Customer Reviews


17 Reviews
5 star:
 (8)
4 star:
 (7)
3 star:
 (1)
2 star:
 (1)
1 star:    (0)
 
 
 
 
 
Average Customer Review
Share your thoughts with other customers
Create your own review
 
 
Only search this product's reviews

The most helpful favorable review
The most helpful critical review


30 of 31 people found the following review helpful:
4.0 out of 5 stars Good book with a light start
The `Hacks' series from O'Reilly seems to be breeding as fast as virii in a Windows network - every time you turn around another one. While the writing and editing have remained high some such as `eBay Hacks' have not really had great material. `Spidering Hacks' is an improvement almost back to the quality I remember in the last contribution from Calishain, `Google...
Published on February 14, 2004 by A Williams

versus
13 of 16 people found the following review helpful:
2.0 out of 5 stars what is in a name?
well, sometimes a generalizing lie.
.
IMHO, this book should have been named "(some) Spidering Hacks using Perl"
.
the "100" and "industrial strength" sale pitches they could have spared from the title as well
.
the very little python and java code that was either mentioned and/or included as code examples I think was as a way to...
Published on December 29, 2005 by Onetitfemme


‹ Previous | 1 2 | Next ›
Most Helpful First | Newest First

30 of 31 people found the following review helpful:
4.0 out of 5 stars Good book with a light start, February 14, 2004
By 
A Williams "honestpuck" (Neutral Bay, NSW Australia) - See all my reviews
(VINE VOICE)    (REAL NAME)   
This review is from: Spidering Hacks (Paperback)
The `Hacks' series from O'Reilly seems to be breeding as fast as virii in a Windows network - every time you turn around another one. While the writing and editing have remained high some such as `eBay Hacks' have not really had great material. `Spidering Hacks' is an improvement almost back to the quality I remember in the last contribution from Calishain, `Google Hacks'.

She and Kevin Hemenway have taken a fairly complex topic, spidering and scraping web sites and reduced it to manageable chunks in their hundred hacks. The writing has the same light, readable feel you can quickly grow to expect from O'Reilly. Certainly I have never found myself faulting their editing.

There are some caveats. It seems that O'Reilly and Dornfest (the Editor of this book and the series) have fallen in love with having a hundred hacks and little in the way of an introduction. I think this may have been a better book if it was done as 90 `hacks' and had a much larger introduction as the first chapters hacks are all too light and more truly introductory material such as how a HTML page is built and how to properly register your spider. Given that only someone with a fair amount of web knowledge is going to consider spidering a website in the first place then this early material is way too slight. From Hack 9 on it quickly gets down to useful and informative chunks in each and no longer feels `lightweight'.

This may be a reflection on trying to extend the `Hacks' series into places it has to be forced. While the format worked well for Google and Amazon I felt the entire topic of eBay too light for a topic in this series and perhaps spidering is too heavy or complex. If this book had been written in a more traditional format some of my complaints would disappear.

All the examples are in Perl and the serious part of the book starts with examples using LWP::Simple to grab a page before going on to LWP::UserAgent and much more complex requests using authentication, custom headers and posting form data. It also covers using curl and wget.

Then it gets down to the nitty gritty of scraping using HTML:Treebuilder and HTML:TokeParser. This is all further expanded through the next few hacks until starting at Hack 39 through to 89 there are a good series of examples (perhaps a few too many). Finally there are two chapters on maintaining your collection and `Giving Back To The World' which tells how to make it easy to scrape your site and using RSS.

O'Reilly have a page for the book with ten example hacks, index, Table of Contents and errata and you can also visit hacks.oreilly.com for the same ten hacks with the possibility of more being added.

As a whole this volume seems a little thin. If you've been doing the maths then you've realised that only about thirty of the hundred hacks actually give any details on building and running a serious web spider. Sure, a number of the examples provide good information on how to perform various tasks and some of the last eleven hacks are good to know but in all the book feels like it lacks solid information throughout. A bit more information on various crawling and page parsing techniques would have been good.

After that criticism I'm now surprising myself, I'm going to recommend this book. This isn't a large field and when you consider that most other books on writing spiders and crawlers are less than practical and more than expensive "Spidering Hacks" has many good points. It's written for the practical Perl programmer, it examines several methods and gives lots of examples and while not cheap it's certainly inexpensive. Given that I found it both useful and inspiring the complaints above may be a little like nitpicking. I should also say that I found this volume immensely useful in writing my own spider and scraper (it gets a list of new books from the web sites of several publishers.) I have to be honest and admit that there are three publishers, O'Reilly, Addison Wesley and Prentice Hall, from whom I expect a decent standard and criticise a little harder when they move from that norm. If this book had come from SAMS or Wrox I may well have not looked quite so hard for flaws and been a little more generous in my treatment of the ones I found.

That said, I recommend this book to you if you want a practical introduction to building a web spider in Perl.

Help other customers find the most helpful reviews 
Was this review helpful to you? Yes No


23 of 23 people found the following review helpful:
4.0 out of 5 stars Many examples of how to use spiders, April 8, 2004
This review is from: Spidering Hacks (Paperback)
The book has a nice collection of case studies on how to gather data from disparate websites. You might consider this as showing a simple way for you to use Web Services.

Spidering is the way that search engines gather their data. But you do not have to be Altavista or Google to use spiders. Nor do you have to be scanning a large fraction of the Web. The authors demistify spiders. If you can follow their examples, then you get concrete instances of usage that might help your particular application.

Thoughtfully, the examples are mostly written in Perl, with a few in Java. These languages should be familiar to many. Though even if you don't know them, the logic of the code can still be useful. (That is, you can treat the code as pseudocode.)

While spiders are probably best known as being used by search engines, they are really only the starting point for the latter. The much harder problems start when you have the data amassed by a spider. Now you have to efficiently find correlations between the various web pages. You should be aware that the book does not discuss these with any significant depth. Not surprising, because these are outside the scope of the book. The examples do show how to use the data found by spiders. But most of these are for web pages that sit in a given domain. So the pages are closely affiliated in content and structure.

Help other customers find the most helpful reviews 
Was this review helpful to you? Yes No


18 of 18 people found the following review helpful:
5.0 out of 5 stars Lots of great ideas, March 22, 2004
This review is from: Spidering Hacks (Paperback)
Once in a long while you get a book that inspires you with a lot of great small ideas. Spidering Hacks is just that type of book. The web has a wealth of structured and semi-structured that is just waiting to be mined with automated tools. This book not only teaches you how to get the data out of these sources, but gives you idea about where to look for information and what to do with it.

This book demonstrates everything I like in a technical book. It not only describes how things are done. It also gives practical examples of how the technology can be useful in the real world, and presents them enthusiastically. It makes you want to go out and implement all of the ideas and to keep on going with some of your own.

Nitpicks I have with the book are minor. The 'Hacks' format seems imposed, for example, hack #8 is about installing CPAN. I don't think that section should be left out, but I don't think it's a hack either. But hey, I don't care that much about the structure as long as it isn't an imposing flaw and the content within the structure is great, as it is with this book.

Have to say, O'Reilly is on a roll with the Hacks series. They have all been fine books.

Help other customers find the most helpful reviews 
Was this review helpful to you? Yes No


14 of 15 people found the following review helpful:
4.0 out of 5 stars Rich samples, fit your specific needs if you're Perl lover, February 25, 2004
By 
Otto Yuen (Toronto, ON Canada) - See all my reviews
This review is from: Spidering Hacks (Paperback)
If you are a Perl lover and looking for a book to help you extracting contents from this huge resourceful Internet, this book quite fits your needs. Overall is good, the author shows you how to setup your spidering tools -- Perl modules. Yes, Perl, if you're Java folks, too bad. He shows you how to use Perl modules on crawling web pages, logging on to systems, extracting specific contents, and massaging data to your needs, across 100 different scenarios. Most of them are practical, but they don't cover much of the details, you have to read the programs listed in the book, which is quite painful for non-Perl people like me. In addition, it doesn't provide much of resulting screen shots after running the sample codes. Most importantly, the author tries to avoid the copyright questions by delegating URL links for readers to reference. In general, it's still a good tool book in spidering field.
Help other customers find the most helpful reviews 
Was this review helpful to you? Yes No


15 of 17 people found the following review helpful:
5.0 out of 5 stars Great Book, January 5, 2004
By A Customer
This review is from: Spidering Hacks (Paperback)
Are you ready to be the next Google? It is widely known that Google pulled out in front of (and largely obsoleted) major search engine players like Altavista and Yahoo largely because of Google's highly accurate search results -- you find what you search for. They are so confident in their search engine spiders they even have a "I'm feeling lucky" button to transport you to the first search result found -- it's arrogance, but well deserved arrogance. In a sentence, Google works.

Enter Kevin Hemenway and Tara Calishain's latest O'Reilly book: Spidering Hacks. Continuing in the Oreilly "Hacks" tradition, this comprehensive guidebook provides a hundred clear, useful tools for designing and implementing the next generation -- or maybe just your own customized -- spider (or bot, if you prefer.)

So why build your own spider? Well, if you have a large website, your spider could check link integrity, HTML standards and check meta-tags. If you are researching a topic and Google is not returning what you want, creating your own spider might be just what you need. This handy book (with examples in Perl) will show you how to:

* Create a site-friendly bot that wont get you banned by webmasters (Hack #16 --Respecting your Scrapee's Bandwidth, and Hack # 17 -- Respecting robots.txt)

* Interested in graphics, audio and video? Hacks #33 through #42 step you through collecting media files. Specific examples including scraping films from www.ifilm.com (Hack #24), gathering movies from the Library of Congress (Hack #35) and archiving images from Webshots. You'll have your own personalized library in no time.

* Weblog-Free Google Results -- Weblogs (aka Blogs) are amazingly popular these days. With Google's pagerank algorithm, that means they get heavy emphasis in your search results. Hack #50 skims down the search results by eliminating those annoying Blogs.

In addition, you'll find multiple hacks covering Amazon.com and RSS Feeds. The book includes much information regarding spider automation (e.g. Cron jobbing your spiders.) You'll find content filtering and and even a hack using PHP code(Hack #84.)

This book is extraordinarily helpful and is a great resource for any PERL hacker. I highly recommend it to any computer hobbyist interesting in data mining and spidering and scraping. Well done, O'Reilly!

Help other customers find the most helpful reviews 
Was this review helpful to you? Yes No


8 of 8 people found the following review helpful:
5.0 out of 5 stars Example-filled and easy-to-follow, March 7, 2004
This review is from: Spidering Hacks (Paperback)
The knowledgeable collaboration of Kevin Hemenway and Tara Calishain, Spidering Hacks: 100 Industrial-Strength Tips & Tools is an extensive, 402-page instructional guidebook and reference to Internet data retrieval through the use of spiders and scrapers. Including information on methodology, philosophies, and ethical considerations, as well as freely available modules, scripts, frameworks, and templates, information on how to build alternative interfaces to online databases, how to keep one's data current and share it in a user-friendly manner, and so much more, Spidering Hacks is an example-filled, easy-to-follow, highly recommended computer shelf resource.
Help other customers find the most helpful reviews 
Was this review helpful to you? Yes No


13 of 16 people found the following review helpful:
2.0 out of 5 stars what is in a name?, December 29, 2005
This review is from: Spidering Hacks (Paperback)
well, sometimes a generalizing lie.
.
IMHO, this book should have been named "(some) Spidering Hacks using Perl"
.
the "100" and "industrial strength" sale pitches they could have spared from the title as well
.
the very little python and java code that was either mentioned and/or included as code examples I think was as a way to pepper the content and apparently make it more appealing to a broader audience
.
._ the book is mostly about Perl scripts (you could compile Perl to C and then use c2java, for example, but why bothering if, as I noticed right away, it was mostly toy code?) I wonder what the "industrial strength" thing was all about.
There is also some gnu utils examples (wget and curl), from which you could get better examples online
._ the book has "examples" that don't make any sense (to me) and not only that but you could see as a total waste of time, why bothering scraping amazon's pages if they offer SOAP/RSS feeds? And not only that but then he goes on telling you how to scrape a site offering financial stocks info, too!?!?! I would have started by splitting the book in two, cases for which you don't really need scraping at all and those for which you do
._ the author in an attempt to reach the "100" mark, included cases on how to download, say MP3 with Beatles songs and PDF files from IRS sites as separate cases :-? I wonder what the difference is once you have a connection to the data feed?!?
.
there is, "Web Content Mining with Java" ISBN: 047084311X and as you see the publishers/authors named this book after what it is all about and if you want to read about "industrial strength" approaches I would recommend "Mining the Web" ISBN: 1558607544
.
usually "hacks" books are about hacks, meaning you already know your stuff and are learning some hacks. If you know the basics of spiders and how to retrieve data off the Net programmatically this book is not for you. If you, on the other had, are new to this subject and are a Perl programmer you may learn a few things from it
.
otf
Help other customers find the most helpful reviews 
Was this review helpful to you? Yes No


7 of 8 people found the following review helpful:
5.0 out of 5 stars Perl-intensive book on web crawler design, May 16, 2006
This review is from: Spidering Hacks (Paperback)
A spider (also known as a web crawler or web robot) is a program which browses the World Wide Web in a methodical, automated manner. This book is about how to create programs that perform the functions of a web crawler, with most of the Hacks being written in Perl. Like the rest of the Hacks series, this book presents 100 bite-sized chunks of code or technique to tackle specific activities. In this book these range from the simple - how to download a set of image files - to the complex - cross-referring the output from one site with another to generate a third set of data. No matter what the complexity, each hack is clearly explained, with the code samples balanced with instructions, examples and notes on how to hack the hack.

As already mentioned, the hacks in this book mostly use Perl, though scattered here and there you'll find some Java, Python and PHP. If you really hate Perl, then you will not like this book. On the other hand the authors assume only a rudimentary knowledge of Perl, and there is no requirement for any knowledge of network programming of any description. After the opening chapter which gives guidance of being a good spidering citizen (how to respect the sites you are taking data from), there is a second chapter which details how to create a spidering toolkit (how to find and install the site of modules that many of the hacks depend on).

With a toolkit in place and a knowledge of good behavior, the book dives into the various hacks that are organized by topic: collecting media files, gleaning data from databases (with many examples for Yahoo!, Amazon, Google, Alexa and other popular information sources), maintaining your collections (more automation with "cron" or other scheduling tools) and a final chapter on giving something back (creating a web service, generating RSS feeds and so on).

The bulk of the hacks are in chapter four, which looks at extracting data from databases. Aside from the obvious sources such as Amazon and Google, these including online banks, tracking FedEx packages and more. There are a range of techniques used to grab and filter the data, so even if a data source you want to use isn't listed, the chances are that one of these hacks can be refactored to do what you want.

If Perl is not your thing then the very light sprinkling of non-Perl hacks probably isn't enough to make this a worthwhile purchase. If you're a Perl hacker interested in spidering there is a ton of stuff for you here without doubt. Also, if you are a student looking for a good supplement on building a web spider from scratch, this is probably not the book for you either, but the various hacks will give you some ideas on what you might want to do in your own spider if you wish to write one in a higher level language such as Java. Amazon does not show the table of contents so I do that here for completeness:

Chapter 1. Walking Softly
1. A Crash Course in Spidering and Scraping
2. Best Practices for You and Your Spider
3. Anatomy of an HTML Page
4. Registering Your Spider
5. Preempting Discovery
6. Keeping Your Spider Out of Sticky Situations
7. Finding the Patterns of Identifiers
Chapter 2. Assembling a Toolbox
Perl Modules
Resources You May Find Helpful
8. Installing Perl Modules
9. Simply Fetching with LWP::Simple
10. More Involved Requests with LWP::UserAgent
11. Adding HTTP Headers to Your Request
12. Posting Form Data with LWP
13. Authentication, Cookies, and Proxies
14. Handling Relative and Absolute URLs
15. Secured Access and Browser Attributes
16. Respecting Your Scrapee's Bandwidth
17. Respecting robots.txt
18. Adding Progress Bars to Your Scripts
19. Scraping with HTML::TreeBuilder
20. Parsing with HTML::TokeParser
21. WWW::Mechanize 101
22. Scraping with WWW::Mechanize
23. In Praise of Regular Expressions
24. Painless RSS with Template::Extract
25. A Quick Introduction to XPath
26. Downloading with curl and wget
27. More Advanced wget Techniques
28. Using Pipes to Chain Commands
29. Running Multiple Utilities at Once
30. Utilizing the Web Scraping Proxy
31. Being Warned When Things Go Wrong
32. Being Adaptive to Site Redesigns
Chapter 3. Collecting Media Files
33. Detective Case Study: Newgrounds
34. Detective Case Study: iFilm
35. Downloading Movies from the Library of Congress
36. Downloading Images from Webshots
37. Downloading Comics with dailystrips
38. Archiving Your Favorite Webcams
39. News Wallpaper for Your Site
40. Saving Only POP3 Email Attachments
41. Downloading MP3s from a Playlist
42. Downloading from Usenet with nget
Chapter 4. Gleaning Data from Databases
43. Archiving Yahoo! Groups Messages with yahoo2mbox
44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
45. Gleaning Buzz from Yahoo!
46. Spidering the Yahoo! Catalog
47. Tracking Additions to Yahoo!
48. Scattersearch with Yahoo! and Google
49. Yahoo! Directory Mindshare in Google
50. Weblog-Free Google Results
51. Spidering, Google, and Multiple Domains
52. Scraping Amazon.com Product Reviews
53. Receive an Email Alert for Newly Added Amazon.com Reviews
54. Scraping Amazon.com Customer Advice
55. Publishing Amazon.com Associates Statistics
56. Sorting Amazon.com Recommendations by Rating
57. Related Amazon.com Products with Alexa
58. Scraping Alexa's Competitive Data with Java
59. Finding Album Information with FreeDB and Amazon.com
60. Expanding Your Musical Tastes
61. Saving Daily Horoscopes to Your iPod
62. Graphing Data with RRDTOOL
63. Stocking Up on Financial Quotes
64. Super Author Searching
65. Mapping O'Reilly Best Sellers to Library Popularity
66. Using All Consuming to Get Book Lists
67. Tracking Packages with FedEx
68. Checking Blogs for New Comments
69. Aggregating RSS and Posting Changes
70. Using the Link Cosmos of Technorati
71. Finding Related RSS Feeds
72. Automatically Finding Blogs of Interest
73. Scraping TV Listings
74. What's Your Visitor's Weather Like?
75. Trendspotting with Geotargeting
76. Getting the Best Travel Route by Train
77. Geographic Distance and Back Again
78. Super Word Lookup
79. Word Associations with Lexical Freenet
80. Reformatting Bugtraq Reports
81. Keeping Tabs on the Web via Email
82. Publish IE's Favorites to Your Web Site
83. Spidering GameStop.com Game Prices
84. Bargain Hunting with PHP
85. Aggregating Multiple Search Engine Results
86. Robot Karaoke
87. Searching the Better Business Bureau
88. Searching for Health Inspections
89. Filtering for Content
Chapter 5. Maintaining Your Collections
90. Using cron to Automate Tasks
91. Scheduling Tasks Without cron
92. Mirroring Web Sites with wget and rsync
93. Accumulating Search Results Over Time
Chapter 6. Giving Back to the World
94. Using XML::RSS to Repurpose Data
95. Placing RSS Headlines on Your Site
96. Making Your Resources Scrapable with Regular Expressions
97. Making Your Resources Scrapable with a REST Interface
98. Making Your Resources Scrapable with XML-RPC
99. Creating an IM Interface
100. Going Beyond the Book
Help other customers find the most helpful reviews 
Was this review helpful to you? Yes No


2 of 2 people found the following review helpful:
4.0 out of 5 stars A Classic (is it outdated?), October 12, 2010
Amazon Verified Purchase(What's this?)
This review is from: Spidering Hacks (Paperback)
This book was published in October 2003. It is now late in 2010. It is a great book, but can it really be 5 stars after 7 years?

I've owned the book since 2005, and any time I have a spidering question, I still turn to it, and am rarely disappointed. My copy is thoroughly dogeared. But how much of my work could actually be shortcutted if I was using the most recent Perl modules?
Help other customers find the most helpful reviews 
Was this review helpful to you? Yes No


5.0 out of 5 stars Enter the Spider, February 8, 2010
Amazon Verified Purchase(What's this?)
This review is from: Spidering Hacks (Paperback)
I've always wondered what a spider was and now I know what a scraper is too. The book provides a lot of info and links. I've just started and am happy so far.
Help other customers find the most helpful reviews 
Was this review helpful to you? Yes No


‹ Previous | 1 2 | Next ›
Most Helpful First | Newest First

This product

Spidering Hacks
Spidering Hacks by Tara Calishain (Paperback - November 1, 2003)
$29.99 $16.35
In Stock
Add to cart Add to wishlist