|
|||||||||||||||||||||||||||||||||||
|
17 Reviews
|
Average Customer Review
Share your thoughts with other customers
Create your own review
|
|
Most Helpful First | Newest First
|
|
30 of 31 people found the following review helpful:
4.0 out of 5 stars
Good book with a light start,
By
This review is from: Spidering Hacks (Paperback)
The `Hacks' series from O'Reilly seems to be breeding as fast as virii in a Windows network - every time you turn around another one. While the writing and editing have remained high some such as `eBay Hacks' have not really had great material. `Spidering Hacks' is an improvement almost back to the quality I remember in the last contribution from Calishain, `Google Hacks'.She and Kevin Hemenway have taken a fairly complex topic, spidering and scraping web sites and reduced it to manageable chunks in their hundred hacks. The writing has the same light, readable feel you can quickly grow to expect from O'Reilly. Certainly I have never found myself faulting their editing. There are some caveats. It seems that O'Reilly and Dornfest (the Editor of this book and the series) have fallen in love with having a hundred hacks and little in the way of an introduction. I think this may have been a better book if it was done as 90 `hacks' and had a much larger introduction as the first chapters hacks are all too light and more truly introductory material such as how a HTML page is built and how to properly register your spider. Given that only someone with a fair amount of web knowledge is going to consider spidering a website in the first place then this early material is way too slight. From Hack 9 on it quickly gets down to useful and informative chunks in each and no longer feels `lightweight'. This may be a reflection on trying to extend the `Hacks' series into places it has to be forced. While the format worked well for Google and Amazon I felt the entire topic of eBay too light for a topic in this series and perhaps spidering is too heavy or complex. If this book had been written in a more traditional format some of my complaints would disappear. All the examples are in Perl and the serious part of the book starts with examples using LWP::Simple to grab a page before going on to LWP::UserAgent and much more complex requests using authentication, custom headers and posting form data. It also covers using curl and wget. Then it gets down to the nitty gritty of scraping using HTML:Treebuilder and HTML:TokeParser. This is all further expanded through the next few hacks until starting at Hack 39 through to 89 there are a good series of examples (perhaps a few too many). Finally there are two chapters on maintaining your collection and `Giving Back To The World' which tells how to make it easy to scrape your site and using RSS. O'Reilly have a page for the book with ten example hacks, index, Table of Contents and errata and you can also visit hacks.oreilly.com for the same ten hacks with the possibility of more being added. As a whole this volume seems a little thin. If you've been doing the maths then you've realised that only about thirty of the hundred hacks actually give any details on building and running a serious web spider. Sure, a number of the examples provide good information on how to perform various tasks and some of the last eleven hacks are good to know but in all the book feels like it lacks solid information throughout. A bit more information on various crawling and page parsing techniques would have been good. After that criticism I'm now surprising myself, I'm going to recommend this book. This isn't a large field and when you consider that most other books on writing spiders and crawlers are less than practical and more than expensive "Spidering Hacks" has many good points. It's written for the practical Perl programmer, it examines several methods and gives lots of examples and while not cheap it's certainly inexpensive. Given that I found it both useful and inspiring the complaints above may be a little like nitpicking. I should also say that I found this volume immensely useful in writing my own spider and scraper (it gets a list of new books from the web sites of several publishers.) I have to be honest and admit that there are three publishers, O'Reilly, Addison Wesley and Prentice Hall, from whom I expect a decent standard and criticise a little harder when they move from that norm. If this book had come from SAMS or Wrox I may well have not looked quite so hard for flaws and been a little more generous in my treatment of the ones I found. That said, I recommend this book to you if you want a practical introduction to building a web spider in Perl.
23 of 23 people found the following review helpful:
4.0 out of 5 stars
Many examples of how to use spiders,
By
This review is from: Spidering Hacks (Paperback)
The book has a nice collection of case studies on how to gather data from disparate websites. You might consider this as showing a simple way for you to use Web Services.Spidering is the way that search engines gather their data. But you do not have to be Altavista or Google to use spiders. Nor do you have to be scanning a large fraction of the Web. The authors demistify spiders. If you can follow their examples, then you get concrete instances of usage that might help your particular application. Thoughtfully, the examples are mostly written in Perl, with a few in Java. These languages should be familiar to many. Though even if you don't know them, the logic of the code can still be useful. (That is, you can treat the code as pseudocode.) While spiders are probably best known as being used by search engines, they are really only the starting point for the latter. The much harder problems start when you have the data amassed by a spider. Now you have to efficiently find correlations between the various web pages. You should be aware that the book does not discuss these with any significant depth. Not surprising, because these are outside the scope of the book. The examples do show how to use the data found by spiders. But most of these are for web pages that sit in a given domain. So the pages are closely affiliated in content and structure.
18 of 18 people found the following review helpful:
5.0 out of 5 stars
Lots of great ideas,
By Jack D. Herrington "engineer and author" (Silicon Valley, CA) - See all my reviews (VINE VOICE) (REAL NAME)
This review is from: Spidering Hacks (Paperback)
Once in a long while you get a book that inspires you with a lot of great small ideas. Spidering Hacks is just that type of book. The web has a wealth of structured and semi-structured that is just waiting to be mined with automated tools. This book not only teaches you how to get the data out of these sources, but gives you idea about where to look for information and what to do with it.This book demonstrates everything I like in a technical book. It not only describes how things are done. It also gives practical examples of how the technology can be useful in the real world, and presents them enthusiastically. It makes you want to go out and implement all of the ideas and to keep on going with some of your own. Nitpicks I have with the book are minor. The 'Hacks' format seems imposed, for example, hack #8 is about installing CPAN. I don't think that section should be left out, but I don't think it's a hack either. But hey, I don't care that much about the structure as long as it isn't an imposing flaw and the content within the structure is great, as it is with this book. Have to say, O'Reilly is on a roll with the Hacks series. They have all been fine books.
14 of 15 people found the following review helpful:
4.0 out of 5 stars
Rich samples, fit your specific needs if you're Perl lover,
By Otto Yuen (Toronto, ON Canada) - See all my reviews
This review is from: Spidering Hacks (Paperback)
If you are a Perl lover and looking for a book to help you extracting contents from this huge resourceful Internet, this book quite fits your needs. Overall is good, the author shows you how to setup your spidering tools -- Perl modules. Yes, Perl, if you're Java folks, too bad. He shows you how to use Perl modules on crawling web pages, logging on to systems, extracting specific contents, and massaging data to your needs, across 100 different scenarios. Most of them are practical, but they don't cover much of the details, you have to read the programs listed in the book, which is quite painful for non-Perl people like me. In addition, it doesn't provide much of resulting screen shots after running the sample codes. Most importantly, the author tries to avoid the copyright questions by delegating URL links for readers to reference. In general, it's still a good tool book in spidering field.
15 of 17 people found the following review helpful:
5.0 out of 5 stars
Great Book,
By A Customer
This review is from: Spidering Hacks (Paperback)
Are you ready to be the next Google? It is widely known that Google pulled out in front of (and largely obsoleted) major search engine players like Altavista and Yahoo largely because of Google's highly accurate search results -- you find what you search for. They are so confident in their search engine spiders they even have a "I'm feeling lucky" button to transport you to the first search result found -- it's arrogance, but well deserved arrogance. In a sentence, Google works.Enter Kevin Hemenway and Tara Calishain's latest O'Reilly book: Spidering Hacks. Continuing in the Oreilly "Hacks" tradition, this comprehensive guidebook provides a hundred clear, useful tools for designing and implementing the next generation -- or maybe just your own customized -- spider (or bot, if you prefer.) So why build your own spider? Well, if you have a large website, your spider could check link integrity, HTML standards and check meta-tags. If you are researching a topic and Google is not returning what you want, creating your own spider might be just what you need. This handy book (with examples in Perl) will show you how to: * Create a site-friendly bot that wont get you banned by webmasters (Hack #16 --Respecting your Scrapee's Bandwidth, and Hack # 17 -- Respecting robots.txt) * Interested in graphics, audio and video? Hacks #33 through #42 step you through collecting media files. Specific examples including scraping films from www.ifilm.com (Hack #24), gathering movies from the Library of Congress (Hack #35) and archiving images from Webshots. You'll have your own personalized library in no time. * Weblog-Free Google Results -- Weblogs (aka Blogs) are amazingly popular these days. With Google's pagerank algorithm, that means they get heavy emphasis in your search results. Hack #50 skims down the search results by eliminating those annoying Blogs. In addition, you'll find multiple hacks covering Amazon.com and RSS Feeds. The book includes much information regarding spider automation (e.g. Cron jobbing your spiders.) You'll find content filtering and and even a hack using PHP code(Hack #84.) This book is extraordinarily helpful and is a great resource for any PERL hacker. I highly recommend it to any computer hobbyist interesting in data mining and spidering and scraping. Well done, O'Reilly!
8 of 8 people found the following review helpful:
5.0 out of 5 stars
Example-filled and easy-to-follow,
By Midwest Book Review (Oregon, WI USA) - See all my reviews
This review is from: Spidering Hacks (Paperback)
The knowledgeable collaboration of Kevin Hemenway and Tara Calishain, Spidering Hacks: 100 Industrial-Strength Tips & Tools is an extensive, 402-page instructional guidebook and reference to Internet data retrieval through the use of spiders and scrapers. Including information on methodology, philosophies, and ethical considerations, as well as freely available modules, scripts, frameworks, and templates, information on how to build alternative interfaces to online databases, how to keep one's data current and share it in a user-friendly manner, and so much more, Spidering Hacks is an example-filled, easy-to-follow, highly recommended computer shelf resource.
13 of 16 people found the following review helpful:
2.0 out of 5 stars
what is in a name?,
By Onetitfemme "Onetitfemme" (USA/NY) - See all my reviews
This review is from: Spidering Hacks (Paperback)
well, sometimes a generalizing lie.
. IMHO, this book should have been named "(some) Spidering Hacks using Perl" . the "100" and "industrial strength" sale pitches they could have spared from the title as well . the very little python and java code that was either mentioned and/or included as code examples I think was as a way to pepper the content and apparently make it more appealing to a broader audience . ._ the book is mostly about Perl scripts (you could compile Perl to C and then use c2java, for example, but why bothering if, as I noticed right away, it was mostly toy code?) I wonder what the "industrial strength" thing was all about. There is also some gnu utils examples (wget and curl), from which you could get better examples online ._ the book has "examples" that don't make any sense (to me) and not only that but you could see as a total waste of time, why bothering scraping amazon's pages if they offer SOAP/RSS feeds? And not only that but then he goes on telling you how to scrape a site offering financial stocks info, too!?!?! I would have started by splitting the book in two, cases for which you don't really need scraping at all and those for which you do ._ the author in an attempt to reach the "100" mark, included cases on how to download, say MP3 with Beatles songs and PDF files from IRS sites as separate cases :-? I wonder what the difference is once you have a connection to the data feed?!? . there is, "Web Content Mining with Java" ISBN: 047084311X and as you see the publishers/authors named this book after what it is all about and if you want to read about "industrial strength" approaches I would recommend "Mining the Web" ISBN: 1558607544 . usually "hacks" books are about hacks, meaning you already know your stuff and are learning some hacks. If you know the basics of spiders and how to retrieve data off the Net programmatically this book is not for you. If you, on the other had, are new to this subject and are a Perl programmer you may learn a few things from it . otf
7 of 8 people found the following review helpful:
5.0 out of 5 stars
Perl-intensive book on web crawler design,
This review is from: Spidering Hacks (Paperback)
A spider (also known as a web crawler or web robot) is a program which browses the World Wide Web in a methodical, automated manner. This book is about how to create programs that perform the functions of a web crawler, with most of the Hacks being written in Perl. Like the rest of the Hacks series, this book presents 100 bite-sized chunks of code or technique to tackle specific activities. In this book these range from the simple - how to download a set of image files - to the complex - cross-referring the output from one site with another to generate a third set of data. No matter what the complexity, each hack is clearly explained, with the code samples balanced with instructions, examples and notes on how to hack the hack.
As already mentioned, the hacks in this book mostly use Perl, though scattered here and there you'll find some Java, Python and PHP. If you really hate Perl, then you will not like this book. On the other hand the authors assume only a rudimentary knowledge of Perl, and there is no requirement for any knowledge of network programming of any description. After the opening chapter which gives guidance of being a good spidering citizen (how to respect the sites you are taking data from), there is a second chapter which details how to create a spidering toolkit (how to find and install the site of modules that many of the hacks depend on). With a toolkit in place and a knowledge of good behavior, the book dives into the various hacks that are organized by topic: collecting media files, gleaning data from databases (with many examples for Yahoo!, Amazon, Google, Alexa and other popular information sources), maintaining your collections (more automation with "cron" or other scheduling tools) and a final chapter on giving something back (creating a web service, generating RSS feeds and so on). The bulk of the hacks are in chapter four, which looks at extracting data from databases. Aside from the obvious sources such as Amazon and Google, these including online banks, tracking FedEx packages and more. There are a range of techniques used to grab and filter the data, so even if a data source you want to use isn't listed, the chances are that one of these hacks can be refactored to do what you want. If Perl is not your thing then the very light sprinkling of non-Perl hacks probably isn't enough to make this a worthwhile purchase. If you're a Perl hacker interested in spidering there is a ton of stuff for you here without doubt. Also, if you are a student looking for a good supplement on building a web spider from scratch, this is probably not the book for you either, but the various hacks will give you some ideas on what you might want to do in your own spider if you wish to write one in a higher level language such as Java. Amazon does not show the table of contents so I do that here for completeness: Chapter 1. Walking Softly 1. A Crash Course in Spidering and Scraping 2. Best Practices for You and Your Spider 3. Anatomy of an HTML Page 4. Registering Your Spider 5. Preempting Discovery 6. Keeping Your Spider Out of Sticky Situations 7. Finding the Patterns of Identifiers Chapter 2. Assembling a Toolbox Perl Modules Resources You May Find Helpful 8. Installing Perl Modules 9. Simply Fetching with LWP::Simple 10. More Involved Requests with LWP::UserAgent 11. Adding HTTP Headers to Your Request 12. Posting Form Data with LWP 13. Authentication, Cookies, and Proxies 14. Handling Relative and Absolute URLs 15. Secured Access and Browser Attributes 16. Respecting Your Scrapee's Bandwidth 17. Respecting robots.txt 18. Adding Progress Bars to Your Scripts 19. Scraping with HTML::TreeBuilder 20. Parsing with HTML::TokeParser 21. WWW::Mechanize 101 22. Scraping with WWW::Mechanize 23. In Praise of Regular Expressions 24. Painless RSS with Template::Extract 25. A Quick Introduction to XPath 26. Downloading with curl and wget 27. More Advanced wget Techniques 28. Using Pipes to Chain Commands 29. Running Multiple Utilities at Once 30. Utilizing the Web Scraping Proxy 31. Being Warned When Things Go Wrong 32. Being Adaptive to Site Redesigns Chapter 3. Collecting Media Files 33. Detective Case Study: Newgrounds 34. Detective Case Study: iFilm 35. Downloading Movies from the Library of Congress 36. Downloading Images from Webshots 37. Downloading Comics with dailystrips 38. Archiving Your Favorite Webcams 39. News Wallpaper for Your Site 40. Saving Only POP3 Email Attachments 41. Downloading MP3s from a Playlist 42. Downloading from Usenet with nget Chapter 4. Gleaning Data from Databases 43. Archiving Yahoo! Groups Messages with yahoo2mbox 44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups 45. Gleaning Buzz from Yahoo! 46. Spidering the Yahoo! Catalog 47. Tracking Additions to Yahoo! 48. Scattersearch with Yahoo! and Google 49. Yahoo! Directory Mindshare in Google 50. Weblog-Free Google Results 51. Spidering, Google, and Multiple Domains 52. Scraping Amazon.com Product Reviews 53. Receive an Email Alert for Newly Added Amazon.com Reviews 54. Scraping Amazon.com Customer Advice 55. Publishing Amazon.com Associates Statistics 56. Sorting Amazon.com Recommendations by Rating 57. Related Amazon.com Products with Alexa 58. Scraping Alexa's Competitive Data with Java 59. Finding Album Information with FreeDB and Amazon.com 60. Expanding Your Musical Tastes 61. Saving Daily Horoscopes to Your iPod 62. Graphing Data with RRDTOOL 63. Stocking Up on Financial Quotes 64. Super Author Searching 65. Mapping O'Reilly Best Sellers to Library Popularity 66. Using All Consuming to Get Book Lists 67. Tracking Packages with FedEx 68. Checking Blogs for New Comments 69. Aggregating RSS and Posting Changes 70. Using the Link Cosmos of Technorati 71. Finding Related RSS Feeds 72. Automatically Finding Blogs of Interest 73. Scraping TV Listings 74. What's Your Visitor's Weather Like? 75. Trendspotting with Geotargeting 76. Getting the Best Travel Route by Train 77. Geographic Distance and Back Again 78. Super Word Lookup 79. Word Associations with Lexical Freenet 80. Reformatting Bugtraq Reports 81. Keeping Tabs on the Web via Email 82. Publish IE's Favorites to Your Web Site 83. Spidering GameStop.com Game Prices 84. Bargain Hunting with PHP 85. Aggregating Multiple Search Engine Results 86. Robot Karaoke 87. Searching the Better Business Bureau 88. Searching for Health Inspections 89. Filtering for Content Chapter 5. Maintaining Your Collections 90. Using cron to Automate Tasks 91. Scheduling Tasks Without cron 92. Mirroring Web Sites with wget and rsync 93. Accumulating Search Results Over Time Chapter 6. Giving Back to the World 94. Using XML::RSS to Repurpose Data 95. Placing RSS Headlines on Your Site 96. Making Your Resources Scrapable with Regular Expressions 97. Making Your Resources Scrapable with a REST Interface 98. Making Your Resources Scrapable with XML-RPC 99. Creating an IM Interface 100. Going Beyond the Book
2 of 2 people found the following review helpful:
4.0 out of 5 stars
A Classic (is it outdated?),
By
Amazon Verified Purchase(What's this?)
This review is from: Spidering Hacks (Paperback)
This book was published in October 2003. It is now late in 2010. It is a great book, but can it really be 5 stars after 7 years?
I've owned the book since 2005, and any time I have a spidering question, I still turn to it, and am rarely disappointed. My copy is thoroughly dogeared. But how much of my work could actually be shortcutted if I was using the most recent Perl modules?
5.0 out of 5 stars
Enter the Spider,
By
Amazon Verified Purchase(What's this?)
This review is from: Spidering Hacks (Paperback)
I've always wondered what a spider was and now I know what a scraper is too. The book provides a lot of info and links. I've just started and am happy so far.
|
|
Most Helpful First | Newest First
|
|
Spidering Hacks by Tara Calishain (Paperback - November 1, 2003)
$29.99 $16.35
In Stock | ||