- Paperback: 306 pages
- Publisher: No Starch Press; 1 edition (March 30, 2007)
- Language: English
- ISBN-10: 1593271204
- ISBN-13: 978-1593271206
- Product Dimensions: 7 x 1 x 9.2 inches
- Shipping Weight: 1.4 pounds (View shipping rates and policies)
- Average Customer Review: 20 customer reviews
- Amazon Best Sellers Rank: #1,461,682 in Books (See Top 100 in Books)
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL 1st Edition
Use the Amazon App to scan ISBNs and compare prices.
There is a newer edition of this item:
The Amazon Book Review
Author interviews, book reviews, editors picks, and more. Read it now
Frequently bought together
Customers who bought this item also bought
About the Author
Michael Schrenk has developed webbots for over 15 years, working just about everywhere from Silicon Valley to Moscow, for clients like the BBC, foreign governments, and many Fortune 500 companies. He's a frequent Defcon speaker and lives in Las Vegas, Nevada.
Top customer reviews
It would be great to see an update to the book that specifically speaks to forms on .asp pages and how the parameters are passed from the client to the server. I have not been able to get a scraper to work with those types of pages.
I was very pleased with how this book covered concepts. The book uses PHP and the cURL library as a teaching tool instead of trying to give a lesson in how to use PHP as a crawler language. The way the code is explained makes it very easy to translate into whatever language you are most comfortable coding in. The book uses fundamental functional programming concepts which make it easy to pick up the general idea without actually knowing PHP.
My boss bought this book to help my group us with a project we were working on, and even my co-workers who had no background with PHP were able to use this book to write a web bot in C# (using the cURL library) very easily. The concepts from this book easily transfered over to object-oriented concepts.
I did download some of the material to check it out and tried a few things. If you do not know PHP and want to get started webscraping as your primary goal, this book would be for you.
However if you're like me, I've been programming a rather complex database driven personal site for over 8 months and learning PHP. Previously, I had very succussfully used PERL for webscraping and I became interested in the webscraping possibilites of PHP decided to check it out. This book did not add much to what I'd already learned in the past 8+ months; although, it does have a decent jumpstart guide to CURL.
Technically the book and examples are very basic and beginner level. All code is procedural and has absolutely no references to object oriented programming at all. This is great for a simple project, but building anything larger than a targetted webbot or two is beyond the scope of this book.
I was very dismayed at Mr. Schrenk's opinion of regular expressions:
"The use of regular expressions is a parsing language in itself, and most modern programming languages support aspects of regular expressions. In the right hands, regular expressions are also useful for parsing and substituting text; however, they are famous for thier sharp learning curve and cryptic syntax. I avoid regular expressions whenever possible."
This disregard for regular expressions effectively wipes out a powerful toolset for budding developers. Regular expressions are no harder to learn than PHP. The reasons for his disdain for them is also flawed:
"The regular expression engine used by PHP is not as efficient as engines used in other languages, and is certainly less efficient than PHP's built-in functions for parsing HTML."
PHP uses the same regular expression engine used (very effectively) in PERL with the use of the preg_* functions. There has been many studies that show preg_* style expressions outperform basic text matching in PHP. In this assesment the author is terribly wrong.
The book does a great job of explaining how to make single use scripts for scraping, but never how to create a larger infrastructure. There is no focus on creating multi process engines with pcntl_fork(), or proc_open(), these are critical for scaling web scraping applications. A single script scraping a few hundred websites on a single thread would take ages over a multi-threaded engine.
If you are looking to break into web scraping and not sure where to start, this is likely the best (and possibly only) book on the market. If you are intermediate or advanced you will quickly question the author's logic and see that scaling will become the number one issue you have to over come.
I'm sure there will be some a#$h@#e that will say it's too rudimentory. It's an intro and it takes you up to intermediate and explains stuff about PHP that I didn't even know existed. Definitely worth the money. I can't wait for the sequel.