- Paperback: 256 pages
- Publisher: O'Reilly Media; 1 edition (July 24, 2015)
- Language: English
- ISBN-10: 1491910291
- ISBN-13: 978-1491910290
- Product Dimensions: 7 x 0.6 x 9.2 inches
- Shipping Weight: 13.6 ounces (View shipping rates and policies)
- Average Customer Review: 51 customer reviews
- Amazon Best Sellers Rank: #37,665 in Books (See Top 100 in Books)
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Web Scraping with Python: Collecting Data from the Modern Web 1st Edition
Use the Amazon App to scan ISBNs and compare prices.
There is a newer edition of this item:
The Amazon Book Review
Author interviews, book reviews, editors picks, and more. Read it now
Frequently bought together
Customers who bought this item also bought
From the Publisher
Q&A with author Ryan Mitchell
What got you interested in web scraping?
In 2011, I started working for a company called Abine, that offered a service to remove customers’ personal information from various sites on the Internet. In the early days of the company, the process of looking for someone’s personal information on all of these sites, filling out all these opt-out forms, faxing emailing, compiling reports to send back to the customers -- it all took a lot of time! I started looking into ways to streamline these processes, and add additional features. I built bots that could search for profiles, store information in our database, fill out web forms, create documents, and send the emails and faxes automatically. Some of these sites were fairly bot-resistant, so I had to learn, and even invent, some interesting techniques to deal with them. I really fell in love with building bots and scraping the web, and continued to do it even after I left the company!
Why is Python such a good fit for web scraping and building web crawlers?
I’ll be honest: As far as high performance programming languages go, Python does not win many speed contests. But with web scraping, you’re not looking for speed -- sending and receiving data across the Internet will be thousands of times slower than any relatively tiny differences in language performance, so you can throw that metric out the window! What you need is something that’s lightweight, easy to deploy to remote machines, that can be installed and run anywhere, that’s easy to write and modify, and, perhaps most importantly: that has a plethora of well-document tools for just about any situation. Python has all of these in spades.
What’s the most interesting way you’ve used web scraping, for professional or side projects?
One of my favorite scraping projects, and something I introduce in Web Scraping with Python, is scraping Wikipedia for historical edits by IP address, time of the edit, and language. You can resolve the IP address to a geographic location, and explore when and where speakers of different languages are making edits. Lots of interesting sociological research potential there!
A recent hobby of mine has also been automated CAPTCHA solving. I really enjoy analyzing new types of CAPTCHAs for vulnerabilities, writing scripts to pre-process the images, creating data sets for machine learning algorithms, and seeing how high I can get the success percentage of my bots! No real practical applications these days, but you never know when it will come in handy.
What information do you hope that readers of your book will walk away with?
I try to stress a couple of things throughout the book:
First, no website is bot-proof. Attempts to make websites more bot-proof generally also result in a loss of usability for human users. That loss of usability may be in the form of slower loading times, poor browser compatibility, lack of accessibility for users with mobility or visual impairments, or users on mobile devices. And many of these measures have no real deterring effect on web scrapers. If you can view the data in a browser, you can capture it with a scraper.
What’s the most exciting or important thing happening in your space right now?
Like many fields, especially computer science fields, there’s a lot being done with machine learning and big data. The percentage of page requests performed by humans and bots is about 50/50 right now, and as more humans are getting on the Internet, more bots are too -- and outpacing them! There’s just so much data, and so many machines collecting that data, and so many connections we haven’t been able to make before, waiting to be made. And these aren’t just data scientists and server farm owners making them, either! The kind of research that once might have required months or years of surveys and data collection are now just a Python script, a database, and a weekend of coding away!
About the Author
Ryan Mitchell is a Software Engineer at LinkeDrive in Boston, where she develops their API and data analysis tools. She is a graduate of Olin College of Engineering, and is a Masters degree student at Harvard University School of Extension Studies. Prior to joining LinkeDrive, she was a Software Engineer working on web scraping and data analysis at Abine.
Top customer reviews
There was a problem filtering reviews right now. Please try again later.
Nonetheless, as an entry level python programmer, I found the book mostly readily accessible. If you're an experienced coder (python or otherwise) this book is a great investment in your data acquisition skills.
I'll end on a positive note - my boss likes weather updates for our offices in four different cities (we do logistics.) He wants this report at 6:15am daily. I was able to write a .py script that scrapes the webpage, compiles results into a string, logs into my email account and sends the report to him daily, on time. Now I never have to worry about this early morning task again!
If you need to automate the retrieval, processing and delivery of online information, this book is for you!
Most recent customer reviews
Clear, concise and useful
Kind of obsolete. Most of the code in this book does not work. Published in 2015
No errata is posted on the book's web site.