Web Scraping with Python: Collecting Data from the Modern Web 1st Edition
| Ryan Mitchell (Author) Find all the books, read about the author, and more. See search results for this author |
Use the Amazon App to scan ISBNs and compare prices.
There is a newer edition of this item:
Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.
Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice.
- Learn how to parse complicated HTML pages
- Traverse multiple pages and sites
- Get a general overview of APIs and how they work
- Learn several methods for storing the data you scrape
- Download, read, and extract data from documents
- Use tools and techniques to clean badly formatted data
- Read and write natural languages
- Crawl through forms and logins
- Understand how to scrape JavaScript
- Learn image processing and text recognition
Customers who viewed this item also viewed
Customers who bought this item also bought
From the brand
-
Sharing the knowledge of experts
O'Reilly's mission is to change the world by sharing the knowledge of innovators. For over 40 years, we've inspired companies and individuals to do new things (and do them better) by providing the skills and understanding that are necessary for success.
Our customers are hungry to build the innovations that propel the world forward. And we help them do just that.
-
From the Publisher
Q&A with author Ryan Mitchell
What got you interested in web scraping?
In 2011, I started working for a company called Abine, that offered a service to remove customers’ personal information from various sites on the Internet. In the early days of the company, the process of looking for someone’s personal information on all of these sites, filling out all these opt-out forms, faxing emailing, compiling reports to send back to the customers -- it all took a lot of time! I started looking into ways to streamline these processes, and add additional features. I built bots that could search for profiles, store information in our database, fill out web forms, create documents, and send the emails and faxes automatically. Some of these sites were fairly bot-resistant, so I had to learn, and even invent, some interesting techniques to deal with them. I really fell in love with building bots and scraping the web, and continued to do it even after I left the company!
Why is Python such a good fit for web scraping and building web crawlers?
I’ll be honest: As far as high performance programming languages go, Python does not win many speed contests. But with web scraping, you’re not looking for speed -- sending and receiving data across the Internet will be thousands of times slower than any relatively tiny differences in language performance, so you can throw that metric out the window! What you need is something that’s lightweight, easy to deploy to remote machines, that can be installed and run anywhere, that’s easy to write and modify, and, perhaps most importantly: that has a plethora of well-document tools for just about any situation. Python has all of these in spades.
What’s the most interesting way you’ve used web scraping, for professional or side projects?
One of my favorite scraping projects, and something I introduce in Web Scraping with Python, is scraping Wikipedia for historical edits by IP address, time of the edit, and language. You can resolve the IP address to a geographic location, and explore when and where speakers of different languages are making edits. Lots of interesting sociological research potential there!
A recent hobby of mine has also been automated CAPTCHA solving. I really enjoy analyzing new types of CAPTCHAs for vulnerabilities, writing scripts to pre-process the images, creating data sets for machine learning algorithms, and seeing how high I can get the success percentage of my bots! No real practical applications these days, but you never know when it will come in handy.
What information do you hope that readers of your book will walk away with?
I try to stress a couple of things throughout the book:
First, no website is bot-proof. Attempts to make websites more bot-proof generally also result in a loss of usability for human users. That loss of usability may be in the form of slower loading times, poor browser compatibility, lack of accessibility for users with mobility or visual impairments, or users on mobile devices. And many of these measures have no real deterring effect on web scrapers. If you can view the data in a browser, you can capture it with a scraper.
Second, writing web scrapers that capture the data you want often involve combining multiple techniques, some creative thinking, and a dash of laziness. I can’t count the number of times people have asked me to build a bot, or to help them build a bot, to collect data that could be easily obtained through an API! So sometimes your data collection problem can be solved using the information from only a single chapter in the book. On the other hand, I also provide an example of a web scraper that uses JavaScript execution, HTML parsing, DOM interaction, and optical character recognition, all in one piece of code, in order to extract the text from book previews on Amazon! (Sorry, Amazon!) When faced with a web scraping problem you should always 'work the steps' to try formulate a data extraction and processing plan -- it’s not just about learning a single library or command!
What’s the most exciting or important thing happening in your space right now?
Like many fields, especially computer science fields, there’s a lot being done with machine learning and big data. The percentage of page requests performed by humans and bots is about 50/50 right now, and as more humans are getting on the Internet, more bots are too -- and outpacing them! There’s just so much data, and so many machines collecting that data, and so many connections we haven’t been able to make before, waiting to be made. And these aren’t just data scientists and server farm owners making them, either! The kind of research that once might have required months or years of surveys and data collection are now just a Python script, a database, and a weekend of coding away!
Editorial Reviews
About the Author
Ryan Mitchell is a Software Engineer at LinkeDrive in Boston, where she develops their API and data analysis tools. She is a graduate of Olin College of Engineering, and is a Masters degree student at Harvard University School of Extension Studies. Prior to joining LinkeDrive, she was a Software Engineer working on web scraping and data analysis at Abine.
I'd like to read this book on Kindle
Don't have a Kindle? Get your Kindle here, or download a FREE Kindle Reading App.
Product details
- Publisher : O'Reilly Media; 1st edition (July 24, 2015)
- Language : English
- Paperback : 256 pages
- ISBN-10 : 1491910291
- ISBN-13 : 978-1491910290
- Item Weight : 15.9 ounces
- Dimensions : 7.01 x 0.54 x 9.17 inches
- Best Sellers Rank: #1,172,907 in Books (See Top 100 in Books)
- #148 in Internet Web Browsers
- #184 in Online Internet Searching
- #471 in Web Services
- Customer Reviews:
About the author

Ryan Mitchell is a senior software engineer at HedgeServ in Boston, where she develops APIs and data analytics tools for hedge fund managers. She is a graduate of Olin College of Engineering and Harvard University Extension School with a master’s in software engineering and certificate in data science. Since 2012 she has regularly consulted, lectured, and run workshops around the country on the topics of web scraping, Python automation tools, and data science.
Customer reviews
Customer Reviews, including Product Star Ratings help customers to learn more about the product and decide whether it is the right product for them.
To calculate the overall star rating and percentage breakdown by star, we don’t use a simple average. Instead, our system considers things like how recent a review is and if the reviewer bought the item on Amazon. It also analyzed reviews to verify trustworthiness.
Learn more how customers reviews work on AmazonTop reviews from the United States
There was a problem filtering reviews right now. Please try again later.
Nonetheless, as an entry level python programmer, I found the book mostly readily accessible. If you're an experienced coder (python or otherwise) this book is a great investment in your data acquisition skills.
I'll end on a positive note - my boss likes weather updates for our offices in four different cities (we do logistics.) He wants this report at 6:15am daily. I was able to write a .py script that scrapes the webpage, compiles results into a string, logs into my email account and sends the report to him daily, on time. Now I never have to worry about this early morning task again!
If you need to automate the retrieval, processing and delivery of online information, this book is for you!
Top reviews from other countries
Allerdings empfehle ich hier jedem, der dieses Buch ernsthaft kaufen will, dass man unbedingt auf die neue Version warten sollte.
Zum Inhalt:
- kurze Erklärungen, jedoch vollkommen ausreichend, sofern man nicht Programmiereinsteiger ist
- leicht die Konzentration hoch zu halten, da die Autorin nie in Details abschweift oder Dinge unnötig wiederholt
- Module werden KAUM erklärt
- sehr ausführlich bezüglich verschiedenen Varianten und Möglichkeiten unterschiedliche Websites zu minen/crawlen
- fragwürdiges Aufkommen von Infokästen (z.B. werden auf Seite 67 plötzlich Sets in Python erklärt?!??! Wieso??!)
Fazit:
Absolut super, wenn man keine Ahnung von dem Gebiet hat und vorgegebene Beispiele haben möchte, die man sofort selber verwenden kann.
Sehr schlecht, wenn man schon viel Erfahrung hat und man tieferes Verständnis des Aufbaus von den verwendeten Modulen sucht.
Da für die meisten Leute eher Ersteres zutrifft und man sich dies anhand des Seitenumfangs (~230) auch denken kann, gibt es 4 Sterne.
No profundiza excesivamente en cómo procesar nodos hijos y demás, lo cual sería deseable. Da una base para cada cosa y a partir de ahí hay que espabilarse. Sin embargo, es una satisfacción comprobar que Python puede hacer las cosas mucho mejor que un simple script Shell. Tengo ganas de ver hasta dónde se puede llegar...
Eso sí, si no has programado nunca en Python, no ayuda mucho, se supone que hay que conocer el lenguaje.









