Clean Data
| Megan Squire (Author) Find all the books, read about the author, and more. See search results for this author |
Use the Amazon App to scan ISBNs and compare prices.
Key Features
- Grow your data science expertise by filling your toolbox with proven strategies for a wide variety of cleaning challenges
- Familiarize yourself with the crucial data cleaning processes, and share your own clean data sets with others
- Complete real-world projects using data from Twitter and Stack Overflow
Book Description
Is much of your time spent doing tedious tasks such as cleaning dirty data, accounting for lost data, and preparing data to be used by others? If so, then having the right tools makes a critical difference, and will be a great investment as you grow your data science expertise.
The book starts by highlighting the importance of data cleaning in data science, and will show you how to reap rewards from reforming your cleaning process. Next, you will cement your knowledge of the basic concepts that the rest of the book relies on: file formats, data types, and character encodings. You will also learn how to extract and clean data stored in RDBMS, web files, and PDF documents, through practical examples.
At the end of the book, you will be given a chance to tackle a couple of real-world projects.
What you will learn
- Understand the role of data cleaning in the overall data science process
- Learn the basics of file formats, data types, and character encodings to clean data properly
- Master critical features of the spreadsheet and text editor for organizing and manipulating data
- Convert data from one common format to another, including JSON, CSV, and some special-purpose formats
- Implement three different strategies for parsing and cleaning data found in HTML files on the Web
- Reveal the mysteries of PDF documents and learn how to pull out just the data you want
- Develop a range of solutions for detecting and cleaning bad data stored in an RDBMS
- Create your own clean data sets that can be packaged, licensed, and shared with others
- Use the tools from this book to complete two real-world projects using data from Twitter and Stack Overflow
About the Author
Megan Squire is a professor of computing sciences at Elon University. She has been collecting and cleaning dirty data for two decades. She is also the leader of FLOSSmole.org, a research project to collect data and analyze it in order to learn how free, libre, and open source software is made.
Table of Contents
- Why Do You Need Clean Data?
- Fundamentals – Formats, Types, and Encodings
- Workhorses of Clean Data – Spreadsheets and Text Editors
- Speaking the Lingua Franca – Data Conversions
- Collecting and Cleaning Data from the Web
- Cleaning Data in Pdf Files
- RDBMS Cleaning Techniques
- Best Practices for Sharing Your Clean Data
- Stack Overflow Project
- Twitter Project
Editorial Reviews
About the Author
Megan Squire
Megan Squire is a professor of computing sciences at Elon University. She has been collecting and cleaning dirty data for two decades. She is also the leader of FLOSSmole.org, a research project to collect data and analyze it in order to learn how free, libre, and open source software is made.
Don't have a Kindle? Get your Kindle here, or download a FREE Kindle Reading App.
Product details
- Publisher : Packt Publishing (May 25, 2015)
- Language : English
- Paperback : 272 pages
- ISBN-10 : 1785284010
- ISBN-13 : 978-1785284014
- Item Weight : 1.04 pounds
- Dimensions : 7.5 x 0.62 x 9.25 inches
- Customer Reviews:
About the author

I am a professor of Computing Sciences at Elon University (NC, USA). I teach mostly database systems, web development, data mining, and data science courses. My first language was Perl, but I am learning to love Python. Some colleagues and I started the FLOSSmole project in 2004 as a way to collect data and analyses about free, libre, and open source software (FLOSS) projects. We provide lots of historical data about the way open source has grown and changed over the past decade-plus.
Customer reviews
Customer Reviews, including Product Star Ratings help customers to learn more about the product and decide whether it is the right product for them.
To calculate the overall star rating and percentage breakdown by star, we don’t use a simple average. Instead, our system considers things like how recent a review is and if the reviewer bought the item on Amazon. It also analyzed reviews to verify trustworthiness.
Learn more how customers reviews work on AmazonTop reviews from the United States
There was a problem filtering reviews right now. Please try again later.
Most obviously, the title is a lie, as the book is not about data cleaning, but about data extraction, primarily web scraping. Invocations of data science are similarly for the gullible - you might as well write a book about, say, project management, and call it "data science", because hey, data scientists manage projects too. Second, the author advertises the book to beginners, but can't be bothered to provide beginner-adequate hand-holding, and just gives them code dumps - off you go. This isn't helping beginners, it's exploiting them.
Beginners, you don't have to settle for this. If it's actual data cleaning that you are looking for, I recommend "Data Quality Assessment" by Arkady Maidanchik. For what this book was actually going for, see "Web Scraping with Python" by Ryan Mitchell, or "Automate the Boring Stuff with Python" by Al Sweigart. Once in a blue moon, Packt does publish a good book, but "Clean Data" is part of the 99%, a low-quality quickie by a non-expert and a non-writer.
4.5 stars for content. Pros: I recently began a graduate program at MIT in a data science-heavy field, and the sections on converting between formats in Python was especially helpful as a jump start to several projects. As another reviewer mentioned, this is not only an appropriate book for beginners, but anyone returning to the field after a hiatus. Cons: I agree with another reviewer that a section on R would have been helpful. Also the code snippets were written like a cookbook, where some may prefer a more line-by-line explanation.
Ask any data scientists, developer, or analyst and they'll tell you that they spend more time than they'd care to admit cleaning, parsing, and formatting data to suit their needs. This very process is the root of countless hours of lost productivity, frustrating bugs in code, and incomplete or sloppy analysis. This book attempts to arm the reader with a set of tools and a mindset by which the reader can successfully clean data and display it in compelling ways. For me it's been the most valuable technical literature I've dedicated time to in quite awhile.
But enough with the rhetoric, what should you know before thinking about purchasing this book? I'll focus on what I thought were the key skills to be gained from the book and also some things you should be aware of prior to investing your time and money.
Key Takeaways:
1 - You will learn to seamlessly convert common file types like CSV, TSV, JSON, and HTML into MySQL tables and vice versa. There are many subtleties I wasn't even aware of - for example, using the correct data types when cleaning data with tools like MS Excel. These subtleties, if not handled correctly can cause major headaches down the road.
2 - You'll learn to scrape and clean data using Python and PHP. Python is one of the go-to data science and visualization languages and is a personal favorite tool of mine. While PHP may not be a great choice for data science, it is refreshingly easy to use in conjunction with MySQL and doesn't require nearly as much boiler plate code as other languages. Those of you familiar with JDBC know just how annoying some languages make working with SQL.
3 - You'll learn to automate daily workflow items. The amount of data contained in PDFs, text files, and spreadsheets is enormous and can be difficult to parse. Often times, companies will resort to hiring more people or implementing increasingly complicated processes to store and communicate that data - this book will teach you how to automate those types of tasks and make life easier for yourself and your colleagues.
4 - This book will teach you to visualize the data you've cleaned using d3.js - a very powerful visualization library used by companies like the New York Times. Programmatic data visualization is a difficult task and it's very difficult to figure out all the nuts and bolts by yourself. It was enormously beneficial to have Dr. Squire's help in class working out the details of a tricky visualization problem, and she's done a great job of communicating that knowledge in this book.
Some things to consider before purchasing:
1 - If you're looking for a cookbook for a specific language this book may not be for you. While Dr. Squire includes numerous working code examples, it's my understanding that she's trying to impart knowledge of the fundamentals and thought process of cleaning data.
2 - If you've yet to learn fundamentals of programming, this book will not spend much time teaching you fundamentals of programming - after all, it's not intended to. If you are a beginner looking to become a data scientist, I would start with some books that go over programming fundamentals like data structures, objects, classes, function, etc.


