Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.

  • Apple
  • Android
  • Windows Phone
  • Android

To get the free app, enter your email address or mobile phone number.

Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work 1st Edition

3.8 out of 5 stars 12 customer reviews
ISBN-13: 978-1449321888
ISBN-10: 1449321887
Why is ISBN important?
ISBN
This bar-code number lets you verify that you're getting exactly the right version or edition of a book. The 13-digit and 10-digit formats both work.
Scan an ISBN with your phone
Use the Amazon App to scan ISBNs and compare prices.
Trade in your item
Get a $5.78
Gift Card.
Have one to sell? Sell on Amazon
Buy used On clicking this link, a new layer will be open
$27.95 On clicking this link, a new layer will be open
Buy new On clicking this link, a new layer will be open
$32.68 On clicking this link, a new layer will be open
More Buying Choices
40 New from $20.54 21 Used from $14.53
Free Two-Day Shipping for College Students with Amazon Student Free%20Two-Day%20Shipping%20for%20College%20Students%20with%20Amazon%20Student


Amazon Book Review
The Amazon Book Review
Discover what to read next through the Amazon Book Review. Learn more.
$32.68 FREE Shipping. In Stock. Ships from and sold by Amazon.com. Gift-wrap available.
click to open popover

Frequently Bought Together

  • Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work
  • +
  • Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
Total price: $60.36
Buy the selected items together

Editorial Reviews

Book Description

Mapping the World of Data Problems

About the Author

Q Ethan McCallum is a consultant, writer, and technology enthusiast, though perhaps not in that order. His work has appeared online on The O’Reilly Network and Java.net, and also in print publications such as C/C++ Users Journal, Doctor Dobb’s Journal, and Linux Magazine. In his professional roles, he helps companies to make smart decisions about data and technology.

NO_CONTENT_IN_FEATURE

Product Details

  • Paperback: 264 pages
  • Publisher: O'Reilly Media; 1 edition (November 24, 2012)
  • Language: English
  • ISBN-10: 1449321887
  • ISBN-13: 978-1449321888
  • Product Dimensions: 7 x 0.6 x 9.2 inches
  • Shipping Weight: 1.2 pounds (View shipping rates and policies)
  • Average Customer Review: 3.8 out of 5 stars  See all reviews (12 customer reviews)
  • Amazon Best Sellers Rank: #585,823 in Books (See Top 100 in Books)

Customer Reviews

Top Customer Reviews

Format: Paperback
"Bad data" follows the low-cost formula of O'Reilly's earlier "Beautiful data": 15-20 practitioners each contribute a short essay, then O'Reilly puts the manuscripts together and adds a sexy, vague title. Anything goes (in) - "bad" as not available in a form ready for analysis? here's an essay on web scraping in Python! - and while this increases the chances of you finding something interesting, or even useful, it also increases the percentage of content that you will find irrelevant. Speaking of my own expectations, "Bad data" is not a book dedicated to data quality - if this is what you would like, I recommend "Data quality assessment" by Arkady Maydanchik.
Comment 17 people found this helpful. Was this review helpful to you? Yes No Sending feedback...
Thank you for your feedback.
Sorry, we failed to record your vote. Please try again
Report abuse
Format: Paperback
Bad data is a fact of life. Coping with bad data is a valuable, learned skill. Bad Data Handbook offers insights from over 20 authors based on their years of personal experience managing ill-defined, often chaotic and incomplete data. We begin with a exploration of what is meant by *bad data* and what checks we can preform to help us understand data quality as a prerequisite to data analysis.

Kevin Fink offers suggestions on approaching data critically in order to ensure that we understand what we're working with before we begin to try to manipulate it. Fink offers useful scripts in shell and Perl that can be used to inspect data and perform basic sanity checks. Paul Murrell tackles the problem of scraping data from sources formatted for human consumption into a format more amenable for algorithmic analysis using R. And on and on.

Each chapter addresses a critical concern in the data life-cycle: identifying, annotating, capturing, archiving, versioning, manipulating, analyzing, and deriving actionable information from imperfect or incomplete data. The advice offered is both powerful and immediately useful to data scientists and newcomers to the field alike and for me has spurred several ideas for how to approach teaching statistics.

Given the number of authors who contributed to this volume, it should come as no surprise that the tone, writing styles, and tools used vary greatly among the chapters, sometimes wandering into technical minutia, but only infrequently. The book holds together remarkably well, regardless, and was a pleasure to read.

Disclosure: I received a complimentary ebook copy of this book to review
Comment 8 people found this helpful. Was this review helpful to you? Yes No Sending feedback...
Thank you for your feedback.
Sorry, we failed to record your vote. Please try again
Report abuse
Format: Paperback
In the movie The Sixth Sense, Cole Sear said "I see dead people". For author Q. Ethan McCallum, whose excellent book Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work just came out, he likely sees bad data just about everywhere.

So just what is this monster called bad data? McCallum writes on page 1 that it is difficult to explicitly define what bad data is. He writes that some people consider it a purely hands-on, technical phenomenon, namely missing values, malformed records, and incompatible file formats. But also notes that it is much more than that.

Chapter 1 notes that bad data includes data that eats up your time, causes you to stay late at the office, drives you to tear out your hair in frustration and more. It's data that you can't access, data that you had and then lost, data that's not the same today as it was yesterday. Ultimately, bad data is data that gets in the way. And there are so many ways to get there, from bad storage, to poor representation, to misguided policy.

In the book, McCallum gathered numerous authors to detail how bad data issues have affected them and what they have done to deal with it, and remediate it.

Most books that have close to 20 authors suffer from poor organization, repetitive material and overall lack of structure. This title suffers from none of that, and provides the reader with an excellent guidebook to use to ensure that they don't run into the garbage in, garbage out scenario when dealing with data. This is particularly important given that we are living in a data driven society.

While ostensibly a dry topic, the authors expertise is such that they are able to make the text most interesting.
Read more ›
Comment 10 people found this helpful. Was this review helpful to you? Yes No Sending feedback...
Thank you for your feedback.
Sorry, we failed to record your vote. Please try again
Report abuse
Format: Paperback
TL;DR summary of the review - awesome book. If you work with real-world datasets, or you work with people who do, you owe it to yourself to read this book. I wish it had been around 8 years earlier when I started working with large-scale social sciences census data. All of the fun, and all of the pain, of dealing with government data and social sciences data is particularly true for census information.

Much of the book could be summed up as noting that less-than-perfect data is still very useful, but you need to understand how the data is bad - is it random? What kinds of bias are introduced, if any? What impact will that have on your conclusions? Go get your hands dirty with the data itself - go look at a few hundred records in a text editor to see what you've got. You'll want to test the data all through your analysis, to ensure that you can identify both where you're hitting issues and where you're introducing issues yourself, and you'll be happier if you can automate these tests so that you can run them often without creating a burden for yourself. Prefer simple tools and portable file formats - in particular, Excel is not your friend. The book discusses a number of different case studies and anecdotes for dealing with data that has problems of one flavor or another. The authors have been there before and you can learn from their experience.

Discussions of social sciences survey data and its inherent imperfections and messy metadata definitely rang true with my experiences dealing with census data, as did the chapter on the lowly, undervalued flat file as a data structure.

I'll summarize three takeaway messages that resonated for my own experience:
1.
Read more ›
Comment 3 people found this helpful. Was this review helpful to you? Yes No Sending feedback...
Thank you for your feedback.
Sorry, we failed to record your vote. Please try again
Report abuse

Most Recent Customer Reviews

Set up an Amazon Giveaway

Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work
Amazon Giveaway allows you to run promotional giveaways in order to create buzz, reward your audience, and attract new followers and customers. Learn more
This item: Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work