Most Helpful Customer Reviews
|
|
12 of 14 people found the following review helpful:
4.0 out of 5 stars
Cursed by a bad name, May 19, 2005
If only this book didn't have the "Data Crunching" name. Far from being about data analysis this really a general book about different data formats (e.g. text, XML, database, binary) and how they are created and accessed in different languages. It's a reasonable fundamentals book. It also serves to introduce a wide variety of different technologies, without going into explicit depth about each.
As long as you understand what you are getting then I recommend this book. This is not a book about data processing techniques. Also, I recommend NOT using the material in the book that references SQL. The code does not use prepared statements properly and is vulnerable to SQL injection attacks.
|
|
|
5 of 5 people found the following review helpful:
5.0 out of 5 stars
It's about using the right tool for the right job, June 13, 2006
Gregory Wilson likes Python and bash but doesn't particularly care for XSLT (or Perl, and possibly Java as well, either), doesn't express a preference in the great Emacs vs. Vi(m) holy war, and divides programming languages into two camps - agile, like Python and Ruby, and "sturdy", like Java. He's an adjunct CS professor at the University of Toronto, a contributing editor with Dr. Dobb's Journal, and is developing "Software Carpentry", which is either a basic course on software development aimed at scientists and engineers for the Python Software Foundation or a project to develop a newer, easier-to-use set of software development tools.
In the book, "Data Crunching: Solve Everyday Problems Using Java, Python, and More", data crunching is explored through a series of examples. The closest that Wilson comes to giving a definition is when, at the start of the first chapter, he refers to data crunching/munging as the "other 10%" of a programming task that takes up the "other 90% of the time". The first example that he gives is his experience helping a high school science teacher convert PDB (Protein Data Bank) files containing the coordinates of atoms in various molecules into a format that a Fortran sphere-drawing program could process.
From the introduction, he moves on to the manipulation of text and text files using Unix command-line tools and Python, with Java work-alikes following most of the Python scripts. Although the book's subtitle, "Solve Everyday Problems Using Java, Python, and More", gives Java first billing (possibly for marketing reasons?), Wilson's preference for Python over Java is never in doubt. After presenting the Java equivalent of a Python script that counts the number of times every email address appears in a list of email addresses, he writes:
All right. It's two-and-a-half times longer than the equivalent Python program, it isn't as fast on small files, and we have to compile it before we can run it, but other than that, it's almost as easy...
With a table of useful commands, explanation of redirection and piping, and some guidelines on how to make sure that your command-line tools follow convention, the text chapter could actually be viewed as a pretty passable introduction to the philosophy of Unix.
The chapter on Regular Expressions is great. So good, in fact, that I wish I could go back in time and give myself a photocopy of those thirty-odd pages at the point that I was struggling to get a handle on RE's some years back. Also included in this chapter is a brief, but very lucid, discussion of character encoding and a bit on using grep.
Although the Text and RE chapters were my favorite, Wilson's clear and concise writing style makes th eentire book, including the coverage of XML, binary data processing, and relational databases, a joy to read. With segues like "But wait a second. Wait just one pattern-matching second.", lists of email addresses to munge that include entries for Alan Turning, John von Neumann, and Grace Hopper, and the like, he also manages to inject some pleasant, if a bit groan-worthy, humor here and there into what could otherwise be a rather dry book.
He uses the last chapter, titled "Horshoe Nails" to quickly address a number topics, like encoding, the pitfalls of floating point arithmatic, and unit testing, which (not a surprise in a title coming from the Pragmatic Bookshelf) he likes, going so far as to say that the spread of test-driven development has been the "real revolution in programming in the last decade"). Diff is introduced and he brings the venerable make to the table as a tool for automating test running.
He doesn't say it in so many words, though his retooling the old saying that "two years of hard work can save you an hour in the library" as "an hour of hard work can often save you sixty seconds on Google" comes close, but the message is to work smarter rather than harder. Use industrial-strength tools and processes when industrial-strength solutions are called for and agile, simplest-things-that-work solutions whenever possible.
|
|
|
6 of 7 people found the following review helpful:
5.0 out of 5 stars
Just what the newbie or occasional programmer needs, June 10, 2005
Data Crunching is a short book with great how-to-like code examples of very common data parsing and manipulation techniques. The examples are easy to follow and clearly demonstrate the author's point. None of the topics are covered in great depth but each contains enough to whet the reader's appetite for more. The text and examples are thought provoking, leading the reader to ask the right kind of questions when detailed information is needed.
The book covers the most common aspects of data crunching, including text files, regular expressions, XML, binary files, relational databases and unit testing. The book dedicates a chapter to each of these topics. Each chapter has one or more sample problems to solve. I found the sample problems to be well thought out. If not exactly the same as a real-life data crunching problem I've had to solve in the past, then sufficiently close to easily apply the principals (and sample code) to my problem. I thought the regular expressions section was an excellent, succinct, (re)introduction to regular expressions. Wilson starts with basic patterns, quickly and clearly working up to common complex patterns. The regular expressions chapter also includes a nice bit of Python code that generates a table of patterns, test strings and those patterns that match them. I liked the chapter on XML but noticed that there was no code example on performing an XSLT. There is, however, a good example of an XSLT template, but no code on how to process it. The chapter on relational databases covers all the most common SQL needed for daily use (think 10% of the SQL that works on 90% of the problems). This includes sub-selects, negation, aggregation and views. The last chapter, "Horshoe Nails", covers miscellaneous topics including testing. The author of course covers unit testing but also simple ways of testing when full-blown unit testing is overkill. The last chapter also has sections on encoding, dealing with floating point numbers, dates and times and how to format them with strftime. I was impressed by the author's ability to cull such important techniques and idioms and organize them into a small, yet incredibly useful text.
Data Crunching covers real-life data parsing and manipulation concepts. It does so without tangential journeys into other areas of programming. Each of the five main topics include simple code examples, usually in Python, Java or both, that clearly demonstrate the topic. The author does an impressive job of squeezing in most all the issues in the daily work of data crunching. The reader can expect to come away with something of value on each topic covered, especially the newbie or occasional script writer.
|
|
|
Most Recent Customer Reviews
|