Shop the new tech.book(store)
New! Introducing the tech.book(store), a hub for Software Developers and Architects, Networking Administrators, TPMs, and other technology professionals to find highly-rated and highly-relevant career resources. Shop books on programming and big data, or read this week's blog posts by authors and thought-leaders in the tech industry.
> Shop now
Every day, all around the world, programmers have to recycle legacy data, translate from one vendor's proprietary format into another's, check that configuration files are internally consistent, and search through web logs to see how many people have downloaded the latest release of their product. This kind of "data crunching," may not be glamorous, but knowing how to do it efficiently is essential to being a good programmer.
This book describes the most useful data crunching techniques, explains when you should use them, and shows how they will make your life easier. Along the way, it will introduce you to some handy, but under-used, features of Java, Python, and other languages. It will also show you how to test data crunching programs, and how data crunching fits into the larger software development picture.
{"itemData":[{"priceBreaksMAP":null,"buyingPrice":22.1,"ASIN":"0974514071","isPreorder":0},{"priceBreaksMAP":null,"buyingPrice":19.98,"ASIN":"1934356271","isPreorder":0}],"shippingId":"0974514071::nSU%2B7UW5c5XrHkgV9VmxG0nceBxHOFJROnZNPt8Xnh%2B1%2B%2BDTX%2Bvs%2BOnJJ1b77wHSDVLBqVQYowFSFq5QVFpZHXN%2BleLwU2KjZLeZDP%2BTfgxeb0pJYOFBVw%3D%3D,1934356271::SglVz0BTc9gVvT5fvK6vudmDuiOT1jwRCrtxZRUaphwtvUUIhPVVLgDmcoBrtK38bpEHgeoxwukvv6bDiOGkX5eMnw0c39mxUq2fqv6NFYg%3D","sprites":{"addToWishlist":["wl_one","wl_two","wl_three"],"addToCart":["s_addToCart","s_addBothToCart","s_add3ToCart"],"preorder":["s_preorderThis","s_preorderBoth","s_preorderAll3"]},"currenyCode":"USD","shippingDetails":{"xy":"same"},"tags":["x","y","z"],"strings":{"addToWishlist":["add to wishlist","Add both to Wish List","Add all three to Wish List"],"addToCart":["Add to Cart","Add both to Cart","Add all three to Cart"],"showDetailsDefault":"Show availability and shipping details","shippingError":"An error occurred, please try again","hideDetailsDefault":"Hide availability and shipping details","priceLabel":["Price:","Price for both:","Price for all three:"],"preorder":["Pre-order this item","Pre-order both items","Pre-order all three items"]}}
Greg Wilson holds a Ph.D. in Computer Science from the University of Edinburgh, and has worked on high-performance scientific computing, data visualization, and computer security. He is the author of Data Crunching and Practical Parallel Programming (MIT Press, 1995), and is a contributing editor at Doctor Dobb's Journal, and an adjunct professor in Computer Science at the University of Toronto.
Born and raised on Vancouver Island; studied engineering at Queen's University in Ontario, worked for a while, then went to Edinburgh for a Master's, some more work, and a PhD. Traveled while writing my first book on parallel programming; came to Toronto "for a couple of years" in 1994, and have never left. I've worked for big corporations, startups, and myself (prefer the small to the large), been a university professor (enjoyed the teaching more than the red tape), and am now project lead for Software Carpentry, a crash course on software development for scientists and engineers. You can find me online at http://third-bit.com (personal stuff) or http://software-carpentry.org (the course).
If only this book didn't have the "Data Crunching" name. Far from being about data analysis this really a general book about different data formats (e.g. text, XML, database, binary) and how they are created and accessed in different languages. It's a reasonable fundamentals book. It also serves to introduce a wide variety of different technologies, without going into explicit depth about each.
As long as you understand what you are getting then I recommend this book. This is not a book about data processing techniques. Also, I recommend NOT using the material in the book that references SQL. The code does not use prepared statements properly and is vulnerable to SQL injection attacks.
Gregory Wilson likes Python and bash but doesn't particularly care for XSLT (or Perl, and possibly Java as well, either), doesn't express a preference in the great Emacs vs. Vi(m) holy war, and divides programming languages into two camps - agile, like Python and Ruby, and "sturdy", like Java. He's an adjunct CS professor at the University of Toronto, a contributing editor with Dr. Dobb's Journal, and is developing "Software Carpentry", which is either a basic course on software development aimed at scientists and engineers for the Python Software Foundation or a project to develop a newer, easier-to-use set of software development tools.
In the book, "Data Crunching: Solve Everyday Problems Using Java, Python, and More", data crunching is explored through a series of examples. The closest that Wilson comes to giving a definition is when, at the start of the first chapter, he refers to data crunching/munging as the "other 10%" of a programming task that takes up the "other 90% of the time". The first example that he gives is his experience helping a high school science teacher convert PDB (Protein Data Bank) files containing the coordinates of atoms in various molecules into a format that a Fortran sphere-drawing program could process.
From the introduction, he moves on to the manipulation of text and text files using Unix command-line tools and Python, with Java work-alikes following most of the Python scripts. Although the book's subtitle, "Solve Everyday Problems Using Java, Python, and More", gives Java first billing (possibly for marketing reasons?), Wilson's preference for Python over Java is never in doubt.... After presenting the Java equivalent of a Python script that counts the number of times every email address appears in a list of email addresses, he writes:
All right. It's two-and-a-half times longer than the equivalent Python program, it isn't as fast on small files, and we have to compile it before we can run it, but other than that, it's almost as easy...
With a table of useful commands, explanation of redirection and piping, and some guidelines on how to make sure that your command-line tools follow convention, the text chapter could actually be viewed as a pretty passable introduction to the philosophy of Unix.
The chapter on Regular Expressions is great. So good, in fact, that I wish I could go back in time and give myself a photocopy of those thirty-odd pages at the point that I was struggling to get a handle on RE's some years back. Also included in this chapter is a brief, but very lucid, discussion of character encoding and a bit on using grep.
Although the Text and RE chapters were my favorite, Wilson's clear and concise writing style makes th eentire book, including the coverage of XML, binary data processing, and relational databases, a joy to read. With segues like "But wait a second. Wait just one pattern-matching second.", lists of email addresses to munge that include entries for Alan Turning, John von Neumann, and Grace Hopper, and the like, he also manages to inject some pleasant, if a bit groan-worthy, humor here and there into what could otherwise be a rather dry book.
He uses the last chapter, titled "Horshoe Nails" to quickly address a number topics, like encoding, the pitfalls of floating point arithmatic, and unit testing, which (not a surprise in a title coming from the Pragmatic Bookshelf) he likes, going so far as to say that the spread of test-driven development has been the "real revolution in programming in the last decade"). Diff is introduced and he brings the venerable make to the table as a tool for automating test running.
He doesn't say it in so many words, though his retooling the old saying that "two years of hard work can save you an hour in the library" as "an hour of hard work can often save you sixty seconds on Google" comes close, but the message is to work smarter rather than harder. Use industrial-strength tools and processes when industrial-strength solutions are called for and agile, simplest-things-that-work solutions whenever possible.Read more ›
Data Crunching is a short book with great how-to-like code examples of very common data parsing and manipulation techniques. The examples are easy to follow and clearly demonstrate the author's point. None of the topics are covered in great depth but each contains enough to whet the reader's appetite for more. The text and examples are thought provoking, leading the reader to ask the right kind of questions when detailed information is needed.
The book covers the most common aspects of data crunching, including text files, regular expressions, XML, binary files, relational databases and unit testing. The book dedicates a chapter to each of these topics. Each chapter has one or more sample problems to solve. I found the sample problems to be well thought out. If not exactly the same as a real-life data crunching problem I've had to solve in the past, then sufficiently close to easily apply the principals (and sample code) to my problem. I thought the regular expressions section was an excellent, succinct, (re)introduction to regular expressions. Wilson starts with basic patterns, quickly and clearly working up to common complex patterns. The regular expressions chapter also includes a nice bit of Python code that generates a table of patterns, test strings and those patterns that match them. I liked the chapter on XML but noticed that there was no code example on performing an XSLT. There is, however, a good example of an XSLT template, but no code on how to process it. The chapter on relational databases covers all the most common SQL needed for daily use (think 10% of the SQL that works on 90% of the problems). This includes sub-selects, negation, aggregation and views. The last chapter, "Horshoe Nails", covers miscellaneous topics including testing.... The author of course covers unit testing but also simple ways of testing when full-blown unit testing is overkill. The last chapter also has sections on encoding, dealing with floating point numbers, dates and times and how to format them with strftime. I was impressed by the author's ability to cull such important techniques and idioms and organize them into a small, yet incredibly useful text.
Data Crunching covers real-life data parsing and manipulation concepts. It does so without tangential journeys into other areas of programming. Each of the five main topics include simple code examples, usually in Python, Java or both, that clearly demonstrate the topic. The author does an impressive job of squeezing in most all the issues in the daily work of data crunching. The reader can expect to come away with something of value on each topic covered, especially the newbie or occasional script writer.Read more ›
Yeah, its 'Short, Informative, Useful and Clear' (like someone already said) but... it's not enough. It seems like introductory chapters for a excelent book, but the really important chapters do not exist.
Too expensive for 176 pages with tips of XML, regexes, DB and some unix commands.
This book is mainly concerned with scripting as a 'glue' between applications: processing various input and output formats. The book is divided into 5 main categories of data handling: plain text, regular expressions, XML, binary data and SQL. There is a final chapter on various miscellaneous topics. Most of the examples are given in Python. Some of the code is demonstrated in Java, although, disappointingly for a book published in 2005, none of the Java 5.0 features are leveraged. However, if nothing else, it demonstrates why Java is not anyone's first choice for such activities.
If you've read any of the O'Reilly cookbook series, you will know what to expect, although the chapters are more cohesive and less episodic. Beginning programmers will get the most out of this book, although intermediate programmers should find at least some material here that's new to them.
The XML chapter is a pretty good introduction the use and advantages/disadvantages of SAX and DOM, and XSLT is also described, although the discussion is not so clear. Those without experience with databases will welcome the chapter on SQL. The discussion on dealing with plain text files in chapter 1 was highlight for me, a subject not often covered in much depth in cookbooks; if, like me, you still regularly need to convert between various plain text formats, this chapter will help formalise approaches that you may already be carrying out in a less than rigorous fashion.
Additionally, the paragraphs on floating point arithmetic were intriguing but all too brief. The chapter on dealing with binary is fairly good, although rather dry.... Peter Seibel's discussion of binary data in the context of writing a Shoutcast server in Practical Common Lisp shows that the subject can be dealt with in a more compelling fashion. That said, for the most part, author Greg Wilson is a genial companion; the writing style is chatty, but doesn't overdo it.
Overall, if you own any cookbook-style books, there is little here that you don't already know. Even for a beginner, it's hard to see how anyone who decides they need this book hasn't already been exposed to some of the material here. In particular, does anyone really need yet another introduction to regular expressions? The treatment here isn't bad, it's just that this material is already covered in many introductory programming books (especially those that cover scripting languages like Perl and Python). As this takes up nearly 20% of the book, and there's less than 200 pages, it's a bit of a waste. Personally, I would have preferred more discussion of the less well-treated subjects, some of which are too sparsely described, but this would have detracted from the book's main aim.
This would be suitable for a beginner Pythonista, who for some reason didn't want the bulk of the likes of Python Cookbook. Otherwise, if you feel that some Pragmatic Programmers books can be rather lightweight and somewhat overpriced, this will not change your mind.Read more ›