Text Processing with Ruby: Extract Value from the Data That Surrounds You 1st Edition
Use the Amazon App to scan ISBNs and compare prices.
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Customers who viewed this item also viewed
Customers who bought this item also bought
From the Publisher
Top Five Text Processing Tips by Rob Miller, author of Text Processing with Ruby
Clean up your data first
Data in the real world is messy. It almost always pays off to take some time to normalize different sources of data and to get them into the same format before you begin whatever actual processing you need to do. You’ll have less exceptions and special cases in your code, and it’ll be a lot more resilient.
Master regular expressions
There are definitely some text processing problems that can’t be solved with regular expressions, but not that many. While they’re not always the best or more readable option, knowing regular expressions well will get you out of many tight spots, and even more often than that will be the first step towards a more robust solution.
Break your problem into discrete steps
Almost all text processing tasks, no matter how complicated they seem on the face of it, are really a series of small transformations. Figuring out how to frame your problem in this way will make it easy to take a pipeline approach, where your text flows through a series of small, discrete steps, each of which transform the data in a particular way and then passes it on. Such programs are both easier to reason about and easier to modify and extend.
Figure out a strategy for missing data
Data in the real world, as well as being messy, also frequently has gaps. Decide early on how you’re going to cope with that — how you’ll represent the absence of particular fields or properties — and you’ll avoid messiness later on.
Make the most of existing tools
There are hundreds of command-line tools that exist solely to process textual data. Each of them is capable of performing a particular transformation, which means you don’t need to reinvent the wheel. If you use existing tools for the parts of your problem that have already been solved, all that remains is to solve the unique problem that you have.
Author, Confident Ruby; Head Chef, RubyTapas.com
"This is a fun, readable, and very useful book. I'd recommend it to anyone who needs to deal with text -- which is probably everyone."
Developer, maintainer of text gem
"While Ruby has become established as a Web development language, thanks to Rails, it's an excellent language for working with text as well. Text Processing with Ruby covers the nuts and bolts of what I believe is a natural domain for Ruby, all the way from bringing text into the environment via files, the Web, and other means through to parsing what it says and sending it back out again."
Editor of Ruby Weekly
"The biggest selling point of this book is that I can apply it right away -- I am literally using the things I've learned at work today. Perfect for the beginner to intermediate Rubyist, or any programmer who wants some standout techniques for handling text whatever language they're using."
"A lot of people get into Ruby via Rails. This book is really well suited to anyone who knows Rails, but wants to know more Ruby."
Director, Studio Nelstrom, and author of Practical Vim
About the Author
There was a problem filtering reviews right now. Please try again later.
I like that Rob carefully places Ruby in relation to UNIX coreutils and demonstrates many organic CLI workflows, with each tool used appropriately. I also appreciate, due to the very lightweight and readable Ruby syntax, the gentle introduction to parsers and NLP. This could provide the newcomer a conceptual foundation before venturing into more industrial strength tools (in Scala, Java, Python, Go, what-have-you).
Rob is a talented writer and I look forward to more from him. One "star" subtracted due to a formatting snafu in my edition that is not representative of the normal high quality of this publisher.
Part 1, Aquiring text, starts with the basics: reading from files or from standard input. And quickly moves on to how to extract data from CSV files, and scraping data from HTML files using the Nokogiri library.
Part 2, Modifying and Manipulating Text, opens with an introduction to regular expressions (does every programming book have a chapter on regex?). Then it gets really meaty with a chapter on writing parsers, and another on natural language processing. I particularly enjoyed the section where the author demonstrates how to use the Parslet library to parse a Rich Text Format file.
Part 3, Writing Text, starts again with the basics: writing to standard output, standard error, and to a file. Then it goes on to discuss serialising data to JSON, XML, or CSV formats. And the last chapter uses ERB to render templates into text files. Anyone who knows rails will be familiar with the ERB templating language, but I found it refreshing to see this used outside of a rails context.
If you want to learn Ruby as your first programming language, this is not the first book that you should read on Ruby, but it would be a good choice as your second book. Or if you already know how to program and you want to add Ruby to your repertoire, then this would be a great place to start. I've been working with Ruby for 8 years and I picked up lots of new stuff. I wish I could have read this book years ago!
And if you ask where could you use text processing I would say website scraping as an example. And you'll find tools described in this book to accomplish your task.