Data-Intensive Text Processing with MapReduce (Synthesis Lectures on Human Language Technologies, 7): Lin, Jimmy, Dyer, Chris, Hirst, Graeme: 9781608453429: Amazon.com: Books
Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader "think in MapReduce", but also discusses limitations of the programming model as well. Table of Contents: Introduction / MapReduce Basics / MapReduce Algorithm Design / Inverted Indexing for Text Retrieval / Graph Algorithms / EM Algorithms for Text Processing / Closing Remarks
Customer Reviews, including Product Star Ratings help customers to learn more about the product and decide whether it is the right product for them.
To calculate the overall star rating and percentage breakdown by star, we don’t use a simple average. Instead, our system considers things like how recent a review is and if the reviewer bought the item on Amazon. It also analyzed reviews to verify trustworthiness.
I bought this book for a project at work, to prototype a log analysis system using Hadoop. I haven't bought very many technical books in the last few years, but the quality of most online documentation for Hadoop is poor and books seemed like a better option. This book is a good intro to the Map Reduce algorithm, but once I got my head around that I didn't use it very much (probably because my problem space was nothing like the sort of text processing that the book focuses on).
The book is educating and, hopefully, will be helpful for writing map-reduce programs. It concentrates not on API, but on algorithms, which is rare and should be appreciated. Text-processing is a good example of data-intensive processing, but the book may be useful in many other fields.
So far I studied upto chapter 3. This book is very much helpful to get good insight about the Mapper, reducer, combiner and partitioner. I almost highlighted most of the lines in this as important points in chapter 2 and 3.
This (fairly short - 150 pages) book presents a collection of techniques and design patterns for map reduce, focusing on text processing (for which, read index construction for information retrieval). The text is well organised, and the algorithms are presented in clear and concise pseudo-code.
Good points: suitable as an introduction to map reduce algorithm design (especially for document indexing), unencumbered by the sorts of implementation details that tend to be in-your-face in books on Hadoop.
Potential issues: the strength of the book in avoiding implementation details means that once you have read this, you are not quite ready to put the lessons into practice. In addition, the focus on indexing means that some of the later chapters cover topics that seem to be essential reading for the few rather than the many.
This book is compact and intense but is an insightful and powerful demonstration as to how a problem may be decomposed to fit the MapReduce paradigm. Equally important, it describes the types of problem that are not suited to decomposition as MapReduce Jobs. It covers in detail the use of MapReduce in text indexing, graph algorithms, and expectation maximization, but the techniques described could easily be applied to a wide range of applications. I was able to turn the pseudo code snippets, together with Hadoop: The Definitive Guide, into working examples in a relatively short space of time.
I just bought this book in order to review some of the map reduce algorithms especially related to text processing. This book is a good reference if you're planning to do map reduce programming using text analytics.