- Paperback: 400 pages
- Publisher: O'Reilly Media; 1 edition (July 20, 2015)
- Language: English
- ISBN-10: 1491900083
- ISBN-13: 978-1491900086
- Product Dimensions: 7 x 0.9 x 9.2 inches
- Shipping Weight: 1.6 pounds (View shipping rates and policies)
- Average Customer Review: 9 customer reviews
- Amazon Best Sellers Rank: #131,702 in Books (See Top 100 in Books)
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Hadoop Application Architectures: Designing Real-World Big Data Applications 1st Edition
Use the Amazon App to scan ISBNs and compare prices.
The Amazon Book Review
Author interviews, book reviews, editors picks, and more. Read it now
Frequently bought together
Customers who bought this item also bought
About the Author
Mark is a committer on Apache Bigtop and a committer and PMC member on Apache Sentry (incubating) and a contributor to Apache Hadoop, Apache Hive, Apache Sqoop and Apache Flume projects. He is also a section author of O’Reilly’s book on Apache Hive – ProgrammingHive.
Ted is a Senior Solutions Architect at Cloudera helping clients be successful with Hadoop and the Hadoop ecosystem. Previously, he was a Lead Architect at the Financial Industry Regulatory Authority (FINRA), helping build out a number of solutions from web applications and Service Oriented Architectures to big data applicatons. He has also contributed code to Apache Flume, Apache Avro, Yarn, and Apache Pig.
Jonathan is a Solutions Architect at Cloudera working with partners to integrate their solutions with Cloudera’s software stack. Previously, he was a technical lead on the big data team at Orbitz Worldwide, helping to manage the Hadoop clusters for one of the most heavily traffickedsites on the internet. He's also a co-­founder of the Chicago Hadoop User Group and Chicago Big Data, technical editor for Hadoop in Practice, and has spoken at a number of industry conferences on Hadoop and big data,
Gwen is a Solutions Architect at Cloudera. She has 15 years of experience working with customers to design scalable data architectures. Formerly a senior consultant at Pythian,Oracle ACE Director and board member at NoCOUG. Gwen is a frequent speaker at industry conferences and maintains a popular blog.
Top customer reviews
I have written a detailed chapter-by-chapter review of this book on[...], the first and last parts of this review are given here. For my review of all chapters, search i-programmer DOT info for STIRK together with the book's title.
This book aims to provide best practices and example architectures for Hadoop technologists, how does it fare?
This book is written for developers and architects that are already familiar with Hadoop, who wish to learn some of the current best practices, example architectures and complete implementations. It assumes some existing knowledge of Hadoop and its components (e.g. Flume, HBase, Pig, and Hive). Book references are provided for those needing topic refreshers. Additionally, it’s assumed you are familiar with Java programming, SQL and relational databases. It consists of two sections, the first of which has seven chapters and looks at factors that influence application architectures. The second consists of three chapters, each providing a complete end-to-end case study.
Below is a chapter-by-chapter exploration of the topics covered.
Section I Architectural Considerations for Hadoop Applications
Chapter 1 Data Modeling in Hadoop
The chapter opens with a look at storage considerations. Various file types are discussed, and the importance of spilltable compressed data highlighted. Avro and Parquet are generally the preferred file formats for row and columnar based storage respectively.
The chapter continues with at look at factors to consider when storing data in HDFS. Directory structures are recommended (e.g. /users/<username>). If you know what tools you intend to use to process the data (e.g. Hive), you can take advantage of partitioning – reduces IO, bucketing – improves performance of joins, and denormailization – eliminates the need for joining data.
Factors to consider when storing data in HBase are discussed next. HBase is a NoSQL database, often thought of as a huge distributed hash table. This key-value store is optimized for fast lookups, and is especially suitable for problems having relatively few get and put requests. HBase tables can have millions of columns and billions of rows. Important considerations for choosing the row key are discussed. Other aspects of HBase covered include: use of timestamps, hops, tables and regions, and the use of column families.
The chapter ends with a look at metadata, describing what metadata is, and why it’s important. The importance of the Hive metastore and its reuse by other tools is discussed.
This chapter provides a useful discussion of features to consider in data modeling. Some sections seem wordy, but probably need to be so. Some useful recommendations are given (e.g. use the Avro file format), together supporting reasons.
From its start, it’s clear this is not a book for beginners. The chapter is well written, has useful explanations, discussions, diagrams, references, links to other chapters, and considered recommendations. A useful chapter conclusion is provided. These features apply to the whole book.
This book aims to provide Hadoop current best practices, example architectures and complete implementations – and succeeds in each area.
The book is well written, providing good explanations, examples, walkthroughs, and diagrams. Useful links are given between chapters, and there’s a valuable conclusion at the end of each chapter. The order of the chapters is helpful in understanding the flow of topics. This is not a book for beginners, but does contain useful references to books to get you up to speed.
In many ways, this book follows on naturally from “Hadoop: The Definitive Guide”, which I recently reviewed. It provides practical discussions of the many factors to consider when presented with common Hadoop architectural concerns (e.g. whether to use HDFS or HBase?). The book offers recommendations, and provides supporting information that backs these up.
The book doesn’t cover all Hadoop technologies (e.g. it omits Machine Learning), but it does cover many popular ones. Some of the books referenced are getting old and some chapters have footnotes at the end, which would be better placed on the pages where they are referenced.
Hadoop is changing rapidly, this book suggests the near future will see a decline in MapReduce processing, and a rise in processing using Spark. Similarly, at the higher-level of abstraction, SQL in its various flavours also appears to be in ascendancy.
If you want to know the current state of Hadoop and its components, want a practical discussion of the pros and cons for using various tools, and want solutions to common problems, I can highly recommend this book.
The book is very well and clearly organized, and proceeds very logically in terms of Hadoop storage options, how to put / ingest data into a Hadoop environment, how to decide and use processing engines for Hadoop such as MapReduce, Spark, Hive, etc., how to utilize those engines to do important and critical tasks such as record deduplication, windowing analysis, and time series modification. The exposition of these fundamental building blocks are followed by graph processing on Hadoop, where both Giraph and Spark GraphX are described and contrasted. And then the topic of orchestration of Hadoop workflows are described to an extent, mainly showing how to configure and use Oozie. Part I finishes by describing Near-Realtime processing in Hadoop, and shows how Storm, Trident and Spark Streaming can be used for satisfying different requirements.
The second part of the book is dedicated to real-world use cases such as Clickstream Analytics, Fraud Detection, and Data Warehousing. The authors provide a good and broad overview for each case, clearly showing where and how Hadoop software stack helps, together with architectural recommendations, but I think the the final use case, Data Warehouse chapter is the most interesting one because it makes use of a very popular, publicly available movie data set known as MovieLens. Thanks to this, it is very easy to follow this chapter by using the same data and apply the designs and programming steps, creating your own customizations and investigating different scenarios and technical challenges you can come up with.
As a conclusion, I can recommend this book to big data architects and software engineers who are not total novices when it comes to Hadoop. The book is of course a bit date, in the very fast moving world of big data, 2015 sounds already distant past, but thanks to the extensive industrial and practical experience of authors, the way they explain their thinking and justifications for very different scenarios shed light on current and upcoming challenges for many big data engineers.