- Paperback: 358 pages
- Publisher: O'Reilly Media; 1 edition (June 16, 2017)
- Language: English
- ISBN-10: 9781491943205
- ISBN-13: 978-1491943205
- ASIN: 1491943203
- Product Dimensions: 7 x 0.7 x 9.2 inches
- Shipping Weight: 1.8 pounds (View shipping rates and policies)
- Customer Reviews: 27 customer ratings
- Amazon Best Sellers Rank: #433,851 in Books (See Top 100 in Books)
High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark 1st Edition
Use the Amazon App to scan ISBNs and compare prices.
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Frequently bought together
Customers who viewed this item also viewed these digital items
From the Publisher
About the Author
Holden Karau is transgender Canadian, and an active open source contributor. When not in San Francisco working as a software development engineer at IBM's Spark Technology Center, Holden talks internationally on Apache Spark and holds office hours at coffee shops at home and abroad. She is a Spark committer with frequent contributions, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.
Rachel Warren is a data scientist and software engineer at Alpine Data Labs, where she uses Spark to address real world data processing challenges. She has experience working as an analyst both in industry and academia. She graduated with a degree in Computer Science from Wesleyan University in Connecticut.
There was a problem filtering reviews right now. Please try again later.
This book is the second of three related books that I've had the chance to work through over the past few months, in the following order: "Spark: The Definitive Guide" (2018), "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" (2017), and "Practical Hive: A Guide to Hadoop's Data Warehouse System" (2016). If you are new to Apache Spark, these three texts will help enable your going in the right direction, although keep in mind that the related tech stack is evolving and you will obviously need to supplement this material with web documentation and developer forums, as well as to get hands-on with the tooling. Reading these books in opposite order of publication date enabled exposure to more current material sooner rather than later, but this was largely just a coincidence.
Keep in mind one of the initial assertions of the authors that "this book was created using the Spark 2.0.1 APIs, but much of the code will work in earlier versions of Spark as well. In places where this is not the case we have attempted to call that out". As Spark 2.4.0 was just released in November 2018, you will find that some of the material provided here is either outdated or seen to be commonplace with the newest aforementioned Spark text. The unfortunate dilemma is that a book specifically focusing on Spark performance simply isn't available outside of what the authors provide here, so you will need to account for differences across versions, especially in the several instances where the authors provide workarounds that they warn are likely not to provide long term viability.
Unlike "Spark: The Definitive Guide", which provides Python, Scala, and Spark SQL code, readers should be aware that the bulk of code provided in this book is Scala, "simply in the interest of time and space", because "it is the belief of the authors that 'serious' performant Spark development is most easily achieved in Scala", and while "these reasons are very specific to using Spark with Scala, there are many more general arguments for (and against) Scala's applications in other contexts." As the authors further state their case, they provide tips for learning Scala alongside additional arguments for picking up the language: "to be a Spark expert you have to learn a little Scala anyway", "the Spark Scala API is easier to use than the Java API", and "Scala is more performant than Python."
This densely written book of slightly over 300 pages in length is broken down into 10 chapters and an appendix: (1) "Introduction to High Performance Spark", (2) "How Spark Works", (3) "DataFrames, Datasets, and Spark SQL", (4) "Joins (SQL and Core)", (5) "Effective Transformations", (6) "Working with Key/Value Data", (7) "Going Beyond Scala", (8) "Testing and Validation", (9) "Spark MLlib and ML", (10) "Spark Components and Packages", and an appendix on "Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist". While the chapters aren't provided in the context of broader sections, chapters 1 and 2 are essentially an introduction, and chapters 3, 4, 5, 6, and 8 provide the bulk of the content (chapter 8 should likely join these other 4 chapters, as testing and validation are a likely follow-up to much of what is discussed. As far as the remaining 3 chapters are concerned, while chapter 7 would likely provide value as a last chapter, chapters 9 and 10 seem a bit misplaced, with chapter 10 seemingly better suited for an appendix alongside the one appendix provided.
The diagrams in chapters 3 through 6 are especially well done, and supplement the discussions very well. While the diagrams in chapters 1 and 2 are beneficial, these can be largely found in the documentation (perhaps with the exception of the diagrams provided in the section entitled "The Anatomy of a Spark Job"). For example, the diagram in chapter 3 on Spark SQL windowing (which personally helped supplement the cursory explanation in "Spark: The Definitive Guide"), the diagrams in chapter 4 on joins, the diagrams in chapter 5 on narrow versus wide dependencies between partitions and caching versus checkpointing, and the diagrams in chapter 6 on GroupByKey (although I found one of several errors here) and SortByKey.
The appendix is beneficial to the point that it could likely have been expanded and included in the body of the text, possibly following the introductory chapters, because the discussion here is all about what one can do outside one's application code (what the bulk of this book is essentially about). Topics covered here are broken down into sections on "Spark Tuning and Cluster Sizing", "Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?", "Serialization Options", and "Some Additional Debugging Techniques". Highly recommended text for anyone looking to broaden their understanding of the hows and whys behind optimizing Spark.
Much of the book is written with a focus on performance. There's some discussion of statistical concepts, but the book is clearly aimed at helping the reader use Spark in a resource-efficient manner (which makes a lot of sense, given that Spark comes into play when you're tackling large data sets).
Virtually all of the code examples are written in Scala. When I began reading, my Scala abilities were fairly limited, but the authors do a good job of parsing and commenting on the code such that I now feel much stronger in Scala, as well. They do have a chapter that discusses using Python and Java (including JVM), but most of the book is presented through Scala.
My one complaint about this book is that it's a bit heavy on the code. It's possible that it's necessary, but I ended up skimming most of the coding examples, and it made for some tedious reading at times. Then again, there were several examples that I scrutinized closely, and having thorough examples did help me learn quite a bit of Scala.
Top international reviews
For beginner Spark users, the book may feel overwhelming, particularly as it focused on Spark RDDs rather than the Spark SQL API which is more widely used. I would highly recommend Zaharia and Chamber's Spark - the Definitive Guide as an alternative purchase as being both more comprehensive and easier to understand. For those, hoping to learn Scala/Spark Scala this book also probably dives in way too fast, and I would recommend Chuisano and Bjarnason's excellent Functional Programming in Scala (although quite hard) and Alexander's Functional Programming Simplified.
On the positive side, the chapter on Key/Value data, although perhaps fairly widely known, was both well-explained and clarifying as were some of the information about how to make more effective transformations.
Some of the code examples are so difficult to read. On top of this, huge chunks of the book 'build upon' old examples, but this just ends up being a complete refactor of the old examples to improve it. Therefore this book can't be used as a handbook without reading it through first. Code examples should have been small and distinct.
Despite these complaints this is a truly fantastic guide, full of straight answers that are difficult or impossible to find online via trial and error.
The text also references unreadable spark UI screenshots or coloured lines in black and white diagrams.