- Paperback: 358 pages
- Publisher: O'Reilly Media; 1 edition (June 16, 2017)
- Language: English
- ISBN-10: 1491943203
- ISBN-13: 978-1491943205
- Product Dimensions: 7 x 0.8 x 9.2 inches
- Shipping Weight: 1.8 pounds (View shipping rates and policies)
- Average Customer Review: 11 customer reviews
- Amazon Best Sellers Rank: #56,518 in Books (See Top 100 in Books)
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark 1st Edition
Use the Amazon App to scan ISBNs and compare prices.
See the Best Books of 2017
Looking for something great to read? Browse our editors' picks for the best books of the year in fiction, nonfiction, mysteries, children's books, and much more.
Frequently bought together
Customers who bought this item also bought
From the Publisher
Best practices for scaling and optimizing Apache Spark
About the Author
Holden Karau is transgender Canadian, and an active open source contributor. When not in San Francisco working as a software development engineer at IBM's Spark Technology Center, Holden talks internationally on Apache Spark and holds office hours at coffee shops at home and abroad. She is a Spark committer with frequent contributions, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.
Rachel Warren is a data scientist and software engineer at Alpine Data Labs, where she uses Spark to address real world data processing challenges. She has experience working as an analyst both in industry and academia. She graduated with a degree in Computer Science from Wesleyan University in Connecticut.
Top customer reviews
There was a problem filtering reviews right now. Please try again later.
Much of the book is written with a focus on performance. There's some discussion of statistical concepts, but the book is clearly aimed at helping the reader use Spark in a resource-efficient manner (which makes a lot of sense, given that Spark comes into play when you're tackling large data sets).
Virtually all of the code examples are written in Scala. When I began reading, my Scala abilities were fairly limited, but the authors do a good job of parsing and commenting on the code such that I now feel much stronger in Scala, as well. They do have a chapter that discusses using Python and Java (including JVM), but most of the book is presented through Scala.
My one complaint about this book is that it's a bit heavy on the code. It's possible that it's necessary, but I ended up skimming most of the coding examples, and it made for some tedious reading at times. Then again, there were several examples that I scrutinized closely, and having thorough examples did help me learn quite a bit of Scala.
The book is written assuming the reader has some experience working with Spark or other streaming data processing engine. Novice users may find themselves being overwhelmed with the advanced concepts introduced in the book.
The book begins with an introduction to Spark and Scala, the building blocks to high speed data processing. This introduction examines the trade-offs as to why Scala is a better choice than Python or Java for streaming data processing. The authors then provide their perspective as to how Spark fits into the big data ecosystem.
The next chapter focuses on concepts that are familiar to all data scientists – an overview on datasets, dataframes and Spark’s twist on structured query language. Later chapters extend this foundational knowledge with advanced concepts like Resilient Distributed Datasets (RDD), SQL joins, data transformations and Machine Learning. As with all O’Reilly books, after introducing the reader to the concepts, the authors provide us with the code snippets to be able to practice on our own.
I especially enjoyed the topics on writing test cases, troubleshooting, and discussions on some of the more common exceptions.
This book will definitely be on my office shelf. I highly recommend it for other data scientists and other data engineers.
Most people in (and out) of IT will never have any contact with Spark. I need to know about it only because my job involves having at least a superficial knowledge of every significant aspect of IT.
This book presumes you are already conversant with Apache Spark and need no education or hand-holding in that regard.
Rather this book’s goal is to help the reader make their Spark queries “faster, able to handle larger data sizes, and use fewer resources”. Being able to at least read Scala is highly recommended.
The entire book is loaded with detailed examples. For the casual reader, such as myself, lacking a Spark environment to play in, there is an empty feeling – you can read the examples, study them, but not run them.
Having read literally dozens or more programming cookbooks during the course of my career, this one feels right, but without being able to run the examples, that’s just as an assumption. It does, however, make me wish I had some huge datasets to work on. Maybe I can get a job with the NSA? I bet there are a lot of Spark experts there.
this book however dedicates several chapters explaining it in detail and making the reader understand the internals and the performance implications.