Buy new:
$40.99$40.99
FREE delivery:
Tuesday, Nov 8
Ships from: Amazon.com Sold by: Amazon.com
Buy used:: $22.48
Other Sellers on Amazon
+ $3.99 shipping
87% positive over last 12 months
Usually ships within 4 to 5 days.
& FREE Shipping
93% positive over last 12 months
& FREE Shipping
91% positive over last 12 months
Usually ships within 2 to 3 days.
Download the free Kindle app and start reading Kindle books instantly on your smartphone, tablet, or computer - no Kindle device required. Learn more
Read instantly on your browser with Kindle Cloud Reader.
Using your mobile phone camera - scan the code below and download the Kindle app.
Learning Spark: Lightning-Fast Data Analytics 2nd Edition
| Price | New from | Used from |
Enhance your purchase
Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.
Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to:
- Learn Python, SQL, Scala, or Java high-level Structured APIs
- Understand Spark operations and SQL Engine
- Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
- Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
- Perform analytics on batch and streaming data using Structured Streaming
- Build reliable data pipelines with open source Delta Lake and Spark
- Develop machine learning pipelines with MLlib and productionize models using MLflow
- ISBN-101492050040
- ISBN-13978-1492050049
- Edition2nd
- PublisherO'Reilly Media
- Publication dateAugust 25, 2020
- LanguageEnglish
- Dimensions7.25 x 1 x 9.25 inches
- Print length397 pages
Frequently bought together

Products related to this item
From the brand
-
-
Sharing the knowledge of experts
O'Reilly's mission is to change the world by sharing the knowledge of innovators. For over 40 years, we've inspired companies and individuals to do new things (and do them better) by providing the skills and understanding that are necessary for success.
Our customers are hungry to build the innovations that propel the world forward. And we help them do just that.
From the Publisher
From the Preface
Who This Book Is For
Most developers who grapple with big data are data engineers, data scientists, or machine learning engineers. This book is aimed at those professionals who are looking to use Spark to scale their applications to handle massive amounts of data.
In particular, data engineers will learn how to use Spark’s Structured APIs to perform complex data exploration and analysis on both batch and streaming data; use Spark SQL for interactive queries; use Spark’s built-in and external data sources to read, refine, and write data in different file formats as part of their extract, transform, and load (ETL) tasks; and build reliable data lakes with Spark and the open source Delta Lake table format.
For data scientists and machine learning engineers, Spark’s MLlib library offers many common algorithms to build distributed machine learning models. We will cover how to build pipelines with MLlib, best practices for distributed machine learning, how to use Spark to scale single-node models, and how to manage and deploy these models using the open source library MLflow.
While the book is focused on learning Spark as an analytical engine for diverse workloads, we will not cover all of the languages that Spark supports. Most of the examples in the chapters are written in Scala, Python, and SQL. Where necessary, we have infused a bit of Java. For those interested in learning Spark with R, we recommend Javier Luraschi, Kevin Kuo, and Edgar Ruiz’s Mastering Spark with R (O’Reilly).
Finally, because Spark is a distributed engine, building an understanding of Spark application concepts is critical. We will guide you through how your Spark application interacts with Spark’s distributed components and how this is decomposed into parallel tasks on a cluster. We will also cover which deployment modes are supported and in what environments.
While there are many topics we have chosen to cover, there are a few that we have opted to not focus on. These include the older low-level Resilient Distributed Dataset (RDD) APIs and GraphX, Spark’s API for graphs and graph-parallel computation. Nor have we covered advanced topics such as how to extend Spark’s Catalyst optimizer to implement your own operations, how to implement your own catalog, or how to write your own DataSource V2 data sinks and sources. Though part of Spark, these are beyond the scope of your first book on learning Spark.
Instead, we have focused and organized the book around Spark’s Structured APIs, across all its components, and how you can use Spark to process structured data at scale to perform your data engineering or data science tasks.
Editorial Reviews
About the Author
Brooke Wenig is a machine learning practice lead at Databricks. She leads a team of data scientists who develop large-scale machine learning pipelines for customers, as well as teaching courses on distributed machine learning best practices. Previously, she was a principal data science consultant at Databricks. She holds an M.S. in computer science from UCLA with a focus on distributed machine learning.
Tathagata Das is a staff software engineer at Databricks, an Apache Spark committer, and a member of the Apache Spark Project Management Committee (PMC). He is one of the original developers of Apache Spark, the lead developer of Spark Streaming (DStreams), and is currently one of the core developers of Structured Streaming and Delta Lake. Tathagata holds an M.S. in computer science from UC Berkeley.
Denny Lee is a staff developer advocate at Databricks who has been working with Apache Spark since 0.6. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premises and cloud environments. He also has an M.S. in biomedical informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise healthcare customers.
Product details
- Publisher : O'Reilly Media; 2nd edition (August 25, 2020)
- Language : English
- Paperback : 397 pages
- ISBN-10 : 1492050040
- ISBN-13 : 978-1492050049
- Item Weight : 1.4 pounds
- Dimensions : 7.25 x 1 x 9.25 inches
- Best Sellers Rank: #68,870 in Books (See Top 100 in Books)
- #12 in Mathematical Analysis (Books)
- #35 in Data Processing
- #111 in Software Development (Books)
- Customer Reviews:
About the author

Discover more of the author’s books, see similar authors, read author blogs and more
Related products with free delivery on eligible orders
Customer reviews
Customer Reviews, including Product Star Ratings help customers to learn more about the product and decide whether it is the right product for them.
To calculate the overall star rating and percentage breakdown by star, we don’t use a simple average. Instead, our system considers things like how recent a review is and if the reviewer bought the item on Amazon. It also analyzed reviews to verify trustworthiness.
Learn more how customers reviews work on AmazonReviewed in the United States on September 4, 2020
-
Top reviews
Top reviews from the United States
There was a problem filtering reviews right now. Please try again later.
I used the book as an extra study resource when taking some Databricks certifications. It was a great addition to my study materials.
By Arturo Amador Cruz on July 25, 2022
I used the book as an extra study resource when taking some Databricks certifications. It was a great addition to my study materials.
By JA on September 3, 2020
I was able to follow along in this book fairly easily. Working on a MacBook, I did have to first install Scala, download Spark, enable Spark in IntelliJ, etc. I didn't have trouble with this as it was fairly straightforward. With my environment set up, I found the book presents every code sample in Scala and Python. I worked through the code samples, chapter by chapter, writing Scala in IntelliJ or sometimes writing Scala in the Spark CLI itself.
I did take a detour from the book slightly to learn a bit more about sbt, which is the Scala build tool.
For a beginner such as myself, this book is a God send, but I do wish the authors approached some things differently.
In my opinion, some topics are covered in a very "hand-wavy" manner. For example, Chapter 4 discusses managed vs. unmanaged tables. While knowing this difference exists is helpful for the reader, the authors never discuss when you should use a managed table or an unmanaged table. They could have included that information or pointed the user to some external source. This part of Chapter 4 then shows sample code on how to create a managed table from a CSV file. However, it's not clear what should I do with that information. What are the patterns applicable to a managed table vs. unmanaged table? What are the trade-offs? Being a beginner book, I still feel the authors could have written even just 1 page, which would add significant value to this section.
Sometimes the book will share some interesting tidbit but using terminology or concepts that the authors haven't really described. I found this very frustrating. For example:
> (Chapter 4, page 92) ... you can create multiple SparkSessions within a single Spark application—this can be handy, for example, in cases where you want to access (and combine) data from two different SparkSessions that don’t share the same Hive metastore configurations.
If you search for mentions Hive, you see the authors briefly mentioned Spark uses a Hive metastore to persist table metadata. So are the authors saying I can use one Spark installation and access table metadata from different Hive metastores? Why would I ever want to access only the metadata for different tables? Again – the use case isn't clear.
As a beginner, I found this book very valuable, and I believe it is a great investment.
Top reviews from other countries
I haven't read the chapter on streaming or the two chapters on machine learning as it isn't applicable to me, but everything else has been just what I needed. Well done to the authors for putting together such an amazing guide.
If you want to see the different chapter contents, I've added them as photos for your ease.
Reviewed in the United Kingdom on February 9, 2022
I haven't read the chapter on streaming or the two chapters on machine learning as it isn't applicable to me, but everything else has been just what I needed. Well done to the authors for putting together such an amazing guide.
If you want to see the different chapter contents, I've added them as photos for your ease.
Lo recomiendo.
I have the kindle edition and noticed that the formulas on one of the pages on machine learning was slightly cutoff at the edges but I wont remove a star because of that. In my view there are tons of material online to understand those regression formulas. What really worked for me is how great a job the authors have done in explaining how to use Spark 3.0.
Since I am a Python and SQL user this book really benefits me at work. The syntax and function explains are very clear and with an online Databricks account one can really practice as you learn with an uncomplicated dataset. How to program the Dataframe API is really well covered.






