- Paperback: 280 pages
- Publisher: O'Reilly Media; 2 edition (July 6, 2017)
- Language: English
- ISBN-10: 1491972955
- ISBN-13: 978-1491972953
- Product Dimensions: 7 x 0.5 x 9 inches
- Shipping Weight: 1.2 pounds (View shipping rates and policies)
- Average Customer Review: 9 customer reviews
- Amazon Best Sellers Rank: #197,228 in Books (See Top 100 in Books)
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Advanced Analytics with Spark: Patterns for Learning from Data at Scale 2nd Edition
Use the Amazon App to scan ISBNs and compare prices.
"Warlight" by Michael Ondaatje
A dramatic coming-of-age story set in the decade after World War II, "Warlight" is the mesmerizing new novel from the best-selling author of "The English Patient." Learn more
Frequently bought together
What other items do customers buy after viewing this item?
From the Publisher
From the Preface
What’s in This Book
The first chapter will place Spark within the wider context of data science and big data analytics. After that, each chapter will comprise a self-contained analysis using Spark. The second chapter will introduce the basics of data processing in Spark and Scala through a use case in data cleansing. The next few chapters will delve into the meat and potatoes of machine learning with Spark, applying some of the most common algorithms in canonical applications. The remaining chapters are a bit more of a grab bag and apply Spark in slightly more exotic applications—for example, querying Wikipedia through latent semantic relationships in the text or analyzing genomics data.
The Second Edition
Since the first edition, Spark has experienced a major version upgrade that instated an entirely new core API and sweeping changes in subcomponents like MLlib and Spark SQL. In the second edition, we’ve made major renovations to the example code and brought the materials up to date with Spark’s new best practices.
About the Author
Sandy Ryza develops algorithms for public transit at Remix. Prior, he was a senior data scientist at Cloudera and Clover Health. He is an Apache Spark committer, Apache Hadoop PMC member, and founder of the Time Series for Spark project. He holds the Brown University computer science department's 2012 Twining award for "Most Chill".
Uri Laserson is an Assistant Professor of Genetics at the Icahn School of Medicine at Mount Sinai, where he develops scalable technology for genomics and immunology using the Hadoop ecosystem.
Sean Owen is Director of Data Science at Cloudera. He is an ApacheSpark committer and PMC member, and was an Apache Mahout committer.
Josh Wills is the Head of Data Engineering at Slack, the founder of the Apache Crunch project, and wrote a tweet about data scientists once.
Top customer reviews
There was a problem filtering reviews right now. Please try again later.
As a Computer Forensics Expert, I need to be at least passingly familiar with technologies wherein data pertinent to an investigation might be found.
This book is an excellent introduction into large scale data analytics, such as estimating financial risk, performing semantic analysis, analyzing neuroimaging data and more.
Each of the three authors writes clearly and the unseen hand of a good editor is evident in how seamlessly the chapters flow into each other.
Personally I would have benefited from more illustrations, but I am less knowledgeable than the intended office for this book.
Still, the book is eminently readable for anyone with a moderate understanding of programming, network operations and statistics, the last being quite important to comprehension.
For me, the book served its purpose: educating me in the essentials of Spark and its uses.
Like most O'Reilly books, this one assumes the reader is generally knowledgeable but needs more/better specifics about this particular area. The authors do a good job of introducing concepts without making you feel like you're wasting your time reviewing the bare basics, but also giving you enough depth and background so that you're not suddenly dropped in the middle of an unfamiliar landscape.
I particularly liked the wide variety of examples. Honestly, I doubt that I'll use half of what I learned from them, but I don't know which half, and the book gives you a solid foundation to work from so that you're ready to tackle a broad range of projects.
While many books review the basic ideas and techniques of Machine Learning beginning with Regression and Classification; this handbook discusses these techniques in light of stream and parallel processing with Spark and server/workstation clusters. This is most needed both in terms of Internet of Things and Transaction Processing in the corporate world.