- Paperback: 280 pages
- Publisher: O'Reilly Media; 2 edition (July 6, 2017)
- Language: English
- ISBN-10: 1491972955
- ISBN-13: 978-1491972953
- Product Dimensions: 7 x 0.5 x 9 inches
- Shipping Weight: 1.2 pounds (View shipping rates and policies)
- Average Customer Review: 7 customer reviews
- Amazon Best Sellers Rank: #34,912 in Books (See Top 100 in Books)
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Advanced Analytics with Spark: Patterns for Learning from Data at Scale 2nd Edition
Use the Amazon App to scan ISBNs and compare prices.
Frequently bought together
Customers who bought this item also bought
Customers who viewed this item also viewed
From the Publisher
From the Preface
What’s in This Book
The first chapter will place Spark within the wider context of data science and big data analytics. After that, each chapter will comprise a self-contained analysis using Spark. The second chapter will introduce the basics of data processing in Spark and Scala through a use case in data cleansing. The next few chapters will delve into the meat and potatoes of machine learning with Spark, applying some of the most common algorithms in canonical applications. The remaining chapters are a bit more of a grab bag and apply Spark in slightly more exotic applications—for example, querying Wikipedia through latent semantic relationships in the text or analyzing genomics data.
The Second Edition
Since the first edition, Spark has experienced a major version upgrade that instated an entirely new core API and sweeping changes in subcomponents like MLlib and Spark SQL. In the second edition, we’ve made major renovations to the example code and brought the materials up to date with Spark’s new best practices.
About the Author
Sandy Ryza develops algorithms for public transit at Remix. Prior, he was a senior data scientist at Cloudera and Clover Health. He is an Apache Spark committer, Apache Hadoop PMC member, and founder of the Time Series for Spark project. He holds the Brown University computer science department's 2012 Twining award for "Most Chill".
Uri Laserson is an Assistant Professor of Genetics at the Icahn School of Medicine at Mount Sinai, where he develops scalable technology for genomics and immunology using the Hadoop ecosystem.
Sean Owen is Director of Data Science at Cloudera. He is an ApacheSpark committer and PMC member, and was an Apache Mahout committer.
Josh Wills is the Head of Data Engineering at Slack, the founder of the Apache Crunch project, and wrote a tweet about data scientists once.
Top customer reviews
There was a problem filtering reviews right now. Please try again later.
As a Computer Forensics Expert, I need to be at least passingly familiar with technologies wherein data pertinent to an investigation might be found.
This book is an excellent introduction into large scale data analytics, such as estimating financial risk, performing semantic analysis, analyzing neuroimaging data and more.
Each of the three authors writes clearly and the unseen hand of a good editor is evident in how seamlessly the chapters flow into each other.
Personally I would have benefited from more illustrations, but I am less knowledgeable than the intended office for this book.
Still, the book is eminently readable for anyone with a moderate understanding of programming, network operations and statistics, the last being quite important to comprehension.
For me, the book served its purpose: educating me in the essentials of Spark and its uses.
While many books review the basic ideas and techniques of Machine Learning beginning with Regression and Classification; this handbook discusses these techniques in light of stream and parallel processing with Spark and server/workstation clusters. This is most needed both in terms of Internet of Things and Transaction Processing in the corporate world.
Like most O'Reilly books, this one assumes the reader is generally knowledgeable but needs more/better specifics about this particular area. The authors do a good job of introducing concepts without making you feel like you're wasting your time reviewing the bare basics, but also giving you enough depth and background so that you're not suddenly dropped in the middle of an unfamiliar landscape.
I particularly liked the wide variety of examples. Honestly, I doubt that I'll use half of what I learned from them, but I don't know which half, and the book gives you a solid foundation to work from so that you're ready to tackle a broad range of projects.
There are a wide variety of datasets used throughout the text. The first data encountered is from the UC Irvine Machine Learning Repository and consists of curated records from a linkage study performed in a German hospital in 2010. Another interesting dataset used is the Audioscrobbler music recommendation dataset from 2005 as provided by last.fm. The Covtype dataset of consists of types of forest-covering parcels of land in Colorado appears later even in the book. And even later on a full dump of the wikipedia website is used (and recommends a cluster of computers to do the examples). Other datasets are used such as network analysis datasets, New York City Taxi Trip data, neuroimaging data, and financial data. The rich diverse set of data used throughout the examples really brings this book into 5-star territory.
A wide variety of techniques are also used: decision trees, k-means clustering, latent semantic analysis, co-occurence networks, temporal data analysis, and Monte Carlo simulation. Highly recommended for the data scientist interested in learning the ends and outs of Spark.