- Paperback: 276 pages
- Publisher: O'Reilly Media; 1 edition (April 20, 2015)
- Language: English
- ISBN-10: 1491912766
- ISBN-13: 978-1491912768
- Product Dimensions: 7 x 0.6 x 9.2 inches
- Shipping Weight: 1 pounds (View shipping rates and policies)
- Average Customer Review: 22 customer reviews
- Amazon Best Sellers Rank: #470,263 in Books (See Top 100 in Books)
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Advanced Analytics with Spark: Patterns for Learning from Data at Scale 1st Edition
Use the Amazon App to scan ISBNs and compare prices.
Fulfillment by Amazon (FBA) is a service we offer sellers that lets them store their products in Amazon's fulfillment centers, and we directly pack, ship, and provide customer service for these products. Something we hope you'll especially enjoy: FBA items qualify for FREE Shipping and Amazon Prime.
If you're a seller, Fulfillment by Amazon can help you increase your sales. We invite you to learn more about Fulfillment by Amazon .
There is a newer edition of this item:
The Amazon Book Review
Author interviews, book reviews, editors picks, and more. Read it now
Frequently bought together
Customers who bought this item also bought
From the Publisher
About the Author
Sandy Ryza is a data scientist at Cloudera and active contributor to the Apache Spark project. He recently led Spark development at Cloudera and now spends his time helping customers with a variety of analytic use cases on Spark. He is also a member of the Hadoop Project Management Committee.
Uri Laserson is a data scientist at Cloudera, where he focuses on Python in the Hadoop ecosystem. He also helps customers deploy Hadoop on a wide range of problems, focusing on life sciences and health care. Previously, Uri cofounded Good Start Genetics, a next generationdiagnostics company while working towards a PhD in biomedical engineering at MIT.
Sean Owen is Director of Data Science for EMEA at Cloudera. He has been a significant contributor to the Apache Mahout machine learning project since 2009, and authored its “Taste” recommender framework. He created the Oryx (formerly Myrrix) project for realtime large scale learning on Hadoop, built on lambda architecture principles, and has contributed to Spark and Spark’s MLlib project.
Josh Wills is Cloudera's Senior Director of Data Science, working with customers and engineers to develop Hadoop based solutions across a wide range of industries. He is the founder and VP of the Apache Crunch project for creating optimized MapReduce and Spark pipelines in Java.Prior to joining Cloudera, Josh worked at Google, where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+.
Top customer reviews
There was a problem filtering reviews right now. Please try again later.
Spark has emerged as the big data platform of choice for data scientists both from the ease of use as well as the performance / optimization point of view. In a few lines of Scala code, Spark allows you to write iterative algorithms that scale out very well. For a data scientist who wants to explore large scale data sets, Spark is a great starting point (this is incredible progress in the Spark community given the project is just about 4 years old). However, Spark itself is moving fast and maturing with time, and Spark and Scala as well as distributed algorithms are typically not in the arsenal of many data scientists today.
What this book does is teach you how to think about data science problems at scale, in the context of Spark. By well chosen examples covering both supervised and unsupervised learning, the authors take you step by step from a practical problem definition (say how to recommend music given user's history of music listened to) to what features are relevant, what machine learning algorithm to use and how to tune parameters to optimize the solution and how you can use Spark to do all of this in an interactive / iterative manner. As a bonus, they also point you to well engineered data sets that you can use to follow along the discussion and learn by trying out the examples yourself.
By embracing the feature engineering steps and data cleaning/ error handling and tuning /feedback steps, the authors manage to show how real world data science works and how you can do full stack data science using Spark and gain immensely from the interactive nature of the Spark REPL.
Overall, I highly recommend this book, and though it is the first book on Data Science using Spark, it sets a high standard for subsequent efforts.
This book presents 9 case studies of data analysis applications in various domains. The topics are diverse and the authors always use real world datasets. Beside learning Spark and a data science you will also have the opportunity to gain insight about topics like taxi traffic in NYC, deforestation or neuroscience. Without any previous exposure or contact with machine learning readers might struggle to understand certain chapters, so I think it's good idea to actually try those examples yourself while reading and Google for further details about the used methods. Many of the chapters end only with basic models, which barely outperform the baselines, so if you want to, there is a lot of space for their improvement and further work.
Spark itself provides it's users with APIs in three languages - Java, Scala and Python. This books successfully covers each one of these, although you can feel slight preference of a Scala throughout the book. For Scala starters - they always explain some of the special constructs or syntax features which is in fact a nice thing. Introduction and Appendix chapters provides basic information about the Spark core, RDDs (Resilient distributed datasets) or options of running Spark - whether in cluster (Mesos, YARN, Spark's own) or standalone settings. Throughout the book you can find some really worthy tips about Spark or data analysis - like using other serializer than the Java's default (they recommend kryo), overview of data cleansing and whole machine learning pipeline. To sum up, I recommend this book to every data scientist - because it demonstrates advanced topics like workload distribution and scaling on an enjoyable examples.
The vignettes introduce a variety of topics that Spark can tackle: recommendations, graph analysis, Monte Carlo methods, by analyzing some publicly-available dataset. The analysis conducted is explained well and very useful as an introduction to the techniques they used. As an overview of the capabilities of Spark, this method excels. In addition, all code is available in the author's Github, though there are some discrepancies between the code in the repo and the book (beyond what's expected for comparing a static book and a changing git repo). This is useful for following along and replicating the analysis as well as altering their techniques to explore the data further.
On the other hand, this is not an in-depth introduction to Spark as a whole. There is an appendix introducing some Spark basics, but you'll get much further with Spark's own documentation, or the other O'Reilly book, Learning Spark. Without this aspect, it becomes harder to generalize these analyses for your own purposes.
Most recent customer reviews
The "advance" part of the title, seems questionable to me.Read more