- Paperback: 466 pages
- Publisher: O'Reilly Media; 1 edition (November 1, 2012)
- Language: English
- ISBN-10: 1449319793
- ISBN-13: 978-1449319793
- Product Dimensions: 7 x 0.9 x 9.2 inches
- Shipping Weight: 1.8 pounds
- Average Customer Review: 158 customer reviews
- Amazon Best Sellers Rank: #200,521 in Books (See Top 100 in Books)
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 1st Edition
Use the Amazon App to scan ISBNs and compare prices.
There is a newer edition of this item:
The Amazon Book Review
Author interviews, book reviews, editors picks, and more. Read it now
What other items do customers buy after viewing this item?
Customers who viewed this item also viewed
From the Publisher
This is by no means a complete list. Even though it may not always be obvious, a large percentage of data sets can be transformed into a structured form that is more suitable for analysis and modeling. If not, it may be possible to extract features from a data set into a structured form.
As an example, a collection of news articles could be processed into a word frequency table which could then be used to perform sentiment analysis. Most users of spreadsheet programs like Microsoft Excel, perhaps the most widely used data analysis tool in the world, will not be strangers to these kinds of data.
What Is This Book About?
This book is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems. This book is not an exposition on analytical methods using Python as the implementation language.
When I say 'data', what am I referring to exactly? The primary focus is on structured data, a deliberately vague term that encompasses many different common forms of data, such as:
- Multidimensional arrays (matrices).
- Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files.
- Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a SQL user).
- Evenly or unevenly spaced time series.
Data Wrangling with Pandas, NumPy, and IPython
About the Author
Wes McKinney is the main author of pandas, the popular open sourcePython library for data analysis. Wes is an active speaker andparticipant in the Python and open source communities. He worked as aquantitative analyst at AQR Capital Management and Python consultantbefore founding DataPad, a data analytics company, in 2013. Hegraduated from MIT with an S.B. in Mathematics.
Top customer reviews
There was a problem filtering reviews right now. Please try again later.
In particular, see sections: Tutorials, Intro to Data Structures - Series and DataFrame, and Essential Basic Functionality.
The remaining 1/4 of the book had very useful concentrated intro to NumPy, Advanced NumPy, and Python Essentials reference. This book does not cover the newer development of R function calls from Python. In my opinion, R is winning the R vs Pandas argument due to ggplot2 and statistical learning professors publishing code first in R. Since R is now easy to use from within Python, Pandas might not get as much use. But it's still useful to know how to use Pandas as part of a data analyst's toolkit.
I also want to warn buyers about faint printing on several physical copies of this book. I bought from Amazon AND directly from O'Reilly Media in trying to get a physical book that had good, solid printing on all pages. This was not possible. The physical book from O'Reilly had even fainter/worse printing than the version I got from Amazon. Better to save your money and just get with the eBook version if you are OK with that, which you can usually find cheaper online. O'Reilly puts on excellent conferences, but may be getting out of the printed book business. I guess most programmers buy eBooks now. I just find eBooks difficult to deal with when it comes to dense, technical books. I am fine with eBooks for fiction or more narrative non-fiction such as economics, popular science, or history.
I didn't find the book as helpful as I had hoped. I already had a fair bit of experience with numpy, on which pandas is built, so perhaps the book was not written for me. I think the book was premature, as pandas has often changed in a backwards incompatible manner. It also glosses over many of the idiosyncrasies of the package, which fortunately are also changing.
I find pandas both helpful and maddening, and have removed all traces of it from my work only to come back and give it another try at a later date more than once. Due to its popularity, it clearly fills a need. At this point I am once again committed to using pandas. I've learned not to expect too much from it, to avoid using it in situations requiring more than 2 dimensions (multiIndexing is not a universal solution for higher dimension data!), and often finding it best to resort to rebuilding DataFrames from the underlying data after manipulation of the values in numpy. I've also found that when working with data that pushes up against your available RAM, pandas can be a real problem. The book does little to help here.
I wish the book had a clearer explanation of the various caveats and gotchas of pandas, and how it deviates from numpy. I found little help for understanding the efficiency of different strategies for big data or complex manipulations. Pandas tries to do a lot, with some remarkable success, but you can quickly find yourself down a rabbit hole. The package clearly is improving, but the book doesn't help one understand these boundaries. Perhaps a dedicated pandas book once the package is more mature will do this. For now, this is a decent introductory book for generally lighter weight problems (though mileage will vary). In the meantime StackOverflow can be quite helpful, though I'm afraid it contributes to the impression that pandas requires a rather byzantine set of solutions for less standard problems.
Wes McKinney is clearly a data geek. His examples are a bit harder to follow than those of other writers, but the depth of his knowledge -- both in problem solving and using Python to do it -- makes taking the effort to follow worth it a thousand fold. He covers everything from accessing data from numerous types of sources and walks you through solving really nasty data problems using simple tools. I found his writing clear, though probably not concise. No matter, he quickly gave me what I needed.