HN Books @HNBooksMonth

The best books of Hacker News.

Hacker News Comments on
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

Wes McKinney · 3 HN comments
HN Books has aggregated all Hacker News stories and comments that mention "Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython" by Wes McKinney.
View on Amazon [↗]
HN Books may receive an affiliate commission when you make purchases on sites after clicking through links on this page.
Amazon Summary
Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems. This book is not an exposition on analytical methods using Python as the implementation language. Written by Wes McKinney, the main author of the pandas library, this hands-on book is packed with practical cases studies. It’s ideal for analysts new to Python and for Python programmers new to scientific computing. Use the IPython interactive shell as your primary development environment Learn basic and advanced NumPy (Numerical Python) features Get started with data analysis tools in the pandas library Use high-performance tools to load, clean, transform, merge, and reshape data Create scatter plots and static or interactive visualizations with matplotlib Apply the pandas groupby facility to slice, dice, and summarize datasets Measure data by points in time, whether it’s specific instances, fixed periods, or intervals Learn how to solve problems in web analytics, social sciences, finance, and economics, through detailed examples
HN Books Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this book.
Sep 16, 2016 · danso on R for Data Science
There's a large number of such books, though none that are as authoritative with respect to Python (this is a statement about the size of Python's community vs. R, not necessarily about the authors):

- via Wes McKinney, creator of pandas (which makes Python about as close to R as you can get): https://www.amazon.com/Python-Data-Analysis-Wrangling-IPytho...

- http://joelgrus.com/2015/04/26/data-science-from-scratch-fir...

There are a bunch of books specific to machine learning too though I haven't read them myself.

hadley
What would you recommend for visualisation?
nrjames
I've used R and Python. I stick with Python whenever possible because IMO it supports the non-modeling parts of data science more effectively. ETL scripts, API creation, Flask for hosting simple websites, etc. yHat makes python-ggplot and Rodeo, similar to RStudio. I explore and develop algorithms in Jupyter notebooks, documenting along the way, while running "hardened" code from the command line, often nohup'ing it on a Linux box, for services that run perpetually, keep me updated via Slack/SMS/email, etc.

For visualization, almost everything I do is in D3, p5.js, or in Processing (Java), which has a Python interpreter, for those interested. There are some great Processing books and Daniel Shiffman is the Hadley of that world. Tons of engaging resources from him. There are tons and tons of good D3 books and online resources. bl.ocks and Mike Bostock's other online articles are wonderful.

Every organization with data scientists defines "data science" differently. People with a modeling and stats focus probably should stick with R. If you find yourself in a position with a wider scope, you simply must have more tools in your tool belt, and in my opinion, R, Python, and JavaScript all are part of that package. For me, personally, Processing is, too. Have a look at Ben Fry's work to understand why. I also use openFrameworks when the volume of data to visualize and performance concerns require it.

danso
I actually have no idea about that. I don't think there's an equivalent to R's base graphics, so that would seem to make matplotlib the closest thing to a standard -- seaborn [0], which I've seen used a lot lately for more advanced dataviz, lives atop it, but it's also relatively new.

People seem to have conflicted feelings about matplotlib, maybe because of its origin in MATLAB? Not that Matlab itself is bad, but I think the decision to make matplotlib's API comfortable for MATLAB users seems to cause confusion to contemporary users, even before the usual 2.x vs 3.x issues (matplotlib ported to 3.x a few years ago but many users still write Python in the 2.x style.)

Anecdotally, I feel like I see advice like "Just use plotly" more than I see recommendations to actually learn matplotlib. I actually gave up on matplotlib until I stumbled upon this comprehensive tutorial, which covers the basics and many elaborate use cases. If there's a book that does it better, I haven't heard about it:

http://www.labri.fr/perso/nrougier/teaching/matplotlib/

The matplotlib site itself is chockful of well-documented examples, but some of them seem to be significantly more verbose than they need to be. My impression is that the library is stable/ubiquitous enough that there isn't a big movement to overhaul things. Last time I looked at the API changes for v2.0 [1] (1.5.3 is stable), most of the changes had to do with default styles and stylesheets, which is non-trivial given the number of people who use ggplot2 because it "just works"

[0] https://stanford.edu/~mwaskom/software/seaborn/

[1] http://matplotlib.org/devdocs/users/dflt_style_changes.html

sonabinu
What are your thoughts on bokeh? I seem to always revert to R for visualizations
nickdavidhaynes
Wes's book is definitely the standard. But I would hold off buying one right now - he's currently working on a (much-needed) second edition, coming out next year (http://wesmckinney.com/).

Joel's book is a great resource for preparing for interviews or learning really basic stuff and less of an introduction to the tools.

Jul 16, 2015 · pvnick on How to learn data science
Good article for beginners. A couple thoughts, just to build on what the author said:

First off, data science == fancy name for data mining/analysis. Wanted to clear that up due to buzzwordy nature of "data science."

Learn SQL - this is the big one. You must be proficient with SQL to be effective at data science. Whether it's running on an RDBMS or translating to map/reduce (Hive) or DAG (Spark), SQL is invaluable. If you don't know what those acronyms mean yet, don't worry. Just learn SQL.

Learn to communicate insights - I would add here to try some UI techniques. Highcharts, d3.js, these are good libraries for telling your data story. You can also do a ton just with Excel and not need to write any code beyond what you wrote for the mining portion (usually SQL).

I would also go back to basics with regards to statistical techniques. Start with your simple Z Score, this is such an important tool in your data science toolbox. If you're just looking at raw numbers, try to Z-normalize the data and see what happens. You'd be surprised what you can achieve with a high school statistics textbook, Postgres/MySQL (or even Excel!), and a moderate-sized data set. These are powerful enough to answer the majority of your questions, and when they fail then move on to more sexy algorithms.

Edit: one more thing I forgot to mention. After SQL, learn Python. There are a ton of libraries in the python ecosystem that are perfect for data science (numpy, scipy, scikit-learn, etc). It's also one of the top languages used in academic settings. My preferred data science workspace involves Python, IPython Notebook, and Pandas (This book is quite good: http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython...)

DataWorker
This is the truth. People can't do simple statistics, even with advanced degrees. In many cases advanced degrees make things worse. The ability to reason about data and have strong fundamentals in math statistics is what's needed.

Someone else mentioned Gelman's blog. That's a great place to find evidence that phd's do not lead to an increased ability to ferret out "truth" or insight from data. In many cases they just hide the mistakes so that others without that background don't know they're being misled.

rjusher
But how do you become a good data scientist, instead of a technical person, that knows how to apply an algorithm in Python/R.

What I am trying to ask is how do you become good at setting your start point(formulate your hypotheses), communicating your insights and selecting which tools apply where, because if your are good at coding and have experience in things related to computer science you have the abilities to handle a dataset(SQL Knowledge) and the data tools(Python, Pandas, etc), but that doesn't earn you the title of data scientist.

eitally
Practice. And education, but mostly practice. This is the kind of thing that is typically taught in formal educational settings (at least in engineering, which is my experience). As an example, I learned more about probability & statistics in 1) AP biology in high school, and 2) a "simulation systems" class in my industrial engineering master's curriculum. We spent much of the former class learning basic statistical analysis techniques (ANOVA, chi-square, etc) to apply to our lab data, and the latter class was all about statistical analysis of process flows (aimed at the real life problem of factory production planning & scheduling and manufacturing process optimization).

So, do I consider myself a data scientist? Absolutely not. But do I understand basic statistical concepts and know how to apply them to several categories of real life data analysis problems.

I'm a terrible coder, btw.

rjusher
Would you recommend any approach or I should go undust my high school and college books in the search for study material. Or is this too basic material.
stared
I would shift priorities to Python (from SQL).

Unless one has "data scientist" title so to make "database engineer" look more fancy, then data comes in various shapes and forms. And most questions cannot be answered with a simple aggregation.

For example, data I work on (I am a data scientist freelancer) is flat csv files, xls files, JSON files, some text files I need to parse, various SQL, MongoDB, things I am getting from various APIs, etc...

While understanding joins is crucial (and normal forms, etc), SQL itself does take negligible amount of my time (and effort).

S4M
I would disagree with that advice. If you work as a data scientist in a company, you will likely have the logs of something stored in an SQL table (be it pure SQL database or something like hadoop hive) and you will have to answer (and ask) to questions like: "Do people convert more when they come from X or Y?", so you will have to do a couple of queries to get the conversion rates from people coming from X and Y.

This is my experience when I worked as Data Scientist about a year ago. Now, YMMV, especially if you're a freelancer, I guess your clients are more comfortable with giving you raw dumps of data as files instead of giving you access to their database servers.

stared
I work as a freelancer. And actually, I never ever processed logs.

Of course, sometimes I am given SQL access to server; but I never learnt SQL except for in action (i.e. things which I need right now).

And most of times I work with flat files. Even if they come from SQL they typically need a serious preprocessing before I can do a more adv analysis.

BTW: I have no problems with composing rather advanced queries. Just if SQL is a problem from someone (and, in case of doubt, it can't be Googled in no time) then I am curious how can get machine learning.

platz
Regarding SQL, have you noticed any increase in the usage of "window functions" (how important do you find them for your work?)
dummy7953
You're basically describing stuff I was doing like 20 freaking years ago. Minus the Hive & Spark & Highcharts & d3.js - naturally.

But back then I couldn't get any of my managers to understand or appreciate what I was doing. Fickle finger of fate.

pvnick
Hell even William Gosset was doing "data science" when he popularized the Student T distribution while working for the Guinness Brewery back in 1908.
mziel
I lived through Statistics, Business Analysis, Decision Analytics, Data Analytics, Data Mining now Data Science. Same thing renamed over and over again.

Regarding post above, it's right. Data scientist is someone better at statistics (classical stats, bayesian, machine learning) than computer scientist, and better at programming (SQL, R/Python for building models) than academic statistician. Plus a teaspoon of visualization (ggplot or d3).

rrmm
AI has gone through the same sort of buzzword treadmill and even programming in general. Only after living through a few cycles does it really become obvious how cyclic these sorts of trends are.

I'm trying to work on being less jaded about it, and not letting my annoyance with the-new-trendy-thing-that-i-remember-doing-years-ago-under-a-different-name get in the way of learning new technology and new lessons.

But it's a struggle.

Lofkin
Great comment.

BTW, you can make interactive visualizations in pure python with bokeh: http://bokeh.pydata.org/en/latest/

Also with Blaze, you can use Pandas (or even Dplyr) syntax in python to query Hive, Spark and other large stores. http://blaze.pydata.org/en/latest/

I guess the question is what do you mean by advanced topics? What direction do you want to go in? The latter book you mentioned seems to cover a number of topics and is probably a good bet.

If you are interested in the web, both these books were good: http://www.amazon.com/Python-Web-Programming-Steve-Holden/dp... http://www.amazon.com/The-Definitive-Guide-Django-Developmen...

Here are a few books that cover some "advanced?" topics that I'd like to read when I have time (would also like to hear other peoples' recommendations on them): http://www.amazon.com/Python-Data-Analysis-Wes-McKinney/dp/1... http://www.amazon.com/Twisted-Network-Programming-Essentials... http://www.amazon.com/Foundations-Python-Network-Programming... http://www.amazon.com/Introduction-Tornado-Michael-Dory/dp/1... http://onlinebookplace.com/programming-computer-vision-with-...

I'm not sure on your background or the quality of these books, but an understanding of data structures, algorithms, and object oriented programming could be considered important: http://www.amazon.com/Data-Structures-Algorithms-Using-Pytho... http://www.amazon.com/Python-Algorithms-Mastering-Language-E... http://www.amazon.com/Python-3-Object-Oriented-Programming/d...

Although these and other intermediate to advanced topics tend to be covered better in non-language-specific books such as this shotgun blast to the head. Don't worry, it's just an "introduction": http://www.amazon.com/Introduction-Algorithms-Thomas-H-Corme...

michelleclsun
Thanks @poof131 - I'd like to go deeper into algorithms / data manipulation / social network analysis (for my job), and also web programming using python (weekend reading).

I'm currently reading Python for Data Analysis but feel like I can read about how to use a library but it's hard to retain specific syntax use cases if I'm not using those libraries immediately / frequently.

One book I really like is Collective Intelligence (http://shop.oreilly.com/product/9780596529321.do), which has some good examples on social network analysis.

HN Books is an independent project and is not operated by Y Combinator or Amazon.com.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.