HN Books @HNBooksMonth

The best books of Hacker News.

Hacker News Comments on
Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan

John Kruschke · 5 HN comments
HN Books has aggregated all Hacker News stories and comments that mention "Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan" by John Kruschke.
View on Amazon [↗]
HN Books may receive an affiliate commission when you make purchases on sites after clicking through links on this page.
Amazon Summary
Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan, Second Edition provides an accessible approach for conducting Bayesian data analysis, as material is explained clearly with concrete examples. Included are step by step instructions on how to carry out Bayesian data analyses in the popular and free software R and WinBugs, as well as new programs in JAGS and Stan. The new programs are designed to be much easier to use than the scripts in the first edition. In particular, there are now compact high level scripts that make it easy to run the programs on your own data sets. The book is divided into three parts and begins with the basics: models, probability, Bayes’ rule, and the R programming language. The discussion then moves to the fundamentals applied to inferring a binomial probability, before concluding with chapters on the generalized linear model. Topics include metric predicted variable on one or two groups; metric predicted variable with one metric predictor; metric predicted variable with multiple metric predictors; metric predicted variable with one nominal predictor; and metric predicted variable with multiple nominal predictors. The exercises found in the text have explicit purposes and guidelines for accomplishment. This book is intended for first year graduate students or advanced undergraduates in statistics, data analysis, psychology, cognitive science, social sciences, clinical sciences, and consumer sciences in business. Accessible, including the basics of essential concepts of probability and random sampling Examples with R programming language and JAGS software Comprehensive coverage of all scenarios addressed by non Bayesian textbooks: t tests, analysis of variance (ANOVA) and comparisons in ANOVA, multiple regression, and chi square (contingency table analysis) Coverage of experiment planning R and JAGS computer programming code on website Exercises have explicit purposes and guidelines for accomplishment Provides step by step instructions on how to conduct Bayesian data analyses in the popular and free software R and WinBugs.
HN Books Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this book.
The only book accessible by beginners, while still dealing with real-world level analysis: https://www.amazon.com/Doing-Bayesian-Data-Analysis-Second/d...
Characteristic examples from the book Doing Bayesian Data Analysis 2nd edition [1] programmed in Clojure and OpenCL to run on the GPU. Much, much faster than Stan or JAGS!

The library used (Bayadera) is still pre-release, so much polishing is still needed, so this can be considered a preview. But, it is still very useful, and not more complex for programmers than the mainstream Bayesian tools.

[1] https://www.amazon.com/Doing-Bayesian-Data-Analysis-Second/d...

marmaduke
It's not really comparable to stan if you're not computing gradients for HMC. Of course maybe for these models you don't need to.
dragandj
Please point me to models that absolutely need HMC, and I can try to see how Bayadera fares.
marmaduke
Stan 's developers in particular use it for hierarchical models, but it general anything with highly correlated parameters works better with HMC than MCMC, IIRC.

Michael Betancourt (à stan dev working on the HMC parts) has a pair of YouTube videos which go into details.

That said, I switched to pymc3 so that I could compute logp via opencl more easily and if there were better ways to do this, I'm happy to see them.

dragandj
But there ARE hierarchical models in the examples. One is 158-dimensional. With highly correlated parameters. Works like a charm in Bayadera.
marmaduke
My point was mainly that comparing speed between an algorithm that doesn't require gradients and HMC in Stan is apples and oranges.
dragandj
How's that? The algorithms have the same goal - find the posterior distribution. The time to get to that distribution is what is important and what is compared, provided that both algorithms get proper results. How they do it underneath is irrelevant for the user who waits.

That's like saying that comparing a horse cart and an automobile is comparing apples and oranges.

That being said, there are other things where Stan might fare better. User familiarity, or maturity, or personal taste...

te
Looks cool. Would love to see the "much, much faster" claim quantified. Both including and excluding compile times. Stan is neat, but the recompile time after every tweak to the model really got to be a drag. If you can improve on that, it would be a real win for me.
dragandj
Recompile time is under a second for most models that I tried. Let's say 150 - 700 ms. Once you compile it, you can use it many times.

The diference for the inference time is in the post below (but YMMV).

thom
This looks great, and very tastefully implemented on the Clojure side. Have you considered splitting out things like the quil utils?
dragandj
Thanks! About the quil utils - you mean get rid of them, or you think that they would be useful on their own? Btw, I only use quil for the sketch setup, the plots are done in low-level Processing. Quil does some(un)boxing, so It would be a drag when plotting millions of sample points.
thom
Ultimately I've been disappointed with Incanter, which seems effectively abandoned at this point, and I'm always on the lookout for alternatives that might one day rival the tools available in R. I use quil for various visualisations, but it (and I guess Processing beneath it) is jarringly mutable. Despite that, it would be interesting to me to see more plotting libraries spring up for Clojure, I suppose that was my point.
simonb
Shameless self promotion: https://github.com/sbelak/huri has a half-decent plotting DSL that uses ggplot underneath.
dragandj
Bayadera does not aim to rival or to mimic R tools - it aims to be better! But, it is strongly opinionated; either you take the bayesian path, or you use something else :)

Currently, I do not have resources to take on the task of creating a general-purpose plotting library. My idea with the plotting toolbox available in Bayadera is to provide a tool that would do one job - provide fast, easy to use, plotting for exactly those types of plots that are needed for bayesian stuff - and do that job well.

Plotting in general is not that hard technically, but the problem is that each user in each situation wants something unique. The level of customizability found in ggplot, matplotlib and the likes is rather high; that's something I do not have motivation to pursue.

feral
Any hint how much faster? Not looking for defensible benchmarks, but are you talking an order of magnitude? Multiple orders?

Does it make the same probabilistic guarantees as the methods used in Stan etc? Or is it trading validity for speed?

patall
Validity probably depends on whether you use a professional GPU or gaming devices. I am currently running NMF on a GTX980 and from time to time the algorithm totally fails, probably due to missing EC. I hope our new Tesla server will solve the issue
dragandj
I can not guarantee, but this should not be a problem for this particular algorithm (MCMC).
dragandj
Nothing is universal and guaranteed, of course. YMMV, and all that.

For example, robust linear regression from chapter 17, that fits 300 points over 4 parameters (easy, but far from trivial) runs in 180 seconds in JAGS and 485 in Stan, in parallel with 4 chains, taking 20,000 samples.

Bayadera takes 276,297,912 samples in 300 milliseconds, giving much fine-grained estimations.

So, depending on how you count the difference, it would be 500-1000 times faster for this particular analysis, while per-sample ratio is something like 7,000,000 (compared to JAGS).

Of course, JAGS and Stan are mature software packages, while Bayadera is still pre-release...

feral
Thanks. About the second part of my question - are you doing much the same stuff as JAGS/Stan? Like, they do a lot of work to make sure that their MCMC is validly converging to the posterior - does Bayadera make similar guarantees?

Is the speedup coming from a better implementation, or because GPUs are just way faster, or because it cuts statistical corners? If its cutting corners, are they sensible?

dragandj
It uses different MCMC algorithm - affine invariant ensemble MCMC. The difference comes from the fact that this algorithm is parallelizable, while JAGS/Stan's isn't. So, many GPU cores are the main factor. But, the algorithm is also a factor, in a sense that parallel chains always mutually inform each other.

They may do a lot of work to make sure that MCMC is validly converging, and Bayadera also does its stuff on that front, but the truth is, and you'll find it in any book on MCMC (Gelman included) that you can never guarantee MCMC convergence.

eli_gottlieb
Can you point me to any good documentation on parallel MCMC algorithms and any info you might have written down on how you parallelized it? This sounds extremely worth porting over to some other probabilistic programming languages.
dragandj
I'll be glad to send you the paper once it gets accepted.
eli_gottlieb
Thanks!
nextos
Looks very nice. I wonder if the upcoming Xeon Phi will make the task of parallel sampling simpler. Or at least compiling and optimising automatic parallel samplers on the fly. Macros might be great for this. That's the ultimate probabilistic programming goal. Write the model and get efficient sampling for free.
dragandj
Thanks. I doubt that XeonPhi would be any faster than my old AMD R9 290X, and the 10x price tag is also not inviting.
Having read many books/online courses, I strongly recommend this: http://www.amazon.com/Doing-Bayesian-Data-Analysis-Second/dp...

It gives a gentle but through introduction from first principles; lots of good intuition and 'why'.

It works well with "Probabilistic Programming & Bayesian Methods for Hackers" also mentioned, but I'd start with this. It is much more accessible than many other introductory books, IMO.

I don't have any Google Hangout chat messages to run the first example of using jupyter. I know that you are not going to share your data, but it should be handy if some fake conversations could be included. People like me like to first install the applications and then run it to see whether it works as claimed. I installed the conda distribution and the jupyter notebook works correctly. (I installed conda in ubuntu and then seaborn, PyMC3 and panda (PyMC3 and seaborn with pip since conda install 2.3 of PyMC3). It works.

I should say that the first step is to clone:

cd where_you_want_the_data_to_be_copied git clone ....

# and now start jupyter notebook with

jupyter notebook

# go to File/open/ and select the first section.

I see that I can edit the markdown. I translated the introduction to section 0, here it goes. Thanks for this tutorial. The graphics are nice.

### Sección 0: Introducción Bienvenido a "Bayesian Modelling in Python" - un tutorial para personas interesadas en técnica de estadística bayesiana con Python. La lista de secciones del tutorial se encuentra en la página web del projecto [homepage](https://github.com/markdregan/Hangout-with-PyMC3).

La estadística es un tema que en mis años de universidad nunca me gustó . Las técnicas frecuentistas que nos enseñaron (p-values, etc.) parecían rebuscadas y en última instancia di la espalda a este tema en el que no estaba interesado.

Esto cambió cuando descubrí la estadística Bayesiana - una rama de la estadística bastante diferente a la estadística frecuentista que se suele enseñar en la mayoría de las universidades. Mi aprendizaje se inspiró en numerosas publicaciones, blogs y videos. A los que se inician en la estadística bayesiana les recomendaría fervientemente los siguientes:

- [Doing Bayesian Data Analysis](http://www.amazon.com/Doing-Bayesian-Analysis-Second-Edition...) by John Kruschke - [Python port](https://github.com/aloctavodia/Doing_Bayesian_data_analysis) of John Kruschke's examples by Osvaldo Martin - [Bayesian Methods for Hackers](https://github.com/CamDavidsonPilon/Probabilistic-Programmin...) fue para mí una gran fuente de inspiración para aprender estadística bayesiana. En reconocimiento de la gran influencia que ejerció en mí, he adoptado el mismo estilo visual que se usa en BMH. - [While My MCMC Gently Samples](http://twiecki.github.io/) blog de Thomas Wiecki - [Healthy Algorithms](http://healthyalgorithms.com/tag/pymc/) blog de Abraham Flaxman - [Scipy Tutorial 2014](https://github.com/fonnesbeck/scipy2014_tutorial) de Chris Fonnesbeck

He creado este tutorial con la esperanza de que otros lo encontrarán útil y que les servirá para aprender técnicas bayesianas de la misma forma que me ayudaron a mí. Cualquier aportación de la comunidad corrección/comentario/contribución será bienvenida.

Some good books on Machine Learning:

Machine Learning: The Art and Science of Algorithms that Make Sense of Data (Flach): http://www.amazon.com/Machine-Learning-Science-Algorithms-Se...

Machine Learning: A Probabilistic Perspective (Murphy): http://www.amazon.com/Machine-Learning-Probabilistic-Perspec...

Pattern Recognition and Machine Learning (Bishop): http://www.amazon.com/Pattern-Recognition-Learning-Informati...

There are some great resources/books for Bayesian statistics and graphical models. I've listed them in (approximate) order of increasing difficulty/mathematical complexity:

Think Bayes (Downey): http://www.amazon.com/Think-Bayes-Allen-B-Downey/dp/14493707...

Bayesian Methods for Hackers (Davidson-Pilon et al): https://github.com/CamDavidsonPilon/Probabilistic-Programmin...

Doing Bayesian Data Analysis (Kruschke), aka "the puppy book": http://www.amazon.com/Doing-Bayesian-Data-Analysis-Second/dp...

Bayesian Data Analysis (Gellman): http://www.amazon.com/Bayesian-Analysis-Chapman-Statistical-...

Bayesian Reasoning and Machine Learning (Barber): http://www.amazon.com/Bayesian-Reasoning-Machine-Learning-Ba...

Probabilistic Graphical Models (Koller et al): https://www.coursera.org/course/pgm http://www.amazon.com/Probabilistic-Graphical-Models-Princip...

If you want a more mathematical/statistical take on Machine Learning, then the two books by Hastie/Tibshirani et al are definitely worth a read (plus, they're free to download from the authors' websites!):

Introduction to Statistical Learning: http://www-bcf.usc.edu/~gareth/ISL/

The Elements of Statistical Learning: http://statweb.stanford.edu/~tibs/ElemStatLearn/

Obviously there is the whole field of "deep learning" as well! A good place to start is with: http://deeplearning.net/

yedhukrishnan
Those are really useful. Thank you. Books are pricey though!
shogunmike
I know...some of them are indeed expensive!

At least the latter two ("ISL" and "ESL") are free to download though.

alexcasalboni
Those are great resources!

In case you are interested in MLaaS (Machine Learning as a Service), you can check these as well:

Amazon Machine Learning: http://aws.amazon.com/machine-learning/ (my review here: http://cloudacademy.com/blog/aws-machine-learning/)

Azure Machine Learning: http://azure.microsoft.com/en-us/services/machine-learning/ (my review here: http://cloudacademy.com/blog/azure-machine-learning/)

Google Prediction API: https://cloud.google.com/prediction/

BigML: https://bigml.com/

Prediction.io: https://prediction.io/

OpenML: http://openml.org/

yedhukrishnan
I went through the links and your review. They are really good. Thanks!
HN Books is an independent project and is not operated by Y Combinator or Amazon.com.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.