HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Simon Peyton Jones: Data Parallel Haskell

nuACM · Youtube · 34 HN points · 13 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention nuACM's video "Simon Peyton Jones: Data Parallel Haskell".
Youtube Summary
There are many approaches to exploiting multi-cores, but a particularly promising one is the "data-parallel" paradigm, because it combines massive parallelism (on both shared and distributed memory) with a simple, single-control-flow programming model. Indeed, I think that data parallelism is the only way we will be able to exploit tens or hundreds of processors effectively.

Alas, data-parallel programming is usually restricted to "flat" data parallelism, which is good for implementers but bad for programmers. Instead, I'll describe the "nested" data parallel programming model, first developed in the 90's by Blelloch and Sabot. It is great for programmers but much harder to implement; as a result, it's virtually unknown in practice. It's really only feasible to support nested data parallelism in a purely functional language, so we are building a high-performance implementation in Haskell.
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
One of my favorites of these is only showing the speedup..."scales linearly", without comparing to a non-parallel baseline. First time I saw this was with Clik, which reported linear scaling. I was impressed, until I found out that the parallel version was 6x slower on a single CPU than the sequential version.

Similar with a SPJ talk on parallel Haskell (can't find it right now). Big claims about this being the One True Way, and then required four cores to match C sequential case. Hmmm.

UPDATE: https://www.youtube.com/watch?v=NWSZ4c9yqW8 around the 1h 10min mark. Four cores to match sequential C (previously had written six, was wrong)

scott_s
Agreed. To my surprise, he kinda argues against this in one of his items (https://blogs.fau.de/hager/archives/6558):

Always compare the fully parallel accelerator code with a sequential (single-core) CPU version. You can even argue in favor of this, saying that “serial CPU code constitutes a well-defined baseline.” Be sure to rush ahead to the next slide before anyone dares to ask why you don’t use a single GPGPU “core” to make the comparison even more “well-defined.”

To be clear, this is about the performance of accelerated codes (such as GPU or FPGA). I had to read that multiple times to come up with the most charitable explanation. What he's really arguing against is framing: it's not valid to claim a 200x performance increase if your baseline is the sequential version on a single core. The graph he shows is actually about a 2.3x performance improvement (comparing GPU with ECC on against CPU with using 12 cores). But as mpweiher points out, I think you absolutely need to include the sequential version in your comparison set: the sequential performance is your "table stakes" baseline. If you can never beat it, then I don't care what your scalability on the device is! It should be there as a sanity check: yes, this approach is better than just using a single thread on the CPU.

Where it gets interesting is that whether or not the sequential version wins is often dependent on problem size. In that case, the sequential version needs to be included because even if "large enough" problem sizes benefit from the accelerator, there may be real problem sizes that are not large enough. I think Hager is also coming from the world of HPC, where its okay to focus on particular kernels instead of end-to-end application performance. This is the case in the linked post; the performance graph is for a matrix multiply, not an application which uses matrix multiply for a larger result. And end-to-end application performance often requires factoring in some of the other bullet points (such as device communication) that focusing on kernels side-steps.

smitherfield
As a bit of a layman with this stuff, I'm wondering if "running the parallel version on a single CPU" means running a parallel algorithm (which is different from a sequential algorithm) on one CPU, or if the sequential and the single-threaded parallel algorithms are equivalent, and it's measuring the difference between e.g. 8 threads per 1 CPU vs. 1 thread per 8 CPUs (in which case, obviously, context switches would account for the performance difference).
scott_s
When someone says "running the parallel version on a single CPU" they usually mean running the parallel version with 1 thread on 1 CPU. Comparing that to the sequential (non-parallel) version allows you to measure the baseline overhead cost of parallelization. There are also non-baseline overhead costs of parallelization, which usually come in the form of synchronization overhead between threads or processes.
ergl
I had a similar realization when I read on STM performance[1], where almost all benchmarks showed considerable slowdowns for the parallel+stm implementations when compared against sequential solutions.

[1]: http://queue.acm.org/detail.cfm?id=1454466

inetknght
> This video is not available.

> Sorry about that.

Oh, Youtube...

adrianN
See also "Scalability! But at What COST?" https://www.usenix.org/system/files/conference/hotos15/hotos...
mpweiher
Ahh, yes, the Configuration that Outperforms a Single Thread. I think this should be a standard + required measurement.
The "FP for multi-core" trope is a red-herring, at least so far. Yes, immutability makes parallelizing easier. It also makes things in general so much more expensive that we're still net-negative by a large margin.

For example, Simon Peyton Jones gave a talk[1] about Data Parallel Haskell, which after many years of development was still slower on 6 cores than C on a single core.

[1] https://www.youtube.com/watch?v=NWSZ4c9yqW8

Symmetry
I think the state of the art for Haskell has improved a lot in terms of scientific computing since 2010?

http://research.microsoft.com/en-us/um/people/simonpj/papers...

mpweiher
Glad to hear that they've improved enough to look good on their own benchmarks :-)
Well nested data parallelism generally applies to hierarchical structures like trees, whereas flat data parallelism applies to things like arrays and whatnot. You might want to watch this (similar in some ways) talk by SPJ on Data Parallel Haskell http://youtu.be/NWSZ4c9yqW8
LOL! My 1st year university course was 1989. When were you born?

Interestingly, that is exactly what I see in the current generation of FP hypsters. Grandiose claims based on very limited experience. Aka: know so little, assume so much.

But don't worry, it happens to the best. My favorite is still the Simon Peyton-Jones talk on Data Parallel Haskell, where he can't help himself, but just has to make the snide comment "this is only possible in a language like Haskell". Hand raised in the auditorium: The HPC community has been doing this for years. In FORTRAN. But carry on...

(And of course, the results are dismal: 6 cores required to match sequential C! Considering that this sort of concurrency [there are others] is exclusively for better performance, that is a damning result).

Despite these, er, issues, it is an absolutely fascinating talk, and the techniques look quite worthwhile. After all, they've been in use in the HPC community for some time :-)

https://www.youtube.com/watch?v=NWSZ4c9yqW8

Sep 06, 2014 · sgrove on Rhine – A Lisp on LLVM
I believe Haskell compilers are pretty amenable to stuff like this - watching "Data Parallel Haskell", it was interesting to see how they actually map out contiguous memory, etc. https://www.youtube.com/watch?v=NWSZ4c9yqW8

Sadly, it doesn't seem to be as tightly integrated as what you'd see in SBCL from a developers PoV, but I'm certainly intrigued by their ability to achieve some very specific implementation while separating it from the expression of the algorithms.

Jun 28, 2014 · mpweiher on Teenage Haskell
Your make some good points, but I don't quite buy them.

Turbo Pascal for example was written by a single guy, Anders Hejlsberg, in about 2 years for several platforms in assembly language, and included an integrated (though primitive) text editor/IDE.

Smalltalk was taken from ST-72 to ST-80 in 8 years, with small teams that also had to invent/implement complete VMs, operating systems, bitmap graphics engines, GUI framework(s) and sophisticated IDEs. You write that it would be nice to have an FP IDE so it could give you better feedback (or were you talking about an existing FP IDE). These guys built the live IDE (even better than types for direct feedback) in a fraction of the time. Does OPAL still compile to C? Does it still use Tk?

Think about Ruby, Python, etc.

I could go on (the VPRI stuff is also amazing), but I think you get the idea: it seems that not only can you be a productive member of society without FP, it sure looks like people can be a lot more productive without FP than with it.

Java is not really a good example of anything. Certainly not of an OOPL: "Java is the most distressing thing to happen to computing since MS-DOS." "If the pros at Sun had had a chance to fix Java, the world would be a much more pleasant place. This is not secret knowledge. It’s just secret to this pop culture." — Alan Kay (who coined the term "object oriented")

I don't quite get why you see LISP as an example for the power of FP, it is a multi-paradigm language with some functional elements, and most of its power comes, AFAIK, from its meta-system.

And HOFs like map or fold have been in other languages since the dawn of time, they're certainly in Smalltalk.

In terms of rhetoric, Peter Pepper might seem like an outlier, but as far as I can tell he is not. You can see it in this very thread: if you don't buy the self-evident awesomeness, it must be because "you didn't understand" and "refuse to understand", it cannot possibly be because you legitimately have a different point of view.

Another example. I watched the following talk by Simon Peyton Jones on data parallel Haskell:

https://www.youtube.com/watch?v=NWSZ4c9yqW8

Interesting stuff. But he has to go and say it: "this is only possible in a pure FP language like Haskell". Hand raised in the auditorium. "Actually, this sort of thing is pretty common in the HPC community, in FORTRAN [under some specific name]". And of course he can't just shut up and read up on things he obviously doesn't know about, he has to retort: "well, that means that it is nothing but a variant of Haskell at this point". Ouch! And finally, at the end of the talk he gives some numbers. It turns out that this amazing thing that's only possible in Haskell needs 6 cores to match the performance of a single core C program. 6:1. That's absolutely pathetic, especially when compared to what the HPC FORTRAN guys are doing. So maybe he shouldn't be lecturing them, they know this stuff a lot better then he does.

Now don't get me wrong, the talk was interesting and thought provoking, but the mismatch between grandness of the rhetoric and the paucity of the results was its usual epically proportioned self.

This is both a straw man (OP asked for limited help, not omnipotence), and predicting the future by drawing a line from the past.

Functional compilers that outperform hand tuned C++ on many core machines may be possible, but very hard. Just because everyone has previously underestimated the difficulty does not mean it's an idea worthy of derision.

I think it's much more likely that it's perpetually ten years away, until one day it's not (like the current, possibly temporary end to Moore's Law).

Watching Simon Peyton Jones' talk on data parallel Haskell, I'd not make a bet against high performance Haskell's eventual success, especially as core counts continue to climb, and NUMA gets more NU.

http://www.youtube.com/watch?v=NWSZ4c9yqW8

SolarNet
We're already there [1] for vectorized code (Haskell vs. Hand written C). Note how they compare naive haskell solutions with their vector library vs. heavily optimized C code and using finely tuned libraries. And they are competitive, actually beating the finely tuned library in non-toy examples while still using naive haskell.

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.278...

tl;dr: "we remark that all 3 languages provide easy-to-use, high-level and efficient support for parallelism. Haskell has an edge in its rich libraries and aggressive optimisation. Through laziness, top-level parallelism can be specified in a more modular way, minimising the need to modify components and avoiding any notion of explicit threads. Thus, Haskell is the language of choice for minimally intrusive parallelism. F# and Scala provide several mechanisms for parallelisation, some as high level as Haskell’s, some lower level with more potential for performance tuning. Thus, these languages provide a more direct route to controlling low level aspects and can use object-oriented and impure language features."

It would be cool to see a similar contrast between Data Parallel Haskell (http://www.haskell.org/haskellwiki/GHC/Data_Parallel_Haskell / http://www.youtube.com/watch?v=NWSZ4c9yqW8) and others.

Dude, you just DID NOT insult Haskell !

[Sort of relevant]

Apparently SPJ was once asked by a customs official in America if an american citizen could do what he did. Someone with him announced that his work as the creator of ML made him irreplaceable.

So you see, ML's got fanboys in high places. Even close to SPJ apparently.

Source for SPJ Koan : http://www.youtube.com/watch?v=NWSZ4c9yqW8 (around the 4 min mark).

chrisdone
FWIW you can link to a specific time in a YouTube video like this: http://www.youtube.com/watch?v=NWSZ4c9yqW8#t=3m0s
Here's the main Haskell hero, Simon Peyton Jones giving a talk about data parallel Haskell, which I think is a promising language abstraction to do hard things with parallelism relatively easily:

http://www.youtube.com/watch?v=NWSZ4c9yqW8

As for techniques that can be used now: Haskell's `par` and `pseq` combinators allow adding parallelization to "hard" code after-the-fact pretty easily.

SeanLuke
Wasn't the promise of haskell the idea that parallelism would be implicit? What are operators like those doing there?
Peaker
No, that was never "the promise" of Haskell.

Making automatic implicit parallelism in Haskell is trivial - but it will yield too much parallelism and the overhead will trump the benefits.

The `par` and `pseq` combinators allow the programmer to specify which computations to parallelize explicitly to avoid parallelizing computations that are too small to be worth it and to allow the programmer to care for data-locality.

Despite being explicit, they are still far easier than other explicit parallelism mechanisms such as explicit threads because they are:

* So easy to throw in the program

* Guaranteed not to alter the semantics of your program - so you can just throw them in there and profile the performance changes -- and you know they won't break your program. They can alter the performance for better or worse.

pjscott
On a few occasions, I've taken a program and added a few functions from Control.Parallel, and five minutes later, I had a significantly faster program running on several cores. It was surprisingly easy.

Granted, I was going after the low-hanging-fruit cases, but even so, I was impressed.

While this isn't undeniable proof, it's interesting none the less, Simon Peyton Jones: Data Parallel Haskell.

http://www.youtube.com/watch?v=NWSZ4c9yqW8

He talks about how this scheme can parallelize non-trivial data structures, and how tedious book keeping can be avoided. I haven't used it yet in a project, so I can't speak of its efficacy - but worth a watch, imho.

May 07, 2010 · 34 points, 0 comments · submitted by alrex021
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.