Hacker News Comments on
2012-04-07 - Lab-Tools FPGA Binning Co-processor for Apl
LabToolsInstruments
·
Youtube
·
25
HN points
·
0
HN comments
- This course is unranked · view top recommended courses
Hacker News Stories and Comments
All the comments and stories posted to Hacker News that reference this video.⬐ pflanzeI don't understand why binning 1 million values would take more than a second on the CPU. I've tested the following in C:This runs in 8 ms on my elderly laptop.int len= 1000000; int numbuckets= 1000; double* vec= ...; // malloc(sizeof(double)*len) and fill with values between 0..1 unsigned long* res= calloc(numbuckets, sizeof(unsigned long)); int i; for (i=0; i<len; i++) { int ii= vec[i] * numbuckets; res[ii]++; }
It would be slower if the individual bins wouldn't be of equal size, but the examples shown in the video seem to suggest that they are, and even with about a 10 times slowdown for a binary search it would be faster than the USB 2.0 transfer. Without using a GPU, multiple threads or SIMD.
⬐ thesz⬐ sparkyMake more buckets, make bins with different bounds (not all equal to 1/numbuckets). All of sudden your algorithm will work much more due to cache misses and the need to perform conditional jumps.⬐ pflanzeThat does not explain it. Yes it will be slower with more buckets; how much?40 ms is still >50 times faster than the "software algorithm" in the video. And he didn't use 40 mio buckets there, as can be seen from the graphs.numbuckets uint64 uint32 1000 6 ms 5 ms 10000 6 ms 5 ms 100000 6 ms 7 ms 1000000 14 ms 8 ms 10000000 38 ms 32 ms 20000000 38 ms 40000000 40 ms 80000000 44 ms
> make bins with different bounds
You'd have to implement this and use a huge number of buckets and likely still use an older CPU than he did to get to his software timings. Or use a slow language implementation (interpreted or boxing arithmetic results). Or implement a slow algorithm like a linear search over the buckets (which would explain what the difference between his two algorithms of about a factor 2 is, the faster one stopping once the bucket is found). In any of these cases, he either doesn't know how to write fast code, or deliberately left the software version way slower than possible to have the FPGA version shine, pending any other better explanation.
⬐ pflanzeI've now implemented the binary search version; I was right, the factor 10 was about correct, and even with 20 mio buckets the algorithm is still faster than the FPGA and more than 3 times faster than his faster software algorithm even though he's got a newer CPU (I'm running it on a Core 2 duo (single-threaded) at 2.5 Ghz), and again, he's really using something between about 1000..100000 buckets, which would make my program on my machine between 17 to 32 times faster than his:(Results in my previous post were with gcc -O1.)buckets gcc -O1 gcc -O3 1000 75 ms 72 ms 10000 105 ms 102 ms 100000 135 ms 136 ms 1000000 243 ms 240 ms 10000000 580 ms 20000000 693 ms
All results so far were with random numbers with an uneven distribution: vec[i]= (i / len) * random_between(0.,1.). Here are the results for the binary search with vec[i]= random_between(0.,1.):
(He's using a Gaussian distribution, not even.)buckets gcc -O1 1000 82 ms 10000 115 ms 100000 148 ms 1000000 360 ms 10000000 735 ms 20000000 860 ms
Here's the code:
(Embedded within a Scheme driver program: http://christianjaeger.ch/scratch/binning.scm)int nvals= ..; double* vals= ..; int nbuckets= ..; double* buckets= ..; unsigned int* res= ..; int i; for (i=0; i<nvals; i++) { double val= vals[i]; int bi; { int lo=0, hi=nbuckets; while (lo < hi) { int mid= (lo+hi) / 2; if (val < buckets[mid]) { hi= mid; } else { lo= mid+1; } } bi= lo; } res[bi]++; }
Aliasing conflict!This is about doing histogramming on an FPGA, not a tool to group FPGA dice into speed grades :)
Would be interesting to see how an FPGA stacks up against a GPU on this problem. GPUs are very fast at parallel histogramming, but hardwiring the number-to-bin-index computation for a particular problem instance might buy you quite a bit of energy efficiency, or even possibly a bit of performance if the computation involves a divide. If the bins are of non-uniform size and the indexing computation involves a binary search on a lookup table, it seems like the much higher clock speeds on a GPU would win out.
⬐ sitkackGPU clocks aren't really that high. FPGAs win when it comes to combinatorial logic, if statements kill GPUs. My money is on the FPGA by a superlative.⬐ TD-LinuxMost FPGAs can't handle very high clocks either (~400MHz max, 100MHz is more reasonable for the Xilinx Artix chips). A huge fanout with a ton of comparators is going to not run that fast.I imagine binning with an ISA with conditional execution wouldn't be that bad as far as jump penalty. Even with jumps, as long as they are predicted correctly it's fine.
The big limitation here is probably the USB 2.0 speed - 40MB/s is not a lot compared to a CPU's bandwidth to main memory.
⬐ sitkackGPUs just recently hit 800+, they were at 600Mhz for a long while. As for USB bottleneck, I am speaking generically and assuming an equal footing (memory bandwidth, bus, etc), the specific device (binning on an FPGA) as a co-processor for a computer system doesn't really make sense from a speed advantage. I saw this more as a proof of concept for augmenting APL with an FPGA. Of course, APL itself should just get compiled down the GPU.It does make a for a great benchmark. Eventually FPGA fabric and GPU compute will merge into an oatmeal with raisin consistency. I'd argue that with embedded multipliers, FPGAs are already there. Altera has an OpenCL sdk.