HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
A Billion Rows per Second: Metaprogramming Python for Big Data

thenewcircle.com · 109 HN points · 0 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention thenewcircle.com's video "A Billion Rows per Second: Metaprogramming Python for Big Data".
Watch on thenewcircle.com [↗]
thenewcircle.com Summary
ProTech provides technical training including Microsoft, Linux, Java, Oracle, IBM, Project Management, VMWare, Perl, Internet Security & more.
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Sep 27, 2013 · 109 points, 16 comments · submitted by BuffaloSweat
iskander
Cool use of Numba. Does anyone know if there's more information available about their query language and what kinds of Python expressions they dynamically generate?
vtuulos
Sorry, not yet. I hope to be able to make more information available soon.

The query language is very straightforward. More interestingly, this approach makes it easy to implement various algorithms for machine learning / data mining, at least compared to MapReduce.

lcampbell
From my understanding of the video, the queries are generated by the analysts via a frontend-generated or hand-written SQL query. The SQL query is parsed by PostgreSQL which forwards it via FDW[1] to Multicorn[2]. Their custom data storage and processing backend implements the API expected by Multicorn (e.g., you implement the multicorn.ForeignDataWrapper interface); this is where they transform from the parsed, serialized SQL into their custom DSL (the metaprogramming bit) which compiles to LLVM.

--

[1] http://wiki.postgresql.org/wiki/Foreign_data_wrappers

[2] http://multicorn.org/

dialtone
Yeah, that's pretty much how it works. The frontend is anything that supports PostgreSQL as a database. Right now we use Tableau but we also used to have a custom WebUI on top of this service, it was very functional-inspired.

Unfortunately Ville is on vacation right now, otherwise he'd be glad to dive more into the details of how that piece worked.

iskander
Why did you guys choose to compile through Numba rather than directly to LLVM or C?
vtuulos
Numba was just the fastest way to get it working. LLVM(py) is very low-level. C is still an option but Numba made interfacing with a no-brainer.
vtuulos
Author here: Fortunately HN works even in the Finnish countryside. I am happy to answer any questions.
dev360
Very inspiring talk! Is it possible to deal with continuous data in those matrices or is it more oriented around discrete values?
vtuulos
Thanks! Our approach supports both discrete and continuous values. It is mainly optimized for the use case where you want to aggregate continuous variables over discrete filters.
iskander
Any interest in also trying Parakeet (https://github.com/iskandr/parakeet) for the backend? I'm curious to see how the performance would compare with Numba. I also have a semi-usable Builder API which constructs typed functions at a higher-level than llvmpy.
rtkwe
So in essence they do tons of pre-processing on their data. I wonder how long the pre-processing takes compared to the amount of speed gains it produced for them.
dialtone
Doesn't take so long actually, takes about 1 hour per day of data and every day we process about 10TB of uncompressed log files. The result of this can be stored and reused as many times as you need.
lmm
How does this compare to something like spark/shark?
aborochoff
Interesting, you're processing ~500GB (10TB / 24 hour) of uncompressed loglines in 1 hour? Is the set up the same as the presentation?
lymie
I keep thinking about these two buzz words that came out of the last two elections... Social Media and Big Data... its like Karl Rove and Richard Nixon. I can't help thinking the next BIG THING is going to be Total Awareness and Strom Thurmond.
pjvds
My takeaway is, that the reason that this is possible is that they care about data structure. A language can give you an order of magnitude performance, but - according to Ville - you can almost get infinitive improvement if you rethink the algorithms and data structure.
mathattack
Structure of the database matters a lot. Going column-oriented [1] will improve performance dramatically before you scale up the processing power.

[1] http://en.wikipedia.org/wiki/Column-oriented_DBMS

HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.