Hacker News Comments on
A Billion Rows per Second: Metaprogramming Python for Big Data
thenewcircle.com
·
109
HN points
·
0
HN comments
HN Theater has aggregated all Hacker News stories and comments that mention thenewcircle.com's video "A Billion Rows per Second: Metaprogramming Python for Big Data".
thenewcircle.com Summary
HN Theater Rankings
- This course is unranked · view top recommended courses
Hacker News Stories and Comments
All the comments and stories posted to Hacker News that reference this video.
⬐
⬐ iskanderCool use of Numba. Does anyone know if there's more information available about their query language and what kinds of Python expressions they dynamically generate?⬐ vtuulos⬐ rtkweSorry, not yet. I hope to be able to make more information available soon.The query language is very straightforward. More interestingly, this approach makes it easy to implement various algorithms for machine learning / data mining, at least compared to MapReduce.
⬐ lcampbellFrom my understanding of the video, the queries are generated by the analysts via a frontend-generated or hand-written SQL query. The SQL query is parsed by PostgreSQL which forwards it via FDW[1] to Multicorn[2]. Their custom data storage and processing backend implements the API expected by Multicorn (e.g., you implement the multicorn.ForeignDataWrapper interface); this is where they transform from the parsed, serialized SQL into their custom DSL (the metaprogramming bit) which compiles to LLVM.--
⬐ dialtoneYeah, that's pretty much how it works. The frontend is anything that supports PostgreSQL as a database. Right now we use Tableau but we also used to have a custom WebUI on top of this service, it was very functional-inspired.Unfortunately Ville is on vacation right now, otherwise he'd be glad to dive more into the details of how that piece worked.
⬐ iskanderWhy did you guys choose to compile through Numba rather than directly to LLVM or C?⬐ vtuulos⬐ vtuulosNumba was just the fastest way to get it working. LLVM(py) is very low-level. C is still an option but Numba made interfacing with a no-brainer.Author here: Fortunately HN works even in the Finnish countryside. I am happy to answer any questions.⬐ dev360Very inspiring talk! Is it possible to deal with continuous data in those matrices or is it more oriented around discrete values?⬐ vtuulos⬐ iskanderThanks! Our approach supports both discrete and continuous values. It is mainly optimized for the use case where you want to aggregate continuous variables over discrete filters.Any interest in also trying Parakeet (https://github.com/iskandr/parakeet) for the backend? I'm curious to see how the performance would compare with Numba. I also have a semi-usable Builder API which constructs typed functions at a higher-level than llvmpy.So in essence they do tons of pre-processing on their data. I wonder how long the pre-processing takes compared to the amount of speed gains it produced for them.⬐ dialtone⬐ lymieDoesn't take so long actually, takes about 1 hour per day of data and every day we process about 10TB of uncompressed log files. The result of this can be stored and reused as many times as you need.⬐ lmmHow does this compare to something like spark/shark?⬐ aborochoffInteresting, you're processing ~500GB (10TB / 24 hour) of uncompressed loglines in 1 hour? Is the set up the same as the presentation?I keep thinking about these two buzz words that came out of the last two elections... Social Media and Big Data... its like Karl Rove and Richard Nixon. I can't help thinking the next BIG THING is going to be Total Awareness and Strom Thurmond.⬐ pjvdsMy takeaway is, that the reason that this is possible is that they care about data structure. A language can give you an order of magnitude performance, but - according to Ville - you can almost get infinitive improvement if you rethink the algorithms and data structure.⬐ mathattackStructure of the database matters a lot. Going column-oriented [1] will improve performance dramatically before you scale up the processing power.