Hacker News Comments on
Dataflow: A Unified Model for Batch and Streaming Data Processing
@Scale
·
Youtube
·
37
HN points
·
0
HN comments
- This course is unranked · view top recommended courses
Hacker News Stories and Comments
All the comments and stories posted to Hacker News that reference this video.⬐ burembaIt's strange that the service looks quite promising but I don't know any company that uses this service. Isn't it mature enough (it's actually kinda strange argument for a managed service though) or is it hard to use? I skimmed the documentation but the pricing model seemed unclear compared to AWS's managed services.⬐ jkffHi! Dataflow team member here.On maturity: Dataflow is built by the people who over the past 12 years created MapReduce, FlumeJava and FlumeC++, Pregel and Millwheel, based on the sum of experience obtained from all of these. It shares most of the back-end stack (work scheduling, pipeline optimization, fault tolerance etc.) with FlumeJava and FlumeC++ for batch jobs and Millwheel for streaming jobs, all of which are extensively used inside Google for data processing, and it shares a lot of the Java SDK with FlumeJava.
On usage: the blog post that announced General Availability of Dataflow lists a few public customer testimonials: http://googlecloudplatform.blogspot.com/2015/08/Announcing-G... . Quite a few blog posts by other users can also be found via https://www.reddit.com/r/dataflow .
Happy to answer additional questions. If I forget to check on this thread, feel free to ask on [email protected] or on StackOverflow with tag google-cloud-dataflow - we constantly monitor these and usually answer everybody.
⬐ alooPotatoWe use it here at Streak for streaming log processing. We have it setup such that our backend and client side logs are streamed to our "dataflow job" where we do some pretty simple processing/transformations and then it gets outputted to BigQuery. Sounds simple but there is a lot of complexity its hiding when you're streaming at a large enough scale. We know because we built our infrastructure for this at first and it sucked.As for pricing, its just consumes your regular google compute engine instances, so its all based on how big your jobs are and how long they run for.