Hacker News Comments on "Dataflow: A Unified Model for Batch and Streaming Data Processing" @Scale Youtube Video

Rankings: this week · month (apr/may) · year (2024) · all time

digests · search

Hacker News Comments on
Dataflow: A Unified Model for Batch and Streaming Data Processing

@Scale · Youtube · 37 HN points · 0 HN comments

HN Theater has aggregated all Hacker News stories and comments that mention @Scale's video "Dataflow: A Unified Model for Batch and Streaming Data Processing".

Youtube Summary

Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves. On top of that -- consumers want answers *now*. This talk will cover how Google has evolved its earlier work on batch and streaming systems (including MapReduce, FlumeJava, and Millwheel) into Dataflow, a new programming model that allows users to clearly trade off correctness, latency, and cost. An overview of this model will be provided, including a demo of the fully managed service it enables, and a discussion on some of the many use cases that got Google here.

Presenter
Frances Perry

HN Theater Rankings

This course is unranked · view top recommended courses

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.

Google Dataflow: A Unified Model for Batch and Streaming Data Processing [video]

⬐

Oct 23, 2015 · 35 points, 3 comments · submitted by espeed

⬐ buremba
It's strange that the service looks quite promising but I don't know any company that uses this service. Isn't it mature enough (it's actually kinda strange argument for a managed service though) or is it hard to use? I skimmed the documentation but the pricing model seemed unclear compared to AWS's managed services.

⬐ jkff
Hi! Dataflow team member here.
On maturity: Dataflow is built by the people who over the past 12 years created MapReduce, FlumeJava and FlumeC++, Pregel and Millwheel, based on the sum of experience obtained from all of these. It shares most of the back-end stack (work scheduling, pipeline optimization, fault tolerance etc.) with FlumeJava and FlumeC++ for batch jobs and Millwheel for streaming jobs, all of which are extensively used inside Google for data processing, and it shares a lot of the Java SDK with FlumeJava.
On usage: the blog post that announced General Availability of Dataflow lists a few public customer testimonials: http://googlecloudplatform.blogspot.com/2015/08/Announcing-G... . Quite a few blog posts by other users can also be found via https://www.reddit.com/r/dataflow .
Happy to answer additional questions. If I forget to check on this thread, feel free to ask on [email protected] or on StackOverflow with tag google-cloud-dataflow - we constantly monitor these and usually answer everybody.

⬐ alooPotato
We use it here at Streak for streaming log processing. We have it setup such that our backend and client side logs are streamed to our "dataflow job" where we do some pretty simple processing/transformations and then it gets outputted to BigQuery. Sounds simple but there is a lot of complexity its hiding when you're streaming at a large enough scale. We know because we built our infrastructure for this at first and it sucked.
As for pricing, its just consumes your regular google compute engine instances, so its all based on how big your jobs are and how long they run for.

Google Dataflow a Unified Model for Batch and Streaming Data Processing

Sep 16, 2015 · 2 points, 0 comments · submitted by patangay

Hacker News Comments on Dataflow: A Unified Model for Batch and Streaming Data Processing

Hacker News Stories and Comments

Hacker News Comments on
Dataflow: A Unified Model for Batch and Streaming Data Processing