HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Dataflow: A Unified Model for Batch and Streaming Data Processing

@Scale · Youtube · 37 HN points · 0 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention @Scale's video "Dataflow: A Unified Model for Batch and Streaming Data Processing".
Youtube Summary
Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves. On top of that -- consumers want answers *now*. This talk will cover how Google has evolved its earlier work on batch and streaming systems (including MapReduce, FlumeJava, and Millwheel) into Dataflow, a new programming model that allows users to clearly trade off correctness, latency, and cost. An overview of this model will be provided, including a demo of the fully managed service it enables, and a discussion on some of the many use cases that got Google here.

Presenter
Frances Perry
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Oct 23, 2015 · 35 points, 3 comments · submitted by espeed
buremba
It's strange that the service looks quite promising but I don't know any company that uses this service. Isn't it mature enough (it's actually kinda strange argument for a managed service though) or is it hard to use? I skimmed the documentation but the pricing model seemed unclear compared to AWS's managed services.
jkff
Hi! Dataflow team member here.

On maturity: Dataflow is built by the people who over the past 12 years created MapReduce, FlumeJava and FlumeC++, Pregel and Millwheel, based on the sum of experience obtained from all of these. It shares most of the back-end stack (work scheduling, pipeline optimization, fault tolerance etc.) with FlumeJava and FlumeC++ for batch jobs and Millwheel for streaming jobs, all of which are extensively used inside Google for data processing, and it shares a lot of the Java SDK with FlumeJava.

On usage: the blog post that announced General Availability of Dataflow lists a few public customer testimonials: http://googlecloudplatform.blogspot.com/2015/08/Announcing-G... . Quite a few blog posts by other users can also be found via https://www.reddit.com/r/dataflow .

Happy to answer additional questions. If I forget to check on this thread, feel free to ask on [email protected] or on StackOverflow with tag google-cloud-dataflow - we constantly monitor these and usually answer everybody.

alooPotato
We use it here at Streak for streaming log processing. We have it setup such that our backend and client side logs are streamed to our "dataflow job" where we do some pretty simple processing/transformations and then it gets outputted to BigQuery. Sounds simple but there is a lot of complexity its hiding when you're streaming at a large enough scale. We know because we built our infrastructure for this at first and it sucked.

As for pricing, its just consumes your regular google compute engine instances, so its all based on how big your jobs are and how long they run for.

Sep 16, 2015 · 2 points, 0 comments · submitted by patangay
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.