HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Distributed Systems in One Lesson by Tim Berglund

Devoxx Poland · Youtube · 2 HN points · 1 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention Devoxx Poland's video "Distributed Systems in One Lesson by Tim Berglund".
Youtube Summary
Normally simple tasks like running a program or storing and retrieving data become much more complicated when we start to do them on collections of computers, rather than single machines. Distributed systems has become a key architectural concern, and affects everything a program would normally do—giving us enormous power, but at the cost of increased complexity as well.

Using a series of examples all set in a coffee shop, we’ll explore topics like distributed storage, computation, timing, messaging, and consensus. You'll leave with a good grasp of each of these problems, and a solid understanding of the ecosystem of open-source tools in the space.
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Oct 18, 2020 · 1 points, 0 comments · submitted by anderspitman
https://www.youtube.com/watch?v=Y6Ev8GIlbxc&t=28m15s

Just saw this and might explain why Hadoop is out of the spotlight. In summary, Spark and Kafka seem to be better? I'm not sure as I'm just starting to enter this field.

nathairtras
Getting this out of the way first, I've only started exploring non-Hadoop/non-HDFS Spark execution beyond some limited Amazon EMR work, but I'm interested in learning more about it. What follows is a combination of work experience and armchair research in the evenings. But I'm not claiming to be an expert.

Have grown to really appreciate Spark in the Hadoop space. Started with plans to go with Impala, then went to Hive due to stability concerns, and finally to Spark due to speed / flexibility. You can write code against a data frame, or write Spark SQL, so you still have SQL.

HDFS has benefits over other storage approaches, if you are running Spark in the same cluster you get data proximity. But you can go with a different storage back-end. That costs in performance. "Performance of multiple query and enrichment jobs concurrently executed resulted in 90% longer execution times." https://redhatstorage.redhat.com/2018/06/25/why-spark-on-cep...

Unless you really have BIG data, you're invoking a lot of maintenance overhead to support a cluster when you may do just fine without.

Haven't had the freedom to explore other possibilities until recently, very interested in how Spark on k8s is working out. (Same comment could be made here as above and elsewhere - do you really need k8s? But I want to play with k8s and learn more about it, so... for that purpose I 'need' it.) https://spark.apache.org/docs/latest/running-on-kubernetes.h...

And there's always the cloud route. You can run an EMR job that uses files in s3. There is a cost, but you do not need to support a cluster in the same way. Or if you're feeling adventurous, use Lambda. https://www.qubole.com/blog/spark-on-aws-lambda/

And Spark isn't the only option. Have started learning about Dask, also looks very interesting for performing some of the same tasks. https://dask.org

Feb 11, 2019 · 1 points, 0 comments · submitted by mpiedrav
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.