HN Books @HNBooksMonth

The best books of Hacker News.

Hacker News Comments on
Big Data: Principles and best practices of scalable realtime data systems

Nathan Marz, James Warren · 2 HN comments
HN Books has aggregated all Hacker News stories and comments that mention "Big Data: Principles and best practices of scalable realtime data systems" by Nathan Marz, James Warren.
View on Amazon [↗]
HN Books may receive an affiliate commission when you make purchases on sites after clicking through links on this page.
Amazon Summary
Summary Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built. Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. About the Book Web-scale applications like social networks, real-time analytics, or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional database systems. These applications require architectures built around clusters of machines to store and process data of any size, or speed. Fortunately, scale and simplicity are not mutually exclusive. Big Data teaches you to build big data systems using an architecture designed specifically to capture and analyze web-scale data. This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm, and NoSQL databases. This book requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful. What's Inside Introduction to big data systems Real-time processing of web-scale data Tools like Hadoop, Cassandra, and Storm Extensions to traditional database skills About the Authors Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems. James Warren is an analytics architect with a background in machine learning and scientific computing. Table of Contents A new paradigm for Big Data PART 1 BATCH LAYER Data model for Big Data Data model for Big Data: Illustration Data storage on the batch layer Data storage on the batch layer: Illustration Batch layer Batch layer: Illustration An example batch layer: Architecture and algorithms An example batch layer: Implementation PART 2 SERVING LAYER Serving layer Serving layer: Illustration PART 3 SPEED LAYER Realtime views Realtime views: Illustration Queuing and stream processing Queuing and stream processing: Illustration Micro-batch stream processing Micro-batch stream processing: Illustration Lambda Architecture in depth
HN Books Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this book.
Lambda architecture for data processing, as popularized by Nathan Marz et al [0], has two components, the Batch layer and the Stream layer. At a high level, Batch trades quality for staleness whilst Stream optimises for freshness at the expense of quality [1].

I believe what GP means by Lambda is that, you'd need a system that batch processes the data to be amended / changed (reprocess older data) but stream processes whatever that's required for real-time [2].

An alternative is the Kappa architecture proposed initially by Jay Kreps [3][4], co-creator of Apache Kafka.

---

[0] https://www.amazon.com/dp/1617290343

[1] https://en.wikipedia.org/wiki/Lambda_architecture

[2] https://speakerdeck.com/druidio/real-time-analytics-with-ope...

[3] https://engineering.linkedin.com/distributed-systems/log-wha...

[4] https://dataintensive.net/

thekhatribharat
Here's a related article: https://medium.com/open-factory/state-of-the-m-art-big-data-...

An excerpt from the article:

Furthermore, the big data tools can be combined using a growing number of data processing architectures — Lambda and Kappa, among others.

battery_cowboy
Thanks so much for the comment, it was very helpful!
sologoub
The sources are good and thorough, but very long. Here’s an ok summary of kappa proposal: https://milinda.pathirage.org/kappa-architecture.com/

In theory this sounds great, but you have to account for processing capacity.

While compute is getting cheaper, one of the key reasons streaming in lambda sacrifices quality over throughput is compute capacity (as well as timing). If you have to feed already stored data through the same streaming pipe, you either have to have a lot of excess capacity, be willing to pay for that additional burst or accept latency in your results (assuming you can keep up with your incoming workload and not lose data). There is no free lunch.

You probably might want to read this (for free): http://book.mixu.net/distsys/single-page.html

And pay a little to read this book: http://www.amazon.com/Designing-Data-Intensive-Applications-...

And this one: http://www.amazon.com/Big-Data-Principles-practices-scalable...

Nathan Marz brought Apache Storm to the world, and Martin Kleppmann is pretty well known for his work on Kafka.

Both are very good books on building scalable data processing systems.

HN Books is an independent project and is not operated by Y Combinator or Amazon.com.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.