Hacker News Comments on
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Hacker News Stories and CommentsAll the comments and stories posted to Hacker News that reference this book.
Great post. Also highly recommend Designing Data-Intensive Applications by Martin Kleppmann (https://www.amazon.com/Designing-Data-Intensive-Applications...). The sections on "Storage and Retrieval", "Replication", "Partitioning" and "Transactions" really opened up my eyes!
⬐ lysecretAbsolutely loved the book. Can someone recommend similar books?⬐ avinassh⬐ itsmemattchungDatabase Internals is also pretty good.⬐ skrtskrt⬐ pixelmonkeySeconding Database Internals - it's not just about "Internals of a database", as part 2 gets nitty gritty with the general problems of distributed systems, consensus, consistency, availability, etc. etc.There is a quite-nice interactive browser dataviz here that shows you books similar to the themes, categories, and topics discussed in DDIA:⬐ wombatpmDatabase Design for Mere Mortals by Ray Hernandez⬐ dangetsI have not read it personally, but I've seen 'How Query Engines Work' highly recommended several times before. I have a procrasinatory tab open to check it out some day.Second this.
I really like how he (Martin Kelppman) in the book starts with a primitive data structure for constructing a database design, and then evolves the system slowly and describes the various trade offs with building a database from the ground up.
This books has an excellent reputation for the foundations of data-intensive software:
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
⬐ alexottIt could be a good second step, imho…
I’m a Midwest dev w/ 8 YOE at non big tech who got multiple FAANG+ offers last year. I wrote the below guide for friends interested in following the same path so I’ll just post it here.
Took about four months of studying ~2 hours daily.
0. Total Compensation (TC)
Compensation data: https://www.levels.fyi/
Get the app Blind and start browsing it daily. People regularly post their offers, and it is the most up to date info on the market. It’s an anonymous forum where your company email is verified. You can DM employees of target companies for referrals or information about roles.
1. Leetcode (LC)
Buy a yearlong Leetcode premium subscription and do all the modules listed here, in no particular order, but skip decision trees and machine learning: https://leetcode.com/explore/learn/
When you are done with that, do all the problems on this list: https://www.teamblind.com/post/New-Year-Gift---Curated-List-...
A lot of these problems are on the modules linked previously, so you will only have 30-40 new problems here
Next, do random problems until you "see through the matrix."
Focus on medium level problems. Try to do something like 35% easy, 50% medium, 15% hard.
If you can't find the optimal solution to a problem, "upsolve" by reading a bit of the solution and trying again. If you still can't get it, copy the code of the solution and study it. Then erase it and try to solve it from memory.
Periodically go back over solved problems and re-solve them while taking notes.
Your goal should be to solve two random LC mediums in ~35 minutes. Consider using Python as your interview language if you are comfortable enough with it. It's faster than Java for writing.
Some places will have you run the code, others it will be a glorified whiteboard, so don't use the run button as a crutch.
Around two weeks before your interview, start doing company tagged problems like: https://leetcode.com/company/doordash/
Start doing this part first and grind it hard. It might take 3 months, it might take a year. It takes as long as it takes until you think you can crush it. I spent around 2 hrs each day in the morning on LC.
2. System Design
If you are being considered for senior level roles, this will be by far the most important part of your interview as far as leveling. If you are shaky, they will downlevel.
Read it more than once.
These courses on educative.io are useful: https://www.educative.io/courses/grokking-the-system-design-... https://www.educative.io/courses/grokking-adv-system-design-...
These videos are also really good: https://www.codekarle.com/
Tech talks on Cassandra/Kafka and stuff like that are good.
Videos are the best last minute prep before interviews for design.
Amazon tends to be easier in terms of LC problems but ask more behavioral. Amazon also has a reputation of being stressful and pay is not at the level of Meta/Google, though that might be changing. I would do this interview first since it’s good practice for getting behavioral stories real sharp.
Google is way slower than these other companies, so if you wanna consider them, get the process started as early as you can.
If you are interested in remote, also consider Zoom, Square, Twitter, and Coinbase.
Get referrals wherever you can. Most places will ignore you unless you have them. I applied to probably 25+ companies and got rejects or ignored for all but Uber and AirBnB. Places I had referrals to I scored onsites for 100% of the time, including places that rejected me before a referral.
You can get referrlas off Blind. I didn’t do this, but I guess it happens! You probably also have people somewhere in your network in FANG and top tier companies if you look. If people think you have a chance of passing they’ll be happy to refer. Referral bonuses are several thousand dollars. Ask them for mock interviews as well.
The process is recruiter call -> "phone screen" (do an LC problem on Hackerrank while on a zoom call) -> "onsite" which is 5 hours of zoom...usually 2 coding, 1 behavioral (maybe a small coding question as well), 1 design.
Do mock interviews with friends/colleagues for LC problems. I had 3 different people give me a total of 6 mock interviews. You can also pay for this with different companies like interviewing.io or randoms off Blind.
Getting mock interviews for system design is harder, and you might have to pay for it. I did and it was the best money I spent that year.
Also for interviews you can interview over 2-3 days after 3pm PST to avoid taking time off work if you’re not in PST.
Recruiters will let you push back interviews for any reason multiple times, especially if it's for more interview prep, so if you aren't where you want to be before one, it's totally fine to ask for more time.
You should try to get all your interviews lined up very close together to get competing offers, especially if you want Google, who tends to lowball candidates that do not have competing offers.
⬐ wallflowerCongratulations on your multiple offers! I'm still deciding whether to commit to this journey (of sorts), and your detailed guide that you wrote for your friends is what I was hoping to receive so I don't get "lost". Thank you for your gift!⬐ fbftethrowawayNo problem. I would say the journey was worth it. I tripled my compensation.
Good luck to you no matter what you choose to do!
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/Designing-Data-Intensive-Applications...
I surprisingly really enjoyed it. Well written and it pulled back the veil on a lot of concepts that I thought were too complex for me to understand/enjoy.
⬐ eatonphilI agree it's well written but for me that's way more on the side of required reading.
I'm thinking about books you don't _need_ to read but that are just really neat or advanced discussions.⬐ jstx1The book is okay, I just don't get the hype. From the title of this thread I could guess that DDIA would be the top suggestion and I don't understand what people see in it. It seemed like a decent enough overview of things + some implementation details that you forget if you don't think about them every day, nothing groundbreaking. Maybe it's because there aren't many other similar books and this information is a bit more scattered around?⬐ ArcsechI agree that this is a great book, but I think it fails the "impractical" test - DDIA is an immensely practical book that's basically required background reading if you work with distributed data systems.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann https://www.amazon.com/Designing-Data-Intensive-Applications...
You can learn a lot of algorithms. It's useless unless you start to create architecture and use them in practice.
Coding and building teach you more than taking a course or watching a video. If you don't have any programming background, you can enroll in some coursera or udacity courses to start with. Then go through this course http://web.stanford.edu/class/cs106x/, the course reader is really good. After that for data engineering read this book https://www.amazon.com/Designing-Data-Intensive-Applications.... Also learn some sql. Take some data, feed into sql light db, and ask question and convert question into query. Becoming good at this takes some time. Be patience. The learning curve is like hokey stick, initial phase might have a dip but it accelerate in the later phase. BY ANY CHANCE DO NOT JOIN A BOOTCAMP.
⬐ markus_zhangThanks. I'm actually in middle of cs106 and indeed the reader is pretty good. I also have the intensive book which I read a bit but think I need more technical muscle for it.
Source and more info : https://martin.kleppmann.com/2020/11/18/distributed-systems-...
This recorded series is from Kleppmann's Concurrent and Distributed Systems course which he teaches at University of Cambridge. In case the name seems familiar, Kleppmann is the author of perhaps HN's favourite book "Designing Data-Intensive Applications" https://www.amazon.com/dp/1449373321
If you are interested in the computer science in general I highly recommend:
1. Structure and Interpretation of Computer Programs (available for free, e.g. here http://sarabander.github.io/sicp/html/index.xhtml
Also, I haven't read it yet, but this book has been praised here a lot recently: https://www.amazon.com/Designing-Data-Intensive-Applications...
In the case of technical books, I have found that pricing is not that different from digital to physical books.
This is a good example: https://www.amazon.com/dp/1449373321/ref=cm_sw_r_tw_dp_x_qQ8...
It is $29.99 for Kindle and Physical is $34.91.
I would say exactly the opposite. I regret of buying a book from Amazon  dedicated to Kindle-use, because it is DRM protected and I am forced to use "Amazon Kindle" application, otherwise I cannot open it. I am usually okay with DRMs but I miss a fact I haven't bought it elsewhere with less annoying protection.
Psst, "Designing Data Intensive Applications" was very good read. Do you know similar books that focus on distributed systems?
I read this book. https://www.amazon.com/Designing-Data-Intensive-Applications...
And it was really enlightening. I would heavily recommend it. It starts off by teaching different types of implementations of different parts of DBMS. Then goes on to teaching about how distributed systems deal with various problems.
⬐ ifarhanpi'm a mobile application developer aspiring to learn backend. will this be a good read for me if i have negligible experience on server side of things?⬐ colmvp⬐ thatcodingdudeJust my two cents: I'm a front-end dev, and I found this book to connect with me a lot more once I also learned some backend.
Earlier this year, I read a few chapters of the book, and it was really abstract to me, so I stopped reading it.
On a whim, I had two developers teach me SQL and relational database theory. I spent a few weeks creating a few databases and connecting them to the front end to mimic real life application, before picking up the book again and the book made way more sense.⬐ ifarhanp⬐ damidekronikthanks, makes sense to me.Not really in my opinion, the book is really about the data aspect. Probably your best bet is to just pick your favourite programming language and build something in it.
The book is great but not what you seem to be looking for.⬐ ifarhanp⬐ nwsmi'm not looking to learn backend through this book, I find the content of the book interesting and just wanted to know if my inexperience with server-side hamper my ability to grasp the content of the book.You would definitely benefit from doing general backend side projects and reading before reading this book. I don't think you'll have enough context for it otherwise.
However, the book is definitely relevant to mobile applications. The backends for all the most popular apps are built with systems described in it.⬐ ifarhanpthanks for the suggestion.It's actually recommended in the teach yourself cs curriculum, so I'll get around to reading it !
Read this book:
If you are interested in distributed systems , I found the book "Designing data intensive application by Martin Kleppmann" to be a good starting point. Its not about only about distributed systems but also covers quite a bit of ground on overall data systems. https://www.amazon.com/Designing-Data-Intensive-Applications...
How would you compare Database Internals to Designing Data Intensive Applications?
⬐ deepaksurtiThe book you refer to is really kind of system design for applications which handle large data volumes. OTOH, the book I refer to talks about how database software can be developed from ground up thus helping you understand the internals.⬐ avremelThat book does cover many implemention details of a database. However, sometimes at a high level, and as you mention, specifically in the context of distributed systems.
Referencing my copy of designing-data intensive applications, here are some approaches mentioned:
1) The naive approach is to assign all writes to a chunk randomly. This makes reads a lot more expensive as now a read for a particular key (e.g. device) will have to touch every chunk.
2) If you know a particular key is hot, you can spread writes for that particular key to random chunks. You need some extra bookeeping to keep track of which keys you are doing this for.
3) Splitting hot chunks into smaller chunks. You will wind up with varying sized chunks, but each chunk will now have a roughly equal write volume.
One more approach I would like to add is rate-limiting. If the reads or writes for a particular key crosses some threshold, you can drop any additional operations. Of course this is only fine if you are ok with having operations to hot keys often fail.
For anyone eager to read something now, Designing Data-Intensive Applications  is an excellent and completed book that covers nearly all of the same material with significant depth.
⬐ ps101Why is this book considered to be so good? I started it because it's been recommended on HN so much and I gave it up rather quickly because it was really dry and not all that focused on practical applications. Should I give it another go?⬐ rainloft⬐ soobrosaThis book enabled me to think better from first principles.
e.g. How might I go about optimizing a redshift query? Well, now that I have an idea about how data is laid out on disk, because redshift is a columnar store, if I try to optimize X query, here's how I imagine the index to be so that sequential reads would be faster.
I could find a reference on how to optimize redshift queries, but this book answers the WHY and not just the immediate how.
I've read so many books that were practical, yet became so much less useful over time. (e.g. reading a book about the specifics of the Angular API, whereas now I write mostly React.)
I keep returning back to this book for understanding a top-level view of the fundamentals of distributed systems, specifically data stores.
I hope you give the book a second look at some point.⬐ atwebbI really like it because it covers just enough on a number of topics and ties them together. There are many books which can allow you to delve further into specific subjects.⬐ indogoonerThe book may seem rather shallow if you are experienced developer but I feel it is extremely good at covering breadth in data intensive applications. For practical applications I have found following Open source frameworks like Kafka, Spark or Presto more helpful. You can also go through references cited in the book to look at other applications.⬐ hdraI highly recommend it. It does a very good job at explaining the "magic" behind all the data storage techniques, giving you a very good fundamentals and intuition of why each of them are good for certain kind of problems.
No more need to hope the vague Medium post you found while trying to decide which DB to use would match your use case closely enough.⬐ laichzeit0I’ve read the book too and didn’t feel like it covered much that isn’t covered by an undergraduate CS curriculum of databases and distributed systems. Perhaps the book appeals to developers without a formal education in computer science?⬐ alttabSome senior engineers have been in the game long enough that they could have a reputable cs degree without classes in DB or distributed systems. Now it seems less likely, but after teaching Java, C++, C, the other topics were electives.⬐ kevstevIndeed- I am out of school almost 20 years now, distributed systems were an advanced research topic and no class on them existed. My database course was an elective and focused entirely on RDBMS'es and SQL.
I have kept up to date on these technologies, I participated in undergrad research on distributed systems and my career has revolved around them. Many devs never really get a say in where their data goes, they might read a blog post or two about new systems, but it leaves a very light imprint. Its been rather spotty as to whether I had any say in where my data is stored throughout my career.⬐ tripleeThis. Data engineering has ramped up significantly and if you want senior people you'll quickly run out of people who've been exclusively doing "big data" for 5+ years.
So your options are either senior software engineers who have done some data work (that's how I got to be a Data Engineer) or people who've been doing analytical data work (either in the traditional warehousing space or via science/insurance/finance type spaces) that are semi-technical but have no formal engineering background.
The former are people who went to college in the late 90s/early 2000s (like myself) when things were different. The latter need to hyperfocus on coming up to speed in engineering.
I reviewed this guide a couple months ago for my employer to consider as the basis of an internal bootcamp, and I'd note that it's perfect for the audiences I mentioned. Also, even for people with more up to date academic experience, note that the transactional database schemas that software normally deals with often look wildly different than analytical structures.I might even have stronger feelings than Vicky in terms of how 'useful' it is. If you want to build an other piece of tooling that we already have to muc of, then maybe. http://veekaybee.github.io/2019/04/11/attic-compsci/⬐ atwebbYep! It is really great and covers theory, technical implementations, and practical implementations while not locking into any vendor or specific tech stack. If anything, it's technical information is too dense.⬐ longcommonnameI recently took over a large (new) data engineering project. After being given almost no direct, I sat down and read this book and let it assist me with my design.
When we reviewed the design I mentioned a few points that were like: "Yeah I know the little requirements you gave counteracted this design, but if we do it this way it'll help us out (source in book)"
This book is really well written, and I've learned so much from it and I keep opening it up every day for further guidance.
I'd highly recommend reading [Designing Data-Intensive Applications](https://www.amazon.com/Designing-Data-Intensive-Applications...). The book gives you a great overview of designing data systems - foundational knowledge you'll need in any DE role.
The reason you can't find data engineering materials online is because real data engineering really only happens at a handful of companies - and those companies maintain this knowledge base internally and do not share it.
I noticed that you listed tools / frameworks to learn, as well as languages. Another piece of advice would be to not focus on those because they come and go (for example, Hadoop is pretty much deprecated in any DE-heavy company). What lasts is an understanding of distributed systems, distributed query engines, storage technologies, and algorithms & data structures. If you have a firm grasp on those, you won't have to start from scratch every time a new framework is introduced. You'll immediately recognize what problems the tech is solving and how they're solving it, and based on your knowledge you can connect the dots and know if that solution is what you need.
Another thing to do is watch CS186 from Berkeley in its entirety. This course is about relational databases, but will give you the foundation you need to speak the DE language.
Source: I work as a data engineer at what some would call a big company :)
⬐ elamjeGreat advice! I actually got that book last night as I researched more. I’ll be looking into the Berkeley class as well!
I haven't read Designing Distributed Systems, but I have read Designing Data-Intensive Applications  and it was fantastic.
An overview of databases (what and why, but also a lot of how) plus distributed concepts and modern architectures.
⬐ detaroAgreed, good book. Was a bit more theoretical than I expected, but way better way to refresh that stuff than digging through my course materials from university.⬐ weitzjAgree fully. Such a great book, good thread to read from cover to cover. Many references to other articles/books.
Heck, for me it feels like it even has some suspense. “Ok, I now understood the single database instance and various ways data can be organized on disk - what can go wrong next? What problems are with more dbs?”
I never had this book-“feeling” with a technical book before, where you just want to continue reading it under your blankets at night. Love it and happily advocating it in the company. :)
I'll structure this in "current/future/recent_past" format if I may.
* The Go Programming Language
* Building Microservices
Plan to do next:
* Designing Data-Intensive Applications
* Designing Distributed Systems
* Unix and Linux System Administration 5th ed, but probably just gonna skip/read chapters of interest, i.e. I wanna get a better understanding of SystemD.
Read last month:
* Learning React
Good for a quick intro but I probably wouldn't read cover-to-cover again, some sections are old, but overall an OK book.
* React Design Patterns and Best Practices
Really liked this one, picked a tonne of new ideas and approaches that are hard to find otherwise for a newbie in JS scene. These two books, some time spent reading up on webpack and lots of github/practice code made me not scared of JS anymore and not feeling the fatigue. I mean, I was one of the people who dismissed everything frontend related, big node_modules, electron, complicated build systems etc. But now I sort of understand why and am on the different side of the fence.
* Flexbox in CSS
Wanted to understand what's the new flexbox layout is about since it's been a while when I've done some serious CSS work. Long story short I made it about half of this and dropped it - not any more useful than MDN docs and actually playing with someone's codepen gave me better understanding in 5 minutes than 3 hours spent with this book.
⬐ apodobnikDesigning Data-Intensive Applications is fantastic.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
I read through this book last year when I saw it recommended on HN. I recommended it to engineers on my team at work.
I’m reading it for a second time now, and just finished chapter 2 today. It’s dense but an amazingly detailed and thorough text.
I'd recommend the following:
Clean Code: A Handbook of Agile Software Craftsmanship  is a great book on writing and reading code.
Similarly, Clean Architecture: A Craftsman's Guide to Software Structure and Design  is, no surprise, a book on organizing and architecting software.
Designing Data-Intensive Applications  may be overkill for your situation, but it's a good read to get an idea about how large scale applications function.
The Architecture of Open Source Applications  is a fantastic free resource that walks through how many applications are built. As another comment mentioned, reading code and understanding how other programs are built are great ways to build your "how to do things" repertoire.
Finally, I'd also recommend taking some classes. I started as a self-taught developer, but I've since taken classes both in-person and online that have been a tremendous help. There are many available for free online, and if in-person classes work better for you (motivation, support, resources, etc), definitely go that route. They're a fantastic way to grow.
Designing Data-Intensive Applications by Martin Kleppmann 
⬐ nigealjCouldn't agree more. One of the best distributed data systems book! A must read for anyone dabbling in that part of the stack!⬐ russdpaleThis book is fantastic!⬐ nickdotnetI'm about halfway through this book right now and learning a lot. Do you have any recommendations for follow up books or simply ways to put the lessons of the book in to practice?
I second Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Rather than covering theoretical aspects in detail, it focuses on real-life problems that can be solved using various paradigms.
As a self-taught developer, I used to think that some of the theoretical elements were overhyped. I can build iOS apps that work, and I did just that for the last 2-3 years. However, many of the programs that I wrote have not been as easy to maintain as I would like and some difficult to fix bugs have popped up overtime, both of which are due to a lack of deeper understanding of CS fundamentals. Last year I started interviewing and was ridiculed at one company in particular for a lack of CS knowledge. Afterwords I started exploring a lot of the CS concepts listed in this link and I have since found numerous ways to improve my code quality and have a better understanding of how CS best practices came to be. I also used to think that algorithms and data structures were relatively useless for an iOS developer, and I was able to do the job without them, thus proving my point. However, after gaining a better understanding, it quickly becomes clear that things like view hierarchies are simply trees and understanding ways to traverse these hierarchies can lead to much cleaner code. With the open sourcing of Swift, I also became more interested in understanding the language, but a lot of the language design decisions didn't make sense to me until I gained a better understanding of CS fundamentals. I have found the programming languages course on Coursera  to be particularly useful, and have also greatly enjoyed the book Designing Data Intensive Applications . There's also a great video from this year's WWDC that really inspires algorithm study and use in everyday applications .
"CP/AP: a false dichotomy" https://martin.kleppmann.com/2015/05/11/please-stop-calling-... . Martin Kleppman is the author of "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" https://www.amazon.com/Designing-Data-Intensive-Applications...
⬐ dustingetzWhen I speak about Datomic, my most frequent questions are about CAP theorem, which Datomic seemingly subverts to do seemingly impossible things (like aforementioned Spanner): http://www.dustingetz.com/:datomic-cap-theorem/⬐ None⬐ uluyolNoneI think that the CP/AP dichotomy is a good example of how we treat fundamental, but not hard tradeoffs as hard.
E.g. often we have a fundamental tradeoff between latency and throughput, and it's impossible to get 100% in both metrics. However, we can still do very well in both, and that's what matters in practice.
We have the same thing with CAP. You can build a CP system with high availability.⬐ zzzcpanFundamental tradeoffs in CAP are latency/time for consistency, since real world networks are always partitioned. You are just using a different meaning of availability when talking about high availability, not the one from CAP.
I haven't seen anyone touch on this, but I remember reading about this in Data Intensive Applications. The way that they solved the celebrity feed issue was to decouple users with high amounts of followers from normal users.
Here is a quick excerpt, this book is filled to the brim with these gems.
> The final twist of the Twitter anecdote: now that approach 2 is robustly implemented,Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out. Tweets from any celebrities that a user may follow are fetched separately and merged with that user’s home timeline when it is read, like in approach 1. This hybrid approach is able to deliver consistently good performance.
Approach 1 is a global collection of tweets, the tweets are discovered and merged in that order.
Approach 2 involves posting a tweet from each user into each follower's timeline, with a cache similar to how a mailbox would work.
⬐ hinkleyIt’s an oft overlooked inequality in these systems. People get so wrapped up in some whiz bang thing that they don’t stop to think if they should.
At the end of the day, one of the most important aspects of your information architecture is how many times is each write to the system actually observed? That answer can dictate a lot about your best processing strategies.⬐ tmd83This. It took me a while to learn to look at this and eventually focusing less on caching on some items that don't enough read in it to justify the caching or pre calculation.
Another good resource is Designing Data-Intensive Applications . Chapter 2 does a really good job explaining how different categories of databases relate to different data models, including examples of querying graph-like data models using `WITH RECURSIVE` compared to a query language for graph databases.
I read this book titled "Designing Data Intensive Applications", which covers this and a lot of other stuff about designing applications in general. https://www.amazon.com/Designing-Data-Intensive-Applications...
⬐ kuwzeHe also published this awesome paper on CRDTs, "A Conflict-Free Replicated JSON Datatype".⬐ pbowyerIt's the best written computer-related book I've read. On a par with Friedl's "Mastering Regular Expressions".
Very highly recommended.⬐ sah2ed⬐ drejCurrently reading the book and I agree with your recommendation, it's very well-written, perhaps written with an intuitive learner in mind.⬐ _jal> On a par with Friedl's "Mastering Regular Expressions".
That comparison sold me. You deserve a commission.Too bad I can't upvote this multiple times.⬐ acidtrucksThanks, I literally just came here to ask for book recommendations on this topic.
Are there any other suggestions?⬐ commandlinefan⬐ JoeriI know this sounds like a cliche at this point, but volume 3 of Donald Knuth's "The Art of Computer Programming" goes into more depth on the theory that underpins these algorithms than anything else I've ever seen/read/heard of (in fairness, I haven't read OP's suggestion yet, though).This is one of the best technology books I've ever read. I spent a few years diving into big data architecture. I thought I had a reasonably good handle on it, then I read this book. So many insights.⬐ rollulusThat book is a must-read for anyone dealing with data.⬐ tw1010⬐ mkandas89The simultaneity of your and mkandas89's comment is what I'd call a canonical example of spooky action at a distance.Its a must read for anyone dealing with data⬐ polymathemagicsCompletely agree, fantastic book. Does anyone know of any similarly wonderful technical books?⬐ _nrvsJust finished reading DDIA, can't recommend it enough! I learned a lot of new info in every single chapter even for topics I thought I had a firm grasp on. Great job Martin Kleppmann!
Designing Data-Intensive Applications by Martin Kleppmann. There's a previous HN thread about it. Helped me understand a bit more about databases and systems. The book is also very approachable and has the perfect blend of application and theory at a high level that anyone approaching the industry for the first time stands to gain a lot from reading it.
The Architecture of Open Source Applications series is a good one for leaning how to build production applications and you can read it online. The chapter on Scalable Web Architecture is a must-read.
You should try to understand how databases in general work, it will help you with your query writing.
One thing you have to realize is that once you get a little advanced, you have to get to the details of the single SQL implementations, it's not about SQL but about Postgres.
I've found these books really valuable
# SQL Performance Explained Everything Developers Need to Know about SQL Performance
This book fundamentally talks about how to effectively use and leverage the SQL indices. Talks about all the important implementations (Postgres, MySQL, Oracle, SQL Server).
# Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
This book gets mentioned a bunch around here and for a good reason. There aren't too many concrete resources on making your systems "webscale" and this one is really good.
# PostgreSQL 9.0 High Performance
Discusses all the different settings and tweaks you can do in Postgres. It's crazy how much of a perf gain you can get just by twiddling the parameters of the database, i.e. all the tricks you can do when the single instances are bottle necks.
There's a similar book for MySQL https://www.amazon.com/High-Performance-MySQL-Optimization-R...
# PostgreSQL 9 High Availability Cookbook
Discusses how do you go from 1 Postgres instance to 1+ instance. Talks about replication, monitoring, cluster management, avoiding downtime etc i.e. all the tricks you can do to manage multiple instances. Again there's a similar book for MySQL https://www.amazon.com/MySQL-High-Availability-Building-Cent...
Last but not least check out the postgres documentation, people consider it a standard of what good documentation looks like https://www.postgresql.org/docs/9.6/static/index.html
Also last but not least, read up on relational algebra (the foundation of SQL) https://en.wikipedia.org/wiki/Relational_algebra. I've always found SQL to be extremely verbose (the syntax reminds me of idk COBOL or smth) but there's another query language called Datalog, that's for our purposes similar to SQL but the syntax is much more legible.
E.g. check out these snippets from these slides (page 29) (and check out the whole class too)
s(X) <- p(X,Y).
s(X) <- r(Y,X).
t(X,Y,Z) <- p(X,Y), r(Y,Z).
w(X) <- s(X), not q(X).
CREATE VIEW s AS (SELECT a FROM p)
(SELECT b FROM r);
CREATE VIEW t AS
SELECT a, b, c
FROM p, r
WHERE p.b = r.a,
CREATE VIEW w AS (TABLE s)
MINUS (TABLE q);
You probably might want to read this (for free): http://book.mixu.net/distsys/single-page.html
And pay a little to read this book: http://www.amazon.com/Designing-Data-Intensive-Applications-...
Nathan Marz brought Apache Storm to the world, and Martin Kleppmann is pretty well known for his work on Kafka.
Both are very good books on building scalable data processing systems.