Hacker News Comments on
How We've Scaled Dropbox
Stanford
·
Youtube
·
76
HN points
·
9
HN comments
- This course is unranked · view top recommended courses
Hacker News Stories and Comments
All the comments and stories posted to Hacker News that reference this video.An important detail which this guide leaves out: The "actual location of the row" is usually the leaf node of another B-Tree. That is the primary B-Tree and it's indexed by the primary key.The major consequence is every non-primary index query involves dereferencing N pointers, which could mean loading N pages from disk/SSD to get leaf nodes spread out across the primary tree. Whereas if you query a range in the primary B-Tree directly, the rows are consecutive so you would only load N/M consecutive pages based on the ratio of rowsize to pagesize.
That's why some people use composite keys for primary key, to get better data locality in the primary B-Tree index.
See "How We've Scaled Dropbox"[1] to hear the same thing.
At 48:36 they explain why they changed from PRIMARY KEY (id) to PRIMARY KEY (ns_id, latest, id). ns_id is kinda like user_id. So that change groups journal entries for users together on disk
Specifically. PRIMARY KEY (id) orders things by creation date whereas PRIMARY KEY (ns_id, latest, id) orders things by ns_id primarily.
⬐ iaabtpbtpnnThis is true of MySQL (which the guide uses), but not necessarily of other databases such as Postgres.⬐ srcreighYou're right. Postgres doesn't give any control over primary data locality. That might cause querying 1 row a bit faster in Postgres (no log(N) traversal of the primary B-tree index) but picking out N rows could be a lot slower.https://www.postgresql.org/docs/13/indexes-index-only-scans....
This talk has a lot of details on the early version.
⬐ fermienricoI’m curious what the state of their affairs is today. 2012-2018 is a huge amount of time in tech world and I’m curious what improvements they’ve made.⬐ zawerfThe evolution of their SQL schema (around ~45:00 on) is pretty cool.For example to implement undo/version control for files, they just added a single `prev_rev` column. There are some arguably better (but more complicated) ways to do it but it would've been premature optimization since this simple solution clearly worked out for them.
⬐ nodesocket⬐ redwoodIf I am understanding correctly a sort_order column would work as well. Essentially then you can just do select with order by sort_order.I wonder if their focus on data center build out (rather than differentiation of their offering and customers, yes admittedly debatable) will be deemed a success or misstep in the long run⬐ leowoo91Looks like an investor level presentation. Pretty sure, there are 10x more layers of controllers needed to handle 1M> users.⬐ waz0wskiThey've posted about this on their tech blog:https://blogs.dropbox.com/tech/2016/03/magic-pocket-infrastr...
https://blogs.dropbox.com/tech/2018/06/extending-magic-pocke...
I found the Dropbox lecture [1] at Stanford one of the most riveting things ever. There is just so much technology behind Dropbox, it is staggering.There is a reason why it is so much better than iCloud sync, Google Drive, Box or OneDrive.
⬐ travbrackThis led me to find another loosely related but very entertaining piece of dropbox history. The original "Show HN" post: [1]. It's funny to see so much skepticism knowing now what the company became.⬐ billforsternzYes, this is one of the classics - right up there with the "less space than a Nomad, no wireless, lame" comment (which wasn't on HN I don't think - but we all know it could have been :)Edit: I see the motherlode is in place earlier in the thread "Especially when you could build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem"
⬐ mixmastamykThe iPod comment was from slashdot, if memory serves.⬐ wiredfool⬐ anentropicYes. CmdrTaco, when posting to the home page:https://slashdot.org/story/01/10/23/1816257/Apple-releases-i...
(that post is old enough to vote in this year's election).
ha, that's a nice counterpart to the 'trivial' thread above https://news.ycombinator.com/item?id=18071820
imo the best way is to look at what other's have built. Here's some of my favorite talks that go from 0 users to millions.Dropbox - https://www.youtube.com/watch?v=PE4gwstWhmc
Instagram - https://www.youtube.com/watch?v=oNA2C1vC8FQ
Slack (bonus. not as applicable but good reminder of why initial architecture does matter) - https://www.youtube.com/watch?v=WE9c9AZe-DY
How we scaled DropBox - Kevin Modzelewski: https://www.youtube.com/watch?v=PE4gwstWhmcThe initial Node.js presentation: https://www.youtube.com/watch?v=ztspvPYybIY&t=597s
Dropbox is a lot more than a GUI on top of rsync. Even purely from an engineering standpoint (ignoring product & design) that's incorrect.You might enjoy this talk: https://www.youtube.com/watch?v=PE4gwstWhmc
Applications dont really need to be well architected until they are hitting scale. Then the parts of their system that need to relieve pressure will need to be re-architected. This is almost like a case study and there are a lot of good talks on youtube from places like dropbox and facebook that explain the problem and solution. Example: https://www.youtube.com/watch?v=PE4gwstWhmcIf you dont want to do youtube case studies there are also books to read about distributed systems. Also reading about cloud architecture can help.
⬐ LrnByTeach> Applications don't really need to be well architected until they are hitting scale.very True, 'a system Well architected' before hitting scale is considered OVER Engineering
> Then the parts of their system that need to relieve pressure will need to be re-architected.
> This is almost like a case study and there are a lot of good talks on youtube from places like dropbox and facebook that explain the problem and solution. Example: https://www.youtube.com/watch?v=PE4gwstWhmc
This is a talk about the evolution of Dropbox's architecture from 2012: http://www.youtube.com/watch?v=PE4gwstWhmc. It is incredibly detailed (down to the exact sql schema they use for file metadata etc).