HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Erlang Factory 2014 - That's 'Billion' with a 'B': Scaling to the Next Level at WhatsApp

Erlang Solutions · Youtube · 21 HN points · 16 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention Erlang Solutions's video "Erlang Factory 2014 - That's 'Billion' with a 'B': Scaling to the Next Level at WhatsApp".
Youtube Summary
March 7th, 2014

Rick Reed
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Oct 28, 2022 · 1 points, 0 comments · submitted by Tomte
WhatsApp. Rick Reed’s “That’s Billion with a B” presentation is a good glimpse into their own view of it https://youtu.be/c12cYAUTXXs
Mar 03, 2022 · 2 points, 0 comments · submitted by Tomte
Jun 29, 2021 · 1 points, 0 comments · submitted by Tomte
One of my favorite talks is by Rick Reed about scaling Erlang at WhatsApp. What an absolute savage. He flies through an articulate and in-depth curriculum on system performance and bottleneck mitigation.

The talk is called "That's 'Billion' with a 'B'" and makes for a great lunchtime watch: https://www.youtube.com/watch?v=c12cYAUTXXs

staticassertion
Yeah, not quite related to the linked slides, which are about scaling in terms of productivity, but still a great talk.

I really found that "isolation" is just a key optimization and one of the most useful properties of a system, which they call out in that talk. I wrote about it a bit more in depth here: https://www.graplsecurity.com/post/architecting-for-performa...

I've written our data processing layer and our event orchestration layer as basically an actor oriented system, with push/pull systems in a couple of key places where it makes sense. It's incredible what you can do with strong isolated infrastructure in terms of performance, security, reliability, and quality.

secondcoming
I know nothing about Erlang, but I do know high-volume systems. Needing 550 beefy servers to handle all that load is - in my mind - not that impressive.
strmpnk
I remember that year. I was giving a talk at EF during the same time slot but the schedule originally had me in the large room and they had a much smaller one.

When the news of the acquisition hit, everyone wanted to see the WhatsApp talk. The organizers knew this so we swapped rooms. So, I started my talk by asking if anyone in the room was here for the WhatsApp talk and told them they could quietly leave and I wouldn't mind and a bunch of people got up.

Heheh. I don't blame them. I didn't really like my talk and Rick Reed is very good at what he does and the talk is no exception.

May 06, 2020 · 1 points, 0 comments · submitted by Tomte
Nov 05, 2019 · 4 points, 0 comments · submitted by Tomte
Apr 07, 2019 · 1 points, 0 comments · submitted by Tomte
Nov 16, 2018 · 4 points, 0 comments · submitted by Tomte
Jun 06, 2018 · 1 points, 1 comments · submitted by Tomte
mikece
How accurate is it to say that WhatsApp's ability to scale is mainly a function of it being written in Erlang?
There's a talk one of the engineers gives[1] (that someone else posted here so I'm watching it) about their architecture that's published in March of 2014, and in it he talks about ~550 servers, which includes 250 multimedia servers and 150 chat servers.

The only places he mentions 16 is when he talks about the "multimedia database". I think there were 16 sharded database servers.

1: https://www.youtube.com/watch?v=c12cYAUTXXs

Absolutely stellar talk by principal engineer Rick Reed about scaling their Erlang setup: https://www.youtube.com/watch?v=c12cYAUTXXs.
Here is the description of their stack and how Erlang helped them by Rick Reed at Erlang Factory:

https://www.youtube.com/watch?v=c12cYAUTXXs

Here is Jamshid Mahdavi talking about their stack as well but focusing on development, testing and shipping code. Some of the stuff they do will surprise you if you've been at a shop with a large QA team and deployment pipelines with many stages.

https://www.youtube.com/watch?v=tW49z8HqsNw

Yes, I understand, I was just wondering if Erlang would be considered a "proven language".

From my (limited) Erlang experience, Erlang & Elixir have the same core functionality. Elixir has a "nicer" syntax (I prefer Erlang's syntax but most devs I've talked to like Elixir's).

So if Elixir isn't proven but the concurrency model & programming paradigms make sense, maybe Erlang is a good choice.

* WhatsApp Scaling to 1 Billion users in erlang: https://www.youtube.com/watch?v=c12cYAUTXXs * https://www.erlang-factory.com/upload/presentations/395/Erla...

hellofunk
Erlang has been in use for decades, so yes. For many years, and possibly still true, nearly every phone call in the world went through Erlang code. Yes it's proven.
thodin
Actually not every call, and not even a majority of calls. And most of the erlang based software available in telecom world - is far from perfect.
zerr
As I remember WhatsApp did quite a heavy modification of Erlang/VM, i.e. not a typical/idiomatic Erlang use case I believe.
mercer
That sounds fascinating. Any sources?
zerr
It was mentioned in one of their tech talks/slides, I don't remember exactly which one.
hellbanner
I haven't watched but this would be a good guess https://www.youtube.com/watch?v=c12cYAUTXXs "Scaling with A B: Erlang and WhatsApp"
ZenoArrow
> "I was just wondering if Erlang would be considered a "proven language"."

Erlang is proven in production for server and telephony apps.

> "From my (limited) Erlang experience, Erlang & Elixir have the same core functionality."

Depends on what you mean by 'core functionality'. If you mean the functionality provided by Beam VM or OTP, then there's a case to be made for that. However, the library ecosystem for Elixir is separate from the one for Erlang.

To use an alternate example, Java is definitely a 'production ready' language for enterprise software. However, if I write a new language for the JVM, it's not automatically 'production ready'. In most cases it's the library ecosystem which is under evaluation, not the platform the language runs on.

di4na
Just saying, but so far, there have been more critical bugs in erlang solved due to the Elixir community than Elixir critical bugs solved... so well...
erszcz
Can you give some examples?
Not sure why this comment saw a couple downvotes earlier. mbesto is correct: for most startups, most of the time, competitive advantage doesn't come from the underlying tech stack. To make a general statement, most things could be done similarly on any of several platforms. However, when product requirements match exceptionally well with a specialized technology, you can see things that would simply be infeasible or extremely tough using a different stack.

WhatsApp + Erlang was one of those cases (watch this talk and imagine trying to recreate that system with only a handful of server engineers using any other tech: https://www.youtube.com/watch?v=c12cYAUTXXs). Discord + Elixir appears to be another.

Curious if anyone has any examples that spring to mind from outside the highly concurrent messaging space.

Here's a great talk from Rick Reed (a former mentor of mine at Yahoo!) about how WhatsApp scaled their Erlang infrastructure: https://vimeo.com/44312354

Here's another talk: https://www.youtube.com/watch?v=c12cYAUTXXs

Jul 19, 2016 · 2 points, 0 comments · submitted by kornish
It's required because your phone is where your messages are stored.

Whatsapp don't retain messages/media after they've been delivered to your phone, which is a compelling privacy feature for many.

It's also what allows them to serve such an enormous user base with limited hardware. Their technology stack (FreeBSD/Erlang) is pretty interesting, more info here:

2014 talks by Rick Reed:

https://www.youtube.com/watch?v=c12cYAUTXXs

https://www.youtube.com/watch?v=TneLO5TdW_M

Slides:

http://www.erlang-factory.com/static/upload/media/1394350183...

There's likely no technical reason why you couldn't use a pc instead of a phone for users that want to use the pc as the primary client (with the phone optionally accessing the DB on the pc in the same way that the desktop client does for the phone). Perhaps they've decided that this is a small and declining market.

Edit:

Slides for second talk

http://www.slideshare.net/iXsystems/rick-reed-600-m-unsuspec...

440k connections/sec, 1.1 million msgs/sec, 1 billion images/day, and that was in 2014...

executesorder66
> It's required because your phone is where your messages are stored.

So are you saying that a desktop PC's hard drive can't handle storage of some text messages but a phone can?

yxlx
No, he's saying that in order for the messages to be available, they have to be stored locally and since most people need access to the messages from their phones, it makes sense to do it this way to ensure that all messages are stored on the phone, instead of users ending up with some messages on their phones and some on their PCs.
xomateix
Of course, he is not saying that, he is just explaining how whatsapp is actually storing the messages.
Longhanks
Neither does iMessage store messages on servers, and yet I receive iMessages on every device I signed on.
kccqzy
But you can't retrieve historical iMessages on a new device.
hudell
I'm not really asking for that feature. If I could send and receive messages on the computer without having to open the smartphone app every five minutes for it to restore the connection, I would be happy already;
uola
"Perhaps they've decided that this is a small and declining market."

For consumers (and facebook is a consumer company) it's all about who owns mobile (and can also compete with facebook). They want desktop to just be enough of a feature to be more appealing than other platforms, but not enough that it detracts from mobile.

Freak_NL
> There's likely no technical reason why you couldn't use a pc instead of a phone […]

Any personal computer built in the last five years can do anything a smartphone or tablet can in terms of processing power and connectivity. A smartphone is a computer with hardware that enables it to use cell phone networks and make calls.

My inner cynic strongly suspects that Facebook and other similar corporations really like the control they have on the overall user experience on the two major mobile operating systems; i.e., eyeballs on a smartphone or tablet are worth more than those on a general purpose computing device.

Too much freedom on a personal computer; with browsers that feature all kind of privacy enhancing add-ons such as ad-blockers and tracker-blockers. Much harder to monetize.

kome
You are totally right! That's the obvious answer. On desktop you have too much freedom.
rhaps0dy
I'm going to assume this is sarcasm and upvote.
Not "parent" but search for Erlang Factory talks they talk about it. Here is one from Rick Reed. He explains how the product works :

https://www.youtube.com/watch?v=c12cYAUTXXs

Klarna (the European payment processor) and WhatsApp architectures are built on Mnesia. Mnesia is Erlang's built-in distributed database. Due to its age it has a few bad corner cases, that could be fixed if it just had a better backend (the front-end API is nice, well designed, supports transactions, is built-in to the lanauge).

So this effort brought in the ability to have a better backend, and make Mnesia a better option as a general purpose distributed database.

Here is an talk on how WhatsApp uses Mnesia:

Erlang Factory 2014 - That's 'Billion' with a 'B': Scaling to the Next Level at WhatsApp

https://www.youtube.com/watch?v=c12cYAUTXXs

Here is an example of using Mnesia:

https://en.wikibooks.org/wiki/Erlang_Programming/Using_mnesi...

scurvy
Raise your hand if you're ever actually run mnesia on a busy server in a distributed system. OK, show of hands of how many people will do that again? Thought so.

mnesia should not be in the OTP. Not by a longshot.

I love Erlang. I hate mnesia.

angersock
Why the hate for mnesia?
scurvy
You'll get a lot of mnesia partitions on even the best network with a slightly busy system. It's not very sturdy, and resynchs are really slow.
toast0
I only see mnesia partitions when the network is having issues. Maybe my network is better than your best network? Maybe network issues a few times a year is a lot?

Resynchs are really slow, because "resynch" is copy data from the network. Mostly this goes close to the line speed for your network though; be sure to move the data files out of the way on the node that will be receiving the copies: otherwise mnesia first loads those, then throws them away sigh. It would be pretty nice if there was support for logging changes when a node is down, though.

scurvy
>I only see mnesia partitions when the network is having issues. Maybe my network is better than your best network?<

Doubtful on the network thing. mnesia will partition then resync when the server gets really busy. As others have mentioned, it might have nothing to do with anything that Erlang is doing. It might be something external. It could just be a lot of traffic that causes it to fall behind. Either way, one missed message and then you're forced to fully resync.

>Resynchs are really slow, because "resynch" is copy data from the network. Mostly this goes close to the line speed for your network though<

Which means your other node has its interface maxed out, causing more service disruptions. I've never run mnesia on a 10gig network, but that definitely was the case with 1gig. I'm not really willing to test or run mnesia in a 10gig/40gig environment. Been burned by it too many times.

>Mostly this goes close to the line speed for your network though; be sure to move the data files out of the way on the node that will be receiving the copies: otherwise mnesia first loads those, then throws them away sigh. It would be pretty nice if there was support for logging changes when a node is down, though.<

Which again, is another sign that it's not robust enough for Internet application usage. Probably OK for some 1990's phone switching, but not for how distributed systems are built today. How are you going to manually move the files out of the way in today's world of systemd automatically restarting failed daemons? Manual operator intervention? Thought so, and this is why ops teams hate mnesia.

lostcolony
Yeah; running Mnesia in production, albeit for pretty low volume systems (maybe 3000 writes a day on average), in multiple clusters, in different locations, running from anywhere between half a year to 2 years, I think I've seen...two netsplits, total, both of them caused by a VM snapshotting process that caused the socket to hang. But we've got 24/7 ops keeping an eye on things in any case, and the data being stored is maybe ~1 gig, so resyncing is fast. It fits our use case.
derefr
I think people are trying to use mnesia for things it wasn't made for. mnesia basically has exactly one purpose:

• in a 1:1 master:hot-spare setup,

• where the nodes contain their own data in process-memory-space rather than relying on a separate "database" node,

• and you need to be able to fail-over to the hot-spare and promote it to master, without business logic being aware of this,

• and your system has time tolerances allowing you to manually fix the old master and bring it up as the new hot-spare,

then mnesia is perfect.

You know what system I'm describing?

Call switching!

You know what system I'm not describing?

Most things!

lostcolony
Actually, a lot of internal business style applications fit Mnesia's model too (if you're hosting them in a single datacenter). While it means you'll need to manually deal with netsplits (or write your own code to address them, or try to borrow https://github.com/uwiger/unsplit), if you're hosting it internally on your own servers, that might be a rare enough event (and with sufficiently minimal likelihood/consequence of things going down, coming back up, etc, such as to introduce problematic inconsistencies rather than downtime/ignorable inconsistencies in the event of major network/system thrashing) as to be worth it for certain things.

Something, like, say, file processing. You have a watch directory, you want to be able to process everything that lands in that directory in a scalable manner, but don't want to re-process the same thing (but it's okay if you do, just inefficient). Mnesia is probably fine to keep tabs of what you've processed already; in the event of a netsplit you can just let all sides of it keep going, until you get around to fixing the cluster. Your inconsistencies just lead to inefficiencies, rather than real data loss, and you have a clear path to fixing them (just dump the data on the partitioned nodes, and rejoin them to the cluster). As such, you have a more resilient, scalable system than you would if you just used a centralized database, while not having to configure and manage a separate DB.

That said, I like the idea of being able to swap Mnesia out for something a little less warty, if it's pretty seamless in operation.

seiji
You know what system I'm describing? Call switching!

Exactly right!

The "distributed" part of Erlang (including mnesia) was designed to run in a blade-like system where the networking was provided by a physical common backplane among all the compute cards in the chassis.

So, a lot of distributed erlang and mnesia falls apart when dealing with network partitions and resyncs and real-world scenarios that wouldn't really happen on a common physical substrate.

That's why most sane erlang people won't run distributed erlang (gotta love that epmd), they'll run their own TCP servers connecting to external DBs.

derefr
Yep. It feels like people want to use Erlang's distribution mechanism for communication between machines which have no pre-existing relationship, which is not what it's for. It's what regular sockets are for—and Erlang is great at everything to do with regular sockets.

I do wonder if you could get an interesting boost in fault-tolerance by writing an Erlang application to run distributed between, say, several EC2 instances in the same Placement Group. That gets you the analogous "backplane guarantee" in virtualized network-space, AFAIK.

mononcqc
Distributed Erlang on its own is fine to run in production (particularly as a control plane for metadata between nodes) because few assumptions about reliability of the network are baked in; most of them can be made by the application designer.

Distributed OTP applications (with the automated takeover/failover mechanism) see very little use in the real world because of their set of assumptions that network failures are rarer than software or hardware failure, which result in perceiving all netsplits as nodes going down (a great way to get split brains!)

I believe the issue is that unlike web chats and apps like Telegram, Whatsapp doesn't store conversations on their servers beyond what is required to deliver messages to the required destination device(s).

This presents obvious problems with having not just a web client, but any client other than the main device, since the main device is the only store of the existing conversations. New messages sent also need to be synchronized across devices, which I presume is why it's required to keep your phone connected to the Internet. Otherwise Whatsapp would need keep conversations on their servers until they could be synced, which is very much not their model.

Details on their architecture:

https://www.youtube.com/watch?v=c12cYAUTXXs

http://highscalability.com/blog/2014/2/26/the-whatsapp-archi...

72deluxe
Doesn't that sound like a poor architecture then?

If the system is reliant on the main device, what happens if that gets run over or destroyed? WhatsApp have no control over the fate of the device. Why wouldn't they store the messages on the server and sync from that? WhatsApp have control of the fate of their own servers, which is far more reliable, SURELY.

If I send a new message from a device, either the other devices can poll periodically for messages that I have sent, or they can use WebSockets and be notified when a new message is sent, or when the app opens it can fetch all the recent messages that I have sent from other devices. It isn't difficult with a fine grained timestamp, surely? That's what Google Talk does.

You'd only need to order my messages by timestamp to get all messages that I had sent.

It's a bit of a daft architecture if it is incapable of this basic mechanism, isn't it?

onion2k
If WhatsApp don't store the messages then they can't give them to any authorities if they're asked. That's a pretty big selling point for some users.
mike_hearn
WhatsApp operates at insane scale very cheaply. It's one reason they can charge $1 per year or less (lots of users don't seem to pay - they keep giving me free extensions for example). Not archiving all messages on their end keeps their costs in check, and is better for privacy.
Igglyboo
It's actually a feature, not poor architecture. They're putting security and privacy above UX. This is especially important in this day and age with all of the NSA and related revelations.

If you want a better UX there are hundreds of other options available.

Nov 26, 2014 · rbsn on Yahoo Mail moving to React
Just to follow up on topic of WhatsApp and Erlang, here is a presentation given at Erlang Factory 2014 about their goals with scaling WhatsApp to billions of simultaneous users.

http://youtu.be/c12cYAUTXXs

Oct 07, 2014 · rdtsc on Mnesia and CAP
Unrelated, but maybe interesting: WhatsApp was running Mnesia and made it work fantastically. Especially with such a small team of engineers.

Here is a video of Rick Reed talking about how they did it (hint it is not a one monolithic Mnesia cluster).

http://www.youtube.com/watch?v=c12cYAUTXXs

lostcolony
Yeah, you can build scalability into Mnesia, but it's not there by default. It's also dependent on what kind of data and persistence guarantees you want; if it's mostly transient data, or acceptable to lose some data in the one in a million case, Mnesia out of the box is probably fine. If you need stronger guarantees, you either have to do a lot of work, or you should investigate an alternative solution.
My other favorite "scalability" study is from WhatsApp:

http://www.youtube.com/watch?v=c12cYAUTXXs

That's 'Billion' with a 'B': Scaling to the Next Level at WhatsApp

(that walk title was create before the acquisition and was mean to imply message count, after the acquisition it got a secondary meaning).

The one thing that is fascinating about it, is how small their team was compared to the volume and complexity of the operation.

dav-
The thought of even touching systems at such scales terrifies me.
rdtsc
Agreed. Someone asked once how to you simulate load like that to test and he answered "we can't" we try to do gradual deployment and rolling upgrades.
Jul 25, 2014 · 2 points, 0 comments · submitted by rdtsc
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.