HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Velocity 2011: Artur Bergman, "Artur on SSD's"

O'Reilly · Youtube · 6 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention O'Reilly's video "Velocity 2011: Artur Bergman, "Artur on SSD's"".
Youtube Summary
Artur Bergman (Wikia/Fastly),
Artur Bergman, hacker and technologist at-large, is the VP of Engineering and Operations at Wikia. He provides the technical backbone necessary for Wikia's mission to compile and index the world's knowledge. He is also an enthusiastic apologist for federated identity and a board member of the OpenID Foundation. In past lives, he's built high volume financial trading systems, re-implemented Perl 5's threading system, wrote djabberd, managed LiveJournal's engineering team, and served as operations architect at Six Apart. His current interests extend to encompass semantic search, large scale infrastructure, open source development, federated instant messaging, neurotransmitters, and the future of cyborgs.
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
https://youtu.be/H7PJ1oeEyGg

https://web.archive.org/web/20110703091817/http://buyafuckin...

The memory hierarchy of a modern CPU with spinning rust might as well be the Parker Solar Probe waiting on a sea anemone.

I am remiss how someone these days can buy a laptop with spinning rust and 4 GiB and then complain it's "slow." Maybe they should've bought a real laptop for real work that's repairable and upgradable?

tambourine_man
>repairable and upgradable

Unfortunately, the best ones aren't anymore.

failwhaleshark
Marketing "the best" (most expensive / newest) aren't the best, for most purposes.

I bought a T480 with dual batteries that does run for 10 hours. It has a 2 TiB SSD and 32 GiB. Works fine for me. Water-resistant, repairable, awesome keyboard, and MIL-SPEC drop rated too. Optional magnesium display top assembly cover.

alpaca128
Those aren't the best ones then.
Artur Bergman, creator of the Fastly CDN, at Velocity 2011 - giving a (very) short talk about SSDs: https://www.youtube.com/watch?v=H7PJ1oeEyGg
Look. :)

Yeah, the build machine churns a lot. But that work should be primarily done by the FS cache, by buffers. Yes, it's going to write out those small files, but if DragonflyBSD has any kind of respectable kernel, though should be a solid curve, not lots of bursts.

I would love if my old colleague Przemek from Wikia would talk about the SSD wear on our MySQL servers which had about 100k-200k databases per shard.

We wore the _fuck_ out of some SSDs.

You should replace your HDDs with SSDs, though, for a number of reasons, and take the long view, as kryptistk noted the OP is doing. Really compare the cost of SAS 15k drives and Intel 320s or 530s.

But I think in his place, you can take the words of the inimitable Mr. Bergman:

  https://www.youtube.com/watch?v=H7PJ1oeEyGg
Stop wasting your life. But don't expect a machine that does lots of random IO, like a database, to have 1-2% SSD wear after two years. It might not last two years. If it does, use it more. Aren't you making money with these drives? ;)
Jabbles
There are several sites doing SSD stress tests. This one claims to have written 2 Petabytes to a drive without failing:

http://techreport.com/review/27436/the-ssd-endurance-experim...

rasz_pl
2 petabates while validating write RIGHT AFTER it happens, and never powering that drive down. Its great when all you need is a dev/null, not so much if you need to power down from time to time and retrieve useful data later.
awfag4y44444
I believe they have powered them down. It's not solely a drive write cache test.
kalleboo
They did do a 5 day unpowered retention test http://techreport.com/review/25681/the-ssd-endurance-experim...
awfag4y44444
Not all written bytes are equally expensive to the flash. Single-stream sequential writes are going to give you a lot longer lifetime than tons of random small IO with zero TRIM.
engendered
But don't expect a machine that does lots of random IO, like a database, to have 1-2% SSD wear after two years.

It rather depends upon the database, no? A 512GB Samsung 850 Pro would last 29 years with 100GB of writes a day in a horrendous, worst case 3x write amplification scenario. Very, very few databases write more than 100GB of data a day, and most are magnitudes less.

rasz_pl
Reading also wears out nand flash. SSD have read counters next to erase counters, too many reads and drive forces whole sector rewrite to mitigate data degradation.
Gurkenmaster
Does this mean that SSDs will lose data if you reach the write limit and continue reading from it?
dillondf
There is something called read-disturb, where reading a row of NAND over and over again can disturb adjacent rows. The effect is many orders of magnitude lower than the erase wear effect and data caching within the drive (as well as data caching in the OS) mitigates it significantly. So for all intents and purposes you don't have to worry about it.

The SSD's normal internal scans will detect the bits when they start to get flaky and rewrite the block then. It might do some rough tracking to trigger a scan sooner than it would otherwrise, but it is more a firmware issue and not so much a wearout issue.

-Matt

dillondf
To be clear here, what the poster means is that piecemeal database writes of, say, 128-byte records can cause a huge amount of write amplification, so 100GB/day worth of database writes can end up being 1000GB/day worth of flash rewrites. This issue largely goes away if the database backend appends replacement records rather then rewriting them in place, and uses the index to point to the new copy of the record. At that point the battery-backed ram the RAID system has combined with the appends results in much lower write amplification and a SSD could probably handle it.

-Matt

engendered
I know of no database products that write 128-byte blocks. Most are at least 2KB, if not larger (SQL Server and pgsql are 8KB blocks). Yes, you could conceivably imagine up some hypothetical situation that would be bad for an SSD, but it is incredibly unlikely. And when I say 100GB of writes, I mean, literally, 100GB of writes, which in actual source data is generally much, much, much, much smaller.
dillondf
You do understand that flash erase blocks are around 128KB now, right? Random writes to a SSD can only be write-combined to a point. Once the SSD has to actually start erasing blocks and doing collections of those randomly combined sectors things can get dicey real fast in the face of further random writes. It doesn't have a magic wand. The point is that the SSD has no way to determine ahead of time what the rewrite order is going to be as random blocks continue to get written. You can rationalize it all you want, the write amplification is going to happen everywhere in the chain to one degree or another. For that matter, filesystems are ganging blocks together now too (and have been for a long time). It is not uncommon to see 16KB-64KB filesystem block sizes.

Nobody gives a shit about a mere 100GB in physical writes to a storage subsystem in a day. A single consumer 512GB crucial can handle a rate like that for almost 30 years.

Write amplification effects are not to be underestimated, but fortunately there are only a very few situations where it's an issue. And as I said, the situations can be largely mitigated by database writers getting off their butts and fixing their backends to push the complexity to the indexes and away from trying to optimize the linear layout of the database by seek-writing into it.

-Matt

pkaye
4KB random writes should have a WAF of 2-7 depending on over provisioning and assuming random writes. But real workload are not purely random so it will be better than that.
engendered
You do understand that flash erase blocks are around 128KB now, right?

How is that relevant to your wrong claim about database 128-byte writes? You word that as if you're correcting me.

I don't even know what point you're trying to make anymore, but you're trying to buttress your argument with what can best be described as diversions.

Most people write far less than they think. Databases aren't particularly magical, any more than the web server log file that is written to a line at a time. Yes, people "give a shit" about a "mere" 100GB of writes, because the vast majority of real projects, including at major corporations, write far less than that per day. So are we just talking about dick measuring now?

wtallis
The point is that even writing in 8kB chunks will lead to a lot of write amplification when the erase block size is at least 128kB and the flash page size is 16kB. 8kB writes are definitely less bad than 128B writes, but it's still not enough write combining to pass either of the relevant thresholds.
engendered
It's a 64x difference. And even that is grossly understating it because most database systems feature write coalescing.

It's actually "ironic" in that database systems were built to avoid random IO because it was glacially slow on spinning rust storage. They do everything possible to avoid it, and in actual, empirical metrics running databases on flash the write amplification has been very low, and hardly mirrors the claims throughout this thread.

dillondf
I'm sorry, engendered... but are you an idiot? Do you even understand the context of the conversation?

-Matt

dang
> are you an idiot? Do you even understand

This comment breaks the HN guidelines. We'd appreciate it if you'd read the site rules and follow them:

https://news.ycombinator.com/newsguidelines.html

https://news.ycombinator.com/newswelcome.html

AlisdairO
A block might be 8KB, but the actual update you're making to the block might be much smaller. I imagine by '128 byte write' he's talking about a lot of random row updates, where each row is 128 bytes. Now, if you're not too unlucky, many updates will be combined on the same page per checkpoint, but that's not a given. On the other hand, it's reasonably likely that several updates will be combined per erase block per checkpoint. A heavily indexed table can exhibit some pretty random write patterns, though.

Additionally, the WAL has to be synced to disk every commit (unlike a web server log file), and WAL records can be very small. WAL is of course append-only, so you'd hope that a good SSD with a battery/cap backup would cache the writes and flush on the SSD erase block filling up.

engendered
A block might be 8KB, but the actual update you're making to the block might be much smaller.

All major databases will only deal in the 8KB increments (or whatever their block size is, whether larger or smaller, but never as small as originally claimed) though. They don't write less. Indeed, it's worth noting that most (every single major one) database systems actually write to a sequential transaction log (which they do not have to checkpoint every n-bytes), and only on commit do they actually then make a strategy for changing those source pages and extents, unrolling it from the log and checkpointing it, which by default includes coalescing and combining strategies. The idea that databases are randomly writing small bits of data all over the place is simply wrong, but is the entire foundation of almost every comment in this thread.

https://technet.microsoft.com/en-us/library/aa337560%28v=sql...

As one example. Oracle, pgsql, mysql, and others do the exact same thing.

They aren't randomly changing an int here and a bool there.

I worked on a financial system where we wrote just absurd amounts of data a day. We ran it on a FusionIO SLC device (with a mirror backup), and churned the data essentially around the clock. After three years the FusionIO little lifespan hadn't even moved.

tldr; people grossly overestimate the "magical" nature of databases.

AlisdairO
I'm really not sure you read my post very thoroughly. I'm fairly intimately acquainted with the internals of database systems, and you don't seem to be replying to what I actually wrote - I wasn't attacking your point of view (I generally agree that a decently designed DB is unlikely to trash an SSD all that quickly), I was just hoping to shed light on the other poster's wording.

If you have an 8kb block and you change 128 bytes of it, the 'actual update' is much smaller than 8kb. Sure, you're reading/writing 8kb to disk, but everything outside of that 128 bytes is basically fat for the purposes of that change. As I said in my previous post, that can absolutely be mitigated by writes being combined through the checkpointing process, and one would hope that a decent SSD could cope easily with combining writes to an appending log.

A database can still be writing data all over the place. A heavily indexed table can cause quite varied write patterns, which can result in a lot of different pages getting touched. Fortunately, the reality is that well-designed DBs and SSDs are fairly capable of dealing with this.

engendered
My original comment on this whole discussion was that few databases write more than 100GB a day. I am not talking about whether you inserted n integers or updated so many varchar columns -- when you actually monitor its IO, it is extremely unlikely that your database exceeds 100GB a day of writes, and in all likelihood is a magnitude or two below this. Whenever anyone waves their hands and talks about databases as if they somehow imply massive use, they're just fearmongering -- actual empirical stats are your friend, and actual empirical stats show that most real-world databases barely register on the lifespan of most SSDs.

So now that we're in an understanding that we're talking about database writes to IO, the other matter is how it writes it. I've built a lot of systems on a lot of databases, and the write amplification has generally been very low. I've been building and running significant databases on SSDs for about 7 years now, and while everyone else is finally starting to realize that they're wasting their time if they aren't, we still see the sort of extreme niche fearmongering that makes other people clutch onto their magnetic storage (and I heard it the entire time. "OMG but don't databases kill flash???!?!?". No, certain volumes of writes and types of writes do. Only metering will tell you if that applies). Yes, some people do very odd things that can kill storage, but that is extremely rare. It almost certainly doesn't apply to the overwhelming percentage of HN readers.

None
None
lennel
exactly :) memory is cheap -this is an exercise in that imo.
pixl97
Yea, many people look at SSDs with a dollar for longevity number, which is a terrible metric for them. In $ per IOP SSDs kill hard drives. The amount of IOPS you can fit in a 4U rack with SSDs is insane and destroys hard drive in performance per watt when you consider all the extra controllers, cases, and power supplies needed to get anywhere close to the same performance.
VLM
And don't forget the expensive labor, if you have 20x the number of spindles to meet latency targets, you will get 20x the number of failures (roughly) over time, 20x the labor to set it all up, etc.
I find it a bit disturbing how this post reaches the top of HN. But I suppose I shouldn't be surprised.

I probably live in my own little bubble, but only lately have I realized that NoSQL has two audiences: (1) People for whom normalization can't work because of their application's characteristics and the limitations of current hardware. (2) People who just don't understand basic relational concepts, or why they were invented in the first place.

It's kinda sad. I've consulted on projects where people implemented sharding before adding any indices to MySQL.

The thing about being in group (1) is that you can also recognize when the ground shifts beneath your feet. Artur Bergman is one of those guys.

http://www.youtube.com/watch?v=H7PJ1oeEyGg&feature=youtu...

diminish
The point in NoSQL is mostly in group (3), web2+ startups with a possible 1B user and 1000B things (comments etc) per user. Joins and normalization here are a bit costlier and what is written in your Oracle development manuals, for your small workgroup intracompany app don't work here.

So the group (0), which are mostly db developers of the client/server architecture; when they attack the web2.0 problems, they fail because they stick to dogmatic notions as if they are true. Though (2) people are ignorant of relational concepts, (0) people are stubborn, uneducatable people who end up creating all types of scalibility problems. They overuse the notions of normalization, but forget they attack the wrong problem with the wrong tools.

PS: I am not yet using NoSQL in production and have a solid past in Oracle/Db2/Ms SQL/Sybase, and now now doing startups in MySQL/Postgres and Mongo.

alexchamberlain
I have to disagree with you. Normalisation is a very basic concept. Whilst it can cause a couple of problems, I am very sceptical about most startups hitting them. I'd like to see DBs introduce denormalisation as a feature, separating the logic from practicality.

Having said this, I like working with MongoDB. I like schemaless design and flexibility.

sanderjd
Your group (3) is just his group (1) stated differently.
raleec
He denormalized the data.
diminish
yea, I show the nosql way by repeating the data.
sravfeyn
I was just going to write similar thing. This shows that there is a significant percentage of HN users, who haven't had formal introduction to basic CS concepts.
neilk
I'm one of those people. I was your typical web monkey, learning everything from how-to guides and O'Reilly books. Luckily I had the chance to read O'Reilly's Oracle Design (1997), which is actually a cleverly disguised general purpose RDBMS design handbook, including a good description of the normalized forms.
Tangaroa
The upvotes could also mean that people appreciate a good presentation of introductory material, and they may see a use for it as something to send to the newbies on their team.

For this subject, I usually send people to Bill Kent's Guide to the Five Normal Forms: http://www.bkent.net/Doc/simple5.htm

SkyMarshal
I'm one of those. I keep a list of good intro material I can send to friends, nephews, etc, who want to learn programming. HN and reddit are two good sources of those kinds of posts.
ismarc
There's a group 3, but it's likely just as small or smaller than group 1. Where your data is easily normalized (and easier to work with in that form), but the cost of a getting an rdbms to support your write load is an order of magnitude more expensive than a persistent 'nosql' datastore that you do batch dumps into an rdbms.

The catch is that very few people end up in group 3 and still have the cost of running/administering the rdbms over the 'nosql' one actually matter.

Come on, now. Would this "Artur on SSDs" video be nearly as memorable (and therefore efficacious) without all the swearing?

http://www.youtube.com/watch?v=H7PJ1oeEyGg

HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.