HN Books @HNBooksMonth

The best books of Hacker News.

Hacker News Comments on
The Design and Implementation of the 4.4Bsd Operating System

Keith Bostic, Marshall Kirk McKusick, Michael J. Karels, John S. Quaterman · 1 HN comments
HN Books has aggregated all Hacker News stories and comments that mention "The Design and Implementation of the 4.4Bsd Operating System" by Keith Bostic, Marshall Kirk McKusick, Michael J. Karels, John S. Quaterman.
View on Amazon [↗]
HN Books may receive an affiliate commission when you make purchases on sites after clicking through links on this page.
Amazon Summary
This book describes the design and implementation of the BSD operating system - previously known as the Berkeley version of UNIX. Today, BSD is found in nearly every variant of UNIX, and is widely used for Internet services and firewalls, timesharing, and multiprocessing systems. Readers involved in technical and sales support can learn the capabilities and limitations of the system; applications developers can learn effectively and efficiently how to interface to the system; systems programmers can learn how to maintain, tune, and extend the system. Written from the unique perspective of the system's architects, this book delivers the most comprehensive, up-to-date, and authoritative technical information on the internal structure of the latest BSD system.As in the previous book on 4.3BSD (with Samuel Leffler), the authors first update the history and goals of the BSD system. Next they provide a coherent overview of its design and implementation. Then, while explaining key design decisions, they detail the concepts, data structures, and algorithms used in implementing the system's facilities. As an in-depth study of a contemporary, portable operating system, or as a practical reference, readers will appreciate the wealth of insight and guidance contained in this book.
HN Books Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this book.
These days, most people need the opposite. They're probably using systems optimized for spinning disks but they're running it on flash, and all the layers of complexity added over the years to optimize disk latency are just slowing things down.

Like, this is the kind of bullshit people used to do research on: https://tlb.org/docs/usenixw95.pdf (I'm an author). This paper (and 100s of others) exist only because sequential reads & writes on disks were much faster than random access, because you had to (a) move the head, and (b) wait for the sector you're interested in to come around. At 7200 RPM, this is up to 4.2 milliseconds, so it was worth burning 10000s of CPU instructions to try to sort read requests in some order that might avoid waiting for another rotation. Many popular disk controllers couldn't read sequential sectors, so it was better to sort write requests so that it hit every second sector or something. Madness.

Anyway, today there are only 3 kinds of secondary storage worth caring about: flash (sector granularity, but no seeks), cloud (like S3), and archive (like Glacier).

But if you're working with some old-timey environment and really need to know about optimizing for spinning disks, most of the ideas were published in the late 80s and 90s. This book https://www.amazon.co.uk/Design-Implementation-Operating-Add... (McKusick et al) is excellent, and describes a filesystem that works well on a wide variety of workloads. Or see the references in the Blackwell paper above.

marginalia_nu
Flash still effectively has "seeks". Sequential IO is still faster than random IO, since its a block operation.

Especially for writing, sustained random small writes is disastrously bad on an SSD. https://en.m.wikipedia.org/wiki/Write_amplification

BeeOnRope
SSDs read at page granularity. SSDs don't really have an equivalent to a HD seek: there is latency but this can over overlapped with the latency of dozens of other commands.

Some SSDs may support reading an entire block in a faster way than page-by-page commands, but because of the FTL a read of a logical block worth of data may be split across many physical blocks, so you cannot necessarily take advantage of this.

memset
Thank you for this! Here is my question, though. I have an example C program that uses mmap() to read data from a file. It can do this linearly or select random records. I'm running it on my M1 mac.

https://gist.github.com/poundifdef/e748c467d354662ed034b5f64...

The script runs orders of magnitude slower when I do random reads rather than linear. Theoretically this seems like it should not be the case since it is running on an SSD, but clearly there are other tricks the OS (or hardware?) is doing to optimize for the linear case.

I'm looking to better understand why I'm observing that performance difference and how I can better design software around it, even though (intuitively) it seems like it should not matter with non-mechanical disks.

gniv
As a sibling comment pointed out, this is about the page (or block) size. Both disks and SSDs share the property that reads are done in a block, 256KB or more. So even if you're reading a few bytes you are actually reading the entire block. In the linear (sequential) read case, you are using it all.

The block size is a parameter in the theoretical model for I/O-efficient computation.

memset
Okay. Does this imply that I should see similar SSD read speeds when reading 10 blocks contiguously vs randomly?

If I can fit, say, 2k records into a single block, then I would expect reading the first 4k pieces of data to have similar SSD read performance compared to reading 2k from the first and 2k from the last? (Haven't coded the experiment yet, but that would be the prediction?)

And: to the point of using block size as a parameter, do you have suggestions on further reading for techniques people have used to incorporate that into the design of their structures? Or is it basically the same as efficient paging algorithms?

gniv
> Okay. Does this imply that I should see similar SSD read speeds when reading 10 blocks contiguously vs randomly?

Yes, with some caveats:

1. Not sure if you can read block-aligned data from high-level code.

2. There might be some read-ahead heuristics in the I/O stack.

> techniques people have used to incorporate that into the design of their structures?

I haven't kept up with the research. You can try searching on scholar.google.com for "I/O-efficient" algorithms and data structures. Also "cache-oblivious". There should be some good surveys now, since this is not a new research area.

Note that most of the algorithms that were optimized for disk reads did not typically take into consideration sequential vs random reads. The model simply assumed that reading a block of size B has a unit cost and the performance of algos/ds was expressed in terms of the number of these units.

mamcx
You will like this https://ayende.com/blog/posts/series/195587-B/implementing-a... and https://www.reddit.com/r/databasedevelopment/ where it talks about this kind of stuff.
xani_
Someone did test exactly that:

https://panthema.net/2019/0322-nvme-batched-block-access-spe...

From my experience as long as you can do that access multithreaded you won't really be penalized for random reads. Single threaded access I've seen as much as read performance halved (used fio for testing), but that didn't translate into multithreaded benchmarks

sakras
Just to be a bit pedantic, what you really want is several concurrent IOs, doesn’t have to be multithreaded. For example, io_uring can kick off many concurrent IOs off a single (userspace) thread.

But yes, if you don’t have io_uring you need to use threads, and if you use a lot of them context switches can have a nontrivial overhead from my experience.

xani_
Pedantic and wrong, good fucking luck saturating whole SSD from single thread on anything else than pure benchmark code
BeeOnRope
SSDs generally read at the page level, so you can get full performance or close to it from much smaller reads than 256KB or more.

For example, you may be able get maximum performance from 4K or 8K byte reads. Of course, there are a few layers between you and the SSD so you have to make sure you have the SSD actually sees enough pending requests at all times for maximum performance. If you just read one 4K block before submitting the next one, you'll get terrible performance because at queue depth of 1 the SSD is utilized most of the time (i.e., the bandwidth delay product is much lager than 4096 bytes).

This effect can lead the erroneous conclusion that small block sizes are slow when queue depth is actually the confounding factor (large blocks effectively feed many page-sized read requests to the SSD at once, so you can get away with a lower queue depth).

Writes are different but you also don't need gigantic block sizes. For SSDs available as EC2 local storage, for example, a 4K random write load can extract full performance if you get the other parameters right.

tlb
That program's data set is only 0.8 GB, which is like $10 of RAM. The time difference you're measuring between random & sequential access is probably mostly due to CPU cache and TLB, not flash.

Whatever real-world application you have in mind, this probably isn't representative. It pays to have realistic benchmarks before spending much time optimizing.

xani_
Shouldn't matter as long as you do direct IO for benchmarking
memset
Understood - it sounds like the idea of a “minimal working code example” will give misleading results. I’ll pay more attention to this when testing, I appreciate it.
amelius
Once you mmap() a file the OS will (probably) use the same algorithms for accessing data as it uses for virtual memory. So you probably want to read:

https://news.ycombinator.com/item?id=19302299

wtallis
If you want to get good random read performance out of an SSD, you can't use mmap. A thread can only page fault on one address at a time, but full random IO performance requires giving the SSD many requests to work on in parallel.
gavinray
This, you should open the fd in O_DIRECT, non-mmap'ed

You have a lot more control this way

dsp
In the random case you’re reading a whole page to get some tiny struct.
jltsiren
Disk-based and in-memory algorithms and data structures need similar techniques these days. The numbers are just different.

RAM latency is something like 100 ns, or maybe a bit less. Sequential read speed ranges from tens to hundreds of gigabytes per second. However, if you divide cache line size by latency, you are still orders of magnitude below that. If an algorithm accesses the memory randomly and waits for the results before continuing, it's often much slower than an algorithm that reads sequentially, even when the structs are conveniently the size of a cache line.

SSD read latency is around 100 µs, or three orders of magnitude higher. Sequential read speed is gigabytes per second, or 1-2 orders of magnitude lower. Again, if you divide page size by latency, you are nowhere near the peak throughput. Reading sequentially can be much faster, because the OS and the controller can guess your intent and read ahead.

anonymoushn
> Anyway, today there are only 3 kinds of secondary storage worth caring about: flash (sector granularity, but no seeks), cloud (like S3), and archive (like Glacier).

We mostly read from spinning magnets attached to computers, since flash is expensive and cloud is expensive and slow. I guess this is true if you're fine with throwing away lots of money, but the question sort of presupposes that you are not.

mgraczyk
There is reason to believe that won't be true much longer, for example

https://wikibon.com/qlc-flash-hamrs-hdd/

It's likely within the next 10 years HDDs will no longer be viable. In terms of TCO with replacement costs we may already be there.

charleslmunger
Write amplification is a concern on flash memory. The other secret is that just like flash didn't eliminate the cost of seeks, RAM isn't as fast for random access as sequential. All modern systems benefit from data locality, to varying degrees.

A CPU has layers of caches in front of memory, which then has its own caches on front of NVMe, which has its own controller usually with an SLC cache in front of MLC, which often times is being used as a cache on front of the network or slower storage.

For many problems modern hardware is fast enough, but if your problem is attempting to make use of all modern hardware offers, you need to optimize for it. Meaningfully saturating a 100gbps network connection, for example, is hard to do without considering data locality.

GuB-42
I generally agree but:

> Anyway, today there are only 3 kinds of secondary storage worth caring about: flash (sector granularity, but no seeks), cloud (like S3), and archive (like Glacier).

"cloud" and "archive" are not storage devices, they are network architecture, and a service. There is a whole lot to write about that, but this is not on the same layer.

"The cloud" uses flash memory and spinning rust, and they need to care about how to get the best performance out of the hardware they use. Maybe there is not much left to discover, and maybe fewer people need to know about that, but it still matters. Same thing for "archive", with less flash and more tape.

sitkack
Reading from RAM, SSD and even spinning disk all share the same property that there is a significant setup time and then a lower cost for each block read.
BeeOnRope
That is not really true for SSD, which provide random access at the same performance, at least when a "suitable" block size is reached.

This must be the case in fact because SSDs do not lay out data in the same order as the linear addressing used to access them: rather, logical addresses are mapped to physical ones inside a flash translation layer meaning that "most" access effectively looks random at page granularity.

sitkack
The flash _blocks_ are 64kB or larger. If one is doing writes smaller than this, than potentially one could get write amplification on the device depending on how it buffers and coalesces writes.
BeeOnRope
Yes, that's true though "blocks" is an overloaded term (e.g., people will talk about what "block size" they are reading at at the application layer).

In any case I'm mostly disputing your characterization of SSDs as behaving like hard drives: having long setup time then faster reading of subsequent blocks. Unless you are talking about command latency, they don't really work like that.

To a first order approximation you can think of SSDs a ideal page-wise parallel random access devices, with a certain read latency and the ability to be processing up to N reads at once. As long as the queue depth can keep up to N reads in progress at once, the maximum bandwidth or a substantial fraction, can be achieved.

sitkack
We are definitely in the weeds.

When used via an OS, you get better performance when you treat an SSD like a fast tape than you do if you do random IO (in the application domain, in the SSD domain at a high enough granularity yes, they are random access). There is setup time in opening a page, writing blocks within the page, sealing the page. If your writes are less than an SSD block, it has to copy the existing data to a buffer, write it, append the new data, seal the block. Or it applies a translation layer. SSDs are complex devices and are not fundamentally random access, no more so than DRAM, which is also not random access.

BeeOnRope
You were talking about reads and I responded about reads.

I'll repeat my claim: SSDs largely behave as random access devices for page-aligned random reads of an integer multiple of pages. In particular, without any performance footguns activated in the application & OS stack, you can get a substantial portion or all the throughput with such a random read workload compared to the most sequential read workload you can imagine.

This is totally unlike disks (resp. tapes) where you have the hard physical reality of average seek time in the milliseconds (resp. seconds) making random read loads orders of magnitude slower (resp. even more orders) than sequential ones.

Writes are different and more complicated, but they also behave as random write devices if you choose your granularity to be "block size" and even at page size (if you can avoid write amplification, which is not always possible).

You have page and block reversed in your description: a page is smaller than and contained within a block.

SSDs to do not "open a block, overwrite a page and then seal the block": they write newly written pages to a freshly erased block, using a variety of possible strategies, which sometimes give an advantage to nearby-in-time writes to the same block or sometimes not.

kadoban
That sounds quite a bit less true for RAM and SSD, typically as long as you're reading cache-line or block size, respectively, isn't it consistent costs for each read?
sitkack
They would be consistent, but some fraction would be still be setup time. RAM behaves more like a serial device than a truly random one.

https://stackoverflow.com/questions/56086993/what-does-strea...

http://thebeardsage.com/dram-commands/

kijiki
I recall (>20 years ago) reading TAOCP, and laughing to myself about how obsolete the section on sorting using multiple tape drives. And here we are.

There are newer editions (2004, 2014) of the McKusick et al book, s/4.4 BSD/FreeBSD/1 for the updated title. If you don't want any taint of knowledge of fast flash drives, maybe go with the 2004 edition.

stingraycharles
While what you’re saying is valid, what you classify as bullshit (log-structured storage) is still immensely popular and important in the world of flash.

In my experience, read-ahead is still extremely important, even when you have SSDs, and unless all your writes are >4kb, you’re still going to benefit from a certain amount of “sequential-ness” of your writes.

Optimizing things for disk storage still does pay of tremendously, it’s just that the way things are optimized are very different, and perhaps much more nuanced.

Where before it was just as easy as saying “just read/write things sequential, random access is expensive”, nowadays you need to think about how the kernel interacts with storage, and sector sizes and whatnot. There definitely are do’s and don’ts when optimizing data structures for this.

xani_
If you're writing on SATA SSD, yes.

If you're writing on NVMe, well, there is a good chance that if you're not at the high end (big site, a lot of things to do), you can just ignore that as the IOPS will be "good enough"

Like, single relatively shitty NVMe can still sustain ~700MB/s of random(4k block) writes. Use good ones, and use more than one and you quickly hit CPU barrier before you hit NVMe performance.

You still want to keep it kinda grouped together but that's mostly for the wear levelling reasons

formercoder
Doesn’t the chip on the drive deal with wear leveling?
xani_
It does but it's always easier if you allocate/deallocate big blocks. Especially if you just put sub-block-size data on disk
stingraycharles
A lot of the cloud is based on iSCSI, which, although “random access” time is relatively constant, still has an increased latency compared to local storage. As such, you definitely do want to benefit from things such as read-ahead, and designing your storage structures around these types of mechanisms yields significantly better results.
HN Books is an independent project and is not operated by Y Combinator or Amazon.com.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.