HN Books @HNBooksMonth

The best books of Hacker News.

Hacker News Comments on
The Field Guide to Understanding Human Error

Sidney Dekker · 8 HN comments
HN Books has aggregated all Hacker News stories and comments that mention "The Field Guide to Understanding Human Error" by Sidney Dekker.
View on Amazon [↗]
HN Books may receive an affiliate commission when you make purchases on sites after clicking through links on this page.
Amazon Summary
When faced with a human error problem, you may be tempted to ask 'Why didn't they watch out better? How could they not have noticed?'. You think you can solve your human error problem by telling people to be more careful, by reprimanding the miscreants, by issuing a new rule or procedure. These are all expressions of 'The Bad Apple Theory', where you believe your system is basically safe if it were not for those few unreliable people in it. This old view of human error is increasingly outdated and will lead you nowhere. The new view, in contrast, understands that a human error problem is actually an organizational problem. Finding a 'human error' by any other name, or by any other human, is only the beginning of your journey, not a convenient conclusion. The new view recognizes that systems are inherent trade-offs between safety and other pressures (for example: production). People need to create safety through practice, at all levels of an organization. Breaking new ground beyond its successful predecessor, "The Field Guide to Understanding Human Error" guides you through the traps and misconceptions of the old view. It explains how to avoid the hindsight bias, to zoom out from the people closest in time and place to the mishap, and resist the temptation of counterfactual reasoning and judgmental language. But it also helps you look forward. It suggests how to apply the new view in building your safety department, handling questions about accountability, and constructing meaningful countermeasures. It even helps you in getting your organization to adopt the new view and improve its learning from failure. So if you are faced by a human error problem, abandon the fallacy of a quick fix. Read this book.
HN Books Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this book.
Mar 03, 2022 · droopyEyelids on My Stripe Tax Story
If anyone is interested in this topic, I recommend the book "Field Guide to Understanding Human Error" by Sidney Dekker

It's an incredibly thorough treatment of the incentives and psychology that lead to people labeling process failures as 'human error'.

Most of the book deals with manufacturing, aviation, and air control failures, but the principles generalize so easily to software development that it's a treat to read. One thing that makes it so good is that I was vaguely aware of most of what he covers before having read it, but reading him stitch it all together brought me to the point of intuitively understanding the concepts that had been floating in the back of my mind, and being able to see them all around me at work. He puts it together so smoothly that after having read it, it felt like I always knew what I had just learned.

It's super expensive on amazon https://www.amazon.com/Field-Guide-Understanding-Human-Error... but available on all the online library sites that aren't for linking in polite company. It's also on audible.

I think you're missing a couple things here.

One is the difference between optimizing for MTBF and MTTR (respectively, mean time between failures and mean time to repair). Quality gates improve the former but make the latter worse.

I think optimizing for MTTR (and also minimizing blast radius) is much more effective in the long term even in preventing bugs. For many reasons, but big among them is that quality gates can only ever catch the bugs you expect; it isn't until you ship to real people that you catch the bugs that you didn't expect. But the value of optimizing for fast turnaround isn't just avoiding bugs. It's increasing value delivery and organizational learning ability.

The other is that I think this grows out of an important cultural difference: the balance between blame for failure and reward for improvement. Organizations that are blame-focused are much less effective at innovation and value delivery. But they're also less effective at actual safety. [1]

To me, the attitude in, "Getting a call that production is not working is the event that I am trying to prevent by all means possible," sounds like it's adaptive in a blame-avoidance environment, but not in actual improvement. Yes, we should definitely use lots of automated tests and all sorts of other quality-improvement practices. And let's definitely work to minimize the impact of bugs. But we must not be afraid of production issues, because those are how we learn what we've missed.

[1] For those unfamiliar, I recommend Dekker's "Field Guide to Human Error": https://www.amazon.com/Field-Guide-Understanding-Human-Error...

cjfd
One can talk about MTBF and MTTR but not all failures are created equal so maybe not all attempts to do statistics about them make sense. The main class of failures that I am worrying about regarding the MTTR is the very same observable problem that you solved last week occurring again due to a lack of quality gates. To the customer this looks like last weeks problem was not solved at all despite promises to the contrary. If the customer is calculating MTTR he would say that the TTR for this event is at least a week. And I could not blame the customer for saying that. Since getting the same bug twice is worse than getting two different ones, it actually is quite great that quality gates defend against known bugs.

The blame vs reward issue to me sounds rather orthogonal to the one we are discussing here. If the house crumbles one can choose to blame or not blame the one who built it but independently of that issue, in that situation it quite clear that it is not the time to attach pretty pictures to the walls. I.e., it certainly is not the time to do any improvement let alone reward anyone for it. First the walls have to be reliable and then we can attach pictures to them. The question what percentage of my time am I busy repairing failure vs what percentage can I write new stuff seems to me more important than MTBF vs. MTTR.

I have to grant you that underneath what I write there is some fear going on, but it is not the fear of blame. It is the fear of finding myself in a situation that I do not want to find myself in, namely, the thing is not working in production and I have no idea what caused it, no way to reproduce it and I will just have to make an educated guess how to fix it. Note that all of the stuff that was written to provide quality gates is often also very helpful to reproduce customer issues in the lab. This way the quality gates can decrease MTTR by a very large amount.

wpietri
Problems don't occur due to a lack of quality gates. Quality gates are one way to fix problems, but are far from the only way. And, IMHO, far from the best way.

And I think the issue of blame is very much related to what you say drives this: fear. Fear is the wrong mindset with which to approach quality. Much more effective are things like bravery, curiosity, and and resolve. I think if you dig in on why you experience fear, you'll find it relates to blame and experiences related to blame culture. That's how it was for me.

If you really want to know why bugs occur in production and how to keep them from happening again, the solution isn't to create a bunch of non-production environments that you hope will catch the kinds of bugs you expect. The solution is a better foundation (unit tests, acceptance tests, load tests), better monitoring (so you catch bugs sooner), and better operating of the app (including observability and replayability).

cjfd
I am sorry but what you are saying really does not make much sense to me. You say quality gates are bad and instead we should have unit tests, acceptance tests and so on. Actually, unit tests and acceptance tests are examples of quality gates. And do note that the original article is down even on unit tests because they are not the production environment.

Then you say that e.g., bravery is better than fear. Well, there is fear right there inside bravery. I would be inclined to make up the equation bravery = fear + resolve.

And why are you pitting replayability against what I am saying? Replayability is a very good example of what I was talking about the whole time. I have written an application in the past that could replay its own log file. That worked very well to reproduce issues. I would do that again if the situation arose. Many of these replayed logs would afterwards become automated tests. The author of the original article would be against it, though. The replaying is not done in the production environment, so it is bad, apparently.

wpietri
I don't believe the original article is down on unit tests. He's very clearly down on manual tests and tests that are part of a human-controlled QA step. But he also says, "If you have manual tests, automate them and build them into your CI pipeline (if they do deliver value)." So he is in favor of automated tests being part of a CI pipeline.

And I'm saying that the things I listed are good ways to get quality while not having QA environments and QA steps in the process.

I also don't know where you get the notion that all debugging has to be done in production. If one can do it there, great. But if not, developers still have machines. He's pretty clearly against things like QA and pre-prod environments, not developers running the code they're working on.

So it seems to me you're mainly upset at things that I don't see in his article.

kerpele
> The main class of failures that I am worrying about regarding the MTTR is the very same observable problem that you solved last week occurring again due to a lack of quality gates. To the customer this looks like last weeks problem was not solved at all despite promises to the contrary.

I think the quality gates mentioned in the article are the ones where you have a human approving a deployment. If you have an issue in production and you solve it you should definitely add an automated test to make sure the same issue doesn’t reappear. That automated test should then work as a gate preventing deployment if the test fails.

cjfd
I can't say I spelled every letter in the article but it says so many strange and wrong things I would not give it any benefit of the doubt of the sort of 'but it cannot really actually say that, right?'
John Allspaw applied concepts from The Field Guide to Understanding Human Error to software post mortems. When I was at Etsy, he taught a class explaining this whole concept. We read the book and discussed concepts like the Fundamental Attribution Error.

I've found it very beneficial, and the concepts we learned have helped me inn almost every aspect of understanding the complicated world we live in. I've taken these concepts to two other companies now to great effect.

https://www.amazon.com/Field-Guide-Understanding-Human-Error...

https://codeascraft.com/2012/05/22/blameless-postmortems/

https://codeascraft.com/2016/11/17/debriefing-facilitation-g...

https://www.oreilly.com/library/view/velocity-conference-201...

rytor718
So much fantastic reading I hadn't seen before in this discussion. Mucho thanks to all for sharing!
One of the things I think about when analyzing organizational behavior is where something falls on the supportive vs controlling spectrum. It's really impressive how much they're on the supportive end here.

When organizations scale up, and especially when they're dealing with risks, it's easy for them to shift toward the controlling end of things. This is especially true when internally people can score points by assigning or shifting blame.

Controlling and blaming are terrible for creative work, though. And they're also terrible for increasing safety beyond a certain pretty low level. (For those interested, I strongly recommend Sidney Dekker's "Field Guide to Understanding Human Error" [1], a great book on how to investigate airplane accidents, and how blame-focused approaches deeply harm real safety efforts.) So it's great to see Slack finding a way to scale up without losing something that has allowed them to make such a lovely product.

[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...

michaelbuckbee
This extends to other aspects of their ecosystem. I've shipped applications for most of the major app ecosystems and Slack's was literally the _only_ one where it really felt like they wanted to help me get things set up correctly and help me ship something.

The reviewer rejected my Slack add-on twice, but was really nice about it, gave specific reasons, encouraged me to fix it and reapply, etc.

A very pleasant experience compared to some of the other systems where it feels like you're begging to be capriciously rejected.

paulific
Yes, I did some translation work for a global engineering company whose approach to achieving "zero fatal accidents" was exactly what you suggest. Instead of placing the blame on people for not following the rules, they identify the ultimate cause and fix the problems with their systems that made the accident possible in the first place.
dvtrn
Having recently escaped from a "control and blame" environment, this is also horrible for releases as left unchecked, more energy is expended trying to double-down on architecting for perfection in fault tolerance. Risk aversion goes through the roof cripples decision making, and before you know it your entire team of developers have become full time maintenance coders, you stop innovating and spend cycles creating imaginary problems for yourself and begin slowly sinking.

We had a guy who more or less appointed himself manager when previous engineering manager decided he couldn't deal with the environment anymore, his insistence on controlling everything resulted in a conscious decision to destroy the engineering wiki and knowledge base and forced everyone to funnel through him-creating a single source of truth. Once his mind was made up on something, he would berate other engineers, other developers and team members to get what he wanted. Features stopped being developed, things began to fail chronically, and because senior leadership weren't made up of tech people-they all deferred to him-and once they decided to officially make him engineering manager (for no reason other than he had been on the team the longest-because people were beginning to wise up and quit the company), the entire engineering department of 12 people except for 2 quit because no one wanted to work for him.

Imagine my schadenfreude after leaving that environment to find out they were forced to close after years of failing to innovate, resulting in the market catching up and passing them. Never in my adult life have I seen a company inflict so many wounds on itself and then be shocked when competitors start plucking customers off like grapes.

wpietri
For those for whom this excellent description has resonance, I strongly recommend the book, "Why Does He Do That? Inside the Minds of Angry and Controlling Men". [1] It's nominally written about domestic abuse, but its descriptions of abuser psychology and its taxonomy of abuser behaviors have been really helpful to me in a work context.

[1] https://www.amazon.com/Why-Does-He-That-Controlling-ebook/dp...

I think that's ridiculous. Pilots are correctly very reluctant to hit things. Historically, we have wanted them to do their best to avoid that.

You could argue that we should now train pilots to carefully pause and consider whether the thing they are about to hit is safe to hit. But for that, you'd have to show that the additional reaction time in avoiding collisions is really net safer. And if you did argue that, you couldn't judge the current pilots by your proposed new standard.

For those interested, by the way, in really thinking through accident retrospectives, I strongly recommend Sidney Dekker's "Field Guide to Human Error": https://www.amazon.com/Field-Guide-Understanding-Human-Error...

I read it just out of curiosity, but it turned out to be very applicable to software development.

craftyguy
Well by avoiding trying to hit one thing, they hit the ground. That's (arguably?) worse than hitting just about anything else other than perhaps another large aircraft or a missle.
goldenkey
You do realize that the ground has a fixed force based on the landing -- while an accident at 1000 ft with a non gliding helicopter means falling to terminal velocity with certain death?
wpietri
In this case, maybe. But you have to do a sum over all cases to prove that your proposed solution is better. Otherwise it's just a way, post facto, to blame somebody. An error which is one of the biggest topics of Dekker's book.
The path to a disaster has been compared to a tunnel [0]. You can escape from the tunnel at many points, but you may not realize it.

Trying to find the 'real cause' is a fool's errand, because there are many places and ways to avoid the outcome.

I do take your meaning, reducing speed and following well established rules would have almost certainly have saved them.

0. PDF: http://www.leonardo-in-flight.nl/PDF/FieldGuide%20to%20Human...

Amazon: https://www.amazon.com/Field-Guide-Understanding-Human-Error...

mjlee
I prefer the swiss cheese model.

https://en.wikipedia.org/wiki/Swiss_cheese_model

"In the Swiss Cheese model, an organisation's defenses against failure are modeled as a series of barriers, represented as slices of cheese. The holes in the slices represent weaknesses in individual parts of the system and are continually varying in size and position across the slices. The system produces failures when a hole in each slice momentarily aligns, permitting (in Reason's words) "a trajectory of accident opportunity", so that a hazard passes through holes in all of the slices, leading to a failure."

A series of minor failures that combine for a serious crisis seems very relatable.

jabl
Aha! I sense an opportunity for an enterprising statistical physicist to model this with https://en.wikipedia.org/wiki/Percolation_theory :)
civilian
Yup! My cousin & cousin-in-law work to maintain human health at some nuclear power plants in Canada. They use the swiss-cheese model, and whenever they notice that one layer missed something, they do a re-evaulation of that safety layer.
JshWright
I've never heard the tunnel analogy. Seems odd, since tunnels generally only have one entrance and one exit...

In a lot of fields where the stakes are high and mistakes are costly (aviation and emergency services are the two I'm most familiar with) the analogy used is a chain. Break any link in the chain and you prevent the event.

https://www.aopa.org/asf/publications/inst_reports2.cfm?arti...

blattimwind
> I've never heard the tunnel analogy. Seems odd, since tunnels generally only have one entrance and one exit...

Road and train tunnels have many emergency exits...

JshWright
Generally they have a parallel tunnel (either for traffic in the other direction, or specifically for emergency egress), and connections between them. They still take you to the same general place.
Piskvorrr
Urban tunnels have many regular entries and exits, as well as emergency exits, not to speak of emergency crossovers.
csours
It's the perception tunnel. There are many branches, but only one route is followed. In hindsight, you can see all the branches and call out everything that went wrong.
tantalor
Normally that is called "tunnel vision".

https://en.wiktionary.org/wiki/tunnel_vision

(figuratively) The tendency to focus one's attention on one specific idea or viewpoint, to the exclusion of everything else; a one-track mind.

mustacheemperor
I think the idea is that if you are stuck in a dark train tunnel and there's a train coming towards you, there may be doorways, recesses, etc in the walls that are invisible without the proper tools (flashlight, etc).
None
None
mostlyskeptical
As we say when teaching new riders to ride a motorcycle, a crash is often an intersection of factors. If you remove even one of those issues it likely would have prevented it.
pbhjpbhj
I've noticed a lot of learners (eg in YouTube videos) crash because they grab at the handlebars, grabbing the throttle and inadvertently accelerating.

Is there a reason the throttle can't be reversed so that grabbing would reduce throttle and slow the bike?

(FWIW, I'm a fully licensed motorcyclist, don't currently ride.)

drdrey
Are you recommending the 2nd edition specifically? There is a 3rd edition available: https://www.amazon.com/Field-Guide-Understanding-Human-Error...
csours
No, I just didn't notice (speaking of confusing UI, Amazon seems to do everything possible to move UI elements small and large on a basis so arbitrary as to seem completely random)
I am in favor of commentary, but his comment only makes sense as hindsight. If he posted it beforehand, he would mostly be wrong, because AirBnB mostly works. If he posted it after any of the many successful outcomes, he would look dumb.

There is no reason to say that these people "got it wrong". They were unlucky. Suppose the same shitheels broke a window, climbed in, unlocked the door, and had a big party on a weekend when the owners were away. One inclined to superiority-by-hindsight could say, "Well duh, why didn't they have bars on their windows?"

After a rare negative occurrence, one can always look back with hindsight, find some way the bad outcome could theoretically have been averted, and then say, "Well duh." Always. It is a great way to sound and feel smart. But it never actually fixes anything. Indeed, it can prevent the fixing of things because, having blamed someone, we mostly stop looking for useful lessons to learn.

If you want the book-length version of this, Sidney Dekker's "Field Guide to Understanding Human Error" has a great explanation of why retrospective blame ends up being immensely harmful: http://www.amazon.com/Field-Guide-Understanding-Human-Error/...

mikeash
"Why didn't they have bars on their windows?" could be a reasonable criticism or it could be pure hindsight. Which one depends on whether the person stating it was faced with a similar decision and decided in the direction they advise. If I've evaluated the risks and decided that putting bars on my windows is wise, and I did it, then it's totally reasonable for me to criticize people who didn't and subsequently got their house broken into. That's not hindsight, that's having foresight, and criticizing other people for not having it.

That's relevant because the AirBnB question is one that a lot of us have actually thought about and decided on. I (and presumably the other guy above) had the foresight to realize that renting my house on AirBnB was not wise. Pointing out the natural consequences of making what we see as the unwise choice is not hindsight, because we made that decision beforehand.

I don't see how this particular instance doesn't fix things, or prevents fixing things. Convincing people not to rent their houses on AirBnB if they care about said houses is a fix! It's a really good fix! It doesn't fix the problem of "AirBnB guests can sometimes cause major damage to their accommodations" but it does fix "AirBnB guests cause emotional violation by trashing a person's primary dwelling."

"Don't rent your home to strangers from the internet" is not searching in hindsight for some way the problem could have been avoided, and using that to unjustly criticize. It's a completely rational approach to life that a lot of people have been following for a long time.

wpietri
Your theory here seems to be that one person in the world having an extremely negative outcome validates basically all negative risk perceptions. I think that's bunk. It makes no more sense than saying that because one person wins the lottery, it validates everybody who buys a ticket and loses. To evaluate risk and reward, you have to look at baskets of outcomes, not crazy outliers.

Suppose he lent his house to a friend who trashed it? Suppose he gave keys to a cleaning service and some reprobate there stole the keys and trashed the house? Suppose an earthquake destroyed his house? No matter what happens, there's always a way to blame people after the fact.

The way this discourages fixing things is that blame never fixes anything. Blame isn't a solution, it's another problem.

If somebody wants to do an integrated risk analysis on AirBnB and make some predictions on that, great. I'd love to read it. But comments like the one I call out are predicting the past. They're not just worthless, they're harmful to useful dialog.

mikeash
What exactly is the difference between "blame" and analyzing a past problem with the idea of avoiding similar problems in the future, and what makes you put this particular statement on the "blame" side of it?
wpietri
From Wikipedia: "Blame is the act of censuring, holding responsible, making negative statements about an individual or group that their action or actions are socially or morally irresponsible."

This guy didn't "analyze a problem". There was zero intellectual contribution. He called people stupid for taking a risk and then being unhappy when they got a 1-in-a-million negative outcome.

If you want a longer version of why blame impedes risk analysis, try the book I recommended upthread. (Or pretty much any book on retrospectives will have a shorter version.) But the short version is that you basically have a choice between the emotional activity of blame or the analytical activity of analysis. Because humans.

We can't have a discussion about the human factors in automated systems without talking about Sidney Dekker's book The Field Guide To Understand Human Error:

http://www.amazon.com/Field-Guide-Understanding-Human-Error/...

Fantastic read about the futility of placing blame on a single human in a catastrophe like this. It makes a strong case for why more automation often causes more work. Definitely worth checking out, Etsy has applied it to their engineering work by using it to facilitate blameless post mortems:

http://codeascraft.com/2012/05/22/blameless-postmortems/

gmu3
Nicholas Carr's new book is also about the dangers of too much automation: http://www.amazon.com/Glass-Cage-Automation-Us/dp/0393240762
dllthomas
Seems related to http://www.ctlab.org/documents/How%20Complex%20Systems%20Fai...
barrkel
It's worth bearing in mind that that the accident rate is a fifth of what it was, according to the article.

The automation may have created new dangers, but it probably reduced more common errors.

HN Books is an independent project and is not operated by Y Combinator or Amazon.com.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.