Hacker News Comments on "Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill • GOTO 2017" GOTO Conferences Youtube Video

Rankings: this week · month (apr/may) · year (2024) · all time

digests · search

Hacker News Comments on
Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill • GOTO 2017

GOTO Conferences · Youtube · 8 HN points · 12 HN comments

HN Theater has aggregated all Hacker News stories and comments that mention GOTO Conferences's video "Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill • GOTO 2017".

Youtube Summary

This presentation was recorded at GOTO Chicago 2017. #GOTOcon #GOTOchgo
http://gotochgo.com

Bryan Cantrill - Chief Technology Officer at Joyent

ABSTRACT
As software is increasingly developed to be deployed as part of a service, the manifestations of defects have changed: the effects of broken software are increasingly unlikely to be felt by merely one user, but many (or even all) -- with concomitant commercial consequences. Debugging service [...]

Download slides and read the full abstract here:
https://gotochgo.com/2017/sessions/86

https://twitter.com/gotochgo
https://www.facebook.com/GOTOConference
http://gotocon.com
#Debugging #DebuggingUnderFire

Looking for a unique learning experience?
Attend the next GOTO Conference near you! Get your ticket at http://gotocon.com

SUBSCRIBE TO OUR CHANNEL - new videos posted almost daily.
https://www.youtube.com/user/GotoConferences/?sub_confirmation=1

HN Theater Rankings

This course is unranked · view top recommended courses

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.

⬐

Nov 16, 2022 · vaidhy on Amazon confirms corporate staff cuts that could hit 10k employees

If it were so obvious, I am wondering why you need to worry about doing it wrong. You talk about trade-offs as if they are fixed in time. That is not the case at all in anything but toy systems. Maybe you need to find someone who can design a quality micro-service?
You have no idea what I do and what my competencies are, yet, you are assuming you know a lot more about system design than I do and I am missing something simple.
Here is another take from Bryan Cantrill [https://www.youtube.com/watch?v=30jNsCVLpAE&t=1413s]. Very likely, he does not know anything about services either, I guess.

⬐ matai_kolila
Can you show me where I said "building microservices is obvious"? Or if you don't believe I said that, can you describe what you think it is I'm suggesting here? Because my intent was to say, "It is obvious that microservices can be designed well and designed poorly." In fact, if you'll observe, I removed the word "microservices" from my final statement, to demonstrate that such a statement is trivially true for any given thing.
Nowhere did I say, "It is obvious how to design a good microservice." but it seems like you're arguing against that statement, not the one I made.
And for what it's worth, I don't give a hoot who you are, who Bryan Cantrill is, or what your supposed competencies are; make your argument, don't rely on your pedigree to speak for you. That should be obvious.

⬐

Aug 09, 2022 · OmarAssadi on The Story of Mel (1983)

I’m surprised I haven’t seen this linked yet, but Bryan Cantrill of dtrace, Sun, lawnmower, Joyent, etc, fame gave an amazing talk for Monktoberfest 2016, titled “Oral Tradition in Software Engineering”, which features The Story of Mel [1]. Highly recommend checking it out — there are loads of little gems and stories like this throughout.
All of his other presentations are great too and definitely worth a listen if you like this sort of thing [2]. A couple of my favorites are “Fork Yeah! The Rise and Development of Illumos” [3] and “Debugging Under Fire: Keep your Head when Systems have Lost their Mind” [4].
[1]: https://youtu.be/4PaWFYm0kEw?t=644
[2]: http://dtrace.org/blogs/bmc/2018/02/03/talks/
[3]: https://youtu.be/-zRN7XLCRhc
[4]: https://youtu.be/30jNsCVLpAE

⬐ int0x2e
Bryan Cantrill's talks are some the best I've ever seen. I've always tried sharing them around with colleagues (with limited success, but still worth it in my view...)

⬐ jasonladuke0311
"Don't fall into the trap of anthropomorphizing Larry Ellison" is one of the funniest things I've ever heard: https://youtu.be/-zRN7XLCRhc?t=2302

⬐

Jul 01, 2022 · KronisLV on Upptime/upptime: Uptime monitor and status page powered by GitHub

> All I say is that you must relay on externals if A-E are all on the same network as it may go down.
Thankfully, it's not too hard to take advantage of multiple networks in a hybrid/multi-cloud setup nowadays! Though, depending on the necessary access controls and auditing, such a setup might require slightly more work.
You do bring up an excellent point, though, about how it's a serious single point of failure in many systems out there, because personally I've also seen many setups like that (the majority of them, actually): I do suspect that in many cases that is indeed done for ease of use/convenience, even if it may lead to downtime.
Of course, in some cases downtime is acceptable, so I cannot argue that it can also make sense to choose such a simpler setup - for example, for having your own company's applications/monitoring for development environments all on the same network.
Though if this topology is retained at scale, things can get a bit interesting. On a similar note, I recall Bryan Cantrill doing an interesting presentation "Debugging Under Fire: Keep your Head when Systems have Lost their Mind" that talked about restarting their whole data center and the implications of that: https://youtu.be/30jNsCVLpAE

⬐

Mar 22, 2022 · wingmanjd on Hackers claim to have breached Okta systems

I suppose it all depends on how much infra needs to be stood up for the absolute necessities of the business to operate. Does the company need that internal ticketing system in place to process external client transactions? Probably not, but it'll need it eventually (so maybe that moves to 2nd tier restore process?). My company's RTO is 24hrs to processing new client transactions. Restoring old ones will definitely take longer, but at least new ones can proceed.
If your own company's RTO is 2w, that sounds like a lot needs to be in place. Part of the business continuity/ disaster recovery is getting management to sign off on those types of numbers, big or small. Make sure they're realistic.
You're right that this type of recovery is not fun. Bryan Cantrill gives a great presentation about managing an outage (https://www.youtube.com/watch?v=30jNsCVLpAE). One of my biggest takeaways, if you're looking at a sweeping outage and a long haul of a recovery, do sleep management asap with your team. Dead tired people are more likely to make brain dead decisions.

⬐ bogomipz
What a great link, thanks for sharing.
>"Dead tired people are more likely to make brain dead decisions"
Indeed. In reading the post-mortem on the recent multi-day Roblox outage, it's hard not to imagine some bad decisions were made that only made problems worse and that likely these were on account of people just being fried by lack of sleep:
https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

⬐

Jan 24, 2022 · OmarAssadi on Famous outages along with deep postmortems?

Perhaps not famous, but Bryan Cantrill, who gives my favorite talks, has an interesting and funny talk on one of the Joyent outages: https://youtu.be/30jNsCVLpAE

⬐

Oct 18, 2021 · KronisLV on Is it time to rewrite the Operating System in Rust? (2018) [video]

I'll give it a watch and will summarize, because while i'm not always on board with the Rust hype (and therefore want to know more, to help eliminate my biases, one way or the other).
That said, i rather enjoyed Bryan Cantrill's talk from 2017, "Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill • GOTO 2017": https://youtu.be/30jNsCVLpAE
So i wouldn't necessarily turn away from any of his videos just because of the sometimes humorous or not awfully serious tone.

⬐ KronisLV

Okay, so here's a slightly delayed summary (had to fix some prod issues):

  - a discussion that the parent commenter took an issue with: "What's software? What's hardware? It's hard to answer that."
  -   essentially, an OS is a program that abstracts away hardware
  -   kernel: a piece of the OS that runs with the highest level of privileges, but only a part of the OS
  -   the OS as a whole includes libraries, daemons etc.
  - expansion on the history of OSes and how we got to where we are now
  -   developing OSes isn't always lucrative, you don't hear about the innovative companies that didn't survive
  -   mentions of https://en.wikipedia.org/wiki/Second-system_effect
  -   a brief story about how trying to outsource the PL/I compiler wasn't a good idea
  -   the Unix approach was way more organic in comparison to PL/I, less of a waterfall
  - a little bit about programming languages
  -   a little bit about the history of C and how it wasn't created the exact time as Unix
  -   some words about languages that now seem esoteric, like https://en.wikipedia.org/wiki/Language_H
  -   thoughts on the imporance of macros and the C preprocessor
  - more about OSes in the 1990s
  -   languages like C++ and Java got more popular
  -   many of the OSes of the time suffered from the aforementioned second system effort, were overcomplicated
  -   oftentimes overcomplication also lead to great resource usage with little tangible benefit
  -   with the arrival of Linux, the C based OSes became more entrenched
  -   at the same time, the actual languages that focused on the ease of development (Java, Python, Ruby) also gained popularity, though in a different context
  - software systems in 2010s
  -   without a doubt, it's nice to be able to use higher level abstractions
  -   Node.js got surprisingly popular due to a high-performance runtime with the aforementioned benefits
  -   Go was also developed, though its garbage collector is mentioned as a problem here, because it makes C interop harder
  -   a bit of elaboration about GC and the problems with it, how easy it is to have a reference into a huge graph
  -   essentially, it has its use cases, but at the same time there are problems it's just not suited for (a certain class of software)
  - how Bryan got started with Rust and a bit about it
  -   initially he didn't want to go back to C++, because of a variety of past issues
  -   he got increasingly interested in Rust and its potential benefits
  -   curiously, there is a category of people who are curious about Rust, but haven't actually written any
  -   it's nice to have a language that's built around safety, parallelism and speed
  - more about Rust
  -   its ownership system allows for the power of garbage collection, but the performance of manual memory management
  -   being able to determine when a memory object is no longer in use and being able to do so statically is like a super power
  -   the compiler itself just feels really *friendly* by pointing you to directly where the problems are
  -   composing software becomes easier all of the sudden, when compared to C, since it's hard to get it right there
  -   going back to C or porting C software can actually be somewhat difficult because of unclear ownership
  -   some of the Rust performance gains are actually from good implementation of language internals, like them using b-trees
  -   algebraic types are also nice to have and the FFI in Rust is really well thought out
  -   there's also the "unsafe" keyword which allows loosening the security guarantees when necessary
  - about OS development and Rust
  -   no-one cares about how easy OS components were to develop or how long they took to develop, everyone just wants things to work
  -   a bit of information about having to deal with failed memory allocations, design discussions, language features etc.
  -   lots of OS projects out there in Rust
  -   however, writing your own OS essentially forsakes Linux binary compatibility, so a lot of software won't run anymore
  -   you have to consider what is the actual advantage of rewriting existing software in Rust, safety alone might not be enough
  -   a callback to the fact that an OS is more than just the kernel! you could rewrite systemd in Rust and other pieces of software, however not all software is a good candidate for being rewritten
  -   firmware in user space (e.g. OpenBMC) could probably benefit greatly from Rust as well, in addition to just having open software in the first place

tl;dr - Rust is promising, yet isn't a silver bullet. That said, there are certainly domains and perhaps particular OS components which could benefit from the safety that Rust provides, especially because its performance is also good!

⬐ guerrilla
Wow, impressive notes. Thank you.
That reminds me of the GNU coreutils being rewritten in Rust: https://lib.rs/crates/coreutils

⬐ puzzledobserver
Haven't watched the video, but I recently watched Timothy Roscoe's OSDI keynote on operating systems and hardware [0], where he argues for the operating system to include all of the kernel, device drivers and blobs that manage the SOC. He points out that the system emerging from the interaction of these components is complex and essentially unarchitected.
After watching his talk, I wanted to figure out how a computer, say a Raspberry PI, _really_ works. And build an operating system for it. Maybe in Rust?
[0] https://www.usenix.org/conference/osdi21/presentation/fri-ke...

⬐

Aug 19, 2021 · e12e on Harbormaster: Anti-Kubernetes for your personal server

Not (I think) the exacttalk/blog post gp was thinking of - but worth watching IMNHO:
"Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill • GOTO 2017" https://youtu.be/30jNsCVLpAE
Ed: oh, here we go I think?
> Running Aground: Debugging Docker in Production Bryan Cantrill19,102 views16 Jan 2018 Talk originally given at DockerCon '15, which (despite being a popular presentation and still broadly current) Docker Inc. has elected to delist.
https://www.youtube.com/watch?v=AdMqCUhvRz8

⬐ thor_molecules
awesome, thanks!

Post Hype Microservices with Bryan Cantrill

Apr 28, 2021 · 4 points, 0 comments · submitted by dralley

⬐

Feb 26, 2021 · KronisLV on Google admits Kubernetes container tech is too complex

> Joyent famously took down their whole region by rebooting wrong nodes.
In case anyone's interested, here's a pretty funny and educational talk by Bryan Cantrill about that particular incident:
GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind
https://www.youtube.com/watch?v=30jNsCVLpAE

⬐

Dec 14, 2020 · znpy on Google outage – resolved

It reminds me of this: https://www.youtube.com/watch?v=30jNsCVLpAE -- "GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill"

⬐

Sep 24, 2020 · pdkl95 on They're deleting my channel, but they don't know why? [video]

> I run a YouTube channel that is over 10 years old, has over 770,000 subscribers, almost 600 public videos, 120,000,000 views
When writing software involved in managing a live, public, massively multi-user system, the traditional unix-style commands that are immediate, often silent, and capable of damaging effects become a really easy way to shoot yourself in the foot. Worse, some commands might let you accidentally shoot everyone's foot on a typo. The traditional example is accidentally typing something like "rm -f * .bak" (note the extra space after the star).
For a good discussion of this type of problem, I highly recommend Bryan Cantrill's talk[1] about the time an operator accidentally rebooted an entire datacenter with a single miss-typed command.
The general solution to this is building sanity checks into the software. The user just asked to reformat 500 hosts, but almost all previous uses olf the 'reformat' command affected less than 10 hosts. Maybe we should ask for verification from an actual human if they really intended to run this unusually destructive command.
Why doesn't YouTube have this kind of sanity check in their automated takedown/strike/channel-deletion tools? Google wrote automation that can decide to delete a channel with a long history and many successful videos. Why doesn't that automation have basic sanity checks that ask for operator input when asked to do an unusually destructive action like deleting a 10 year old channel with a huge history?
[1] https://www.youtube.com/watch?v=30jNsCVLpAE

⬐ zaroth
Good luck checking in code that will increase the human burden and reduce automation. Imagine how many TPS reports you would have to fill out to get permission to make a system page a human more often at “Google scale”.
When the metrics finally hit the dashboards sounds like it would be career limiting.
“But we prevented pdkl95’s channel from being deleted!” is definitely going to be a solid defense for why you hobbled their infallible AI from ruthlessly executing its system oversight responsibility.

⬐ mikenew
And yet they've created a widespread perception that YouTube mistreats it's creators, and we see stories like this all the time now. Surely someone at Google is capable of taking the long view. And if the culture at Google prevents that then Google is broken.

⬐ motoboi
Every time a person come close the have a long time view, they get senior enough to leave the company or the product team. All companies have a tendency to lost bearing of its mission and focus on the wellbeing of the company owners or executives.

⬐ Sophistifunk
Youtube runs the way it runs because they don't care if a bunch of creators get hosed. All they care about is keeping the copyright wolves at bay.

⬐ Crosseye_Jack
Under DCMA, on receiving a valid counter notification they are free to put the content back on the site, remove the strike from the content uploader and let the courts handle the matter between the two parties.
https://twitter.com/leonardjfrench/status/130794817037951385...
It’s only if Google “ignore” valid dmca takedown notifications they are at risk of losing their safe harbour exceptions. How they handle the counter notifications can be pretty hit and miss. (but tbf, most of the time it’s because the person filing the counter notification didn’t fill it out exactly how Google’s bots like) I’ve seen cases where YT just automatically removed the strike from the account and restored the video on receipt of the counter notice. I’ve seen cases where they have locked the account even though the person filing the DMCA notifications has publicly admitted to taking down the video because it was critical of the work and the video only contained small pieces of the original content (exactly what you would expect to be covered under the criticism and comment portions of fair use exception).

⬐ rlayton2
I don't know, but I've read that the counter here is that YouTube's system isn't actually DCMA - they are pursuing copyright over and above that system, and therefore the "protections" in DCMA don't apply - you broke a Google policy, not a law, and therefore you have no recourse.

⬐ Crosseye_Jack
There is "copyright take down requests" which are DMCA takedowns and gives the uploader strikes against their account and there is Googles own copyright system (Content ID) which Google would rather have right holders use (as they don't have to issue strikes to the users) that allow the claimant leave the video up but get the analyics from the video, claim the ad revunue from the video, mute the video (in cases where its an audio claim) or remove the video.
When you get a claim on the latter system (Content ID) you don't get a strike on your account but if you appeal the claim its (normally) upto the claimant to decide if your appeal is valid or not (sometimes YT does step in an say "yeah its fair use, have your ad rev back" but that is not the norm).
These takedowns are actual DMCA takedown requests. (even TeamYouTube are telling him to issue (valid) counter notifications to these claims - https://twitter.com/TeamYouTube/status/1306733040211824645)

⬐ AstralStorm
They're not afraid of DMCA, but big money pulling out if they're insufficiently aggressive about claims...

⬐ Crosseye_Jack
Which is why they have Content ID. Big money can use the system to claim the ad rev / remove the content (without giving the user a strike) then putting the onus on the uploader of the video to then take the fight to the claimant for breaching their rights.

⬐ kords
They have ways to detect the "unusual". If I fill a form really fast, or something is weird in the state of my browser I get prompted to select crosswalks, road-signs or hydrants. It will be great if they would apply them on their algorithms too, when an "unusual" result occurs.

⬐ bigiain
I'm now imagining a newly emergent dark pattern, where when you try to delete some expensive cloud resource, but before it'll stop charging you you get into one of those insane recaptcha loops that keeps saying "Click on all the Lovecraftian Gods that drive you insane just by looking at them!" or "type the text" where the displayed text is all in Druidic runes or something...

⬐ noir_lord
Sounds like the Land of Azure.

⬐ tatersolid
This is actually a pretty effective control. I almost deleted the wrong storage account in Azure the the other day.
We do not generally allow deletions of stateful resources to be done via automation after being bitten, and use the Azure lock mechanism.
We are now also changing our naming standards from “component-environment” to “environment-component”. In this example I almost deleted “app-prd-sa” instead of “app-dev-sa”. Much harder to do when you lead the name with “production”.
Anybody have “safer” naming conventions in use out there that I should be aware of? Didn’t find much authoritative out there via search; naming things is hard.

⬐ Bedon292
Reminds me of when s3 went down in 2017, and took a chunk of the internet with it (to include their own status dashboard) due to someone entering a command wrong.
https://aws.amazon.com/message/41926/

⬐ mattl
Other than websites and maybe some web-ish apps not too much else would be affected by S3 outage, a DNS outage would affect more of the Internet vs the web.

⬐ CreepGin
Websites and web-ish apps are not a chunk of the internet? Also there are tons of non-web-ish apps that depend on S3.

⬐ dylan604
i work on multiple projects that are 100% data crunching with content stored on S3 and accessed solely by ec2 instances. there is no UI or public access. yet if S3 went down, we'd be up a certain creek without a certain self-propelling device and a big whole in the boat without a bucket (physically and in the cloud)

⬐ megous
All of Google was gone from my country due to bungled router configuration for half a day some time ago. Now that was fun to watch. Everything from 8.8.8.8 to all their backend APIs for android, to all their web properties, went poof.

⬐ mattl
Yikes. Time to use a few different providers?

⬐ eru
There's actually quite a few enterprise client of S3. Lots in finance, too.

⬐ Bedon292
Maybe "the internet" was a bit too generic. But it destroyed a lot of tech worker productivity for the day at least.
Slack, GitHub, GitLab, BitBucket, Docker, etc all had issues.
https://venturebeat.com/2017/02/28/aws-is-investigating-s3-i...

⬐ ckozlowski
I was just coming to see if someone mentioned this! I work for AWS, and it's both surprising and heartening to see how often I see this anecdote get repeated, and many times by us to our own customers. We had teams looking for weeks at all of our other services for other instances of this.
It was a good operational lesson and one I'm happy to see shared still. If a command will let you perform a self-inflicted wound like that without checks, then it's time to review that command. Err on the side of caution and bias towards minimizing blast radius even if it means sacrificing some speed. "Move fast and break things" may be true at times, but not when the "*" character is ever involved. =)

⬐ Bedon292
Yeah its an interesting one. I tend to think of it in the context of, well you screwed up and deleted the wrong instance, but at least it wasn't that bad.
I also think about the response to it. Sure it sucked that it happened, and someone was probably feeling really bad about it. But it was a learning opportunity as well. Not just for them, or AWS, but for everyone in tech. Put those safety nets in place.

⬐ josmala
It is my personal opinion that you don't need "*" character ever there are other alternatives available always. And if avoiding that character makes it far safer I'm all for it.
rm -rf /+(?)
If anyone not understanding the joke, + character matches 1 or more instances, and (?) matches any one character. And / is start of entire filesystem. rm is remove. -rf (recursive, never ask verification). So it removes every file that you have rights to remove.
You can also do same thing accidentally by proxy. When my first computing teacher asked entire class to write anything to terminal to show that you had to write specific commands and computers don't understand normal language. Then she told to press enter. I asked her if she was sure. She asked why. I said that I wrote "format c:" . I have never seen teacher walk so fast in a classroom. I think that was last time she used that line in the classroom.

⬐ tmpz22
Because building it won’t get you a promotion at google.

⬐ eru
That's what we called 'promotion oriented programming'.

⬐ lmilcin
Yeah. If the channel owner has spent so much effort it would be fair to expect actual human to spend literally a minute to verify it is legit removal case.

⬐ pwillia7
Because that would be sprints not spent making more money for Alphabet shareholders

⬐ moltar
But that’s not true though. Creators create content and make money with ads. No creators = no ad platform.

⬐ edoceo
The number of creators harmed is less than the number of other creators, new creators and other factors. Loss of 0.00001% of the content on YT is, literally, no big deal. You, me, all other creators just don't matter.
That is BigG just don't give a f---.

⬐ eru
Not sure your logic holds.
If Google is so big that 0.00001% loss of revenue doesn't register for them, they are surely big enough to spare 0.00001% of total engineering time to fix that loss?
In fact, I hear that as a common complaint that when working as an engineer at Google most of the time you are just making some system fractions of a percent more efficient.
Just to be sure, I am not saying that either your premise or the statement you are inferring is wrong. No opinion on that. I'm just saying that your premises don't lead to the conclusion. (But your conclusion might be right for other reasons.)

⬐ edoceo
I'm saying that a loss of content does NOT create a loss of revenue.

⬐ eru
OK, then it makes sense.

⬐ nxpnsv
Weigh the cost of false positive against false negative. In the current landscape YT has little incentive to change.

⬐ hahajk
and if you start implementing these sanity checks, they better be everywhere. Operators will start to rely on them and might become careless - they believe the system will catch any egregious typo. The one spot where the sanity check isn’t implemented will become more vulnerable.
⬐ Ensorceled
How could some sanity checks be worse than NO sanity checks at all? In the YouTube case, I mean. This is happening ALL the time.
⬐ lopmotr
Many of the cases we hear about don't seem to be mistakes, including this one. Sometimes nobody can figure out any possible reason for a ban and sometimes it gets reinstated (that's the undo feature in action). But often there are obvious copyright issues or offensive material and the channel owner only goes clutching at fair use or freedom of speech after the ship has sailed.

⬐ hamandcheese
For the same reason human babysitters of autonomous vehicles don't succeed in preventing pedestrians from getting run over.

⬐ Ensorceled
Right, but a human babysitter is better than NO babysitter.

⬐ dylan604
That's the same mentality of people stating because one form of pollution reduction or green energy generation isn't solving the problem that it's not worth attempting to do what you can when you can where you can. Those small moves are what gets the thing moving.

⬐ lmm
An inconsistently applied sanity check creates a false sense of security. Like how a safety barrier that isn't strong enough to take someone's weight might be worse than no barrier.
⬐ Ensorceled
Right, I understand the concept. What I'm saying is that a safety barrier that sometimes doesn't work is better than rails that always guide you off a cliff; which is currently the YouTube model.

⬐ fiddlerwoaroof
Yeah, some people recommend something like
    alias rm=‘rm -i’
Unfortunately, as soon as you’re on another computer or at someone else’s shell, you’ll be more comfortable thinking less about rm-ing things and the sanity check won’t be there.
⬐ pwg
The better recommendation is to alias rm -i to del:
    alias del='rm -i'
And then get in the habit of using 'del' unless you really want a no-confirmation delete.
Then when you are on another computer or at someone else's shell you get:
    del: command not found
Instead of having a bunch of files silently deleted.
⬐ thaumasiotes
This strategy will fail pretty badly if you're on someone else's computer using Windows or DOS, where 'del' is the delete command.

⬐ happymellon
Does `del -rf /` do a lot on Windows?

⬐ thaumasiotes
I doubt it, but `del *` should do the same thing on both systems.

⬐ pwg
Except that on windows, "del" defaults to prompting "are you sure" when one types "del *".

⬐ fiddlerwoaroof
Another trick is to run
    touch —- -i 
In important directories. Then rm -rf * expands to rm -rf .... -i ... and you get the “are you sure” prompt.
⬐ tgtweak
Where is Bryan anyway he's been unusually quiet and I know he lurks on HN.
bcantrill where art thou

⬐ cellularmitosis
Have you heard about Oxide Computer Company? https://oxide.computer/team/

⬐ Tokkemon
Because writing good software is haaaarddd. /s

⬐ stickfigure
Two days ago I restored a customer's data from backup because they typed "ALL" and hit OK on a dialog that said "This will change ALL of your things, type ALL to confirm". This dialog only pops up if N > 100, and normally makes you type the number (ALL is a special case). I just do not know how to make this more idiotproof short of making them fax in the request.
No matter what you think is sane, some insane person will prove you wrong. In this case, tiredness + ESL = bad reading comprehension. It's going to happen.
⬐ laughinghan
It sounds like you found the solution, you just made it really cumbersome: undo.
All confirmation dialogs should be replaced with undo. The happy path has lower friction, and in case of a mistake they'll heave a huge sigh of relief. When possible, it's better in all cases for all users, whether novices or power users.
And many things that at first blush seem like undo isn't possible, are actually easy to make undoable with a simple tweak: deleting data? Don't actually delete it until 24 hours later. Sending an email? Wait 10 seconds to actually send it, similar to Gmail's Undo Send.

⬐ AstralStorm
Unless you're actually trying to do a huge synthetic operation where info would choke, e.g. creating an archive while removing existing files.
That one is still possible to undo, just slower and more expensive...
Then there are fun ones like Windows update holding 20 GB of insufficient undo, mechanical or hardware failures induced by extra load, and how to decide where an operation ends.

⬐ stickfigure
Implementing soft-delete is much easier than "soft-update", which is what this would have been.

⬐ laughinghan
Yeah, unfortunately in some cases if you didn't plan for it from the beginning it's not easy to tack on later.
In my opinion, it's usually worth it though. You only hear from the folks asking you to restore things from backup—you won't hear from the folks who experience unnecessary friction and tell their friends or coworkers "it's okay, it works, it's kind of annoying to use though, I can't put my finger on anything specific".

⬐ eru
You could still wait 10 seconds, and have a 'fake' undo button that aborts. (You can even put up a progress bar to pretend you are doing work during those 10 seconds.)
That's purely a UI element and is completely independent of how the actual destructive operation is implemented in the backend nor how hard it would be to reverse.

⬐ laughinghan
I don't think that would be much better than a confirmation dialog that makes you wait 10 seconds before you can click OK. It's often only after clicking around and seeing the resulting changes that it sinks in that a mistake was made, and they reach for Undo. And that would add just as much friction to the happy path.

⬐ eru
It can be much better than the confirmation dialog, because it's meant to be implemented in such a way that you can get on with the rest of your work while the undo-countdown is ticking.
From personal experience with gmail's fake undo, in terms of things sinking in, it works almost as well as regular undo for me; and not like a confirmation dialog (which doesn't work at all).
So there's less friction, there's no extra click you need to make after ten seconds. And, also from personal experience, the force-delayed confirmation dialog I've used (I think in Chrome and Firefox for certain actions), don't seem to lead me to thinking at all. At least not any better than a regular confirmation dialog.
But in any case, all these are empirical questions, and it would be interesting to run a little user study with the different options, instead of endless speculation.

⬐ stickfigure
Gmail is a pretty specific case - email is fundamentally asynchronous and "delay send" for something that's already scheduled is straightforward.
Imagine trying to apply this undo to a bulk add/remove labels operation. Once you've committed the transaction, there is no simple 'undo'. It's possible to build a system capable of undo, sure, but you're talking about a lot of upfront work and complexity. Plus a fairly exotic database schema.

⬐ laughinghan
There's nothing exotic about it, you just need an OLAP rather than OLTP database schema: https://en.wikipedia.org/wiki/OLAP_cube

⬐ eru
I don't see the problem?
I would imagine you would stick all your UI actions in something like a log, and then only apply that log to your actual data with a delay?
But not sure whether you call that 'a lot of upfront work and complexity'?
Perhaps I'm a bit blind, because I come from a part of the programming world that's very keen on persistent datastructures, where undos are trivial to implement. (https://en.wikipedia.org/wiki/Persistent_data_structure)

⬐ pdkl95
Unfortunately human error can never be completely eliminated. However, I'm not really talking about this type of problem. In my previous [1], the operator understood how the tool worked; they simply made a mistake when typing the command, and the tool without warning accepted the command to reboot the entire datacenter. Particularly telling are these comments: (sic)
    Operator-1: I ewas rebooting an rb
    Operator-1: forgot to put -n
    [...]
    Operator-5: [...] i've almost done what Operatolr-1
                just did a *number* of times.
That isn't a user understanding problem; it's a dangerous tool that doesn't fail safely. In your case, at least you detected the unusually destructive action and asked for verification. Youtube isn't even attempting simple sanity checks like your "N > 100" test.
> normally makes you type the number
> I just do not know how to make this more idiotproof
Requiring explicit typing of the number or an explicit phrase like "Yes, I want to delete everything." are can help a lot.
If possible, another good approach is to explicitly show the full list of proposed changes. Phrases like "This will change ALL of ..." might have multiple interpretations (ALL what? All of the the things in my entire account? All of the things in the current/last project/group? All of the things I think (perhaps incorrectly) were referenced in this action?). If someone is expecting to change only a few records, a confirmation popup that asks "Do you want to make these changes:" followed by a huge list has a large size/presence that should conflict with their expectations. "I only wanted to change a few things - wtf is this huge list?"
⬐ humaniania
Require a different user's authentication or admin code to approve an "ALL" transaction.

⬐ segfaultbuserr
Yes, it works.
But when the sample is large enough, it still happens once in a while. In 2004, a CSB investigation showed that an entire chemical plant exploded after the interlock was bypassed by the supervisor password [0][1].
> The explosion occurred when maintenance personnel entered a password to override computer safeguards, allowing premature opening of the sterilizer door. This caused an explosive mixture of ethylene oxide (EO) to be evacuated to the open-flame catalytic oxidizer by the chamber ventilation system. The oxidizer is used to remove EO in compliance with California air quality regulations. When the EO reached the oxidizer it ignited and the flame quickly traveled back through the ducting to the sterilizer where approximately fifty pounds of EO ignited and exploded.
Apparently the supervisor who owned the password didn't receive any training on the nature of the process and the dangers of bypassing the interlock...
[0] https://www.csb.gov/assets/1/20/sterigenics_report.pdf
[1] https://www.youtube.com/watch?v=_2UnKLm2Eag

⬐ C1sc0cat
Why would you ever ever have the ability do this "allowing premature opening of the sterilizer door".
Ironic it was air quality regulations - that did for them.

⬐ bluGill
There needs to be something in case the implementation forgot something. These are dangerous of course, but they can also save the day when something unexpected happens.

⬐ C1sc0cat
This is a physical chemical plant not a website - if you do need to do something like that you do it manually.

⬐ zentiggr
While there might be emergency conditions that would make this cumbersome, in general that sorry of two person control makes sense. That's why the military uses it for especially dangerous actions or conditions. (Weapons loading on the sub i served on, for example).

⬐ tapland
Printing a list of the fist 100 or so affected files is useful to give a real wake up call and a chanse to double check.

⬐ heavenlyblue
The bigger question I have, why did you not delete the data when asked to?
I understand idiot users, but what about users who actually want to delete it?

⬐ dannyw
GP mentioned restoring from backups. You generally don’t delete from backups outside of the normal cycling policy, because otherwise they’re not backups.

⬐ john_minsk
Don't allow the user type "all", but another guy (admin) to verify such commands.

⬐ waheoo
Reverse survivorship bias...

⬐ dbcurtis
An old office mate had been a systems programmer back in the days of mainframes. He once put a local-site patch into the mainframe boot code where the question about wiping and restoring the on-line storage required operators to type in, correctly capitalized and punctuated, the precise string: "Yes, I really do want to spend the entirety of my shift hanging tapes."
.... and someone still did it. The struggle is real.

⬐ segfaultbuserr
In electrical engineering, the saying is "Nothing is foolproof to a sufficiently talented fool."
Also, in computer folklore, there are numerous stories of how non-technical users purposefully destroy foolproof mechanisms by brute-force, e.g. cut the slot on a DDR3 socket to insert a DDR2 RAM module and fry everything... And I wonder whether "don't use brute-force, if you have difficulty getting it in, it means you are doing it wrong" should be taught as the first rule when working with hardware. Unfortunately, to add the confusion, we also have connectors that can be surprisingly hard to connect and disconnect even under normal circumstances...

⬐ Uberphallus
In my environment the adage is "when you say foolproof you mean fooldetecting".

⬐ IggleSniggle
I think this is a good analogy for software. To the typical user, some software is easy and sensical, while other is obtuse and requires significant jiggering just to do the thing it was ostensibly designed to do.

⬐ segfaultbuserr
There is a difference, however. The connectors that require a lot of force for insertion and removal have some true advantages, they are usually the simplest, cheapest and generally reliable components. Almost nothing can go wrong with a simple wire terminal, it's just a piece of rectangular or round metal. Although it can be difficult to disconnect for servicing, but you're only expected to do that once per year. On the other hand, "easy" connectors are designed in a way that, instead of requiring the mating force necessary for a good contact , it's provided by the connector mechanism itself, and as a result, they're often more complex, expensive, or fragile, such as a USB connector or a ZIF socket.
A software analogy for an "easy" connector would be, "fancy software with good user experience often has a lot of complexity hidden behind of scene, and can be fragile". But I'm not sure what would be the analogy for a cheap connector. Perhaps, a shell script?

⬐ IggleSniggle
Yes, a shell script, but for the sake of the discussion, you must imagine that to the average user, invoking a shell script is “the same” as navigating through a few settings menus and clicking some checkboxes they don’t understand: that is to say, when the UX is sufficiently lacking, users often enter “well I don’t really know what I’m doing just push through” mode.

⬐ rtx
Atleast in this instance some blame lies with computer hardware designers. Make things simple, no one puts a three the wrong way.

⬐ TeMPOraL
> And I wonder whether "don't use brute-force, if you have difficulty getting it in, it means you are doing it wrong" should be taught as the first rule when working with hardware. Unfortunately, to add the confusion, we also have connectors that can be surprisingly hard to connect and disconnect even under normal circumstances...
And everyone has likely experienced in their lives plenty of appliances, self-assembly kits and other objects where some components required application of force to put together, because there's resistance coming from the feature that prevents the object from coming apart together. My rule of thumb is now that if the force seems to be veering into "could break surrounding structure" levels, or if the thing starts making unexpected sounds, then I'm doing it wrong.
... and then I have to put a CPU on a motherboard and the correct way absolutely does involve close-to-breaking forces and squeaky sounds.

“us-east-1 is being rebooted?” (Debugging under Fire, GOTO 2017)

Jul 25, 2020 · 1 points, 0 comments · submitted by tosh

⬐

Jul 12, 2019 · merlincorey on Root cause analysis: significantly elevated error rates on 2019‑07‑10

> In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better.
Bryan Cantrill has a great talk[0] about dealing with fires where he says something to the effect of:
> Now you will find out if you are more operations or development - developers will want to leave things be to gather data and understand, while operations will want to rollback and fix things as quickly as possible
[0] Debugging Under Fire: Keep your Head when Systems have Lost their Mind - Bryan Cantrill: https://www.youtube.com/watch?v=30jNsCVLpAE

⬐

May 18, 2018 · tonyarkles on The sad state of sysadmin in the age of containers (2015)

> I'm not sure you do, since you admitted to really being at least part sysadmin, earlier :)
Hah!
I wish I could remember which talk it was. Bryan Cantrill had a good line about DevOps in one of his talks (I think it was this one: https://www.youtube.com/watch?v=30jNsCVLpAE). The gist was "you can say you're DevOps, but when the shit hits the fan, you're either going to be Dev or Ops. If you're a Dev, you're going to want to debug the problem before rebooting the failed machine. If you're Ops, you're going to want to reboot the machine as fast as you can to get it back up." Through that lens, I'm definitely pretty far over on the Dev spectrum; when something goes catastrophically wrong and someone reboots a box to "solve the problem", my first reaction is "YOU FUCKER YOU BURNED THE CORPSE"
It's such a tricky thing all around. Looking back at what I wrote earlier, I also realize that in some ways I'm facilitating the ignorance. I've got a really nice Consul and Nomad setup for the team, so they can pretty much just toss .wars and Docker containers at the cluster and they'll automatically get scheduled somewhere with spare capacity. The load balancer, the database cluster, all of the service discovery and job scheduling stuff... they've never had to get in and set any of that up. Maybe it's time to do more mentoring...
Anyway, thanks for getting me thinking. I've been a little bit grouchy lately about all of the recent experiences of people not knowing how the stuff they build actually runs. You've been a great mirror for some self-reflection :)
Edit: also, bare metal rocks.

Hacker News Comments on Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill • GOTO 2017

Hacker News Stories and Comments

Hacker News Comments on
Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill • GOTO 2017