HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Mastering Outages with Incident Command for DevOps: Learning from the Fire Department

IT Revolution · Youtube · 63 HN points · 0 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention IT Revolution's video "Mastering Outages with Incident Command for DevOps: Learning from the Fire Department".
Youtube Summary
DOES18 Las Vegas — Leading companies such as Google, PagerDuty, and Atlassian have developed successful major incident management practices based on the Incident Command System (ICS), which was first developed by fire departments. We can learn from these organizations, where managing emergencies is a core capability.

Mastering Outages with Incident Command for DevOps: Learning from the Fire Department

Brent Chapman, Principal, Great Circle Associates

Brent Chapman is an expert at emergency management, and at guiding organizations to prepare for and learn from emergencies, working from a strong background in IT infrastructure and site reliability engineering (SRE).

As a leader in Google’s legendary SRE organization, Brent convinced senior management of the need to strengthen and standardize the company’s incident management practices, and created the Incident Management at Google (IMAG) system that is now used throughout the company. He also helped refine the Postmortems at Google (PMAG) system that the company uses to learn from incidents large and small.

Brent brings a unique perspective to his work in IT, as a former air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events such as Burning Man, and a Community Emergency Response Team (CERT) member and instructor.

Throughout his career, Brent has designed, built, managed, and scaled IT infrastructure and teams for everything from embryonic startups to giants such as Google, Apple, and Netflix. He is the coauthor of the highly regarded O’Reilly book Building Internet Firewalls, and the developer of widely used open source software, and a popular speaker at conferences worldwide. He has worked with dozens of organizations both in Silicon Valley and around the world, as well as with a variety of non-profit and government entities.

Brent has a rare combination of experience as an emergency manager, technology manager, people manager, software developer, network/systems engineer, and educator. Now, he shares that expertise worldwide with clients as the founder and principal of Great Circle Associates, Inc.

DOES18 Las Vegas
DOES 2018 US
DevOps Enterprise Summit 2018
https://events.itrevolution.com/us/
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Nov 29, 2018 · 63 points, 11 comments · submitted by kiyanwang
janzer
For anyone wanting to dig into the ICS the authoritative introduction is the NIMS IS-100[1] course. Be aware that it's your standard government produced course, i.e. not entertaining in any way but generally manages to get the point across.

If I understand/remember correctly, IS-100 and 200 are the required courses on ICS (along with 700, 800 for NIMS itself) for all frontline firefighters on any fire department wanting to receive federal money.

1. https://training.fema.gov/is/courseoverview.aspx?code=IS-100...

mbubb
Yes these are dry... I took them as part of a CERT org a few years ago and retook for EMS operations chapter in a class right now.
None
None
mbubb
Coincidentally am finishing up an EMT class and prepping for the NREMT test. I have been very struck by the similarity between the differential diagnosis process an emt uses and the troubleshooting a server or network issue. Similar problems too when you get tunnel vision (It must be asthma/It must be the DNS resolver... etc). Nice video!
monkmartinez
ICS seems like a natural fit for so many professions. I wonder if there is a future for a Fire Officer to teach/adapt ICS for <insert profession>?
deadmanwalking
I know of at least one company that has an Ex-fire chief from the US as well as other services specifically to teach IC to IT and other areas - http://www.blackrock3.com/

Really useful training, have had to use it multiple times to coordinate responses ransomware, and other IT disasters, and does benefit from the buy in of senior leadership at the company I work for.

Biggest challenge we have is keeping ICs after deciding to centralise the IC organisation in 3 locations, all ICs were offered the choice to relocate or leave.

We have a lot of very new ICs.

chanandler_bong
I had a 10 year background in fire/EMS before getting in to tech, and have "sneaked in" ICS concepts and practices to several of my employers and teams.

I tried to introduce the ICS concepts formally and up-front, but met a lot of pushback; "we don't need that", "we're not dealing with fires" or "it's too complicated".

By using ICS principles without calling them such, people usually see the value. I even got a sizable promotion and raise due to my "clear and concise handling of several serious incidents and putting procedures in place to handle similar in the future". All I did was direct people in to ICS functions and act as an IC.

I agree that ICS concepts should use used more widely and outside of just emergency services, but getting past the "stigma" of the title is the hard part.

mlosapio
Volunteer Firefighter and SRE here.

ICS is crucial in any and all of our incidents and should be the model on how any disaster is handled

nodesocket
"While I go work with Slack for a couple years..."

Can't be encouraging for Slack, couple of years. But, that's the deal now, people leave tech jobs every 1-5 years for something better or start their own company.

sokoloff
If tech companies want employees to stay longer, they need to work to make that the best option for their employees. If they don’t, we shouldn’t be surprised that employees leave to take a better option.
brentchapman
I’m the speaker in that video. Historically, on average, I’ve stayed at each of my employers for about 2 years. Google was a big exception, as I was there for almost 6 years, but partly that was because I had 3 fairly different roles during the time I was there.

So far, a month in, I’m loving working for Slack; it’s a great company, and an excellent group of people. So there’s every possibility that I’ll be there longer than 2 years!

bmv
test
KineticLensman
ICS like this is well already well developed for dealing with cyber incidents, e.g. malware incursion or data exfiltration. A really well thought through example is the NIST cyber security framework [0], which defines a cyber defensive lifecycle (identify, protect, detect, respond and recover). At least in the UK, this lifecycle has been adopted by Critical National Infrastructure (CNI) organisations such as electricity generation, transmission and distribution.

The key to successful incident response is to design, agree and test the ICS processes, roles and Command and Control (C2) hierarchies before incidents occur, capturing the results in an Incident Response Plan. The IRP will typically involve standing up an incident response team when an incident reaches a pre-defined severity level. Incident response teams are often structured into bronze, silver and gold levels of command (gold typically including Cxx individuals such as CIO and CFO, not just DevOps roles) that temporarily replace Business as Usual (BAU) management with incident C2. In the UK at least, the bronze/silver/gold hierarchy was developed by the ‘blue-light’ services (police, fire and ambulance), and is itself a simplified version of military C2 chains of command.

If management don't buy into this type of approach, it doesn't node well for an organisation's ability to deal with a crisis.

[0] https://www.nist.gov/cyberframework

HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.