SEV0 Brand Logo

The Pearl, San Francisco

Maintaining blameless incident culture when everyone knows whodunnit

Andrew's talk emphasizes the importance of maintaining a blameless incident culture, focusing on system improvements rather than blaming individuals to encourage continuous learning and accountability.

  • Andrew Guenther
    Andrew Guenther Member of Technical Staff, OpenAI
The transcript below has been generated using AI and may not fully match the audio.
All right. How's it going, everybody? Today, we're going to talk about maintaining blameless incident culture when everybody knows who done it. I love that this title was the only one that had to wrap around to a second line in the agenda. we're going to start off as many talks do with a show of hands.raise your hand if you've been involved in an incident. This is the easiest one. Awesome. I just needed, this is a confidence boost for me. I needed people to raise their hands for something. Raise your hand if you've ever felt blamed for an incident. Most everybody. The more uncomfortable question, raise your hand if you've ever blamed someone for an incident.Don't lie to me, I know you've done it. There we go. Really early on in my career, I worked on a payroll system platform as an intern and my summer project was basically building this PDF export feature for like monthly reports for payroll. and one of my teammates came to me one day. He's Hey, we've got an incident going on right now.You should, come check it out. it's a cool opportunity to learn. I was like, Hell yeah, man. I want to go check that out. I want to see how this works and I got in the room and I sat with everybody and they're trying to figure out something was wrong. Blocking a customer from accessing the system.They, it seemed like any write operation was failing. They couldn't run their payroll. People were upset. And as the team continued to dig deeper and deeper into it, they realized something was blocking all of the write threads to the database. And I will give you one guess what the thing was that was blocking all the write threads to the database.It was me! it was my PDF export job. hi, I'm Andrew Gunther. for over a decade now, I've been causing and responding to incidents, at companies big and small. I've had a pretty good stint, over at AWS, and now I'm working with OpenAI to build services to help other services be more reliable.I'm also pretty heavily involved in our overall incident management process. And we'll talk more about that particular incident later. But first, I want to talk a little bit about human nature. Believe it or not, even though I'm from OpenAI, I'm not going to talk about AI during this talk. I don't know if that disappoints anyone.This is mostly going to be about humans. Pesky humans. To err is human. And to blame is people make mistakes, we're biased to attribute errors to people rather than our processes. If you've ever been on the phone with customer support, you know this very well. You assume that they're bad at their job and maybe they just don't have the tools to help you out.But even though you called them months ago and told them you'd be traveling and your credit card's frozen and you couldn't have been more clear about the dates, I digress. It's natural. And because it's natural, that's why we call it blameless. It's the absence of something rather than the presence of something.If it was natural to not blame, I wouldn't be here giving this talk. We would all be in utopia and everything would be great and we wouldn't have to talk about this. but the point is, since we're pushing back on human nature, This is something that we can't just declare in a document, everything we do is blameless, and it just becomes so.This is something that we, have to be practicing all the time. now I've already probably said blame like ten different times, here. So I brought cards with me to ensure I have plenty of synonyms for the word blame. and just to keep it exciting, I'm gonna shuffle them. Alright, I'll keep those on standby.a brief refresher on, blameless and what it is and what it isn't. We want to attribute errors, we want to attribute issues, not to people or teams, but we want to be able to take a look at our systems with a critical lens. Mistakes happen, and if we're willing to accept those purely as a human failing, we're, not going to improve, we're not going to get better.we want to be focused on understanding what happened, And we want to be more resilient in the future. And this is the added benefit of, it encourages people to come forward and be honest about what's happening. and the people who own those mistakes are motivated to help improve them. And we do this because we know that it works.aviation started doing this back in the 1950s. And today, flying in a plane is safer than walking down the street. as in large part due to this culture and this process. In 1974, a Turkish Airlines flight, basically crashed due to explosive decompression from the cargo hold because the latch was not locked properly.And rather than just saying, we'll just fire the guy on the tarmac who didn't latch it properly. They redesigned the cargo holds so that the pressure difference between the inside and the outside of the plane would actually, if it wasn't latched, it would just suck it shut, and it would latch on its own.And that is a direct result of systems focused thinking. quickly, what isn't a blameless culture? And, I hear, criticism like this. More frequently than I would expect of, if you don't blame anybody, does nobody get in trouble or if somebody gets angry and deletes your database, do you just pretend you don't know who did it or, is there no consequences?No, that's not what it's about. We don't, just throw our hands up and say, Kramer deleted the database. There's nothing we could have possibly done. But it's an opportunity for us to take a critical look at our systems and understand how this was able to happen. There can still be consequences. there can still be consequences in times of negligence.And the point is that we're focused on how do we continue to improve in the future. what does it take? What does it take to get here? to build this wonderful, just place that we want to be in? We have to focus on more than just the post mortem. I, think there's a lot, the, a huge imbalance of literature around blameless post mortems versus looking at the rest of our process, and the times we're in incidents and in between incidents.and again, we also need to acknowledge the realities of human nature here. We're fighting against something that is natural. This is something that takes practice. It can be uncomfortable at times, and we need to be thinking about that. And we also need to celebrate the success of our process. there is no better tool in your toolbox than positive reinforcement.That is the, no matter what kind of culture you are trying to drive, positive reinforcement is always going to be your most effective tool. for getting that done and driving that sort of shift. So let's get on to the meat and potatoes. So during an incident, how can we ensure that we are fostering a just culture during our incidents?I really like this tweet from HBO back in 2021, calling out there an intern for basically sending out this massive blast of test emails to pretty much every single HBO Max customer. this isn't how I would have played it. I wouldn't have called him out on Twitter. But, what I found really interesting here was the outpouring of support and reassurance that came in the replies.Just tons of people jumping in and acknowledging we have all been here before. We have all been that intern. Although my favorite one is probably this, one by Oscar Mayer. just for a little bit of corporate chillery. Now, I'm sure many of you have been this intern. I was that intern. How did that feel?And I think it'll really resonate when I say causing an incident is really fucking stressful. And as engineers and SREs, our brains very quickly jump to mitigation. That's natural, right? We want to fix the problem. We want to protect those nines. But it's also important to take a moment and just breathe and let people know that it's going to be okay.That initial flurry of, What happened? What did you do? is incredibly stressful. It raises the stress levels of everybody in the room. And when we're stressed out, we can make more mistakes. So focusing on positive reinforcement here, positive behaviors that we want to see, Hey, thanks for letting us know.Thank you for being honest. Highlight missing guardrails and opportunities to improve. And this may sound very kumbaya, and it is, right? you're probably thinking of incidents you've been in where it's I would never say these things. And again, that's because we're fighting against that human nature, right?it is much more natural for us, especially as engineers, to just want to dive in, straight into the problem. And I, encourage you to, take that moment and Just acknowledge the people involved. Going back to, that incident I was talking about earlier when I caused that incident that I was talking about in the intro, I felt awful and I sat in the room as this intern watching all of my teammates deal with my mess, not able to do anything.And one of our senior engineers told me, Hey, man, It's alright, I appreciate that you're here, but we've got it. we'll talk later, we'll figure out ways to improve, don't worry about it. and that may sound like a fairy tale of the most best possible way this could go down. but I promise you it happened, and also there's a totally non zero chance that was his polite way of just getting me out of the room so that everyone there could curse my name.and I don't blame them, because, Responding to incidents is also really fucking stressful. You have availability goals, you have customer SLAs, there's a lot of pressure involved with resolving an incident, and it is so easy for resentment to pile up and build. So what can we do about this? We need to support our incident responders, and support our incident causers, acknowledging the people, not just the process, a little bit of empathy goes a long way.Reach out to your teammates, offer support, highlight things that are going well. If you are a manager or a leader, please, do not be the person who jumps on the goal. Or on the call and just proclaims what's the status because man, first off, somebody I'm sure has already written a lovely summary of what's happening that you can go back and refer to.And secondly, you just caused the stress levels in that room to absolutely skyrocket. start with empathy, then get down to business. Next, we want to encourage openness and honesty, the good and the bad. who here has written the most even keeled, unbiased, productive, most non finger, nonfingerpointing, post mortem of their lives while just the rage of a thousand suns burned in your heart? Yeah? Yeah, and this is because when we incentivize people to act one way in an open forum, but don't give them the opportunity to vent those frustrations privately, That's how we build resentment.People need that opportunity to let it out and leaders. I encourage you to give people that opportunity. One on ones is not the place to reinforce and tell people when they're upset. no. We don't blame here. We don't do that. It is not the place. Let people let it out. And this is important because if you don't let it out somewhere, it's going to come out somewhere else and it's probably not going to be the place that you want it to come out.So next I want to dig a little bit into postmortems. I think there is, like I said, endless, literature about Blameless Postmortems, so I don't want to talk much about the document itself, but I want to talk about the process surrounding it. When we write it, who writes it, and, what does the document, the shape of the document look like?What does the process look like? So to start off with, when do we write a postmortem? I've personally always found it most effective when the criteria for requiring a postmortem is a defined part of your process, and it is as objective as possible. And I've seen and worked at places where that criteria is not objective, and I get it, it's usually done with the best of intentions.we're trying to, we want to write as much postmortems as we can, because we want to learn all these things, and we want to continue to grow, and we want to keep getting better, so why would we put limits on when postmortems get written? When you allow those processes to be Subjective, they can start to feel arbitrary and punitive.I know of companies where if another team asks you to write a postmortem for an incident, you don't have an option. You just have to. And these types of rules are very easy to weaponize. At a previous company, a member of my team caused a very minor incident for another team, like they blocked builds for a couple of hours, and within moments of resolution, the manager from this team assigned a post mortem to this person, and when I reached out to them and asked, why, is this important?Their response was, this is a good opportunity for them to learn a lesson. And now that's a bit of a comic book villain example, but, this happens, right? this is the way that these processes can get weaponized and become very punitive. And by being objective, we can limit the ways that our processes are used for antagonistic purposes.and I typically see companies tie post mortem requirements to incident severity. And that's, all well and good, but like I said earlier, we, want to learn, I know we have more talks later about lower severity incidents and how much insight there is to gain from those. So I don't want to discourage that, but let teams, define their own criteria for, when maybe a lower severity post mortem is appropriate, right?You're, and you're not going to get this right 100 percent of the time. Those subjective measure, or sorry, those objective measures are our first line of defense. They're something that we can agree on and, but we need to have a bit of wiggle room for that exceptional case.Next I want to talk about who is going to write the postmortem. this is another place that I start to see fault finding, come and creep into the process. And this is one of the key differences, if we got this process from these concepts from aviation, this is a place where we're very different from aviation.There is no NTSB of software. We do not ask pilots to investigate and write the reports on their own crashes. Someone likely involved is going to have to write this thing. now, I don't know of any companies who have an internal NTSB board, but if you do, I would love to hear from you afterwards, because I'm very keen to see how this works in practice.I like to think about this incident from 2012 with Amazon AWS's load balancer service. some of you probably remember this. This is big Christmas Eve outage. took down Netflix, was very widely publicized. The story of how this happened, was someone who was relatively new to the ELB team, was working on Christmas Eve, God bless them, and as they were wrapping up for a nice winter's rest, they ran the reset environment script to reset their dev environment.And as part of their onboarding, they were given this script by a very well intentioned senior engineer, and that script pointed to prod by default. And so the entire ELB control plane was basically wiped out within moments, causing this incident. Now, obviously this got a post mortem. Who do you think wrote it?Do you think the junior engineer who ran the script wrote it? Do you think the senior engineer who gave them the script wrote it? I see a lot of head shaking. You're all correct. it was neither of them. the postmortem ended up being written by the team owning the control plane database and the access controls for the control plane database.And a really important, or a really useful way that I find to reframe this question of who's going to write the postmortem is Who can help us learn the most from this incident? Now, this isn't always going to not be the person who caused the incident. And that's fine, right? The point here is that we want to, reframe the question in such a way that it is not accusatory.It is focused on those positive outcomes. It is focused on that learning and improvement. Writing a postmortem just shouldn't, it shouldn't feel like a punishment. It's an opportunity to share learnings, and that's way easier said than done, right? Because you're taking hours, days, to, figure out what happened, get it all nice and well written, present it to a bunch of people, get a bunch of feedback, get the right action items.it's a huge time investment. And no matter how many times you tell somebody, this is a good thing you're doing, it, it rarely feels that way. and in fact, if you can easily convince somebody, on, any of your product development teams, that they're doing a good thing when they're writing a postmortem, this talk probably just isn't for you.You've done it, you've figured it out, and I have nothing left to teach you. but it does bring me to my next question, of what does the postmortem end up looking like? do postmortems have to take hours or days to write? There's this homogeny that we've developed around post mortems, and, in the true spirit of adopting technologies and processes, that we like because we've seen successful companies do it that way, we often fail to take into account that we're not them.If you only have one service, maybe Kubernetes isn't for you. That's my talk at KubeCon next month, but anyways, I, back to incidents. every company is different, and your incident process should reflect that. And it's not just a matter of, your company's maturity, it is what your, company values and what you focus on and what is important.If you're a growth, if you're a growth stage company where things are falling over every day, as you scream past 200 million weekly active users, you might not be in the best position to take weeks to write a postmortem, right? You need to right size your structure and your expectations around what is most important to you.Let's put it another way. Okay. If the quality of postmortems is not something, isn't something you would put on somebody's promo doc, you shouldn't ask them to spend a week writing one. It is clearly not valuable enough to you to ask them to go through that process, right? So if we strip this down to core elements, what are we actually trying to get out of this process of developing a postmortem?We, we want to understand maybe some trends, we want to know where to invest. At a previous company, what we ended up doing here, again, growth phase company, things falling over all the time, was we tried to distill it down to basically a tweet. after an incident completed, write, two sentences about, even your surface level understanding of what happened and, what the approximate impact was.And at the end of the quarter, what we were left with was this, two or three page document of all the incidents that happened over the quarter that was super digestible, it was very easy to just sit down and read, and it made trends really, obvious. Because when you're a small, when you're an early phase company, you're not dealing with these crazy, wild, deep technical problems most of the time, right?Your, five whys is one why. The why is we don't have CICD, or we just don't do synthetics. that's the answer and your postmortem process should acknowledge that and reflect that and you can still get value out of lighter weight processes that just meet you where you are.So to sum all that up, we want to be objective about when to write a postmortem. We want to be thoughtful. about who writes the postmortem, and we want to be aligned about how to write the postmortem. Alright, so we've had the incident, we've written the postmortem, we're done. We've won. It's complete. resiliency culture is solved.no, we're not done yet because it's the whole rest of the time. There's, after the postmortem, there's in between incidents, there's before the next incident. What are we doing the rest of the time? This is a muscle that we have to be exercising. This takes constant work. We can't just be thinking about this during incidents.First off, we want to celebrate those successful outcomes. Again, positive reinforcement is the most powerful tool in your toolbox. We want to highlight the things that went well. We want to highlight when guardrails work. We want to highlight improvements as we see them in our processes and our availability, whatever metrics we have available to us.I think this is something that we've actually done an especially great job at, at open AI. Teams often highlight when those added guardrails are successful at catching something. and part of the way that we do this is all of our postmortems have that sort of gross categorization that I was talking about earlier.We can watch, get that dopamine, watch that number go down. As, we bring in improvements, but also having metrics around, hey, canary rollback caught, six issues this week. that's a huge success, and that's a great way to reinforce, this process is working, we're learning things that are valuable, this is a good place for us to be investing.Next, reinforcing psychological safety. psychological safety is when, you look in kind of aviation as well as, healthcare. This is something that's very commonly talked about of offering that safe space for people to be open and honest outside of an incident before an incident to raise issues and a place to share those concerns without fear of blame or judgment.And this kind of helps a That encouragement helps keep that culture alive in between incidents. I'll also say, as a brief aside, I've talked about positive reinforcement a good handful of times. This most powerful tool in your toolbox is not the only tool in your toolbox. Because I can tell you right now that I become a force to be reckoned with when I see antagonistic behavior during incidents.It is also important to protect these spaces and as whether you're an IC or a manager or a leader, there are different ways to approach this and depending on your personality. Not everybody is going to be, the guy who jumps in and is don't do that on a public channel and call people out.That's okay. If that's not you, there are plenty of ways that you can help support others again by lending positive reinforcement, maybe talking to somebody on the side and just saying, hey, that wasn't okay. Let me know if you need anything. Or, you can just be brash and call people an asshole.Works for you. lastly, I mentioned this briefly earlier, but also assign value to the time spent in incident response. I made a comment about, if you're not gonna put a post, a good post mortem on somebody's promo doc, you shouldn't ask them to spend a week writing it. You should be putting good post mortems on people's promo docs, right?Because when we're not assigning value to things, That's when they start to feel punitive. That's when they start to feel like wasted time. When all you're thinking about is, you're, you slipped a deadline. It's I slipped a deadline because I wrote this, fabulous post mortem that everybody loved.When you don't assign value to that, and not just post mortem writing, but also incident response, it becomes really hard to foster this culture, because you're showing, you're telling people like, this isn't important, what you're doing. but we want to highlight that if it is important to you, that this is something that we want to continue to see happen.So again, outside of incidents, we want to make sure that we're celebrating those successful outcomes, we want to be reinforcing that psychological safety, and we want to be assigning value to that time spent during incident response. Now, and these are all great things to do outside of an incident, I have two more nuggets, to leave you with, in my three minutes left.First off, Stephen said it earlier, but also, have more incidents. Incident is not a dirty word. We are fighting against human nature here. These are muscles that need to be exercised and practiced. And this doesn't mean, obviously, just bring down prod more. don't manufacture more incidents. As a small company, five minutes of downtime, you're probably not even going to open an incident.As a large company with enterprise customers, that's probably a step zero and a public apology. I don't know. we want to always be raising the bar. We want to be raising the bar so that we continue, can continue to practice these muscles. We want to have more incidents and we need to make clear that this is a good thing.We are focusing on that positive, those positive outcomes. Your end goal should not be to drive incidents to zero, right? Because when we talk about assigning value to that time and aligning incentives, if you're saying we're driving this to zero, that inherently makes this a bad thing. This is a thing that we don't want to do.And that's not true. We, want to have more incidents, we want to keep learning, we want to continue to improve. That's what this is all about. And the last thing I'll say, again, there's been a lot of kumbaya talk about, talking about things that go well, being really nice to everybody all the time, and incidents are all sunshine and rainbows.You don't have to get it right every time. You're not gonna get it right every single time. as a parent, there's this philosophy that I've really latched on to, maybe this says something about me, called good enough parenting. And there, there was some research, there was research done that basically said, you only need to accurately respond to a child's needs 30 percent of the time, in order for them to be well adjusted later in life.I shoot for higher than that number, I like to think I may be hovering at 40%. but, I think this is really applicable and it is very important to keep in mind, right? When, we think about incident response, and when we think about culture change and culture shift. It's, really easy to get caught up in the times that we get it wrong, but again, what we're trying to do here and the best way to build cultural change is that we, want to build allies.We want that to grow within the organization and you don't have to get it right every time. Somebody seeing that one example, being in the room that one time with the senior engineer who tells you it's going to be okay and then over a decade later you're on stage talking about blameless culture, right?Those little gestures can have a really profound impact and you don't have to be getting it right every single time. All right, redo that thing where we show the first slide again to make sure everyone gets the point. So when we're talking about systems focused culture, what does it require? What does it take to do this?We need to be focused on more than just the post mortem. This is all parts of not just our incident management process. This is the times outside of incidents, in between incidents, before the next incident. We need to acknowledge that we are pushing against something that is very natural. It is going to feel uncomfortable.We are not going to get it right all the time, and that's okay. And lastly, again, the best way to drive that change is by celebrating those successful outcomes by celebrating the success of your process. And continuing to use positive reinforcement as that most powerful element in your toolbox. Awesome.That is it for me. Thank you. Thank you.

Sessions