SEV0 Brand Logo

The Pearl, San Francisco

Building processes that survive contact with reality

Panel discussion on evolving incident management processes, focusing on practical approaches to implement new ideas while maintaining effectiveness.

  • Colette Alexander
    Colette Alexander Director of SRE & Observability, HashiCorp
  • Dawer Jung
    Dawer Jung Game Reliability Manager, Netflix
  • Uma Chingunde
    Uma Chingunde Engineering Leader, ex Render, Stripe
  • Chris Evans
    Chris Evans Co-founder and CPO, incident.io
The transcript below has been generated using AI and may not fully match the audio.
Chris: let's dive in. so building processes that survive contact reality. What are we talking about here? so I think broadly, like a lot of things at the moment are changing in the world of incident management. So there's a ton of new sort of approaches to running and learning from incidents being adopted and things that people have been doing and data they've been looking at for many years.They're being proven to be Like either ineffective in some regards or like actively harmful in other regards. But with so much change going on it can be like an absolute minefield knowing where to focus and where we should be actually looking on these things. So this panel came about because we want to look at like how we can bring in some of those processes and look at how we're evolving things.But in a way that sort of actually survives contact reality. So no, no kind of like sprouting from the sort of pulpit and being like, this is how everyone should do things without practical guidance of like how we should do it kind of thing. and we have an incredible panel here of, absolute experts in their field, like genuinely really appreciative for you folks to be here.and we're going to go through a few different topics here. before we dive in, let's do a quick round of intros. so Uma, do you want to, do you want to kick us off?Uma: Sure, until recently, I led engineering, at a Series B startup called Render. it's a platform as a service. you can imagine we're a platform serving thousands of users, each trying to do very different things.and very early on when I joined, I was actually like an early instant user because I realized we didn't have the software we needed. to manage all the incidents we were seeing. Prior to that, I was at Stripe for three years, and I actually have a hot take that FinTech has the most interesting incidents, which is why I believe incident exists as well.And before that, I was at VMware and Delphix. Always infra in different forms.Dawer: Nice go for Dow. Hello everyone. My name is down with John. I'm at Netflix working in the game side of things as a game reliability manager. It's really about ensuring that our players have a great and like stable experience and playing on planet in a reliable experience.Been there for almost a year now. really building and working with folks in the greater streaming business as well on their already awesomely built incident process as well, and putting that kind of games flavor to things. prior to that, I was at Epic Games, and previously I was at like a Web3 company, but also at Riot Games as well while I was there helping out, helping launch, a lot of their titles, post League of Legends, but also looking after League of Legends.prior to that, about 15 years experience in both managed services, as well as Organizations like Cisco and, Symantec as well as in it as well. So it's an interesting mix of coming from IT and regular, like regular it. and then putting that kind of gaming and then now to like the Netflix side of things.Netflix is a very different culture and how they approach things in general. And so that's been a really, good learning. And but also gaming is a totally different, Top of customer as well as an expectations and so that's been super interesting as well. But, it's been a lot of that time in the whole, obviously I tool world used a lot of incident tools.All of them are great and so it's been really interesting to see this journey of people going from managing tickets. To actually being able to manage the incident and focus less on an actual Oh, I feel what seemed was saying filling of 24 different fields. No one likes that, especially in companies where it's really about, not sacrificing, not slowing the business down just because something's gone wrong and really trying to focus on that proactivity as well.Colette: yeah,Dawer: thank you. Go for it, Collette.Colette: I'm Collette. I am a director of engineering at HashiCorp of our SRE and observability teams. Before I was at HashiCorp, I worked at a company called Cognite in Oslo, Norway. Before that, I was at Spotify, and that's where I got really interested in reliability, and especially learning from incidents.I have a 12, 13 year career in tech, so it goes back a ways. Before that, I was a professional cellist in rock bands, so I have what's called an alternative background,for engineering leaders. I just got my Master's of Science in Human Factors and System Safety from Lund University. I'm following in the footsteps of, thank you, of my friend and mentor, John Alspaugh, and many other people, including J.Paul Reed, who's in the audience. yeah, so I love to talk about learning from incidents.Chris: That is a, good thing. a quick, show of hands, like, how many people are, like, aware of the kind of concept of resilience engineering? Amazing. Amazing. it's gonna be a great conversation. Colette, I know this is a thing that you care about a ton.and I think you have a super interesting background, right? As you said, have the academic sort of side and the things from learned, which is super interesting, have clearly studied and thought about this stuff from that angle. But being a director of SRE, like how she called that is pretty much as like as real as shot and sharpened as it gets, right?and I guess, we've spoken about resilience engineering in the past. I wonder if you can, for anyone who hasn't heard of it, what is your definition of, resilience engineering? like, how do you think about it? And, what is, a practical way in which it can be applied in, an organization like HashiCorp, for example?Colette: Yeah. That's a good question. yeah. so in, in the academic sense, resilience engineering has evolved through a long practice of safety that goes back to the 1920s really when we look at safety in work and, it, arose as a conceptual way to reframe how we look at safety, which isn't through the lens of what does somebody do wrong or what was the bad thing that happened, but rather What do people do that is right?Whether you're having an incident, whether you're not having an incident, what's really happening on the ground? How are people working? and how do we grow the strengths that all people who are working on the ground have that prevent us from having incidents every day? So I think that's, where resilience engineering conceptually comes from.I think how I think about practicing it, certainly at HashiCorp, and really at Spotify too, was around, looking at incidents and doing deeper dives, getting beyond what a lot of people refer to as shallow data about incidents and doing these, what we call in depth incident analyses, right?Which is getting really into the heads of the engineers on the ground. What do they know about the systems and how, what's going wrong and using the incident report as a way to, transmit knowledge that's inside of engineers heads, We often forget this, right? But what we are expert in, we don't really know.We don't have that outside perspective of, of what it is we are truly expert in. Other people might really gain a lot from learning, right? when we do these in depth incident analyses, what you see really is, The mountains of information inside of engineers heads coming out onto paper and being disseminated throughout an organization.It's me. That's a huge benefit and boost for the resilience of any organization.Chris: Nice. I guess to put like a super like practical point on it. if there was that one practice that you were like, Hey, everyone leave this conference. Here's the thing that you should take away from resilience engineering.You're smiling. is there, a thing? Is there, one thing that you'd be like, this, just go and do this, and you'll see some, benefits in return, some value from it?Colette: So I'm gonna, I'm gonna steal something from Allspot here, because I think, John Allspot, who invented DevOps and Blameless Cluster Organs, and, yadda Hi, John. In depth incident analysis is 100 percent the one practice that I try to bring with me to every organization that I lead in, because I think it's so valuable for the people who work in it. you need to be trained in how to do it, and you can't do an in depth analysis for every single incident.But what you can do is you can start by separating the kind of like action item y postmortem process, which is a little bit different, than the learning part. Try to make them two separate. meetings, right? Get your action items out of the way. Maybe after you've done a section, an hour together where you're learning, where you're just trying to get information out of people's heads about what they know about what happened during the incident, then do action items a couple days later because your action items, A, are going to get a lot better because everybody in the room will suddenly have a shared understanding of the incident that they didn't have before, and B, you're basically just improving the amount, even if you're not necessarily documenting that meeting, you're improving the amount of knowledge that's being shared in your org.Does that make sense?Chris: It makes, yeah, it makes perfect sense. I guess Dawa, Uma, your experiences?Uma: I love the idea of being able to do that. it's, my worry would be that in a very fast paced so I ran engineering studying it. Eight to nine people and then grew to 55. There were definitely times in our era where we had more incidents per week than engineers.So in that, yeah. and, in, in that sort of scenario, I can't imagine having, honestly, the time to be able to have separate. But maybeColette: just pick one where youUma: do that,Colette: right? And I think that's the thing, you don't, this isn't like a, you have to do it for every single one. It's a you can pick the most interesting one, or maybe even the least interesting one, and find out it's more interesting than you thought it was.Chris: Interesting. Yeah, it's thinking about like Andrew's talk earlier where it's have objective criteria for, things. It's it sounds like maybe the answer is like maybe don't do that if you're time constrained and things and don't try and do it for everything. But as you say, pick, something, go deep and figure out how valuable that is.Is that fair?Colette: Yeah, I also think there's this thing about time, right? Like, how many people here do code review? have a code review practice that's mandated in their company, right? Yeah. I think even smaller startups have that. Code review takes a lot of time, and we still do it, and we don't really question it, and I think that my question is, can you add one more hour to your post incident process if you're going to magnify, even if it's just for one incident a month, if you're going to magnify the value that you get out of it?Chris: Yeah, it makes sense. It makes sense. I often think about this kind of concept of learning and like the focal point being like the post incident review. And I think in a lot of the organizations I've worked in, a lot of the learning happens like implicitly and around the edges and things like that.And so I've been in organizations where we haven't had that kind of go deep and do the thing But the team ends up like chatting over lunch, right? and that's where like where I found actually a ton of the value is which is get people together if They're like minded and they all want to build like reliable systems.That's the kind of the secret sauce And so I've been quite comfortable kind of not mandating Deep incident reviews, like, how, do you think about, how do you think about that? Is that like a, do you think, would you say good approach, bad approach? I'm curious.Colette: I, don't know if it's an approach that scales with an organization of 1, 200 or 3, 000, and I think but I do think there is, when we talk about processes or things that we're trying to apply, my general take is like, most processes that you're gonna mandate from above are probably not necessarily valuable to your, Organization in terms of increasing its resilience.The things that people are already doing on the ground Talking at the water cooler talking at lunch talking in slack, you know about something that happened That's the kind of stuff you can magnify to right?Chris: Yeah, 100 percent 100 percent I guess if you were to look at Resilience engineering as like a broader practice and look at the sort of the as yet unsolved parts of it that you haven't yet been able to be like Here's a way that we can operationalize inside an organization.What would you, is there something that comes to mind there for you?Colette: I don't know that, from an academic perspective, I think resilience engineering is a very young field. It's only been around for 20 years and I don't think software has really been participating that much, like maybe a little bit over the last, what do you think, J PAL 10?yeah. So I don't know that, that It's solved much, right? But I do think it's raised very interesting questions, one of the reasons you don't hear people talking about mean time to recovery right now very much is probably largely due to the folks who are participating in the resilience engineering movement, who basically did a lot of work, Courtney Nash, like a bunch of other people, to say this is horseshit, like these statistics don't mean anything.And, the So, I think that it's had an impact already in terms of like how it's affected our field. What we haven't solved is I don't know, name any problem on the ground for how, do any organizations understand what makes them more resilient or less resilient? How can any leaders tell from where they're sitting, where they are on a spectrum of less resilient or more resilient or what?What are the things, there's a really great paper called Going Solid that, that calls upon a nuclear term, which is essentially when I think the core melts down or something, but the idea of an organization going solid, becoming more brittle to stresses that are happening. Do any of your engineering leaders know what those signals are?Do we have, an understood language among all of us about when our organizations are getting pushed closer to that line of being? Brittle and unable to withstand anymore. That's what we where I think we need to go.Chris: Yeah, it's interesting. It's interesting. I think the empty TR thing is a very, relevant conversation for me anyway.So totally, aligned. But I think this is one area where we have so much work to do because whilst it gets a chuckle in the room here and everyone's yeah, obviously don't look at empty TR the vast majority of organizations I speak to. There are a bunch of people who are still like. the graph that shows me MTTR, for example, that's the thing I'm trying to drive down here there is a lot of education.I feel like yet to be done in organizations.Dawer: I think there's an anchoring here of like Why are we doing these things? Why does it matter? I think for me It's like you said like I worked in organizations where it's very much that all these MTTR metrics whereas in the past six, seven years, it's really been more rooted around, what's the impact to the customer experience or the player experience, depending on the kind of company you work for.At the end of the day, these things all need to be, all in my opinion, need to be in service of those things, in terms of let's look at the customer experience, what are customers telling us, what are the problem areas, okay, where do we need to either be more resilient, more stable in that respect, and what, where do we need to spend those calories, because, There's a lot of, just hearing all the previous talks before me, there's a lot of stuff like what should we measure and what not.I think there's a, and on the topic of celebrating successes, how many minutes are we saving? how many incidents are we preventing? how have our, like, Andrew's example, how many times have our, canary services identified an issue that we normally would not have seen? And how has that, impacted the customer experience?Because at the end of the day, to be blunt, they're the people paying our checks, right? And it, that's an area that I've, really been focused on, as well, which is, and that's why I question the teams in terms of, okay, you want to do this thing, but what's the impact, positive, to, to our customers as a result of you putting in these calories?And then also, how, what things can we put in place to also protect that time? Because from an engineering standpoint, you want your engineers to be focused, working on features and cool things that are going to either enhance your product, et cetera, et cetera. There's that time that they have to budget for to a certain degree to deal with incidents.And I think, all of us can agree, we want to minimize that time as much as possible, that toil required. And resilience engineering is a big factor in terms of getting towards that. And it's dovetails into what Stephen was talking about before in terms of, how do we build those structures in order to enable that?Because in the day, technology are, they're enablers for us, right? And so how do we, leverage this to put those in the right areas in service of what our goals are?Chris: for sure. might have to move us on unless there's any, final thoughts on Resiliency, Resilience Engineering.Claire, I imagine you have many.Colette: yeah, I think, we were talking about this a little bit at dinner last night and my favorite sort of, way to describe the theory to practice problem is the meme, you might, you all might know this, of, how to draw an owl with two circles and then Draw the rest of the fucking owl.And I sometimes feel there are these concepts that are very up here and we are down here in our context all the time. And I think, I would just say, don't ever underestimate your ability to translate those things for your reality. and just because they don't work one time doesn't mean they won't work another time or won't work a week later.I think. Be willing to experiment like the scientists that we all are right when we're doing this stuff. That's all I would sayChris: I love that. I think that's like super pragmatic Which is like a lot of this stuff no one has answers for right now But it's like I reckon we can figure it out if we try things.That's how things have happened in the past Super interesting I imagine lots of people probably want to continue the conversation as lots of time later today but let's, move us on. so I want to talk a little bit about, applying, local context to the kind of things that we're doing here.and I think sometimes these things, introduce a bunch of friction in orgs where we don't apply that local context. a great example here is, I think, most people in this room, or most engineers, can intuit which practices from, a Google sized organization will apply to a 10 person startup, and which ones are like, cool, maybe one day, if we are some goliath that prints money, essentially.but I feel like in incident management, we haven't quite built that muscle memory up or that sort of like those reps up to be like, Hey, that's really cool. And I can imagine that works and like a great example is that maybe the one you gave Collette, which is like the deep, investigation into an incident.Now, if I'm a five person startup and I'm default dead, maybe that is a thing where I'm like, I should apply some context and, use the round the lunch table mechanism as my way of being able to learn and distribute that knowledge amongst people. and I guess like Uma, you have worked Stripe, pretty big company, and then Renda, a much smaller company, what's your take on, that angle?Uma: So I already shared the constraints, which is, sometimes we had more incidents than people. And I think that my, my sort of, I actually asked around a bit, at that time, like how are people scaling incident investigation or incident debriefs? And I got a lot of. so I was like, I was getting a lot of sort of advice where I was told everything from, as the engine leader, you should absolutely not be in the room when the incidents are being discussed.And I can get that why that might, create a more blameful environment at an environment like Google, but there is literally no one else to facilitate the meeting at a nine person or 15 person startup that we were. So I was like, yes, I get where this is coming from, but that's not my reality. Or the other one is you should have an objective criteria and then decide on that and then go into the details. So that's where it's actually my approach is we should actually use the fact that we're a startup to our advantage where these practices were built to scale for thousands of engineers where it's hard to decide, you know what, it means because you have so many different things going on versus when you have a few dozen people and you don't.So I know that, I know that, A handful of incidents or dozens of incidents, Even I actually know which ones are more interesting and which ones are not. turning the fact that we're A startup into a superpower. Like I know, for instance, that this incident Caused zero downtime.Was like over in five minutes, but Exposed a really corner case. So we're going to discuss that. Versus, this one. and it was, it was caused a massive outage, but was like, resolved immediately and everyone knows exactly what happens. We're still going to discuss it. The other interesting thing is, where do you decide postmortems versus not?I thought Andrew made a really good point, which is, like you have to give engineers the ownership and have them decide which ones they spend time discussing. And the other angle I'll add is, which he also alluded to, which is. The postmortem process itself can feel punitive. So if you say, oh, you were in an incident.Great. Now you have to spend an hour writing the report that's going to actually disincentivize people from even calling an incident the next time. So you have to be very careful of the incentives that you set in place. So using the fact that, you can be a lot more subjective in a smaller, fast moving environment.And really grounding it, I think you made a really good point earlier, what's the value, what's the ultimate value you're trying to provide, grounding it and what's the customer impact? And then we had this vague metric called interestingness, like the one I gave before, it's oh, it's this random thing.And the other thing I found that was really useful was, I would actually just allow engineers to nominate incidents themselves. so you decide collectively in your. In the weekly meeting, which ones we discuss versus which ones we don't. And so there was a collective sort of upward downward process that we would say Oh, we want to discuss, we'd have a list and we'd actually just people would just be like, Oh yeah, this was straightforward.Let's skip it. Versus this one was, priority three, but was actually really interesting. So I think using that sort of subjective, subjective analysis versus trying to model. Purely like the objective criteria that works at scale is what worked for us.Chris: yeah, that makes a ton of sense to me.you've, reminded me actually of I remember in a past life working in a financial institution, some of the the gnarly big incidents where you had to produce like public post mortems for, you'd have these very formal processes, and you'd go and tick the boxes and write the report and stuff.And then we'd have these other like, inconsequential incidents where a bunch of engineers sat on beanbags in like the ground floor of our office The stark difference in the type of conversation you have and the type of sharing you have is vast. And no one mandates that you sit on beanbags and have those kind of conversations.But they're super interesting, right? Dawa, Colette, any thoughts on this sort of, applying local context kind of thing?Dawer: I'm a pretty, keep it simple, stupid kind of person. So when it comes to postmortems and things like that, it's just about how can I use what's out there at hand to make it a very frictionless as possible quick, thing, because at the end of the day, the only things I really care about for a post mortem, and feel free to shoot me down, is like, why did it happen?What's the timeline generally? What are the, what are, what, most importantly, what are the follow up actions, and who is, who's, owning those, and when are they gonna be done by? And that's it. and generally, I'd want all that kind of to be initially fleshed out before the meeting.Because those meetings can become very expensive, depending, obviously, if they're 7 1 or whatnot, right? really trying to be respectful of people's time. I'm all about really focusing on protecting engineers time, so they can, focus on doing the cool stuff. and so it's really, that kind of simple thing, and I'm not going to plug incident.io here, but, the automated version of actually being able to create that, generate that PIR in incident. io Saves a lot of time for us. And it really, like you said, incentivizes people to be like, Oh, I'm gonna look at this because I don't have to write it. I just have to review it. And then it's gonna come to a meeting.And that's, fine. Instead of asking a person to go, let's all do this thing by committee. That's gonna take four hours to do. very onerous when they could have been spending that four hours working on this corny feature. And so it's really hard to make that case to the respective engineering managers, etc.as well. Also. Just that simple approach, and I'm, pushing your teams to make it better. It's good, but there's, right now, there's a lot, for me the most important thing is follow ups. How, what's the best way for me to track follow ups? I haven't found a good solution for that, because that, if, or like it's made.And we're in a meeting, we're like, okay cool, we've got the follow ups from the incident in the Slack channel, I'll put the little fast forward emoji, sweet, it's in there, I have a list now. Okay, how do I track those? How do they, do we put them into Jira via API? Do we store it in a separate page or something?And these are the kind of things that we're like, Okay, how do we save time on this, but still get to the outcomes that we need?Chris: Claire, you were smiling through all of that.Colette: Oh, I was going to say, I think one of the best things, the best decisions I ever made as a, manager at a smaller start, the smallest place I've ever worked with.I came in and they were mandating postmortems for every single incident. And I was like, this is really, and shockingly, there weren't very many incidents. I know you'd be so surprised to hear that. and so when I came in and I suggested that, Hey, maybe we shouldn't mandate postmortems. And it should just be if engineers are curious or feel like it's important to follow up or, if, like a leader is that was a step one, maybe we should follow up, I guess maybe that will happen too, but hopefully not.and shockingly, we suddenly had more incidents and there was more discussion and more learning. And so I think that's a good example of we're like, letting go of process or enforcing something can actually help things move forward, I think for the better. I'm curious though about the action items, like in your experience Daur, or what When you're tracking those, if you were tracking those action items super effectively, what do you think would be the outcome of that?Dawer: Generally, I'd want to look at those action items and make sure, what are they trying to achieve? is it going to apply, or is it going to result in Better resilience in terms of hey, our stuff's going to fail less, or hey, are we going to know sooner? Sometimes it's just not, decreasing either the time to know what the, what is actually wrong, to then engage and page the right team at the right time.It's all in service of that, as well as, okay, cool, again, what is the impact of it? What's the benefit to the customers as a result of this, or benefit to the team, or to the engineering team? is this going to, is this going to reduce their, time spent on incident by 50 percent or something like that, right?or whatnot. It's generally Those are the kinds of things that we're, like, you want, just like alerts, you want them to be actionable. You want your follow ups from the incidents to actually be valued but also be actually actionable as well in that respect. Not just like a bunch of fluff of okay, yeah, someone should look at, someone should look at alerting.that's not good enough, right?Colette: Yeah, no, totally. I find action items after incidents to be very frustrating because so often, so many of them don't get done. But I don't think that's a sign that something's wrong. For I think that's a sign that perhaps the action items, when they got generated, just maybe weren't the right ones.And, but it's like, how do I, guess as a leader you, I just remember this one person coming to me and being like, Collette, the incident happened again. And I was like, huh. And they were like, an incident like this happened last summer. And I tracked all of the action items and I made sure everybody followed up and every engineering manager had a meeting with me until those action items were done.And. And I was like, huh. And then what happened? He was like, it happened again. And I'm like, yes, it happened again. And that, there's not, there's nothing, to me, there's not a correlation between all of the action items getting completed and that incident won't happen again, or an incident like it won't happen again, or, we're, all totally safe.And I don't know, I'm trying, I'm, I constantly grapple with this desire that I and many other leaders I know have to track action items with the fact that I just don't know that they always are, like, that tracking them gives us anything. if I could,Uma: add, I, I think I, I get the frustration and I get the sort of, action items go into this pile.I actually think Incident does a pretty good job of, and then we have a lot of listing out like time like the dashboards actually have these. So we use those and I get it. But and tying the action items. Back to the learning. Actually think that what doesn't get talked about enough is how much you can learn from the follow ups.So the ending that's actually one of the big missing things, which is it's not just the postmortem and the meetings that you have. It's also the follow ups. So if you the key thing. I think as engineering leaders, we need to do is give the team one The sort of the mandate that you know the follow ups are important and give them the space to do the follow ups and that's where the I actually believe because that's the practice that's where the actual learning is to come from and a big thing that a big you know to your point I get this sort of you know it's a sort of tightrope between blame, keeping it blameless. But then also you do want a culture where you don't just see the same type of incident over and over again, because otherwise you're missing something critical. Like in the, second, if it's repeated, my, my question would be, what did we miss the first time?did the team not get the time to follow up? Did the team, did we not invest enough in fixing the underlying things that broke the first time? Because the goal should actually be, more novel incidents versus more of the same incidents. And I think that's where you can actually tie the room for follow ups and action items and giving the team the agency and ownership to invest in those so that they feel that, okay, they're actually not going to get woken up again for the same reason.Dawer: Which is a good candidate for that deeper incident analysis, right? Because if it happened again, something's Maybe it might have been different, but it's worth maybe deep diving into it, or a bit more maybe, but, We can't, that's the thing, we can't prevent all incidents. Stuff's gonna happen again, like someone's gonna forget to renew a SSL certificate or something, right?because they didn't put the alert in, or, those things can repeat itself. But yeah, that's, I get your frustration. Live. It just feels bad not doing anything. Just having incidents and just moving on.Uma: my hot take is like, why bother having a learning session if you're not, if you're just going to ignore all the data you got from it and not do anything with it.Colette: Yeah, so I think it's interesting because how do you define, I happen to believe that most engineers do follow up items to incidents even before the post mortem happens. A lot of the really important follow up incidents. action items, sorry, happen right after or during even the incident, right?And that is the idea that engineers are the source of the resilience in your organization, they are constantly taking action to keep you up all of the time, means to me, if they're ignoring an action item or it's not getting done, maybe that's not the most important thing for them to be doing.And And that leads me to think that maybe tracking action items is more about making us feel betteras managers,Colette: right? And making us be able to tell a story to our leadership of I've got it under control. We're doing something about it. And if that's what it is, okay, let's, that's fine. Let's just admit it.But I don't know that it's actually improving the safety of our system. I think,Uma: I just feel like we can't make a blanket statement about it because I think we all have action items. We've all been in incidents where, yes, the biggest ones that, caused the outrage got fixed, but then, something like the next level that would prevent the incident from happening again, people talked about it in the room, decided it was important, and then nothing happened, and so you go back and have another one.there is that in between gray area. So there are absolutely ones, which is like. Yes, we should update the documentation because it's incorrect, yes, and no one's going to remember to do it, and the actual things that caused the outage are on one end, but there's this in between gray area. And I think that's where I think actually someone, you have the engineers making their best judgment call, but then there might be, the gray area is the tension where yes, they can ignore this and put it on the backlog or go work on the feature, and that's where you have to Kind of put in the right norms like yes, we want you to work on the feature But we also want you to work on this thing that will prevent The next incident and that's the right sort of prioritization that I think is a leadership.Chris: IUma: thinkChris: I Think that point around like why do we track them is that genuinely really an important one?And I think the thing You might be right Colette, right? Which is that this stuff's happening regardless of whether it's tracked or whether it's not tracked And so like why track things and I think the way I look at this is like No, no team exists in a vacuum where they are able to work, locally optimizing for themselves, right?That's not how broader systems work, right? You get, so engineers could be like, great, I don't need to track this because you just need to trust that resilience is happening. It's all going on behind the scenes. But that's not how, I don't think that's how the real world works here, right? Which is people need to be able to report either up to leadership or maybe you're a regulated entity and you have an incident and someone is literally if you're not doing something off the back of this that you can communicate, I have no confidence that the right things will happen.And I think that's honestly where I see a ton of the tension happening here. And I think there is a very happy medium, which is like giving the ownership over the process. So it's not a top down, like I need five action items. It is like folks, like you have, an incident debrief, you decide the action items.You take away, I'm a big, believer of, soak time is a super important thing because you get the recency bias of, set a bunch of engineers in a room and it's let's all figure out what we could do that would make this never happen again. And it's super easy to go, here's 40 things.And then you go, how do they stack up against everything else that my team has to do that is also generating a bunch of, reliability and resilience and things. And that, doesn't often happen. And so I think it is, generate things, give it soak time, Call the things that are not important and then communicate the things that are important to satisfy the broader Organizational needs is it's broadly like my thinking on the thing, which is I think a good balance of the Let the right things happen, but also let the right reporting and the right things happen up the chain Anyway, we are we have nine minutes from the end.This is going way faster than I expected Let's talk about probably like one of the most hot button topics So I think instant metrics are like super evocative. we've already covered, things like MTTR. I think if we look at this at the core, like metrics of there, because we have such messy data in the world of incidents, there is like often hundreds of things going on over a month.many of our organizations that we work with have that sort of scale. And it's like, how do I communicate to folks around an organization? What, what is going on? What is the narrative that I set without having to spend hours and hours giving people all of that, sort of examples there.and MTTR is a super interesting example because I think there are still a lot of organizations that are hooked on this as a metric, which is like a valuable thing. And it makes sense to me, right? Which is, it is super intuitive. It's I, how long are my incidents going for? I obviously want that to go down.that makes perfect like logical sense. Yeah. I've read all the papers, I think everyone in here is like intellectually honest enough to know this is probably not a good metric to be targeting people on and trying to drive, that stuff down. but I guess to bring this around to land a question, Dao, you have run live operations at a bunch of places, you've ringed off a bunch of games everyone will be well aware of, like, when it comes to like incident data, what are the things that you care about?what would you end up actually looking at that you find useful?Dawer: MTTR. A hundred percent kidding. I agree. . I, agree. I agree. I agree. Said it better. Horse shit. it, it's really about, 'cause the problem with metrics like MTTR is that it's, is the m it's, the average and you're not looking at the worst and the, and the best.And we wanna celebrate successes. Hey, we did a good job here, but, it's really focusing on What are the learnings? What, like what incident can we best learn from? Like, where did we, what did we do well then what instance didn't we do well? And so looking at those sorts of things.And and, having a, at a bit more of a, like a communal discussion with folks in terms of okay, which of should we highlight where, are some of the wins and losses in that respect? And what are some of the things we're, choosing to, to, do and not choosing to do?Because similar to the previous topic, It's okay that we don't do things. It's okay that we're not going to follow up on certain things. there's a, but there's a cost. There's a cost to do something, and there's a cost of not doing something. And ensuring that we have the right measurements around, what's giving us the best indication of okay, in gaming, live operations or reliability, I see our role as protecting the player experience.So how have we done that? what things have we put in place to What generally, like, How, available has our, How our system's been, either from a game code perspective, Or from a platform perspective, and Trying to find those signals that can best lead us to, Getting it to the bottom of what things should we put in place and help prioritize and discuss with those engineering teams, okay, say, hey, can we please prioritize this over some feature work, because I know it's a tough thing, and that can be a very tough decision, sorry, discussion to have.And I'm just being general about this because it depends on the business and the type of business that you have, it's different for gaming, and it's different for companies or something because a bunch of, things, right? All those compliance issues and whatnot.And but it's really about what is, most impactful to the customer, but also what's also impactful to the internal teams. Like the last thing you want to do, what you want to have a beat on as well is, How much, strain is this putting on your teams? Because I think COVID has taught us everything.It taught us a lot in the past couple of years that really focusing on like people's health and especially when you're working remote and not in the office anymore, like a lot of people are like, sometimes you just, you're, working all day, all night and you just don't realize it. And so being, trying to be mindful of that impact that you're having on your own staff as well as like the, hopefully the, like the impact you're having, To, to your customers in that respect.So that's yeah, where my head's at. Nice.Colette: I love those. Especially, the stress on your team. hours worked, especially off hours on incidents. I love, time to muster. Or, time to be actually responding to the incident as a collective, right? I think that's really useful to, to understand.How long does it take us from detection to, When we're all able to get together and talk about this and figure out what's going on, the right people are in the room.Chris: what do you do with a metric like that one? So if you're like, is that just like a interesting data point?Because I agree. It's interesting. I think it's interesting, right? Which is is Did it take us ages to find the right people? Was that like some, organizational context that we're missing? That, it paged Colette, and then Colette had to paint Uma, and Uma had to get Dower, and we were, like, slowed down because of that?Colette: Yeah.Chris: is that, the kind of, the conclusion of, that was a long time?Colette: When my product VP comes to me and says, I want to take this to three and a half nines, I can show him how long it took for his team to respond to an incident and say, three and a half nines is going to cost us this, we're going to have to ask people to act this much faster.Where, I can take them down the chain of, like, how it works in order to actually achieve the level of nines that, my product leaders are asking for in terms of reliability.Chris: That's awesome. Yeah, Uma.Uma: One that I really like that Stripe used, they probably still use it, was actually, users having a bad day.And the idea was, because I think all of this really, and both of you are alluding to it, which is, with the nines and, other stuff, like, all of this is really about user impact. So it isn't like, it doesn't matter if you have any number of nines if a user is unhappy, and that's like your most important users or vice versa, like your users are happy and your engineers are getting paged all the time, like neither, like those are like two really bad extremes.So I think using both user impact and then engineering health, like how many times are your engineers getting paged, we actually started splitting nighttime versus daytime pages and trying to drive. Down the nighttime pages, were, like, going through the root causes, were they false alerts?Turns out a category was false alerts, and so we shifted them to paging during daytime, things like that. So focusing on the, those two audiences, that's your team and your users, though, because that's, I think, the two most important stakeholders in, the incident space, basically. Yes.Dawer: And I think incidents have this bad rap of being, all the metrics seem very punitive.It's oh, you did a bad job because, what not. But I think it's really, good metrics are about highlighting the hot spots. In terms of hey, where are areas that we need to focus in a little bit more, again, blamelessly if we can, as the, maybe we have a, maybe we have a, break in our page duty escalation policy.Where it's paging the wrong person or something like that. but what's going to tell us that? And it's not because someone did a bad job, but maybe life happens or whatnot. Or maybe we have to accommodate for the fact that, there's a quiet period coming up or a change freeze coming up or something like that.And that's where, ideally, it would be good to head in that kind of direction. I think, other metrics, we talked about this earlier in terms of, number of incidents. trying to reduce the number of incidents we have per period. I was thinking about that, and I flip flopped on it going, No, we should never do that.What's the point of tracking? Because you can't really prevent number of incidents. But then I thought about it some more, and it made sense because if you're looking, from a customer experience, our customers do not care how many incidents we have. They don't care. All they care about is the service up?Can I play the game? Can I make a payment? Et cetera, right? But, looking inwards, I was like, Actually, I do care if, this one engineer has ten incidents, and he got paid for a month, right? So there's certain things we do want to track, The number of incidents, but not from a necessarily opinion perspective, but also hey, this is an area that we could, we should pay attention to, because I think, Jim over here shouldn't get paged 10 times a month, that's bad for him in that respect,Chris: Yeah, I think that's, I think that's a very powerful point, I think the way, whenever I talk to people about like incident data and metrics and insights when people are like, what numbers should I look at, I'm like, Whatever leads you to the most interesting avenue to have a starting point for an investigation is the thing.It's really, these things are like deliberately lossy. They are modeling like messy data with humans and things like that. And it's essentially how do I like narrow the focus in some way to find a thing, which is often not actually how I end up like showing the MTTR is like largely useless, which is like how many times have you looked at the graph and gone, Oh, that's interesting.I can go do a thing. It's like almost never, and when it is, you go, Oh yeah, that was that really bad incident, we know that took forever. that drove up, MTTR for whatever reason it is, kind of thing.Colette: and to be clear, MTTR was created, as far as I know, Dora created MTTR as a way for organizations to track their journey in the DevOps, continuous deployment space, right?And If you're an organization that's been running ITIL and doing, weekly releases or monthly releases, then maybe, yeah, MTTR is a good thing for you to monitor as you go through that transformation, as a way to understand if you are successful. It is not a good way to monitor whether incidents are causing too many problems for you, right?Chris: for sure. I'm getting an angry red light down here, which is saying we have exhausted the entire 45 minutes. These folks will be around for the rest of the day. I will also be around. if you'd like to talk about any more of this, definitely grab us. and, we'd love to talk.Thank you folks. Thank you

Sessions