Organization-aware incident response

The talk by Lawrence Jones discusses the importance of leveraging organizational context during incident response. It emphasizes using structured data and service catalogs to enhance incident management by bringing valuable organizational knowledge directly to responders.

Lawrence Jones Product Engineer, incident.io

The transcript below has been generated using AI and may not fully match the audio.

It's a pleasure to be here today. I'm gonna figure out the clicker. Let's hope that works. Oh, cool. I'm here to talk to you about Organizational Aware Incident Response, which might seem a bit vague, so we will, I will intro that and explain it. But first, like Chris said, I'm a product engineer at Incident.io. I've been here for about three years now, so I joined alongside the three founders. What this means is I've had the privilege to work with, honestly, most of the customers. that we've onboarded over the last three years. and as Stephen said, that is actually a really interesting opportunity to see how incident response works across many different companies.and a lot of that will be going into this talk. before I joined Incident, I worked at a company called GoCardless. GoCardless is a fintech. I used to be a principal SRE there. I have about six ish years of experience of trying to lead out their infrastructure team, dealing with anything from like massive financial incidents to honestly just going back to the previous talk when I managed to screw things up, bring the entire etcd cluster down.so yeah, now you know who I am, I think it's time to introduce the concept of the talk, which is that incidents are a whole organization game. it's not a solo effort. doesn't just impact your engineering teams. instead what you have is the incidents demand organizational knowledge, right?so when you respond to incidents you need to navigate your organization, you need to know who to pull in, you need to translate technical problems to business terms and you need to understand real world impact. If your responders depend exclusively on an individual's understanding of that organization then you Honestly, that puts you in a position where you are only as good as your weakest link.which is obviously not what you want. a bus factor is not what you want to be communicating when it comes to the incident response at your company. You want to be a lot more resilient than that. and that's why I will posture that you should be bringing all the organizational context that you possibly can down into the incident response process.So you're not forcing people to Try and reach out to this information that's too far away to be like, actually useful inside of a time pressured incident situation. And instead you're bringing it right to the people who are responding on the ground so that they can do something with it. and they don't need to rely on that individual's knowledge of the organization.So hopefully it will democratize incidents in the same way that Stephen was talking about earlier. Now, when I'm talking about organizational context, at least in this talk, I'm going to be talking about structured data. this is normally data that you will end up putting inside a tool called a catalog.you normally use a product like OpsLevel, Cortex, Backstage, or even IncidentIO. We have our own catalog product, and this is the, Most tools actually have like a catalog, basically. but I would say that most people stop well short of the amount of value that you can actually get from it.Especially when you're talking about how you can bring that value into the incident response process. So yeah, my whole point here is that I think that we should be making more of these tools. and this talk is about how you can explore different types of data to put inside of your catalog so that you can hopefully push more of this organizational context down to the people on the ground responding.But first, what even is this data? so I've got this fairly rough diagram here. table stakes down in the bottom left is what I think almost everyone has, who has a response process. I call this, the things, that you own and who owns them. so you can think the teams inside of your organization and then the services that you run and maybe even features or products.almost everyone has a response process that has this, so hopefully this is very familiar to you. And it is normally the sum total of what you end up putting inside of your catalog. but I think the interesting thing is that almost no one then explores the kind of deeper technical details.data sources, or even the non technical ones over in the corner. Despite the fact that there's a huge amount of value, if you can bring those into your incident response process, of doing so. so that's just visit the table stakes thing, because I want to make sure everyone is on the same page.As I said, this is normally what you would find in a service catalog. I think this data provides a really important facility, which is, for me, in very basic and non technical terms. translating from technical speak to human speak, or machine speak to human, what you're really seeing when an incident gets kicked off is an alert comes in, and it'll talk about a service, Kubernetes cluster.which means, honestly, absolutely nothing for the business. you don't know who owns it, you don't know, even what the impact of this would be, or what products have fell over, so this is what the table stakes data does. it helps you link from the initial incident, cause, to who owns it and what is actually the impact of what's going on here.so I think, I've said that everyone has this, but I think that kind of understands how difficult it is to produce this data, so I think instead of going into why you should have this, I think, it's Interesting to look at, like, why is it so hard to collect this and build out a catalog of all these things.and the first thing is, I want everyone to realize, these are made up things. They're like fantasy concepts. when I say, what is a service, what the hell actually is a service? Lots of you will run Kubernetes. Is it, the service manifest that you're gonna put in your clusters? is that actually, what you think of in terms of a service?Or is it more like the logical application that you're running in production? More interestingly, do your team even agree? Like, when you say service, how many people, or how many different interpretations are, like, going through people's heads at the time? So until you actually model what these concepts are, officially, you, should assume that everyone is understanding them as different things.so that's obviously hard. the next is that this data rots really quick. so I've seen lots of customers who come to us with a spreadsheet of services and that's great. the problem is as soon as you've created the spreadsheet, I think you all know, it's immediately out, so it's out of date, right?in fact, even people who've invested heavily into tools like Backstage, for example. They struggle with this middle one, the trying to get people to update the backstage catalog is a really, uphill battle. and it's something that people honestly find is really difficult, and that's because you really need the third component here, which is that until you try and use the data that you've got to put inside of your catalog, then you're going to find it will always fall out of date.so what you actually have to do is you need to build this, positive feedback loop, or a flywheel, where you put the data inside of whatever your authoritative source is, your catalog. you start using it in core use cases, so valuable things that deliver value to the company and the people who are going to use it.and then as a result, you've created a motivation for people to go and update that data. so I'd actually argue that, incident response is uniquely placed for this. I think everyone here would recognize that if you ever get paged for something that is not actually a feature that your team owns.You are very motivated to go and update the data. And that's a really positive feedback loop that you can start building into this type of organizational context. And actually it's a thing that helps people get to a place where actually they end up with a catalog that accurately reflects their organization.but yeah, what can you actually do with this stuff? in an incident context, it allows you to tag your incidents up there over in the top left. So you can quickly find historical incidents that have targeted these similar types of people. On the bottom left, you can start, if you've got an incident tool that supports it, doing cool things for your incident responders, like you can subscribe automatically to all the incidents of your team.This is useful. I can imagine lots of people in this room might be quite interested in something like this. and then obviously you've got the classical use case over on the right, which is you receive alerts in. The alerts are tagged with something like service or feature or whatever, and you use this data to resolve who you should actually be contacting.But you all know this, this is, standard fare. what's interesting to me is, what are the more advanced use cases? What data have we not yet put in our catalogs, and what can we do with it if we do? I'm gonna start with, Deeper technical context. I think we said today that incident response is an organization wide affair.I think everyone agrees with that, but I think, let's face it, they are mostly, at the heart, a very technical problem, or at least most incidents have a hard technical problem at their core. If you can bring more detailed information into your incident response process that helps you, navigate that technical, landscape, it will help your responders respond more, more efficiently.yeah. We have two Oh, interesting, this is, one by one. we have two, use cases that often come up with our customers. the first is, for customers who run hardware devices. so actually Netflix is a really good example. Netflix run their software on many different TVs, which obviously the Netflix employees in the room can tell you.and it, as an incident responder, if something's going wrong with one of those devices, it's really useful for you to be able to quickly access the context associated with that device. Yes. Is it inside of an SLA? do we have any help docs associated with this device? How do I actually go about debugging this?and then on the right, you've got code modules. many people in the room may be running monolithic like software codebases nowadays. lots of tools make this kind of hard to do, but if you can take package level information about your monolith and you can push it into your catalog, so that you can do things like identify when incidents are more frequently appearing from certain packages inside of your codebase.what this actually looks like is, you've got the devices example on the left. we have several customers who do this, where they have a device list inside of their catalog that tells you whether or not things are inside or outside of SLA, and then if you end up having an incident that tags a device that is outside of support, then they can immediately send a message to the incident channel and go, hey, I know you've been paid for this, but, Watch out.if you've got other incidents going on that are targeting other devices that are actually inside support, then make sure you prioritize those. it can just help direct your responders so that they spend their time more wisely on the incidents that are available. and then on the top right, you can see actually an example of what we do, for our monolith, where we push all of our code repository, like all of our software packages in there.And then we can detect when, a, You package that is like high criticality in our code base, keeps appearing in incidents and prioritize some investment on it. So that's deeper technical context, but like this is the one where I think everyone's ears should pick up because this is where a huge amount of value comes.obviously we've got all of this technical information and we've said that incidents are technical but I think Especially when you're talking about your response process. Some of the most janky parts is the coordination that you have between, say, GTM and then the technical teams inside of your organization.so yeah. I would say almost everyone in this room, if you're running an incident response, probably has a CRM, because it's weird if you have incidents and you don't have customers. And I would also say that you should be bringing all of that data, inside of your CRM, into your tool. So you can do this in a load of ways, either you sync it from your data warehouse, or you use an incident tool like Incident.io, which can connect to something like Salesforce and automatically sync in your customer details. The reason that you're doing this is because now that you've synced in that information you can start exposing really key relationships So things that maybe are really crucially important to the business and the GTM side of the business But may not have even been like possible for your engineers and your responders in the incident to get at as an example, you often have customers who have customer success managers, right?If you have a major incident that's impacted that customer, you should really probably be pulling in that customer success manager. Now, historically, I found this really hard to do, because, often the information is hidden behind, locked actually behind a very expensive Salesforce license. I may not even have access to it.So a load of this stuff, it becomes possible if you can try and surface the relationships inside of your response process. and it also makes it easy to understand impact. what plan is this customer on? Like how much did they pay us? Are they in a trial? Do they have an account exec? Like, all this stuff is extremely relevant for you when you're trying to respond to an incident, but often has just not been accessible to people.some customers even bring in sentiment of their customer, which is often quite a fun one. so yeah, going back to practical examples. these are things that we see across our customer base, but we do them ourselves. you can see up on the top left, whenever, We have an incident that hits a customer.We drop a really helpful, debugging message into the channel, so you can actually see, Erin is actually Netflix's CSM, and thankfully, sentiment is green there. so we've then got all the, really useful debugging links, going to our staff room, which is our internal, portal that allows you to manage these accounts.you're essentially taking this information that was previously maybe a bit too far away for people to access readily in an incident, and you're pushing it to them directly. Thanks, everybody. So that it's just right there, so they don't need to waste any time on this. You can also see, on the bottom left, this is the part where you get to bring your GTM organization, or people outside of your technical team, into the incident response process.So this is the part where actually, if an incident is major, you should automatically go message the CSM if their customer was involved in it. Hopefully prevents a situation where they end up on a call maybe about an awkward renewal next week And they have no idea that they were in a really bad incident last week It's really tying things together and starting to provide value not just to your engineering team and your responders But like across the organization right and then on the right how many of you are regulated and have something like Incident reporting thresholds, right?I have been in incidents like this before, and sadly, whenever you're considering a reporting threshold, it probably means that the incident is really bad. which probably means that you don't want to stop and crack out Excel to try and run some numbers to try and figure out how much revenue is impacted by this incident.This is stuff that we just shouldn't be doing. if you can bring into your incident response process and get it into your tooling, you can end up automatically calculating all this stuff. And it just removes a huge amount of burden from your responders. And honestly, it prevents like some pretty costly mistakes.No one wants to get fined by a regulatory body. That's actually a lot more expensive than the incident should ever be. cool. Cool. Finally, you get this. You've put all of this information inside a tool, a catalogue, that allows you to then use it inside of your incident response. You've got the deeper technical concepts, you've got your GTM side of the world, so your CRM is all in there.and you've got your fundamentals. And the cool thing about this is that now you get to take your like, honestly, those fancy concepts that mean a lot to your business, but up until now haven't really been real or useful, and you get to interpret your incident data, which is like real hard facts across those dimensions, right?so there's a load of use cases here, but as a quick example from our account, We've established the concept of product features and integrations, which are the third party integrations that our product supports. And we know the teams who own them. This means that we can take some real world incident data, which is how long in hours have people spent responding to incidents, and we can start slicing it by these dimensions that we've got to put into our incident response tool.So the cool thing here is If anyone in this room has ever gone, should we schedule some technical investment? You're often going, what's the ROI, like, how long are we spending on reactive work, should we even do this? and if you manage to get this in the right place then you can answer this question quite effectively.so what we have over here, on the top left, is, the amount of time that we spend over each week on responding to incidents. And you can see on March, we have this like big green block, which is the on call team. They were having a really bad week. So then what you do is you go, actually I'd like to split this data by the feature area.Like, where am I spending my time? So you can start interpreting which parts of your product are causing you to respond reactively. and it turns out it was alerts on this particular month. which is cool, but alerts are quite big and we have lots of integrations. This is the power alert. So actually knowing it was alerts wasn't good enough, so you can then split it again and you can go, which integrations are causing the problem.And as it happened on, on March, it was, it was Sentry, so we ended up investing in our Sentry provider, and as a result, hopefully, the reactive work went down. there's loads of use cases for this, especially if you start combining the GTM data with the rest of this dataset. for example, which of your teams have the most revenue associated with their incidents?That's probably quite useful for you to know, right? But up until now, it's been very difficult to try and calculate. why isn't everyone doing this? I think, there are broadly like three reasons, and this is what I'll leave you with before we go into the break. so I think lots of people let perfect be the enemy of good here.they're like caught up in this idea that we don't know everything that we'll put in the catalog yet. Like we don't know what data we'll need. How would we represent all of this? And I guess my point here is that it's a game of incremental value. So If you have a dimension that you want to push into your incident response or understand your incident response against, just add it by hand, see it, try it, see how it works.The cool thing is you don't need to be perfect, because if you can build your flywheel, and you can start using this in your incident response process, it will encourage people to go back and update the data. And you've ended up in this organic situation where people are going, Ooh, actually, I see what we were doing there, I'd love to get this data in.Oh, that date is wrong. I should probably go back and update it. so yeah, as I said, incident response is an amazing place for you to do this. If anyone here is struggling trying to keep their backstage catalog or similar up to date, I would highly encourage you to consider pushing that into your incident response process.I think it will make you have a much better time. then finally, There's some element here of flexibility of tools. I think you have to pick a tool that's flexible enough to model your data. I think many incident tools are married to this kind of, simple concept of like services and teams and not really much else.that might not let you express your organization in the way that you would need to get the value out of this sort of thing. so you want to pick one that will allow you to speak in your organization's language. so yeah, just bear that in mind when you're trying to figure out how you're going to implement this in your incident response process.But yeah, that's it. So I guess my case here is you should be pushing a lot more of this information into your incident response. anything that your responders could probably benefit from knowing that is locked behind other tools, you should try and unlock and push directly to them proactively. it's going to make everyone's lives easier and you'll probably have a much nicer time in your incidents.Thanks.

2024 Sessions