Stop, Drop, and SEV4: Why small incidents are a big deal

Derek Brown emphasizes the importance of documenting and sharing knowledge from small incidents to improve incident management and investigations.

Derek Brown Service & Network Platform Lead, Plaid

The transcript below has been generated using AI and may not fully match the audio.

Hello everyone, my name is Derek. I am a software engineering lead at Plaid. and before I get into things, I thought I'd tell you a little bit about me. so I started my engineering journey in the container space. Loved digging into container run times and things were very far away from incidents. I joined meta and realized that I knew absolutely nothing about containers and meta.They built their own platform. It's its own entire universe. And so I sat back and thought about what I wanted to spend my time doing and the problems I wanted to solve. And I saw this observability team that was very underfunded and just had a lot of systems under their belt. They were building their own time series database.They were building their own logging database and there were 14 people. So as many of you can imagine, that was just an impossible situation. and that really taught me really quickly, how to manage incidents, but also how to think deeply about how we designed our alerting system and our incident management tool because these 14 people were managing all of those things from design to execution.The other thing you need to know about me is that one of my heroes is my mom. she works in local government, and her primary job is making sure that hungry kids get fed. and that was something that was hard to rectify in this world of software engineering, where you're so far away from your customer in solving real problems.and so I did the logical thing, which is I decided to go to law school. and I was in the back row of class, really hoping that professors would not call on me because I was actively dealing with an incident, and I didn't really know how to explain to the law professor that I was, making sure that Instagram was staying up right now.It was just not a conversation I was prepared to have. a few years into my journey at Meta, I decided to take that story further and deliver a lot of the things that I've been working on, and, reliability and incident management, brought more broadly, and so I ended up at Lacework, which is a cyber security company that delivers a lot of that kind of tooling as a product.and throughout that time, I spent managing privacy and security incidents, borrowing from my experience, as a lawyer. most recently, I joined Plaid a little bit, under a year ago, where I am starting to get merged these two things. and most recently, I joined the California Bar.So at some point, these will fully merge, but at the moment, I'm someone who gets to wear two different hats. Thank you. And that's really exciting, sometimes you have that call, it's we need to get lawyers on the phone, and I'm like, ha I've been here the whole time. really awesome. like many of you I imagine, I'm a student of how incidents evolve.and the part that I find most fascinating is the part between, I see some weird log line, and how do we get to a SEV? because, we talk about MTTR and all these things, that's where a lot of our ability to respond to an incident comes down to. and I see two different kind of, we'll say anti patterns emerge in that space.and the first is, especially for more junior engineers or people who haven't listened to talks like this, we talk about what an incident actually is. That incident framework actually gets in the way of having meaningful investigations. so if I tell someone, what do I actually want you doing in the first phases of an incident?It's really easy. It's two things. Number one, I want you to figure out what the impact is. If it matters to customers, then we need to figure out how to resolve it. If it doesn't matter to customers, then, throw in the backlog, we'll figure out how to deal with that later. And the second thing is, if there's an impact to customers, how do we mitigate it?I don't want people squabbling about who owns it, or what's the best way to, deal with that situation. We really want to be focusing on just how do we fix the problem. and so here comes the anti pattern. The questions you absolutely should not be asking is, this an incident in the first place?I think we, we had some conversation earlier about, whether incidents feel punitive or whether it's a kind of conversation about ownership for what team needs to be driving that conversation, and this question comes up a lot about is this an incident in the first place? Should I be prioritizing this above everything else?and this should just absolutely not be in the critical path of resolving an incident. We should be thinking about whether it's an incident later, what the level and severity of that incident is down the road once we're already on the path to resolution, not before we actually get the right people on the phone fixing the problem.Now, the second, problem I see a lot in these investigation space is that this effort is just completely lost. Now, you probably have all seen really well written post mortems after every SEV0 and SEV1 incident. But how many times have you actually seen a really well written post mortem for that investigation that turned out to be a dud?Probably not, right? a few of the patterns I see is, number one, this just lives in tribal knowledge. You have one engineer who maintains that one cursed MySQL database who knows how to dig into the schema and figure out who is blocking the query load. But it's never documented that engineer leaves the company.That's a huge organizational risk. No one else can pick up that problem and other problems. People go off and do solo investigations. They don't have a good way of being able to transfer this knowledge because they're not documenting it. They're not in person in the office anymore. Transferring that knowledge.And I think the worst one of all these magic incantations. my bash history I'm sure is worth hundreds of thousands of dollars to me and maybe to many of the engineers on your teams as well. So we need to get into the pattern of taking all of these investigation tools and scripts and methods and putting them somewhere.So my hypothesis and proposal and feature request, if you will, for the folks from Incident, is to start treating investigations as a first class citizen. So what this means is that I want to separate out between the idea of an incident, which is something is bad and it's impacting our customers, to just investigations.It's something that's interesting that's happening. How do we actually go figure out whether that's bad or not? and so to give some kind of practical examples of I think when something is an investigation, but maybe not an incident, alerts probably get a lot of false positive alerts. Let's call that an investigation and not necessarily an incident.You get the alert, you realize you spend some time looking in your map monitoring system. It was just one pod. It just restarted. It was an AWS issue. Document it, leave it as an investigation, it doesn't become an incident. User reports, the same thing. A lot of the time, it's a user issue. They didn't understand how the tool was supposed to work.And if you document the process of figuring out that it was the user's fault, put that in an investigation, the next time someone comes back to that procedure, they know exactly what to do. And this one I think might be a little bit controversial, but debugging effort as well. I spent a shocking amount of time in S trace and wire shark and GDB and things that no one should ever have to touch for any reason where we are in the software world.But being able to document those investigations that other people understand the right command flags to be able to get back all that investigation context. I think it's crucial. great. We've launched an investigation. We have some documented steps for it. This is how we evaluated whether it was important or not, or how we even got to, the investigation process.what do we do with it? I think this should be obvious to the people in the room. sometimes we file a bug. We say, this isn't important today, but we do need to address it. It's a problem with our software. Let's throw that in the backlog and fix it. If it's really, bad and it's impacting customers right now, we file an incident.And, of course, we document everything that goes into investigation in a runbook. one of the things we talk about a lot at Plaid is the difference between type 1 and type 2 decisions. so for people that are unfamiliar with this framework, a type 1 decision is a one way door. It's something we have to spend a lot of time scrutinizing because it's going to cost a lot to implement, it's going to be hard to revert that decision.A type 2 decision, by contrast, is It's really cheap for us to implement, it's probably not even worth a close conversation. Let's just execute it, see how it goes, and if it doesn't go well, just revert it. So what I like to do is give you a proposal for how you can implement this sort of investigation first strategy, as a type 2 decision within your organization.So I spent some time kind of thinking about what you actually need to implement. in order to have a better investigation experience than you do today. the first thing is it needs to be collaborative. One of the hardest things about the kind of remote world is that it's hard to sit in front of one screen, collect all of these different metrics, pull all of these signals together if it's multiple people investigating at one time.and so we need some sort of way to pull in all of these different investigation threads into one place. the second is that it helps us become, create a coherent picture of what's going on. I cannot tell you how many times I've gotten 45 minutes into investigation and then realized that on step one I wrote the wrong query and I've been following a false lead the whole time.So being able to piece together the puzzle as we go along and figure out, how to create a coherent story of what's going on. And lastly, we need some way to be retrospective about this investigation process. Investigations are a significant amount of time that your engineers are spending, and it's in that critical path before we get an incident resolved.So being able to think about how we actually improve the investigation and debugging experience, not just the overall incident experience, is crucial. how do we create a prototype of this? the reality is this already exists. I suggest that you can use your SEV4 today as a method of Tracking these investigations, even if that doesn't cause impact to customers.So what does that look like? So I would suggest using your existing instant management tooling to actually track investigations. And this is where the title of my talk comes from. Stop drop and 74 the analogy I like to make is that you should just absolutely file the 74 without even thinking about it because that takes all the guesswork out of is it a set?What level is it? Start using the communication and investigation tooling that you get out of your incident management platform and get, get your money's worth. the next is to actually do lightweight retrospectives on all of these investigation procedures. See, it's really common that you have a sub four realize your debugging experience is terrible.Two weeks later, you have a SEV1, and you're still stuck in that place where you don't know how to investigate the issue. Quite frequently, I find that those SEV4s are a forbearer of hard to solve debugging or monitoring problems. And being able to empower an engineer to say, this debugging experience is not good, the observability is lacking, I need to go spend some time to fix those issues, is really, crucial to being able to improve the incident experience, not just the investigation experience.and lastly, as an engineering leader, being able to take the time to prioritize that investigation, observability, and debugging, tooling and improvements, and we all talk about this in the context of an incident, it's self evident, we say that's an incident follow up, they absolutely need to prioritize that may not be as clear in the case of an investigation follow up, and being able to create the space for engineers to focus on improving their debugging tooling and improving their monitoring tooling, is, really, crucial to starting this flywheel effect of improved, incident response.So lastly, I thought I'd wrap up with, how do I go sell this type two decision, to, my leadership, to other people in my organization as being valuable. so maybe a controversial topic from the last one, but it's really, crucial to reduce time to mitigate incidents. being able to have that kind of point and shoot response of, I saw something weird happen in the logs, let me go file a sub thread so that I can start investigating it, keeps that time to actually investigate down and lets people focus on responding to incidents rather than trying to classify them.the second thing is that over time, your observability tooling is going to get better because people start thinking more about the investigation process when you have this big pool of data of all of these SEVs, and you can realize, oh, this system is super hard to debug. I lost a lot of engineering time investigating that system.and lastly, you can start piecing together these different investigations to figure out larger patterns in your system. Again, quite often, a sub four leads to a sub two in two weeks, and especially now in the world of AI ops. Having all of that documented where a system can come back, collect all of the steps that you performed and then suggest that to you in future incidents is just absolutely invaluable.But another benefit that I think is perhaps less obvious in this approach is the way that it changes your organization. And this is the note I wanted to leave on. it can be really, hard for especially junior engineers to practice those incident management experiences. I think we talked about earlier about the case where an intern had caused an incident and they're locked out of the room.if you have more incidents, more investigation options, people just get into the habit of good incident management practices. It's more surface area for every team to practice what it means to be a good incident manager. Another thing I've realized in implementing this is that it's much easier to recognize contributions within your organization for people who are doing really, solid investigation work.I'm sure if I say who's a good investigator on your team, someone comes to mind. There's somebody spends a lot of time. Digging into deep problems that no one else can investigate, but I'm guessing you probably had a hard time at some point writing a promo packet for that individual or trying to explain their impact to the organization because so much of their time is spent on one off investigations or contributions that you can't really articulate as shipping a feature to X number of users.And so being able to have a big long list of investigations that they and only they could have been able to solve and how that unlocked other potential of your organization is absolutely huge and it's changed the incentives for teams that I've been on my team now and the way that they actually address incident investigations because they now have more incentive to spend time doing quality investigations and don't see it as just a side project to the work that they're doing.and lastly, this promotes really good sharing of investigation techniques between different teams. it could be really, hard for that central infrastructure team that's been built up over five years to share the ways and methods that they're using to go dig into each service. But by having a documented investigation process that other peoples can follow along with either during the incident or after the incident investigation, it gives more learning opportunities to teams to spread that knowledge.the last thing, I guess this is also not a spicy take, they stole my thunder earlier, but I think absolutely everyone should be involved in this incident process. one of the things I tried doing at Meta was pulling my interns into incidents and exposing them to what it is to be an incident, manager.and quite frankly, the feedback I got was, don't scare the interns. But this is the job, this is what we all sign up for. And frankly, It's one of the reasons that I think this stays very interesting, right? Incident investigation and investigations in general are really fascinating. If you're a person who likes that CSI style investigation, this can be a really good motivator and a reason you pick one team over another, one company over another, and create a good engineering culture.So encourage more people within your organization, whether it's interns or teams that traditionally hand that off to some other team to go do investigations. to start practicing their experience through this kind of investigations as a first class concept. I won't keep everyone from lunch, sounds great, but thank you so much for listening and I hope you adopt this type two decision of stop, drop, and step forward.

2024 Sessions