SEV0 Brand Logo

The Pearl, San Francisco

Culture over chaos: Fostering a positive On-call environment

Ryan Schroeder shares insights on how Netflix promotes a healthy on-call culture, emphasizing team support, prioritizing work, and adapting rotations to reduce burnout​

  • Ryan Schroeder
    Ryan Schroeder Senior Software Engineer, Netflix
The transcript below has been generated using AI and may not fully match the audio.
My name's Ryan Schroeder, and I'm a member of the Core Reliability Team at Netflix. Now, I realize I am between us and cocktails, so I'll try to keep this short. First, I want to ask a question. Does your team enjoy being on call? Think about that for a moment. Really? If the answer is no, do you know why? I can honestly say I've been doing on call at Netflix for over 13 years now.And I can honestly say, if on call is done well, I enjoy being on call. However, I've been on a team before where I did not enjoy on call. So how is it possible then that two teams at the same company, one can have an enjoyable on call experience, and the other one, not so enjoyable? that's what I want to talk about today.How do you foster a positive on call environment? first and foremost, being on call should not feel like a burden. So at Netflix, like a lot of places, every service team staffs their own rotation. What this means is that each team has their own needs that makes their own, on call rotation needs unique.We, the core team, we promote operational and reliability, best practices across the company, which includes how to maintain a healthy on call rotation, but we also carry the pager for those member impacting incidents. Now we use that word carry the pager casually. We throw that around, but I think for this particular example, it helps to set that context of how heavy is that pager?Does it feel like a burden or is it a light burden? So at Netflix, what we try to encourage across our entire organization is that on call is a team activity. First, the team supports each other to share that burden of on call. Secondly, it's important on call tasks and toil, reducing tasks. They've got to be prioritized.And then finally, on call can be flexible. So we regularly review on and adjust how we do on call to meet the needs of the current situation. So let's dive into these on call being a team activity. We all know what that rough shift feels like. So when the page goes off asking for help or escalating into the team, That's a norm that we've established across our team. We have a lot of new members on our team. They don't fully understand the system. And so we encourage them to ask for that help. Our core team, we maintain an entire team escalation policy. So if anybody needs help when they push the escalate button, we all show up.It's like the bat signal, if you will. But we celebrate when that happens because it's not a bad thing. It's recognizing that somebody says, I need help with this. And chances are it's big enough that they need more than just one person. So we put out the bat signal, people who show up do. And then we also sometimes celebrate with, retros and stuff after that.Like when we were all in person, we'd get donuts, that sort of a thing. So because of that second level escalation where everybody shows up if they're asked, We've also established this team norm where you help out even without being asked because you understand that if it starts going off the rails or starts going sideways, you're going to get brought in anyways.And so that encourages. Everyone to keep an eye out on how things are going, show up when they can and participate however they will. Another aspect of this is relieving the on call after a tough shift. Me, personally, I've been part of an incident where, or a part of an overnight shift, if you will. I got paged three times.You come in the morning, and when it wasn't going so well, the team was like, oh, that looks like a rough shift. Almost as if they're saying, I'm glad it wasn't me. There was no recognition about how your rough shift may impact the rest of your day, no offer to relieve you or something like that. And being on call shouldn't feel like you drew the short straw, particularly if it's a noisy pager or those sorts of things.And then finally, distributing that follow up work. In a similar sort of way, our team held a norm where if you were the incident commander, You're driving the most of the follow up work because you had the most context again, we had a situation where one team member had eight incidents over the course of a two day shift.That's way more than any one person can do. And we ended up in a situation was like, who's going to do all the work? And so we review that we say, look, let's distribute that work. And so again, my call to you is as a leader of a team, how are you making sure That on call is a team activity. Now that it's a team activity, we have to recognize that on call is important work.It's more than just carrying the pager and being paged. One of my favorite things about being on call is every shift gives you a chance to make the next shift better. Whether you're paged or not, review your dashboards, review your operational procedures. Because nothing makes on call worse than showing up to an incident unprepared.Our system is constantly changing, as I'm sure all of yours are. And when one dashboard could look one way one day, but then look different the next day. Is that an expected change? Is that a different change? But, until you're aware of what's happening there, Again, you could get caught unprepared. Next off, make space for that learning and curiosity.Again, we've got a lot of new team members. And as you are on call, you need to take the time to look at a system and say, I don't know what that does. How do I engage with that to understand what it's doing in the infrastructure? What happens if it fails? Where can I observe it? Those sorts of things.And again, we encourage this, whether you're an incident response team or a service team. If you're an If you're constantly being on call and you don't feel prepared, make that space for learning and curiosity. And then finally, prioritize the work that reduces that on call burden. So again, we all know what that feels like, being paged or being put in situations where you feel so helpless or so unprepared in order, or unequipped to fix them.that's the, that's a recipe for burnout. And so we gotta make sure we're prioritizing that on call work. And it's, whether it's work directly that's, the contributing factors to the incidents, or if it just makes operations that much easier, it has to be prioritized. And then finally, on call is flexible.Make that on call rotation work for you. There are various levers that you can use to adjust the rotation, and you'd be surprised how simple but effective they are at reducing that burden of on call. One of the easiest ones is the shift length itself. So in an ideal world, we run a 2 2 3 rotation, which means, the weekdays and then the weekend are a separate shift.When our team is well staffed, that's great. It's a wonderful duration between shifts. But one time our team got down to only five people, and that meant we're on call about every 10 days. That's just started toe to weigh on us as a team. And so one of the easiest things that we did is we moved to a 34 shift.So three days and then the four day shift over the weekend. So again, it's those simple things that you don't have to worry Silently suffer with an on call rotation that's not working for you, you can adjust them as needed. Another thing that we do is, during our holiday quiet period, rather than, again, being that member that draws the short straw of who's on call for Christmas or who's on call for which day, we set up a schedule and we say, put in your availability, we'll go day by day, And that makes balancing those commitments across family activities or other times like that more, more tenable.We'll even hyper focus that on, say, an on site week, where nobody wants to be on call and miss all of the sessions. So we'll even split up our shifts into, six hour shifts. The morning shift, the night shift, and then an overnight shift, or an evening shift. again, make the on call work for you. When we talk about coverage and escalation, Again, I mentioned that we're all, we're, our team maintains an entire rotation.For when we get paged, maybe that's not sustainable. Maybe you have a secondary rotation, but you got to keep in mind that a secondary rotation is almost as a. the same as carrying two rotations now. So again, there's tradeoffs to make there. How does your handoff work? We used to hand off at midnight.We didn't have a lot of incidents overnight, but the handoff at midnight didn't really get a lot of confidence that the person you're handing off to was available and ready to take it over. So a simple thing of just shifting that during the daytime again. These are simple things to do but they make that on call feel more flexible and then finally the duties again on calls being on call is part of the job.It's a wonderful part of the job, but when you incorporate the duty routine around it, it helps, take away that ambiguity of what happens if who's doing what, And so again, come up with a duty rotation schedule. Those are some of the things that we've done to make on call flexible. So with that, I want to leave you with this.Improving on call improves team health. Nothing is worse than showing up to an incident, being unprepared, feeling alone. Those are the kinds of things that really, destroy a team, that make the, feeling of burden, less. I'm feeling on call a burden and by keeping on call a team activity important and flexible, you can foster that effective and positive on call environment.Thank you.

Sessions