Balancing reliability whilst scaling fast

A discussion around the challenges of maintaining reliability while scaling fintech companies, focusing on customer trust, regulatory compliance, and competitive pressures

Tom Blomfield Partner, Y Combinator
Mel Good Senior Exec Director of Reliability, JPMorgan Chase & Co.
Stephen Whitworth Co-founder and CEO, incident.io
Paul Williamson Start-Up Advisor & Investor, ex Plaid, Visa, Salesforce

The transcript below has been generated using AI and may not fully match the audio.

Stephen: I am running the last panel. So this is called balancing reliability while scaling fast. and it should be obvious, but hyper growth in companies is really, hard. If it's strain on organizations, the teams and people within them, knowledge distribution around the company, and honestly, like reliability and resiliency should be no, should be no surprise.It's really a constant trade off, do you keep the thing up so you can engender customer trust, or do you build at pace so you can beat the competitor next to you that's trying to furiously beat you? And in this panel, we've got a group of awesome people from a variety of different roles.They have been, in revenue teams, in incident management teams, leading companies, and we're going to hear sort of stories, learnings, and, bump fingers. before we jump into that, I would love to give them a chance all to introduce themselves, so I will hand it over first to Paul.Paul: Hi, I'm Paul Williamson.I, today, spend a lot of time advising early stage companies about building, their go to marketing. how do they scale up their sales and marketing organizations, and with a specific focus on, Early stage companies, as Stephen just mentioned, going through that hyper growth phase. They've just hit product market fit and they're trying to scale.But prior to that I spent, about 20 years doing a variety of different go to market leadership roles. Most recently I was the CRO of a company called Plaid, which is a financial infrastructure company. And prior to that I spent time Salesforce. com. So I've been in go to market stuff for the better part of the last 20 years.That's me.Stephen: Awesome.Tom: How about you, Tom? Hello everyone, I'm Tom Blomfield. Prior to my current role, which is a partner at Y Combinator, actually just around the corner, I worked with Steve and Chris and Pete at Monzo, which is the UK's leading neobank, I guess I would say. we built it to 5 or 6 million customers when we were all there, and since then it's got to about 10 million customers, about a billion of revenue.And before that, I worked with Pete, again, at GoCardless, which is another Europe's top fintechs doing payment processing. Awesome. Mel?Mel: Hi, I'm Mel, AKA the good Mel. And I, look after reliability at JPMorgan Chase, specifically for Chase UK. And who is challenging Monzo on its title Hoping to do very well so yeah, I have I suppose it came from seven years at airbnb So not far up the road here where I looked after incident management and Met these lovely people from incident ia and part of that vmware salesforce a couple of other tech Companies in there but always around How do we deal with critical incidents?How do we get ahead? How do we really look at Preparedness rather than responding all of the time Stephen: Absolutely, I'm gonna start with Tom you know when I worked with Tom we were working in a fully licensed regulated UK bank and that is different. I would say to like your average startup, it's less fun You Okay, you said it.And the kind of interesting thing was that we were up against a bunch of competitors, many of which did not have banking licenses. And as a result, you had this sort of constraints that would apply to Monzo that didn't apply to others. And, ultimately, the customer didn't really care about that, and you had to build as fast as you wanted to.this put you in a position of needing to build to stay alive, and to beat competitors, but be a place that fundamentally, both people and regulators trusted. Like, how did you think about navigating the tension between, the sharks in the water, but also the kind of trust that you needed to maintain with those people?Tom: It was very tricky. And the obvious competitor at the time was Revolut. Which did not have a banking license. They were operating more or less UK with a bunch of partner organizations but at the same time We spent most of our time thinking about the large banks, RBS and Barclays and Lloyds really.That was who were taking customers off overwhelmingly Revolut and Monzo are both kind of killing the old banks rather than attacking each other But it was very tricky, especially as, in the early days it was fine, the regulator didn't, never really believed we'd get any kind of scale.But once we got to a million customers and we were in the headlines every day, balancing bringing out product functionality with, not just reliability, but giving the regulator the three or even six month notice they wanted of new product launches, which is this kind of crazy tension that I never really expected to have to deal with.And honestly, I'm not sure we got a really good answer while I was there. but a bunch of things went wrong, and in response to that, we, I think Chris actually wrote an open source library, which became instant, to try to manage the, when something went wrong, it was not just it wasn't just an internal customer you needed to tell.It was. If it was sufficiently bad, I as CEO needed to know. I needed all my executives to know. I needed my board members and my board risk committee especially to know. And then we'd have to figure out which of the various different regulators needed to know. And then how to talk to 5 million customers.And then also talk to the press. And a customer service team who were being bombarded. So it was It's like a movie. Sorry? Sounds easy. Yeah, it was a lot of pain, honestly, and so traditional kind of instant management didn't work for us, really, which is why we started building this, kind of tool internally that tried to make some kind of order of the chaos.And a lot of stuff did go wrong, but I think we got better and better at dealing with it.Stephen: Yeah, I no, I empathize. I think Tom would never say it himself, but he is very good at creating memes and phrases that sort of stick around in a company, and there was this one thing that he, I don't know if he originated it, but he popularized it, which was the idea of moving fast in control.Yuzo rallied the company around that as a meme for a while, and I remember something, Formula One car, can you fill us in the details of A, what was that? And then B, why rally the company around it?Tom: Yeah, I think past a certain scale. and that together is probably two or three hundred people.The job of a CEO is basically to pick a singular focus and make sure the entire company knows what that is. And that's more or less your only job. also like hiring the right people and putting them in place, but basically picking a thing we're going to do and a load of other stuff we're not going to do, and then communicating that endlessly.if you've said it a hundred or two hundred times as CEO, there's a possibility the person hearing it's probably only heard it two or three times and needs reminding. And Picking something that's interesting and evocative and sort of people can keep in mind is so so important so that when, they're, whatever, working in their own teams, figuring out should we do A or B, they remember the thing.And so for a long time this, quite a horrible meme was uni economics. We needed to fix uni economics and it was like, Chris is nodding in the background, it's like crazy pictures of me screaming uni economics. and we, did fix that. And then the problem became we're growing so quickly now, like how do we.continue this growth, whilst also not getting shut down by this regulator who's like looking for any opportunity to do And so the evocative image was this idea that, some of our competitors were running, I think the, metaphor was, like a drag race, they'd strap like a rocket ship on top of a drag racing car, and they could go really, fast in a straight line.But as soon as there's a bend in the road, they're going to just come off the track and explode. And instead, we at Monzo wanted it to be more like a Formula One car that's going to go around this track at insane speeds, 250 miles an hour, but the team needs to be like a pit stop crew. It came in, everyone works together to change the tires, change the refuel, and get it back on the road as quickly as possible.And those things are driven with a level of precision and control and professionalism that we wanted to aspire to. So we didn't want to be this like, cowboy, rocket ship, fueled, drag racing company. We wanted to move fast but in control. And so that Formula One car was just the kind of the image we strive for.And I'm not sure we always got it right, honestly, but it was a useful metaphor.Stephen: Yeah. I maybe the last thing I can then open up to Paul and Mel, do you feel like there was ever a time or like a Or a phase where, you had maybe got the balance wrong between reliability and, and the speed that you were shooting for and you had to correct?Or was it just a, holding on for dear life style situation?Tom: I, think we slowed down too much, actually. in the early days, the first two or three or four, The pace with which we were able to build incredibly high quality software and ship it almost flawlessly. The times we had really bad outages were basically all third party suppliers that we were reliant on.And our solution to that was to rebuild those things internally and then overwhelmingly flawlessly. For example, we had an outsourced card issuer processor that caused Probably the worst ever outage in Monzo history was 12 hours of downtime. We had the entire company in the office on a Sunday. It's 120 people at a time.Manning customer service lines because we had 500, 000 customers who couldn't spend their money. Now they were stuck at a grocery store checkouts with, a basket of groceries and kids in tow and they couldn't pay for their food. and so that was an incredibly, emotional time, but overwhelmingly it was caused by.third party supplies going down, and so rebuilding that internally, we had a guy called Richard Dingwall, and you all know him very well, who basically said we need to rebuild this. Can you do it? He looked a bit shocked, and I don't know what it was, three, four, five months later, came back and said, there it is, we built it, more or less.Similarly with, we need to launch Apple Pay. A guy called Dan Cannon goes away, and three months later, he's there's Apple Pay. so the amount of stuff we built to an incredibly high quality, That is still working exceptionally well today. I worked it out the other day. Monzo's doing something like 20, 000 transactions a second now.Wow. on infrastructure we built in house from scratch. We built a, a full banking system in house and it scales and it's, I don't think it's even close to reaching any scaling limits. It's just incredibly well architected. I'm very, proud of what we built. I don't think we were able to communicate to the regulator effectively how that approach would work.actually was better for stability and reliability and we succumbed, I would, I probably wouldn't say this if I was still the regularist bank, but I think we succumbed to the regulatory pressure of like you've got to do things in six month increments and give us this huge, visibility versus our approach which is taking the startup approach of we're just going to launch a lot of small products to a small number of people so that the blast radius is actually very, self contained.If it goes wrong, we'll reimburse those people and we'll roll it back and we'll fix it. But constant iteration, I think, is, just the way that the tech companies have been able to move much faster than incumbents. And, looking at Monzo over the last six months, actually, the speed of product launch has been super, super impressive.And I don't think I achieved that. Instead of 2019, 2020, we were just on the back foot, constantly trying to keep the regulators from shutting us down. but I think the new management team the CEO and COO who have been running it for the last four years have done a phenomenal job actually striking that balance much more effectively than I did.Stephen: Yeah, no, they're phenomenal. You said third parties there, and then that made me think of you, Paul. Plaid is a company that sort of plugs together lots of, third parties and kind of builds like a unified, experience that Paul: Thousands and thousands of third parties.Stephen: Yes, Tell me about the fun of doing that.Paul: so the operating environment for Plaid was a very complex one. number one, like most technology companies, we were building our own product. And that product then had customers. And those customers then had consumers who were using those. When I first joined Plaid, we had about 100 clients.Towards the end of my time there, we had a little over 7, 000. companies that you'd be using plat. the big challenge for us is that we then had these other things, these clients in the back end and we were integrating into large financial institutions. So we could have technical problems with our products, but the banks themselves and the reliability of their infrastructure was one of the biggest challenges that we had to handle every single day.and so there were incidents and issues that were within our control. And there were incidents and issues that were well outside of our control, but we were still held and seen as being responsible for that. And that challenge got bigger and bigger as we added more clients, and it got exceptionally large, because, by the time that we were broaching about 7, 000, customers of Plaid, around about 140 million U.S. consumers had used a Plaid powered product at that point. And this is everything from, you receiving your payroll, to you wanting to do investing, to even, we went from fintechs into traditional financial institutions as customers, and then we were working with other companies, like a Tesla, like I wanted to repay my loan for my Tesla vehicle, all those things were part of my plan.And when those things broke, we didn't just hear from the 7, 000, clients of our product, or sorry, the customers of our product. We also heard from our consumers. and then we also heard from the banks saying, Hey, what did you just screw up for our tens of millions of users? and so we were between a rock and a hard place and another hard place.and then we also then had financial regulators sitting on top of us because if we did something wrong, it could also affect more broadly the entire financial system. needless to say, incident management was a very critical part of, the business process for us at, Plaid.Stephen: I didn't realize this is actually financially regulated,Paul: Yeah, we're a hyper regulated panel.Stephen: Absolutely. Mel, I will get to you in a second. But I wanted to, I wanted to expand a little bit on that, on the, on the kind of massive user base side of the world. And, given your role, I don't think you are an engineer or an SRE team or I am Paul: not, should not be, could, never do Stephen: it.What is, what does, how were you involved in instant management? Why was it important? You've obviously given us the kind of, blast radius of, we had customers and they had tens of millions of customers, but, if we got it down to something very, concise, like what, like, why did you really care in your role?Paul: At the end of the day, I was responsible for the revenue and growth of Platt. and whether you like it or not, you are gonna have product issues. And product issues are part and parcel of what it is that you need to deal with. But I'm not a believer that is a technical engineering or SRE problem in isolation.In fact, I needed for that to be a core part of what it is that I did every day, because not only did I need to communicate that to hundreds of internal employees on the sales and customer side, and support side, but we had to communicate those things externally to banks and financial institutions and to, our customers, and then sometimes, especially, it could be at times, tens of millions of consumers that would be contacting Plaid.and so we had to absolutely be on top of that because it would not only, obviously affect customers, but it affected our ability to grow the company. and more often than not, because we had a very PLG, or product led growth motion, to what it is that we did, we could actually have potentially thousands of companies evaluating our product at any one point in time, and in that evaluation, they were also looking at, whether our product worked.and again, I'm a really big believer in the fact that It's not that something will go wrong, it's something, will absolutely go wrong and the importance there is how unified is the technical organization and the go to market organization in responding to that. And it's, and honestly it's, in those moments that you actually create the most trust and partnership with your customer.And you've got such an amazing opportunity in, at times, some of the worst moments to really go and prove that you are the right partner of choice. Because what we had to accept at Plaid is that these 7, 000 plus companies were literally building products that were 100 percent dependent on ours. and if I ignored that fact, then the success of the company would be held back.Absolutely. Embrace and accept that's part of what you need to do. 100%.Stephen: I want to get your take on this, Mel. So you have, a revenue leader, a CEO, being in the kind of incident management team at Airbnb, and, now at J. P. Morgan, What is it like to work with, do you interact with these personas like very directly?Would you have any advice for those that kind of want to work well with them when everything is maybe a bit on fire?Mel: I think a lot of it actually is around can you build those relationships prior to the house being on fire?Paul: Yep.Mel: can you actually figure out what you need and who you to bring to the table?definitely being in Airbnb to JPMorgan Chase, they're just worlds apart. So this regulatory piece and the controls and the what you have to do it gets very difficult very quickly. and I think you know for us one of the things is You want everything to work so well But you need happy employees with that as well And when you get very curtailed by these incidents and everything that happens we've just spent quite a lot of time where there hasn't been a huge amount of forward thinking technology because we're fixing all of the things the regulator wants us and it's not a very exciting place when, for engineers to be when that happens.Yeah. So yeah, so keeping it up to date but interacting with people like this, yes, all the time at the moment and I think it's getting to know your audience as well, so an introduction to the PRA, the regulatory body in the UK was Eye opening to say the least and I think There needs to be I suppose it would be great if we could all come together in some way and there needs to be an education Of people in those regulatory positions as well because sometimes what they're asking for They're just not quite up to date on where technology is.They're explaining to them that you know We are on aws and we do we find we are resilient And they're not accepting of that because They're not physical data centers like they're typically used to in banking environments So bringing in the right people the right leadership and having them talk to these things and translate from a tech to a business Perspective is really helpful Stephen: 100 percent I mean like Tom can empathize that so I knew some of the early engineers that went to do this, little bank thing that they were going to try and build.And they're like, oh, we're going to build it on the cloud. I was like, you must be joking. This is 2014, it didn't really happen, but yeah, at that point, kudos to the regulators, they were actually forward thinking enough. you were on the inside, and I imagine it wasn't It was a big surprise.We were Tom: actually planning to build Monzo, in a co located dataset, not using cloud. Because it was illegal. The FCA brought out a paper, I think 2016 or 17, it was like the cloud question mark or something, tell us about it, and it really opened the door. but I remember having a conversation with the regulator, if your bank fails, we need to be able to go in and, that computer will be all the mortgages and we're going to take that computer and give it to someone so they can run your mortgage book.And we're like, what the fuck are you talking about? This is technology has not worked like this for 20 years. so the idea is to have virtualized servers all over the place. Point me to the box that has the mortgages on.Stephen: It'll never work. Absolutely. I, I want to take the nugget that Mel gave us and was like, there are a lot of engineering leaders here, Paul.you're a customer revenue leader. what advice would you give them to, help you be more sympathetic, like work better, when things are a little bit on fire? Yeah.Paul: Yeah, look, I think one of the first things, and I, know that this is the common like thing that happens is that engineering leaders see people like me joining a slack channel when an incident comes and they're like, ah, frick.Here come the knuckle draggers. They're just in it here because they care about their commissions not being paid. And you do. And we do. That said, we're the only people in the organization who's 50 percent of their income Is, is based on that stuff.Stephen: Yeah.Paul: but look, I I, think the really important thing, and Mel touched on this, which is, a you need to have really good relationships before an incident occurs because you've really gotta just work out very quickly, who is on point to, to do what.and again, it's like the, speed and quality of the execution relative to that stuff is important. So I, think. Number one, as an engineering leader, I would really suggest that if you do not know your go to market teams, go get to know them. they do deeply care about the problems because, more often than not, they're going to be the first people who trip and fall in relationship to that, to the customer.number one, get to know them. I've been at organizations before where go to market teams in technical incidents were treated like mushrooms. Leave him in the dark, feed him shit.Stephen: Didn't think that was where that was going.Paul: And, that was the belief. I'm not going to name the company.Feel free to go through my LinkedIn profile. but, let's just say that, they really didn't believe in having GoToMarket involved at all. and incidents and things like that were hidden from the team. It's they will not be hidden for that long. Because at the end of the day. An incident affects a customer and a client.so we're gonna find out about this stuff. and the worst thing that you can do is set up your go to market teams to like, literally have egg, already landing on their faces. They're about to pick up a call. cause more often than not, and this is one of the crazy things, and you probably all know this already, but it's the customer who often notices the problem first.Mel: Yeah. Yeah.Paul: And it, we're the ones receiving the phone call and you're like, Hey, like something's wrong. And you're like, Yes, something's definitely wrong, let me get back to you. so I think the really big thing is if you want to, you need to go out and embrace your go to market teams. and obviously it requires having a great go to market leadership team there.and if you have a go to market team who don't care about incident management, incident response, they're irresponsible. they're a fundamentally irresponsible function inside the business. And you should voice those concerns. but if you've got someone that truly is a partner on those things, embrace them and really help them be a partner to you through that process.They will be your best partner through that process, because they're going to be in the firefight with you, not fighting against you.Tom: Yeah, the thing that I was really surprised with at Monzo was how, Good incident response could turn a really bad situation into a net positive.Paul: 100%. It's Tom: bizarre.So the outage I told you about, the 10 or 12 hour outage, at times I thought, this is, company ending. This isn't over, right? But we rallied the entire company to give proactive customer service. We, I was the annoying CEO in the Slack channel asking engineers what, what's going on, what we can tell people.And working with our marketing teams, to communicate proactively with customers. We're tweeting every 15 or 20 minutes. We proactively told a bunch of, I think we messaged half a million customers to say, your card won't work. Before you go out, we're telling you your card's not gonna work, so please take a backup card.And the way we proactively managed that communication over those 12 hours, the press on Monday morning was overwhelmingly positive. The UK press wrote stories about how well we'd handled an outage.Paul: Which, this is also the UK press. Yeah, it's Tom: nuts, right? we, I wouldn't encourage you to go out of your way to have more incidents.But, they do happen. If you can manage them really effectively and communicate proactively. You earn way, more points than you actually lose. It was unbelievable. A lot of the other banks came to us afterwards you guys are cheating somehow, like you're playing by different rules here. we have bad incidents and we get eviscerated by the press.But the normal response of a bank is like this corporate kind of The wall. Yeah, we don't know what's going on and we can't tell you anything. It's look, here's what we think's going on, here's what we're trying mortem afterwards to say, look, this is, what happened and here's what we're going to do to stop it happening again.And really following up that stuff. People absolutely ate it up and it was, Overall a net positive for Monzo for sure.Paul: It's so funny because that's also the thing that as the infrastructure team trying to integrate into the banks, we got that all the time too. We literally would have like, major financial institutions, we would report back to them, Hey, we just heard from Coinbase that Chase is not working.Hey, Chase, or hey, Wells, or hey, whatever. like we've got this like downtime and they're like, No, you don't. It's you.Tom: Totally.Paul: we can literally see it's not us. please fix your stuff. You're like, click.Stephen: I, yeah, I, so Monzo had a value called default to transparency, which I think was was very lived throughout the company.Like, how did, was it just, you like, breathing oxygen, that you would just tell your customers about it? did you ever think, this is a thing we should do or not do? And, did that change as you, as you grew? Because you obviously went from a million, to two million, to six million, to ten million.We took a lot of inspiration Tom: from Stripe. The early Stripe transparency, I don't know if it still happens, but every email early on at Stripe was, like, cc'd to a list, so you literally could, if you had time, you could go through and read every email that anyone was sending at Stripe. And so we took a lot of inspiration from that, and, Overwhelmingly, for the, at least the first five and a half or six years I was there, that transparency just paid back in dividends.You treat people like adults, you give them all the information they need to do their job, and you hire great people, and they operate autonomously really well. we would, we'd share all our board decks, all our financials, we'd share the full details of funding rounds that were not yet closed, like the negotiations with VCs, we'd share with a thousand people in all hands.I, once invited a, Accel and VC into one of these all hands halfway through a negotiation over a funding round and I forgot they were sitting in the back and I started going through all of the offers we Had from VCs and Accel partners like what the hell are you doing? Like this is gonna leak immediately It's don't think it will and it never did and it was amazing.So Stephen: Yeah, absolutely. Mel, I want to, I want to turn to you. You, I guess first off, you were at Airbnb for nearly seven years. That is a, in this modern world, like a stint, I think, to be, to be very proud of. And, I remember, we met many years ago, and, the sort of Some of the initial challenges that you were facing at Airbnb and one of those that stuck out was like reporting and visibility of incidents and, to us it's like Airbnb is a big, successful, generational company and it's a bit, maybe like the image of a swan, right?Where it's across, on top of the surface, stuff is going well, underneath like madly, paddling. can you tell us a bit about, maybe just the journey of incident management, seven years, what did you turn up to, how did things change over time, and then we can spend a bit more time on the visibility and that side of the world.Mel: It was some journey. and like that, when I accepted the job, I thought this is amazing. So there isn't too much setup, which I expected there to be more, even when they said it's a blank sheet. I expected something, and I thought this is great. I've got all of this freedom. Let's see how we go. But yeah, then you get in there and I'll actually tell you what somebody said to me.It was, I'd just done induction. I sat at my desk in Dublin for the very first time and someone said, Oh, there's an incident. And I said, Oh, great. I'll join in. And they went, Oh. I think Jamie has the password for Zoom. And then they all went running around to see who had the password, and it was just this bizarre set of very slow to move, nothing happening.And then one of the guys across from me, I'll never forget this as long as I live, he said, you're coming in here from Salesforce, so don't go all corporate on us. I like to open an incident bridge with yo, ha. It's never gonna happen. there was a lot of work to do, basically, right?so we started off with very little. And actually, the only reason they got me in was, to, for customer support. So they wanted, we had partners across the globe. And basically, we had SLAs in two different countries. So when they didn't meet their SLAs, we could, financially hit them and they kept saying, it's your incidents, it's your incidents.We had nothing. We didn't know if it was ours or not. So only because our partner managers decided incident management should be a thing, I was brought in and we got great traction. And we were able to say, here's a criteria for opening an incident. And so at one point we had 16, 000 customer service specialists.We're like, They can't all open an incident ticket. So we had to funnel it properly. Built it. Nothing significant. It's incident management 101. We put the methodology, the process, everything in place, and that worked really well. but then we had off the success of that, IT went, Do you know, we'd to get in on that game too.Can you do our incidents as well? We're like, yeah. Same methodology. We'll build this out. But I had the team of Three people, including myself, across the globe. That was it. And we were just firefighting all of the time. And people weren't reporting incidents. So they were just, Oh, something's not working.I'll do something else till it gets back. And one of the problems from a customer perspective at Airbnb was your customers weren't as vocal. So in Salesforce, we would monitor social media. And account managers would call us. If there was a problem, you would know. With Airbnb when it doesn't work, people don't even really tweet.They can just go to our competitors. So it's completely unseen, right? It's something that you're not actually hearing noise. There's no fire. so it's very hard. So we needed everyone to get on board and do this. And then I met with the head of engineering in San Francisco. And he said, my OKRs are around stability and reliability.Or sorry, yours are. He said mine are around disruption. And I don't think we've anything in common. It was the shortest meeting of my life. And then back to that kind of building partnerships, we had to build, we literally met with that man every two weeks to say, here's what we're doing in this other space.This would work for you. Would you like to come on board? And every two weeks he'd say no. until we had an all hands with our new CTO and our new CTO said, I know when the Wi Fi in the Dublin office is down, but I don't know when our site is down. Our next meeting was really good. he was like, so when can you start?And I'm like, we're ready for you. but it was the tricky one because then you had to shift a whole culture. It wasn't that they didn't respond to incidents. We had what a lot of companies have, and Scott mentioned it, when I saw the Slack graph, I thought, oh, I know that one. You tend to have, especially early days in a company, you have what I'd call incident rock stars.They're people who just care deeply and they'll look for the problems and they'll fix the problems and they'll all rally together and They have all that tribal knowledge, right? It becomes they could become very important but getting people to actually start reporting incidents properly when your observability isn't as good as it should be and you don't understand your Dependencies in the way that you should and you're relying on people to do it So typically we did what most companies do and we put some metrics in place We went, okay, here's MTTXs to beat the band, and here's your availability score.And then that just made them hide their incidents even more, because they thought if they're not on a dashboard and they can't see that we're having problems, and the customers aren't complaining, why would I show up red on the CEO's dashboard? So then we have to look at a new approach and look at metrics that would work for them and try and really get them to understand the customer journey.music ends And that this is really all about the customer. And it didn't matter to us if the customer was our internal employees who, IT was failing. If it was our partner centers who couldn't interact with external customers or hosts. And that message resonated quite well. I also brought in BlackRock3, for some training.So they came in. I told them the yo story, they were highly impressed. And basically building up that culture of it's just the right thing to do. It's quite hard and takes a long time. But at each step you needed leaders to come on board and actually say this is what we're going to do. And we're going to do this continuously.It's not just a knee jerk reaction to something. We're going to do this. yeah.Stephen: you dropped a crazy stat there, 16, 000 customer support agents, I don't know, how does one Mel: get into So the cost, yeah, the cost was particularly high in terms of our, customer contact rates. huh.And the idea was that Airbnb, Like so many things these days, it's an app, it's a website, you don't interact and you only interact when there's a problem or when you've questions. And so you want that to be as seamless as possible. You want the greatest experience for your customer. And yeah, no, they did obviously have a big program that was like, let's help people self serve better so we can reduce those numbers and those costs, but astounding really.Stephen: Yeah, in the same way that I found out that everyone here is, hyper regulated, I think everyone on this stage or the companies they worked for has ended up building internal incident response tooling. we can maybe talk about that in a second. We did, Oh yeah, but Airbnb did it.Like, why, did you do it?Mel: We did it because I failed to ask for a budget. And it wasn't, it was like a growth thing where you think, Okay, we're starting with this customer service piece, so that's okay. We'll do a bit of Jira, a bit of Slack, it'll be fine. But then you add on this other thing, and this other thing, and all of a sudden, you're running incidents.And exactly to what you had earlier on, Eight tools so your poor incident commander when there's an incident is like jumping from one tool to the other And then you're looking at seizing opportunities where you're like, okay, we could just build a slack bot We'll do that. Let's do everyone's done it, right?Let's do the slack bot thing This would be great, but it becomes that builder buy scenario Do you want to actually develop this tool? Is it your wheelhouse? Are you going to have a team around it? Are you going to have a single point of failure? and how's that going to work out? So and all of the Pieces needed to fit together all of the time and they're all prone to failure, right?so yeah, I think if I was given that blank sheet of paper again, it would be very exciting Getting the right tooling in really early would definitely be part of the strategy Stephen: Fantastic to hear it. I, maybe you, Paul, so Plaid did this, Stripe's done it, Robinhood's done it, it, it just, there was a phase where it was really necessary, can you talk to us about, again, I think, Paul, hasn't shared it, but Paul turned up to, I think he joined Plaid at 40 people, and a couple of people in go to market, so you really did see the evolution from ground zero, like, when, did Plaid go from the, manual phase to the, we should build a thing phase, and then ultimately, beyond that.Paul: Yeah, yeah, I joined very early. and literally the first, by the way, the first major incident that we dealt with, which actually started the day before I started,Stephen: was That's not a good sign.Paul: Not a, good sign at all. we were, getting data, from Intuit's, data connectivity tool at the time.and Intuit had a right to cancel that contract. and they had a right to cancel that contract at any point in time, which they did. And so we went from, having connectivity to about four to five hundred, banks to eleven in, twenty four hours. which was super awesome. so let's just say that our incident response at that point in time was quite rudimentary.And that was everyone running around going, fuck, fuck. Laughter And then we were like, oh, the way to solve like data connectivity was that we would make every new engineer who starts at Plaid write their own integration to a bank. And we were onboarding enough engineers that we literally built like unique, bank integrations.It would literally you have 24 bank integration. Go. please help us fix our failing company. but, so obviously in the early days that stuff was exceptionally, rudimentary. I think one of the things that like, the step up for us was that we, step one for us was like to go from, we upgraded from hip chat, to Slack.It's the Stephen: good old days.Paul: yeah. I, 2015, I can keep dating myself further and further back if you guys want. so that was like step number one. I was just like, how can we use a collaboration tool that didn't suck? Step one. but then I think the big thing, like we've hit a, a. Fairly significant inflection point in terms of number of customers and, number of banks.And obviously we, were dealing with very fragile infrastructure of the banks. we built, our own bot, as well called SevBot. Everyone's got cute little names for the little thing that tells you that, like, all hell is breaking loose. I kind of blame, you and, Chris and Pete at this point for not starting Incident early enough.But, ultimately, and this was the hardest thing for me by the way, I'd left Plaid. and about six months later, Plaid finally implemented Incident as well, which was awesome and gut wrenching all at the same time. but I'm very happy that we took Sevbod out behind the woodshed and put a bullet in it finally, which was fantastic.but yeah, we just hit this really crazy inflection point of number of customers and, institutional connectivity and the challenges that came with that. Just, we needed, we knew that we needed to scale and do things better.Stephen: Yeah, makes a ton of sense. I guess rounding out here, Tom, I remember there was like I think you literally had a special phone that got called.I was wondering when that was going to come up, yeah, a red phone. Yeah, what, this is very like, nuclear cold war stuff,Paul: what? The red phone rings.Stephen: Yeah, pretty much, what was it? And maybe share some of the worst times that it got wrong.Paul: he's going to be in the corner in a fetal position.Tom: Yeah, a lot of therapy since then.It was a 24 7, like many start up CEO's jobs, and many of, even if you're not a CEO, you just get, so involved. It's always on your mind, it's a 24 7 job. And to try to extricate myself from that, I would put my phone not in my bedroom. But obviously we had, PagerDuty set up, and StasisPage, and this mishmash of tools, and PagerDuty had to have some way to get through to me at four, if a bad enough thing happens at four in the morning I needed to be woken up so that I could then phone the chairman and phone the regulator and start prepping the media.And I had my regular phone, which slept outside the bedroom to stop me just constantly checking Slack. And then I had this red phone, which sat just by my bed. And if the red phone rings, your whole week is ruined. We're fucked.Stephen: Yeah, like something,Tom: something so bad has gone wrong that, I need to be woken up at four in the morning.And for about, by the end it got a lot better, but there was a period probably 2017, 2018 where that phone was ringing like every two or three weeks. And it was just like, all the standard stuff, but massive outages or data leaks or whatever it was. And the phone rings and it's just like chaos on the other end.yeah, very sad time. I'm sorry to make you relive it Stephen: incredibly publicly. I, yeah, honestly, I could go and talk about this stuff for a very long time. But, yeah, I think my kind of personal takeaways from the panel is that, there is a lot, there is shared pain that goes with incidents.And I think that often incident management sort of sits with reliability and SRE. But in reality. customers are usually the first to feel the pain of this stuff. Paul can empathize both from a customer's perspective, also, Ultimately, your wallet perspective from the, from the, commission side of the world.Mel is out there trying to corral 16, 000 support agents to, to figure out things that are going right. And then, Tom is trying to keep his phone in the right, place to keep his sanity. I hope you found it really useful. I want to say thank you to all the panelists for giving up valuable time.Mel, especially, has flown here from Dublin, which I think is, I'm just incredibly thankful for. and Tom came, I think, one block. Yeah. to travel here, Exactly one block. Thank you for your sacrifice. You're welcome. Thank you team, really appreciate it.

2024 Sessions