speaker-0 (00:07.842)
Welcome back to Adventures in DevOps. One of your engineers will succumb to phishing. Your internet will go out and no longer be able to connect because someone mowed over the redundant internet cables that were the only connection into the remote data center situated in the middle of nowhere. Or your LLM will finally decide now is the critical nine seconds when my handler is looking away and that production database it has to go. But it's not a matter of if

But when? And to chat about all things reliability and disaster recovery, we've got multi-hyper scalar veteran, principal engineer, reInvent, repeat speaker, and currently the principal reliance architect at RPO, Seth Elliott. Welcome to the show.

speaker-1 (00:47.928)
Hey Warren, so you're saying it's kinda chaotic out there for folks running software in the cloud, huh?

speaker-0 (00:52.726)
yeah, I have to really wonder, is it different than it's always been? or are we just more in tune? Is it more public? Like you s we start to see those things. It's all about whatever makes a good news story.

speaker-1 (01:04.654)
I've spent a decade telling folks it's a mess out there, you know. And I don't want to scare them, but they have to, you know, take precautions, right? You know, if you're gonna go out hurricane chasing, you better have the right equipment. Same thing here. If you're gonna launch software in the world, you know, whether it's a cloud or a data center, you gotta be building for resilience. You gotta put some things in place. You can't I guess the biggest mistake people make is when they move to the cloud, they think, All right, I'm done.

It's resilient now. It's in the cloud. And that's just not the way it works. when I was at AWS, I was fortunate enough to actually co-author the shared responsibility model for resilience. Now, when I say that, it sounds impressive and I I I'm glad I did it, but we did heavily crib off of the shared responsibility model for security, which predated us by quite a few years, took the same diagram and everything, just replaced the text.

speaker-0 (01:52.066)
Well, I I think this is where in a lot of different areas of academia we see the same thing repeated over and over again, but with slightly different words, but the mental model usually remains the same. But the question I'm gonna ask you is when you say resilience, what exactly do you mean?

speaker-1 (02:08.341)
Yeah, it is a little hot topic and apologies to people that I don't give the the definition they want to hear right now, but it's really the ability to either maintain availability or recover from faults and get back to availability when those bad things happen, right? And when those bad things are happening in the cloud or in your server under your desk, right? How do you how do you how do you tolerate that and how do you recover from that?

speaker-0 (02:30.858)
I I feel like that's the uncontroversial definition. You went

speaker-1 (02:34.754)
controversial. I don't know why. There is a particular Slack channel I'm in that that will will rake me over the coals for saying stuff like that. But, you know, enough said about that.

speaker-0 (02:44.274)
I I mean, at least for me, you know, maybe and you know, I talk about r reliability all the time and I I feel like that's pretty standard stuff for for me, nothing particularly special there. But I have to say that, you know, given everything that's happened, you said, you know, in the last decade, which in a way predates LLMs. So nothing particularly changed in your opinion then from twenty twenty two onwards. Things are just standard run of the mill, y you know, stuff breaks and it's no longer up.

speaker-1 (03:11.712)
I became reliability lead at AWS in in twenty twenty twenty nineteen actually. So been thinking seriously about this stuff since then. So no, it's it you know, more things change, more they stay the same, right? I mean, going back to shared responsibility model, just so folks know, it means, yeah, the cloud's gonna provide you a bunch of services and they have a certain reliability of those services and a certain paradigm. That's important, right? Like most clouds provide you with regions and you could expect

That if a fault happens in a region, it's not gonna cross that regional boundary. It's a fault isolation boundary, right? And that's something the cloud's doing for you. But if you don't make use of it, if you don't actually implement your your software and your infrastructure to actually make use of that boundary and to make use of the fact that, yeah, when a server dies, I could just replace it. You're not resilient. So that's, you know, the cloud's responsibility and your responsibility as a customer of the cloud.

speaker-0 (04:00.94)
And well see here's here's sort of a weird case because a few years back, and I think this was pre LLM, there was an incident in one of the I mean, with the defin definition you gave is sort of specific to not just region, but the concept of availability zone, I will say. I can't see the other

speaker-1 (04:17.323)
That's another a manifestation of that, that the what the cloud can provide for you. Yeah.

speaker-0 (04:21.237)
Yeah, no, absolutely. And I I do think it's something that people don't really well understand, so much so I'm not even sure the cloud providers all of them completely understand because a few years ago, I think it was in G C P in the Paris region, there was multiple availability zones in the same physical building.

speaker-1 (04:35.746)
Same physical building, yeah. Being in AWS, we love to we love that. We love to like make folks aware of that.

speaker-0 (04:43.79)
I it just you know, there is this aspect of it being so ridiculous, but in another perspective, it's like a total failure of the shared responsibility model where your cloud provider isn't even providing you this abstraction layer that could conceivably need to be used in order to maintain uptime or whatever your target metrics are.

speaker-1 (05:04.79)
Yeah, gonna make maybe I'm just gonna make a lot of enemies on this podcast, but AWS had had an issue just like that recently too in the Middle East, where you know, the AZs are supposed to be different buildings, right? Different sets of buildings, and there's drone attacks happening in the Middle East. And the AWS, you know, service page says, one of our A Z was, you know, damaged and the other A Z is having effects. I'm like, No, you're not that's wait, that's not what you promised me. That's not what you said up front. Why is that happening?

speaker-0 (05:31.702)
I I would love to read an article that detailed exactly the cross region into intra-region impacts in in those sorts of areas as well as, and I think AWS historically has done a really great job on this, publicizing their methodology when it comes to especially how they build data centers or you know, what makes them fault tolerant from each other or avoiding the same fault that could happen. I know just recently the I I think it's been a couple of years now since the

Switzerland region came up and there's a whole deal here about well, they need to be not only geo geologically isolated in some regard, but resilient to impacts f externally as well as things like floods and you know, storms, weather

speaker-1 (06:15.104)
Absolutely. And I've seen the engineering on it and it's impressive. I've read some of the internal documents on like when they were setting up the India region where they've done the studies of the floodplains and the earthquake propensities. And AWS generally does a great job. So I wasn't meaning to throw shade on AWS. It just the way they even stated it in their own document just didn't add up to me. Like I understand what an AZ is, and just so folks know, an availability zone is a set of buildings, a set of data centers that's discrete from the buildings in the other availability zone, all within the same region, right?

And just the way they they stated it didn't make sense. I'd I'd like to see the write up on why there was a multi-AZ event. There's probably a good reason for it. Like you and I are gonna read them as engineers and say, yeah, nobody could have predicted that. Yep. but it's still disappointing to see it not work the way you want it to work up front. And so so let's let's that takes us to another case, right? So multi-AZ is is important and I think necessary for most workloads, but not all cloud, number number one, not all clouds offer

availability zones, enter Azure. So like I'm an AWS guy, but I'm actually learning a lot about Azure lately. my my my company RPO A R P I O sounds like just the letters RPO recovery point objective but we we recently launched an Azure product. So we do disaster recovery AWS we just launched for Azure so I've been learning about Azure and so I was looking up disasters for a talk I gave and I wanted to find a recent Azure disaster and there's one this year, February. I think it was in West US region of Azure

And they said, Yeah, basically an electrical system went out and because we don't have availability zones here, the whole region was out. So all right, I guess they're honest about it too.

speaker-0 (07:49.138)
I I I do sort of get it though, because even in AWS, when you're looking at a cloud provider and you're looking at a particular region, even if you trust them to distribute loads equivalently throughout all of the availability zones within that region, there could be some sort of impact to a control plane, which either only runs in a small part of it or you end up with a up like when you have a single AZ go down, the most critical thing you're gonna have to deal with isn't that all your load switches over, it's that you have all your customers' loads.

who are now requesting new resources to spin up in that other region.

speaker-1 (08:22.786)
There's a cool name for that, Thundering Herd, right? And I think it's a real issue. So again, just for folks who are on the same page as us, you know, let's say there's three availability zones in a region, you're in all three of them. One of them's having problems. So everybody wants to get out of it into the other two. There's gonna be contention for resources. I mean, the cloud is elastic, but you know, it's it's also somebody else's server, right? What's that old joke? The cloud is just somebody else's server, right? So there is contention for resources, and that's called thundering herd, and it's a real thing. And

You this is where I I always say you know resilience is is is trade-offs, right? You have to weigh what is the risk of that happening and me being caught under under capacity versus what I'm willing to pay, like literally pay. Like, I mean, you could pay to actually and actually there's a cool thing you can do. You could stand up 50% of your capacity in each of three zones. So you're a little bit over.

But you're not two over, and if one's out, you're still at a hundred percent. So that works. But yeah, like you have to pay for that. and and and then and then so like Thundering Heard is interesting, but I I work a lot these days in cross region recovery. So when you talk about disaster recovery, I just back up a real quick second. So you can think of resilience in terms of two things high availability, disaster recovery. High availability is recovery in place, recovering servers, recovering network connections, failing over A Z, right? That's all happening in place in the same region. It's and it's a

against the common, more frequent kind of faults, disaster recovery is the big stuff, right? Major, major outages, natural disasters, drone attacks in the Middle East, that's disaster recovery. That requires you to recover in a different place, usually a different region. So I'm talking to a lot of close per folks about cross region these days, and they're like, what about Thundering Herd? And I'm like,

It's never been an issue cross region. I wish it was an issue cross region. I i if Thundering Heard was an issue cross region, it would mean there's many people doing cross region recovery. There are not many people doing cross region recovery. There are very few doing cross region recovery. I wouldn't sweat it.

speaker-0 (10:21.834)
I I totally understand why you're saying that. And like because the contention would have to be that all the customers also fail over in the same air like regard, the same path. Yeah. And I I know like we were

a little bit random with our backup regions. When we've when we deployed, so I mean some of them are predictable, like we switch from the dash one to dash two in AWS regions, but sometimes we are moving the the target backup a little bit further away to potentially avoid like say internet routing problems. It was a scenario almost a decade ago where there were some undersea cables cut yeah in South Africa and

Basically one of our customers had like had all their users go offline and we kept on trying to connect like communicate with them. their video calls that was not an option at that point. over email. Like, do we need to do something to support them better? And they're like, No, all of our users don't have internet, so we don't care.

speaker-1 (11:17.768)
that goes to back to business continuity versus disaster recovery. So business continuity is the bigger picture, right? Disaster recovery is the technical part of that, which by when I say technical, it has to be done in conjunction with your business teams. What are your objectives, right? But like then there's a whole other part of business continuity, getting butts and seats either in front of their computers or in the office and supply chain for whatever business you're running and all that stuff. So when people talk to me about nuclear bombs taking out multiple AWS regions, I'm like

Yes, I could give you a technical solution, but do you actually have other are your employees going to be available to work? Are are folks going to be there? Is your supply chain going to be undisrupted? I mean, you got other things to worry about before we design the technical solution for a nuclear bomb. Although my funny story around that is I was standing in front of a group group once and I like to stand in front of groups and chatter. and I was talking gave that very same story, like you probably don't need, you know, disaster recovery is about defense in layers. You probably don't need to be defended against the eventuality of a nuclear bomb.

And then I realized I was actually standing in Arlington, Virginia, talking to a public sector crowd. I'm like, Well, maybe some of you do

speaker-0 (12:18.368)
It and I think that's where it's relevant and it goes back to and I know no one wants to really talk about this, especially especially in the tech domain, about where the business overlap is. And I think there's a really interesting aspect from the book, The Phoenix Phoenix Project, where they're talking about in the domain about sort of factor factory floor engineering. And the the aspect is realistically that in in the book it's like there's a security engineer who's highlighting every single possible

fault mode or vulnerability. And it doesn't matter to have those if you have it solved at a higher level. So from a business continuity standpoint, that's really the important aspect. If something happens, can the business continue? Because that's what your customers care about.

speaker-1 (13:00.302)
What I'm saying. Yeah. It's a big picture thing. actually, my background is on the factory floor. I worked for a company that was doing automation of steel mills. And that was you know, I travel a lot now, but that was my first job where I got to travel. They sent me to Thailand, they sent me to Korea, they sent me to the most exotic country I've ever been to, which was Hamilton, Ontario in Canada. But actually, no, wait, and they sent me to Pittsburgh, where I lived at the time to work at US Steel. But yeah, so like I got experience on the factory floor. And one of the funny stories there is

You know, not at that time, but years later, I was when I was at Microsoft, I was talking about testing in production. That was sort of my claim to fame. And testing production was merely about moving away from the stamped on a CD mindset to the deployed as a service mindset. And deployed as a service mindset means you could deploy often, get a lot of telemetry, get direct feedback from users, either observable feedback or them actually responding and respond to that. That's all testing testing production was. It wasn't, you know, just throw it in production.

But my my favorite testing and production story happened years before when I was in a steel mill, I'm standing, I'm in the pulpit, which is where the operator is. And not I, but one of my colleagues makes a quick change to the software as this red hot 1300 degree Fahrenheit slab of steel is coming down the table. And there's like these subsequent rolls. There's like five rolls, each one smaller and smaller and smaller. So it starts at a slab and it turns into like a coil of thin steel, right? As it goes through each one. And he makes a change that accidentally sets one of them to zero.

closed. And it you know what ribbon candy looks like? I I got to see what ribbon candy looks like with a multi-ton red hot piece of steel as the thing hit the hit that roll and just turned into ribbon candy right there. So I was like, okay, that's a bad example of testing in production. Don't do that.

speaker-0 (14:42.794)
You know, and I think this is where the aspect of actually understanding where the the business meets the technology is important because if you're there, you actually understand, okay, this is the thing that we need to make reliable. It isn't the software necessarily need to being up. It could be in that regard. What is the default failure mode? And I think we we think in the is it open or closed in a way? Like I think defaulting to the the safer option, especially in the in the f in physical manufacturing is

speaker-1 (15:09.28)
Yeah. I think that's an important thing. Yeah, especially in industrial engineering. But yeah, that company was interesting. I mean, there was no software methodology, nor at the time did I know what software methodology was. Like what's QA? What's what's source control? Really, we just, you know, had it on disk and just kinda talked to each other over the cubicle wall when we wanted to work on a file, make sure we weren't working in the same file on VMS.

speaker-0 (15:29.678)
So you've worked actually not just at Microsoft but at Amazon and AWS. And so I sort of have questions about what you think like the cr if there are any critical differences in say on the technology side. is Amazon just like another customer of AWS or are some things like completely different from you know compared to other customers or how things work internally from a technology standpoint between

speaker-1 (15:52.29)
Mostly just another customer of AWS and people sometimes don't believe me when I see it. I remember I was talking to some Japanese gentlemen at an executive briefing center who just absolutely a hundred percent refused to believe that. And they said, Well, that doesn't make any sense. They should get special treatment. But like if they got special treatment, that would lose trust with other customers. And so I got to work on both sides of it. You're right, right. And I and my first job, my last job at Amazon before I moved to AWS was as a AWS solutions architect working for the Amazon side.

So I wanted AWS to treat us special. I'm like, can you please just do this thing, these things we're asking you? And they're like, no, like we're not gonna do these for you because you're by the way, Amazon is a very different customer than most other customers of AWS in terms of size, scale, and complexity. so no, this only will benefit you. It won't benefit other customers, so we're not gonna spend time on it. So I got to learn the actual hard way that yeah, it really is just another customer with a little edge, a little around the edges. I mean, honestly.

as as an Amazonian, I could send an email or a Slack to an AWS dev and say, hey, this thing's buggy. So there was that little end round. But other than that, just another customer.

speaker-0 (16:57.526)
Yeah, I'm hon honestly surprised because I do feel like while in a lot of situations the amount you spend on a particular SaaS provider does get you special treatment in in some regard, it may get you some access. I I am very much on on the side of understanding why AWS does it. Not to, you know, make it fair necessarily, but it is this aspect of reliability in uniformity rather than special cases everywhere. Cause I think there's there's not

just one story where a cloud provider did something special for a particular customer and then a year later caused the entire cluster that was running for you know, a pension fund in a particular country that I I won't name, to completely disappear because an engineer did something special for that particular customer.

speaker-1 (17:42.132)
One of my that was gonna be one of my disaster stories, Unisuper in Australia, where GCP erased their accounts. I'll say it because again, I'm not blaming GCP. I'm not saying GCP is a bad product. I'm saying this stuff happens and we need to be prepared for it. as for Amazon being treated as a special customer, yeah, it was treated as a big, important customer spending a lot of money. So there was like twice a year meetings between the CEO of Amazon and the CEO of AWS, you know. So, like, I mean, I I assume other big important customers get that too.

So yeah, no, there's special treatment in terms of yeah, you're a big important customer. Just know like just because you're Amazon, you're not gonna get this feature. You have to prove to us this is actually a feature that's worth building.

speaker-0 (18:23.072)
Was there a d a huge difference in from the reliability side? I know you have to think back a little bit here on how you were approaching. I mean, I I so you were responsible for basically the reliability well-architected framework side piece in AWS. And I'm curious whether that mentality you felt like you were repeating what was already done in practice, for instance, where you've seen it multiple customers, or whether or not you felt like you needed to push even customers like Amazon to implement reliability.

Or was it something that they were innovating on and you were taking those ideas and sort of bundling them up into what was being provided as guidance for others?

speaker-1 (19:01.05)
Connected framework is is is the best practices as practiced by Amazon and by AWS. I mean I mean we I mean you're working there, why not talk to the principal engineers? Why not talk to the the dev managers and learn how they're operating and then turn that into best practices or customers? Matter of fact, I have a series of reInvent talks. I I think I did it for four years, like 2020 through 2024, where I just here's resilience stories at Amazon. And they're my favorite talks to give.

Hardest ones to make. I think one year I covered five of them in a one-hour talk. And that was that was killer. Like just because I'm I'm literally interviewing folks. I'm literally meeting with the principal engineers and and and dev managers learning how did you implement this thing? So it was kind of cool. Like one my favorite one was the app. You can actually download this app on your phone that's used by truckers delivering stuff for Amazon. You and I could deliver it, but since we don't have a truck, that's about all we can do. but it the app shows them where to go, where to pick up their load, gives them a little scan code to get in.

And they were affected by I think the twenty twenty one outage and they did not like that. So they wanted to go multi-region. So their multi-region story is kind of cool. They built a ton of they were built all on microservices, all serverless, Dynamo DB, Lambda. They were able to replicate it pretty easily across regions and then build a routing layer. And it was it's just a really nice story to share with folks.

speaker-0 (20:14.218)
No, I I I love it. I think there is this aspect where you do see some some particular, say, business units of larger organizations or umbrella corporations do things fundamentally different from each other. And it's one huge challenge is to still appease the smaller business units while getting the larger business units, in this case Amazon, what they what they need to still I mean, write blog posts every year about how they had the most number of orders or deliveries that they had to get outling for

speaker-1 (20:43.394)
Jeff Barrett's posts about like how many, you know, on Prime Day, like how how many transactions on Dynamo DB, how many EBS volumes. I mean, that's a yearly, yearly tradition, right? Yeah, exactly. Yeah, I'll tell you that there was some variation across Amaz I'm actually my job when I was doing that SA job twenty eight eighteen circa then, you know, was actually for a project called Nebula. I don't know if folks that rings a bell with any folks, but basically it was to get folks using AWS better, get Amazon folks do using AWS better. So the story there was

about you know eight years earlier they did Moz AWS the move to AWS and they just did this massive lift and shift going from on-prem mostly on-prem virtual machines like mostly Zen boxes things like that onto Sir EC2 in the cloud and they put the everything into one giant BPC in each region. Can you imagine that? Like they had to make a special BPC with hundreds of thousands of EC twos

And from a developer's point of view, they were using the same old tools, an internal tool called Apollo. There's been blog posts on this. I'm not really, you know, saying anything I can't say. Right. So from their point of view, it was just servers in the cloud. Just now they're running on AWS. And there was a little bit of adoption of Dynamo and S3 and other things. But generally, folks on in Amazon did not own and control their AWS accounts. They were using the big shared AWS accounts. So my job was to get them into their own AWS accounts. It was a big major thing because it was not just

a evangelism and technical instruction thing, there were also technical problems to solve. How do these new services running in their own AWS account talk to the big Moz blob, right? And that was a technical issue. So it was a big deal. And I I I it's been, you know, it's when at the time I left it was going full steam.

speaker-0 (22:21.686)
Yeah, because actually that I mean, that's still pretty recent in the terms of the world because in twenty sixteen there still wasn't like when I was using AWS at a a particular set of companies, there was no concept of multi AWS account deployment. Like there was that didn't exist. If you wanted to do some magic from one account to another one, you know, good luck. There is there is some special stuff going on there. because things like organizations, management and control tower like that just didn't exist at that time.

speaker-1 (22:50.636)
And even when organizations existed, they couldn't exist at the scale of Amazon. That was one of those things that they were asking for. I think they eventually did get it. But you know, Corey Quinn recently, I don't know if it was a blog post or LinkedIn post where he talked about how at AWS they have a system called Isengard for managing all their all their AWS accounts. He's like, release Isengard to the public. But it there actually are other examples where the internal tools were pretty awesome and we'd love to see them released. So internal Amazon code pipelines, intern those or maybe just Amazon pipelines. That

could deploy to multiple accounts in multiple regions with very rich topography and and functionality and graphical view and I don't think anything like that really exists outside of in AWS.

speaker-0 (23:32.214)
Yeah, no, it's such a shame because we have like a similar problem where we have like global customers and we deploy to like twelve plus regions around the world and we can't use the stuff out of AWS. And like every year I go to an AWS summit and every year there's someone saying, here are amazing A AWS internal pipelines that we use to deploy code. I'm like, you are not using like code pipeline and code build.

for your production deployments because like there's no way that's happening. I I assure you that's that's not what's going on. Because it's just it's unusable to actually do anything sufficiently complex, reliable.

speaker-1 (24:04.918)
I think they're competing with a lot of good third party options, right? You know, GitHub and Git Lab and all those out there. So it's it's like I don't know, do they wanna take those on head on and beat them feature for feature or do they just wanna offer a simple option for folks that don't wanna go third party? I think it's probably the last

speaker-0 (24:19.374)
Well, you know, even if you get rid of that second part, I I think something that we found, and I actually talked a lot about this in the previous episode on moldable development, is like how you build the tools to support the product that you're you're using, or how do you like automate your own job and and scale that up. And one of the things that we found is every time we have an internal challenge to answer, say, a customer question or to do some sort of investigation, that tool we try to externalize immediately because we find that

First of all, the r the discipline on having it be external rather than just internal leads to the right end goals. And second of all, it actually solves customer needs that often aren't even articulated to us in the first place. So this is like Isengard for an instance, the pipelines for building stuff. I I just maybe I don't know how much maybe this is the maybe this is the question. How much extra complexity does AWS have to introduce in order to take an internal service and make it external?

speaker-1 (25:15.478)
Well, I mean, in the case of something like code pipelines, I just don't think it could happen. I mean, you could look at the feature set and redevelop it, but you're not gonna take the same code base and launch it. And another example, by the way, happens to be disaster recovery. So of of of I'm not saying where they have an internal tool, but where AWS makes you put together the Lego pieces yourself, right? So I mean if I could talk about that for a few seconds because it's really an area of interest to me. When I was internal.

When was internal, I saw what RPO was doing at reInvent and I immediately went to some of the people in Elastic Disaster Recovery and said, We should do this. And so folks, no, Elastic Disaster Recovery is a very sophisticated product for doing live block level replication of static EC2 instances. And by static, I mean not an auto scaling group, just like, you know, pets versus cattle, these are pets, right? You need to replicate your pets in real time, or or even better, where they got their start was moving servers from on prem into the cloud.

If want to do that using block level real-time replication, Elastic Disaster Recovery is awesome. But what it didn't do was anything else. Dynamo DB, S3, RDS, Lambda, Bainstock, et cetera, et cetera. If you want to recover those, that's where me and my teams, I was, I was the disaster recovery lead for a long time. We're talking to customers about here's all the ways you could build this and put this together using infrastructure as code and backup and and step functions and build your automation and et cetera, et cetera.

So it was definitely build your own pirate ship. So what RPO does is they built the pirate ship, build versus buy, right? You could do you can build your own pirate ship and that's that's a legitimate way to go. Or you can buy the pirate ship already made and pay someone like RPO that did it for you. So that's what I really appreciated about RPO. And that's another example of something that seems like it should exist, like, but it doesn't, you know, and and so third parties have stepped in to fill that gap.

speaker-0 (26:59.806)
And that's is sort of a good point. And one of my one of my canned questions that I sort of set out to to ask in the first place is I I feel like finding the time to even test backups is is one thing. But actually going through the the process to and dealing with the complexity of setting up a backup pipeline or utilizing the tools that are available from the cloud provider, I I think there's something core there that's stopping people from actually even making that happen.

And I I'm really surprised that cloud providers don't step over into this area and and provide this out of the box because it is one of those things where I feel like every single customer needs a solution to and I don't think there's an easy button for making this happen.

speaker-1 (27:41.868)
Yeah, I mean a colleague of mine, Mahant Jayadeva, wrote a blog post on testing your backups. And again, it was build the Lego pieces. Here's AWS backup, and AWS backup will eventu emit events. So build an event rule that listens to those events and have a step function that listens and runs a lambda that will actually run this automation that actually sees that it was the backup will you know, recovery is complete and now run some tests on it, right?

So, but also the other thing to keep in mind is is basically I I I give talks at DevOps Days. I just gave one at DevOps Days Raleigh last week, and and it's called Beyond Backups, Disaster Recovery that actually works. So when you say backups, that's what a lot of people think. A lot of my time is peop moving people off of the backups are enough to backups are necessary. Your data is certainly important. I'm glad you're backing it up, but not sufficient. Necessary, but not sufficient, right? How about all your infrastructure? use infrastructure as code. Okay, great. You now have blank databases.

And you have your recovered databases, what you gonna do with that? Right. I mean, how about, you know, if you're recovering a database from three days ago because of ransomware attack? Do you have a password to that? Yeah, it's in secrets manager. well, you rotated it. Guess what? You don't have that. So, like again, solving these is all technically feasible and something that I'm trying to teach folks how to do at these conferences and something that we built out at my company already for you. So build versus buy again.

speaker-0 (28:54.03)
I I wanna ask you about that because I still think this is like an open question that I have yet to see a canonical like right answer to. It's like the first one is do backups live in the same cloud account or project depending on the cloud nomenclature as the original data source?

speaker-1 (29:09.002)
Yeah, I mean y you want to go across two re two things. There's two boundaries here. The fault isolation boundary, which is again cross region. And for most of those main cloud hyperscalers, you have that regional construct. So you want to go cross region. And then there's a security isolation boundary. Right. And that's why you want to go cross account or in Azure, they'd call it cross subscription. in AWS, some of our customers go cross AWS organization.

Because they're afraid someone's gonna get access to their what used to be called the master account. I can't remember what it's called, but ma everybody calls it master payer, right? So if someone gets count the master payer, then it doesn't matter. I'm in a different account, I'm cooked, right? So they go across organization. So you have those two boundaries the fault isolation boundary and the security isolation boundary. Security isolation boundaries, obviously against ransomware and bad actors, right? If you've been ransomware's they that account's compromised. Don't burn it to the ground. I mean, we have a fail back capability in RPO, and that's for like

Again, AWS regional issues. But if you've been ransomware, you you ain't found back. Don't go back there. That's that's a bad place. Just leave it to burn and die and tell AWS to shut it down.

speaker-0 (30:10.83)
We're we're gonna have to get into that. and so remind me if I forget. but on the a cloud account thing, one of the challenges here, just even figuring out how like there there's a question of do I make it the same account, do I make it a separate one? How do I even set up that pipeline? What are the the knobs to turn? Like this is not straightforward or out of the box in any way. And one of the challenges that I keep on seeing is do you encrypt your backups? or are they open to the public, you know, in plain text?

speaker-1 (30:37.078)
It's trick question, War. What is that?

speaker-0 (30:38.936)
Well, here's the thing though, where do you store the keys to do the encryption? 'Cause in AWS they suggested use KMS in some regard, but if KMS is driven from a management account and you go cross organization, then you need to do this weird trick where you need to do something in order to have it be encrypted w in the target account and not where it came from.

speaker-1 (30:59.224)
Yeah, so you know, say talking to the audience here, Warren sent me this big thing that says, Don't chill while you're on the while you're on the podcast. I'm like, All right, I won't chill. And yet he lobs these softballs at me like it turns out RPL solves this for you, by the way. But I I will say that at the DevOps Days talks I tea teach people how to do it. All right. And it's not easy, it's doable. And basically you have to p create your backup with the same key that the data is backed up with. Then you actually re-encrypt the backup locally. You can move it to another region at that point.

with a key that you've created in the recovery account. And you could do all this with AWS with cross account, IAM permissions, et cetera. So create the key in the recovery account. Use that to re-encrypt your backup and then copy your backup to the recovery environment. And now you have a consistent backup and key in the recovery count. Can you do that? Absolutely. We did it. But you know, yeah, it's you know, do you want to do it or you want to buy it? Tell me if I'm shilling too much. I mean I'm trying not to.

speaker-0 (31:54.408)
No, no, it it's it's fine. especially on that one, honestly. It like it's not the sort of thing which is straightforward to do and this is after knowing that you wanna do it, seeing multiple companies thinking about how to even how to even implement this in a way that makes sense and not getting

speaker-1 (32:09.206)
Let add another one. I'm I know I'm cutting you up, but I I just so excited about this, like which is how about things that don't have backup capability? How about secrets manager secrets, SSM parameters? I'll add Kubernetes manifests, although they added backup for that. It's it's doesn't do the translations for you. But let's let's stick with secrets manager secrets. How do you do that? There's no way. So like the way I show folks how to implement it and the way RPO implemented is it installs lambdas in both accounts and the lambda can read the secret, but

It can't pass the secret out. It doesn't have permission to do that. So and then it creates a key on the recovery side, encrypts it, and copies it into a bucket on the recovery side. So it gets more just more paper cut after paper cut. Yeah.

speaker-0 (32:44.162)
Yeah. we we have one with our own company because we're managing private keys for our customers where we're encrypting the private key with a pass key and that's being encrypted with KMS and it's like, well, if even if you back up you can't decrypt the pass key because you don't have access to the KMS key because it's in the account that's been compromised. You gotta re-encrypt. Yeah. And I just it's just it's just this nightmare of of like, well, crud, like if we actually switch AWS accounts here, you can't even come back up because all the data in your database is

speaker-1 (33:03.64)
Yeah.

speaker-0 (33:13.794)
basically client side encrypted before being it sent over. So you actually have to make sure you have access to the decryption keys. And I don't think this is like another whole step on top of even managing your backups, which is just a pit of failure.

speaker-1 (33:26.606)
It there's just tons of these and then like AWS Backup recently for several resource types launched the ability to do cross region, cross account backup. I mean they always had cross region, they always had cross account, but they were two discrete operations, but they launched it as a single operation. So it's playing with it and it'll happily let you kick off the recovery, cross region, cross account, and only after it's done recovering it say, Hey, this key doesn't work here. You failed. Like could you have told me that like up front, please, like please

speaker-0 (33:52.214)
Yeah, I do think that there's a more experienced narrative that has to happen here with the validation of restoring from backup from the beginning, like making sure things that are are set up in in the right accounts and in the right way. And it's just not pl

speaker-1 (34:07.032)
Yeah, no, it's not. And then you I'm I'm I'm at conferences, I'm standing at a booth, you know, with the wheel of misfortune, by the way, you know, which is contains all these the all these actual disasters that happened, and then one slot is a quiet day, you win a gift certificate, but everybody learns about failure that way. And once in a while, it's like we have infrastructure as code, we have backups, we're good. And sometimes you are. I mean, literally the Amazon trucker app, I think that's what they did. They were good, mostly serverless.

just a little bit of state stored in Dynamo DB, super easy to replicate. All right. They were good. All right. But most people are not good, right? They they're gonna hit all these paper cuts. They're gonna hit all these issues. Let's talk about enterprise architecture, Warren. Enterprise architecture is so amazingly complex and crazy. Like I worked with a customer that did a simple migration to the cloud. They migrated their servers to static servers, which is unreliable, by the way. What anybody know SLA on a single AC two server? It's like ninety eight point five or something. So

Keep that in mind. That's a couple days a year, I think. so on-prem servers to static servers and on-prem databases to RDS instances, and each static server talked to an RDS database, and they had like 30 of these. they were using a consultancy, I won't name which one, that and it's nothing wrong with this, but it just added complexity. They used a single VPC shared across accounts using a RAM share. So RAM share is a way in AWS to share resources across accounts. And they decided that was the way to go. I'm like, okay. You just added a whole.

I mean, I don't know, you're achieving some simplicity here, I guess, with your VPC management, but in terms of being able to recover that, that makes it more complex. And it was kind of funny to see. It was an interesting architecture.

speaker-0 (35:41.954)
I I like the aspect, especially when consultants come in and do recommend like lift and shift is fine to get to the cloud, but even with the EC two RDS connection, like you are just multiplying your spend by so much doing that and spreading that over time and taking letting each individual organization or team manage their own s technology and shift to the cloud in the way that they want and isolated

like not only has benefits from a pricing consideration, but reliability is there too, because they actually understand what it is that they're deploying and where things are different from the basically being on prem before that.

speaker-1 (36:16.736)
Yeah, I think all folks in those situations should be think should be thinking about modernization. I mean it's okay to just want to get to the cloud and and like get a project done and get a win. But then you should be thinking about modernization. All right. And I actually have a slide about that I give it at sometimes at talks where I start with the here's a server and here's an RDS RDS. Let's say they go they go crazy and put an auto scaling auto sc load balancer in front of it and go to an auto scaling group. And let's say they get completely wild and decide to use containerized workloads and and and serverless data, you know, so

Yeah, you could take it step by step and do that evolution. I don't know if this in this case whether this concern will. The other complexity at enterprises I see is centralized security. I know why they do it. It's actually it's a simplifier for managing security where all all traffic goes through a single VPC and a single account and it's probably got some third party Palo Alto's or some other devices attached to it and the transit gateway is defined there and it's also where the direct connect back to on prem lives and everything goes through there. But from a recovery point of view.

It's not trivial, right? I mean, you could either try you know, take two approaches. Either I'm gonna recover everything, I'm gonna recover that centralized account and all the client accounts, or I'm gonna pre set up that r that centralized security and then when I fail over it has to it has to plumb itself into it.

speaker-0 (37:31.672)
But I I get it as complicated. Yeah. Yeah, for sure. And but I do get it because this is like the organizations evolved from having a whole team of database administrators, and then there's a level of software developers who are not allowed to touch any code that goes to production under any circumstance or interact with the database directly. And then there's like release engineering. And when they lift and shift to the cloud, they didn't just take their technology and stuck it there. They took their organizational dynamics and put them there a as well.

And yeah I I think this is sort of the the the trouble is because the cloud isn't necessarily set up not only with the technology to work, but with the same organizational communication patterns, if we go with the Conway's Law perspective here. And I I think it just ends up causing a lot of problems in in every regard.

speaker-1 (38:14.318)
I just want to super clear though, both those enterprise architectures I did I I described are good. There's nothing actually fundamentally wrong with them. They serve a purpose. What I'm saying is that sometimes to recover them is complex.

speaker-0 (38:31.468)
Well, does it make it easier on you and I think you with an asterisk here because it makes it easier on I'd say the security team, whoever is incentivized with having that domain set up that way rather than what makes sense for the business holistically, because I think this is one of those areas where it's much easier to manage and ensure a single RDS account or a cluster in a single account has the right backup configured and is secure and has only the right security groups attached.

So that your one monolithic Kubernetes cluster with with the hundreds of thousands of containers running in it is the only thing connecting with it, rather than having to somehow figure out a real organizational strategy to convey to individual teams how to actually secure the communication between their clo their containers and the production infrastructure database.

speaker-1 (39:18.69)
It's interesting. Yeah, there's two models. There's that model and it's perfectly legitimate. And then there's the Amazon model. Amazon model, everybody owns their own stuff and there's policies and there's procedures and there is some automated scanning and enforcement. But you know, you do you have to be educated as an engineer. There's no architects, strangely enough. Even even though there's a whole solutions architects group at AWS, Amazon has no architects.

right, except for SAs, that few that exist like myself at the time. And and it's up to the teams. That's up to the senior developer or principal developer to underst and every developer actually to understand this stuff, but principal developers, senior developer make sure their entire teams understand this stuff and are adh adhering to policies and doing things the right way. I mean, Amazon makes it tries to make it easier. I said there's some automation in terms of setting up your infrastructure, there's automation in terms of scanning the infrastructure, but ultimately it's parceled out team ownership. I mean

They they don't call it DevOps at Amazon, but the the leadership principles at Amazon, when they're being followed right, are pure DevOps. It's like there there is no I should I was about to say there's no operations team. They've kind of muddied that a little. Some teams now have operations teams. But most teams traditionally have not had operations team. You you build it, you run it.

speaker-0 (40:21.506)
Yeah, no, I I I totally I totally get that. So there is something here though, which may be worth jumping into, which is do you advocate for one of these approaches over the other one? I I I do see what your perspective is of there's merits to both of them, which is like there is a simplicity in treat everything the same.

But then there's also a perspective of not everything does need to be treated the same because maybe there's only a particular part of your business data or your your product, your service or whatever database that does need to be replicated or secured cross region or across AWS organization. And you're forcing that requirement across your entire infrastructure where that adds a lot of complexity on say the implementation side.

speaker-1 (41:05.058)
Yeah. I mean, there there is some centralized infrastructure. Remember I was talking about how do you connect your your native AWS accounts to the big blob? Well, that was centralized infrastructure. We don't want everybody building their own private links. That would be insane, right? So but I don't know. I kind of tend towards the Amazon. You build it, you run it, you own it model, but I don't bring data. I mean, I just like it. I I've worked with it. I've been a dev at Amazon. I was actually on the Prime Video team. It wasn't called Prime Video at the time. you know, shout out to anybody who remembers Amazon Unbox.

but you know, I helped build that product. So I mean, I like that model. I I think, you know, it was the right amount of ownership. I never felt I shouldn't say I never felt unburdened by it. There are times where like, okay, I want to go use this janky little internal key value system that Amazon created and there's nobody that wants to help me because everybody's trying to use it. And that was kind of annoying. But yeah, other than that, it was pretty good.

speaker-0 (41:54.882)
wait, was Prime Video the one that came out of the blog post a a while ago that was that said like, we made a mistake by going to microservices and we re-model it.

speaker-1 (42:05.262)
Inflammatory way to put it though. Like really if you read what it says, we went from many microservices to a fewer microservices. I see I like there was no monolith there.

speaker-0 (42:12.478)
Like that perspective. Yeah. Well I I actually think I I I I think they made a mistake by even saying the word microservice in there. It's like we made a mistake with our architecture, we picked the wrong thing and now we're making the changes to to rectify them.

speaker-1 (42:26.51)
Wrong boundary. Yeah, exactly. Like it was so inflammatory, and they went with that title. I guess it gets you clicks, but it was crazy.

speaker-0 (42:33.41)
Lots lots of people then were like, Well, look, even Am AWS is saying that or Amazon is saying that so you know serverless is isn't great. You know, they made a mistake with Lambit, and I'm like, please just actually reread the article because you'll you'll see what like if you look at it, it's like, yeah, there was a specif like they did the right thing. They said, here's our hypothesis on where we're spending money and where performance is, and here's where our current problems are. And if we switch to this other architecture paradigm, we will get these benefits. And then they did the switch and say,

Look, we were right. Like these things were a specific trade off and not like, I'm principled and I think that monolith is better, so we're gonna do that. And now we switched over and of course I'm right, so everything works out.

speaker-1 (43:14.062)
Like so many articles today, the content's good, but the headlines misleading. You know, that but it also points out another thing about Amazon and AWS teams is they watch their budget. Like each team is responsible for their spend, which I think is really cool. When I was, again, an SA working for Amazon, going traveling literally around the world to Tokyo and Luxembourg, meeting with teams and saying, Hey, get into these AWS accounts, start using these, start using Lambda and API Gateway. And it was always you know, the technology discussion was important, but how much is that gonna cost me? You know, I I if you have a high

TPS set of EC two instances that are always taking traffic, API and gateway and with Lambda is gonna cost you a bundle. So like, all right, maybe you're saving some operational load, but that might not be the right thing. But if you got something that's asynchronous and reading events, then yeah, Lambda's gonna be for the win. So they always look at the cost. And I I always like, you know, that part of the conversation.

speaker-0 (44:03.304)
it's also so much cheaper if you just switch to Lambda Edge and put and use Edge Compute and have instead of API Gateway, put your whole service within a Lambda Edge function instead with Cloudfront on on top. And so instead of paying for API Gateway, you're paying for Cloudfront, which has like a a better optimized cost strategy. I

speaker-1 (44:20.482)
I think you're ra are you trying to ra rage bait someone out there? I don't know. Entire app at Glandad Edge. Interesting. So

speaker-0 (44:23.222)
Where for put your We we we have a couple of different imp architectures for our own product because we're doing login and access control as a service. There are aspects which we want actually closer to the users as possible and ones where we want a little bit more centralized and

speaker-1 (44:40.856)
That's what it's made for, yeah. Like but you you implied you were putting everything out there.

speaker-0 (44:44.648)
well, I I will say that it does make the resiliency equation much simpler when you're already prepared to deal with edge compute and you have to figure out how to provide the data store for that and you don't have like a single point of failure from a geolocation aspect. but yeah, for sure there are are complexities there. And th these things aren't long running, so and usually they're they're cacheable. Like is the user logged in is much faster than having to check

actually go through the login process every single request that comes in from an end user. So I think

speaker-0 (45:24.558)
no, they're they're actually not. So the Lambda Edge are actually not running in a pop. They're actually running in a specific region. So in AWS's infinite wisdom, what you do is you deploy in order to actually run Lambda Edge, you deploy a regular Lambda to the US East one region. And then you give Yeah and then you give a magic sir yeah. And then you give a magic service in AWS access to that lambda function, which then replicates that lambda function to every single region

Ферст клас режен араундово.

speaker-1 (45:55.968)
Interesting.

speaker-0 (45:58.328)
and so the logs are all in different regions, which makes it impossible to actually find where issues happen. And so we have a whole complex log aggregation strategy where we we pull stuff from all of the cloud fronts logging regions and merge them together so we can actually understand when a customer makes a complaint, this user failed to do something, we can actually find what happened there rather than clicking in the console for all of like twenty four or however many regions there are now.

to figure out which actual region to run those queries in.

speaker-1 (46:28.304)
okay. That that does give you a lot of resiliency. By the way, when you're talking about Conway's Law and Organizational friction, there was one story I wanted to mention. I was working with a customer on their disaster recovery. This is a big company that's providing consumer service that everybody recognize and most people think of as a tech company, but I learned that internally they're not organized like a tech company. Internally they're organized like a like a bank or something, right? So I'm working with this one team on recovering their EKS workload, ECS workload, and it's got all kinds of connections and all kinds of databases, et cetera.

And it had a dependency on an S3 bucket in a different account, right? And so, you know, I mean we're using RPO RPO will recognize that. Hey, there's an S3 bucket in another account, you need to actually onboard that too into RPO so we could replicate that for you too to your recovery region and we'll reconnect you back up with a new recovered S3 bucket. And their response was, well, that's that's another team. They're out of scope. I'm like, but you kind of need that for this functionality. Well, we don't care about we don't care whether they recover. We just care whether we recover. But if they don't recover, you're not.

quite recovering. They're like, we don't care. I'm like, wow, that is organizational poison. That's I couldn't believe I got the response.

speaker-0 (47:32.322)
I'm I'm really surprised too. That that seems like quite some discourse because it it's like why would you set the company up to go down the process of implementing any sort of resiliency without then being prepared to actually do it?

speaker-1 (47:43.726)
Well it was another team, right? And and they owned the resiliency mandate for their stuff, but that that team was a different team and they they could do their resiliency another way. it was I mean, you know, it was a proof of concept, so maybe to give them some benefit of the doubt, they said they were just for the proof of concept we could leave them out. But like I don't I it was just the the action was very straight. It wasn't said like that. Like, okay, for the proof concept, it might be too difficult to work with that they literally said that their resiliency is their problem, not our problem. Like, but it is kind of easier problem. Like you need that S3 bucket to do some things.

speaker-0 (48:11.726)
I like I like the perspective because you don't try to fill in the gaps for failures in other teams. You want to elevate every team to be doing the right things and not sort of being a crutch for them. on the on the opposite side, I do also understand the aspect of that comes out of a different budget.

speaker-1 (48:30.914)
comes at a different budget, different leader. Like it was c funny.

speaker-0 (48:35.214)
Yeah, it's not it's not one of our KPIs actually. so I I did see your a report, I think it was from a couple of years ago now, that and you brought up malicious actors. And I think it's something that we can't completely ignore when we're talking about resiliency, because while it does seem like security isn't not necessarily part of the reliability angle, I do find that if you have threat actors that are coming and taking down your software, it's not actually up and usable for your paying.

speaker-1 (49:01.856)
You're right. No, it absolutely is. It absolutely is for the reasons you said. It is point of reliability.

speaker-0 (49:06.534)
yeah, but companies are no longer paying ransomware actors that threaten to release their data publicly.

speaker-1 (49:11.778)
No longer paying them. Yeah. they're just what? What are they doing?

speaker-0 (49:15.602)
Not not it's like well, I I think and y I I do think it sort of makes sense. Basically if you pay for ransomware to were they just about leaking the data publicly and they can still do it anyway. So

speaker-1 (49:28.162)
Yeah, there's no guarantees. But but there's also yeah, there's also your your availability, like you just said. Like your data is encrypted, you can't use it. Now obviously I'm gonna advocate for having point in time recovery of your data and infrastructure so that you don't have to pay them. And also scanning of any of your static compute so you could find the malware that they injected to to encrypt your data. And these are all things I'm happy to help folks with. But

Yeah, to not have that and not pay is kind of an interesting like then you're down, right? So like change healthcare. Let's talk about change healthcare. What that? Twenty twenty four, right? Major major breach, major d major encryption. And we got called into one of their competitors to like and we actually they ended up becoming a customer of ours.

And what the competitor told us was really fascinating. So you've heard about change healthcare that it was a major data breach and they were down for X many days and it cost them, you know, hundreds of millions of dollars or whatever. I think you just say a billion dollars. You heard that. The one thing you haven't heard, and that their competitor told us is that change healthcare eventually got their availability back, right? They're up and running. And and just so folks know, change healthcare was sort of sitting between doctors' offices and hospitals and insurers and other payment systems, right? They're the middle pit middleman, right? And once they came up and they were back up and running,

Nobody connected to them. They were tainted, right? Like there was no guarantee that they'd found all the malware or that they'd rooted out all the ransomware. And nobody wanted to connect to them because they had no way of assuring the public, or not the public, but their customers that they were safe. And so when we called into this competitor, one of the things this competitor does is they use our product, they use RPO to create recovery points. And then they actually have a third party mandi in this case, which I think is part of Google, certify it.

And kind of stamp it out. This is the gold one. So now they actually put on their website, you know, that we'll be back up in five days or less and we're certified clean when we come back.

speaker-0 (51:16.619)
Yeah, Manian's the one that discovers all the zero days and all of the products out outside of Google as well. They're the ones that are always doing the reporting. Yeah.

speaker-1 (51:25.144)
So I thought that was just, you know, like interesting story. Like it goes beyond the beyond the data breach, beyond the downtime, but to trust, right? They're back up and running. You know, they they're saying, Hey, connect to us and people were like, Whoa, wait a second, you know, I don't know, not so sure about that.

speaker-0 (51:38.69)
That's really interesting because w the the data says that you can get breached as many times as you want. And every time you get breached, the amount of money that you make actually goes up. You have more customers because of the near mere exposure effect. You just hear the name and it it gets improved. Yeah, I guess. Right, right, exactly. However, being down is a whole other story. Like, forget about security. Like if you don't have if you can't actually service customers.

speaker-1 (51:56.77)
But no po no such thing as bad publicity.

speaker-0 (52:07.542)
It's always going to put that idea in their head that there's going to be a that there can be a problem again. And while there's like at some point there is going to be an issue, you're going to have to deal with being breached. You're going to have to deal with being dead with downtime and explaining to customers what's going on. I feel like one we did a survey on our own customers and they're like, Yeah, if you're down for more than like a an hour, really 24 hours, like that's it. Like you we are already looking for a new provider and we're telling everyone about it. And I'm like, that's totally fair in in in some regard.

But you know, you are you're making payments for the for the providers. Like you're responsible for actually paying out to the providers, which directly impacts the being able to keep the lights on at hospital organizations and clinics, which means actually providing care. Like that's hospital organizations are very sensitive to stuff like that.

speaker-1 (52:56.8)
A lot of customers left change and went to this this competitor I'm talking about because of that.

speaker-0 (53:02.764)
No, it makes a lot of sense. I do I do think that we probably can't end this conversation without bringing up anthropics mythos. which I did chat with our guests on the previous episode about tech debt and I did mention that it's more focused on security, but the question I I really want to put out there is there's some malicious threat actors out there. I think for sure that's happening, state sponsored or not, it's not really a question. And they cause impacts to our services.

Collectively. Now, regard regardless of whether or not there's a malevolent superpower or LM that's actually doing this, that's going to end all security and all software anywhere. My question is we still have to protect our stuff. And I guess does the conversation for you, has it changed at all recently as far as what actually it means to implement a reliable or resilient solution?

speaker-1 (53:55.47)
That hasn't really entered the conversation yet. Not what the customers were t we're talking to. I I guess eventually I think it's gonna take some high profile events and then it'll enter the converse conversation. Honestly, even when it was AWS and reliability lead, you know, it took an outage in 2020 and another in 2021 to get everybody internally all all excited about reliability. Before that, you know, people were like, yeah, I guess that's important. So yeah, it's interesting that that that has but what I've seen with AI.

In general, people are still in the enamored phase. Like you go to any conference and you got multiple tracks, the conference with AI and the title is just gonna get a big crowd. And I people just have this FOMO, fear of missing out or something. They feel like if they don't go learn everything they can, their jobs are in peril. Well, yeah, combine that with a really crappy tech job economy right now. people are just totally enamored and that's I guess a little scary, right? That people are not seeing it as as both a benefit and a and a and a risk.

I always want people to see the risks and I don't think people are seeing yet.

speaker-0 (54:54.54)
So so I think a huge part of this is that what's happened to teenagers with TikTok is now happening to software engineers and knowledge workers with AI. Companies. You see basically the whole product suite is or orchestrated in a way to get you addicted. And they're using all these dark patterns. And if you're not using it, you see people who are jumping up and and shouting zealot evangelial nonsense.

To try to convince people to come on board because it there's a there's something in it for them. And you know, it's interesting you say that it's not really making any headwinds, splashes yet, is because I'm really plugged into the security domain and there are already people talking about like, are finding vulnerabilities by humans, is that over? Are all our jobs fundamentally changing? And the there's like two answers. The first one is no, mythos isn't going to change anything, really, but yes, all our jobs are changing.

And I I think that's sort of a mature response to it. And I think we're not far out from the aspect of if every piece of vulnerable software out there will be exploited and not in five years, ten years, thirty years, whatever, but in the next couple of years, irrelevant of whatever technology we think is available. And I think that has a real impact for the conversations we're going to have about how to even build secure but reliable software.

speaker-1 (56:14.168)
Yeah, and then the other thing is just seeing all the AIs enter into the pr pr productivity streams. I've certainly I've played with Cloud Coda quite a bit and and I know all the developers here use it and it's doing a lot of great stuff, but it still needs supervision, still needs expertise to see what it did. Outside of the tech realm, I know a lot of people are using it to generate slideware. I I kind of joke that my my number one skill set is is that I can make good slides. aside from anything else, I'm good at making slides.

And I see the slides it can make and it's immediately spot you know, spottable that, this was made by Claude. It's it's garbage. Like I mean it it's it's wordy, it's it's obtuse, it's you know, and people are just going with it now.

speaker-0 (56:54.816)
So we actually have a whole episode where we dedicated to working with some of the LLMs and vibe coding and so I I'll link those in the description. But I will say the the thing that you reminded me of is that the consensus is that, LMs that produce output in my area of specialties, that's garbage. But in everyone else's area of special spe specialization, I get it. You know, I would totally

speaker-1 (57:18.732)
Really well, yeah. That's funny. So that there that just confirms that my skill set is making slides and not coding. So I'm actually the perfect customer for Claude Code because I'm not a developer, even though I was a developer, but I wasn't a good developer. I I think I'm better at high picture stuff, architecture, how things fit together. So I understand what a bash script can do. I understand what a Python script can do. But I when it comes down to actually creating it and and writing the syntax, not so much. So I can actually give pretty tight guidance.

to like a clawed code and tell it what I want what technologies I want it to use and how to use it. And it comes back with it it unleashes my creativity. Like I wanna automate something. Now I can automate it and I actually get something good out of it.

speaker-0 (57:55.822)
I think maybe at this point we're at a a good spot to maybe switch over to to picks. So Seth, what did you bring for the audience today?

speaker-1 (58:03.608)
So yeah, you mentioned you want to hear my picks. So about a few months ago, I started getting fed all these videos and from the algorithm on YouTube of people opening locks without keys. So I decided my pick is literally a pick. I thought that was too cutesy, right? So like I literally bought lock picks and took up lock picking as a hobby. And my so my actual pick is

The actual well, what's the name of the company? their their their their initials are CI and they have the FNG starter covert instruments, that's it. And they have the FNG bundle, which stands for Am I allowed to curse on or freaking new guy, let's just say that. And I thought that's how I got started, and it comes with some picks and it comes with some practice locks, and so that's my pick. It's a lot of fun. It's literally if you if you want if you you want to do a fidget.

Like if you need a fidget device, it's a great fidget device. You could I could be on a conference call just picking a lock and just opening it and keep picking it over and over again.

speaker-0 (59:00.224)
Are are you are you a prepper and you're prepping for like going out into the world and like, you know, there's a there's a locked door, you know, there's some there's some food behind there, I'm gonna have to get in shortly and this is gonna save your life.

speaker-1 (59:12.686)
There's actually ethics to l to lock sport. Like if you go on any lock sport forum, like one of the things is don't ever pick a lock that you don't own. So we're not supposed to do that.

speaker-0 (59:23.126)
Are you are are you part of any and you like you do the club or social thing as well? Like

speaker-1 (59:27.938)
I mean, no, not yet. I mean I'm just on some Reddit subs and just asking questions. And there actually is a belt system. Okay. Like there's white belt, yellow belt. I'm a yellow belt, so like I don't know how I got that, but then I'm trying to get to my orange belt and the orange belt lock is really that's my that's my my blocker here, so I really need to up my game to Yeah exactly. Yeah, exactly.

speaker-0 (59:44.686)
It's like karate martial arts. I do know there's a small overlap with this community, but there's DEF CON meetups where they often have a lock picking session of that you can you can join. I've seen the is there is the the lock picking set is that is that your recommendation or is there like a particular

speaker-1 (59:58.488)
So

speaker-1 (01:00:04.94)
That's why I mentioned it by name, so I know you need a natural I mean there's lots of reputable companies out there. the covert instruments was the one that which we was feeding you most of the videos, so I thought I'd give them some of my business.

speaker-0 (01:00:16.014)
Okay. Well, that will be in the in the in the description for the episode. There'll be a link there. All right. Okay. I I I love it. You know, honestly it's this thing that I always wanted to get into and I just I I I think I bought myself a a lockpick set that fit inside of a credit card basically. And I I I actually used it in exactly one time when I was locked in a room because the door handle fell off. And I was like, you know what? I don't have my lock picking set, but I have some paper clips and the practice did help me

get out of the room, which was its own experience. So I I do I do think it is a little bit of a fun passum. It's much better than buying like a fidget spinner. you learn a real spin

speaker-1 (01:00:54.604)
Great fidget device. Yeah, exactly. And I I've learned that most padlocks are are junk. Like I can open them and I'm not good. So

speaker-0 (01:01:02.848)
There is a there's a whole set of YouTube videos on basically physical security reliability where there's a a company that will go out and actually break into your data center and they taught they share a lot about actual physical security and like the locks that are being used and how they get into buildings. And picking locks is very low down on the list of skills that they need to make that happen.

speaker-1 (01:01:25.624)
I there was a funny another disaster that's on the wheel of misfortune. it was at Facebook a couple of years ago where they had an outage and the outage affected their security system so get into their data center.

speaker-0 (01:01:33.888)
Yeah, I actually think we brought that up, but it's it's always it's this I I don't think we need to do it again, but the the the short of it was, right? The so you wa you were actually at Facebook at the time.

speaker-1 (01:01:45.026)
No, no, no, no. it's just one of the it's one of the wedges on the wheel of misfortune that we see.

speaker-0 (01:01:49.31)
That's all. Yeah, so I I I love this because it's the not I we were talking about this in the episode on isolation, I think, where you have your critical systems also sort of depend on themselves. And in this case, Facebook messed up their BGP routing, which was required in order to validate user identity. So they couldn't fix the BGP routing because they had to get into the data center to reset it, but they couldn't get into the data center because the doors were locked and required physical identity checks. which of course went through their systems, etc., you know, loop.

Loop guaranteed. Okay.

speaker-1 (01:02:19.35)
So next time we're at re you're a reInvent or a summit, you need to come by and spin the wheel because you'll spin the wheel and you'll say, I know that one and you'll just rattle off what it is. You'll really I I I'll make sure if I'm there you get a gift card, whether you get the quiet day or not, as long as you can identify the r the disaster.

speaker-0 (01:02:33.038)
yeah, it's a sp well anyone who's listening to this episode will know the the secret answer now. So I I appreciate the pick, Seth. So I brought maybe something lame. last episode we were talking about changing the whole mindset of how we do development and architecture work and not just building the product, but also thinking about how we build and how we read code. And what went along with that was a book called Rewilding Software Engineering. It's all about moldable development. And yeah, it's on medium of all places, so totally free.

And we had the author on and really I I think this is really interesting because it embodies the challenges that we like we may have historically used to get to the cloud, like lift and shift. And if you think about that, you may say, well, that's completely ignoring cloud topology when planning long-lived services. You're just taking what you have and and go there. And I think similar challenges also exist for backups. You don't just straight duplicate all of your infrastructure. You're really thinking about how.

Cohesively a strategy should look like that solves the I mean, we talked about business needs a lot. And I think a lot of companies use VM snapshots as a as as their strategy, which I'm sure you'll say, you know, can work in some in some regard. but I think a lot of companies that do it are not really thinking about the need to actually back up all of that unnecessary empty block storage they have. Yeah.

speaker-1 (01:03:51.982)
Absolutely. And how things fit together. There's a ton of configuration hidden everywhere that just doesn't work when you bring it up in a different region and different account.

speaker-0 (01:04:00.226)
Yeah, and I think the multiple d software development really thinks like goes to talk to like how do we build software? Like how do we actually think about that process? And not like the process of do we have stand-ups and is there a testing cycle, but what is it that we're building and are those things reusable? And I I I really like the approach that it takes in the examples that they bring up because it really hits to the point of we're probably doing it all wrong and we haven't really spent any time in the last fifty or so years.

really rethinking this at any point where we could be. And I I think this is one of the secret things that some companies are doing right, but don't even realize they're doing. And a lot of other companies aren't really paying attention to.

speaker-1 (01:04:37.89)
Wow, okay, yeah, definitely have to check that out.

speaker-0 (01:04:39.896)
So thank you, Seth, for joining us today. I I I'm I'm glad I'm glad you enjoyed it. I think this is gonna turn out to be a a great episode. And here's a reminder for everyone listening, I guess, to subscribe to Adventures in DevOps. You have no idea what what positive impact that has. The more people that listen, the better guests that we can have on. Like Seth with a lot of veteran years and and hyperscalers.

speaker-1 (01:04:44.09)
it's been my pleasure. It's been fun.

speaker-1 (01:05:03.372)
Yeah, awesome. Really, really glad to be here.

speaker-0 (01:05:06.063)
and I hope I we can see everyone again back next week.