Paul (00:07) Welcome back to another episode of Adventures in DevOps. before we get to the show, I just want to introduce today's sponsor, who is Rootly AI. I really appreciate the team for both being a great guest on the show in a previous episode, as well as sponsoring this week's episode. Just one of the interesting things going on with their AI Labs org is their upcoming release of the On-Call Burnout Detector. It's a free open source tool for detecting early signs of overwork in incident responders and engineers. Besides that though, I don't want to spoil anything more about it, so check it out and as always, there'll be a link in the description. And now back to the show. I met our guests at a recent conference and after hearing them tell story after story, I just couldn't wait to bring them onto this episode. So I want to welcome today's guest, Paul Conroy, CTO at Square One. Warren (00:55) Thanks, Milo. Thanks for having me. It's good to be here. Paul (00:59) You know, it's a web agency if I've got that right. I don't know if you use a different terminology for what square one actually does. Warren (01:05) Yeah, we're a full digital agency actually. do web, we do mobile products, we work on, ⁓ we do consulting as well. We do a lot of work around Stripe payments in particular. So a little bit of everything. Yeah, so a little bit more than just the websites we started out with now at this stage. Paul (01:18) You know, usually I'm hesitant to bring on ⁓ someone from a development agency. They usually just want to push their services or try to get their stuff out there looking for additional customers. But ⁓ when I was at the Build Stuff Conference in Vilnius, which was honestly a great conference, just hearing you speak, just like, great guests. Can't wait to have them on the show. These are great stories that you had to share. And I thought most specifically it was ⁓ the practical solutions that you had to basically malicious users. I think in today's world, there's just like a lot of this, especially ramping up with, ⁓ I hate to say AI, but bots galore ⁓ everywhere and having to deal with ⁓ extra ancillary users who aren't adding any value to your platform. Warren (02:04) Yeah, and there's a real cost to these kind of users. guess the AI issue is a big one at the moment. In my experience, I've been lucky enough that in Square One, we work with a lot of large online publishers, news and sports sites. But in the past, I've also worked at large classified websites as well, which would have ⁓ large scale attempts to kind of be malicious actors, as you say, for one reason or another. And we had a story a few years ago where Working in a property portal in Ireland and Ireland's a bit of a funny market for real estate in that it's a national obsession. Property prices have gone through the roof a few times and big crashes. But generally speaking, people in Ireland just love going on property websites. It's not just buying and renting properties. It's what does the kitchen of a million euro house look like? Or my neighbor down the road is selling up. I wonder what way they've decorated their bedroom. And I was talking to a friend about this at the event in Vilnius. And they were saying, so you're basically saying that the Irish are a nation of voyeurs. saying, say that exactly, but I wouldn't not say that. anyway, property in Ireland, huge business because you just have these very popular websites from people looking to see all of these, these photos as much as the day-to-day business of property. And at the time I was working in a property portal, was sort of two of us, a duopoly. We were both very popular. A new entrance came and went all of the time. And the market at the time was a bit weird because if you're in a state agent, there wasn't a central portal where you could go and it would syndicate your listings around. You basically have to. have a sunk cost effort for every site you engage with. So everyone knew who came along, had a of a chicken and egg problem. There's no agent spending time on it because there's no users and there's no users because there's no content. So a lot of them would fade away. Then one day we noticed that we had an interesting new competitor came into the market and they were from abroad. They were an existing property portal and they had pretty deep pockets. And the way they launched themselves in was they had this viral campaign, which was before viral campaigns were even really a thing. They had this little video made up which was like the, you know, the Palm Islands in Dubai, the artificial sort of residences. They were going to build a Shamrock Island in Dublin Bay. And it was going to have all this high rise living fast metros, the world's first giraffe only zoo, all of this sort of stuff. And it was presented as like a legitimate proposal coverage on the national news, all of this. So they got it. They turned all of this attention into basically a launch announcement for their website. So this was a serious new competitor. And the way that they solved the chicken and egg problem was they came and they scraped all of our listings and injected them onto their website. The idea here was that you're searching for property. If you end up on their website, suddenly they have all of our listings in the search results. And when you click through them, then it will come back to our listing. And then we ultimately deliver the lead, the inquiry or the phone call or whatever to the advertiser, which is what we're being paid for. But their game plan is, if they're popular enough over time, they can go to the agents and say, look, we have all of this traffic, cut these guys out of the loop, come work with us. So it's a big, threat for us. And when someone comes along and they're sort of maliciously against your system in this way, you've of approaches and one of them is to set the lawyers on them, send off legal letters and please leave us alone sort of stuff. Problems we had at the time was that the case law wasn't really settled around screenscaping, it was very early on. And also we were a very young company, we become the biggest one in the market, but a lot of us were there in our first job sort of learning as we go and built up good cohort of users, spending a bunch of money and waiting for two years for a legal case to settle. not a great option for us, especially because within two years, it might be moot. You could be out of business. So we took a different approach. the standard approach, someone is scraping your content. So what do you do? You find their IP addresses, you block them at firewall level. That's the end of it. But then they really wanted the content. So every day they would move around a little bit. We'd have to block them again. And you're into this constant back and forth. And the problem was that our blocking, we would catch them, but they would get a large chunk of content every time before we caught them. So they're... They're getting a lot of content and it's not really doing any good for us this blocking. So we wanted to maybe if they're going to get our listings, can we do something that at least makes the listings on their website look bad and obviously not the right one so people will come back to us. And you could do things like, you know, well, let's change all of the addresses to a script tag alert, haha, not today guys or something like that. And hope it ends up like that, but that becomes very obvious. So you'll spot something like that very quickly and you can put in automated checks to catch that sort of stuff. So we need to send back stuff that was semantically sound, but maybe just a little bit fuzzier, a little bit off. So what we did was if we detected their IP addresses, we would return our listings to them. But instead of say a three bedroom house for 400,000 euro, we'll say, well, this is a four bedroom house for 280,000 euro. Oh no. Yeah, fuzz the details just a little bit. And it worked. It led through to their site. the thinking that we had here was that people will see this information, they'll say, this sounds great. They'll click through to our website. They'll see it's different on our website. And they'll say, well, we, know and trust this existing websites. Clearly these new guys are charlatans and we never use them again. And it worked a little bit, but in practice, not so much. A lot of people just got very annoyed as someone is wrong here in the chain. Don't know or care who it is, but you're all wasting our time. They would complain to agencies, state agents would complain to us and say, don't know what you guys are doing, but this is annoying. People are ringing me. don't have time for this. So leave it alone. So we. wasn't really working great for us overall there, but we still had the idea that, okay, well, if we can do something here to make their listings look a little bit less credible, then we'll be in a good spot. And we, but it needed to be something that they couldn't detect easily automatically because at scale, you know, they're scraping 35, 40,000, 50,000 properties every day. So, you know, they don't want to put manual effort on this, but if it's something that they could detect technically, okay, it's a problem. So we can't do stuff like mess with the province or the region because they're fairly set, you you can check them from a list somewhere here at the valid provinces and areas. But the main part of the address is it's often free text. It's not hugely validated. know, people will use vanity addresses there. You know, they'll put a slightly nicer street name in or something like that or typos or whatever it is. So we had that and we had the photograph that we could do something with. So then we set it up so that if we detected their scraper IPs coming through, what we would do initially is we would have a set of fake properties. So we would return the core information, the price and all that would be fine. But instead of 15 Main Street, Dublin, you now get 10 Downing Street, Dublin, Ireland, and a photograph of the British Prime Minister waving outside the office there. Or you get the White House, 1600 Pennsylvania Avenue, County Cork. And things like this that to a human eye, they're really obvious. Willy Wonka's Chocolate Factory or the Emerald Palace in Oz. Because we call this the... Project Yellow Brick Road internally. there was a lot of Wizard of Oz references thrown into it. But the data anyway went across to their website. I mentioned Ireland property is an obsession. People were constantly on social media sharing funny, interesting houses, that sort of stuff. This is a very popular thing you would see. Well, this is what happened. They looked at their website the next day and saw that there was some legitimate Paul (09:03) It's gone viral in a bad way. Warren (09:11) properties in there, but the vast, vast majority of it was just obvious junk. was just joke properties all over the place. And, most property portals from time to time might have one or two of these of someone, you know, joking, putting up the, ⁓ the prime minister's office is now vacant before an election, something like that. But the scale here was just the vast, vast majority was, was absolute nonsense. So people were laughing about it. went viral, but there was also questions being raised about how secure is their platform. You know, has it been hacked? Is this malicious? Because the scale of it is so big, someone must have done something to them. So they deleted all of the properties, they bounced their YPs, they tried again, and we caught them again in the net. So for a few days, this sort of pattern repeated itself where we just had this constant stream of junk. And it got to a point where it just wasn't practical for them to keep going. They gave up on us and they moved and started targeting our other existing competitor in the market. Now, our other existing competitor, they had fewer listings than us. And they also had access to more expensive lawyers and were more inclined to have a conversation in the courts. So long story short, this new competitor ended up winding down and going out of the market relatively quickly afterwards. But I think what we've managed to do there was we kind of flipped the cost curve around a little bit. know, they were going to come at us, they were going to come towards us and outright blocking wasn't working well. But what we were able to do was effectively waste their time, waste their resources and make it not cost effective for them to continue targeting us because at 35, 40, 50,000 properties, if we're spiking it subtly like this, they're going to need a human pair of eyes to kind of go over this and check it. the return on investment was just not going to be there for them. So it was more attractive for them to target one of the other competitors there. Paul (10:55) I really yeah, I know I really like that perspective I think it's something that a lot of people miss out on is that Attackers or I hesitate to call them necessarily malicious ⁓ because they they're not trying to necessarily take down your website or your But yeah, but in a way they are trying to take down your business right there. They are trying to so ⁓ malicious I you know Warren (11:08) Let's see. Paul (11:18) on the line as far as whether or not that's appropriate word. I don't know if I have a better one. But the interesting thing is that think a lot of people miss out on is that malicious attackers don't have an infinite revenue stream. They can't afford to overcome every single security block or countermeasure that's in place. And you came up with a pretty cheap way to ⁓ circumvent these particular users from coming back at all. And I think that's like a core to practical security engineering that we don't really see that much today because a lot of companies, a lot of security organizations, they call these things like best practices and then just go full steam with implementing them even if they do potentially cost a lot, whereas there's practical low-cost alternatives that actually do solve the core of the problem. Warren (12:07) Absolutely. There can be, and you're right there. can, you have the industry best practices. It's always tempting to reach for, here's the established playbook. Let's just go with it. Maybe it's not a hundred percent perfect for our company. It was a work for Google or whoever, but it's a best practice. So let's do it. yeah, there's definitely a way to kind of look at the context in your own business and what's going to work well in your own business. And the cost thing, I think is a useful point as well, because to loop back to what we talked about at the start with the AI bots, for example. So Cloudflare have a service now for this sort of thing. So AI bots are coming to your site, your bandwidth fields are going bananas because they're sucking up all of this content. again, it's not necessarily malicious users, but users we don't particularly want. And Cloudflare came up with what they call the labyrinth. So they land on your website. They're expecting to see a detailed website where you've poured your heart and soul over many years writing on the ins and outs of the Dutch train system, for example. They're to get all this amazing training data. and Cloudflare starts sending them back, not your application, but AI generated slop, plausible slop, but slop nonetheless, full of links internally to other pages of AI generated slop. So the idea here is again, just to burn through the resources that whoever's trying to come at your website, the AI bots in this case have, and I guess Cloudflare are making the bet there that Cloudflare have more resources to burn than these guys do when they're throwing them together. But as a customer sitting behind the big shield, let the two big guys... find it out and we'll just carry on writing our passion project about the Dutch train system or whatever it is. ⁓ But it's a really interesting thing that they rolled out there. it's ⁓ cool system. It's definitely worth a look if you're seeing a big spike in bandwidth builds without any corresponding real world benefit or real user benefit coming out of it. think discrepancy is happening more and more. Paul (13:59) I think it's an extension of what was happening initially with the idea of zip bombs where there's like a gzipped archive that sort of was recursively pointing to made up directory file system that was much bigger than the disk would actually show. If a malicious attacker tried to download these from your site, they would search it for credentials, for instance. Let's say you call it secrets.zip and they go and they try to unzip that archive. It will just... you know, consume all of their resources and and I feel like these sorts of honey parts are great, especially when they have a real user interaction. I want to want to ask you something which is like, did you get lucky with having the other major player in Ireland fold basically because without that that happening wouldn't the ⁓ was a Dutch company still have access to essentially the same listings because I imagine the real estate agents they were uploading the same data to both. ⁓ your client's website as well as your competitor. And so at that point, wouldn't that data still be available somewhere? Warren (15:01) So the other competitor that we had, the pre-existing competitor that we had in Ireland, yeah, because they didn't have necessarily 100 % the same listings. It would have been very, very strong in rentals, for example, or certain areas of the country where our sales team had just done a better job than theirs had. So they would have big blind spots. They would be very strong in some parts of the country, but huge parts of it were more ⁓ focused on our side of things. if you only had their listings, ⁓ You have a very incomplete picture of the market, which was one of the reasons that we had become the number one website and where the big target at the time was we had done this because it was the way the uploading work for the agents at the time was it was quite a good most. Once you manage to hook someone in to do it and put the listings onto your website, it's a lot of effort to go and do it somewhere else. And it's not unusual if you have the one website that delivers most of your traffic, you'll make sure every listing goes up there. you get distracted by a phone call or you're going out to lunch and you forget to do it on the other website, you know, wasn't unheard of, even in the areas where we both had kind of similar numbers of estate agents. So that was why we ended up being the main target in the first place. Paul (16:08) Did you ever consider sharing with your local competitor the way to prevent the external competitor from scraping their website because there's sort of this like, that data is still available and preventing the attacker or the malicious unwanted traffic to your competitors was still somehow beneficial for you? Warren (16:28) no short answer. I think, know, I'd be honest, I think, you know, over the years and over, ⁓ over a few beers at various times, it's, it's certainly been shared informally with them once or twice. But, no, think, from, from our perspective at the time, you know, we, we, just wanted these guys to leave us alone. And we knew that if they left us alone, that was our main priority. They left us happy enough that we could fight them commercially and do whatever we needed to do. ⁓ the fact that they were then attacking the other guys. That was, to be frank, it wasn't the worst outcome in the world. There's now these other guys, instead of rolling out new products and doing whatever they need to do, have to spend time and effort dealing with whatever is going on here. So, it's bit like the old story about the guys who were on safari. And you're driving in a Jeep and the Jeep breaks down and everyone has to make a run for it back to the base when the lines are coming over and one guy stops to tie his laces and the guy beside him says, what are you doing? You're crazy. You're never going to outrun the line if you're stopping to... fiddle with your laces and the guy says, well, I don't need to outrun the lion. I just need to outrun you. You don't necessarily need total victory if you can push your adversary off to someone else's. Not the most, ⁓ let's say, charitable sounding way of approaching it. But if you're being attacked in some way, being able to just sort of deflect the attacker is as good as a victory in many cases, I think. need total domination. Paul (17:51) But I think that's one of the aspects that's completely missing from like the whole area of detection engineering or building up security boundaries and understanding your threat model is that you're not taking into account or many companies aren't taking into account like the actual practical nature of who their attackers are, what those look like. ⁓ When should things be blocked and when should they be let through? Where is the benefit there or the disadvantage? And the cost comes into the equation for both you and the attacker. There's a lot of solutions out there that are just incredibly expensive. And I see people putting these up all day long. And my question is, why do you care if you get an extra, I don't know, 2,000 RPS for an hour or so one time? How much is that just going to cost you in sheer Monetary value and if it's like a DDS attack, you're like we can handle it because we scale then just ignore it You know because you're that's free engineering that you don't have to spend you don't have to pay extra costs on external solution You don't have to maintain it ⁓ If there is an impact to your business like it was in this case, right? Like there was this wasn't like a technological threat, right? This was like a existential threat for the business, right? You you can evaluate a little bit differently and in a way I seek a good justification for spending more money to actually solve the problem. when you look at the threat model and you look at the countermeasures, you're like, none of these will actually fix the problem. Like, we'll just block some IP addresses is usually the standard one. It's like, well, IPv4 or IPv6 and then like all of them, what if they go on AWS and are proxying the stuff through Lambda or CloudFront or CloudFlare? What are you going to do then? We're going to block all the IP addresses for non-residential like... That's ridiculous. Like you're not going to, you're not going to do that and keep those lists up to date. It's just a lot of extra work to be done. So I really like, I really liked the solution. Very practical. Warren (19:44) There's a lot of extra work and then there's also a lot of potential negative side effects. We had an example recently where a number of our publishers were being hammered by bots where someone had fired them up inside one of the Google Cloud IP ranges. They were passing themselves off as various AI research bots or whatever they were, but they obscene amounts of traffic just out of absolutely nowhere. The issue there is you come and say, well, let's find where the traffic's coming from. It's very clear. Here's the block of IPs. Let's just smash that whole range and go back to the day job or whatever it is. But the problem was that whatever block they've managed to get in GCP had a partial overlap with the Google bots own IP range. So a very naive block to say, let's just kill all of these things off, stop this traffic. Fine. Four or five days later, you open Google search console and you see a whole flurry of errors saying Google blocks. Google box was blocked from crawling all of these pages and now you're dropping out of the index and that's a whole ⁓ other set of problems you need to deal with and deal with very quickly. yeah, you're dead right in that the smash the big red button and just block them off at source isn't always the most effective solution to kill you. Paul (20:54) You're letting fear overcome the actual size of the impact, right? Like, no, they're causing, they're costing us like pennies per minute ⁓ is a lot different than like, no, like our business is no longer listed anywhere. Which you obviously aren't thinking about when you're like, we could, we could probably just block all the range. I get the sense that ⁓ you aren't just. Warren (21:14) Yeah, exactly. Paul (21:21) You haven't just worked in the security domain though with threats that you have a lot of experience with the different cloud providers and ⁓ what they're doing. I think maybe a weird stepping stone here is that practical solutions to complex technology problems. And I just get this feeling that you've got like a bunch of other stories in the same domain. Warren (21:42) Yeah, that's true. Plugging together sort of little bits of technology from different areas to solve problems is something we've done many, many times. One example is with the elections in Ireland not so long ago. So elections are interesting because they tend to come around very semi-scheduled. Sometimes they're sort of off schedule for whatever political reasons, but generally you have a relatively short window when you know they're definitely going to happen. You might hear, it'll happen probably in November, but you won't get confirmation until a certain date. And then you have a very short window. And the nice thing with elections is they really get the creative juices flowing of journalists and editors and people who kind of live in this space all of the time. So we were working with a large publisher who had a CMS, ⁓ industry-leading CMS for their news publisher. And it does the job perfectly, but it's very optimized for, need to serve an article. I need to serve a river of articles. Anything else you want to do, that's someone else's problem. I would do this and do this job really, really well. The problem is that when you have something like an election coming along, you've journalists and editors who now want to do, here's a cool data visualization we can do, or here's this really cool interactive thing, ⁓ putting your ideas and see which party matches your policy, all these really cool, interesting things. And you're sitting there going, okay, guys, well, the election was announced. We've three and a half weeks to get everything built and live. And you have a system here which The flexibility it gives us is we can basically embed stuff into an article like a tweet or a YouTube video or something, but you're not going to be able to go hog wild and build like centers or anything like that. So one of the tools we were looking at was a coalition builder. So in Ireland, the way our electoral system works is we use the proportional representation and the transferable vote. not sure if it's the same with you guys, but it turns elections here into almost like a national sport. you have seats available, you've 10 candidates. Instead of just picking your favorite one, you rank them all. And then at the end of each round, candidate drops out. If my candidate drops out, my number one vote is dead, but my number two vote now becomes a number one vote for someone else. this goes on and on. It sounds complex. It is kind of complex, but it means that after the election, there's three or four days of manic media coverage. The most popular website in the country is a Google Sheets that's shared by one of the journalists here who keeps track of all these counts and all the different centers. It's an odd time where there's a huge amount of interest and traffic. With this electoral system, what it means is that you can't get elected by being really, really popular with 30, 35 % of people and despised by everyone else. You have to be, if not likable, at least less dislikable than someone else. But what it means is that practically we end up with coalition governments all of the time. So once the votes are counted, the seats are allocated, you need to figure out, well, 27 guys from this party and 15 people from this party, how are we going to combine them and blah, So the idea was a game where you could see what the votes were and you can allocate it and see what type of government could be made and cross the magic number, all this sort of stuff. So we come up with a way to do this where you're going to put a web component together, which will do the front end rendering, nice little JavaScript animations, that sort of stuff. And it'll work as an embed, the same as a YouTuber, Twitter, or whatever. So it'll work within the constraints of our CMS system, but it needs to pull the data from somewhere. And we were looking for a system where you're going to pull in structured data. It needs to be ⁓ very easy to use, needs to have tight permissions on it so that only the editors and journalists are using it. And ideally it'll have some kind of version history if someone makes a mess of this or a fat finger's a number, we can roll it back easily enough. And I some people will be listening to this and thinking, yeah, this sounds like a great case for a limited scope, basic off microservice that we can spin up and blah, blah. And could do, but again, we've three weeks to get the thing built in live. we've a team who's very used to working with this one big CMS, there's not a huge amount of additional infrastructure available to kind of provision this sort of stuff. We needed to work within the constraints of what we had. And we turned out Google Sheets was a solution for this. Yes. the permissions sorted by your Google manages the permissions. That's, that's all fine. It has version tracking built into it. Usability. Everyone knows how to use Google Sheets. And what I didn't know before we started looking into this was that Google Sheets has this sort of scripting language internally. But what you can do is you can publish the sheet in such a way that that script can transform your data into like an API response. So as far as the outside world is concerned, you'll hit a certain URL. This script runs and it maps your data from your Google Sheet into a lovely JSON response that you can take out. So here now we're into a mode where, OK, we have a Google Sheet. We can give this to our editors, put in whatever they need, use this little script thing that sits in the middle. It's like 20 lines of code. It sits there brilliantly. and our front end can call this thing, happy days, this is great. Google, we're worried about scale. know, the election, this is gonna be huge. People are gonna be constantly refreshing this and sharing it. It'll be a big viral tool. Google, yeah, scale's not gonna be a problem for Google, is it? So we tested it, looked at it, it looked okay. And then we started to hit a few kind of problems fairly quickly. So cores, first of all, was a problem. The cores errors. But then we also had an issue with the speed. because the script would run, but the first time it ran, it would take maybe five or six seconds to respond. And on follow-up loads, it would be faster, but not by much. So it's not great. ⁓ And the bigger problem then was it's also rate limited because these URLs, they're not really intended for public API in the night of an election, I guess, ⁓ but they're limited. think it was something like 50 requests a minute or something like that. Under normal circumstances, should be fine, but the scale we were looking at just That was going to be problematic. We had a look and we came around to Cloudflare Workers. So Cloudflare Workers, we could put a very small script together that would run at the edge somewhere. And what it would do is, so first of all, it gets us around the cores error. It's making a server-server communication. That's fine. And we can put whatever cores headers we want on the response. So great, we can now consume it from the front end. We could then also push Cloudflare's caching headers on it. So what we could do is we could call the sheet. And then when we're sending the response from the worker, we can put a cache to say, okay, for the next minute, this is publicly cacheable. It's not going to change. And then we use Cloudflare's own caching rules to say, cache this in the CDN. Don't even go back to the worker. That's all fine. And what that solved for us was two things fairly quickly. It solved the rate limiting issue, because now we're going to hit it at most once a minute, two or three times, maybe depending on the, dog button, but not going to hit enough to challenge the rate limit. But it also means that the speed should be quite quick. Cloudflare CDN is going to be order of magnitude faster than Google. Now, it doesn't solve the problem of the cold cache every minute or so. So we had a cron set up on a machine somewhere that every 30 seconds just curled this URL. So it took the hit on whenever the cache was cold, but it meant the front end would always be sending back a warm cache. So we looked at this. So this is now working. It's fine. It's really responsive. We had a chat with the editors. We were talking about cache and validation. So when they update us, they want to change to be live immediately. we were looking at ways, ⁓ CloudFair has great APIs for purging the cache. We were looking at integrating this with some kind of, I don't know, a button into Google Sheets that would do it. And we started looking at this and then one of the editors said to us, it's a lot of work guys for the sake of data being a minute stale, I think we're okay for this game. So that's just not that on the head. We live with a minute cache and great music to our ears. Cause what's the old joke about the... The two hardest things in computer science are cache and validation, naming things, and off by one errors. We were kind of thinking, okay, this could get bad, but no, it worked and it went live and it did exactly what we needed it to. It stood up to huge traffic. It was very popular, it worked really well. And the whole editorial team didn't have to learn a new tool. They could live in this Google Sheet, which then once the election is over, we can snapshot the API response, stick it in an S3 bucket, and it's effectively static now until the next election. wind down the worker and job is done. So that was a nice example, I think, where we took a couple of different smaller technologies and kind of plugged them together in ways that were probably not on the manual when you go through Google Sheets. Here's what you should do with this. it worked and it worked really, really well in this case for us. Paul (30:06) was actually going to ask you about whether or not you were going to be comfortable with the SLA's being provided by Google Sheets to be ⁓ run for the, you know. I'd say success of the news company because you know if you're using one news site and you're only going there to see the you know up to date or one minute behind election results and you know it's down you're immediately going to you know switch engines you know maybe permanently to somewhere else but I think the switching over to cloud flare pretty much put you in a full serverless land right like your database is Google Sheets and everything else is being cached and I think every cloud provider offers some sort of strategy like that. Warren (30:44) Yeah, yeah. it worked like a charm with Cloudfair. Once you managed to get the different layers tuned between the worker and then the caching rules at the account level, it was great and it went really, really smoothly. Now, it went really, really smoothly because I think we've had one experience of not tuning those things correctly in the past. ⁓ The caching layers, I'd say there's very few problems I've come across over the years that have caused me as much pain and sleepless nights as... misconfigured cache or multiple cache layers. Because you know yourself, people are working on an application. On day one, there's a very well-considered and thought out caching strategy, but then over time, things can drift. Suddenly there's a ⁓ slow part of the application over here. well, let's just wrap a cache around this particular function. just, now this part is slow. Well, let's put a cache over here as well. And it's like you have a giant rug and you're just shoving all these lumpy things under it. know that over time it builds up and you can end up then at one point where Okay, we have a cache on this particular resource retrieval and we have a cache on the formatting around it and then we have a cache on this thing getting injected and then we have a response cache and maybe then there's a browser cache going out to the users. And when you're trying to purge these, you're working with news websites and when you work with news websites, it is inevitable that someone is going to make a mistake at some point. And you'll write a big article about someone being convicted of an absolutely horrible crime, this disgusting, despicable stuff and the photograph will be some guy who ran a marathon for a local charity, he's been accidentally put on the wrong thing. Something silly like that, it happens. It's human error. It happens more than you'd probably expect. There's other things that leads to people's hair getting on fire and saying, is going to and vanish really quickly. And it can be incredibly frustrating when you have your nice caching strategy that was there on day one, you've your nice cash invalidation strategy that was there on day one. The caching structure and the invalidation structure may not have kept up with each other. So you can end up with these really ⁓ hairy problems where, at ⁓ a simple level, let's say you have varnish cache around an application, you have a CloudFair cache at the edge, and it's going off to the user. But when you're trying to purge these, if you purge them in slightly the wrong order, or you're doing them async and they execute in slightly the wrong order, you clear your CloudFair cache. Great, that's nice and fresh. Your internal varnish cache hasn't cleared. A new request comes in. CloudFair gets the old data. It recaches it. Now your varnish thing is cleared, but CloudFair is gone. So you sometimes end up with, well, What's the policy of clearing the cache guys? Or you just keep clicking this kill all caches button until eventually it works. It can be very stressful if that kind of strategy isn't well taught out, isn't well tested and isn't structured because it can be a big source of pain, I think. And not just in the reputational sense, in news organizations, pretty much every app that's using cache, it's using it for some sort of sensitive performance data. conscious of the cash-in validation strategy and making sure it's up to date is a key way to avoiding these kind of hyper-stress moments when something big comes out of the blue and you're suddenly fighting against the screaming noise. Paul (33:46) Well, you and I think that's, know, you had another practical success here as well, because the newsagent is like, okay, you know, we'll just let the data be stale, you know, for one minute. It's like celebration time, honestly, because trying to figure out how to get in the invalidations and then have them actually work correctly is just like, well, you know, good luck. in a way you sort of lucked out because obviously for the, like Irish elections, only the edges in Ireland were particularly of ⁓ note here at the moment. And who cares if there's even a 10 minute delay on invalidation if someone's loading the data from another country or whatnot. And assuming the news provider was even willing to share it publicly, because I know a lot of ⁓ newspapers or online content just is region locked for whatever reason. A long time ago, I was advising actually ⁓ the corollary in Switzerland early before an election. They had run all of their technology on-prem and had a smart idea of finally moving to the cloud. know, caching was a huge problem for them as well. And I think this is one thing that a lot of people who aren't in the content domain don't realize is that it's not like you just have a lot of public data that just has to be exactly cached. There's a like a lot of small little tweaks, A-B testing that ends up getting in the cache data and for them it was like paywall stuff. It's like you don't want to cache per user and cache on like authorization tokens or bearer tokens or whatever because it's the same data for everyone but you don't want it to be public because then otherwise you wouldn't need to log in to get it. getting that right is really ⁓ quite a tricky thing and same goes with invalidation. Like you remove stuff from there and search bots or search engines, that have their bot that are going and scraping websites, they want to see like a 410 gone when content gets removed and they don't want to see a 404. And it's like, okay, well, that means we need to keep track of when things were deleted and have a note there and then expose that, but only if it's for the Google bot and not for like actual users, because that would be a confusing experience. so, but then that data could get cached and we're not paying attention to the user agent who's actually calling us. And, you know, I think that's where you start to realize, actually. This is a little bit more complicated and it has actually nothing to do with caching. But then there's caching on top of that. Warren (36:09) Yeah. So, so many layers, it turtles all the way down. And then you get into problems, you know, you're supposed to send one code back to the Google bot, but Google's guidelines are also very strong on not treating the Google bot differently than regular users. Well, so, know, you're damned if you do and you're damned if you don't, but it's a, yeah, the caching in particular can be, it's one of those things that the day you first put it in, this is the greatest thing in the world. Now my app is hyper-performant and it's brilliant and I'm going to cache everything, you know, suddenly. have the cash hammer and everything looks like a nail and then ⁓ it's when you start need to flush the cash as that thing start to get a little bit a little bit hairier. Paul (36:43) Yeah, and I think I think that isn't really evaluated enough and it usually is a result of major production downtime because when you put the caches in place, what people don't consider is that they're not planning for the failure mode when the cache gets flushed. They're looking at the new steady state resource usage. So if you have a cache in front of your in front of your service that you put there because the underlying database was slow and you didn't want to scale it up or you couldn't because you had some sort of index that just wasn't well optimized or couldn't be given the data that it was querying for or it's a third party server. right like Google Sheets and you don't have any control over it in the first place and you put the cash there and then all of a sudden you're like oh well some of the data and the cash is wrong let's flush it well you flush it and now all those new requests are coming directly to your database and will overwhelm it and now not only do you not have the data in your cash no one's getting any results up and now it's like a real production incident and you just don't really think about the long-term impact of setting up that that system and so one of the things that I want to ask about and I think does come up frequently in the conversations I had, especially around serverless stuff is like, how do you evaluate the short, medium, long term for some of these options? Like I see especially like very staunch serverless component ⁓ proponents. Like I'll say I'm one of them, but I try to take this approach of like, well, are we going to end up in a situation where we have like tens of thousands of these cloud flare workers all doing independent actions? How are we going to keep track of them and know what they're supposed to be doing and even manage that? Are those like separate repositories or whatnot? I assume. You've seen this in some regard. Warren (38:14) Yeah, yeah. think like your point about all the different work is it very similar to a conversation that you'll have about say microservices versus the monolith. You're breaking these things down into a million small pieces. And for those who are say microservice skeptic, you'd be pointing to the cases where now to debug one request, I need to spin up 15 microservices locally and try chasing them all through. And that kind of holistic view you're talking about there of having an idea of what lives where and what is doing what, it can really get complex very quickly as you're breaking down these components. So I think that that's always the trade off with microservices, small workers, whatever it is that you're going to do is the extra velocity you get from breaking these things out into individual components that have just one thing to do and one thing to focus on and have a clearly defined boundary versus the complexity of, okay, now I have four of these things. How many nodes do I have of how they're interacting with each other? And how many things do I need to load into my own mental model to do this to kind of get a handle on it as those boundaries are I think more often than not where things start to fail and things start to get, let's diplomatically say interesting, gets very colorful. Whether it's the microservices or you're working with a very complex Kubernetes stack, for example, and suddenly you're trying to find there's one pod randomly in the middle somewhere that is misfiring for whatever reason, ⁓ pinning it down and diagnosing it. It's a lot more challenging than when everything used to run on just one massive server sitting in the corner somewhere and you can shell in and dig your way through it. ⁓ There's definitely trade-offs to be had there. And it is difficult. It is difficult to keep that high level view of what's happening, where, what we need to be aware of going on throughout our whole system. When we ourselves are straddling more than one of these services, you know, it's not all isolated teams. That can be a very hard view to get. I think that's one of the areas where some of the AI tooling is certainly helping with, because what we're talking about here is it's effectively documentation, system documentation. you know, developers... Most developers are not developers because they love sitting down and writing 15 page docs that are out of date 10 minutes after you've hit publish on it. But I think with the AEI tools now, particularly the last few iterations of them, they're getting stronger and stronger and better and better to a degree where there's really less of an excuse to not have this type of documentation available or this type of documentation not only available, but current as well. I don't know if you've seen many of the more recent models, but I know say over Christmas, for example, there's been a huge boon in people talking about the Opus 4.5, Pink Law Code and new versions of Codex that they've kind of hit a certain inflection point where there suddenly now a lot more stuff just works where it wasn't six or nine months ago. things like generating better documentation, you know, getting getting graphs and flow charts that actually look like graphs and flow charts and not some weird ASCII art thing going on is easier. It's happening and it's quick to do and it's quick to have. And then it's quick to then feed back to these tools when something goes wrong and says, listen, I have 15 different services documented here. Something is breaking somewhere. Help. I'll look over here. You look at all of this stuff and together we'll figure it out between us. So I think that's an area that is going to be more helpful over time. The usual caveats about AI hallucinations and stuff needing to be sanity checked, but degenerating of that documentation and updating it is now a much, much smaller task than it was 12 or 18 months ago, I think, to do this. Paul (41:46) Also color me naive here. Why do we need documentation with LLMs? Can't we just have it be a text prompt that gives us the answer? Warren (41:56) Well, I suppose what I'm talking about here is, well, we could, a text prompt. So let's go to the LLM and say, my system's broken, please fix this. I'll tell you no more, figure it out for yourself. mean, that's the dream stage. ⁓ Well, I'm talking more about how, know, historically when we're putting together these microservices or we have workers or whatever, we might have a repo. We want to have a readme in the repo that explains what it's doing, what it's not doing, what it's taking responsibility for. So when we're trying to chase down, ⁓ this cache thing is misfired and it's gone through these 16 different channels. what are each of these channels actually supposed to be doing, rather than having the LLM go and say, I'm going to analyze this monster code base. We've effectively distilled the monster code base into, here's the documentation explaining at a high level. is what, with any documentation, it's never going to be, we're 100 % sure this is what it's doing, but we're about maybe 90, 95 % sure this is what we should be doing. So you could have a double pass. From your perspective as the incident investigator, yes, you get a text box saying, my system's broken, go figure it out. The LLM can do a first pass. read in the 15 documentations and then figure out, it's in one of these two repos and then go deep, spin off sub agents that goes deep and says, okay, we really know the ins and outs of what's going on in the internals of this application over here and go deeper and deeper. yeah, ultimately a text box where I can fire in a question and everything gets sorted out for me. That's the dream. Paul (43:14) I was seeing a pattern here because when you put it like that, the documentation for me is like a LLM knowledge-based cache where rather than reading through the whole source code and then having to do that on the fly, cache the results in what you're calling a document in the readmate, which it can then read back later to generate an answer. And that seems obvious. And then you just get into the standards like, well, you're going to have to invalidate that cache at some point. Warren (43:40) I like it. Yeah, that's it. That's a great way of putting it. Paul (43:44) Maybe this is a good point to switch over to PICS. I'll ask Paul, what did you bring for the audience today? Warren (43:52) Yes. So my pick is it's the book, the code book by Simon Singh. So it's all about a cryptography and secrecy from ⁓ ancient Egypt up to about Enigma. So I love history, but I would say probably more of a pop history fan. You know, I'm not going to be sitting down with a 1200 page deep college textbook on history. I love historical stories that are basically illustrated by anecdotes or whatever personal information. And this book is great for that because there's a base. five or six different eras of cryptography where they're talking about, you know, from Caesar ciphers back in the day, all the way up to the enigma and what goes on in between. You know, there's a story in it that I love. I think I've probably told it to so many people at this stage that I owe Simon Singh royalties for finding it in the first place. But it's about the 30 years war. when the Spanish were going to war with the French, the Spanish also controlled the Netherlands. So they needed to get kind of back and forth quite a bit over dodgy territory. the problem the Spanish kept having was that every time their armies and their diplomats were moving around, the French were there waiting. They were absolutely hammering them. They were destroying them militarily. They knew all of their plans. They knew all of their strategy. So the Spanish went through the normal stuff of, there's double agents and let's purge a whole bunch of animals and whatever. But it turned out that this wasn't happening at all. Spanish, we're using really strong ⁓ cryptography. It wasn't a Caesar cipher, was like a slightly modified version of that where you're swapping out. Instead of an A, you're using 127. Instead of an E, it's 329. So it's not subject to the same sort of frequency analysis you might get from just the plain text. So they were confident that they had this amazing system in place. So how on are the French breaking it? They've gotten rid of anyone who could have been a double agent, but the French were just routinely breaking it. So they went to the Pope and the Pope at the time, you know, was kind of the boss of Europe, boss of all the Kings. ⁓ But he also had his own army there. And the Spanish went to him and says, you know, your holiness. This makes no sense. We're using the best cryptography in the world, but they're reading it. The only way that this could be happening is if there's something else going on here. The King of France has clearly made a deal with the devil and sold his soul to get our cryptography. It's the only possible, this may be the 17th century equivalent of there's a bug in the compiler. You're going to devil did it. then Pope heard them very patiently and says, yeah, I'm not going to excommunicate him because the Pope had also broken their cryptography. It turned out that What they were doing was they had like a key and this sort of key was under lock and key was secure. It was kind of passed around, but it wasn't changing. So operationally they kept it the same and it they were using the same key for their messages all of the time. And while they were swapping, you know, letter A for one, two, seven, whatever it was, it wasn't subject to basic frequency analysis, but with a bit more time, it's still effectively subject. And if you think about diplomatic cables going back and forth, a lot of them start with my esteemed Lord or There's phrases in them that you can start to pick apart. And when you're using the same keys for so long, the French eventually just figured it out. That's okay, these are the maps. And so did the Pope's armies and so did the British and pretty much everybody except for the Spanish. this was a lesson in the importance of your keys not being static for too long. And assuming that different threat models do exist because after this for... couple of centuries, the Spanish cryptographer was used as a sort of pejorative term around Europe in security terms. So that is one of many, I could probably fill a whole episode telling you the different stories in this book. So it's one I would definitely recommend if you're into, it's a fairly easy read. It's not going in and out of the full algorithms behind quantum cryptography or anything like that. It's a nice light read. It sort of blitzes through the different ages and very distinct chapters. So I enjoyed it quite a lot. So I would recommend that to anyone who's even tangently interested in the area. Paul (47:40) I like your pick. think it's super relevant. I'm sure there's at least one person who's interested in reading that book and I'm going to put it on my list for my vacation to pick up and read. I Thank you Paul for bringing that. I guess I'll move over to mine. So my pick is I'm living in Switzerland so I go hiking a lot. So I had to I had to bring a hiking shoe. I just recently bought this and I'm sure I'm going to get some some crap for it. I'll just hold it up here. It's a it's a north face. ⁓ Hedgehog hiking six I think is the actual version and yeah, I know what people are thinking that like North Face isn't known for their ⁓ quality hiking gear. ⁓ It's a glamping ⁓ more than anything. But honestly, I feel like there's like there's some that they've locked into that are still quality that are that I like have been managing to still get for like the last decade or so. ⁓ I don't know what it is and maybe people will say that I'm I'm just shilling their brand. Like honestly, like I think it was an accident that they started making like actual useful hiking gear and the boots are just one of the things that I just I just really like. ⁓ You have to get them with the vibram souls. There's a lot of knockoffs like even from North Face that are just they're there to just be worn and not actual hike in. So you have to be careful which ones you go for. But these waterproof and everything really sturdy. I absolutely love them. Warren (49:03) And I'm not a huge hiker myself, but what would you say is the of the main difference with them and sort of regular hiking shoes is, it the comfort in the soles or more reliable water? Paul (49:13) Okay. I think this is where it's different for everyone. ⁓ When you're when you're buying hiking stuff, it really is like each brand makes subtly different sizes and shapes. And so it's hard to say that it's you know, it's definitely better in some way than than other ones like I would have definitely preferred to go with a more traditional company when it comes to hiking gear. But a lot of them. ⁓ They don't fit my foot well, because do I have special feet? Maybe. But for whatever reason, I really like these. think For me, it's like light and malleable, but you have to be careful when you buy hiking boots or even shoes because the malleability on the bottom of the shoe causes your foot to do more work when you're walking on say non flat surfaces, rocks and whatnot. terrain is also super important. ⁓ If I say like you're in Switzerland and you're doing T2s and T3s, it could be wet and you don't want to slip. ⁓ You may need ankle support or not. Like these are great. you go up to T fours or T fives, or you're walking on paved roads or whatnot. Yeah. Get a different, get a different shoe for sure. Well, thank you so much, Paul, for coming on and joining us in this episode. It's been absolutely fantastic. Warren (50:22) Yeah, great stuff. I know I had a blast. was really great. Thanks for having me on. Paul (50:25) Yeah, and would happy to do again. And thank you for Rootly for sponsoring this episode. And I appreciate all the listeners and hopefully you'll join us back again next week.