Welcome back everyone to another episodes of ventures and DevOps. And I am really excited today because we're going to jump into one of the areas that I find personally really interesting, but also our guest has worked at a number of companies in areas that I feel like lots of companies get wrong. So I just want to welcome to the show. Sylvan from from Rutley, who is the head of veloper Relations. Hey, how are you doing? Thank you for having me. We're in all good. Good to hear, so, I said, head of velper Relations, And I got to be honest, I feel like a lot of companies have started to over utilize this term to mean a wide variety of different roles and responsibilities. Can you give me like a breakdown like what it means for you today? Yeah? Absolutely, Actually to a great question because I have just puts it on LinkedIn two days ago when I was higher and historically the role of developer relations is to empower developers to use a product developer tool by providing education resources, by answering any questions they may have, and then just overall marketing you know, the product in a way that fits engineer, which is not product marketing, right, it's all about what you get out of it. Don't don't tell, show me, So it's tutorials, talks, you know in the former article, YouTube, video, and so on and so forth. But as I joinedly, it was clear that AI and more specifically here llms are the future of incident management. For those who don't know, rutly Routly is and on call and incident management platform. So when something breaks and you have people on call we need to respond to the incident, that's where they go to manage the incident up to it being solved. So I played a big role in that, and we'll speak about it during this episode and most of my time actually at truly most of my AERG, I would say like maybe seventy five percent has been dedicated to agent AI agent relations because we can see II agent as just another member of the team, and this agent also need to be taught and onboarded, just in a way that's different from humans. So I would say, while I'm the head of developer relation, can also say I'm the head of AI agent relations. Well, that's that's definitely a wide area. You know. I'm all interested because I found you mentioned this that you're not doing product marketing, you're still marked in some way to engineers. Uh, they're notoriously the biggest challenge to get engineers on board with whatever you're trying to sell. I mean I found of all groups of people, uh even ones in the technology space, I feel like engineers always want to do things themselves, right it. Is and and with already we are targeting SARES psyche truly ability engineers who are even more skeptical and hard to convince because and for good reason. Right, their job is to ensure that the infrastructure is rolling smoothly and in optimized fation. And so you want to be careful with the tool and the frame. A new framework or new tool or methodology that might include might bring chaos or instability. And uh, you know, so I used to be an SARE myself. Back then SARE was not truly a thing yet. I was working for flight Chair as a develops engineer. We were in the top fifteen months visited website in the world back then, displaying about one point five billion slide a day, which you know was definitely large volume. And we got acquired by LinkedIn, where I work as a senior SR. This time for three year and you can imagine the volume. So I've been on the side, on the engineering side, and so I completely you know, get the personal and I think at the end of the day, it's just that these people they don't want to waste time with marketing copies. They want to understand what's in it for them and their job. Right, So for me, it comes, you know, very naturally. I don't think it's a challenge, but I think for someone who does not come from an engineering background and doesn't have as good of an understanding as you know, an engineer may have, it may be hard to communicate to this audience where you know, I think this thing might come from. Yeah, no, I totally got it. I know I'm gonna I really want to dive into the core topic, which is self healing systems. And we were talking a little bit before the episode started about like this has been your like a lifelong project area. How did you get into this? I would say it's the first thing, like was this, Like you always knew this is this was the thing you're going to go into, and like how long you've been doing this and what does that really mean to work in self healing systems? Yeah, so it goes back to actually I started the project when I was working for a slight share as a develops engineer slate slash slate site reabity engineer, and I was on call and I had to you know, manage outages back then who are using puppet as a way to ensure that unfraustructure was as it should And you know, ultimately there were a lot of repeat incident or a lot of outage or you know issues that were coming from the same type of problems. I think. You know, as you as engineer grow in their careers, they kind of know what are the main failure types and so and so, you know, like I think it, it gets repetitive, and I think a great engineer wants to automate itself and not do the same thing over and over again. And so yeah, that's where the idea of building a self feeling system came about. So maybe just. Sort off like you said, you would see the same sort of regressions over and over again. Was there like one in your mind that was just like the most common where it was like every single time it happened. It was like the driver for you to make a real change in an organization. Yeah, you know, I think. I mean, I know it's been like more than a decade, so you know, I won't have like super sharp example, but I think the classic ones are, you know, issue with lack of resources, whether it's you know, storage or CPU or memory, and you need either to increase decrease the lots somehow or distribute or scale. Could be like a service that misbehaving and you need to restart. Could be a lot of things that I think, So I think the industry took different throughout. I think now with Kubernetists, what we do is that if if this is misbehaving, we just shut it down, right, we get rid of it and we start a new one. And obviously kubernets is great at scaling, so I think this tool to like the self feeling system, it still works in this way, right like by new king thing or scaling thing, you can heal a system. I think in my mind I wanted to take a different approach where I was training. I was envisioning a system that would actually address the root cause I think in some of the cases, you know, instead of just new kings, this thing like really try to mimic what a human engineer would do. So, Yeah, that's that's kind of the philosophy that I had back then. I mean, there's definitely a huge population of engineers who think that the what they would do in those examples would be for sure to restart the machine or the container or the node if it started to run out of memory or processing power. And I feel like that's sort of the crux of one of the issues that I've seen over and over again is that we do build those systems that I say we as collective humanity and not at my current company, that automatically restart or you know, allocate more memory or processing power. And I feel like the automatic scale scale out or scale up for resources can make sense if it doesn't create a negative impact on the feedback loop that you have to solve the problem. And I feel like this is one of the problems with automatic restarts is that it doesn't really solve the problem. It's still is persistent there, It's going to keep happening, and also you're delaying actually doing the investigation, and you're also eliminating some of the evidence that would allow you to identify the problem there. So it's really great to hear that you know, you thought that the appropriate process was, you know, go out like why is there extra memory usage? Why is you know, the machine getting stuck, et cetera, et cetera. And I feel like that's sort of the thing that sets apart the best esses from the ones that are just coming into quote unquote do the job it is. I think then you need to strike the right balance between achieving the end results, which is capability, stability, and if restarting is the way to go, and you know, you don't need to spend engineering resources. And obviously, I mean it's walking, so you know, I don't think it's it's a valid issue, but yeah, as you said, you need to find a strike the right balance between just doing this over and over and if it's repeat like out edsure issue, you know might need investigation. And so the idea that I had back in two Southern was I think I started in twelve thirteen was too really ingest as many data as you can from a distributed system, you know, whether it's any lugs, any metrics application, lugs, traces, and ingest all of this in the dallabas. Back then we were using fluend, which is an open source message, but we still exist and actually it's very popular. Back then I was we were one with lecture of the first main user big use there actually shout out to the team if they are listening. And now they went very far with this technology and still all of this in unstructured database. Back then we I bet on on Mongo DIB. It doesn't really matter the technology, but that's what we use for the prototype. And then based on that come up with like a state of a system and and use and then like try to resolve this issue by throwing at the system a bunch of actions that would be safe. You know, I'm speaking like any action like a like AREM action or something like you know, drop database like you need to be careful. But but but other set of safe action and then use machine learning as an engine to learn basically what, depending of the state of the system, what could solve the issue. And so we designed this for a way distributed infrastructure. Actually continued this work at LinkedIn and they asked me to write a patent, which eventually was accepted. But yeah, it never got a chance to build it, unfortunately, because then I left to become an entrepreneur. But that's why, you know, I was telling you have been swimming in this topic for a little while. Yeah, I mean I still want to go further into that. So just to summarize a little bit, the strategy is, we're collecting tons of logs, maybe metrics, et cetera. We maybe have access to the source code, and we train on that data to identify based off what sort of errors we're actually seeing, how to pinpoint potentially is it a part of the source code or the infrastructure which could be problematic, and then utilizing that on errors that actually do come out of the system to help dive in to identify the cause or does it go further than that? Yes, I think that's a good point if we speak about I think I think back then I was I directly dive into resolution. I was a young engineer, you know, I was like twenty five, twenty six, maybe even younger than this, so I was not really mature. But I think I think starting with the root cause analysis is the right approach. Obviously, you know now with insight it makes sense, but but my goal was really resolution, which ultimately you know where you want to go. So yeah, I was really really focusing on on direct resolution, and I would be a mix of kind of run books that that you know, we could feed and then but then more interestingly set of safe action safe commands that the system could run and could see if this solved the issue, and then kind of do like a learn from it, you know, maybe try something. It doesn't work, it's fine, you know, we just ditch this kind of pass instruction instruction set as an option, but sometimes it will succeed and it will use this for the next incident. And all of this would be based on machine learning obviously, like you know, like the more success you have with an instruction, the more you are likely to use it next time. Back then, machine learning was nearly not as advanced as it is today, you know, so it's it was hard like to achieve this goal. And I think the industry, I don't think anyone build this type of system until like now you add one player that did something similar, which is Facebook. They build a system that's called f bar F B A A R that the definers self feeling it was to manage Facebook that center racks, so it was not system but tracks where it would auto automatically perform action to solve some production issue. But it was deterministic, so it was not the well, no, none of machine learning used in it. And then Dropbox in twenty sixteen presented no rout Aserican and this was a self feeling system but for distribute distributed systems, this time for web infrastructure, but same year it was deterministic. So I you know, I think this system I've been around for like more than a decade, and I've been produce producing value. I know they've been in production, and I think as area generally not up with having mechanism or system working on their behalf and kind of in the shadow. But the truth is that they've been around for a while. I think. Now the main difference, the big difference, which is a huge difference, is that we are including this machine learning LLM part which is non deterministic. And that's a big deal. So let's just dive into that for a second. When we say deterministic in the history of self healing systems, we're talking about like auto scaling groups or identifying specifically based off of rules that some engineer wrote, what they're seeing and then how to handle the situation very concretely. Is that accurate? It's a great and also a one that was called earths and it's exactly how you describe it, like a runt book that's threaten by a human and then this rent book is a trigger based on a specific signal. But it's absolutely deterministic. Yeah, And I think the interesting thing with the deterministic systems is that really require you to do the root cause analysis so that you could write like if no run book applied or the run or a run book applied either by a human or through automation, didn't have an effect that actually resolved whatever incident you had, you had to actually still do the root cause analysis. And now I feel like we're getting into you know, I think everyone's waiting for us to talk about this how to apply Uh, I hate to say AI to this concept. So now that we're now that it's twenty twenty five, what does it mean to deploy LM to be able to self heal a system? What does that actually look like in practice? Yeah? So, you know, I think you hit the nail on the head with speaking about the root cause analysis. That's obviously the first step that this system need to do is to understand what's up. I think in the past it was a mix of for this self feeling system I think if part of this runt book where automatically based on a very specific maybe anologue or you know, something like very trivial, and then it would be automatically applied. In other cases, a human would need to do the root cause analysis and then you would push this rund book, which would still save a ton of time, right because this system would orchestrate all the action that needs to be done. And when we are speaking about very complicated infrasecure like at Google, Facebook or LinkedIn, you know, it can be a lot of work. But here the idea is that we can throw whatever broken system to this LLM and it should understand what's up or at least come up with hypothesis. And I think the hypothesis is very trivial, very important. Sorry, I don't think we need We should not consider this LM to be God or to be the silver bullet. We should consider them just as another human which can make mistakes. We make a lot of mistakes, and so I think one key element is to understand this that this ipothesis also kind of a degree of certainty, and so I think a great ais are will provide as part of the diagnosis what the degree of certainty that it has about the diagnosis, so human can say, hey, if it's fifty percent, maybe I should not pay too much attention attention to it. If it's ninety five percent, okay, maybe I should really look into it. So what sorts of source data are you utilizing in order to feed into the l I'm going to ask you questions about about that afterwards, but specifically right now, like is it you know a list of like source code and some other things like what does the sourt settle look like? M So I think with as with many tools in the LM air space today, context is king, and I'm going to speak about what we are building at Rutely, which again is an incident management platform. And why I repeat this because it really matters in the sense that engineering teams will all the signals that are associated with an incident will flow through their incident management platform. It just makes sense, right, And so these platforms such as routely have pretty much all the context that is available generally to solve this incident. So it can be monitoring, logging traces, it can be more than this. It can be slight conversation, right, it can be a zoom call here you know, you can do a transcript as a zoom call, so you can fit this in the l l M. But I think there also are very important data such as the history of post mortem incident resolution report right where everything is documented from what happened, how it happened, how did we solve the incident, how this incident like how it do I solve? Who solved it? And all this data is like super important for for the AI agent to to to find a root cause. And and the last one I I forgot you mentioned is obviously anything that's linked to changes, which you know takes the form of cod So the list of commits is often you know where you'll find the issue. So you've actually got a system that and justs all this information and that's out. You know, here's a here's some action items take And then I imagine some companies are actually automating based off of that to remediate the problems in production. Or is this like you need a human to review this before you do anything else. Yeah, I think the safe this field is extremely like the auto healing mechanism around a LAMB is extremely young. You know, the oldest company in the field maybe two year old, perhaps even less, there is a lot of competition in the space, seen at least twenty five products, and I've spoken to a lot of engineers who are building this internally at large companies, so you know, I think everybody's doing differently. The maturity of the product is also very different. But I think for s ARE is obviously a really ability is ultimately the goal, and not experimentation, right its secondary goal definitely Right, starting with just investigation is the right way to go. And then I think, as the space meteor and perhaps the model, what's work great with this technology? You can teach it right, It becomes better over time, so you can train models on your data, you can tune it right. So as the technology mature and the model is learning, and perhaps we are learning as humans, we can go more towards allowing this tool to do the resolution. But I think the first step in is just fun our root cause analysis. Yeah, I mean, at least from my personal standpoint, I'm scared to hand over the tools to make changes to production infrastructure automatically without involving some sort of review process. And I mean, I guess, I guess it's fine to have like another LM review. The first llm's work in some way, but I don't know if the direction matters. Like I think you need someone to review the context of what's happening, just like you probably want multiple engineers on call to actually validate any sort of code changes that would have to go into production, because I mean, otherwise you're in a situation where there's a critical event. It's three, something's going wrong with the database. You log in and accidentally drop the production dB. I mean, I'll pause there because that this has actually happened to more than one company, But I think there's one in our history. I think it's almost ten years old now, a major source code get server company had a production incident, very famous with their I think it was Postcress at the time. So engineer is definitely not infallible when it comes to remediation. But I guess my question is going to be do you find with all the data that you're collecting that the set of incidents all point back to some like as far as you're concerned, repeatable or already seen problems like oh yeah, this sort of software development issue or some syntax problem like no reference exception or dynamic module loading or you know, memory exhaustion or something like that, or are there like minor differences as time goes on, like oh well it used to be this said, but the next thing is sort of something that we haven't discovered yet, and so you're still discovering sort of new failure modes. Yes, just picking like very briefly on what you said before. I totally agree with you that ELMS should you be considered as another gument So can review, doing canary deploy, you know, passing the change through the CD basically making sure that the change is safe is just a must do, right. I don't think we should treat treat what the AI say as the resolution pass as any different as a human would say. I mean it goes further than that though, right, Because if we were able to confidently take the output from LLMS and feed it back in, LMS should be able to develop increasingly large solution of any size. And we see that no company has a software an automated software development or agent engineer that can just continually push out code. Even ones operating of very small scopes have utterly failed in their release and their push out of their products, let alone larger companies that have been trying to build stuff up. And the recent craze on vibe coding. Yeah, I mean, and for anyone who's not aware, it's this idea where you just you don't even look at the code. You just have the LM produce all the output, and whenever there's a problem, you just say, hey, here's the issue, try to fix it. The problem is that the context window will have to keep growing indefinitely. Every new feature you add will continue to grow. And so as long as we have these two failure modes A the LM's finite context W and B, companies who have made it their sole goal to make money off of automated software development aren't making money off of that, you know, aren't like wildly successful. The likelihood of you being able to do it, lust being able to trust them fundamentally tells us that, you know, we're not at that point yet. For sure. I think the this vibe coding is variable in many situations. If you want to prototypes, maybe if you are very young startup, you know, I think it makes a lot of sense. But when you get to the stage where you hire a necessary or you need stability in your product, or you are pushing a product that is crucial for your customer system. You know, I don't think this type of engineering practice, if we can call it this way, makes sense, but I think this technology can bring a lot of value. You mentioned do we find patterns in the type of incidents that with see through the system? And that's a really great question. So one of the initial leading at rutely is the rootly Air Labs. It's a community driven initiative where we hire software engineers. We have the head of platform engineering at Venmo and the former head of AI at Video and other very smart student PhDs from Stanford and whatnot, and we pay them to create open source prototype leveraging the latest air innovation to see how can this be applied to the world of free the ability and system operation. And one of the projects that we're working on is exactly what you mentioned, is to create a graph of the incidents and see if we can find patterns. So it could be an area we are infrastructure, or a part of your application, or perhaps a type of failure. You know, let's let's say we speak about we spoke about resources. Is the resources often something you know that's that's fading our system and maybe because our skating rules are not aggressive in US, and and alms are helping us to to create this graph because they are you know, they are great interesting unstructured data and make sense of it. And so then we can create this graph that can and power a sorry team to understand where and stability come from. I mean, that's something I'd be super interested to find out, like where statistically are the most problems coming from, and how that maps are, or like what the confounding variables are between maybe the culture of the company or software languages that they're utilizing, or the frameworks or the industries. Right, you know, maybe these industries have these common incidents. Like I think that'd be super interesting to say. Well, yeah, so we we're building it. You can check it out. We have a gid up GitHub space if you look for rootly AI labs. Everything is open source and we're always welcoming people to to join, just giving ideas or or contributing. Again, we're paying people to do that, and so it's kind of a side job. But yeah, I think AI is breaking like and and you know, I would say it's kind of a side thing. You know, it's it's not it's not as an ambitious goal as like self ining system. But I do think that's where you see that l M can can allow you to do other things that are interesting. I know there are prototype that I think it might be interesting in two other prototy, but speak about it very briefly. One of them is to create a diagram out of a post more tem showing where things went wrong. And post more TEMs are actually kind of painful for engineer to write like you know, no one wants to do that. You need to remember what happened and bring all of this together. Actually, that's what's good with lelamps and that's something we have in Rutley. Rutley will draft a pastmare tem for you and then you just have to review it and chances are that the post more time is going to be great. And then the next step that we we tried with the Routly air lab is how about trying to offer another way to consume a postmare tem and a visual way may help, especially I think non engineering audiences to understand where the failure happened and why the other service that may seem totally enerated was done as well. So the way it works is that it will ingest the post mortem makes sense of it as a geson and then ingest you know your your code base, infrastructure and code and make a gson out of it and then merge this to and and create a knockdown graph. So yeah, that's you know, another way to leverage LMS, which which can ultimately help a very team to do their job more efficiently. I like how you called out that after you are you've pushed out a post mortem that someone actually has to review what way you've created, Like, no, don't just don't just take that and you know, start sending it to people as the official thing are Like, if you take an LM generated post mortem and you put that up publicly, you will for sure get harassed on Blue Sky and asd on very quickly about how you spend zero effort and then making sure that that was accurate. It's very easy to identify LM generated stuff like that. And the second thing that we've built that maybe of interest to the audience is is an on called Burnard Detector. I think that's particularly interested for companies that are distributed where manager may not be in touch as much with what the team is doing, and especially for large companies. So what we do is that we feed all the associated data about incident responder can be how long was there shift over the last week, how many incidents they had to travel, shoot, what was the severity of this incident, how long were they working during the night, and so many things you know that are all instructured data. Right, So we again like Elem's are great at this, And then from this an elelant can come up with kind of a burnout level, you know, and see, hey, like you know, this person was like smashed very hard with a bunch of hard incident, like you may consider giving them a break. I'm sorry, So as. Long as it doesn't also suggest the therapy that should be necessary and try to provide that. I think you're on the right track there, Yeah, I mean, yeah, it can be. It can be difficult to see the differences between individuals. Like some of them are way more interested in actually jumping in and you know, diving in and trying to identify those problems and solve them, and others are you know, care more about the routine. But I don't think in my in the history of my engineering career, I ever saw someone jump up and down and say yes, I would love to be woken up at three a m. And jump on a call with other people and try to justify what was what was happening. So, you know, I think you're definitely onto something interesting there. Yeah. I was when I was young because I wanted to learn. But I'm definitely not into that, but go for it. Yeah, I mean, I know. I think that's an interesting point because you know your career, things change for you over time, and maybe at some point you are willing to make some sacrifices, you know, But I don't know if it was the case for me, Like I remember my first job out of university, there would be incidents in the middle of the night, and I never had to deal with that sort of thing in my life up until that point. Like I didn't run my own data center in my home, and even if I did, I don't. It was not at the point where you'd be like getting alerts to be woken up to deal with one of your virtual machines failing. And the university wasn't a thing. You're you're definitely awake while you're causing problems, right, things aren't happening while you're sleeping. And so my first job like this would happen, and I definitely came away from that with it, with the idea this is wrong, Like I don't ever want to be woken up in the middle of the night, Like you don't have to be it's not a requirement. And since then, like I've really been on the path of highly reliable systems. And I think the part that really stumps a lot of people is they focus a lot on the preventative nature that they can try to prevent every problem. Oh, get one hundred percent test coverage, or you know, have a highly reliable solution by duplicating the infrastructure in multiple regions. And I mean the thing I think you said at the beginning of the episode, which is that it will go down, Like you cannot have one hundred percent reliable system and so at some point you have to optimize for recovery and not just prevention. And this is where I think a lot of people get stuck because like at our company, we have a five nines reliability SLA, and that means that by the time so one gets alerted and they get online, we've already violated the SLA, let alone identified and fixed the problem. That's a great point you bring that I think this system well. First, first of all, getting woken up at three am is never a pleasant experience, and it takes time for your brain to get into it. And you know, maybe you were in some deep sleep and you are waking up and kind of having like little panic attack or you know, something like tough on your body, and then you need to you know, get time to ingest the data and so on and so for so we know how hard it is for your body and your mind. I think that's where aiser is. You know, which are like self feeling system or tools that can lead to that can help. Is like, hey, this tool can ingest so much data in such a small amount of time and give you something to get started, like an initial root cause analysis. Then by the time you get to your computer, you already have something ready to look at. Hopefully it's ninety five percent confidence and you know you just have to push the fig that they suggests. I think it's such a great tool. I think that we'll have a great positive impact on the on the health of our people. The second thing I think that's that's interesting with this tool is that you mentioned you know five nines, and you know, we know that it's possible to get five, nine or six nines. But the companies that are achieving that, like the Google of the world, are investing huge amount of resources, uh you know, human and financial to reach this level, and for the rest of us, the rest of the businesses is simply not possible until today. I believe that these self healing tools will allow companies to reach this type of you know SLA without spending the budget that that Google does. And I think that's truly I think that's going to already find the sory space. Yeah, I mean, I will say that one of the biggest struggles we have is actually customer perspective alignment. Like it's a challenge for us to know what the status of our system is like it's subjective. Is it up or down? Is not like you can look at some chart and have the answer there. And what's even more important is that if we believe that our system is up, that our customers also believe that our system is up, because this mismatch is really what you're trying to solve for If customers always like one hundred percent, reliability is not what whether you think it's up, it's whether or not you know the people that are paying you money to you know, run some system believe it is and the customer expectational alignment like that's actually a really that's a huge challenge and I'm not sure you can fundamentally all that problem. But yeah, I do things as a huge gap with a lot of companies being able to get for from where they're at, which is like their software is going down like at least once a week, to something much further than that. Yeah, yeah, so, I you know, I do think the LM can can help with that in some capacity. Maybe I can jump also and share a little bit about what are the challenging of building these type of tools. Yeah, please, I'm dying to now, right. So, I think one of the hardest things, which you know, I think it's a big week LACE is obviously the non deterministic part of the system. And here I think, you know, the old adage you cannot improve or fix what you cannot measure is you know, works very well, right like for l elms, even if you provide the same input, the output will be different. And so it's very hard for engineering team to ensure that one, you know, is my system running well as you say, it's subjective, and I think here it's even more subjective because it's not a matter of just hey, am I getting a two hundred or five hundred or maybe it's a two hundred, but which is too much you know latency. We're speaking about an output which is natural language. And second is my output better or worse? So that's that's like a big challenge in in in building this system. And and another point is that this system don't have sking in the game. And elms are like dream machines. They are designed to put together chain of tokens that are, you know, using statistics, the more likely to be pleasing, you know, and sometimes this what the assembly is not rooted in reality. But they still did their job as they should. And so if we compare this to a human when you know, if let's say were you're my manager and I'm working on trouble shooting this incident, and I'm like, hey, I think that's the issue. I think this is where we should you know, Look, I have skin in the game, right, like, like I'm putting my skills on the line, and so you know, when I share this with you, I have a certain degree of certainty that this is a probable cause for ALMS. There is none of that, right, So here, what we've done at routely is that we we have two types of agents. We have the master agent, which is orchestrating sub agents which are in charge of doing the work of gathering days trying to understand, like doing the grain walk and then coming up with an answer. And the master agent will make sure that the overall narrative mix sense. And there is anothern agent that's like, you know, coming up with something that doesn't make sense, like a manager would do. So what's funny with with LMS is that it kind of mimics a human narrative, a human dynamic. Yeah. No, I mean I feel like the most common questions I end up asking are how do you know? And why now? And Alan's not so good at solving that one, especially when like a bunch of changes all stack together to then cause the problem. Right, you know, you look at individual changes and they all seem fine, and then only together do they cause the issue. So I mean, I do see this sort of interaction is necessary. I do want to ask you about your models, though, so are you taking some fundamental like some foundational model out there that's available open source and mind too, it are you building it up from scratch? Is there like one particular companies models that you like more than others? What does this look like for you? Yeah, So I think the assumption that I think the you know, anyone not not like deep in the space would assume is that you need you need to train models, you need to tune it using in our case, you know, like your customer data or like you know, if you are building this internally, your specific data. What we found is that this is actually not needed for most of the incidents, Like out of out of the shell like model like work perfectly fine, and we'll find most of the issues. Training model is actually really hard, really costly, and we haven't found so far. You know, we're still early you know in the space. So I think we'll get to this eventually. But I think for now we're finding the most value by not doing it. I think again, like, it's it's difficult, it's expensive. And then there is a lot of skepticism and I think issue with privacy and security companies on one their data going into l LMS heaven, you know, if we would do this only for their ALM. So what we found matter of the most is ready the context that you provide. And I think what we've found is the most valuable is the non technical stuff. But what I mean by this is the human generating generated context. And when we link this to roughtly, it's two things. The first one is the former postmare terms like this is a gold mine of information. Most of the time your system is unstable and it's gonna you know, this area will remain unstable at least for some period of time. Generally you have action items that your team is supposed to implement. Sometimes it's actually items are done, sometimes not. You know, there is always a priority issue with we need to release this future, just fix this potential bug. And the second thing is all the communications that's happening on Slack or teams or Zoom or Google Meter, you know and whatnot where that's how incidents are sold, right, it's human communicating between each other and sharing so much information that's business specific, right Like Ellen's are trained on a ton of data that's online, but it's not specific to a company obviously, right, And and and so we found that this data is that really boosts the results that we get out of these tools. Yeah, I mean, I think you you said it a different way. I like the context of you have to pull in the business criteria, understanding and context in order to have a valuable output. And I think it's you know, it can even be more than that. It's the fundamental nature of l ms that we have today, Like it's not a it's transformer architecture, which you know, is fundamentally lacking the reasoning piece, Like they'll never be able to reason, which means they'll never be able to make a decision based off of the business context. But they'll be able to do a little bit better of pulling that in and combining with the output that it would normally get. So, you know, my one of my questions is here, Okay, so building up a foundational model, and I think we've heard this before auntaventures and DevOps, and that's that it's incredibly expensive. Also, the industry is moving quicker that the new foundational models are are just as good, So spending money on building a new one doesn't make sense. Actually, I think we heard one time that even fine tuning models doesn't make sense because the next generation while like say anthropics three point seven cloud versus three point five, it's not it's not really that much of an improvement, but you are getting up to date data. If anything. Rather, you know the time stamp has changed, and if you spend time training it, refine tuning it. Rather then by the time the next one comes out, all you're fine. Tuning well, first of all is a waste, second all is expensive, and third of all, like you may be able to throw your quarries at your prompts at the new model and get the right answer out anyway, So it's good to hear that. Does that mean you're using some sort you're using like something from Olama or deep seek or something like that. We found that what entropy provide is generally, you know, the best performing. Ultimately, we are integrating a number of different model providers and we use different model a different step of the process. You know, I cannot explain in detail because it would be too long, but you know, when we're basically like the the agent will come up with an initial probe that you know, we compose and I would say, like different model will for instance, coming up with the let's say the master thesis of what you know we need to look for might be better created by some model, and then you know, the actual technical part maybe better than by another model. So it's and it's a moving target, as you said, like the industry is moving fast. There is a constant flow of new models and so so you know, I don't think it's something that really set in stone. Do you do something to validate model changes? So for instance, when three point seven came out, are you still using three point five before? Or you can, like you have some sort of ender templates or system prompts that you can throw out and validate that the answers still makes sense, that the RCAs and post mortems that you're doing still are understandable on match and somehow validating the outputs. Like what does this process work for you? Yeah, lung Chain as a bunch of open source tools that can allow you to do this. So we are constantly you know, tracing every like all the different you know, it's like kind of a tree right with different nerd and paths, and we keep track of everything that's being done, the reasoning, the output, and we constantly measure you know, the performance. So that's definitely something that we do. That being said, I think it's a challenge. It's still a challenge, you know to really understand the the quality, how the quality is shifting. They were a talk at Aserican in Santa Clara a few weeks ago. I think it was the AI director at ASIA who were speaking about one of the mobile based product that they are using, and it was saying that it's very hard for them to understand how to measure that, and they are relying on NPS. So it's a neutral promo Neutral Promoter Score, which is basically an industry standard rating, which is like would you recommend this to your friend and family to assess how their model are doing? And I think that was like really shocking to the audience. I mean, we do not We do know from experience that like NPS is like totally wrong from the net standpoint, because you should never ask from a human psychology standpoint, you should never ask one what they would do, but metrics on what they have done. And I think that's often the problem. But I mean, I think it really goes to show that there is no good way of adequately measuring these things. You have to do it within context of like what your business is doing, you know, for instance, really being able to do the incident management. And I do think that at least I know I have this question, so I'm sure someone else does that. There you're getting to the point where you don't want to have to make the code changes to like go into GitHub or get lab or heaven forbid a bitbucket or one of the other ones to actually put up a poor request to fix the problem. Wouldn't it be great if there was another LM out there that had the context of the source code and everything, and you could just give it the output from routlely and have a different agent do that. And that for me means you need to somehow integrate with other agents. And I can't believe I'm saying this, but MCP model context protocol, Like, how do you feel about that? I'm a huge fan of this. I think you are, Like, that's exactly the architectures that you need to have in mind. Is not an agent is the collection of agents, And it can go as deep as like, let's say, like you are doing work with GitHub. You can have an agent working on commits one, on pull request one, you know, like on each type of different resources that you may have with gitub you really need to tailor the agent too for them to do their best job. Because again, like they don't always have the business logic and understanding that we do as a human and bringing this context in each of the small like sub set of agent is critical for MCP. I you know, I'm a big fan of MCP. I've been wearing an MCP badge at Aserican and Cuicon because I truly think it's it's the technology itself is nothing, you know, amazing, Like at the end of the day, it's just to get away to an API. I mean, I really want, I really want to stress that enough, Like anyone who's not cut off on this, like it's nothing special, Like just imagine you deploy a new API or verse proxy CDN in front of your existing software and you're mapping from one protocol to another one like from TCP to UDP, from h GDP to or from rest to g RPC. It's really just another one. And I think the joke right now is that the AS and MCP stands for security. Yeah, you know, I think I mean that. Yeah, I think we go back to the you know what we discussed at the beginning of the conversation. You have the engineer, we will be an athayer and obviously there is a lot of things that are wrong with this, with this protocol, it's not stable, it's full of bug securities. Absolutely not to concern. But I think what's interesting is the concept of breaking this world between UH this AIG and and all the resources that are out there, whether it's data or system and and m cps just think it as USB, right, it's facilitating the conversation between between these two entities. It's it's open source and and it really unleash a lot of power for for AI agent and data sources, which as you say, most of the time is going to be race API to communicate in a way that's very optimized because one there are many issues with agent to consume most of the time it's it's race APIs as we say, like they may not have the business logic. So getting an information from an API main equest you to do multiple calls to multiple route and you know, like the LM may not know what if it's feasible, they may not use the best way to do that may get lost. And so MCP is really enabling this, removing all of all of this complexity. And for instance, I built an m CP server for Rutely and what it allows the developer to do is to when they get paid and stuff opening, you know, the web up going to rut looking the incident and you know it taket time context switching, which we know is bad for developers. They can just ask into their favorite power, I get me the last latest incident is going to put up in their chat and assuming it's simple in US and there is in US data in the payload of the incident, you can ask in this case, I use yourself to fix the incident. And so you go from production incident to resolution in a matter of a minute. Again, it's as you said, it's you know, some people were like, yeah, it's a joke, it's it's not truly, it's not revolutionary, it's not. But I think what's great is that it allows workflows to be done and it reduced a lot of friction. And we see a lot of companies in customer like Canvas and Bricks. They are huge engineering organization and they're like, investigate so much into MCP because they want their developer to remain where they produce the most value, which is in their idea, and so they are trying to bring as many you know, ICP server and then it doesn't matter if it's ICP. Actually, IBM really is a competitive protocol which is called ACP, which does the same thing. But you know, they're trying to bring all the contexts and the context that engineers need to do their work into the idea, and MCP is allowing just this. I think I'll be remissive. I didn't bring up Randall Monroe's comic on the we have fourteen competing standards for this, you know what, we need one universal standard to do this, you know, and then time later we have fifteen competing standards for this. I mean, because there really are. There's like a Tobos came out with not long ago Smithy for u h GDP services design pattern for documenting their APIs. We had open API specification. It's on version three point one right now, so that's you know, three versions later, and there's a whole bunch of these that different companies use, and I think the biggest trouble a lot of them have is that, like we have open API specification for authors, is that even if getting a human to understand what was written there is quite challenging, and so like feeding that into a model is you know, nonsensical, Like it's just not going to get you the word. As you pointed out, often the pattern is multiple things. I mean, we have things like graph ql, which you know has its own problems and whatnot. So I think we're just going to keep seeing more of these and I don't think we're ever going to really be able to settle on one. It would be nice if we could have one. The thing about MCP is, even if we pretend for one moment, does the worst thing in the world. As you pointed out, like I think Azure, GCP, and ABS all released MCP servers for they're built in like AI products, so you can interact with AWS better rock through an MCP server and like, so irrelevant what you think about that. It now exists and large companies have put some effort behind it, and maybe they're just trying to capture some of the market share and later things can evolve. I do think that, especially if a lot of companies are going to speed over quality that we may not get for like that many more iterations of a protocol to work. I mean, any I'll take this over the using sound high frequencies to communicate between devices that have you know, LMS, like I don't need that. It can go over the internet, please, Like, that's where I'm comfortable with my security. I'm not comfortable with things going through the airwaves because otherwise it's going to be the Alexa. Please order me, you know, another twenty four roll of toilet paper from an advertisement running on my television and actually have it happen and like this is recorded, that doesn't happen. Like, so I I don't need that to happen. People will have this happening. So I think MCP is still a little bit more secure than some of these other protocols that are out there. It is. Yeah, you know again, I'm not you know, I'm not an MCP evangelist. I think I'm not vouching for the technology, but not the concept. I think there's some serious limitation, a lot of issue with it. I think one of the security I think we've already discussed. We won't let that. But I think one issue is, for instance, you spoke about open API, so you can fit actually your open MPI and MCP can use this as a reference, which is great because you know, if you if your API is constantly updated with the latest you know state and translated into an API, then you make sure you're m CPCR is always up to date. What we found out at Rutely is that because we work with large corporations like LinkedIn, Canva and Cisco and so on and so for, they have like very specific requests in how they want to run their internet management. So our API is very verbal. We have a lot of routes to please our customer, and if you expose all of this to MCP, it's going to get lost in it, even though it's supposed you know to do this, So you need to restrict the amount of route that you expose. And the second thing is even at the next level in the MCP server chain is within the client in the editor like what people recommend, you can have up to five, two ten MCP server. After that your local agent is going to get lost because again too much context. So you know that this technology is I don't know if it's going to mature or something is going to replace it. You know, then you need to envision maybe something that centralized this MCP server into the central hubs so you don't have to configure like fifty of them. But I think it's on the right track and I think we see adoption and but but yeah, we will see where this move open AI recently and now that they are supporting MCP, which you know is is interesting because they're competing with Entropy. So yeah, I think there will be more of this for sure. But I actually my pack at the end of the episode will actually related to that. So I think it's really interesting that you brought that up. Yeah, I mean there's a lot. There's a lot there realistically, and I don't like, unless you need it, you probably don't need to spend any time looking at the MTP uh. You know, it's highly specific here for for agents talking communicating with each other. I think the hard problem that will get to very quick is at scale, being concise and meaningful and focused on what the business value is is going to be even more important. And arguably it has always been important, but it's very easy to add another route to your open API specification or your you know, your your web service or whatever you're running and having users should be like, oh, they'll deal with it, right, they'll deal with the problem. And I think realistically, you know, you want to be as clear and concise about what you're offering and what your business is and what the product is offering, but still give your customer's freedom to utilize your product how you want. And now you are almost required to make it happen because of limited context windows for for LMS, for agents. For MCP is going to be even more a problem. I mean, you scared me by saying two to five. I feel like if you have any more than one, I think you really have to question, you know, what the thing is that you're fundamentally offering. I mean I do see platforms like at LASTI ends, where like you have may have one for Gyra and one for Confluent, you know, because that's like a knowledge base, and there's like the day issues and one for maybe the GIT server. Each one of those could potentially be a different server. You say you're not an evangelist, but you are the first person on this episode, on this podcast to come on and say MTP, so that I think, by definition makes you the evangelist. And I think there may be a good a good moment to switch over to PECKS. But before we do that, I'll ask you, you know, is there any one last thing that you want to share? Yeah, if you are curious about MCP, and you know I've been to Cube con Asserica and the vast majority of people still don't know about it. We are organizing an event on April twenty fourth at guitub in San Francisco. We love speaker from brother Bays, Entropic, Open the Eye, Guitub, Factory I, and a lot of other companies. We have demo and a panel. Well, we'll go over what the heck is MCP and I think mab broady, you know, as we would chart it, like where what does this mean for the industry and where is this going? So yeah, if you type m C P Rootly even guitub on Google or perhaps we can share this in the description of the episode. Yeah, for sure, they'll be a link. Okay, then I think it's a great point to move on to our our picks. Uh. So I'll go first. My pick is this short article online by ed Zeitron. He has a blog and it's called where the Money. Uh. It's he's arguing that there's no AI revolution. Uh. You know, if you look at companies like Anthropic and Open AI, they're funneling tons of money in to it and they're not getting the value out and so in a way they're doing the nice thing of subsidizing all our great AI usage, you know, so get it. While the fountain is going. Really's got a great one, it seems you know there are ones out there. Uh, it's just it's a really great breakdown of you know, how companies are supposed to work, how where the money is coming from, you know, where it's being spent, and challenging some of those assumptions. So if you are only optimistic about everything related to AI, I highly recommend reading the article because there's there's a bunch of really good points that are made that are are hard to argue against. Love it. Yeah, that's an interesting question. Yeah, I think the you know a gi and and you know the goal of great getting to this great intelligence, so you see that, you know, that's why the money is just bringing. Yeah, I mean there is this theory that basically we can spend literally all of humanity's resources to achieve this because once we have it, it will produce so much value. That's you know, that theory hasn't been proven yet, but I'll leave it to people to read the article. Who he's articulated this much better than I have. Okay, so you've got for us today. Well, I'm going it's going to be my pig that I wrote, and I know it's going to be controversial, which is why I want to share it even better. You know, we spoke a lot about this. We didn't speak a lot about this episode, but online everybody is speaking about vibe coding, and so I think what's coming for us SA is is incident vibing, because the amount of incidents that is going to come our way is going to probably going to increase. And more importantly, I think a lot of the fundamentals that makes an engineering organizations solid are going away. A few things. For instance, I think a team that knows their code base very well, it's kind of going away because humans are not doing the coding anymore, right, they are merely like reading it doing coveryview. Perhaps they will, you know, use another l LM, another model to do the could review of another model. But anyway, I think in general we know that the knowledge of the code base is going to go down. The other one is having matter experts in some fields, especially as your company grow. You know, let's say maybe you want someone with like very sharp on database or website or whatever it is. And this again is going away because of what I've just mentioned, but also because I think it's going to be increasingly harder for young professional to gain this experience and this flair that senior engineer have. And so what's the solution. I think it's incident vibing. And I think it's one of this story where if you cannot beat them, you should join them. And so in this article I speak about what some of the ways that the companies can can get ready with incident vibing. I love it, well, we'll share that like an OPEC section of the episode. I mean, I I both love and hate your pick honestly, because, like I am, I'm so with you that vibe coding is terrible. And if we look at the door Report or the episode we did on the Door Report from twenty twenty four, we see that the LAM sacrifice speed for quality. We also know that there's a huge problem coming and companies are still adopting it. So you have to live with the outcome, Like even if you are using lams as best as you can, you're gonna that means you're gonna get more incidents. And so I'm totally with you. I hate that this is happening, but Uh, there's no avoiding it. And so the next level is also viving the incident resolution. Okay, it is. And and we've seen companies, you know, hiring people engineer and and they cannot cut, they only have, they can only prompt And Yeah, whether you like it or it's, it's happening, it's coming. It's the future of software engineering in some capacity. And so I, you know, I just think we need to get ready for it. That's the only thing you can do. I mean, I love the per respective. You know, it doesn't matter if you agree or disagree with with utilizing it. It's it's happening. Uh. And that I'll say, thank you Silvin so much for coming on this episode and sharing your perspective and what really has I've been doing. Thank you very intriving me Yea, and thanks for all the listeners and viewers of this podcast.