Warren Parad (00:00)
Welcome back to Adventures in DevOps. Today we're going to review the complexities of building models at enterprise organizations. And hopefully we'll find AI success stories.

Our guest has his career large scale systems across machine learning, data infrastructure, compilers, and cloud native architectures. His work spans graph processing, multimodal inference, and rag the co-founder and CTO at Corvik, Donald Nguyen. Welcome to the show.

Donald (00:27)
Thanks for having me.

Warren Parad (00:29)
So I have to ask, how large scale is large?

Donald (00:33)
It's kind of, I mean, large scale can mean a lot of things to a lot of people. I think one part that's important to get at is in addition to like scale, which we can talk about gigabytes, petabytes and stuff like that, it's about complexity of the data, right? terms of the volume you're talking about, it's not a lot, but oh my God, the computer can read my PDFs.

it can see these pictures, right? This data that we actually always had that was just sitting there, just dead, sitting in S3 or something like that. Now it means something. And ⁓ once you do that, then you start asking, well, if we can see these pictures and what if it knows a little bit about some other data and then you start connecting it. And those links and those combinations, they create like a combinatorial explosion of complexity. And a lot of, I think the difficulty people sort of see about managing scale is not about the volume.

but how these interactions play out.

Warren Parad (01:27)
So for them, they're storing all the data in S3 or somewhere else. how much is that really that you commonly see in practice? we talking about, is it gigabytes? Is it terabytes, petabytes, et cetera?

Donald (01:28)
Yeah.

I think for a lot of the cases, ⁓ we're talking about like gigabytes, but it's really about data that's been untapped and it's these PDFs and about the pipelines you have to build in order to build meaning on top of it. Right? And so one thing is you store the data. Good. Second thing is you have something that can just like do a keyword search over that. Great. Third thing is I want to build like a specific way of interpreting this data because I'm an auditor or something like that. And I want to find whether or not this document is in compliance with

like my parameters. That's a very complicated query to express in terms of keyword search. So you have to start building these additional indexes on top of it. like I said, it's the complexity more than anything else.

Warren Parad (02:21)
It sounds like really a focus on the knowledge basis of LLM system. you have, it's not necessarily like first class business analytics or data regarding the end product, but like when you say PDFs, the first thing that comes into my mind are invoices and manuals and maybe some documentation or specifications that you have written out and you want to answer some critical questions that help you.

basically get to the next step of whatever the business is trying to achieve.

Donald (02:49)
Yeah, I mean, think that's a good start point. think of what differentiates one organization from another in a digital world. It's basically the data they have and the kind of prior they have about their data. I'm a supermarket. I got a bunch of things about my customers. I'm like an investment bank. I have different things about my clients.

Warren Parad (03:02)
Mm-hmm.

Donald (03:11)
⁓ In terms of the mechanical interactions that they're doing, the transactional things that they're doing, it's very similar. You intake a customer, you intake a client, you collect some data about it, but the actual data makes the difference between JP Morgan and

Warren Parad (03:27)
So one thing I saw in a lot of companies that I've worked with in one way or another is that they were storing just masses of data that was completely and utterly useless. And so one of the biggest challenges was always I'll say clean like I think the canonical term is cleaning cleaning the data. But I feel like it's just so much more than that, where it's just like even figuring out. I mean, there's like first order effects like you can be like, well, that's just analytic or log data on some transactions that happened decades ago.

Donald (03:37)
Yeah.

Yeah. Yeah, yeah.

Warren Parad (03:56)
probably useless ⁓ as compared to some stuff that's relevant for right now. Do you see that there's like a majority of the data falls into the category of complete garbage that needs to get removed before you can start utilizing the data effectively? Or is that just a very small part of it?

Donald (04:13)
So one thing that's interesting, I'll talk the letter, is that in at an enterprise setting, you're talking about all these different stakeholders, right? And so the people collecting the data are very different than the people who actually using the data.

they're separated. And they speak completely different languages. On the data collection side, you're talking about ETL and AirBite and Kafka. And then on the data analysis side is like Excel, right? A spreadsheet or something like this, right? And when you create that separation responsibility, obviously if I'm just charged of collecting data, I'm very precious about collecting every, every, because

someone can ask me, where is that data? And I'll feel bad if I don't have it. But it creates a weird over-indexing in the wrong ways. Because at the end of the day, the whole purpose of this is to drive business outcomes. And so if you're just collecting stuff because you want to collect stuff because you're not sure what you need, you create a weird ecosystem. So I think a lot of that comes from that, is just because we can, should And then secondly, yeah, a lot of data is garbage.

guess another specific case is I used to work in a graph database company and we were doing these complicated ⁓ graph neural network embeddings. So if people are familiar with language embeddings that we use for retrieval augmented generation, imagine you could

build embeddings not over words and sentences, but over relationships of things. So it would be nice to identify, like you're in a social network and you're like a hub, a very important person, that social network. You need to know who you're related to, and then you can imagine embedding that of represents that latently. And it's a very sophisticated technology, and then we were building out these pipelines for various sophisticated use cases around fraud detection.

⁓ And you know, maybe it's better in like 1 % of the case, but it's so complicated to use, so difficult to use, and you have to know exactly what this embedding actually means before you build on top of it.

Warren Parad (06:06)
embeddings was one of those words where I had, it took me a lot of years before I was able to map it to, ⁓ I'm just changing the representation of data in my database from, you know, saving it as a string to saving it as a list of numbers.

Donald (06:07)
Yeah.

Warren Parad (06:20)
I don't understand why it was picked, but that one definitely held me back from true understanding of what was happening. ⁓ But once I got that, I'm like, yeah, OK, well then, I totally understand. You store the data in a different format, like you're doing the CQRS pattern, the ⁓ command query response. And why store it as a string when storing it as something else allows you to search and filter and get the results back faster? It's interesting you bring up relationships in graph world.

Donald (06:23)
⁓ yeah.

Yeah. Yeah, yeah, yeah.

Warren Parad (06:49)
did that actually look like in practice? Was there something special that had to be done in order to figure out how to actually store those relationships in a vector database? Or are there common strategies already, and it was pretty much just converting the format of the data that you had to a different format so that it could be used ⁓ in a more optimized way at runtime?

Donald (07:10)
the way I think of an embedding, it's like there's the universe that we're in, and then there's like a different universe that is like, ⁓ that has a different type of meaning. And embedding is just sort of, in this current universe, right? The word cat and the word dog sound different, but in the other universe, they're very close together, they sound the same,

Warren Parad (07:25)
Yeah. Yeah.

Donald (07:30)
What is curious about relationships and just things above the surface kind of semantic embedding type of thing is that it's very use case look at a PDF.

If I am like a business person, want to see an invoice, I have to these certain fields. If I'm an artist, I think of this invoice very differently. I think of like the meaning of commercialization in culture, something like this. And that's type of stuff that is not inherent in the data. It's something that comes from the viewer of the data and they have to impose a bias to say, look at this, but consider it like this. And a lot of the challenge about

building on top of relationships is trying to answer that question. Say, if I'm looking at this, what connections do I want to draw? Do I want to consider this invoice very similar to this other invoice? Or do I want to consider it very differently because they use a different name? It really depends on what you're trying to do.

Warren Parad (08:21)
That's incredibly novel, honestly.

I don't think anyone's ever brought that up, at least to me before, in a way. I used to treat the embedding models as pretty much just on a spectrum of how accurate are they at driving the, like a...

Donald (08:28)
You

Warren Parad (08:38)
semantic meaning in the embedding result from the original words, but if you say, actually there's some business context or hidden meaning in what's actually captured, and that needs to be carried through into what you're persisting in the database, because without that you're losing a critical aspect, that tells me that you actually can have different models that are optimized for different use cases, specifically for embedding and not just for semantic validation or be able to query back later.

Donald (09:05)
Yeah.

Warren Parad (09:07)
scary because now that means when you're selecting a model to do like to populate your database to actually calculate the embeddings for all your documents, you actually have to be concerned about not just how accurate the model is, but how it was designed to actually vectorize the original data. What intent was actually evaluate that in practice?

Donald (09:08)
You

Welcome to the deep pool. How do we

this is where we'll use the phrasing of building a model and embedding. But I think you'll have to understand it as technically the way you solve this problem is not literally by training a model. So what we are trying to do is all the same thing, is to look at something and try to represent it in a different space so that other things that we consider similar

are close in that other space. ⁓ The way embedding models work is basically do this using matrix multiply, that we look at a pre-existing corpus to figure out what the values that we're using this matrix multiply are going to be. But as we sort of ⁓ up-level, Iger, or make it more use case specific, ⁓ you could train a literal model based off your specific use case in order to understand this kind of thing.

But it turns out, one, that's very expensive. Two, you might not have enough data. And three, you want to elicit information that the human knows, the prior about their space or the problem they're trying to solve or whatever. And they'll just say, like, ignore this, right, or

And yeah.

Warren Parad (10:32)
I mean, you're just leading yourself

the next question. How do you actually manage those pipelines at scale? Because I think one of the challenges that has historically been there is that, especially if you're in a space where your customers are providing you basically free-form data input. So maybe it's just CSVs or Excel files or PDFs, hopefully not, but even JSON that they've...

Donald (10:37)
Yep.

Warren Parad (10:55)
configured themselves to be an input to your system. the whole goal of your product is to say, well, there isn't a standard of how to do this. Or maybe there is, but they've decided not to follow that standard because that company makes too much money and they have their own use case. Or they're in health care or finance, and so who cares about standards there? And.

realistically, then you have this job of actually parsing that data correctly. And for the longest time, the challenge was, well, we have to stand up a pipeline specifically for that. Not only do we have to get the data out, but we sort of have to validate it because even if a customer tells us how or user tells us exactly what the spec or schema is of that thing, they will send us incorrectly specced data afterwards. And I think this was a huge challenge because it pretty much meant, historically, every single customer you had, you would have a new pipeline set up.

And I think maybe one of the hopes was that utilizing LLMs would allow you to dynamically create pipelines in a way that would avoid the need to maintain them. And I still think at the end of the day, though, you still have a huge explosion of these things that you still have to be fully aware of and control. And my question is, how in practice are you actually managing those?

Donald (12:07)
one, we want to create a situation where the system is very kind of transparent to what it's doing,

no one standard about this. Like I said, either you might follow a process, or you just want the thing being transparent, or whatever it produces is fine. You have to be very flexible about

Warren Parad (12:21)
So I think one of the things that we're getting into here is that it is incredibly subjective of whether or not the output is even good. I don't know that's really the appropriate word there, but especially when it comes to the side where it's...

Decidable by the business whether or not it's like what to do with that outcome what to do with the data that's being saved What to do with whatever you've piped in whatever you to do with what you've cleaned or when you do a

Donald (12:46)
used to have humans doing customer support. Now we inserted some kind of automation process so that first line support is done by someone else. So then I shaved some head count, basically. And then with respect to whether or not that was a good decision or not, well, you clearly saved money. And then you do have some metric out there about dissatisfied customers or something like that. And as long as that doesn't increase dramatically, right?

Warren Parad (13:06)
You hope. think that's one of the metrics that we

sort of pray that companies are tracking like how satisfied or dissatisfied and I know after you file a support ticket and you get an email saying like, helpful was I? I don't know if I trust that actually many companies are utilizing that specifically to make decisions.

Donald (13:13)
Yeah

Yeah.

Yeah,

pretty nebulous. And that's why you hear a lot of people, a lot of companies saying you need to just use 10x your AI token usage. Why? We assume that this is increasing productivity. Therefore, we assume that the bottom line will be improved by this. But does anyone have a very concrete number that we should be aiming for?

Warren Parad (13:43)
Well, it's interesting you bring that up because like I actually spent the last couple of months basically interviewing software companies informally on their success stories regarding LLMs and none of them can kind of come up with really anything other than well, there are some processes that we couldn't quite automate before and now we can. ⁓ But I'm really hoping there's something more than that out there, but I just haven't seen it. see...

Two things really, it's the very non-technical automation that I think could have been automated before, but may have required software engineering to make that happen and they didn't have that expertise or capability. ⁓ And the other one is engineers complaining that their organizations are requiring them to do something that doesn't actually have a positive impact for the business.

Donald (14:28)
Yeah.

still kind of working things out. Like, I feel like there's a, people have a, some fraction of people have a feeling like this should do something, right? But like the mechanics of it on the ground, I think are very

It's like we're still searching about

Warren Parad (14:44)
We're still searching for the value of ⁓ cryptocurrencies. So 2009 maybe. ⁓

Donald (14:51)
Yeah, yeah, this is you know you're

old. You're like, that thing. ⁓ wait, that was a decade ago.

Warren Parad (14:54)
Well,

don't know if I say I I've bet on it being successful.

But I definitely have been optimistic about it finding a use case. And ⁓ now it almost has one with the micro payments for processing for ⁓ agent interactions. Cryptocurrencies are now AI together. That's been ⁓ the unique opportunity. So it could be a while, I think, before we find

cornerstone use cases that really solve stuff outside of say workflow automation, ⁓ which I think historically is always a pain, especially in organizations that don't have the technical expertise. And so if anything, it's lowered the barrier to improve those non-technical processes that organizations have in place.

Donald (15:40)
I think a bet well, that I'm making,

is that a lot of these things where you're just sort of a middleman and you depended on sort of encoding that business process in a non-scalable way, like that was a complete product, like just that process, To do sales automation or whatever, and then another one. Those, I think, are going to slowly erode away.

Warren Parad (15:57)
are

you talking about like the integrations between two systems that don't directly communicate with each other? Because both of them have very views and

but neither of them are willing to directly integrate. Like there's no public standard for how to do it. you're just, someone has to write the adapter. And historically that meant you would have hundreds of thousands of adapters out there. And then these companies ⁓ spin up all the time. There's a whole bunch of them. I think we've had some of them on the podcast, actually in the past, that talk about like their whole strategy was to dynamically generate adapters for every possible integration you could ever have. There's SDKs out there that supposedly do this. And so the products as well.

And I do think you're right though that those sort of things go away when a lot of one of the two systems will fully take over that responsibility of that integration.

What's the canonical implementation of this that you are using internally or

Donald (16:50)
firecracker VMs, just like spawn stuff, right? So like that traditional kind of sandbox serverless model.

Warren Parad (16:55)
Firecracker, that's like what AWS is utilizing to safely deploy instances of Lambda functions and execute them at a ridiculous

Donald (17:01)
Mm-hmm. Yep, yep. Yep.

again, we can't really fit to a specific technology unless it's very broadly applicable. So S3, great.

But supporting MySQL versus Postgres versus other things and having that be your canonical data storage, it's not so great because you have the multi-tenant in and then you're very specific to the cloud or very specific to the technology.

Warren Parad (17:24)
but there's only like a finite number of those things, right? Like, couldn't I create a connector for MariaDB and MS SQL and Oracle database and DynamoDB, KeyValueStore and MongoDB and Cassandra and I can go through the list. Like it's a finite number. It's probably less than a hundred. Like that wouldn't be that bad, would it?

Donald (17:40)
Yeah.

It wouldn't be that bad, but then the other requirement that I have is that we take those and we build data pipelines on them. So we take that data out, we enrich it. Where do we put it? I'm not putting it back in that database, am I? I really need to put it somewhere that I understand. And at that point, you're like, why don't I just put it there when I bring in the data initially?

Warren Parad (17:49)
Mm-hmm.

for sure.

a lot of organizations, they have their data already stored in such a disparate number of systems, like the number of companies that I've talked to that have multiple... ⁓

like identity providers, even within the same company, is like a nonsensical number. And they often have more than two. It's like some business units have Entra ID and other ones have Google Workspace and a third has Okta. And there's some giant project going on at this moment where they're trying to actually introduce another one or collapse them down. And that's just identity providers where I think everyone is in agreement, like one is probably the right number to have and not more than one. I mean, of course, there's edge cases.

Donald (18:14)
Yeah. Yeah.

Yeah. Yeah.

Yeah.

Warren Parad (18:40)
Then getting to the point where the number of scored data repositories that you have trying to reduce that down I feel like it's just I Don't know if that can ever happen in practice But amount of cardinality between the different number of data stores that you have within a company organization is Vastly smaller than the cardinality of different data types like the actual data that's being saved in those

I could see that it wouldn't necessarily be a problem to pipe data from individual, like actually build a connector for every single type of data store you could ever have. I build one for RDS and one for ⁓ CosmoDB and whatever is in GCP Cloud SQL, ⁓ is the best place to actually stick this data? And please don't say snowflake.

Donald (19:18)
you

think if you're in an existing organization, you're talking about your internal data platform. very justified reasons.

stuff? And then you have to make a very important decision, right? And, but if you sort of optimize for kind of batchish kind of things, like, I would suggest just making, like, a data lake or running in Databricks or something like this, because that's, a good kind of open place to put stuff.

You can dump things in, process them, and then you can sort of expand modularly.

Warren Parad (19:47)
have the opportunity to dog food your own platform in a way because you have your own data sources. What are you doing with the target data? Like are you sticking it in Databricks? Are you using Dynatrace? funneling it into a hole back into S3 somewhere? What happens to your business data? Where does it go?

Donald (19:49)
Yeah, yeah. Yeah.

using object storage and then bunch of Parquet files and building workflows on top of those things.

Warren Parad (20:08)
doing the difficult work of assembling long-form table reads from Parquet files and S3 and whatnot. Yeah.

Donald (20:14)
Yeah, yeah, yeah, yeah, yeah,

Warren Parad (20:16)
So what would you say is like some fundamental technology challenge you're actually facing still today?

Donald (20:22)
as you sort open platform, even to a vibe-coded code, basically, a lot of things that you would have handled by human interaction has to be made more.

now, we're kind of just, since we control the sandbox, it's just like there. So like, you're in the sandbox, you can see the logs. It's like of this weird sandbox,

Warren Parad (20:42)
I mean, the first thing that comes to mind would be a concern where there is something malicious that would show up in the logs. And so while it could be blocked fundamentally at the inner the HTTP interface or whatever, GCP, whatever the data input ⁓ validation could be there, but.

then it still gets in the logs ⁓ or maybe it's passed in through a mechanism which is considered secure, say like a common field or whatever. And then you end up with a log for J because even though it gets to your monitoring platform and gets re-ingested, now it's like first class data that you've actually saved in a different mechanism. How have you gone about figuring out how to protect even, I still think it's important to protect the sandbox, even if you call it that because ⁓ in some way it must still have access to maybe sensitive data, either being able to write or.

Donald (21:03)
Yeah.

Warren Parad (21:25)
the internet in some regard.

Donald (21:27)
Yeah, and you hit the nail on the head, right? Because I think a lot of ways we think about...

application building has a trusted computing base, right? But when you assume that your actual thing that you're running is untrusted, you have to really think about all the decisions that led up to this, right? And I think it's about being very clear, thinking about the world the same way AWS or GCP thinks about the world, which is like, I've got our customers come in just with garbage code or whatever, and I have to protect the actual,

data center from this kind of stuff, right? And like you have to be very clear about what is ⁓ part of your infrastructure versus the application space and really never mix

Warren Parad (22:08)
How do you decide what's in for the threat model and what should be left out?

Donald (22:12)
I don't think there's a strong sense of what fine-grained access control means for agents. That's basically what it is. We have very coarse-grained stuff at the data level, but beyond that,

Warren Parad (22:22)
So there are some standards around transactional auth and obviously because we're in the authorizations base ourselves, we have a more nuanced perspective here, it definitely is way more complicated than just row level security and it goes much further than that, whereas you really need to think about when agent performs an action, what are the exact pieces of data it should have access to, both from a reading standpoint and a writing standpoint and at that level of granularity, I just...

Donald (22:27)
Yeah.

Yeah.

Warren Parad (22:48)
Just that system doesn't exist. It can't exist. It's just limit that you would never be able to have the context that you can provide to an agent and also the context for what a policy should be that can apply to its to its interaction that could be generated outside of that agent that has to do the thing ⁓ Maybe to give an analogy There is this canonical problem and he just regular old no LLMs involved when like one sir your customer service calls something internal you

and then you call a third party and whatnot. There's no way for the customer to know what the third party is going to require until the request actually gets there. You can sort of try to predict what that third party is going to need and what sort of access control you'll want. And this is the whole, well, ⁓ you offer a service that integrates, say, with Google Drive.

and your customer comes in and makes a request and then you're like, you know what? We need access to your Google Drive, be able to read and write all files, delete your entire drive. Also, just in case we need your calendar access and be able to create meetings on all of your calendars and also delete your calendars just in case. And something about your email too while we're at it, why not? And just request all of that upfront because you don't know fundamentally later what is actually going to be necessary to complete the action, especially if it's asynchronous. And so I think those sorts of things actually make this a...

sort of not solvable problem in some regard. on this topic, ⁓ there's this idea where you can just stand up some agents and limit the access that that thing has. And they're like, ⁓ yeah, you create a bunch of identities for your agents or your system to run in that particular sandbox. But if you fast forward like five years from now,

that agent is going to be a perfect representation or proxy for all of the things you already have in your organization of every single, you know, team member. It's going to have access to all of your email systems, all your databases, et cetera, because you're going to have decided that, well, it needs to read and write to all those things. And so saying that you should give your agents their own identity is sort of like a short-term myopic mindset, which doesn't really hold up in the end. So I don't really understand why people are pushing for that so

Donald (24:42)
Yeah.

But yeah, you're right. People get excited. It's like, can't even have more data, more data, more data. And then you can create this giant, just super god agent that has everything.

Warren Parad (25:04)
think fundamentally with the transformer architecture, they'll always have prompt injection attacks will always be a thing. And it's like, well, what if you take like another model and you wrap the first model and you sort of validate the inputs and the outputs don't seem like a prompt injection attack. And then, but then it's turtles all the way down, right? Like it's like, well, then you need another model to make sure the second model, you know, wasn't. And it's like, whoever has more on top of each other will be successful in the end.

Donald (25:09)
Yep.

Yeah, don't do that.

Yeah, yeah.

Warren Parad (25:28)
I have more turtles, therefore I can protect, you know, prevent those attacks. I thought one more turtle ahead.

Donald (25:33)
Yeah, I think ⁓ from the data perspective, think that the

roadmap is kind of clear, which is like if you know what data sensitive or not and you can just provide it or not, it's very binary. think that's a problem that can be engineered and solved. But you're right that on the control, the action side, it's very hard, right? Like what actions you allow it to take. So I think that's why I'm happy I'm just working sort of more on the data side. You're just talking about just like confidentiality of data, not like the actions you can take right now. It's just, can you read this or not? But you're right that if we want a world where these things are acting on our behalf and sending emails or

making decisions that have like external side effects. I don't know how you'd want to, how you can really manage that properly

Warren Parad (26:16)
think it's one of those things where humans have already sort of figured out the solutions to most problems and we're re-figuring them out when we use LLMs, but the solutions are exactly the same solutions we already have in place. So if we just look to how we solve this problem without LLMs, I think we'll find some answers for us. Like if we go to the bank, you get like multiple different validation prompts and like, you sure you want to send this money to this particular address? Well, that address has error correction in it and we do bank account checks for.

the size the transaction as well as maybe there's some fraud detection with know, LMS aside. If we look at governments as well, there's like different levels of clearance. There's but you see how challenging it is because there's like well, there's like a secret clearance and then there's like a top secret clearance and then there's another thing above that which is like for your eyes only which basically means that ⁓ we don't know how to do clearance correctly at this level. So we'll just say you know, if you need it someone else will decide that you

can get it but there's no like canonical answer for that problem it has to be figured out every single time like in that moment

Donald (27:22)
I

Warren Parad (27:23)
Unsolved problems and insecurity,

yes. And I don't think it's getting better. Maybe for something quickly different. ⁓ So I was chatting with a few company leaders in the last couple of weeks and...

They told me that they are already budgeting for 50 % of the engineering salary ⁓ for LLM usage. So if you're paying someone 100K today, then tomorrow you should be allocating 150K for 50 of which will be just token usage by their favorite LLM provider. What do you think? Accurate, inaccurate, something you're already planning for?

Donald (27:59)
depends what kind of code you're writing really because I think what I've discovered like there's certain types of your skill level is quite low and so what the LM is providing you is great you get like ⁓ something that a lot of information out of that you learn a lot or whatever

Is it half, should we think about like the way we evaluate employees is like how many tokens did you use, you didn't use enough,

I

Warren Parad (28:22)
I think for

get what you measure, right? And ⁓ I think we're already seeing companies falling prey to token maxing, but it's sort of a different aspect, right? You have some role or responsibility in your company. There's something that has to get completed and you have someone who's put on the job of doing that thing.

Donald (28:29)
Hahaha

Warren Parad (28:38)
And I think as companies evolve and if LLMs really are tool that everyone will use at some point, then it's a no-brainer to believe that they may use a tool to solve that and that tool may be an LLM and that tool has a cost. How, like...

What is the expectation here? Is it that you have no idea how much these things will end up costing? Do you have an expectation on what does sort of, what would be reasonable, what would be unreasonable? Like if an engineer came to me and said, yeah, this next year I project I'm gonna use three times my salary on tokens. I'd be like, that sounds wrong. ⁓ Which has nothing to with the salary. It has to do with the fact I don't know if you're generating three times your salary's worth of value from tokens. So maybe that's the conversation.

Donald (29:23)
what would it take to take a normal software developer and then bring them a little bit more aligned towards product or business? that's the way you would value it, right? It's like that you give the person this tool so they can expand their responsibility set. And that's how it would price it.

Warren Parad (29:42)
I love that you're tempted to just think about token usage from an engineering standpoint. Because I feel like the biggest usage of tokens would be the product manager saying, you know what? It's incredibly critical that we prototype this idea and get it in front of customers. And to do that means I'm just going to send my favorite LLM on an infinite cycle loop until this thing is done. And then after spending n...

Donald (29:49)
you

Warren Parad (30:07)
thousands of whatever your favorite currency is, I'll throw it in front of customers and get some feedback there. ⁓ And that horrifies me.

Donald (30:16)
I mean,

I can buy it for like just sort of like show a prototype or something. I think the biggest value that we had for non-engineering use cases of LLMS is actually bringing people to speak the same language. So like our designers can produce just real working websites and designs and like ⁓ changes and then they're submitted as code. Again, you can't, it's really garbage code to read, but at least we're all sort of speaking the same language and we're like understanding what we're actually communicating because we can actually make a physical artifact about that that behaves in high fidelity. And I think that part is

Warren Parad (30:21)
Mm-hmm.

Donald (30:48)
very useful. For a product manager, if I were a manager of a product team, would I be like, you need to maximize your token usage? I think that part is a little scary in the sense that you have to be kind of aligned with reality. And so if you can't actually produce that thing and solve all the other problems, then knowing that you can do this is nice. And seeing it very concretely is very nice. But in some sense, ideas are cheap, right? The ability to execute is the most important thing.

Warren Parad (31:15)
The right business outcome, I think the companies that are just trying to optimize token usage have the wrong incentives at play realistically. But so I think that's a much more mature way of looking at it for sure.

do think that there's a lot of overlap with how things are migrating to at a large scale with the concept of say outsourcing one of your critical internal departments to another company to function. I don't mean like using a SaaS provider, but maybe like if you're doing software engineering and you're hiring a third party company to basically make you some prototypes or some mobile apps. If you are uncomfortable with that model, I feel like you should also be uncomfortable utilizing LLM to solve a problem.

I also really like your perspective on the data because I think this is something that definitely was lost for me early on in my career. It's that data is sort of like a throwaway term. doesn't it means almost nothing there or maybe it means a lot to a lot of people and that's sort of the problem. And it's really saying the term data is like what is the data actually representing because you know you brought up this point with the embedding model early on. It's that there are some cornerstone critical aspects to why the data exists and what the sort of inherent

Donald (32:08)
Yeah. Yeah. Yeah.

Warren Parad (32:25)
information or value is in that particular data and that's where the complexity is. And you really need to understand that that's where the challenge is and not just the storage or data pipelining or processing of arbitrary pieces of data or formats.

Donald (32:39)
Do you know the Douglas Adams thing? know, like the answer to everything is 42? What was the question? Like that's the most important, the data, that's 42. How did you get that? That's the whole journey.

Warren Parad (32:49)
Yes, the life, the universe and everything. It's summed up in a very simple number. That was the, wait, was that the number crunching machine, right? Keep asking, you what's the most important question in the universe that requires the entire processing of the whole universe in order to actually come to a conclusion. So with that, maybe we'll switch over to PICS. So Donald, what'd you bring for us today?

Donald (32:58)
Yeah.

Well, I have a young daughter, so I often have to read to her at bed. And the challenge, if anyone knows, about reading stories to children is a lot of the stories are terrible, and you don't want to read them over and over and over again. So we locked upon a set of books called The Investigators, which is like a little graphic novel. I think it's geared towards kids who can read, but it's enjoyable for younger folks, I think, mostly because it's just like a very funny pun-filled

absurd story about these alligators who are like investigating different things. ⁓ But it's one of those things I've very rarely like if people are familiar with the TV show, Blue Eats, it's the same category, whereas an adult you enjoy watching it. I go to my daughter, like, can we read this book? Because I'm curious about what's happening next in the story. ⁓

because it's something she enjoys, but it actually has this continuity about the storylines built on top of each other over multiple series. Again, very uncommon for children's books. They're very mechanical if you find a series. So, Investigators John Patrick Green, believe. If you have a kid, ⁓ read it to them.

Warren Parad (34:21)
Yeah, I'm with you on that because it does seem like most children's books, I mean, even some adult books, they're very episodic in a way, right? Like there's some story and that's complete case. And it does feel like there is no continuity in any regard, which sort of removes any sort of desire to continue on in some way. ⁓ Okay, well, I think you may be the first to come on the podcast and recommend a children's book. Not like, there have been kids toys, so.

Donald (34:26)
Yeah, yeah.

Hahaha

Warren Parad (34:48)
Totally fine. ⁓ I actually really like the pic. Which makes me almost want to check it out I'm always looking for interesting things to read in German because I want to improve my German language skills and ⁓ I've often thought you know, what would be great is easy television shows or books and finding good content that has been well translated is a challenge and I find children's stories

Donald (34:57)
Yeah, yeah.

Warren Parad (35:10)
tend to be translated, but most of them are bad. ⁓ I may actually look into that. yeah, so what did I bring? I almost forgot. ⁓ I brought this episode of a YouTuber named Dr. Nemo. ⁓ Specifically the episode is, you've never seen a clockwise circle pit. That's a pit of people running in a clockwise motion.

Donald (35:13)
Yeah. Yeah.

Warren Parad (35:34)
Counterclockwise for me actually has always felt more appropriate just anything counterclockwise from conservation of angular momentum to electromagnetic wave pointing vectors. It's always counterclockwise. But one thing I never challenged is why is it called clockwise and counterclockwise when clocks are the only thing really that goes backwards? I feel like we should have picked a different term and there's a lot of stuff in here that's baked on the ⁓ not just nurture but also ⁓

Donald (35:41)
Yeah.

Warren Parad (36:03)
nature of how humans work and standard stuff as well as cultural aspects. I wonder if it's true across all cultures or if it's just a Western world ⁓ aspect. And we may never know because I don't think we have enough segregation. There's a lot of things, say, in physics and mathematics where counterclockwise is positive and clockwise is negative.

Donald (36:26)
good. I love those kinds of things where, you know, there's all these, a lot of these random conventions, right? Like, how do they come about? Was it just like a chance one day or, you know, is there an inherentness? And I guess your point, right, is that if we thought from a mathematical basis, maybe we would have chosen a different way. Maybe we just figured this out too late.

Warren Parad (36:41)
⁓

Yeah, well, he didn't actually say this in the video, but my theory is that we've decided in the Cartesian plane that ⁓ the x-coordinate system goes left to right, negative to positive, and the y goes from ⁓ negative to positive down to up. So down to up, left to right. And if you do a mathematical operation where you go from a high positive x-coordinate to a high positive y-coordinate, you go like this. And that's the...

Donald (37:00)
Mm-hmm. Yeah.

Warren Parad (37:10)
that's this curl and that curl then becomes positive because both those numbers are positive. But if we changed our mental model where the primary coordinate was the vertical and not the horizontal line, then we would have reversed them and that would have made that negative. Or if we thought numbers differently where negative numbers were higher than positive numbers. So there is a huge aspect there. Do other cultures have different counting systems? Do they have a different way of thinking about it?

Donald (37:12)
Yeah.

Yeah.

Warren Parad (37:39)
I know that there are some tribes that fundamentally have it built into their culture, like which way is north, which way is south. So system or cardinal directions or waves, like they just sort of know. I don't. I have no idea which way north is most of the time.

Donald (37:56)
But I think it reminded me that...

think we take a lot of things for granted, like just numbers, right? And you can almost see it today in the way we use the words for numbers, right? And in different languages, like they're very weird in that like you think of one, two, three, four, five, six, seven, eight, fine. Okay, 11, 12, 13. Then you see this pattern like, you know, 13, 14, 15, 16, 17, 18, 19, great. But it's different, right? Then 20 and you have this repeat again. So you can get a sense that when...

Warren Parad (38:01)
Mm.

Donald (38:27)
English speakers were thinking or whatever they came from were thinking about numbers they weren't thinking about in terms of that very rigorous kind of base 10 feeling they were just like I needed a thing for this concept or like one of these things and then for two of these things and then and at some point like we stopped inventing and we're like ⁓ I see the pattern now and so we should do this right and you know that from different languages they have very different counting systems like French the minds at base 20 yeah

Warren Parad (38:52)
Oh, things like 96 in French are

so absurd. there's something there's... Yeah, I'll say that German numbers definitely mess with me the most because you put the ones digit before the tens when you speak, but hundreds still like, but the larger digits still come first. like 11 and 12, they're the same as in English for the most part. But if you get to say 21, you

Donald (38:56)
Well, it depends if you're Belgian or French French.

Warren Parad (39:21)
one in 20 and two in 20 and you know three in 30 and it's like it's so confusing when people are talking to me and I'm just like wait what number did you say? maybe that will be my pick for next week's episode so maybe I won't spoil it for the viewers until then but other than that thank you Donald for coming on for today's episode and talking through multimodal stuff ⁓

Donald (39:23)
Yeah.

Yeah. Yeah.

Warren Parad (39:42)
building AIs, LLMs at scale for customers. I think it's been great.

Donald (39:47)
Awesome, thank you. pleasure.

Warren Parad (39:49)
And ⁓ thanks for all the viewers for tuning into this week's episode and I hope to see you all back again next week.