1
00:00:07,810 --> 00:00:13,756
Welcome back to Adventures in DevOps, where every episode's a deep dive into a specific
topic with an expert guest.

2
00:00:13,756 --> 00:00:17,520
Today's adventure focuses on writing documentation and feature flags.

3
00:00:17,520 --> 00:00:20,543
As the expert, we've got someone with quite the unfamiliar title.

4
00:00:20,543 --> 00:00:28,238
Previously, she's been a software developer, API integration engineer, and now is the
documentation lead at Unleash, Melinda Fekete.

5
00:00:28,238 --> 00:00:30,190
Hi Warren, thank you so much for having me.

6
00:00:30,190 --> 00:00:32,550
Yeah, you know, and I'm really excited.

7
00:00:32,610 --> 00:00:37,910
What I'll say is that we try to limit who shows up on the podcast based off of their
titles.

8
00:00:37,910 --> 00:00:43,190
And you let me in on a little secret before we started recording that you don't believe in
job titles.

9
00:00:43,190 --> 00:00:44,490
So what's that about?

10
00:00:44,502 --> 00:00:46,703
Yeah, you know, um I work at a very small company.

11
00:00:46,703 --> 00:00:51,915
So I did build the documentation website and I maintain it.

12
00:00:51,915 --> 00:00:58,338
And I also do all of the technical content that is on there, but it's also, just a part of
my role.

13
00:00:58,338 --> 00:01:09,872
So I do other DevRel things like conferences and meetups and developer education and talks
and workshops and a of marketing, a bit of this and that.

14
00:01:09,872 --> 00:01:12,754
So what would you say is a good job title for that?

15
00:01:12,754 --> 00:01:13,754
I don't know.

16
00:01:14,419 --> 00:01:21,225
You know, I think this is where if you're at a small enough company and still like around
the startup phase, then there's like this idea of founding engineer.

17
00:01:21,225 --> 00:01:24,538
But then like, what do you call the second founding engineer and the third founding
engineer?

18
00:01:24,538 --> 00:01:29,874
And then I feel like you have this idea that you should start applying titles or roles.

19
00:01:29,874 --> 00:01:34,702
I think what I would say is that often labels are helpful, but they're all wrong in a way.

20
00:01:34,702 --> 00:01:43,020
I do want to share though, a long time ago when I was in the university, I had a lot of
professors who would say, we want engineers who can do at least one other thing.

21
00:01:43,020 --> 00:01:46,182
That's the highest number of feedback that we get from the industry.

22
00:01:46,182 --> 00:01:49,615
And this was, uh wow, almost 20 years ago.

23
00:01:49,876 --> 00:01:52,498
And I didn't really understand it at the time.

24
00:01:52,558 --> 00:02:01,366
And I think the longer I've been in my career, the more I've started to come to terms with
this idea that just doing one specific thing like your

25
00:02:01,366 --> 00:02:08,672
engineering job ends as soon as the code gets deployed to production is just like not that
most critical aspect.

26
00:02:08,672 --> 00:02:12,023
Yeah, it's what they sometimes refer to as T-shaped, right?

27
00:02:12,023 --> 00:02:21,697
So being really good in one kind of vertical area and going really deep and having that
expertise, but also ah just dipping your toes in a couple of other things and trying

28
00:02:21,697 --> 00:02:22,298
things out.

29
00:02:22,298 --> 00:02:25,429
And I think it also helps with job satisfaction.

30
00:02:25,429 --> 00:02:30,611
And if you just do the same thing every single day, it's going to get boring quite
quickly.

31
00:02:30,611 --> 00:02:35,293
So of course there's engineers who only like to code and that's all they like to do.

32
00:02:35,293 --> 00:02:38,034
But even back when I was an engineer,

33
00:02:38,310 --> 00:02:48,503
Um, if you had just told me to code all day and do nothing else, I would have probably
quit because I really need that variety of like, um, being involved in like, I don't know,

34
00:02:48,503 --> 00:02:49,204
interviews.

35
00:02:49,204 --> 00:02:57,102
So it's nice to have a couple of different responsibilities that you also care about and
can kind of experiment and try new things.

36
00:02:57,102 --> 00:03:05,607
How did you make that shift into an area where you felt comfortable with that wasn't your
primary remit when you started?

37
00:03:05,607 --> 00:03:07,948
Like you didn't just like one day wake up and like, you know what?

38
00:03:07,948 --> 00:03:11,256
I only want to write documentation from now on.

39
00:03:11,256 --> 00:03:22,853
For me, when I saw the spark and um excitement for the first time with documentation is um
I was working at this coffee roasting company.

40
00:03:22,853 --> 00:03:26,074
I was integrating um with different APIs.

41
00:03:26,074 --> 00:03:30,276
um We built this IoT espresso machine.

42
00:03:30,557 --> 00:03:34,859
A um big part of my role was just trying to figure out these different integrations.

43
00:03:34,859 --> 00:03:39,401
I was looking at a lot of API docs and some of them were terrible.

44
00:03:40,204 --> 00:03:47,396
You really had to spend days and days trying to, with trial and error, figure out what the
heck was going on.

45
00:03:47,396 --> 00:03:51,407
And that really inspired me to, how can you do this better?

46
00:03:51,407 --> 00:03:55,949
And started looking at some of the companies who I thought were doing really well um ever
since.

47
00:03:55,949 --> 00:03:57,349
I probably wouldn't look back.

48
00:03:57,349 --> 00:04:06,690
think that it's been three or four years now that I mostly do documentation and some small
things on the side, but I love it.

49
00:04:06,690 --> 00:04:19,375
So I think there is this old joke where no engineer comes into a company and is looking
over the current stack and looks at all the source code and says, wow, this last guy just

50
00:04:19,375 --> 00:04:21,616
wrote the most perfect code ever.

51
00:04:21,696 --> 00:04:22,907
I don't need to change anything.

52
00:04:22,907 --> 00:04:25,108
But I actually think that's true for documentation.

53
00:04:25,108 --> 00:04:30,360
I don't remember looking at any portal and being like, wow, the docs for this software
product, they are fantastic.

54
00:04:30,360 --> 00:04:36,202
ah So you mentioned that there are some that you're like, we should do better and look to
some.

55
00:04:36,635 --> 00:04:40,560
pinnacle ones out there that should be modeled for any software product.

56
00:04:40,560 --> 00:04:42,292
What are those in your mind?

57
00:04:42,734 --> 00:04:45,051
uh Unleashed does a pretty good job.

58
00:04:46,830 --> 00:04:49,053
Make your own company, okay, because you're working on it.

59
00:04:49,053 --> 00:04:52,256
So everyone's gonna have to go look at that after this episode now.

60
00:04:52,256 --> 00:04:57,160
Yeah, I mean, mean, some of the big open source projects like GitLab, I think do an
amazing job.

61
00:04:57,160 --> 00:05:02,373
Actually the landscape has changed, I would say significantly over the last couple of
months even.

62
00:05:02,373 --> 00:05:05,446
LLams, AI, they're very good at writing documentation, right?

63
00:05:05,446 --> 00:05:10,640
So you can produce a lot of good quality content with very minimal input.

64
00:05:10,640 --> 00:05:14,318
So the name of the game is kind of more been around like...

65
00:05:14,318 --> 00:05:17,398
what's some of the experience you can build around it.

66
00:05:17,398 --> 00:05:30,718
things like the AI search piece of the puzzle now is like you have to make it, you have to
make your documentation also usable for LLMs because about 50 % of all documentation

67
00:05:30,718 --> 00:05:34,838
visitors for a site like ours are now AI tools.

68
00:05:34,838 --> 00:05:41,758
So you have to find a sweet balance of like, like what's working for humans and what's
working for LLMs as well.

69
00:05:41,758 --> 00:05:49,518
it shift your job responsibility from historically having to write docs for humans to now
having to write docs for LLMs?

70
00:05:49,518 --> 00:06:02,398
There's a couple of fun things you can use, like the tooling that I use, it allows you to
distinguish between how you want to write for a human and what you want to expose to a

71
00:06:02,398 --> 00:06:04,618
human and what you want to expose to an LLM.

72
00:06:04,618 --> 00:06:11,018
So you have different code blocks and you can show some of it to humans, some of it to
LLMs.

73
00:06:11,418 --> 00:06:16,418
A lot of it is like there's a good overlap, but...

74
00:06:16,454 --> 00:06:21,996
LLMs typically do well when a lot of the content is just plain markdown.

75
00:06:21,996 --> 00:06:32,739
You strip it off any of the fancy tables and accordions and all the things that um you put
a lot of work into making the content readable for humans.

76
00:06:32,739 --> 00:06:44,982
But then you have to strip away a lot of those fancy UI features for LLMs and things like,
I don't know if you've heard about LLMs.txt, which is um typically something you...

77
00:06:44,982 --> 00:06:53,702
you would expose for all of your pages so that when LLMs come and look at your docs, it's
a very clean and simple, easy to understand structure for them.

78
00:06:53,826 --> 00:06:58,429
Yeah, I mean, can imagine really realistic from this, there's like two paths down there.

79
00:06:58,429 --> 00:07:09,417
There's exposing the docs that you have to the training processes that companies who are
building models are making so that that information can end up in LLMs just straight away

80
00:07:09,417 --> 00:07:18,394
so that when your users or whomever perspective customers are querying an LLM or prompting
it for information that it can actually return the results.

81
00:07:18,394 --> 00:07:21,876
And then there's basically a more complex search.

82
00:07:21,876 --> 00:07:28,512
aspect where at runtime being able to answer questions from the web that haven't
historically been able to be answered.

83
00:07:28,512 --> 00:07:39,721
Realistically, you may be gotten away with some high level JSON blocks at the top of some
web pages, but now LMS are directly consuming the data that is on uh individual article

84
00:07:39,721 --> 00:07:43,934
pages and summarizing it or providing a useful answer.

85
00:07:43,934 --> 00:07:46,396
And that has a second nature there.

86
00:07:46,396 --> 00:07:51,392
And I think one of the problems is a lot of the tools out there don't do a great job of

87
00:07:51,392 --> 00:07:54,804
removing the visual elements that have been added in some way.

88
00:07:54,804 --> 00:08:05,521
Like if you just look at generated HTML or even markdown in some cases, it's very
difficult to get useful, optimized LLM based uh output for or input for LLMs.

89
00:08:05,521 --> 00:08:07,464
uh What's the strategy there?

90
00:08:07,464 --> 00:08:09,426
Do you just like write the docs twice?

91
00:08:09,426 --> 00:08:14,169
I think a lot of the modern documentation platforms do it behind the scenes for you.

92
00:08:14,169 --> 00:08:21,074
So the way they generate the LLMs.txt file, if they're doing it well, they'll
automatically strip some of those.

93
00:08:21,074 --> 00:08:29,259
We write markdown with custom React components, which are the things like the tables and
the drop downs and the buttons and whatnot.

94
00:08:29,259 --> 00:08:33,172
And so all of that gets stripped out uh from the LLMs.txt.

95
00:08:33,172 --> 00:08:35,803
And I think that alone is already a big win.

96
00:08:35,803 --> 00:08:38,702
And yeah, I'm experimenting with including

97
00:08:38,702 --> 00:08:46,549
Uh, certain explanations, like if you strip out a table, but you still want to tell like
the LLM what, what that is all about.

98
00:08:46,549 --> 00:08:49,432
Like you can probably explain in a different way.

99
00:08:49,432 --> 00:08:59,270
And I've, I've just started experimenting with some of it and I'm, um, I have the data
around, you know, what pages got viewed by what percent of humans versus LLM.

100
00:08:59,270 --> 00:09:01,402
So I still need to do a lot of digging around.

101
00:09:01,402 --> 00:09:05,876
Um, but there's, there's some useful features out there.

102
00:09:05,964 --> 00:09:06,425
for sure.

103
00:09:06,425 --> 00:09:18,586
So just for context, we've been using DocuSource for a while now and the LLM's dot text
functionality plugin is atrocious in every way.

104
00:09:18,586 --> 00:09:23,852
ah So it's always something that we're sort of looking at, especially offering a very
technical product.

105
00:09:23,852 --> 00:09:27,295
You can be sure that it's more likely to get picked up in some way.

106
00:09:27,295 --> 00:09:30,368
And I think one of the biggest problems there is when

107
00:09:30,368 --> 00:09:39,048
you have custom react components or view components or just any sort of thing that you
wrote yourself and not just pure markdown getting embedded links to work has been a huge

108
00:09:39,048 --> 00:09:47,667
struggle for us like if you're using a customer react component that has like something
clickable in it to go to a different page like is whatever process you're using to write

109
00:09:47,667 --> 00:09:51,808
your documentation is going to be smart enough to somehow pull that out and then listed
appropriately.

110
00:09:51,808 --> 00:09:53,989
Yeah, I'm not like extensively looked at it.

111
00:09:53,989 --> 00:09:59,262
So I would say em we recently migrated off of DocuSaurus.

112
00:09:59,522 --> 00:10:09,007
Exactly um why you, some of the reasons you mentioned about the plugins and entire
ecosystem was giving me nightmares.

113
00:10:09,988 --> 00:10:17,892
But uh the tool we have now is called Fern and we started using it as of last Monday.

114
00:10:17,953 --> 00:10:20,246
So it's very new for me and I'm

115
00:10:20,246 --> 00:10:28,832
Still just like trying to learn and experiment and understand what is happening behind the
scenes with things like the LLMs.txt conversion.

116
00:10:28,832 --> 00:10:32,974
So ask me again in six months and we can have a chat about it.

117
00:10:32,974 --> 00:10:36,586
Well, if you do the research and write it down, then we'll definitely want those stats.

118
00:10:36,586 --> 00:10:47,436
So Fern's the one that does the SDK generation but also contains the docs portal, Was the
motivating factor mostly on the doc side or the automatic SDK generation?

119
00:10:47,436 --> 00:10:50,027
We don't use them for the SDK stuff.

120
00:10:50,487 --> 00:10:51,847
Just the docs.

121
00:10:52,308 --> 00:10:57,669
Yeah, I wanted a platform that was a little bit easier to maintain for me.

122
00:10:57,669 --> 00:11:05,412
And I think they're kind of out of the box components are a bit more sexy than the
docusaurus ones.

123
00:11:05,412 --> 00:11:08,232
um just overall, the team was super nice.

124
00:11:08,232 --> 00:11:11,914
So loved working with them throughout kind of the evaluation process.

125
00:11:11,914 --> 00:11:13,684
um

126
00:11:13,684 --> 00:11:23,709
They, you know, the AI search capabilities and also this um other AI tooling like the LLMs
that TXT Generation was not something we had before.

127
00:11:23,709 --> 00:11:31,604
So for me to just get all of that out of the box with as little involvement from my side
as possible was a win.

128
00:11:31,604 --> 00:11:42,890
you know, doing the platform and the content and also these bunch of other things around
developer relations um can be uh time consuming.

129
00:11:43,342 --> 00:11:51,982
So I really want to dive into that a little bit because one of the things that has come
up, I'd say biggest learning for me was who am I writing these docs for?

130
00:11:52,082 --> 00:11:55,842
And you're nodding your head, I'm sure you have some opinions here.

131
00:11:55,842 --> 00:12:06,462
So my first question is like primarily when you're thinking about writing the docs, are we
talking about like end users like through Fern who are technical users?

132
00:12:06,902 --> 00:12:08,622
Are they how-tos?

133
00:12:08,622 --> 00:12:11,242
Are they guided tours?

134
00:12:11,398 --> 00:12:22,758
Well, so I would say about 80 to 90 % of our audience is developers who are either just
getting started with feature management or maybe they've already got Unleash and they're

135
00:12:22,758 --> 00:12:32,997
trying to figure out how to integrate their SDK and maybe 10 to 15 % are business decision
makers who are looking to evaluate what platform to buy.

136
00:12:34,398 --> 00:12:38,020
We categorize the content into three or four main types.

137
00:12:38,020 --> 00:12:42,492
So we have kind of getting started content, which is very developer focused.

138
00:12:42,492 --> 00:12:52,646
Then we have more of the tutorials and guides, which is more of a step-by-step kind of
hand holding, very detailed across all the different SDKs.

139
00:12:52,646 --> 00:12:56,818
And then we have the API documentation, which is one of the largest category.

140
00:12:56,818 --> 00:13:01,174
And then the SDK documentation and release notes.

141
00:13:01,174 --> 00:13:09,719
So I would say those categories are pretty classic in terms of what you'll find on the
typical documentation site.

142
00:13:09,719 --> 00:13:20,706
we typically monitor our usage around what SDKs our customers are using and try and focus
on those languages when we do examples or uh guides and tutorials.

143
00:13:20,942 --> 00:13:21,822
absolutely makes sense.

144
00:13:21,822 --> 00:13:33,342
guess one of the things I sort of realized early on is that the message that I heard when
I was writing documentation in like Wikidocs was, like, who is your audience?

145
00:13:33,342 --> 00:13:39,842
And I think all my English teachers from my entire academic career would always try to get
this point across.

146
00:13:39,842 --> 00:13:43,542
And I'll say I never understood what that actually meant.

147
00:13:43,562 --> 00:13:46,638
But I stumbled upon a few years ago...

148
00:13:46,638 --> 00:13:51,958
uh There's a website dietaxis.fr, which I think really makes the point here.

149
00:13:51,958 --> 00:13:55,284
And you brought up whether they're how-to guides or release notes.

150
00:13:55,284 --> 00:13:57,719
But really the point is, who is this for?

151
00:13:57,719 --> 00:13:59,441
Why am I even writing this?

152
00:13:59,630 --> 00:14:02,881
I think developers are notoriously hard to track as well.

153
00:14:02,881 --> 00:14:05,592
So sometimes you don't get all of the data you wish you had.

154
00:14:05,592 --> 00:14:12,354
um But I do have information on what people are searching for in the search bar.

155
00:14:12,354 --> 00:14:22,236
I get information on what people are asking in the AI search and um what percent of their
answers are being answered correctly.

156
00:14:22,236 --> 00:14:28,638
m Then I also get em feedback on every specific page.

157
00:14:28,962 --> 00:14:31,663
Did this page help you or not help you?

158
00:14:31,764 --> 00:14:34,986
Or feedback on specific code examples.

159
00:14:35,567 --> 00:14:44,022
And also inside the product, we send out a survey to users on I don't know, I think three
to six month basis.

160
00:14:44,022 --> 00:14:47,094
You get a survey request and then some of them fill it out.

161
00:14:47,094 --> 00:14:49,816
And then there's questions around the documentation there.

162
00:14:49,816 --> 00:14:55,881
So I try to rely on all of those different feedback points to figure out, this page
working well?

163
00:14:55,881 --> 00:14:58,338
And is it solving the problem that I

164
00:14:58,338 --> 00:14:59,364
think it should be solving.

165
00:14:59,364 --> 00:15:00,710
uh

166
00:15:01,066 --> 00:15:10,538
I mean you're using the information to decide like what to write next or to change pages,
to improve docs in certain areas, or is there something else there?

167
00:15:10,690 --> 00:15:15,002
Yeah, I think one of the most useful ones actually is the AI search.

168
00:15:15,002 --> 00:15:24,707
And the dashboard will give you an indication, like ah what are some of your content gaps
or what are some of the questions that people struggle with and don't get an answer in

169
00:15:24,707 --> 00:15:26,057
your documentation.

170
00:15:26,117 --> 00:15:36,642
And that is one of the main ways I prioritize what to work on next besides the new
features and the developments from the engineering teams.

171
00:15:36,982 --> 00:15:39,773
I have a question that's going to be controversial.

172
00:15:39,773 --> 00:15:51,707
Internal oh documentation for your architecture, for your services, for other developers,
does it live inside a Git repository and committed to source code?

173
00:15:51,707 --> 00:15:59,850
Or does it go in some sort of open platform that enables anyone to uh update, edit, write
it as much as they want?

174
00:16:00,098 --> 00:16:06,580
Well, we're an open source company, almost everything that we do is in the open source
repo.

175
00:16:06,580 --> 00:16:17,453
And we also document our architectural decision records or anything like that because we
have quite a lot of contributors to also help with developing the SDKs.

176
00:16:17,453 --> 00:16:22,604
And it's very useful for them to know, like, to understand the evolution of the product.

177
00:16:22,604 --> 00:16:27,555
so it's either on the docs or um inside the open source repo.

178
00:16:27,555 --> 00:16:29,886
The documentation is open source as well.

179
00:16:30,062 --> 00:16:39,366
um We have a few things internally, around the cloud architecture and things like that, or
like run books, things like that.

180
00:16:39,366 --> 00:16:41,607
But ah I would say not a ton.

181
00:16:41,607 --> 00:16:52,651
We try to really talk about everything publicly and our customer success team and our
support function very heavily relies on the documentation.

182
00:16:53,252 --> 00:16:58,594
So we don't have a ton of things that would only live in Slack or only live in...

183
00:16:59,948 --> 00:17:02,826
private site and I think that's quite good.

184
00:17:02,956 --> 00:17:17,494
Yeah, think that one of the arguments is that there's a huge overhead to get documentation
into a Git repository, that the whole pull request review process and merging this is

185
00:17:17,494 --> 00:17:20,495
non-trivial for someone without technical capabilities.

186
00:17:20,495 --> 00:17:29,861
uh Whereas on the other side, it's, I want it to be accurate and I want it to be reviewed
before it goes uh public or becomes the source of truth there.

187
00:17:29,861 --> 00:17:32,686
So it sounds like you're more on the side of, like,

188
00:17:32,686 --> 00:17:35,266
Let's have it in source control.

189
00:17:35,266 --> 00:17:40,646
so my question is going to be, what do you do for the non-technical writers?

190
00:17:40,646 --> 00:17:41,846
You'd be like, well, too bad.

191
00:17:41,846 --> 00:17:47,106
Technical writers should always have a technical background and be able to use Git in
order to write documentation.

192
00:17:47,608 --> 00:17:54,118
I think it is simpler than it initially sounds, like learning Git and working in Markdown.

193
00:17:54,118 --> 00:18:02,744
I would say that if you understand the technology that you're working on well enough, you
probably are not going to have a lot of trouble learning DocSense code.

194
00:18:02,744 --> 00:18:12,988
um But a lot of the platforms, the documentation platforms, like Friend, they also offer a
no-code editor.

195
00:18:12,988 --> 00:18:17,150
I personally have not used that a ton, but I have given access to

196
00:18:17,236 --> 00:18:26,273
this editor to, for example, my boss who's not really got time to learn how to download
the repository and work.

197
00:18:26,273 --> 00:18:27,183
I'm sure you could do it.

198
00:18:27,183 --> 00:18:28,815
He's just not got the time.

199
00:18:28,815 --> 00:18:38,462
So he's got access to this NoCo platform and I think it works like Notion and you can drag
and drop elements around and fix just like a small typo and it will open a pull request on

200
00:18:38,462 --> 00:18:39,118
your behalf.

201
00:18:39,118 --> 00:18:51,266
So like the all the benefits of a WYSIWYG notion-like or confluence-like documentation
portal but backed by uh some sort of Git repository that gets updated and is auditable and

202
00:18:51,266 --> 00:18:54,008
trackable and reviewable straight away.

203
00:18:54,008 --> 00:19:06,897
So I do think though you're in an area like where you're forced to have a disciplined
approach to what it looks like both from a tool usage standpoint and documentation as far

204
00:19:06,897 --> 00:19:08,738
as rollout goes because

205
00:19:09,026 --> 00:19:15,359
The primary aspect of your business, feel like as well aligned with this sort of mentality
and that's feature flags.

206
00:19:15,596 --> 00:19:17,046
Yes, that is true.

207
00:19:17,647 --> 00:19:28,013
Yeah, what's been a little bit scary for me in this past year, like 2025 was uh kind of a
massive year for these cloud outages, right?

208
00:19:28,013 --> 00:19:33,195
So you probably remember the AWS one, the GCP one, the Cloudflare one.

209
00:19:33,336 --> 00:19:45,378
And in my mind, some of these teams are world-class engineering teams who basically wrote
the books on reliability and we look up to them for their practices and to see

210
00:19:45,378 --> 00:19:56,778
To see some of these things fall apart was a bit scary because they, I assume, never
worked at Google, but I assume they have world-class CI-CD pipelines and super

211
00:19:56,778 --> 00:19:58,569
sophisticated setup.

212
00:19:58,569 --> 00:20:01,892
And they're amazing at DevOps and getting code into production.

213
00:20:01,892 --> 00:20:07,026
But even companies like that are not good at staying in control of that code once it is in
production.

214
00:20:07,026 --> 00:20:14,592
So that massive GCP outage that happened in, I think it was in June, it was because of a
single line of policy change in Google's IAM.

215
00:20:14,656 --> 00:20:23,626
And they merged the code and it was live for a couple of weeks, I think, before it
suddenly got activated and then half of the internet was down.

216
00:20:23,626 --> 00:20:29,660
I looked at the incident report in quite a lot of detail to see what was going on.

217
00:20:29,660 --> 00:20:35,264
they were able to identify the root cause in about 10 minutes, which I think is pretty
good.

218
00:20:35,305 --> 00:20:40,269
And they prepared the rollback and redeployed within 40 minutes, also OK.

219
00:20:40,269 --> 00:20:41,464
But like the...

220
00:20:41,464 --> 00:20:50,220
whole outage still lasted about four hours due to all of this systematic recovery delays
and backlog clearing and stuff.

221
00:20:50,220 --> 00:21:01,217
that's the real scary part that even if you have perfect DevOps and you're really good at
getting code into production, uh DevOps cannot bring your code back up fast.

222
00:21:01,217 --> 00:21:02,919
this is where it comes.

223
00:21:02,919 --> 00:21:11,424
I've always been a long time advocate for feature flags, but these stories from the past
year really kind of reinforce that, that you really need that.

224
00:21:11,424 --> 00:21:17,889
added runtime control where, you know, our rollback is seconds rather than like four
hours.

225
00:21:17,889 --> 00:21:20,773
And if I remember correctly, both the cloud flare.

226
00:21:20,830 --> 00:21:25,492
And the Google outage in the action items are kind of the summary of the incident.

227
00:21:25,492 --> 00:21:27,703
say, we wish this was behind a feature flag.

228
00:21:27,703 --> 00:21:30,854
So yeah, I think that it's a good one to look at.

229
00:21:30,854 --> 00:21:39,178
And if you have time to read through those incident reports, I would recommend it because
um I would guess that Google has built their feature flagging platform internally and so

230
00:21:39,178 --> 00:21:41,879
has some of these bigger companies like Amazon.

231
00:21:41,879 --> 00:21:43,920
But something is going on there.

232
00:21:43,920 --> 00:21:49,464
And maybe it's the fact that they're treating some of these backend changes like
configuration updates or

233
00:21:49,464 --> 00:21:57,498
Policy changes like it doesn't need a flag and a lot of times people think of flags as
like UI changes or kind of these more cosmetic things.

234
00:21:57,498 --> 00:22:07,170
But we've seen that it is also very useful for some of these more behind the scenes hidden
changes because they can also lead to real outages that are then difficult to roll.

235
00:22:07,170 --> 00:22:08,642
You bring up a really interesting point there.

236
00:22:08,642 --> 00:22:17,046
So historically, the outages at say like AWS, which is the one that I've been tracking is
not is usually due to very complex sets of

237
00:22:17,046 --> 00:22:24,893
root causes, like not just one thing, like it was a race condition plus a number of other
things and not the time they brought down the whole internet because some engineer

238
00:22:24,893 --> 00:22:30,188
switched off like the S3 bucket database, ah which did happen.

239
00:22:30,188 --> 00:22:38,345
the thing about the policy is that, ah and so I don't even account for like Azure or GCP
outages anymore because they always seem like simple things that happen.

240
00:22:38,466 --> 00:22:43,890
They put uh multiple different uh availability zones in the same data center.

241
00:22:43,890 --> 00:22:47,351
There's that one, and there was a flood, and so the region was offline.

242
00:22:47,452 --> 00:22:56,958
But yeah, I think the Cloudflare one was definitely like, we treated uh changing a feature
flag as not the same attention that we would have to merging a pull request.

243
00:22:56,958 --> 00:23:02,914
And configuration changes are just as critical and can have widespread effects, especially
if you have

244
00:23:02,914 --> 00:23:12,069
your code behind a feature flag, then you should really be having the same attention or
even more because that's where the critical path is at that moment.

245
00:23:12,069 --> 00:23:20,424
I mean, I think it's an interesting point that you bring up, especially these hyperscalers
where you want to trust them that they are uh much more rigorous in reliability.

246
00:23:20,424 --> 00:23:30,249
And as you pointed out, literally wrote the docs on, well, like Cloudflare talks about
like edge workers and GCP talks about like the SRE, you know, has the SRE book, whatever.

247
00:23:30,249 --> 00:23:32,334
And AWS has like a whole

248
00:23:32,334 --> 00:23:35,714
whole portal dedicated to high reliability stuff.

249
00:23:35,954 --> 00:23:39,634
obviously they're down all the time.

250
00:23:40,234 --> 00:23:42,234
yeah, there is a question there.

251
00:23:42,234 --> 00:23:47,074
It's like, if they can't get it right, how is anyone supposed to help?

252
00:23:47,532 --> 00:23:56,096
Well, we've definitely tried to build in the similar sort of capabilities into the
platform as you have on a pull request and GitHub, kind of diff and review of what's

253
00:23:56,096 --> 00:23:58,026
changing in this feature flag.

254
00:23:58,026 --> 00:24:01,508
Same way to add any number of required approvals.

255
00:24:01,508 --> 00:24:10,702
And we really try to lock down our production environment so that at least two developers
need to approve a change to a feature flag in production.

256
00:24:10,702 --> 00:24:11,792
And I think that helps.

257
00:24:11,792 --> 00:24:16,608
But another thing that's helped us at Unleash is we love boring technology.

258
00:24:16,608 --> 00:24:26,944
So we're one of those people who just, you know, we love try new things, but when it comes
to the products that we're building for our customers, we really try to prioritize the

259
00:24:26,944 --> 00:24:29,085
true tried and tested stuff.

260
00:24:29,085 --> 00:24:34,388
you know, things like in feature flags, latency is a big thing, right?

261
00:24:34,388 --> 00:24:39,371
So when you toggle a flag, you want um it to take effect as quickly as possible.

262
00:24:39,371 --> 00:24:44,194
So architecturally, you can decide, do you want to do streaming or do you want to do
polling?

263
00:24:44,194 --> 00:24:47,356
between your SDKs and the feature flag server.

264
00:24:47,597 --> 00:24:51,570
so streaming is very sexy and instant updates immediately.

265
00:24:51,570 --> 00:24:55,693
But we can see that when there's an outage, things take longer to propagate.

266
00:24:55,693 --> 00:25:01,627
And we try to do an approach where, for example, we do polling in our SDKs.

267
00:25:01,627 --> 00:25:05,190
You configure the polling interval, and it's very reliable.

268
00:25:05,190 --> 00:25:08,973
But if you really care about instant latency, we do offer streaming.

269
00:25:08,973 --> 00:25:10,764
But always fall back to polling.

270
00:25:10,764 --> 00:25:18,057
Yeah, I can definitely see that there is a spectrum here, right, for companies in why
they're introducing the flags.

271
00:25:18,057 --> 00:25:26,587
I think early on, they are maybe not really specific about what their goal is and then
throw the same approach at every single instance.

272
00:25:26,587 --> 00:25:33,923
Whereas it really does like anything, even if you have a provider backing you, you need to
have an attention to what is the right way to approach this.

273
00:25:33,923 --> 00:25:38,497
Because on one side, ah I would say latency is good actually here.

274
00:25:38,497 --> 00:25:40,236
You prefer to be slow.

275
00:25:40,236 --> 00:25:41,406
to get the reliability.

276
00:25:41,406 --> 00:25:48,219
uh But having real-time switches, I know is what the marketing department wants.

277
00:25:48,219 --> 00:26:00,535
anyone who's done anything with high-reliability systems knows that the trade-off is uh
cash invalidation or uh extra network requests, Very fast polling or streaming, have extra

278
00:26:00,535 --> 00:26:01,175
connections up.

279
00:26:01,175 --> 00:26:05,587
So you're paying the cost somewhere else, which isn't necessarily a good thing.

280
00:26:05,587 --> 00:26:08,430
In one of the previous episodes, we were discussing

281
00:26:08,430 --> 00:26:17,850
with the guests about how great it was that they could avoid even having faster than like
one second or one minute it being wrong.

282
00:26:17,850 --> 00:26:20,970
Like it's okay to wait 60 seconds before this gets rolled out.

283
00:26:20,970 --> 00:26:27,970
It's not that critical that everyone who comes to the website sees the updated version,
you know, two seconds from now.

284
00:26:27,970 --> 00:26:36,480
Like you don't need that level of precision because the trade-off is a high risk to your
product or you you're working in production.

285
00:26:36,480 --> 00:26:46,478
And I think risk is a huge aspect here because the research from Dora, which we actually
talked a lot about in the 2025 report episode, says that more untested code is actually

286
00:26:46,478 --> 00:26:48,631
getting into production because of AI.

287
00:26:48,631 --> 00:26:50,404
uh

288
00:26:50,404 --> 00:26:54,892
more code, uh velocities up but like, ability is down.

289
00:26:54,892 --> 00:26:57,083
Yeah, quality is quality is down.

290
00:26:57,083 --> 00:27:02,504
The interesting thing is that people feel like they are being more productive quote
unquote, whatever that means.

291
00:27:02,504 --> 00:27:11,487
uh But the actual quality metrics show us that solutions are getting worse for the end
users for customers for clients.

292
00:27:11,487 --> 00:27:23,032
And so there is this question of if you're using feature flags, there can be a tendency to
throw the work over the wall and have it be relied on by just enabling the flag.

293
00:27:23,032 --> 00:27:31,054
How do you get teams to be disciplined about uh making sure stuff is tested before
actually deploying it to production, even though it's behind a flag?

294
00:27:31,054 --> 00:27:34,616
actually what we do at Unleash is we do breakathons with every new feature.

295
00:27:34,616 --> 00:27:43,321
So we enable the feature in production only for ourselves and then get together on a
Google Meet call like all of us and like try and break it.

296
00:27:43,401 --> 00:27:47,564
So spend like an hour together, try and like find all of the things that are wrong with
it.

297
00:27:47,564 --> 00:27:49,124
And it's very fun actually.

298
00:27:49,124 --> 00:27:54,938
um Maybe not so much for the person who built the thing, but for everyone else it's very
fun.

299
00:27:54,938 --> 00:27:58,680
So once you're happy with it internally and you start rolling out.

300
00:27:58,680 --> 00:28:01,511
we're rolling it out to say five or 10%.

301
00:28:01,511 --> 00:28:08,133
You can put in some automation in place that says the error rates are below this threshold
for let's say 12 hours.

302
00:28:08,133 --> 00:28:18,836
Then we can progress to the next stage, which may be only 10 % or it could be 50 % or
maybe it's a segment of your customers that you think or you know are more like

303
00:28:18,836 --> 00:28:24,277
experimental and ready to try new things rather than the ones where you really need that
stability.

304
00:28:24,277 --> 00:28:27,788
Then you kind of progress through these stages and you can go away from your

305
00:28:27,788 --> 00:28:37,027
laptop, you can go to sleep or whatever, and you know that if those metrics spike, then
your rollout can be paused automatically or go back to the previous stage or whatever you

306
00:28:37,027 --> 00:28:37,735
kind of define.

307
00:28:37,735 --> 00:28:42,741
So I think we're probably at a good point to move over to pics for the episode.

308
00:28:42,741 --> 00:28:43,222
Nice.

309
00:28:43,222 --> 00:28:46,356
So I'll ask you, Melinda, what did you bring for the audience today?

310
00:28:46,726 --> 00:28:47,987
So I brought a game.

311
00:28:47,987 --> 00:28:51,166
It's a game is called the wavelength.

312
00:28:51,166 --> 00:28:52,486
Do you know it?

313
00:28:53,089 --> 00:28:56,911
No, it's like a how do I describe it like a communication?

314
00:28:56,911 --> 00:29:05,514
collaborative communication game and it's something that I've played with five-year-olds
and my friends and family and But also like my co-workers.

315
00:29:05,514 --> 00:29:14,230
We actually love to play this game as like um We're a remote team and we have a team hour
every Friday and we always pick a game and this is something we played

316
00:29:14,230 --> 00:29:16,911
recently and like had tons of fun with it.

317
00:29:16,911 --> 00:29:22,244
The way it works is it's a board game as well, but there's a digital version, like an app
version that is quite good.

318
00:29:22,244 --> 00:29:26,305
You get a spectrum and with two extremes at the end and it changes every round.

319
00:29:26,305 --> 00:29:31,197
And let's say in one round, you get a scale, which is from good pizza topping to bad pizza
topping.

320
00:29:31,197 --> 00:29:35,099
And you get a random point on the scale and only you see that point in the scale.

321
00:29:35,099 --> 00:29:41,522
And so if it was like off center towards bad pizza topping, you have to come up with a
clue to your team.

322
00:29:41,802 --> 00:29:44,096
to help them identify where that point in the scale is.

323
00:29:44,096 --> 00:29:47,149
So I would say maybe like pineapple and the team have to like.

324
00:29:47,519 --> 00:29:49,445
Going straight for controversy right there.

325
00:29:49,445 --> 00:29:51,521
She's like decided before the episode.

326
00:29:51,521 --> 00:29:52,023
You know what?

327
00:29:52,023 --> 00:29:53,738
I was gonna be pineapple on pizza.

328
00:29:53,738 --> 00:29:54,976
That's gonna be my example

329
00:29:54,976 --> 00:29:57,689
Just trying to help you with the YouTube comments, you know?

330
00:29:58,911 --> 00:30:08,852
And so the team like debate and discuss what um like what pineapple must mean and I try
and identify that point in scale and then you score points based on that.

331
00:30:08,852 --> 00:30:10,604
But it's so much fun.

332
00:30:10,925 --> 00:30:15,512
So I really recommend it if you're looking for something to play with your team or at

333
00:30:15,512 --> 00:30:17,925
Yeah, really embodying that Italian philosophy there.

334
00:30:17,925 --> 00:30:20,297
ah Yeah.

335
00:30:20,297 --> 00:30:22,630
So my pick is actually a television show this time.

336
00:30:22,630 --> 00:30:24,452
It's uh called Bosch.

337
00:30:24,452 --> 00:30:26,494
It's a LA detective procedural.

338
00:30:26,494 --> 00:30:27,095
Have you heard of it?

339
00:30:27,095 --> 00:30:34,763
ah I started watching in December and I've binged like all 10 plus seasons of it.

340
00:30:34,763 --> 00:30:37,526
uh Because it's just so great.

341
00:30:37,526 --> 00:30:38,328
I love it as well.

342
00:30:38,328 --> 00:30:42,094
I love detective shows and like hospital dramas and stuff.

343
00:30:42,150 --> 00:30:50,955
I don't do the hospital dramas, I think the main actor uh Titus Wolver is just absolutely
fantastic.

344
00:30:50,995 --> 00:30:56,498
It reminded me a lot of Law and Order, which I watched a lot when I was younger.

345
00:30:56,578 --> 00:30:58,349
And it's so much better.

346
00:30:58,349 --> 00:31:01,821
Honestly, this may be one of the best procedurals I've ever seen.

347
00:31:01,821 --> 00:31:10,188
uh And it does these nice skips during the show to get rid of downtime that you would
otherwise have to deal with.

348
00:31:10,188 --> 00:31:12,178
You never know what's going on.

349
00:31:12,178 --> 00:31:20,445
You get dropped into the middle of a situation and it's like, I'm still trying to figure
out is there going to be a crime or what's going on in these people's lives at this

350
00:31:20,445 --> 00:31:21,285
moment?

351
00:31:21,306 --> 00:31:26,289
it's like always getting into a new show, which I find they really captured well.

352
00:31:26,289 --> 00:31:30,112
It doesn't feel like every season is just like a continuation of the one before it.

353
00:31:30,112 --> 00:31:32,908
does feel like uh new um every time you watch it.

354
00:31:32,908 --> 00:31:34,041
Yeah, plus one for that.

355
00:31:34,041 --> 00:31:34,942
Go watch it.

356
00:31:34,942 --> 00:31:37,926
Well, thank you so much, Melinda, for coming on today's episode.

357
00:31:37,926 --> 00:31:39,869
It's been absolutely fantastic.

358
00:31:39,869 --> 00:31:45,555
Feature flags and how to use them correctly and most importantly, uh what's next in
documentation.

359
00:31:46,197 --> 00:31:53,526
And thanks to all the listeners and viewers for coming on for today's episode and I hope
we'll see everyone back next week.