FEED Validator

for Atom and RSS and KML

Sorry

This feed does not validate.

line 37, column 13: pubDate must be an RFC-822 date-time: [help]

				<pubDate></pubDate>
             ^

In addition, interoperability with the widest range of feed readers could be improved by implementing the following recommendations.

line 1, column 38: Use of unknown namespace: https://www.oreilly.com/rss/custom [help]

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
                                      ^

line 703, column 0: content:encoded should not contain iframe tag [help]

<figure class="wp-block-embed is-type-rich is-provider-embed-handler wp-bloc ...

line 2101, column 0: content:encoded should not contain fetchpriority attribute [help]
line 2101, column 0: content:encoded should not contain decoding attribute [help]
line 2101, column 0: content:encoded should not contain sizes attribute [help]

Source: https://www.oreilly.com/radar/feed/index.xml

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:media="http://search.yahoo.com/mrss/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
xmlns:custom="https://www.oreilly.com/rss/custom"
>
<channel>
<title>Radar</title>
<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
<link>https://www.oreilly.com/radar</link>
<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
<lastBuildDate>Thu, 16 Oct 2025 11:13:24 +0000</lastBuildDate>
<language>en-US</language>
<sy:updatePeriod>
hourly </sy:updatePeriod>
<sy:updateFrequency>
1 </sy:updateFrequency>
<generator>https://wordpress.org/?v=6.8.2</generator>
<image>
<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
<title>Radar</title>
<link>https://www.oreilly.com/radar</link>
<width>32</width>
<height>32</height>
</image>
<item>
<title>Generative AI in the Real World: Context Engineering with Drew Breunig</title>
<link>https://www.oreilly.com/radar/?post_type=podcast&p=17562</link>
<comments>https://www.oreilly.com/radar/?post_type=podcast&p=17562#respond</comments>
<pubDate></pubDate>
<dc:creator><![CDATA[Ben Lorica and Drew Breunig]]></dc:creator>
<category><![CDATA[Generative AI in the Real World]]></category>
<category><![CDATA[Podcast]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&p=17562</guid>
<enclosure url="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3" length="0" type="audio/mpeg" />
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png"
medium="image"
type="image/png"
width="2560"
height="2560"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-160x160.png"
width="160"
height="160"
/>
<description><![CDATA[In this episode, Ben Lorica and Drew Breunig, a strategist at the Overture Maps Foundation, talk all things context engineering: what’s working, where things are breaking down, and what comes next. Listen in to hear why huge context windows aren’t solving the problems we hoped they might, why companies shouldn’t discount evals and testing, and […]]]></description>
<content:encoded><![CDATA[
In this episode, Ben Lorica and Drew Breunig, a strategist at the Overture Maps Foundation, talk all things context engineering: what’s working, where things are breaking down, and what comes next. Listen in to hear why huge context windows aren’t solving the problems we hoped they might, why companies shouldn’t discount evals and testing, and why we’re doing the field a disservice by leaning into marketing and buzzwords rather than trying to leverage what current crop of LLMs are actually capable of.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out <a href="https://learning.oreilly.com/playlists/42123a72-1108-40f1-91c0-adbfb9f4983b/?_gl=1*m7f70i*_ga*MTYyODYzMzQwMi4xNzU4NTY5ODYz*_ga_092EL089CH*czE3NTkxNzAwODUkbzE0JGcwJHQxNzU5MTcwMDg1JGo2MCRsMCRoMA.." target="_blank" rel="noreferrer noopener">other episodes</a> of this podcast on the O’Reilly learning platform.
<h2 class="wp-block-heading">Transcript</h2>
This transcript was created with the help of AI and has been lightly edited for clarity.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=0" target="_blank" rel="noreferrer noopener">00.00</a>: All right. So today we have Drew Breunig. He is a strategist at the Overture Maps Foundation. And he’s also in the process of writing a book for O’Reilly called the Context Engineering Handbook. And with that, Drew, welcome to the podcast.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=23" target="_blank" rel="noreferrer noopener">00.23</a>: Thanks, Ben. Thanks for having me on here. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=26" target="_blank" rel="noreferrer noopener">00.26</a>: So context engineering. . . I remember before ChatGPT was even released, someone was talking to me about prompt engineering. I said, “What’s that?” And then of course, fast-forward to today, now people are talking about context engineering. And I guess the short definition is it’s the delicate art and science of filling the context window with just the right information. What’s broken with how teams think about context today?
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=56" target="_blank" rel="noreferrer noopener">00.56</a>: I think it’s important to talk about why we need a new word or why a new word makes sense. I was just talking with Mike Taylor, who wrote the <a href="https://learning.oreilly.com/library/view/prompt-engineering-for/9781098153427/" target="_blank" rel="noreferrer noopener">prompt engineering book</a> for O’Reilly, exactly about this and why we need a new word. Why is prompt engineering not good enough? And I think it has to do with the way the models and the way they’re being built is evolving. I think it also has to deal with the way that we’re learning how to use these models.
And so prompt engineering was a natural word to think about when your interaction and how you program the model was maybe one turn of conversation, maybe two, and you might pull in some context to give it examples. You might do some RAG and context augmentation, but you’re working with this one-shot service. And that was really similar to the way people were working in chatbots. And so prompt engineering started to evolve as this thing. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=120" target="_blank" rel="noreferrer noopener">02.00</a>: But as we started to build agents and as companies started to develop models that were capable of multiturn tool-augmented reasoning usage, suddenly you’re not using that one prompt. You have a context that is sometimes being prompted by you, sometimes being modified by your software harness around the model, sometimes being modified by the model itself. And increasingly the model is starting to manage that context. And that prompt is very user-centric. It is a user giving that prompt.
But when we start to have these multiturn systematic editing and preparation of contexts, a new word was needed, which is this idea of context engineering. This is not to belittle prompt engineering. I think it’s an evolution. And it shows how we’re evolving and finding this space in real time. I think context engineering is more suited to agents and applied AI programing, whereas prompt engineering lives in how people use chatbots, which is a different field. It’s not better and not worse.
And so context engineering is more specific to understanding the failure modes that occur, diagnosing those failure modes and establishing good practices for both preparing your context but also setting up systems that fix and edit your context, if that makes sense. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=213" target="_blank" rel="noreferrer noopener">03.33</a>: Yeah, and also, it seems like the words themselves are indicative of the scope, right? So “prompt” engineering means it’s the prompt. So you’re fiddling with the prompt. And [with] context engineering, “context” can be a lot of things. It could be the information you retrieve. It might involve RAG, so you retrieve information. You put that in the context window.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=242" target="_blank" rel="noreferrer noopener">04.02</a>: Yeah. And people were doing that with prompts too. But I think in the beginning we just didn’t have the words. And that word became a big empty bucket that we filled up. You know, the quote I always quote too often, but I find it fitting, is one of my favorite quotes from Stuart Brand, which is, “If you want to know where the future is being made, follow where the lawyers are congregating and the language is being invented,” and the arrival of context engineering as a word came after the field was invented. It just kind of crystallized and demarcated what people were already doing.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=276" target="_blank" rel="noreferrer noopener">04.36</a>: So the word “context” means you’re providing context. So context could be a tool, right? It could be memory. Whereas the word “prompt” is much more specific. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=295" target="_blank" rel="noreferrer noopener">04.55</a>: And I think it also is like, it has to be edited by a person. I’m a big advocate for not using anthropomorphizing words around large language models. “Prompt” to me involves agency. And so I think it’s nice—it’s a good delineation. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=314" target="_blank" rel="noreferrer noopener">05.14</a>: And then I think one of the very immediate lessons that people realize is, just because. . . 
So one of the things that these model providers do when they have a model release,  one of the things they note is, What’s the size of the context window? So people started associating context window [with] “I stuff as much as I can in there.” But the reality is actually that, one, it’s not efficient. And two, it also is not useful to the model. Just because you have a massive context window doesn’t mean that the model treats the entire context window evenly.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=357" target="_blank" rel="noreferrer noopener">05.57</a>: Yeah, it doesn’t treat it evenly. And it’s not a one-size-fits-all solution. So I don’t know if you remember last year, but that was the big dream, which was, “Hey, we’re doing all this work with RAG and augmenting our context. But wait a second, if we can make the context 1 million tokens, 2 million tokens, I don’t have to run RAG on all of my corporate documents. I can just fit it all in there, and I can constantly be asking this. And if we can do this, we essentially have solved all of the hard problems that we were worrying about last year.” And so that was the big hope.
And you started to see an arms race of everybody trying to make bigger and bigger context windows to the point where, you know, Llama 4 had its spectacular flameout. It was rushed out the door. But the headline feature by far was “We will be releasing a 10 million token context window.” And the thing that everybody realized is. . .  Like, all right, we were really hopeful for that. And then as we started building with these context windows, we started to realize there were some big limitations around them.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=421" target="_blank" rel="noreferrer noopener">07.01</a>: Perhaps the thing that clicked for me was in <a href="https://arxiv.org/abs/2507.06261" target="_blank" rel="noreferrer noopener">Google’s Gemini 2.5 paper</a>. Fantastic paper. And one of the reasons I love it is because they dedicate about four pages in the appendix to talking about the kind of methodology and harnesses they built so that they could teach Gemini to play Pokémon: how to connect it to the game, how to actually read out the state of the game, how to make choices about it, what tools they gave it, all of these other things.
And buried in there was a real “warts and all” case study, which are my favorite when you talk about the hard things and especially when you cite the things you can’t overcome. And Gemini 2.5 was a million-token context window with, eventually, 2 million tokens coming. But in this Pokémon thing, they said, “Hey, we actually noticed something, which is once you get to about 200,000 tokens, things start to fall apart, and they fall apart for a host of reasons. They start to hallucinate. One of the things that is really demonstrable is they start to rely more on the context knowledge than the weights knowledge.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=502" target="_blank" rel="noreferrer noopener">08.22</a>: So inside every model there’s a knowledge base. There’s, you know, all of these other things that get kind of buried into the parameters. But when you reach a certain level of context, it starts to overload the model, and it starts to rely more on the examples in the context. And so this means that you are not taking advantage of the full strength or knowledge of the model.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=523" target="_blank" rel="noreferrer noopener">08.43</a>: So that’s one way it can fail. We call this “context distraction,” though Kelly Hong at Chroma has written an <a href="https://research.trychroma.com/context-rot" target="_blank" rel="noreferrer noopener">incredible paper documenting this</a>, which she calls “context rot,” which is a similar way [of] charting when these benchmarks start to fall apart.
Now the cool thing about this is that you can actually use this to your advantage. There’s another paper out of, I believe, the Harvard Interaction Lab, where they look at these inflection points for. . . 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=553" target="_blank" rel="noreferrer noopener">09.13</a>: Are you familiar with the term “in-context learning”? In-context learning is when you teach the model to do something that doesn’t know how to do by providing examples in your context. And those examples illustrate how it should perform. It’s not something that it’s seen before. It’s not in the weights. It’s a completely unique problem.
Well, sometimes those in-context learning[s] are counter to what the model has learned in the weights. So they end up fighting each other, the weights and the context. And this paper documented that when you get over a certain context length, you can overwhelm the weights and you can force it to listen to your in-context examples.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=597" target="_blank" rel="noreferrer noopener">09.57</a>: And so all of this is just to try to illustrate the complexity of what’s going on here and how I think one of the traps that leads us to this place is that the gift and the curse of LLMs is that we prompt and build contexts that are in the English language or whatever language you speak. And so that leads us to believe that they’re going to react like other people or entities that read the English language.
And the fact of the matter is, they don’t—they’re reading it in a very specific way. And that specific way can vary from model to model. And so you have to systematically approach this to understand these nuances, which is where the context management field comes in. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=635" target="_blank" rel="noreferrer noopener">10.35</a>: This is interesting because even before those papers came out, there were studies which showed the exact opposite problem, which is the following: You may have a RAG system that actually retrieves the right information, but then somehow the LLMs can still fail because, as you alluded to, they have weights so they have prior beliefs. You saw something [on] the internet, and they will opine against the precise information you retrieve from the context.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=668" target="_blank" rel="noreferrer noopener">11.08</a>: This is a really big problem. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=669" target="_blank" rel="noreferrer noopener">11.09</a>: So this is true even if the context window’s small actually. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=673" target="_blank" rel="noreferrer noopener">11.13</a>: Yeah, and Ben, you touched on something that’s really important. So in my <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html" target="_blank" rel="noreferrer noopener">original blog post</a>, I document four ways that context fails. I talk about “context poisoning.” That’s when you hallucinate something in a long-running task and it stays in there, and so it’s continually confusing it. “Context distraction,” which is when you overwhelm that soft limit to the context window and then you start to perform poorly. “Context confusion”: This is when you put things that aren’t relevant to the task inside your context, and suddenly they think the model thinks that it has to pay attention to this stuff and it leads them astray. And then the last thing is “context clash,” which is when there’s information in the context that’s at odds with the task that you are trying to perform.
A good example of this is, say you’re asking the model to only reply in JSON, but you’re using MCP tools that are defined with XML. And so you’re creating this backwards thing. But I think there’s a fifth piece that I need to write about because it keeps coming up. And it’s exactly what you described.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=743" target="_blank" rel="noreferrer noopener">12.23</a>: Douwe [Kiela] over at Contextual AI refers to this as “context” or “prompt adherence.” But the term that keeps sticking in my mind is this idea of fighting the weights. There’s three situations you get yourself into when you’re interacting with an LLM. The first is when you’re working with the weights. You’re asking it a question that it knows how to answer. It’s seen many examples of that answer. It has it in its knowledge base. It comes back with the weights, and it can give you a phenomenal, detailed answer to that question. That’s what I call “working with the weights.”
The second is what we referred to earlier, which is that in-context learning, which is you’re doing something that it doesn’t know about and you’re showing an example, and then it does it. And this is great. It’s wonderful. We do it all the time. 
But then there’s a third example which is, you’re providing it examples. But those examples are at odds with some things that it had learned usually during posttraining, during the fine-tuning or RL stage. A really good example is format outputs. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=814" target="_blank" rel="noreferrer noopener">13.34</a>: Recently a friend of mine was updating his pipeline to try out a new model, Moonshots. A really great model and really great model for tool use. And so he just changed his model and hit run to see what happened. And he kept failing—his thing couldn’t even work. He’s like, “I don’t understand. This is supposed to be the best tool use model there is.” And he asked me to look at his code.
I looked at his code and he was extracting data using Markdown, essentially: “Put the final answer in an ASCII box and I’ll extract it that way.” And I said, “If you change this to XML, see what happens. Ask it to respond in XML, use XML as your formatting, and see what happens.” He did that. That one change passed every test. Like basically crushed it because it was working with the weights. He wasn’t fighting the weights. Everyone’s experienced this if you build with AI: the stubborn things it refuses to do, no matter how many times you ask it, including formatting.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=875" target="_blank" rel="noreferrer noopener">14.35</a>: [Here’s] my favorite example of this though, Ben: So in ChatGPT’s web interface or their application interface, if you go there and you try to prompt an image, a lot of the images that people prompt—and I’ve talked to user research about this—are really boring prompts. They have a text box that can be anything, and they’ll say something like “a black cat” or “a statue of a man thinking.”
OpenAI realized this was leading to a lot of bad images because the prompt wasn’t detailed; it wasn’t a good prompt. So they built a system that recognizes if your prompt is too short, low detail, bad, and it hands it to another model and says, “Improve this prompt,” and it improves the prompt for you. And if you inspect in Chrome or Safari or Firefox, whatever, you inspect the developer settings, you can see the JSON being passed back and forth, and you can see your original prompt going in. Then you can see the improved prompt.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=936" target="_blank" rel="noreferrer noopener">15.36</a>: My favorite example of this [is] I asked it to make a statue of a man thinking, and it came back and said something like “A detailed statue of a human figure in a thinking pose similar to Rodin’s ‘The Thinker.’ The statue is made of weathered stone sitting on a pedestal. . .” Blah blah blah blah blah blah. A paragraph. . . But below that prompt there were instructions to the chatbot or to the LLM that said, “Generate this image and after you generate the image, do not reply. Do not ask follow up questions. Do not ask. Do not make any comments describing what you’ve done. Just generate the image.” And in this prompt, then nine times, some of them in all caps, they say, “Please do not reply.” And the reason is because a big chunk of OpenAI’s posttraining is teaching these models how to converse back and forth. They want you to always be asking a follow-up question and they train it. And so now they have to fight the prompts. They have to add in all these statements. And that’s another way that fails.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1002" target="_blank" rel="noreferrer noopener">16.42</a>: So why I bring this up—and this is why I need to write about it—is as an applied AI developer, you need to recognize when you’re fighting the prompt, understand enough about the posttraining of that model, or make some assumptions about it, so that you can stop doing that and try something different, because you’re just banging your head against a wall and you’re going to get inconsistent, bad applications and the same statement 20 times over.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1027" target="_blank" rel="noreferrer noopener">17.07</a>: By the way, the other thing that’s interesting about this whole topic is, people actually somehow have underappreciated or forgotten all of the progress we’ve made in information retrieval. There’s a whole. . . I mean, these people have their own conferences, right? Everything from reranking to the actual indexing, even with vector search—the information retrieval community still has a lot to offer, and it’s the kind of thing that people underappreciated. And so by simply loading your context window with massive amounts of garbage, you’re actually, leaving on the field so much progress in information retrieval.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1084" target="_blank" rel="noreferrer noopener">18.04</a>: I do think it’s hard. And that’s one of the risks: We’re building all this stuff so fast from the ground up, and there’s a tendency to just throw everything into the biggest model possible and then hope it sorts it out.
I really do think there’s two pools of developers. There’s the “throw everything in the model” pool, and then there’s the “I’m going to take incremental steps and find the most optimal model.” And I often find that latter group, which I called a compound AI group after a <a href="https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/" target="_blank" rel="noreferrer noopener">paper that was published out of Berkeley</a>, those tend to be people who have run data pipelines, because it’s not just a simple back and forth interaction. It’s gigabytes or even more of data you’re processing with the LLM. The costs are high. Latency is important. So designing efficient systems is actually incredibly key, if not a total requirement. So there’s a lot of innovation that comes out of that space because of that kind of boundary.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1148" target="_blank" rel="noreferrer noopener">19.08</a>: If you were to talk to one of these applied AI teams and you were to give them one or two things that they can do right away to improve, or fix context in general, what are some of the best practices?
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1169" target="_blank" rel="noreferrer noopener">19.29</a>: Well you’re going to laugh, Ben, because the answer is dependent on the context, and I mean the context in the team and what have you. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1178" target="_blank" rel="noreferrer noopener">19.38</a>: But if you were to just go give a keynote to a general audience, if you were to list down one, two, or three things that are the lowest hanging fruit, so to speak. . .
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1190" target="_blank" rel="noreferrer noopener">19.50</a>: The first thing I’m gonna do is I’m going to look in the room and I’m going to look at the titles of all the people in there, and I’m going to see if they have any subject-matter experts or if it’s just a bunch of engineers trying to build something for subject-matter experts. And my first bit of advice is you need to get yourself a subject-matter expert who is looking at the data, helping you with the eval data, and telling you what “good” looks like.
I see a lot of teams that don’t have this, and they end up building fairly brittle prompt systems. And then they can’t iterate well, and so that enterprise AI project fails. I also see them not wanting to open themselves up to subject-matter experts, because they want to hold on to the power themselves. It’s not how they’re used to building. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1238" target="_blank" rel="noreferrer noopener">20.38</a>: I really do think building in applied AI has changed the power dynamic between builders and subject-matter experts. You know, we were talking earlier about some of like the old Web 2.0 days and I’m sure you remember. . . Remember back at the beginning of the iOS app craze, we’d be at a dinner party and someone would find out that you’re capable of building an app, and you would get cornered by some guy who’s like “I’ve got a great idea for an app,” and he would just talk at you—usually a he.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1275" target="_blank" rel="noreferrer noopener">21.15</a>: This is back in the Objective-C days. . .
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1277" target="_blank" rel="noreferrer noopener">21.17</a>: Yes, way back when. And this is someone who loves Objective-C. So you’d get cornered and you’d try to find a way out of that awkward conversation. Nowadays, that dynamic has shifted. The subject-matter expertise is so important for codifying and designing the spec, which usually gets specced out by the evals that it leads itself to more. And you can even see this. OpenAI is arguably creating and at the forefront of this stuff. And what are they doing? They’re standing up programs to get lawyers to come in, to get doctors to come in, to get these specialists to come in and help them create benchmarks because they can’t do it themselves. And so that’s the first thing. Got to work with the subject-matter expert.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1324" target="_blank" rel="noreferrer noopener">22.04</a>: The second thing is if they’re just starting out—and this is going to sound backwards, given our topic today—I would encourage them to use a system like DSPy or GEPA, which are essentially frameworks for building with AI. And one of the components of that framework is that they optimize the prompt for you with the help of an LLM and your eval data.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1357" target="_blank" rel="noreferrer noopener">22.37</a>: Throw in BAML?
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1359" target="_blank" rel="noreferrer noopener">22.39</a>: BAML is similar [but it’s] more like the spec for how to describe the entire spec. So it’s similar.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1372" target="_blank" rel="noreferrer noopener">22.52</a>: BAML and TextGrad? 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1375" target="_blank" rel="noreferrer noopener">22.55</a>: TextGrad is more like the prompt optimization I’m talking about. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1377" target="_blank" rel="noreferrer noopener">22:57</a>: TextGrad plus GEPA plus Regolo?
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1382" target="_blank" rel="noreferrer noopener">23.02</a>: Yeah, those things are really important. And the reason I say they’re important is. . .
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1388" target="_blank" rel="noreferrer noopener">23.08</a>: I mean, Drew, those are kind of advanced topics. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1392" target="_blank" rel="noreferrer noopener">23.12</a>: I don’t think they’re that advanced. I think they can appear really intimidating because everybody comes in and says, “Well, it’s so easy. I could just write what I want.” And this is the gift and curse of prompts, in my opinion. There’s a lot of things to like about.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1413" target="_blank" rel="noreferrer noopener">23.33</a>: DSPy is fine, but I think TextGrad, GEPA, and Regolo. . .
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1421" target="_blank" rel="noreferrer noopener">23.41</a>: Well. . . I wouldn’t encourage you to use GEPA directly. I would encourage you to use it through the framework of DSPy. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1428" target="_blank" rel="noreferrer noopener">23.48</a>: The point here is if it’s a team building, you can go down essentially two paths. You can handwrite your prompt, and I think this creates some issues. One is as you build, you tend to have a lot of hotfix statements like, “Oh, there’s a bug over here. We’ll say it over here. Oh, that didn’t fix it. So let’s say it again.” It will encourage you to have one person who really understands this prompt. And so you end up being reliant on this prompt magician. Even though they’re written in English, there’s kind of no syntax highlighting. They get messier and messier as you build the application because they start to grow and become these growing collections of edge cases.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1467" target="_blank" rel="noreferrer noopener">24.27</a>: And the other thing too, and this is really important, is when you build and you spend so much time honing a prompt, you’re doing it against one model, and then at some point there’s going to be a better, cheaper, more effective model. And you’re going to have to go through the process of tweaking it and fixing all the bugs again, because this model functions differently.
And I used to have to try to convince people that this was a problem, but they all kind of found out when OpenAI deprecated all of their models and tried to move everyone over to GPT-5. And now I hear about it all the time. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1503" target="_blank" rel="noreferrer noopener">25.03</a>: Although I think right now “agents” is our hot topic, right? So we talk to people about agents and you start really getting into the weeds, you realize, “Oh, okay. So their agents are really just prompts.” 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1516" target="_blank" rel="noreferrer noopener">25.16</a>: In the loop. . .
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1519" target="_blank" rel="noreferrer noopener">25.19</a>: So agent optimization in many ways means injecting a bit more software engineering rigor in how you maintain and version. . .
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1530" target="_blank" rel="noreferrer noopener">25.30</a>: Because that context is growing. As that loop goes, you’re deciding what gets added to it. And so you have to put guardrails in—ways to rescue from failure and figure out all these things. It’s very difficult. And you have to go at it systematically. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1546" target="_blank" rel="noreferrer noopener">25.46</a>: And then the problem is that, in many situations, the models are not even models that you control, actually. You’re using them through an API like OpenAI or Claude so you don’t actually have access to the weights. So even if you’re one of the super, super advanced teams that can do gradient descent and backprop, you can’t do that. Right? So then, what are your options for being more rigorous in doing optimization?
Well, it’s precisely these tools that Drew alluded to, which is the TextGrads of the world, the GEPA. You have these compound systems that are nondifferentiable. So then how do you actually do optimization in a world where you have things that are not differentiable? Right. So these are precisely the tools that will allow you to turn it from somewhat of a, I guess, black art to something with a little more discipline. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1613" target="_blank" rel="noreferrer noopener">26.53</a>: And I think a good example is, even if you aren’t going to use prompt optimization-type tools. . . The prompt optimization is a great solution for what you just described, which is when you can’t control the weights of the models you’re using. But the other thing too, is, even if you aren’t going to adopt that, you need to get evals because that’s going to be step one for anything, which is you need to start working with subject-matter experts to create evals.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1642" target="_blank" rel="noreferrer noopener">27.22</a>: Because what I see. . . And there was just a really dumb argument online of “Are evals worth it or not?” And it was really silly to me because it was positioned as an either-or argument. And there were people arguing against evals, which is just insane to me. And the reason they were arguing against evals is they’re basically arguing in favor of what they called, to your point about dark arts, vibe shipping—which is they’d make changes, push those changes, and then the person who was also making the changes would go in and type in 12 different things and say, “Yep, feels right to me.” And that’s insane to me.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1677" target="_blank" rel="noreferrer noopener">27.57</a>: And even if you’re doing that—which I think is a good thing and you may not go create coverage and eval, you have some taste. . . And I do think when you’re building more qualitative tools. . . So a good example is like if you’re Character.AI or you’re Portola Labs, who’s building essentially personalized emotional chatbots, it’s going to be harder to create evals and it’s going to require taste as you build them. But having evals is going to ensure that your whole thing didn’t fall apart because you changed one sentence, which sadly is a risk because these are probabilistic software.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1713" target="_blank" rel="noreferrer noopener">28.33</a>: Honestly, evals are super important. Number one, because, basically, leaderboards like LMArena are great for narrowing your options. But at the end of the day, you still need to benchmark all of these against your own application use case and domain. And then secondly, obviously, it’s an ongoing thing. So it ties in with reliability. The more reliable your application is, that means most likely you’re doing evals properly in an ongoing fashion. And I really believe that eval and reliability are a moat, because basically what else is your moat? Prompt? That’s not a moat.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1761" target="_blank" rel="noreferrer noopener">29.21</a>: So first off, violent agreement there. The only asset teams truly have—unless they’re a model builder, which is only a handful—is their eval data. And I would say the counterpart to that is their spec, whatever defines their program, but mostly the eval data. But to the other point about it, like why are people vibe shipping? I think you can get pretty far with vibe shipping and it fools you into thinking that that’s right.
We saw this pattern in the Web 2.0 and social era, which was, you would have the product genius—everybody wanted to be the Steve Jobs, who didn’t hold focus groups, didn’t ask their customers what they wanted. The Henry Ford quote about “They all say faster horses,” and I’m the genius who comes in and tweaks these things and ships them. And that often takes you very far.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1813" target="_blank" rel="noreferrer noopener">30.13</a>: I also think it’s a bias of success. We only know about the ones that succeed, but the best ones, when they grow up and they start to serve an audience that’s way bigger than what they could hold in their head, they start to grow up with AB testing and ABX testing throughout their organization. And a good example of that is Facebook.
Facebook stopped being just some choices and started having to do testing and ABX testing in every aspect of their business. Compare that to Snap, which again, was kind of the last of the great product geniuses to come out. Evan [Spiegel] was heralded as “He’s the product genius,” but I think they ran that too long, and they kept shipping on vibes rather than shipping on ABX testing and growing and, you know, being more boring.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1864" target="_blank" rel="noreferrer noopener">31.04</a>: But again, that’s how you get the global reach. I think there’s a lot of people who probably are really great vibe shippers. And they’re probably having great success doing that. The question is, as their company grows and starts to hit harder times or the growth starts to slow, can that vibe shipping take them over the hump? And I would argue, no, I think you have to grow up and start to have more accountable metrics that, you know, scale to the size of your audience.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1894" target="_blank" rel="noreferrer noopener">31.34</a>: So in closing. . . We talked about prompt engineering. And then we talked about context engineering. So putting you on the spot. What’s a buzzword out there that either irks you or you think is undertalked about at this point? So what’s a buzzword out there, Drew? 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1917" target="_blank" rel="noreferrer noopener">31.57</a>: [laughs] I mean, I wish you had given me some time to think about it. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1918" target="_blank" rel="noreferrer noopener">31.58</a>: We are in a hype cycle here. . .
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1922" target="_blank" rel="noreferrer noopener">32.02</a>: We’re always in a hype cycle. I don’t like anthropomorphosizing LLMs or AI for a whole host of reasons. One, I think it leads to bad understanding and bad mental models, that means that we don’t have substantive conversations about these things, and we don’t learn how to build really well with them because we think they’re intelligent. We think they’re a PhD in your pocket. We think they’re all of these things and they’re not—they’re fundamentally different.
I’m not against using the way we think the brain works for inspiration. That’s fine with me. But when you start oversimplifying these and not taking the time to explain to your audience how they actually work—you just say it’s a PhD in your pocket, and here’s the benchmark to prove it—you’re misleading and setting unrealistic expectations. And unfortunately, the market rewards them for that. So they keep going. 
But I also think it just doesn’t help you build sustainable programs because you aren’t actually understanding how it works. You’re just kind of reducing it down to it. AGI is one of those things. And superintelligence, but AGI especially.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2001" target="_blank" rel="noreferrer noopener">33.21</a>: I went to school at UC Santa Cruz, and one of my favorite classes I ever took was a seminar with Donna Haraway. Donna Haraway wrote “<a href="https://en.wikipedia.org/wiki/A_Cyborg_Manifesto" target="_blank" rel="noreferrer noopener">A Cyborg Manifesto</a>” in the ’80s. She’s kind of a tech science history feminist lens. You would just sit in that class and your mind would explode, and then at the end, you just have to sit there for like five minutes afterwards, just picking up the pieces.
She had a great term called “power objects.” A power object is something that we as a society recognize to be incredibly important, believe to be incredibly important, but we don’t know how it works. That lack of understanding allows us to fill this bucket with whatever we want it to be: our hopes, our fears, our dreams. This happened with DNA; this happened with PET scans and brain scans. This happens all throughout science history, down to phrenology and blood types and things that we understand to be, or we believed to be, important, but they’re not. And big data, another one that is very, very relevant.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2074" target="_blank" rel="noreferrer noopener">34.34</a>: That’s my handle on Twitter. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2095" target="_blank" rel="noreferrer noopener">34.55</a>: Yeah, there you go. So like it’s, you know, I fill it with Ben Lorica. That’s how I fill that power object. But AI is definitely that. AI is definitely that. And my favorite example of this is when the DeepSeek moment happened, we understood this to be really important, but we didn’t understand why it works and how well it worked.
And so what happened is, if you looked at the news and you looked at people’s reactions to what DeepSeek meant, you could basically find all the hopes and dreams about whatever was important to that person. So to AI boosters, DeepSeek proved that LLM progress is not slowing down. To AI skeptics, DeepSeek proved that AI companies have no moat. To open source advocates, it proved open is superior. To AI doomers, it proved that we aren’t being careful enough. Security researchers worried about the risk of backdoors in the models because it was in China. Privacy advocates worried about DeepSeek’s web services collecting sensitive data. China hawks said, “We need more sanctions.” Doves said, “Sanctions don’t work.” NVIDIA bears said, “We’re not going to need any more data centers if it’s going to be this efficient.” And bulls said, “No, we’re going to need tons of them because it’s going to use everything.”
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2144" target="_blank" rel="noreferrer noopener">35.44</a>: And AGI is another term like that, which means everything and nothing. And when the point we’ve reached it comes, isn’t. And compounding that is that it’s in the contract between OpenAI and Microsoft—I forget the exact term, but it’s the statement that Microsoft gets access to OpenAI’s technologies until AGI is achieved.
And so it’s a very loaded definition right now that’s being debated back and forth and trying to figure out how to take [Open]AI into being a for-profit corporation. And Microsoft has a lot of leverage because how do you define AGI? Are we going to go to court to define what AGI is? I almost look forward to that.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2188" target="_blank" rel="noreferrer noopener">36.28</a>: So because it’s going to be that thing, and you’ve seen Sam Altman come out and some days he talks about how LLMs are just software. Some days he talks about how it’s a PhD in your pocket, some days he talks about how we’ve already passed AGI, it’s already over. 
I think Nathan Lambert has some <a href="https://www.interconnects.ai/p/agi-is-what-you-want-it-to-be" target="_blank" rel="noreferrer noopener">great writing about how AGI is a mistake</a>. We shouldn’t talk about trying to turn LLMs into humans. We should try to leverage what they do now, which is something fundamentally different, and we should keep building and leaning into that rather than trying to make them like us. So AGI is my word for you. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2223" target="_blank" rel="noreferrer noopener">37.03</a>: The way I think of it is, AGI is great for fundraising, let’s put it that way. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2228" target="_blank" rel="noreferrer noopener">37.08</a>: That’s basically it. Well, until you need it to have already been achieved, or until you need it to not be achieved because you don’t want any regulation or if you want regulation—it’s kind of a fuzzy word. And that has some really good properties. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2243" target="_blank" rel="noreferrer noopener">37.23</a>: So I’ll close by throwing in my own term. So prompt engineering, context engineering. . . I will close by saying pay attention to this boring term, which my friend Ion Stoica is now talking more about “systems engineering.” If you look at particularly the agentic applications, you’re talking about systems.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2275" target="_blank" rel="noreferrer noopener">37.55</a>: Can I add one thing to this? Violent agreement. I think that is an underrated. . . 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2280" target="_blank" rel="noreferrer noopener">38.00</a>: Although I think it’s too boring a term, Drew, to take off.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2283" target="_blank" rel="noreferrer noopener">38.03</a>: That’s fine! The reason I like it is because—and you were talking about this when you talk about fine-tuning—is, looking at the way people build and looking at the way I see teams with success build, there’s pretraining, where you’re basically training on unstructured data and you’re just building your base knowledge, your base English capabilities and all that. And then you have posttraining. And in general, posttraining is where you build. I do think of it as a form of interface design, even though you are adding new skills, but you’re teaching reasoning, you’re teaching it validated functions like code and math. You’re teaching it how to chat with you. This is where it learns to converse. You’re teaching it how to use tools and specific sets of tools. And then you’re teaching it alignment, what’s safe, what’s not safe, all these other things.
But then after it ships, you can still RL that model, you can still fine-tune that model, and you can still prompt engineer that model, and you can still context engineer that model. And back to the systems engineering thing is, I think we’re going to see that posttraining all the way through to a final applied AI product. That’s going to be a real shades-of-gray gradient. It’s going to be. And this is one of the reasons why I think open models have a pretty big advantage in the future is that you’re going to dip down the way throughout that, leverage that. . .
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2372" target="_blank" rel="noreferrer noopener">39.32</a>: The only thing that’s keeping us from doing that now is we don’t have the tools and the operating system to align throughout that posttraining to shipping. Once we do, that operating system is going to change how we build, because the distance between posttraining and building is going to look really, really, really blurry. I really like the systems engineering type of approach, but I also think you can also start to see this yesterday [when] Thinking Machines released their first product.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2404" target="_blank" rel="noreferrer noopener">40.04</a>: And so Thinking Machines is Mira [Murati]. Her very hype thing. They launched their first thing, and it’s called Tinker. And it’s essentially, “Hey, you can write a very simple Python code, and then we will do the RL for you or the fine-tuning for you using our cluster of GPU so you don’t have to manage that.” And that is the type of thing that we want to see in a maturing kind of development framework. And you start to see this operating system emerging.
And it reminds me of the early days of O’Reilly, where it’s like I had to stand up a web server, I had to maintain a web server, I had to do all of these things, and now I don’t have to. I can spin up a Docker image, I can ship to render, I can ship to Vercel. All of these shared complicated things now have frameworks and tooling, and I think we’re going to see a similar evolution from that. And I’m really excited. And I think you have picked a great underrated term.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2456" target="_blank" rel="noreferrer noopener">40.56</a>: Now with that. Thank you, Drew. <a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2458" target="_blank" rel="noreferrer noopener">40.58</a>: Awesome. Thank you for having me, Ben.
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/?post_type=podcast&p=17562/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>From Habits to Tools</title>
<link>https://www.oreilly.com/radar/from-habits-to-tools/</link>
<comments>https://www.oreilly.com/radar/from-habits-to-tools/#respond</comments>
<pubDate>Wed, 15 Oct 2025 12:49:38 +0000</pubDate>
<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17557</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Abstract-colorful-drops_Otherworldly.jpg"
medium="image"
type="image/jpeg"
width="2304"
height="1792"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Abstract-colorful-drops_Otherworldly-160x160.jpg"
width="160"
height="160"
/>
<custom:subtitle><![CDATA[The Future of AI-Assisted Development]]></custom:subtitle>
<description><![CDATA[This article is part of a series on the Sens-AI Framework—practical habits for learning and coding with AI. AI-assisted coding is here to stay. I’ve seen many companies now require all developers to install Copilot extensions in their IDEs, and teams are increasingly being measured on AI-adoption metrics. Meanwhile, the tools themselves have become genuinely […]]]></description>
<content:encoded><![CDATA[
This article is part of a series on the Sens-AI Framework—practical habits for learning and coding with AI.
AI-assisted coding is here to stay. I’ve seen many companies now require all developers to install Copilot extensions in their IDEs, and teams are increasingly being measured on AI-adoption metrics. Meanwhile, the tools themselves have become genuinely useful for routine tasks: Developers regularly use them to generate boilerplate, convert between formats, write unit tests, and explore unfamiliar APIs—giving us more time to focus on solving our real problems instead of wrestling with syntax or going down research rabbit holes.
Many team leads, managers, and instructors looking to help developers ramp up on AI tools assume the biggest challenge is learning to write better prompts or picking the right AI tool; that assumption misses the point. The real challenge is figuring out how developers can use these tools in ways that keep them engaged and strengthen their skills instead of becoming disconnected from the code and letting their development skills atrophy.
This was the challenge I took on when I developed the Sens-AI Framework. When I was updating <a href="https://learning.oreilly.com/library/view/head-first-c/9781098141776/" target="_blank" rel="noreferrer noopener">Head First C#</a> (O’Reilly 2024) to help readers ramp up on AI skills alongside other fundamental development skills, I watched new learners struggle not with the mechanics of prompting but with maintaining their understanding of the code they were producing. The framework emerged from those observations—five habits that keep developers engaged in the design conversation: context, research, framing, refining, and critical thinking. These habits address the real issue: making sure the developer stays in control of the work, understanding not just what the code does but why it’s structured that way.
<h2 class="wp-block-heading">What We’ve Learned So Far</h2>
When I updated Head First C# to include AI exercises, I had to design them knowing learners would paste instructions directly into AI tools. That forced me to be deliberate: The instructions had to guide the learner while also shaping how the AI responded. Testing those same exercises against Copilot and ChatGPT showed the same kinds of problems over and over—AI filling in gaps with the wrong assumptions or producing code that looked fine until you actually had to run it, read and understand it, or modify and extend it.
Those issues don’t only trip up new learners. More experienced developers can fall for them too. The difference is that experienced developers already have habits for catching themselves, while newer developers usually don’t—unless we make a point of teaching them. AI skills aren’t exclusive to senior or experienced developers either; I’ve seen relatively new developers develop their AI skills quickly because they’ve built these habits quickly.
<h2 class="wp-block-heading">Habits Across the Lifecycle</h2>
In “<a href="https://www.oreilly.com/radar/the-sens-ai-framework/" target="_blank" rel="noreferrer noopener">The Sens-AI Framework</a>,” I introduced the five habits and explained how they work together to keep developers engaged with their code rather than becoming passive consumers of AI output. These habits also address specific failure modes, and understanding how they solve real problems points the way toward broader implementation across teams and tools:
Context helps avoid vague prompts that lead to poor output. Ask an AI to “make this code better” without sharing what the code does, and it might suggest adding comments to a performance-critical section where comments would just clutter. But provide the context—“This is a high-frequency trading system where microseconds matter,” along with the actual code structure, dependencies, and constraints—and the AI understands it should focus on optimizations, not documentation.
Research makes sure the AI isn’t your only source of truth. When you rely solely on AI, you risk compounding errors—the AI makes an assumption, you build on it, and soon you’re deep in a solution that doesn’t match reality. Cross-checking with documentation or even asking a different AI can reveal when you’re being led astray.
Framing is about asking questions that set up useful answers. “How do I handle errors?” gets you a try-catch block. “How do I handle network timeout errors in a distributed system where partial failures need rollback?” gets you circuit breakers and compensation patterns. As I showed in “<a href="https://www.oreilly.com/radar/understanding-the-rehash-loop/" target="_blank" rel="noreferrer noopener">Understanding the Rehash Loop</a>,” proper framing can break the AI out of circular suggestions.
Refining means not settling for the first thing the AI gives you. The first response is rarely the best—it’s just the AI’s initial attempt. When you iterate, you’re steering toward better patterns. Refining moves you from “This works” to “This is actually good.”
Critical thinking ties it all together, asking whether the code actually works for your project. It’s debugging the AI’s assumptions, reviewing for maintainability, and asking, “Will this make sense six months from now?”
The real power of the Sens-AI Framework comes from using all five habits together. They form a reinforcing loop: Context informs research, research improves framing, framing guides refinement, refinement reveals what needs critical thinking, and critical thinking shows you what context you were missing. When developers use these habits in combination, they stay engaged with the design and engineering process rather than becoming passive consumers of AI output. It’s the difference between using AI as a crutch and using it as a genuine collaborator.
<h2 class="wp-block-heading">Where We Go from Here</h2>
If developers are going to succeed with AI, these habits need to show up beyond individual workflows. They need to become part of:
Education: Teaching AI literacy alongside basic coding skills. As I described in “<a href="https://www.oreilly.com/radar/the-ai-teaching-toolkit-practical-guidance-for-teams/" target="_blank" rel="noreferrer noopener">The AI Teaching Toolkit</a>,” techniques like having learners debug intentionally flawed AI output help them spot when the AI is confidently wrong and practice breaking out of rehash loops. These aren’t advanced skills; they’re foundational.
Team practice: Using code reviews, pairing, and retrospectives to evaluate AI output the same way we evaluate human-written code. In my teaching article, I described techniques like AI archaeology and shared language patterns. What matters here is making those kinds of habits part of standard training—so teams develop vocabulary like “I’m stuck in a rehash loop” or “The AI keeps defaulting to the old pattern.” And as I explored in “<a href="https://www.oreilly.com/radar/trust-but-verify/" target="_blank" rel="noreferrer noopener">Trust but Verify</a>,” treating AI-generated code with the same scrutiny as human code is essential for maintaining quality.
Tooling: IDEs and linters that don’t just generate code but highlight assumptions and surface design trade-offs. Imagine your IDE warning: “Possible rehash loop detected: you’ve been iterating on this same approach for 15 minutes.” That’s one direction IDEs need to evolve—surfacing assumptions and warning when you’re stuck. The technical debt risks I outlined in “<a href="https://www.oreilly.com/radar/building-ai-resistant-technical-debt/" target="_blank" rel="noreferrer noopener">Building AI-Resistant Technical Debt</a>” could be mitigated with better tooling that catches antipatterns early.
Culture: A shared understanding that AI is a collaboration too (and not a teammate). A team’s measure of success for code shouldn’t revolve around AI. Teams still need to understand that code, keep it maintainable, and grow their own skills along the way. Getting there will require changes in how they work together—for example, adding AI-specific checks to code reviews or developing shared vocabulary for when AI output starts drifting. This cultural shift connects to the requirements engineering parallels I explored in “<a href="https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/" target="_blank" rel="noreferrer noopener">Prompt Engineering Is Requirements Engineering</a>”—we need the same clarity and shared understanding with AI that we’ve always needed with human teams.
More convincing output will require more sophisticated evaluation. Models will keep getting faster and more capable. What won’t change is the need for developers to think critically about the code in front of them.
The Sens-AI habits work alongside today’s tools and are designed to stay relevant to tomorrow’s tools as well. They’re practices that keep developers in control, even as models improve and the output gets harder to question. The framework gives teams a way to talk about both the successes and the failures they see when using AI. From there, it’s up to instructors, tool builders, and team leads to decide how to put those lessons into practice.
The next generation of developers will never know coding without AI. Our job is to make sure they build lasting engineering habits alongside these tools—so AI strengthens their craft rather than hollowing it out.
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/from-habits-to-tools/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Magic Words: Programming the Next Generation of AI Applications</title>
<link>https://www.oreilly.com/radar/magic-words-programming-the-next-generation-of-ai-applications/</link>
<comments>https://www.oreilly.com/radar/magic-words-programming-the-next-generation-of-ai-applications/#respond</comments>
<pubDate>Wed, 15 Oct 2025 10:06:50 +0000</pubDate>
<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17539</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Cluny_-_Mero_-_Croix-Talisman_motifs_magiques_base_sur_Abracadabra_-_VIe-VII_siecle-_Ag_nielle-scaled.jpg"
medium="image"
type="image/jpeg"
width="2560"
height="1467"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Cluny_-_Mero_-_Croix-Talisman_motifs_magiques_base_sur_Abracadabra_-_VIe-VII_siecle-_Ag_nielle-160x160.jpg"
width="160"
height="160"
/>
<description><![CDATA[“Strange was obliged to invent most of the magic he did, working from general principles and half-remembered stories from old books.” — Susanna Clarke, Jonathan Strange & Mr Norrell Fairy tales, myths, and fantasy fiction are full of magic spells. You say “abracadabra” and something profound happens.1 Say “open sesame” and the door swings open. […]]]></description>
<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
“Strange was obliged to invent most of the magic he did, working from general principles and half-remembered stories from old books.” — Susanna Clarke, Jonathan Strange & Mr Norrell
</blockquote>
Fairy tales, myths, and fantasy fiction are full of magic spells. You say “abracadabra” and something profound happens.1 Say “open sesame” and the door swings open.
It turns out that this is also a useful metaphor for what happens with large language models.
I first got this idea from David Griffiths’s O’Reilly course on <a href="https://learning.oreilly.com/live-events/using-generative-ai-to-boost-your-personal-productivity/0636920099736/" target="_blank" rel="noreferrer noopener">using AI to boost your productivity</a>. He gave a simple example. You can tell ChatGPT “Organize my task list using the Eisenhower four-sided box.” And it just knows what to do, even if you yourself know nothing about General Dwight D. Eisenhower’s approach to decision making. David then suggests his students instead try “Organize my task list using Getting Things Done,” or just “Use GTD.” Each of those phrases is shorthand for systems of thought, practices, and conventions that the model has learned from human culture.
These are magic words. They’re magic not because they do something unworldly and unexpected but because they have the power to summon patterns that have been encoded in the model. The words act as keys, unlocking context and even entire workflows.
We all use magic words in our prompts. We say something like “Update my resume” or “Draft a Substack post” without thinking how much detailed prompting we’d have to do to create that output if the LLM didn’t already know the magic word.
Every field has a specialized language whose terms are known only to its initiates. We can be fanciful and pretend they are magic spells, but the reality is that each of them is really a kind of fuzzy function call to an LLM, bringing in a body of context and unlocking a set of behaviors and capabilities. When we ask an LLM to write a program in Javascript rather than Python, we are using one of these fuzzy function calls. When we ask for output as an .md file, we are doing the same. Unlike a function call in a traditional programming language, it doesn’t always return the same result, which is why developers have an opportunity to enhance the magic.
<h2 class="wp-block-heading">From Prompts to Applications</h2>
The next light bulb went off for me in a conversation with Claire Vo, the creator of an AI application called <a href="http://chatprd.ai" target="_blank" rel="noreferrer noopener">ChatPRD</a>. Claire spent years as a product manager, and as soon as ChatGPT became available, began using it to help her write product requirement documents or PRDs. Every product manager knows what a PRD is. When Claire prompted ChatGPT to “write a PRD,” it didn’t need a long preamble. That one acronym carried decades of professional practice. But Claire went further. She refined her prompts, improved them, and taught ChatGPT how to think like her. Over time, she had trained a system, not at the model level, but at the level of context and workflow.
Next, Claire turned her workflow into a product. That product is a software interface that wraps up a number of related magic words into a useful package. It controls access to her customized magic spell, so to speak. Claire added detailed prompts, integrations with other tools, access control, and a whole lot of traditional programming in a next-generation application that uses a mix of traditional software code and “magical” fuzzy function calls to an LLM. ChatPRD even interviews users to learn more about their goals, customizing the application for each organization and use case.
Claire’s <a href="https://www.chatprd.ai/blog/chatprd-quickstart-guide" target="_blank" rel="noreferrer noopener">quickstart guide to ChatPRD</a> is a great example of what a magic-word (fuzzy function call) application looks like.
<figure class="wp-block-embed is-type-rich is-provider-embed-handler wp-block-embed-embed-handler wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="UPDATED ChatPRD Demo - Product Tour & Updated Features (2025)" width="500" height="281" src="https://www.youtube.com/embed/-V6bzSwYUZY?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>
You can also see how magic words are crafted into magic spells and how these spells are even part of the architecture of applications like Claude Code through the explorations of developers like Jesse Vincent and Simon Willison.
In “<a href="https://blog.fsck.com/2025/10/05/how-im-using-coding-agents-in-september-2025/" target="_blank" rel="noreferrer noopener">How I’m Using Coding Agents in September, 2025</a>,” Jesse first describes how his <a href="http://claude.md" target="_blank" rel="noreferrer noopener">claude.md</a> file provides a base prompt that “encodes a bunch of process documentation and rules that do a pretty good job keeping Claude on track.” And then his workflow calls on a bunch of specialized prompts he has created (i.e., “spells” that give clearer and more personalized meaning to specific magic words) like “brainstorm,” “plan,” “architect,” “implement,” “debug,” and so on. Note how inside these prompts, he may use additional magic words like DRY, YAGNI, and TDD, which refer to specific programming methodologies. For example, here’s his planning prompt (boldface mine):
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<code>Great. I need your help to write out a comprehensive implementation plan.</code>
<code>Assume that the engineer has zero context for our codebase and questionable</code> <code>taste. document everything they need to know. which files to touch for each</code> <code>task, code, testing, docs they might need to check. how to test it.give </code> <code>them the whole plan as bite-sized tasks. DRY. YAGNI. TDD. frequent commits.</code>
<code>Assume they are a skilled developer, but know almost nothing about our</code> <code>toolset or problem domain. assume they don't know good test design</code> <code>very</code> <code>well.</code>
<code>please write out this plan, in full detail, into docs/plans/</code>
</blockquote>
But Jesse didn’t stop there. He built a project called <a href="https://github.com/obra/superpowers" target="_blank" rel="noreferrer noopener">Superpowers</a>, which uses Claude’s <a href="https://docs.claude.com/en/docs/claude-code/plugins" target="_blank" rel="noreferrer noopener">recently announced plug-in architecture</a> to “give Claude Code superpowers with a comprehensive skills library of proven techniques, patterns, and tools.” <a href="https://blog.fsck.com/2025/10/09/superpowers/" target="_blank" rel="noreferrer noopener">Announcing the project</a>, he wrote:
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
Skills are what give your agents Superpowers. The first time they really popped up on my radar was a few weeks ago when Anthropic rolled out improved Office document creation. When the feature rolled out, I went poking around a bit – I asked Claude to tell me all about its new skills. And it <a href="https://claude.ai/share/0fe5a9c0-4e5a-42a1-9df7-c5b7636dad92" target="_blank" rel="noreferrer noopener">was only too happy to dish</a>…. [Be sure to follow this link! – TOR]
One of the first skills I taught Superpowers was <a href="https://raw.githubusercontent.com/obra/superpowers-skills/35c29f0fe22881149a991eca1276c148567a7c29/skills/meta/writing-skills/SKILL.md" target="_blank" rel="noreferrer noopener">How to create skills</a>. That has meant that when I wanted to do something like add git worktree workflows to Superpowers, it was a matter of describing how I wanted the workflows to go…and then Claude put the pieces together and added a couple notes to the existing skills that needed to clue future-Claude into using worktrees.
</blockquote>
After reading Jesse’s post, Simon Willison did a bit more digging into the original document handling skills that Claude had announced and that had sparked Jesse’s brainstorm. He noted:
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
Skills are more than just prompts though: the repository also includes dozens of pre-written Python scripts for performing common operations.
 <a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/pdf/scripts/fill_fillable_fields.py" target="_blank" rel="noreferrer noopener">pdf/scripts/fill_fillable_fields.py</a> for example is a custom CLI tool that uses <a href="https://pypi.org/project/pypdf/">pypdf</a> to find and then fill in a bunch of PDF form fields, specified as JSON, then render out the resulting combined PDF.
This is a really sophisticated set of tools for document manipulation, and I love that Anthropic have made those visible—presumably deliberately—to users of Claude who know how to ask for them.
</blockquote>
You can see what’s happening here. Magic words are being enhanced and given a more rigorous definition, and new ones are being added to what, in fantasy tales, they call a “grimoire,” or book of spells. Microsoft calls such spells “<a href="https://paradox921.medium.com/amplifier-notes-from-an-experiment-thats-starting-to-snowball-ef7df4ff8f97" target="_blank" rel="noreferrer noopener">metacognitive recipes</a>,” a wonderful term that should get widely adopted, though in this article I’m going to stick with my fanciful analogy to magic.
At O’Reilly, we’re working with a very different set of magic words. For example, we’re building a system for precisely targeted competency-based learning, through which our customers can skip what they already know, master what they need, and prove what they’ve learned. It also gives corporate learning system managers the ability to assign learning goals and to measure the ROI on their investment.
It turns out that there are dozens of learning frameworks (and that is itself a magic word). In the design of our own specialized learning framework, we’re invoking Bloom’s taxonomy, SFIA, and the Dreyfus Model of Skill Acquisition. But when a customer says, “We love your approach, but we use LTEM,” we can invoke that framework instead. Every corporate customer also has its own specialized tech stack. So we are exploring how to use magic words to let whatever we build adapt dynamically not only to our end users’ learning needs but to the tech stack and to the learning framework that already exists at each company.
That would be a nightmare if we had to support dozens of different learning frameworks using traditional processes. But the problem seems much more tractable if we are able to invoke the right magic words. That’s what I mean when I say that magic words are a crucial building block in the next generation of application programming.
<h2 class="wp-block-heading">The Architecture of Magic</h2>
Here’s the important thing: Magic isn’t arbitrary. In every mythic tradition, it has structure, discipline, and cost. The magician’s power depends on knowing the right words, pronounced in the right way, with the right intent.
The same is true for AI systems. The effectiveness of our magic words depends on context, grounding, and feedback loops that give the model reliable information about the world.
That’s why I find the emerging ecosystem of AI applications so fascinating. It’s about providing the right context to the model. It’s about defining vocabularies, workflows, and roles that expose and make sense of the model’s abilities. It’s about turning implicit cultural knowledge into explicit systems of interaction.
We’re only at the beginning. But just as early programmers learned to build structured software without spelling out exact machine instructions, today’s AI practitioners are learning to build structured reasoning systems out of fuzzy language patterns.
Magic words aren’t just a poetic image. They’re the syntax of a new kind of computing. As people become more comfortable with LLMs, they will pass around the magic words they have learned as power user tricks. Meanwhile, developers will wrap more advanced capabilities around existing magic words and perhaps even teach the models new ones that haven’t yet had the time to accrete sufficient meaning through wide usage in the training set. Each application will be built around a shared vocabulary that encodes its domain knowledge. Back in 2022, Mike Loukides called these systems “<a href="https://www.oreilly.com/radar/formal-informal-languages/" target="_blank" rel="noreferrer noopener">formal informal languages</a>.” That is, they are spoken in human language, but do better when you apply a bit of rigor.
And at least for the foreseeable future, developers will write “shims” between the magic words that control the LLMs and the more traditional programming tools and techniques that interface with existing systems, much as Claire did with ChatPRD. But eventually we’ll see true AI to AI communication.
Magic words and the spells built around them are only the beginning. Once people start using them in common, they become protocols. They define how humans and AI systems cooperate, and how AI systems cooperate with each other.
We can already see this happening. Frameworks like LangChain or the Model Context Protocol (MCP) formalize how context and tools are shared. Teams build agentic workflows that depend on a common vocabulary of intent. What is an MCP server, after all, but a mapping of a fuzzy function call into a set of predictable tools and services available at a given endpoint?
In other words, what was once a set of magic spells is becoming infrastructure. When enough people use the same magic words, they stop being magic and start being standards—the building blocks for the next generation of software.
We can already see this progression with MCP. There are three distinct kinds of MCP servers. Some, like <a href="https://github.com/microsoft/playwright-mcp" target="_blank" rel="noreferrer noopener">Playwright MCP</a>, are designed to make it easier for AIs to interface with applications originally designed for interactive human use. Others, like the <a href="https://github.com/github/github-mcp-server" target="_blank" rel="noreferrer noopener">GitHub MCP Server</a>, are designed to make it easier for AIs to interface with existing APIs, that is, with interfaces originally designed to be called by traditional programs. But some are designed as a frontend for a true AI-to-AI conversation. Other protocols, like A2A, are already optimized for this third use case.
But in each case, an MCP server is really a dictionary (or in magic terms, a spellbook)  that explains the magic words that it understands and how to invoke them. As Jesse Vincent put it to me after reading a draft of this piece:
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
The part that feels the most like magic spells is the part that most MCP authors do incredibly poorly. Each tool has a “description” field that tells the LLM how you use the tool. That description field is read and internalized by the LLM and changes how it behaves. Anthropic are particularly good at tool descriptions and most everybody else, in my experience, is…less good.
</blockquote>
In many ways, publishing the prompts, tool descriptions, context, and skills that add functionality to LLMs may be a more important frontier of open source AI than open weights. It’s important that we treat our enhancements to magic words not as proprietary secrets but as shared cultural artifacts. The more open and participatory our vocabularies are, the more inclusive and creative the resulting ecosystem will be.
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<h2 class="wp-block-heading">Footnotes</h2>
<ol class="wp-block-list">
<li>While often associated today with stage magic and cartoons, this magic word was apparently used from Roman times as a healing spell. One proposed etymology suggests that it comes <a href="https://en.wikipedia.org/wiki/Abracadabra" target="_blank" rel="noreferrer noopener">from the Aramaic for “I create as I speak.”</a></li>
</ol>

]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/magic-words-programming-the-next-generation-of-ai-applications/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Enlightenment</title>
<link>https://www.oreilly.com/radar/enlightenment/</link>
<comments>https://www.oreilly.com/radar/enlightenment/#respond</comments>
<pubDate>Tue, 14 Oct 2025 11:03:06 +0000</pubDate>
<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17534</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Student-sleeps-while-AI-works-1.jpg"
medium="image"
type="image/jpeg"
width="2304"
height="1792"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Student-sleeps-while-AI-works-1-160x160.jpg"
width="160"
height="160"
/>
<description><![CDATA[In a fascinating op-ed, David Bell, a professor of history at Princeton, argues that “AI is shedding enlightenment values.” As someone who has taught writing at a similarly prestigious university, and as someone who has written about technology for the past 35 or so years, I had a deep response. Bell’s is not the argument […]]]></description>
<content:encoded><![CDATA[
In a fascinating op-ed, David Bell, a professor of history at Princeton, argues that “<a href="https://www.nytimes.com/2025/08/02/opinion/artificial-intelligence-enlightenment.html" target="_blank" rel="noreferrer noopener">AI is shedding enlightenment values</a>.” As someone who has taught writing at a similarly prestigious university, and as someone who has written about technology for the past 35 or so years, I had a deep response.
Bell’s is not the argument of an AI skeptic. For his argument to work, AI has to be pretty good at reasoning and writing. It’s an argument about the nature of thought itself. Reading is thinking. Writing is thinking. Those are almost clichés—they even turn up in students’ assessments of <a href="https://lithub.com/what-happened-when-i-tried-to-replace-myself-with-chatgpt-in-my-english-classroom/" target="_blank" rel="noreferrer noopener">using AI in a college writing class</a>. It’s not a surprise to see these ideas in the 18th century, and only a bit more surprising to see how far Enlightenment thinkers took them. Bell writes:
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
The great political philosopher Baron de Montesquieu wrote: “One should never so exhaust a subject that nothing is left for readers to do. The point is not to make them read, but to make them think.” Voltaire, the most famous of the French “philosophes,” claimed, “The most useful books are those that the readers write half of themselves.”
</blockquote>
And in the late 20th century, the great Dante scholar John Freccero would say to his classes “The text reads you”: How you read <a href="https://digitaldante.columbia.edu/dante/divine-comedy/" target="_blank" rel="noreferrer noopener">The Divine Comedy</a> tells you who you are. You inevitably find your reflection in the act of reading.
Is the use of AI an aid to thinking or a crutch or a replacement? If it’s either a crutch or a replacement, then we have to go back to Descartes’s “I think, therefore I am” and read it backward: What am I if I don’t think? What am I if I have offloaded my thinking to some other device? Bell points out that books guide the reader through the thinking process, while AI expects us to guide the process and all too often resorts to flattery. <a href="https://openai.com/index/sycophancy-in-gpt-4o/" target="_blank" rel="noreferrer noopener">Sycophancy isn’t limited to a few recent versions of GPT</a>; “That’s a great idea” has been a staple of AI chat responses since its earliest days. A dull sameness goes along with the flattery—the paradox of AI is that, for all the talk of general intelligence, it really doesn’t think better than we do. It can access a wealth of information, but it ultimately gives us (at best) an unexceptional average of what has been thought in the past. Books lead you through radically different kinds of thought. Plato is not Aquinas is not Machiavelli is not Voltaire (and for great insights on the transition from the fractured world of medieval thought to the fractured world of Renaissance thought, see Ada Palmer’s <a href="https://press.uchicago.edu/ucp/books/book/chicago/I/bo246135916.html" target="_blank" rel="noreferrer noopener">Inventing the Renaissance</a>).
We’ve been tricked into thinking that education is about preparing to enter the workforce, whether as a laborer who can plan how to spend his paycheck (readin’, writin’, ’rithmetic) or as a potential lawyer or engineer (Bachelor’s, Master’s, Doctorate). We’ve been tricked into thinking of schools as factories—just look at any school built in the 1950s or earlier, and compare it to an early 20th century manufacturing facility. Take the children in, process them, push them out. Evaluate them with exams that don’t measure much more than the ability to take exams—not unlike the benchmarks that the AI companies are constantly quoting. The result is that students who can read Voltaire or Montesquieu as a dialogue with their own thoughts, who could potentially make a breakthrough in science or technology, are rarities. They’re not the students our institutions were designed to produce; they have to struggle against the system, and frequently fail. As one elementary school administrator told me, “They’re handicapped, as handicapped as the students who come here with learning disabilities. But we can do little to help them.”
So the difficult question behind Bell’s article is: How do we teach students to think in a world that will inevitably be full of AI, whether or not that AI looks like our current LLMs? In the end, education isn’t about collecting facts, duplicating the answers in the back of the book, or getting passing grades. It’s about learning to think. The educational system gets in the way of education, leading to short-term thinking. If I’m measured by a grade, I should do everything I can to optimize that metric. <a href="https://en.wikipedia.org/wiki/Goodhart%27s_law" target="_blank" rel="noreferrer noopener">All metrics will be gamed</a>. Even if they aren’t gamed, metrics shortcut around the real issues.
In a world full of AI, retreating to stereotypes like “AI is damaging” and “AI hallucinates” misses the point, and is a sure route to failure. What’s damaging isn’t the AI, but the set of attitudes that make AI just another tool for gaming the system. We need a way of thinking with AI, of arguing with it, of completing AI’s “book” in a way that goes beyond maximizing a score. In this light, so much of the discourse around AI has been misguided. I still hear people say that AI will save you from needing to know the facts, that you won’t have to learn the dark and difficult corners of programming languages—but as much as I personally would like to take the easy route, facts are the skeleton on which thinking is based. Patterns arise out of facts, whether those patterns are historical movements, scientific theories, or software designs. And errors are easily uncovered when you engage actively with AI’s output.
AI can help to assemble facts, but at some point those facts need to be internalized. I can name a dozen (or two or three) important writers and composers whose best work came around 1800. What does it take to go from those facts to a conception of the Romantic movement? An AI could certainly assemble and group those facts, but would you then be able to think about what that movement meant (and continues to mean) for European culture? What are the bigger patterns revealed by the facts? And what would it mean for those facts and patterns to reside only within an AI model, without human comprehension? You need to know the shape of history, particularly if you want to think productively about it. You need to know the dark corners of your programming languages if you’re going to debug a mess of AI-generated code. Returning to Bell’s argument, the ability to find patterns is what allows you to complete Voltaire’s writing. AI can be a tremendous aid in finding those patterns, but as human thinkers, we have to make those patterns our own.
That’s really what learning is about. It isn’t just collecting facts, though facts are important. Learning is about understanding and finding relationships and understanding how those relationships change and evolve. It’s about weaving the narrative that connects our intellectual worlds together. That’s enlightenment. AI can be a valuable tool in that process, as long as you don’t mistake the means for the end. It can help you come up with new ideas and new ways of thinking. Nothing says that you can’t have the kind of mental dialogue that Bell writes about with an AI-generated essay. ChatGPT may not be Voltaire, but not much is. But if you don’t have the kind of dialogue that lets you internalize the relationships hidden behind the facts, AI is a hindrance. We’re all prone to be lazy—intellectually and otherwise. What’s the point at which thinking stops? What’s the point at which knowledge ceases to become your own? Or, to go back to the Enlightenment thinkers, when do you stop writing your share of the book?
That’s not a choice AI makes for you. It’s your choice.
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/enlightenment/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>The Architect’s Dilemma</title>
<link>https://www.oreilly.com/radar/the-architects-dilemma/</link>
<comments>https://www.oreilly.com/radar/the-architects-dilemma/#respond</comments>
<pubDate>Mon, 13 Oct 2025 11:22:35 +0000</pubDate>
<dc:creator><![CDATA[Heiko Hotz]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Deep Dive]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17515</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Robot-Concierge-maximalism.jpg"
medium="image"
type="image/jpeg"
width="960"
height="747"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Robot-Concierge-maximalism-160x160.jpg"
width="160"
height="160"
/>
<custom:subtitle><![CDATA[Choosing Between Tools and Agents with MCP and A2A]]></custom:subtitle>
<description><![CDATA[The agentic AI landscape is exploding. Every new framework, demo, and announcement promises to let your AI assistant book flights, query databases, and manage calendars. This rapid advancement of capabilities is thrilling for users, but for the architects and engineers building these systems, it poses a fundamental question: When should a new capability be a […]]]></description>
<content:encoded><![CDATA[
The agentic AI landscape is exploding. Every new framework, demo, and announcement promises to let your AI assistant book flights, query databases, and manage calendars. This rapid advancement of capabilities is thrilling for users, but for the architects and engineers building these systems, it poses a fundamental question: When should a new capability be a simple, predictable tool (exposed via the Model Context Protocol, MCP) and when should it be a sophisticated, collaborative agent (exposed via the Agent2Agent Protocol, A2A)?
The common advice is often circular and unhelpful: “Use MCP for tools and A2A for agents.” This is like telling a traveler that cars use motorways and trains use tracks, without offering any guidance on which is better for a specific journey. This lack of a clear mental model leads to architectural guesswork. Teams build complex conversational interfaces for tasks that demand rigid predictability, or they expose rigid APIs to users who desperately need guidance. The outcome is often the same: a system that looks great in demos but falls apart in the real world.
In this article, I argue that the answer isn’t found by analyzing your service’s internal logic or technology stack. It’s found by looking outward and asking a single, fundamental question: Who is calling your product/service? By reframing the problem this way—as a user experience challenge first and a technical one second—the architect’s dilemma evaporates.
This essay draws a line where it matters for architects: the line between MCP tools and A2A agents. I will introduce a clear framework, built around the “Vending Machine Versus Concierge” model, to help you choose the right interface based on your consumer’s needs. I will also explore failure modes, testing, and the powerful Gatekeeper Pattern that shows how these two interfaces can work together to create systems that are not just clever but truly reliable.
<h2 class="wp-block-heading">Two Very Different Interfaces</h2>
MCP presents tools—named operations with declared inputs and outputs. The caller (a person, program, or agent) must already know what it wants, and provide a complete payload. The tool validates, executes once, and returns a result. If your mental image is a vending machine—insert a well-formed request, get a deterministic response—you’re close enough.
A2A presents agents—goal-first collaborators that converse, plan, and act across turns. The caller expresses an outcome (“book a refundable flight under $450”), not an argument list. The agent asks clarifying questions, calls tools as needed, and holds onto session state until the job is done. If you picture a concierge—interacting, negotiating trade-offs, and occasionally escalating—you’re in the right neighborhood.
Neither interface is “better.” They are optimized for different situations:
<ul class="wp-block-list">
<li>MCP is fast to reason about, easy to test, and strong on determinism and auditability.</li>
<li>A2A is built for ambiguity, long-running processes, and preference capture.</li>
</ul>
<h2 class="wp-block-heading">Bringing the Interfaces to Life: A Booking Example</h2>
To see the difference in practice, let’s imagine a simple task: booking a specific meeting room in an office.
The MCP “vending machine” expects a perfectly structured, machine-readable request for its book_room_tool. The caller must provide all necessary information in a single, valid payload:
<pre class="wp-block-code"><code>{
"jsonrpc": "2.0",
"id": 42,
"method": "tools/call",
"params": {
"name": "book_room_tool",
"arguments": {
"room_id": "CR-104B",
"start_time": "2025-11-05T14:00:00Z",
"end_time": "2025-11-05T15:00:00Z",
"organizer": "user@example.com"
}
}
}</code></pre>
Any deviation—a missing field or incorrect data type—results in an immediate error. This is the vending machine: You provide the exact code of the item you want (e.g., “D4”) or you get nothing.
The A2A “concierge,“ an “office assistant” agent, is approached with a high-level, ambiguous goal. It uses conversation to resolve ambiguity:
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
User: “Hey, can you book a room for my 1-on-1 with Alex tomorrow afternoon?” Agent: “Of course. To make sure I get the right one, what time works best, and how long will you need it for?”
</blockquote>
The agent’s job is to take the ambiguous goal, gather the necessary details, and then likely call the MCP tool behind the scenes once it has a complete, valid set of arguments.
With this clear dichotomy established—the predictable vending machine (MCP) versus the stateful concierge (A2A)—how do we choose? As I argued in the introduction, the answer isn’t found in your tech stack. It’s found by asking the most important architectural question of all: Who is calling your service?
<h3 class="wp-block-heading">Step 1: Identify your consumer</h3>
<ol class="wp-block-list">
<li>The machine consumer: A need for predictability Is your service going to be called by another automated system, a script, or another agent acting in a purely deterministic capacity? This consumer requires absolute predictability. It needs a rigid, unambiguous contract that can be scripted and relied upon to behave the same way every single time. It cannot handle a clarifying question or an unexpected update; any deviation from the strict contract is a failure. This consumer doesn’t want a conversation; it needs a vending machine. This nonnegotiable requirement for a predictable, stateless, and transactional interface points directly to designing your service as a tool (MCP).</li>
<li>The human (or agentic) consumer: A need for convenience Is your service being built for a human end user or for a sophisticated AI that’s trying to fulfill a complex, high-level goal? This consumer values convenience and the offloading of cognitive load. They don’t want to specify every step of a process; they want to delegate ownership of a goal and trust that it will be handled. They’re comfortable with ambiguity because they expect the service—the agent—to resolve it on their behalf. This consumer doesn’t want to follow a rigid script; they need a concierge. This requirement for a stateful, goal-oriented, and conversational interface points directly to designing your service as an agent (A2A).</li>
</ol>
By starting with the consumer, the architect’s dilemma often evaporates. Before you ever debate statefulness or determinism, you first define the user experience you are obligated to provide. In most cases, identifying your customer will give you your definitive answer.
<h3 class="wp-block-heading">Step 2: Validate with the four factors</h3>
Once you have identified who calls your service, you have a strong hypothesis for your design. A machine consumer points to a tool; a human or agentic consumer points to an agent. The next step is to validate this hypothesis with a technical litmus test. This framework gives you the vocabulary to justify your choice and ensure the underlying architecture matches the user experience you intend to create.
<ol class="wp-block-list">
<li>Determinism versus ambiguity Does your service require a precise, unambiguous input, or is it designed to interpret and resolve ambiguous goals? A vending machine is deterministic. Its API is rigid: <code>GET /item/D4</code>. Any other request is an error. This is the world of MCP, where a strict schema ensures predictable interactions. A concierge handles ambiguity. “Find me a nice place for dinner” is a valid request that the agent is expected to clarify and execute. This is the world of A2A, where a conversational flow allows for clarification and negotiation.</li>
<li>Simple execution versus complex process Is the interaction a single, one-shot execution, or a long-running, multistep process? A vending machine performs a short-lived execution. The entire operation—from payment to dispensing—is an atomic transaction that is over in seconds. This aligns with the synchronous-style, one-shot model of MCP. A concierge manages a process. Booking a full travel itinerary might take hours or even days, with multiple updates along the way. This requires the asynchronous, stateful nature of A2A, which can handle long-running tasks gracefully.</li>
<li>Stateless versus stateful Does each request stand alone or does the service need to remember the context of previous interactions? A vending machine is stateless. It doesn’t remember that you bought a candy bar five minutes ago. Each transaction is a blank slate. MCP is designed for these self-contained, stateless calls. A concierge is stateful. It remembers your preferences, the details of your ongoing request, and the history of your conversation. A2A is built for this, using concepts like a session or thread ID to maintain context.</li>
<li>Direct control versus delegated ownership Is the consumer orchestrating every step, or are they delegating the entire goal? When using a vending machine, the consumer is in direct control. You are the orchestrator, deciding which button to press and when. With MCP, the calling application retains full control, making a series of precise function calls to achieve its own goal. With a concierge, you delegate ownership. You hand over the high-level goal and trust the agent to manage the details. This is the core model of A2A, where the consumer offloads the cognitive load and trusts the agent to deliver the outcome.</li>
</ol>
<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td>Factor</td><td>Tool (MCP)</td><td>Agent (A2A)</td><td>Key question</td></tr><tr><td>Determinism</td><td>Strict schema; errors on deviation</td><td>Clarifies ambiguity via dialogue</td><td>Can inputs be fully specified up front?</td></tr><tr><td>Process</td><td>One-shot</td><td>Multi-step/long-running</td><td>Is this atomic or a workflow?</td></tr><tr><td>State</td><td>Stateless</td><td>Stateful/sessionful</td><td>Must we remember context/preferences?</td></tr><tr><td>Control</td><td>Caller orchestrates</td><td>Ownership delegated</td><td>Who drives: the caller or callee?</td></tr></tbody></table></figure>
Table 1: Four question framework
These factors are not independent checkboxes; they are four facets of the same core principle. A service that is deterministic, transactional, stateless, and directly controlled is a tool. A service that handles ambiguity, manages a process, maintains state, and takes ownership is an agent. By using this framework, you can confidently validate that the technical architecture of your service aligns perfectly with the needs of your customer.
<h3 class="wp-block-heading">No framework, no matter how clear…</h3>
…can perfectly capture the messiness of the real world. While the “Vending Machine Versus Concierge” model provides a robust guide, architects will eventually encounter services that seem to blur the lines. The key is to remember the core principle we’ve established: The choice is dictated by the consumer’s experience, not the service’s internal complexity.
Let’s explore two common edge cases.
The complex tool: The iceberg Consider a service that performs a highly complex, multistep internal process, like a video transcoding API. A consumer sends a video file and a desired output format. This is a simple, predictable request. But internally, this one call might kick off a massive, long-running workflow involving multiple machines, quality checks, and encoding steps. It’s a hugely complex process.
However, from the consumer’s perspective, none of that matters. They made a single, stateless, fire-and-forget call. They don’t need to manage the process; they just need a predictable result. This service is like an iceberg: 90% of its complexity is hidden beneath the surface. But because its external contract is that of a vending machine—a simple, deterministic, one-shot transaction—it is, and should be, implemented as a tool (MCP).
The simple agent: The scripted conversation Now consider the opposite: a service with very simple internal logic that still requires a conversational interface. Imagine a chatbot for booking a dentist appointment. The internal logic might be a simple state machine: ask for a date, then a time, then a patient name. It’s not “intelligent” or particularly flexible.
However, it must remember the user’s previous answers to complete the booking. It’s an inherently stateful, multiturn interaction. The consumer cannot provide all the required information in a single, prevalidated call. They need to be guided through the process. Despite its internal simplicity, the need for a stateful dialogue makes it a concierge. It must be implemented as an agent (A2A) because its consumer-facing experience is that of a conversation, however scripted.
These gray areas reinforce the framework’s central lesson. Don’t get distracted by what your service does internally. Focus on the experience it provides externally. That contract with your customer is the ultimate arbiter in the architect’s dilemma.
<h2 class="wp-block-heading">Testing What Matters: Different Strategies for Different Interfaces</h2>
A service’s interface doesn’t just dictate its design; it dictates how you validate its correctness. Vending machines and concierges have fundamentally different failure modes and require different testing strategies.
Testing MCP tools (vending machines):
<ul class="wp-block-list">
<li>Contract testing: Validate that inputs and outputs strictly adhere to the defined schema.</li>
<li>Idempotency tests: Ensure that calling the tool multiple times with the same inputs produces the same result without side effects.</li>
<li>Deterministic logic tests: Use standard unit and integration tests with fixed inputs and expected outputs.</li>
<li>Adversarial fuzzing: Test for security vulnerabilities by providing malformed or unexpected arguments.</li>
</ul>
Testing A2A agents (concierges):
<ul class="wp-block-list">
<li>Goal completion rate (GCR): Measure the percentage of conversations where the agent successfully achieved the user’s high-level goal.</li>
<li>Conversational efficiency: Track the number of turns or clarifications required to complete a task.</li>
<li>Tool selection accuracy: For complex agents, verify that the right MCP tool was chosen for a given user request.</li>
<li>Conversation replay testing: Use logs of real user interactions as a regression suite to ensure updates don’t break existing conversational flows.</li>
</ul>
<h2 class="wp-block-heading">The Gatekeeper Pattern</h2>
Our journey so far has focused on a dichotomy: MCP or A2A, vending machine or concierge. But the most sophisticated and robust agentic systems do not force a choice. Instead, they recognize that these two protocols don’t compete with each other; they complement each other. The ultimate power lies in using them together, with each playing to its strengths.
The most effective way to achieve this is through a powerful architectural choice we can call the Gatekeeper Pattern.
In this pattern, a single, stateful A2A agent acts as the primary, user-facing entry point—the concierge. Behind this gatekeeper sits a collection of discrete, stateless MCP tools—the vending machines. The A2A agent takes on the complex, messy work of understanding a high-level goal, managing the conversation, and maintaining state. It then acts as an intelligent orchestrator, making precise, one-shot calls to the appropriate MCP tools to execute specific tasks.
Consider a travel agent. A user interacts with it via A2A, giving it a high-level goal: “Plan a business trip to London for next week.”
<ul class="wp-block-list">
<li>The travel agent (A2A) accepts this ambiguous request and starts a conversation to gather details (exact dates, budget, etc.).</li>
<li>Once it has the necessary information, it calls a flight_search_tool (MCP) with precise arguments like origin, destination, and date.</li>
<li>It then calls a hotel_booking_tool (MCP) with the required city, check_in_date, and room_type.</li>
<li>Finally, it might call a currency_converter_tool (MCP) to provide expense estimates.</li>
</ul>
Each tool is a simple, reliable, and stateless vending machine. The A2A agent is the smart concierge that knows which buttons to press and in what order. This pattern provides several significant architectural benefits:
<ul class="wp-block-list">
<li>Decoupling: It separates the complex, conversational logic (the “how”) from the simple, reusable business logic (the “what”). The tools can be developed, tested, and maintained independently.</li>
<li>Centralized governance: The A2A gatekeeper is the perfect place to implement cross-cutting concerns. It can handle authentication, enforce rate limits, manage user quotas, and log all activity before a single tool is ever invoked.</li>
<li>Simplified tool design: Because the tools are just simple MCP functions, they don’t need to worry about state or conversational context. Their job is to do one thing and do it well, making them incredibly robust.</li>
</ul>
<h2 class="wp-block-heading">Making the Gatekeeper Production-Ready</h2>
Beyond its design benefits, the Gatekeeper Pattern is the ideal place to implement the operational guardrails required to run a reliable agentic system in production.
<ul class="wp-block-list">
<li>Observability: Each A2A conversation generates a unique trace ID. This ID must be propagated to every downstream MCP tool call, allowing you to trace a single user request across the entire system. Structured logs for tool inputs and outputs (with PII redacted) are critical for debugging.</li>
<li>Guardrails and security: The A2A Gatekeeper acts as a single point of enforcement for critical policies. It handles authentication and authorization for the user, enforces rate limits and usage quotas, and can maintain a list of which tools a particular user or group is allowed to call.</li>
<li>Resilience and fallbacks: The Gatekeeper must gracefully manage failure. When it calls an MCP tool, it should implement patterns like timeouts, retries with exponential backoff, and circuit breakers. Critically, it is responsible for the final failure state—escalating to a human in the loop for review or clearly communicating the issue to the end user.</li>
</ul>
The Gatekeeper Pattern is the ultimate synthesis of our framework. It uses A2A for what it does best—managing a stateful, goal-oriented process—and MCP for what it was designed for—the reliable, deterministic execution of a task.
<h2 class="wp-block-heading">Conclusion</h2>
We began this journey with a simple but frustrating problem: the architect’s dilemma. Faced with the circular advice that “MCP is for tools and A2A is for agents,” we were left in the same position as a traveler trying to get to Edinburgh—knowing that cars use motorways and trains use tracks but with no intuition on which to choose for our specific journey.
The goal was to build that intuition. We did this not by accepting abstract labels, but by reasoning from first principles. We dissected the protocols themselves, revealing how their core mechanics inevitably lead to two distinct service profiles: the predictable, one-shot “vending machine” and the stateful, conversational “concierge.”
With that foundation, we established a clear, two-step framework for a confident design choice:
<ol class="wp-block-list">
<li>Start with your customer. The most critical question is not a technical one but an experiential one. A machine consumer needs the predictability of a vending machine (MCP). A human or agentic consumer needs the convenience of a concierge (A2A).</li>
<li>Validate with the four factors. Use the litmus test of determinism, process, state, and ownership to technically justify and solidify your choice.</li>
</ol>
Ultimately, the most robust systems will synthesize both, using the Gatekeeper Pattern to combine the strengths of a user-facing A2A agent with a suite of reliable MCP tools.
The choice is no longer a dilemma. By focusing on the consumer’s needs and understanding the fundamental nature of the protocols, architects can move from confusion to confidence, designing agentic ecosystems that are not just functional but also intuitive, scalable, and maintainable.
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/the-architects-dilemma/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Everyday AI Agents</title>
<link>https://www.oreilly.com/radar/everyday-ai-agents/</link>
<comments>https://www.oreilly.com/radar/everyday-ai-agents/#respond</comments>
<pubDate>Fri, 10 Oct 2025 11:30:16 +0000</pubDate>
<dc:creator><![CDATA[David Michelson]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Events]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17512</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Everyday-AI-Agents.jpg"
medium="image"
type="image/jpeg"
width="2304"
height="1792"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Everyday-AI-Agents-160x160.jpg"
width="160"
height="160"
/>
<description><![CDATA[A common misconception about O’Reilly is that we cater only to the deeply technical learner. While we’re proud of our deep roots in the tech community, the breadth of our offerings, both in books and on our learning platform, has always aimed to reach a broader audience of tech-adjacent and tech-curious people who want to […]]]></description>
<content:encoded><![CDATA[
A common misconception about O’Reilly is that we cater only to the deeply technical learner. While we’re proud of our deep roots in the tech community, the breadth of our offerings, both in books and on our learning platform, has always aimed to reach a broader audience of tech-adjacent and tech-curious people who want to learn new technologies and skills to improve how they work. For this audience, generative AI has opened up a world of new capabilities, making it possible to contribute to technical work that previously required coding knowledge or specialized expertise. As <a href="https://www.oreilly.com/radar/ai-and-programming-the-beginning-of-a-new-era/" target="_blank" rel="noreferrer noopener">Tim O’Reilly has put it,</a> “the addressable surface area of programming has gone up by orders of magnitude. There’s so much more to do and explore.”
Over the last few years, many in this less technical audience have become adept at using chatbots in their daily lives for summarizing, writing, data analysis, automating tedious tasks and even prototyping. But this proficiency with chatbots is just the beginning. The underlying technology has evolved beyond simple conversations and outputs to power the next step: AI agents.
While chatbots are great for answering questions and generating outputs, AI agents are designed to take action. They are proactive, goal-oriented, and can handle complex, multi-step tasks. If we’re often encouraged to think of chatbots as bright but overconfident interns, we can think of AI agents like competent direct reports you can hand an entire project to. They’ve been trained, understand their goals, can make decisions and employ tools to achieve their ends, all with minimal oversight. Across industries, agents are already handling real work, from automating software development to managing complex marketing campaigns and customer service calls. But there’s a gap. Many people who are comfortable with chatbots don’t yet see the path to harnessing the power of agents in their everyday work. They’ve heard the hype but how can agents impact daily work? How do you get started?
This is why we’ve created the October 23rd <a href="https://learning.oreilly.com/live-events/genai-superstream-everyday-ai-agents/0642572213459/" target="_blank" rel="noreferrer noopener">GenAI Superstream: Everyday AI Agents</a>. This event is designed to bridge that gap and show you how to move from simply chatting with AI to building and deploying AI agents that can become valuable co-workers. Kathy Pham (VP of AI at Workday) and Claire Vo (CEO at ChatPRD) will kick off the conference with a fireside chat about how agents are already changing work and why it matters. From there, we’ll get into the specifics. You’ll hear from Jacob Bank of Relay.app, who will help demystify agents and share real patterns for automating your work, and from April Dunnam of Microsoft, who will demonstrate how to build agents directly within Microsoft 365. You’ll also learn how agents can help designers enforce creative governance with Nadia Elinbabi of Lowes and manage complex product workflows with Aman Khan of Arize AI. David Griffiths of HereScreen will explain how thinking like a programmer—without needing to be one—can help you design more intelligent and flexible agents. Finally, Babak Hodjat, CTO of AI at Cognizant will talk about how individual agents can evolve into multi-agent ecosystems that manage complex operations across an entire enterprise. Together, the goal of these presentations is to show vivid instances of real-world agents in action, inspiring you to imagine how you might use agents to augment your own abilities and work smarter.
Democratization of technical capabilities is one of the key benefits of the current sea change ushered in by genAI. We believe that everyone, regardless of their technical background, should have the opportunity to participate in this transformation. Whether you’re new to agents, feel like your experimentation with agents has plateaued, or you just want a measured assessment of the hype, this GenAI Superstream is your chance to get informed, be inspired, and take the first steps toward building your own AI-powered future. We hope you’ll join us.
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/everyday-ai-agents/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Control Codegen Spend</title>
<link>https://www.oreilly.com/radar/control-codegen-spend/</link>
<comments>https://www.oreilly.com/radar/control-codegen-spend/#respond</comments>
<pubDate>Thu, 09 Oct 2025 11:19:19 +0000</pubDate>
<dc:creator><![CDATA[Tim O'Brien]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17506</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Who-Should-Get-Paid-in-the-Age-of-AI.jpg"
medium="image"
type="image/jpeg"
width="2304"
height="1792"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Who-Should-Get-Paid-in-the-Age-of-AI-160x160.jpg"
width="160"
height="160"
/>
<custom:subtitle><![CDATA[Use the Right Model—Then Switch Back]]></custom:subtitle>
<description><![CDATA[This article originally appeared on Medium. Tim O’Brien has given us permission to repost here on Radar. When you’re working with AI tools like Cursor or GitHub Copilot, the real power isn’t just having access to different models—it’s knowing when to use them. Some jobs are OK with Auto. Others need a stronger model. And […]]]></description>
<content:encoded><![CDATA[
<figure class="wp-block-table"><table class="has-cyan-bluish-gray-background-color has-background has-fixed-layout"><tbody><tr><td>This article originally appeared on <a href="https://medium.com/@tobrien/control-codegen-spend-use-the-right-model-then-switch-back-cf173753d0ae" target="_blank" rel="noreferrer noopener">Medium</a>. Tim O’Brien has given us permission to repost here on Radar.</td></tr></tbody></table></figure>
When you’re working with AI tools like Cursor or GitHub Copilot, the real power isn’t just having access to different models—it’s knowing when to use them. Some jobs are OK with Auto. Others need a stronger model. And sometimes you should bail and switch if you continue spending money on a complex problem with a lower-quality model. If you don’t, you’ll waste both time and money.
And this is the missing discussion in code generation. There are a few “camps” here; the majority of people writing about this appear to view this as a fantastical and fun “vibe coding” experience, and a few people out there are trying to use this technology to deliver real products. If you are in that last category, you’ve probably started to realize that you can spend a fantastic amount of money if you don’t have a strategy for model selection.
Let’s make it very specific—if you sign up for Cursor and drop $20/month on a subscription using Auto and you are happy with the output, there’s not much to worry about. But if you are starting to run agents in parallel and are paying for token consumption atop a monthly subscription, this post will make sense. In my own experience, a single developer working alone can easily spend $200–$300/day (or four times that figure) if they are trying to tackle a project and have opted for the most expensive model.
And—if you are a company and you give your developers unlimited access to these tools—get ready for some surprises.
<h2 class="wp-block-heading">My Escalation Ladder for Models…</h2>
<ol class="wp-block-list">
<li>Start here: Auto. Let Cursor route to a strong model with good capacity. If output quality degrades or the loop occurs, escalate the issue. (Cursor explicitly says Auto selects among premium models and will switch when output is degraded.) </li>
<li>Medium-complexity tasks: Sonnet 4/GPT‑5/Gemini. Use for focused tasks on a handful of files: robust unit tests, targeted refactors, API remodels. </li>
<li>Heavy lift: Sonnet 4 – 1 million. If I need to do something that requires more context, but I still don’t want to pay top dollar, I’ve been starting to move up models that don’t quickly max out on context. </li>
<li>Ultraheavy lift: Opus 4/4.1. Use this when the task spans multiple projects or requires long context and careful reasoning, then switch back once the big move is done. (Anthropic positions Opus 4 as a deep‑reasoning, long‑horizon model for coding and agent workflows.)</li>
</ol>
Auto works fine, but there are times when you can sense that it’s selected the wrong model, and if you use these models enough, you know when you are looking at Gemini Pro output by the verbosity or the ChatGPT models by the way they go about solving a problem.
I’ll admit that my heavy and ultraheavy choices here are biased towards the models I’ve had more experience with—your own experience might vary. Still, you should also have a similar escalation list. Start with Auto and only upgrade if you need to; otherwise, you are going to learn some lessons about how much this costs.
<h2 class="wp-block-heading">Watch Out for “Thinking” Model Costs</h2>
Some models support explicit “thinking” (longer reasoning). Useful, but costlier. Cursor’s docs note that enabling thinking on specific Sonnet versions can count as two requests under team request accounting, and in the individual plans, the same idea translates to more tokens burned. In short, thinking mode is excellent—use it when you need it.
And when do you need it? My rule of thumb here is that when I understand what needs to be done already, when I’m asking for a unit test to be polished or a method to be executed in the pattern of another… I usually don’t need a thinking model. On the other hand, if I’m asking it to analyze a problem and propose various options for me to choose from, or (something I do often) when I’m asking it to challenge my decisions and play devil’s advocate, I will pay the premium for the best model.
<h2 class="wp-block-heading">Max Mode and When to Use It</h2>
If you need giant context windows or extended reasoning (e.g., sweeping changes across 20+ files), Max Mode can help—but it will consume more usage. Make Max Mode a temporary tool, not your default. If you find yourself constantly requiring Max Mode to be turned on, there’s a good chance you are “overapplying” this technology.
If it needs to consume a million tokens for hours on end? That’s usually a hint that you need another programmer. More on that later, but what I’ve seen too often are managers who think this is like the “vibe coding” they are witnessing. Spoiler alert: Vibe coding is that thing that people do in presentations because it takes five minutes to make a silly video game. It’s 100% not programming, and to use codegen, here’s the secret: You have to understand how to program.
Max Mode and thinking models are not a shortcut, and neither are they a replacement for good programmers. If you think they are, you are going to be paying top dollar for code that will one day have to be rewritten by a good programmer using these same tools.
<h2 class="wp-block-heading">Most Important Tip: Watch Your Bill as It Happens</h2>
The most important tip is to regularly monitor your utilization and usage fees in Cursor, since they appear within a minute or two of running something. You can see usage by the minute, the number of tokens consumed, and in some cases, how much you’re being charged beyond your subscription. Make a habit of checking a couple of times a day, especially during heavy sessions, and ideally every half hour. This helps you catch runaway costs—like spending $100 an hour—before they get out of hand, which is entirely possible if you’re running many parallel agents or doing resource-intensive work. Paying attention ensures you stay in control of both your usage and your bill.
<h2 class="wp-block-heading">Keep Track and Avoid Loops</h2>
The other thing you need to do is keep track of what works and what doesn’t. Over time, you’ll notice it’s very easy to make mistakes, and the models themselves can sometimes fall into loops. You might give an instruction, and instead of resolving it, the system keeps running the same process again and again. If you’re not paying attention, you can burn through a lot of tokens—and a lot of money—without actually getting sound output. That’s why it’s essential to watch your sessions closely and be ready to interrupt if something looks like it’s stuck.
Another pitfall is pushing the models beyond their limits. There are tasks they can’t handle well, and when that happens, it’s tempting to keep rephrasing the request and asking again, hoping for a better result. In practice, that often leads to the same cycle of failure, except you’re footing the bill for every attempt. Knowing where the boundaries are and when to stop is critical.
A practical way to stay on top of this is to maintain a running diary of what worked and what didn’t. Record prompts, outcomes, and notes about efficiency so you can learn from experience instead of repeating expensive mistakes. Combined with keeping an eye on your live usage metrics, this habit will help you refine your approach and avoid wasting both time and money.
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/control-codegen-spend/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>The AI Teaching Toolkit: Practical Guidance for Teams</title>
<link>https://www.oreilly.com/radar/the-ai-teaching-toolkit-practical-guidance-for-teams/</link>
<comments>https://www.oreilly.com/radar/the-ai-teaching-toolkit-practical-guidance-for-teams/#respond</comments>
<pubDate>Wed, 08 Oct 2025 11:12:34 +0000</pubDate>
<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17503</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/AI-Teaching-Toolkit.jpg"
medium="image"
type="image/jpeg"
width="2304"
height="1792"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/AI-Teaching-Toolkit-160x160.jpg"
width="160"
height="160"
/>
<description><![CDATA[Teaching developers to work effectively with AI means building habits that keep critical thinking active while leveraging AI’s speed. But teaching these habits isn’t straightforward. Instructors and team leads often find themselves needing to guide developers through challenges in ways that build confidence rather than short-circuit their growth. (See “The Cognitive Shortcut Paradox.”) There are […]]]></description>
<content:encoded><![CDATA[
Teaching developers to work effectively with AI means building habits that keep critical thinking active while leveraging AI’s speed.
But teaching these habits isn’t straightforward. Instructors and team leads often find themselves needing to guide developers through challenges in ways that build confidence rather than short-circuit their growth. (See “<a href="https://www.oreilly.com/radar/the-cognitive-shortcut-paradox/" target="_blank" rel="noreferrer noopener">The Cognitive Shortcut Paradox</a>.”) There are the regular challenges of working with AI:
<ul class="wp-block-list">
<li>Suggestions that look correct while hiding subtle flaws</li>
<li>Less experienced developers accepting output without questioning it</li>
<li>AI producing patterns that don’t match the team’s standards</li>
<li>Code that works but creates long-term maintainability headaches</li>
</ul>
The Sens-AI Framework (see “<a href="https://www.oreilly.com/radar/the-sens-ai-framework/" target="_blank" rel="noreferrer noopener">The Sens-AI Framework: Teaching Developers to Think with AI</a>”) was built to address these problems. It focuses on five habits—context, research, framing, refining, and critical thinking—that help developers use AI effectively while keeping learning and design judgment in the loop.
This toolkit builds on and reinforces those habits by giving you concrete ways to integrate them into team practices. It’s designed to give you concrete ways to build these habits in your team, whether you’re running a workshop, leading code reviews, or mentoring individual developers. The techniques that follow include practical teaching strategies, common pitfalls to avoid, reflective questions to deepen learning, and positive signs that show the habits are sticking.
<h2 class="wp-block-heading">Advice for Instructors and Team Leads</h2>
The strategies in this toolkit can be used in classrooms, review meetings, design discussions, or one-on-one mentoring. They’re meant to help new learners, experienced developers, and teams have more open conversations about design decisions, context, and the quality of AI suggestions. The focus is on making review and questioning feel like a normal, expected part of everyday development.
Discuss assumptions and context explicitly. In code reviews or mentoring sessions, ask developers to talk about occurrences when the AI gave them poor out unexpected results. Also try asking them to explain what they think the AI might have needed to know to produce a better answer, and where it might have filled in gaps incorrectly. Getting developers to articulate those assumptions helps spot weak points in design before they’re cemented into the code. (See “<a href="https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/" target="_blank" rel="noreferrer noopener">Prompt Engineering Is Requirements Engineering</a>.”)
Encourage pairing or small-group prompt reviews: Make AI-assisted development collaborative, not siloed. Have developers on a team or students in a class share their prompts with each other, and talk through why they wrote them a certain way, just like they’d talk through design decisions in pair or mob programming. This helps less experienced developers see how others approach framing and refining prompts.
Encourage researching idiomatic use of code. One thing that often holds back intermediate developers is not knowing the idioms of a specific framework or language. AI can help here—if they ask for the idiomatic way to do something, they see not just the syntax but also the patterns experienced developers rely on. That shortcut can speed up their understanding and make them more confident when working with new technologies.
Here are two examples of how using AI to research idioms can help developers quickly adapt:
<ul class="wp-block-list">
<li>A developer with deep experience writing microservices but little exposure to Spring Boot can use AI to see the idiomatic way to annotate a class with <code>@RestController</code> and <code>@RequestMapping</code>. They might also learn that Spring Boot favors constructor injection over field injection with <code>@Autowired</code>, or that <code>@GetMapping("/users")</code> is preferred over <code>@RequestMapping(method = RequestMethod.GET, value = "/users")</code>.</li>
<li>A Java developer new to Scala might reach for <code>null</code> instead of Scala’s <code>Option</code> types—missing a core part of the language’s design. Asking the AI for the idiomatic approach surfaces not just the syntax but the philosophy behind it, guiding developers toward safer and more natural patterns.</li>
</ul>
Help developers recognize rehash loops as meaningful signals. When the AI keeps circling the same broken idea, even developers who have experienced this many times may not realize they’re caught in a rehash loop. Teach them to recognize the loop as a signal that the AI has exhausted its context, and that it’s time to step back. That pause can lead to research, reframing the problem, or providing new information. For example, you might stop and say: “Notice how it’s circling the same idea? That’s our signal to break out.” Then demonstrate how to reset: open a new session, consult documentation, or try a narrower prompt. (See “<a href="https://www.oreilly.com/radar/understanding-the-rehash-loop/" target="_blank" rel="noreferrer noopener">Understanding the Rehash Loop</a>.”)
Research beyond AI. Help developers learn that when hitting walls, they don’t need to just tweak prompts endlessly. Model the habit of branching out: check official documentation, search Stack Overflow, or review similar patterns in your existing codebase. AI should be one tool among many. Showing developers how to diversify their research keeps them from looping and builds stronger problem-solving instincts.
Use failed projects as test cases. Bring in previous projects that ran into trouble with AI-generated code and revisit them with Sens-AI habits. Review what went right and wrong, talk about where it might have helped to break out of the vibe coding loop to do additional research, reframe the problem, and apply critical thinking. Work with the team to write down lessons you learned from the discussion. Holding a retrospective exercise like this lowers the stakes—developers are free to experiment and critique without slowing down current work. It’s also a powerful way to show how reframing, refining, and verifying could have prevented past issues. (See “<a href="https://www.oreilly.com/radar/building-ai-resistant-technical-debt/" target="_blank" rel="noreferrer noopener">Building AI-Resistant Technical Debt</a>.”)
Make refactoring part of the exercise. Help developers avoid the habit of deciding the code is finished when it runs and seems to work. Have them work with the AI to clean up variable names, reduce duplication, simplify overly complex logic, apply design patterns, and find other ways to prevent technical debt. By making evaluation and improvement explicit, you can help developers build the muscle memory that prevents passive acceptance of AI output. (See “<a href="https://www.oreilly.com/radar/trust-but-verify/" target="_blank" rel="noreferrer noopener">Trust but Verify</a>.”)
<h2 class="wp-block-heading">Common Pitfalls to Address with Teams</h2>
Even with good intentions, teams often fall into predictable traps. Watch for these patterns and address them explicitly, because otherwise they can slow progress and mask real learning.
The completionist trap: Trying to read every line of AI output even when you’re about to regenerate it. Teach developers it’s okay to skim, spot problems, and regenerate early. This helps them avoid wasting time carefully reviewing code they’ll never use, and reduces the risk of cognitive overload. The key is to balance thoroughness with pragmatism—they can start to learn when detail matters and when speed matters more.
The perfection loop: Endless tweaking of prompts for marginal improvements. Try setting a limit on iteration—for example, if refining a prompt doesn’t get good results after three or four attempts, it’s time to step back and rethink. Developers need to learn that diminishing returns are a sign to change strategy, not to keep grinding, so energy that should go toward solving the problem doesn’t get lost in chasing minor refinements.
Context dumping: Pasting entire codebases into prompts. Teach scoping—What’s the minimum context needed for this specific problem? Help them anticipate what the AI needs, and provide the minimal context required to solve each problem. Context dumping can be especially problematic with limited context windows, where the AI literally can’t see all the code you’ve pasted, leading to incomplete or contradictory suggestions. Teaching developers to be intentional about scope prevents confusion and makes AI output more reliable.
Skipping the fundamentals: Using AI for extensive code generation before understanding basic software development concepts and patterns. Ensure learners can solve simple development problems on their own (without the help of AI) before accelerating with AI on more complex ones. This helps reduce the risk of developers building a shallow knowledge platform that collapses under pressure. Fundamentals are what allow them to evaluate AI’s output critically rather than blindly trusting it.
<h2 class="wp-block-heading">AI Archaeology: A Practical Team Exercise for Better Judgment</h2>
Have your team do an AI archaeology exercise. Take a piece of AI-generated code from the previous week and analyze it together. More complex or nontrivial code samples work especially well because they tend to surface more assumptions and patterns worth discussing.
Have each team member independently write down their own answers to these questions:
<ul class="wp-block-list">
<li>What assumptions did the AI make?</li>
<li>What patterns did it use?</li>
<li>Did it make the right decision for our codebase?</li>
<li>How would you refactor or simplify this code if you had to maintain it long-term?</li>
</ul>
Once everyone has had time to write, bring the group back together—either in a room or virtually—and compare answers. Look for points of agreement and disagreement. When different developers spot different issues, that contrast can spark discussion about standards, best practices, and hidden dependencies. Encourage the group to debate respectfully, with an emphasis on surfacing reasoning rather than just labeling answers as right or wrong.
This exercise makes developers slow down and compare perspectives, which helps surface hidden assumptions and coding habits. By putting everyone’s observations side by side, the team builds a shared sense of what good AI-assisted code looks like.
For example, the team might discover the AI consistently uses older patterns your team has moved away from or that it defaults to verbose solutions when simpler ones exist. Discoveries like that become teaching moments about your team’s standards and help calibrate everyone’s “code smell” detection for AI output. The retrospective format makes the whole exercise more friendly and less intimidating than real-time critique, which helps to strengthen everyone’s judgment over time.
<h2 class="wp-block-heading">Signs of Success</h2>
Balancing pitfalls with positive indicators helps teams see what good AI practice looks like. When these habits take hold, you’ll notice developers:
Reviewing AI code with the same rigor as human-written code—but only when appropriate. When developers stop saying “the AI wrote it, so it must be fine” and start giving AI code the same scrutiny they’d give a teammate’s pull request, it demonstrates that the habits are sticking.
Exploring multiple approaches instead of accepting the first answer. Developers who use AI effectively don’t settle for the initial response. They ask the AI to generate alternatives, compare them, and use that exploration to deepen their understanding of the problem.
Recognizing rehash loops without frustration. Instead of endlessly tweaking prompts, developers treat rehash loops as signals to pause and rethink. This shows they’re learning to manage AI’s limitations rather than fight against them.
Sharing “AI gotchas” with teammates. Developers start saying things like “I noticed Copilot always tries this approach, but here’s why it doesn’t work in our codebase.” These small observations become collective knowledge that helps the whole team work together and with AI more effectively.
Asking “Why did the AI choose this pattern?” instead of just asking “Does it work?” This subtle shift shows developers are moving beyond surface correctness to reasoning about design. It’s a clear sign that critical thinking is active.
Bringing fundamentals into AI conversations: Developers who are working positively with AI tools tend to relate AI output back to core principles like readability, separation of concerns, or testability. This shows they’re not letting AI bypass their grounding in software engineering.
Treating AI failures as learning opportunities: When something goes wrong, instead of blaming the AI or themselves, developers dig into why. Was it context? Framing? A fundamental limitation? This investigative mindset turns problems into teachable moments.
<h2 class="wp-block-heading">Reflective Questions for Teams</h2>
Encourage developers to ask themselves these reflective questions periodically. They slow the process just enough to surface assumptions and spark discussion. You might use them in training, pairing sessions, or code reviews to prompt developers to explain their reasoning. The goal is to keep the design conversation active, even when the AI seems to offer quick answers.
<ul class="wp-block-list">
<li>What does the AI need to know to do this well? (Ask this before writing any prompt.)</li>
<li>What context or requirements might be missing here? (Helps catch gaps early.)</li>
<li>Do you need to pause here and do some research? (Promotes branching out beyond AI.)</li>
<li>How might you reframe this problem more clearly for the AI? (Encourages clarity in prompts.)</li>
<li>What assumptions are you making about this AI output? (Surfaces hidden design risks.)</li>
<li>If you’re getting frustrated, is that a signal to step back and rethink? (Normalizes stepping away.)</li>
<li>Would it help to switch from reading code to writing tests to check behavior? (Shifts the lens to validation.)</li>
<li>Do these unit tests reveal any design issues or hidden dependencies? (Connects testing with design insight.)</li>
<li>Have you tried starting a new chat session or using a different AI tool for this research? (Models flexibility with tools.)</li>
</ul>
The goal of this toolkit is to help developers build the kind of judgment that keeps them confident with AI while still growing their core skills. When teams learn to pause, review, and refactor AI-generated code, they move quickly without losing sight of design clarity or long-term maintainability. These teaching strategies give developers the habits to stay in control of the process, learn more deeply from the work, and treat AI as a true collaborator in building better software. As AI tools evolve, these fundamental habits—questioning, verifying, and maintaining design judgment—will remain the difference between teams that use AI well and those that get used by it.
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/the-ai-teaching-toolkit-practical-guidance-for-teams/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Radar Trends to Watch: October 2025</title>
<link>https://www.oreilly.com/radar/radar-trends-to-watch-october-2025/</link>
<comments>https://www.oreilly.com/radar/radar-trends-to-watch-october-2025/#respond</comments>
<pubDate>Tue, 07 Oct 2025 11:17:09 +0000</pubDate>
<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
<category><![CDATA[Radar Trends]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17499</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-3.png"
medium="image"
type="image/png"
width="1400"
height="950"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-3-160x160.png"
width="160"
height="160"
/>
<custom:subtitle><![CDATA[Developments in Programming, Operations, Augmented Reality, and More]]></custom:subtitle>
<description><![CDATA[This month we have two more protocols to learn. Google has announced the Agent Payments Protocol (AP2), which is intended to help agents to engage in ecommerce—it’s largely concerned with authenticating and authorizing parties making a transaction. And the Agent Client Protocol (ACP) is concerned with communications between code editors and coding agents. When implemented, […]]]></description>
<content:encoded><![CDATA[
This month we have two more protocols to learn. Google has announced the Agent Payments Protocol (AP2), which is intended to help agents to engage in ecommerce—it’s largely concerned with authenticating and authorizing parties making a transaction. And the Agent Client Protocol (ACP) is concerned with communications between code editors and coding agents. When implemented, it would allow any code editor to plug in any compliant agent.
All hasn’t been quiet on the virtual reality front. Meta has announced its new VR/AR glasses, with the ability to display images on the lenses along with capabilities like live captioning for conversations. They’re much less obtrusive than the previous generation of VR goggles.
<h2 class="wp-block-heading">AI</h2>
<ul class="wp-block-list">
<li>Suno has <a href="https://suno.com/studio-welcome" target="_blank" rel="noreferrer noopener">announced</a> an AI-driven digital audio workstation (DAW), a tool for enabling people to be creative with AI-generated music.</li>
<li>Ollama has added its own <a href="https://docs.ollama.com/web-search" target="_blank" rel="noreferrer noopener">web search API</a>. Ollama’s search API can be used to augment the information available to models. </li>
<li>GitHub Copilot now offers a command-line tool, <a href="https://github.blog/changelog/2025-09-25-github-copilot-cli-is-now-in-public-preview/" target="_blank" rel="noreferrer noopener">GitHub CLI</a>. It can use either Claude Sonnet 4 or GPT-5 as the backing model, though other models should be available soon. Claude 4 is the default.</li>
<li>Alibaba has released <a href="https://qwen.ai/blog?id=87dc93fc8a590dc718c77e1f6e84c07b474f6c5a&from=home.latest-research-list" target="_blank" rel="noreferrer noopener">Qwen3-Max</a>, a trillion-plus parameter model. There are reasoning and nonreasoning variants, though the reasoning variant hasn’t yet been released. Alibaba <a href="https://thesequence.substack.com/p/the-sequence-radar-727-qwens-oneweek" target="_blank" rel="noreferrer noopener">also released</a> models for <a href="https://qwen.ai/blog?id=b4264e11fb80b5e37350790121baf0a0f10daf82&from=research.latest-advancements-list" target="_blank" rel="noreferrer noopener">speech-to-text</a>, <a href="https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list" target="_blank" rel="noreferrer noopener">vision-language</a>, <a href="https://qwen.ai/blog?id=4266edf7f3718f2d3fda098b3f4c48f3573215d0&from=home.latest-research-list" target="_blank" rel="noreferrer noopener">live translation</a>, and more. They’ve been busy. </li>
<li>GitHub has launched its <a href="https://github.blog/ai-and-ml/github-copilot/meet-the-github-mcp-registry-the-fastest-way-to-discover-mcp-servers/" target="_blank" rel="noreferrer noopener">MCP Registry</a> to make it easier to discover MCP servers archived on GitHub. It’s also working with Anthropic and others to build an <a href="https://github.com/modelcontextprotocol/registry/" target="_blank" rel="noreferrer noopener">open source MCP registry</a>, which lists servers regardless of their origin and integrates with GitHub’s registry. </li>
<li>DeepMind has published <a href="https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/strengthening-our-frontier-safety-framework/frontier-safety-framework_3.pdf" target="_blank" rel="noreferrer noopener">version 3.0</a> of its <a href="https://deepmind.google/discover/blog/strengthening-our-frontier-safety-framework/" target="_blank" rel="noreferrer noopener">Frontier Safety Framework</a>, a framework for experimenting with AI-human alignment. They’re particularly interested in scenarios where the AI doesn’t follow a user’s directives, and in behaviors that can’t be traced to a specific reasoning chain.</li>
<li>Alibaba has released the <a href="https://github.com/Alibaba-NLP/DeepResearch" target="_blank" rel="noreferrer noopener">Tongyi DeepResearch</a> reasoning model. Tongyi is a 30.5B parameter mixture-of-experts model, with 3.3B parameters active. More importantly, it’s fully open source, with no restrictions on how it can be used. </li>
<li><a href="https://locallyai.app/" target="_blank" rel="noreferrer noopener">Locally AI</a> is an iOS app that lets you run large language models on your iPhone or iPad. It works offline; there’s no need for a network connection. </li>
<li>OpenAI has added <a href="https://www.bleepingcomputer.com/news/artificial-intelligence/chatgpt-now-gives-you-greater-control-over-gpt-5-thinking-model/" target="_blank" rel="noreferrer noopener">control over the “reasoning” process</a> to its GPT-5 models. Users can choose between four levels: Light (Pro users only), Standard, Extended, and Heavy (Pro only). </li>
<li>Google has announced the <a href="https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol" target="_blank" rel="noreferrer noopener">Agent Payments Protocol</a> (AP2), which facilitates purchases. It focuses on authorization (proving that it has the authority to make a purchase), authentication (proving that the merchant is legitimate), and accountability (in case of a fraudulent transaction).</li>
<li><a href="https://www.dbreunig.com/2025/09/15/ai-adoption-at-work-play.html" target="_blank" rel="noreferrer noopener">Bring Your Own AI</a>: Employee adoption of AI greatly exceeds official IT adoption. We’ve seen this before, on technologies as different as the iPhone and open source.</li>
<li>Alibaba has <a href="https://news.smol.ai/issues/25-09-11-qwen3-next/" target="_blank" rel="noreferrer noopener">released</a> the ponderously named <a href="https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list" target="_blank" rel="noreferrer noopener">Qwen3-Next-80B-A3B-Base</a>. It’s a mixture-of-experts model with a high ratio of active parameters to total parameters (3.75%). Alibaba claims that the model cost 1/10 as much to train and is 10 times faster than its previous models. If this holds up, Alibaba is winning on performance where it counts.</li>
<li>Anthropic has announced a <a href="https://www.anthropic.com/news/create-files" target="_blank" rel="noreferrer noopener">major upgrade to Claude’s capabilities</a>. It can now execute Python scripts in a sandbox and can create Excel spreadsheets, PowerPoint presentations, PNG files, and other documents. You can upload files for it to analyze. And of course this comes with security risks.</li>
<li>The <a href="https://guides.lib.uchicago.edu/c.php?g=1241077&p=9082322" target="_blank" rel="noreferrer noopener">SIFT</a> method—stop, investigate the source, find better sources, and trace quotes to their original context—is a way of structuring your use of AI output that will make you less vulnerable to misinformation. Hint: it’s not just for AI.</li>
<li>OpenAI’s <a href="https://help.openai.com/en/articles/10169521-projects-in-chatgpt" target="_blank" rel="noreferrer noopener">Projects</a> feature is now available to <a href="https://www.bleepingcomputer.com/news/artificial-intelligence/chatgpt-makes-projects-feature-free-adds-a-toggle-to-split-chat/" target="_blank" rel="noreferrer noopener">free</a> accounts. Projects is a set of tools for organizing conversations with the LLM. Projects are separate workspaces with their own custom instructions, independent memory, and context. They can be forked. Projects sounds something like Git for LLMs—a set of features that’s badly needed.</li>
<li><a href="https://developers.googleblog.com/en/introducing-embeddinggemma/" target="_blank" rel="noreferrer noopener">EmbeddingGemma</a> is a new open weights embedding model (308M parameters) that’s designed to run on devices, requiring as little as 200 MB of memory.</li>
<li>An <a href="https://arstechnica.com/science/2025/09/these-psychological-tricks-can-get-llms-to-respond-to-forbidden-prompts/" target="_blank" rel="noreferrer noopener">experiment</a> with GPT-4o-mini shows that language models can fall to psychological manipulation. Is this surprising? After all, they are trained on human output.</li>
<li>“<a href="https://www.lukew.com/ff/entry.asp?2117" target="_blank" rel="noreferrer noopener">Platform Shifts Redefine Apps</a>”: AI is a new kind of platform and demands rethinking what applications mean and how they should work. Failure to do this rethinking may be why so many AI efforts fail.</li>
<li><a href="https://mcpui.dev/" target="_blank" rel="noreferrer noopener">MCP-UI</a> is a protocol that allows MCP servers to <a href="https://thenewstack.io/how-mcp-ui-powers-shopifys-new-commerce-widgets-in-agents/" target="_blank" rel="noreferrer noopener">send React components</a> or Web Components to agents, allowing the agent to build an appropriate browser-based interface on the fly.</li>
<li>The <a href="https://agentclientprotocol.com/overview/introduction" target="_blank" rel="noreferrer noopener">Agent Client Protocol</a> (ACP) is a new protocol that standardizes communications between code editors and coding agents. It’s currently supported by the Zed and Neovim editors, and by the Gemini CLI coding agent.</li>
<li>Gemini 2.5 Flash is now using a <a href="https://blog.google/products/gemini/updated-image-editing-model/" target="_blank" rel="noreferrer noopener">new image generation model</a> that was internally known as “nano banana.” This new model can edit uploaded images, merge images, and maintain visual consistency across a series of images.</li>
</ul>
<h2 class="wp-block-heading">Programming</h2>
<ul class="wp-block-list">
<li>Anthropic <a href="https://www.anthropic.com/news/enabling-claude-code-to-work-more-autonomously" target="_blank" rel="noreferrer noopener">released Claude Code 2.0</a>. New features include the ability to checkpoint your work, so that if a coding agent wanders off-course, you can return to a previous state. They have also added the ability to run tasks in the background, call hooks, and use subagents.</li>
<li>Suno has <a href="https://suno.com/studio-welcome" target="_blank" rel="noreferrer noopener">announced</a> an AI-driven digital audio workstation (DAW), a tool for enabling people to be creative with AI-generated music.</li>
<li>The Wasmer project has <a href="https://wasmer.io/posts/python-on-the-edge-powered-by-webassembly" target="_blank" rel="noreferrer noopener">announced</a> that it now has full Python support in the beta version of Wasmer Edge, its WebAssembly runtime for serverless edge deployment.</li>
<li>Mitchell Hashimoto, founder of Hashicorp, has <a href="https://mitchellh.com/writing/libghostty-is-coming" target="_blank" rel="noreferrer noopener">promised</a> that a library for Ghostty (libghostty) is coming! This library will make it easy to embed a terminal emulator into an application. Perhaps more important, libghostty might standardize the code for terminal output across applications. </li>
<li>There’s a new benchmark for agentic coding: <a href="https://quesma.com/blog/introducing-compilebench/" target="_blank" rel="noreferrer noopener">CompileBench</a>. CompileBench tests the ability of models to <a href="https://simonwillison.net/2025/Sep/22/compilebench/" target="_blank" rel="noreferrer noopener">solve complex problems in figuring out how to build code</a>. </li>
<li>Apple is reportedly <a href="https://medium.com/@yashbatra11111/why-apple-is-quietly-rewriting-ios-in-a-language-youve-never-heard-of-2f70181df3bb" target="_blank" rel="noreferrer noopener">rewriting iOS in a new programming language</a>. Rust would be the obvious choice, but rumors are that it’s something of their own creation. Apple likes languages it can control. </li>
<li><a href="https://www.oracle.com/news/announcement/oracle-releases-java-25-2025-09-16/" target="_blank" rel="noreferrer noopener">Java 25</a>, the latest long-term support release, has a number of new features that <a href="https://thenewstack.io/java-25-oracle-makes-java-easier-to-learn-ready-for-ai-development/" target="_blank" rel="noreferrer noopener">reduce the boilerplate</a> that makes Java difficult to learn. </li>
<li><a href="https://luau.org/" target="_blank" rel="noreferrer noopener">Luau</a> is a new scripting language derived from Lua. It claims to be fast, small, and safe. It’s backward compatible with Version 5.1 of Lua.</li>
<li>OpenAI has <a href="https://www.latent.space/p/gpt5-codex" target="_blank" rel="noreferrer noopener">launched</a> <a href="https://openai.com/index/introducing-upgrades-to-codex/" target="_blank" rel="noreferrer noopener">GPT-5 Codex</a>, its generation model trained specifically for software engineering. Codex is now available both in the CLI tool and through the API. It’s clearly intended to challenge Anthropic’s dominant coding tool, Claude Code.</li>
<li>Do prompts belong in code repositories? We’ve argued that prompts should be archived. But <a href="https://towardsdatascience.com/why-your-prompts-dont-belong-in-git/" target="_blank" rel="noreferrer noopener">they don’t belong in a source code repo</a> like Git. There are better tools available.</li>
<li>This is cool and different. A developer has <a href="https://joshfonseca.com/blogs/animal-crossing-llm" target="_blank" rel="noreferrer noopener">hacked</a> the 2001 game Animal Crossing so that the dialog is generated by LLM rather than coming from the game’s memory.</li>
<li>There’s a new programming language, vibe-coded in its entirety with Claude. <a href="https://simonwillison.net/2025/Sep/9/cursed/" target="_blank" rel="noreferrer noopener">Cursed</a> is similar to Claude, but all the keywords are Gen Z slang. It’s not yet on the list, but it’s a worthy addition to <a href="https://esolangs.org/wiki/Main_Page" target="_blank" rel="noreferrer noopener">Esolang</a>. </li>
<li>Claude Code is now integrated into the Zed editor (beta), using the <a href="https://agentclientprotocol.com/overview/introduction" target="_blank" rel="noreferrer noopener">Agent Client Protocol</a> <a href="https://agentclientprotocol.com/overview/introduction" target="_blank" rel="noreferrer noopener">(ACP)</a>. </li>
<li>Ida Bechtle’s <a href="https://www.youtube.com/watch?v=GfH4QL4VqJ0" target="_blank" rel="noreferrer noopener">documentary on the history of Python</a>, complete with many interviews with Guido van Rossum, is a must-watch.</li>
</ul>
<h2 class="wp-block-heading">Security</h2>
<ul class="wp-block-list">
<li>The first <a href="https://www.koi.security/blog/postmark-mcp-npm-malicious-backdoor-email-theft" target="_blank" rel="noreferrer noopener">malicious MCP server</a> has been found in the wild. Postmark-MCP, an MCP server for interacting with the Postmark application, suddenly (version 1.0.16) started sending copies of all the email it handles to its developer.</li>
<li>I doubt this is the first time, but <a href="https://www.bleepingcomputer.com/news/security/malicious-rust-packages-on-cratesio-steal-crypto-wallet-keys/" target="_blank" rel="noreferrer noopener">supply chain security vulnerabilities have now hit Rust’s package management system, Crates.io</a>. Two packages that steal keys for cryptocurrency wallets have been found. It’s time to be careful about what you download.</li>
<li><a href="https://embracethered.com/blog/posts/2025/cross-agent-privilege-escalation-agents-that-free-each-other/" target="_blank" rel="noreferrer noopener">Cross-agent privilege escalation</a> is a new kind of vulnerability in which a compromised intelligent agent uses indirect prompt injection to cause a victim agent to overwrite its configuration, granting it additional privileges. </li>
<li>GitHub is taking a number of measures to <a href="https://www.bleepingcomputer.com/news/security/github-tightens-npm-security-with-mandatory-2fa-access-tokens/" target="_blank" rel="noreferrer noopener">improve software supply chain security</a>, including requiring two-factor authentication (2FA), expanding <a href="https://repos.openssf.org/trusted-publishers-for-all-package-repositories" target="_blank" rel="noreferrer noopener">trusted publishing</a>, and more.</li>
<li>A compromised npm package uses a <a href="https://www.bleepingcomputer.com/news/security/npm-package-caught-using-qr-code-to-fetch-cookie-stealing-malware/" target="_blank" rel="noreferrer noopener">QR code to encode malware</a>. The malware is apparently downloaded in the QR code (which is valid, but too dense to be read by a normal camera), unpacked by the software, and used to steal cookies from the victim’s browser. </li>
<li>Node.js and its package manager npm have been in the news because of an ongoing series of supply chain attacks. Here’s the <a href="https://www.bleepingcomputer.com/news/security/self-propagating-supply-chain-attack-hits-187-npm-packages/" target="_blank" rel="noreferrer noopener">latest report</a>.</li>
<li>A <a href="https://blogs.cisco.com/security/detecting-exposed-llm-servers-shodan-case-study-on-ollama" target="_blank" rel="noreferrer noopener">study by Cisco</a> has discovered over a thousand unsecured LLM servers running on Ollama. Roughly 20% were actively serving requests. The rest may have been idle Ollama instances, waiting to be exploited. </li>
<li>Anthropic has announced that <a href="https://old.reddit.com/r/LocalLLaMA/comments/1n2ubjx/if_you_have_a_claude_personal_account_they_are/" target="_blank" rel="noreferrer noopener">Claude will train on data from personal accounts</a>, effective September 28. This includes Free, Pro, and Max plans. Work plans are exempted. While the company says that training on personal data is opt-in, it’s (currently) enabled by default, so it’s opt-out.</li>
<li>We now have “<a href="https://www.bleepingcomputer.com/news/security/malware-devs-abuse-anthropics-claude-ai-to-build-ransomware/" target="_blank" rel="noreferrer noopener">vibe hacking</a>,” the use of AI to develop malware. Anthropic has reported several instances in which Claude was used to create malware that the authors could not have created themselves. Anthropic is banning threat actors and implementing classifiers to detect illegal use.</li>
<li>Zero trust is basic to modern security. But groups implementing zero trust have to realize that it’s a project that’s <a href="https://www.bleepingcomputer.com/news/security/why-zero-trust-is-never-done-and-is-an-ever-evolving-process/" target="_blank" rel="noreferrer noopener">never finished</a>. Threats change, people change, systems change.</li>
<li>There’s a new technique for jailbreaking LLMs: write prompts with <a href="https://www.theregister.com/2025/08/26/breaking_llms_for_fun/" target="_blank" rel="noreferrer noopener">bad grammar and run-on sentences</a>. These seem to prevent guardrails from taking effect. </li>
<li>In an attempt to minimize the propagation of malware on the Android platform, Google <a href="https://www.bleepingcomputer.com/news/security/google-to-verify-all-android-devs-to-block-malware-on-google-play/" target="_blank" rel="noreferrer noopener">plans</a> to block “sideloading” apps for Android devices and require developer ID verification for apps installed through Google Play.</li>
<li>A <a href="https://research.checkpoint.com/2025/zipline-phishing-campaign/" target="_blank" rel="noreferrer noopener">new phishing attack called ZipLine</a> targets companies using their own “contact us” pages. The attacker then engages in an extended dialog with the company, often posing as a potential business partner, before eventually delivering a malware payload.</li>
</ul>
<h2 class="wp-block-heading">Operations</h2>
<ul class="wp-block-list">
<li>The <a href="https://blog.google/technology/developers/dora-report-2025/" target="_blank" rel="noreferrer noopener">2025 DORA report</a> is out! DORA may be the most detailed summary of the state of the IT industry. DORA’s authors note that AI is everywhere and that the use of AI now improves end-to-end productivity, something that was ambiguous in last year’s report.</li>
<li>Microsoft has <a href="https://www.bleepingcomputer.com/news/microsoft/microsoft-word-will-save-your-files-to-the-cloud-by-default/" target="_blank" rel="noreferrer noopener">announced</a> that Word will save files to the cloud (OneDrive) by default. This (so far) appears to apply only when using Windows. The feature is currently in beta.</li>
</ul>
<h2 class="wp-block-heading">Web</h2>
<ul class="wp-block-list">
<li>Do we need another browser? <a href="https://helium.computer/" target="_blank" rel="noreferrer noopener">Helium</a> is a Chromium-based browser that is private by default. </li>
<li>Are scientists <a href="https://www.psypost.org/scientists-say-x-formerly-twitter-has-lost-its-professional-edge-and-bluesky-is-taking-its-place/" target="_blank" rel="noreferrer noopener">moving from Twitter to Bluesky</a>?</li>
</ul>
<h2 class="wp-block-heading">Virtual and Augmented Reality</h2>
<ul class="wp-block-list">
<li>Meta has announced a pair of <a href="https://arstechnica.com/gadgets/2025/09/metas-799-ray-ban-display-is-the-companys-first-big-step-from-vr-to-ar/" target="_blank" rel="noreferrer noopener">augmented reality glasses</a> with a small display on one of the lenses, bringing it to the edge of AR. In addition to displaying apps from your phone, the glasses can do “live captioning” for conversations. The display is controlled by a wristband.</li>
</ul>
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/radar-trends-to-watch-october-2025/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Mapping the Design Space of AI Coding Assistants</title>
<link>https://www.oreilly.com/radar/from-autocomplete-to-agents-mapping-the-design-space-of-ai-coding-assistants/</link>
<comments>https://www.oreilly.com/radar/from-autocomplete-to-agents-mapping-the-design-space-of-ai-coding-assistants/#respond</comments>
<pubDate>Mon, 06 Oct 2025 11:09:27 +0000</pubDate>
<dc:creator><![CDATA[Sam Lau and Philip Guo]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Research]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17493</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Analysis_of_AI_assistants.jpg"
medium="image"
type="image/jpeg"
width="2304"
height="1792"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Analysis_of_AI_assistants-160x160.jpg"
width="160"
height="160"
/>
<custom:subtitle><![CDATA[From Autocomplete to Agents: Analyzing 90 Tools from Industry and Academia]]></custom:subtitle>
<description><![CDATA[Just a few years ago, AI coding assistants were little more than autocomplete curiosities—tools that could finish your variable names or suggest a line of boilerplate. Today, they’ve become an everyday part of millions of developers’ workflows, with entire products and startups built around them. Depending on who you ask, they represent either the dawn […]]]></description>
<content:encoded><![CDATA[
Just a few years ago, AI coding assistants were little more than autocomplete curiosities—tools that could finish your variable names or suggest a line of boilerplate. Today, they’ve become an everyday part of millions of developers’ workflows, with entire products and startups built around them. Depending on who you ask, they represent either the dawn of a new programming era or the end of programming as we know it. Amid the hype and skepticism, one thing is clear: The landscape of coding assistants is expanding rapidly, and it can be hard to zoom out and see the bigger picture.
I’m <a href="https://lau.ucsd.edu/" target="_blank" rel="noreferrer noopener">Sam Lau</a> from UC San Diego, and my colleague <a href="https://pg.ucsd.edu/" target="_blank" rel="noreferrer noopener">Philip Guo</a> and I are presenting a <a href="https://lau.ucsd.edu/pubs/2025_analysis-of-90-genai-coding-tools_VLHCC.pdf" target="_blank" rel="noreferrer noopener">research paper</a> at the Visual Languages and Human-Centric Computing conference (VL/HCC) on this very topic. We wanted to know: How have AI coding assistants evolved over the past few years, and where is the field headed?
To answer this question, we analyzed 90 AI coding assistants created between 2021 and 2025: 58 industry products and 32 academic prototypes. Some were widely used commercial assistants, while others were experimental research systems that explored entirely new ways of working with AI. Rather than focusing on who was “best” or which system was most powerful, we took a different approach. We built a design space framework: a kind of map that highlights the major choices designers and researchers make when building coding assistants. By comparing industry and academic systems side by side, we hoped to uncover both patterns and blind spots in how these tools are being shaped.
The result is the first comprehensive snapshot of the space at this critical moment in 2025 when AI coding assistants are starting to mature but their future directions remain very much in flux.
Here’s a summary of our findings:
<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1600" height="1332" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-1600x1332.png" alt="Overview of findings" class="wp-image-17494" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-1600x1332.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-300x250.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-768x640.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-1536x1279.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-2048x1705.png 2048w" sizes="(max-width: 1600px) 100vw, 1600px" /></figure>
<h2 class="wp-block-heading">10 Dimensions That Define These Tools</h2>
What makes one coding assistant feel like a helpful copilot and another feel like a clunky distraction? In our analysis, we identified 10 dimensions of design, grouped into four broad themes:
<ol class="wp-block-list">
<li>Interface: How the assistant shows up (inline autocomplete, proactive suggestions, full IDEs).</li>
<li>Inputs: What you can feed it (text, design files, code analysis, custom project rules).</li>
<li>Capabilities: What it can do (self-correct, run code, call external tools).</li>
<li>Outputs: How it delivers results (code blocks, interactive outputs, reasoning traces, references).</li>
</ol>
For example, some assistants like GitHub Copilot are optimized for speed and minimal friction: autocomplete a few keystrokes, press tab, keep coding. Academic projects like WaitGPT and DBox are designed for exploration and learning by slowing users down to reflect on trade-offs, offering explanations, or scaffolding programming concepts for beginners. (Links to all 90 projects are in our <a href="https://lau.ucsd.edu/pubs/2025_analysis-of-90-genai-coding-tools_VLHCC.pdf" target="_blank" rel="noreferrer noopener">paper PDF</a>.)
One of the clearest findings from our survey is a split between industry and academia.
<ul class="wp-block-list">
<li>Industry products focus on speed, efficiency, and seamless integration. Their pitch is simple: write code faster, with fewer errors. Think of tools like Cursor, Claude Code, or GitHub Copilot, which promise “coding at the speed of thought.”</li>
<li>Academic prototypes, by contrast, diverge in many directions. Some deliberately slow down the coding process to encourage reflection. Others focus on scaffolding learning for students, supporting accessibility, or enabling entirely new ways of prompting, like letting users sketch a UI instead of writing a text-based prompt.</li>
</ul>
This divergence reflects two different priorities: one optimized for productivity in professional software engineering, the other for exploring what programming could be or should be. Both approaches have value, and to us the most interesting question is whether the two cultures might eventually converge, or at least learn from each other.
<h2 class="wp-block-heading">Six Personas, Six Ways of Coding with AI</h2>
Another way to make sense of the space is to ask: Who are these tools really for? We identified six user personas that kept reappearing across systems:
<ul class="wp-block-list">
<li>Software engineers, who seek tools to accelerate professional workflows</li>
<li>HCI researchers and hobbyists, who create prototypes and new ways of working with AI</li>
<li>UX designers, who use assistants to quickly prototype and iterate on interface ideas</li>
<li><a href="https://pg.ucsd.edu/publications/conversational-programmers-in-industry_CHI-2016.pdf" target="_blank" rel="noreferrer noopener">Conversational programmers</a>, who are nontechnical professionals that engage in vibe coding by describing ideas in natural language</li>
<li>Data scientists, who need explainability and quick iterations on code-driven experiments</li>
<li>Students learning to code, who benefit from scaffolding, guidance, and explanations</li>
</ul>
Each persona requires different designs, which we highlight within our design space. For example, tools designed for software engineers like Claude Code and Aider are integrated into their existing code editors and terminals, support a high degree of customization, and have autonomy to write and run code without human intervention. In contrast, tools for designers like Lovable and Vercel v0 are browser-based and can create applications using a visual mockup like a Figma design file.
<h2 class="wp-block-heading">What Comes After Autocomplete, Chat, and Agents?</h2>
So where does this leave us? Coding assistants are no longer experimental toys. They’re woven into production workflows, classrooms, design studios, and research labs. But their future is far from settled.
From our perspective, the central challenge is that academia and industry are innovating in parallel yet rarely in conversation with one another. While industry tools optimize for speed, generating lots of code quickly is not the same as building good software. In fact, recent studies have shown that although AI coding assistants have claimed to boost productivity by 10x, reality so far is closer to incremental improvements. (See <a href="https://addyo.substack.com/p/the-reality-of-ai-assisted-software" target="_blank" rel="noreferrer noopener">Addy Osmani’s recent blog post</a> for a summary.) What if academia and industry could work together to combine rigorous study of real barriers to productivity with the practical experience of scaling tools in production? If this could happen, we might move beyond simply making code faster to write toward making software development itself more rapid and sustainable.
Check out our paper <a href="https://lau.ucsd.edu/pubs/2025_analysis-of-90-genai-coding-tools_VLHCC.pdf" target="_blank" rel="noreferrer noopener">here</a> and email us if you’d like to discuss anything related to it!
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/from-autocomplete-to-agents-mapping-the-design-space-of-ai-coding-assistants/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Generative AI in the Real World: Emmanuel Ameisen on LLM Interpretability</title>
<link>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-emmanuel-ameisen-on-llm-interpretability/</link>
<comments>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-emmanuel-ameisen-on-llm-interpretability/#respond</comments>
<pubDate>Thu, 02 Oct 2025 14:31:22 +0000</pubDate>
<dc:creator><![CDATA[Ben Lorica and Emmanuel Ameisen]]></dc:creator>
<category><![CDATA[Podcast]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&p=17488</guid>
<enclosure url="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3" length="0" type="audio/mpeg" />
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png"
medium="image"
type="image/png"
width="2560"
height="2560"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-160x160.png"
width="160"
height="160"
/>
<description><![CDATA[In this episode, Ben Lorica and Anthropic interpretability researcher Emmanuel Ameisen get into the work Emmanuel’s team has been doing to better understand how LLMs like Claude work. Listen in to find out what they’ve uncovered by taking a microscopic look at how LLMs function—and just how far the analogy to the human brain holds. […]]]></description>
<content:encoded><![CDATA[
In this episode, Ben Lorica and Anthropic interpretability researcher Emmanuel Ameisen get into the work Emmanuel’s team has been doing to better understand how LLMs like Claude work. Listen in to find out what they’ve uncovered by taking a microscopic look at how LLMs function—and just how far the analogy to the human brain holds.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out <a href="https://learning.oreilly.com/playlists/42123a72-1108-40f1-91c0-adbfb9f4983b/?_gl=1*m7f70i*_ga*MTYyODYzMzQwMi4xNzU4NTY5ODYz*_ga_092EL089CH*czE3NTkxNzAwODUkbzE0JGcwJHQxNzU5MTcwMDg1JGo2MCRsMCRoMA.." target="_blank" rel="noreferrer noopener">other episodes</a> of this podcast on the O’Reilly learning platform.
<h2 class="wp-block-heading">Transcript</h2>
This transcript was created with the help of AI and has been lightly edited for clarity.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=0" target="_blank" rel="noreferrer noopener">00.00</a> Today we have Emmanuel Ameisen. He works at Anthropic on interpretability research. And he also authored an O’Reilly book called <a href="https://learning.oreilly.com/library/view/building-machine-learning/9781492045106/">Building Machine Learning Powered Applications</a>. So welcome to the podcast, Emmanuel.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=22" target="_blank" rel="noreferrer noopener">00.22</a> Thanks, man. I’m glad to be here. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=24" target="_blank" rel="noreferrer noopener">00.24</a> As I go through what you and your team do, it’s almost like biology, right? You’re studying these models, but increasingly they look like biological systems. Why do you think that’s useful as an analogy? And am I actually accurate in calling this out?
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=50" target="_blank" rel="noreferrer noopener">00.50</a> Yeah, that’s right. Our team’s mandate is to basically understand how the models work, right? And one fact about language models is that they’re not really written like a program, where somebody sort of by hand described what should happen in that logical branch or this logical branch. Really the way we think about it is they’re almost grown. But what that means is, they’re trained over a large dataset, and on that dataset, they learn to adjust their parameters. They have many, many parameters—often, you know, billions—in order to perform well. And so the result of that is that when you get the trained model back, it’s sort of unclear to you how that model does what it does, because all you’ve done to create it is show it tasks and have it improve at how it does these tasks.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=108" target="_blank" rel="noreferrer noopener">01.48</a> And so it feels similar to biology. I think the analogy is apt because for analyzing this, you kind of resort to the tools that you would use in that context, where you try to look inside the model [and] see which parts seem to light up in different contexts. You poke and prod in different parts to try to see, “Ah, I think this part of the model does this.” If I just turn it off, does the model stop doing the thing that I think it’s doing? It’s very much not what you would do in most cases if you were analyzing a program, but it is what you would do if you’re trying to understand how a mouse works.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=142" target="_blank" rel="noreferrer noopener">02.22</a> You and your team have discovered surprising ways as to how these models do problem-solving, the strategies they employ. What are some examples of these surprising problem-solving patterns? 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=160" target="_blank" rel="noreferrer noopener">02.40</a> We’ve spent a bunch of time studying these models. And again I should say, whether it’s surprising or not depends on what you were expecting. So maybe there’s a few ways in which they’re surprising. 
There’s various bits of common knowledge about, for example, how models predict one token at a time. And it turns out if you actually look inside the model and try to see how it’s sort of doing its job of predicting text, you’ll find that actually a lot of the time it’s predicting multiple tokens ahead of time. It’s sort of deciding what it’s going to say in a few tokens and presumably in a few sentences to decide what it says now. That might be surprising to people who have heard that [models] are predicting one token at a time.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=208" target="_blank" rel="noreferrer noopener">03.28</a> Maybe another one that’s sort of interesting to people is that if you look inside these models and you try to understand what they represent in their artificial neurons, you’ll find that there are general concepts they represent.
So one example I like is you can say, “Somebody is tall,” and then, inside the model, you can find neurons activating for the concept of something being tall. And you can have them all read the same text, but translated in French: “Quelqu’un est grand.” And then you’ll find the same neurons that represent the concept of somebody being tall or active.
So you have these concepts that are shared across languages and that the model represents in one way, which is again, maybe surprising, maybe not surprising, in the sense that that’s clearly the optimal thing to do, or that’s the way that. . . You don’t want to repeat all of your concepts; like in your brain, you don’t want to have a separate French brain, an English brain, ideally. But surprising if you think that these models are mostly doing pattern matching. Then it is surprising that, when they’re processing English text or French text, they’re actually using the same representations rather than leveraging different patterns.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=281" target="_blank" rel="noreferrer noopener">04.41</a> [In] the text you just described, is there a material difference between the reasoning and nonreasoning models? 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=291" target="_blank" rel="noreferrer noopener">04.51</a> We haven’t studied that in depth. I will say that the thing that’s interesting about reasoning models is that when you ask them a question, instead of answering right away for a while, they write some text thinking through the problem, saying oftentimes, “Are you using math or code?” You know, trying to think: “Ah, well, maybe this is the answer. Let me try to prove it. Oh no, it’s wrong.” And so they’ve proven to be good at a variety of tasks that models which immediately answer aren’t good at.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=322" target="_blank" rel="noreferrer noopener">05.22</a> And one thing that you might think if you look at reasoning models is that you could just read their reasoning and you would understand how they think. But it turns out that one thing that we did find is that you can look at a model’s reasoning, that it writes down, that it samples, the text it’s writing, right? It’s saying, “I’m now going to do this calculation,” and in some cases when for example, the calculation is too hard, if at the same time you look inside the model’s brain inside its weights, you’ll find that actually it could be lying to you.
It’s not at all doing the math that it says it’s doing. It’s just kind of doing its best guess. It’s taking a stab at it, just based on either context clues from the rest or what it thinks is probably the right answer—but it’s totally not doing the computation. And so one thing that we found is that you can’t quite always trust the reasoning that is output by reasoning models.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=379" target="_blank" rel="noreferrer noopener">06.19</a> Obviously one of the frequent complaints is around hallucination. So based on what you folks have been learning, are we getting close to a, I guess, much more principled mechanistic explanation for hallucination at this point? 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=399" target="_blank" rel="noreferrer noopener">06.39</a> Yeah. I mean, I think we’re making progress. We study that in our recent paper, and we found something that’s pretty neat. So hallucinations are cases where the model will confidently say something’s wrong. You might ask the model about some person. You’ll say, “Who’s Emmanuel Ameisen?” And it’ll be like “Ah, it’s the famous basketball player” or something. So it will say something where instead it should have said, “I don’t quite know. I’m not sure who you’re talking about.” And we looked inside the model’s neurons while it’s processing these kinds of questions, and we did a simple test: We asked the model, “Who’s Michael Jordan?” And then we made up some name. We asked it, “Who’s Michael Batkin?” (which it doesn’t know).
And if you look inside there’s something really interesting that happens, which is that basically these models by default—because they’ve been trained to try not to hallucinate—they have this default set of neurons that is just: If you ask me about anyone, I’ll just say no. I’ll just say, “I don’t know.” And the way that the models actually choose to answer is if you mentioned somebody famous enough, like Michael Jordan, there’s neurons for like, “Oh, this person is famous; I definitely know them” that activate and that turns off the neurons that were going to promote the answer for, “Hey, I’m not too sure.” And so that’s why the model answers in the Michael Jordan case. And that’s why it doesn’t answer by default in the Michael Batkin case.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=489" target="_blank" rel="noreferrer noopener">08.09</a> But what happens if instead now you force the neurons for “Oh, this is a famous person” to turn on even when the person isn’t famous, the model is just going to answer the question. And in fact, what we found is in some hallucination cases, this is exactly what happens. It’s that basically there’s a separate part of the model’s brain, essentially, that’s making the determination of “Hey, do I know this person or not?” And then that part can be wrong. And if it’s wrong, the model’s just going to go on and yammer about that person. And so it’s almost like you have a split mechanism here, where, “Well I guess the part of my brain that’s in charge of telling me I know says, ‘I know.’ So I’m just gonna go ahead and say stuff about this person.” And that’s, at least in some cases, how you get a hallucination.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=534" target="_blank" rel="noreferrer noopener">08.54</a> That’s interesting because a person would go, “I know this person. Yes, I know this person.” But then if you actually don’t know this person, you have nothing more to say, right? It’s almost like you forget. Okay, so I’m supposed to know Emmanuel, but I guess I don’t have anything else to say.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=555" target="_blank" rel="noreferrer noopener">09.15</a> Yeah, exactly. So I think the way I’ve thought about it is there’s definitely a part of my brain that feels similar to this thing, where you might ask me, you know, “Who was the actor in the second movie of that series?” and I know I know; I just can’t quite recall it at the time. Like, “Ah, you know, this is how they look; they were also in that other movie”—but I can’t think of the name. But the difference is, if that happens, I’m going to say, “Well, listen, man, I think I know, but at the moment I just can’t quite recall it.” Whereas the models are like, “I think I know.” And so I guess I’m just going to say stuff. It’s not that the “Oh, I know” [and] “I don’t know” parts [are] separate. That’s not the problem. It’s that they don’t catch themselves sometimes early enough like you would, where, to your point exactly, you’d just be like, “Well, look, I think I know who this is, but honestly at this moment, I can’t really tell you. So let’s move on.”
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=610" target="_blank" rel="noreferrer noopener">10.10</a> By the way, this is part of a bigger topic now in the AI space around reliability and predictability, the idea being, I can have a model that’s 95% [or] 99% accurate. And if I don’t know when the 5% or the 1% is inaccurate, it’s quite scary. Right? So I’d rather have a model that’s 60% accurate, but I know exactly when that 60% is.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=645" target="_blank" rel="noreferrer noopener">10.45</a> Models are getting better at hallucinations for that reason. That’s pretty important. People are training them to just be better calibrated. If you look at the rates of hallucinations for most models today, they’re so much lower than the previous models. But yeah, I agree. And I think in a sense maybe like there’s a hard question there, which is at least in some of these examples that we looked at, it’s not necessarily that, insofar as what we’ve seen, that you can clearly see just from looking at the inside of the model, oh, the model is hallucinating. What we can see is the model thinks it knows who this person is, and then it’s saying some stuff about this person. And so I think the key bit that would be interesting to do future work on is then try to understand, well, when it’s saying things about people, when it’s saying, you know, this person won this championship or whatever, is there a way there that we can kind of tell whether those are real facts or those are sort of confabulated in a way? And I think that’s still an active area of research.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=711" target="_blank" rel="noreferrer noopener">11.51</a> So in the case where you hook up Claude to web search, presumably there’s some sort of citation trail where at least you can check, right? The model is saying it knows Emmanuel and then says who Emmanuel is and gives me a link. I can check, right? 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=732" target="_blank" rel="noreferrer noopener">12.12</a> Yeah. And in fact, I feel like it’s even more fun than that sometimes. I had this experience yesterday where I was asking the model about some random detail, and it confidently said, “This is how you do this thing.” I was asking how to change the time on a device—it’s not important. And it was like, “This is how you do it.” And then it did a web search and it said, “Oh, actually, I was wrong. You know, according to the search results, that’s how you do it. The initial advice I gave you is wrong.” And so, yeah, I think grounding results in search is definitely helpful for hallucinations. Although, of course, then you have the other problem of making sure that the model doesn’t trust sources that are unreliable. But it does help.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=770" target="_blank" rel="noreferrer noopener">12.50</a> Case in point: science. There’s tons and tons of scientific papers now that get retracted. So just because it does a web search, what it should do is also cross-verify that search with whatever database there is for retracted papers.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=788" target="_blank" rel="noreferrer noopener">13:08</a> And you know, as you think about these things, I think you get an answer like effort-level questions where right now, if you go to Claude, there’s a research mode where you can send it off on a quest and it’ll do research for a long time. It’ll cross-reference tens and tens and tens of sources.
But that will take I don’t know, it depends. Sometimes 10 minutes, sometimes 20 minutes. And so there’s a question like, when you’re asking, “Should I buy these running shoes?” you don’t care, [but] when you’re asking about something serious or you’re going to make an important life decision, maybe you do. I always feel like as the models get better, we also want them to get better at knowing when they should spend 10 seconds or 10 minutes on something.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=827" target="_blank" rel="noreferrer noopener">13.47</a> There’s a surprisingly growing number of people who go to these models to ask help in medical questions. And as anyone who uses these models knows, a lot of it comes down to your problem, right? A neurosurgeon will prompt this model about brain surgery very differently than you and me, right?
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=848" target="_blank" rel="noreferrer noopener">14:08</a> Of course. In fact, that was one of the cases that we studied actually, where we prompted the model with a case that’s similar to one that a doctor would see. Not in the language that you or I would use, but in the sort of like “This patient is age 35 presenting symptoms A, B, and C,” because we wanted to try to understand how the model arrives to an answer. And so the question had all these symptoms. And then we asked the model, “Based on all these symptoms, answer in only one word: What other tests should we run?” Just to force it to do all of its reasoning in its head. I can’t write anything down.
And what we found is that there were groups of neurons that were activating for each of the symptoms. And then they were two different groups of neurons that were activating for two potential diagnoses, two potential diseases. And then those were promoting a specific test to run, which is sort of a practitioner and a differential diagnosis: The person either has A or B, and you want to run a test to know which one it is. And then the model suggested the test that would help you decide between A and B. And I found that quite striking because I think again, outside of the question of reliability for a second, there’s a depth of richness to just the internal representations of them all as it does all of this in one word.
This makes me excited about continuing down this path of trying to understand the model, like the model’s done a full round of diagnosing someone and proposing something to help with the diagnostic just in one forward pass in its head. As we use these models in a bunch of places, I sure really want to understand all of the complex behavior like this that happens in its weights. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=961" target="_blank" rel="noreferrer noopener">16.01</a> In traditional software, we have debuggers and profilers. Do you think as interpretability matures our tools for building AI applications, we could have kind of the equivalent of debuggers that flag when a model is going off the rails?
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=984" target="_blank" rel="noreferrer noopener">16.24</a> Yeah. I mean, that’s the hope. I think debuggers are a good comparison actually, because debuggers mostly get used by the person building the application. If I go to, I don’t know, claude.ai or something, I can’t really use the debugger to understand what’s going on in the backend. And so that’s the first state of debuggers, and the people building the models use it to understand the models better. We’re hoping that we’re going to get there at some point. We’re making progress. I don’t want to be too optimistic, but, I think, we’re on a path here where this work I’ve been describing, the vision was to build this big microscope, basically where the model is doing something, it’s answering a question, and you just want to look inside. And just like a debugger will show you basically the states of all of the variables in your program, we want to see the state of all of the neurons in this model.
It’s like, okay. The “I definitely know this person” neuron is on and the “This person is a basketball player” neuron is on—that’s kind of interesting. How do they affect each other? Should they affect each other in that way? So I think in many ways we’re sort of getting to something close where at least you can inspect the execution of your running program like you would with a debugger. You’re inspecting the execution learning model. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1066" target="_blank" rel="noreferrer noopener">17.46</a> Of course, then there’s a question of, What do you do with it? That I think is another active area of research where, if you spend some time looking at your debugger, you can say, “Ah, okay, I get it. I initialized this variable the wrong way. Let me fix it.”
We’re not there yet with models, right? Even if I tell you “This is exactly how this is happening and it’s wrong,” then the way that we make them again is we train them. So really, you have to think, “Ah, can we give it other examples that I would learn to do that way?” 
It’s almost like we’re doing neuroscience on a developing child or something. But then our only way to actually improve them is to change the curriculum of their school. So we have to translate from what we saw in their brain to “Maybe they need a little more math. Or maybe they need a little more English class.” I think we’re on that path. I’m pretty excited about it. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1113" target="_blank" rel="noreferrer noopener">18.33</a> We also open-sourced the tools to do this a couple months back. And so, you know, this is something that can now be run on open source models. And people have been doing a bunch of experiments with them trying to see if they behave the same way as some of the behaviors that we saw in the Claud models that we studied. And so I think that also is promising. And there’s room for people to contribute if they want to.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1136" target="_blank" rel="noreferrer noopener">18.56</a> Do you folks internally inside Anthropic have special interpretability tools—not that the interpretability team uses but [that] now you can push out to other people in Anthropic as they’re using these models? I don’t know what these tools would be. Could be what you describe, some sort of UX or some sort of microscope towards a model.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1162" target="_blank" rel="noreferrer noopener">19.22</a> Right now we’re sort of at the stage where the interpretability team is doing most of the microscopic exploration, and we’re building all these tools and doing all of this research, and it mostly happens on the team for now. I think there’s a dream and a vision to have this. . . You know, I think the debugger metaphor is really apt. But we’re still in the early days.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1186" target="_blank" rel="noreferrer noopener">19.46</a> You used the example earlier [where] the part of the model “That is a basketball player” lights up. Is that what you would call a concept? And from what I understand, you folks have a lot of these concepts. And by the way, is a concept something that you have to consciously identify, or do you folks have an automatic way of, “Here’s millions and millions of concepts that we’ve identified and we don’t have actual names for some of them yet”?
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1221" target="_blank" rel="noreferrer noopener">20.21</a> That’s right, that’s right. The latter one is the way to think about it. The way that I like to describe it is basically, the model has a bunch of neurons. And for a second let’s just imagine that we can make the comparison to the human brain, [which] also has a bunch of neurons.
Usually it’s groups of neurons that mean something. So it’s like I have these five neurons around. That means that the model’s reading text about basketball or something. And so we want to find all of these groups. And the way that we find them basically is in an automated, unsupervised way.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1255" target="_blank" rel="noreferrer noopener">20.55</a> The way you can think about it, in terms of how we try to understand what they mean, is maybe the same way that you do in a human brain, where if I had full access to your brain, I could record all of your neurons. And [if] I wanted to know where the basketball neuron was, probably what I would do is I would put you in front of a screen and I would play some basketball videos, and I would see which part of your brain lights up, you know? And then I would play some videos of football and I’d hopefully see some common parts, like the sports part and then the football part would be different. And then I play a video of an apple and then it’d be a completely different part of the brain.
And that’s basically exactly what we do to understand what these concepts mean in Claude is we just run a bunch of text through and see which part of its weight matrices light up, and that tells us, okay, this is the basketball concept probably. 
The other way we can confirm that we’re right is just we can then turn it off and see if Claude then stops talking about basketball, for example.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1312" target="_blank" rel="noreferrer noopener">21.52</a> Does the nature of the neurons change between model generations or between types of models—reasoning, nonreasoning, multimodal, nonmultimodal?
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1323" target="_blank" rel="noreferrer noopener">22.03</a> Yeah. I mean, at the base level all the weights of the model are different, so all of the neurons are going to be different. So the sort of trivial answer to your question [is] yes, everything’s changed. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1334" target="_blank" rel="noreferrer noopener">22.14</a> But you know, it’s kind of like [in] the brain, the basketball concept is close to the Michael Jordan concept.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1341" target="_blank" rel="noreferrer noopener">22.21</a> Yeah, exactly. There’s basically commonalities, and you see things like that. We don’t at all have an in-depth understanding of anything like you’d have for the human brain, where it’s like “Ah, this is a map of where the concepts are in the model.” However, you do see that, provided that the models are trained on and doing kind of the same “being a helpful assistant” stuff, they’ll have similar concepts. They’ll all have the basketball concept, and they’ll have a concept for Michael Jordan. And these concepts will be using similar groups of neurons. So there’s a lot of overlap between the basketball concept and the Michael Jordan concept. You’re going to see similar overlap in most models.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1383" target="_blank" rel="noreferrer noopener">23.03</a> So channeling your previous self, if I were to give you a keynote at a conference and I give you three slides—this is in front of developers, mind you, not ML researchers—what are the one to three things about interpretability research that developers should know about or potentially even implement or do something about today?
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1410" target="_blank" rel="noreferrer noopener">23.30</a> Oh man, it’s a good question. My first slide would say something like models, language models in particular, are complicated, interesting, and they can be understood, and it’s worth spending time to understand them. The point here being, we don’t have to treat them as this mysterious thing. We don’t have to use approximate, “Oh, they’re just next-token predictors or they’re just pattern matters. They’re black boxes.” We can look inside, and we can make progress on understanding them, and we can find a lot of rich structure. That would be slide one.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1450" target="_blank" rel="noreferrer noopener">24.10</a> Slide two would be the stuff that we talked about at the start of this conversation, which would be, “Here’s three ways your intuitions are wrong.” You know, oftentimes this is, “Look at this example of a model planning many tokens ahead, not just waiting for the next token. And look at this example of the model having these rich representations showing that it’s sort of like actually doing multistep reasoning in its weights rather than just kind of matching to some training data example.” And then I don’t know what my third example would be. Maybe this universal language example we talked about. Complicated, interesting stuff.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1484" target="_blank" rel="noreferrer noopener">24.44</a> And then, three: What can you do about it? That’s the third slide. It’s an early research area. There’s not anything that you can take that will make anything that you’re building better today. Hopefully if I’m viewing this presentation in six months or a year, maybe this third slide is different. But for now, that’s what it is.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1501" target="_blank" rel="noreferrer noopener">25.01</a> If you’re interested about this stuff, there are these open source libraries that let you do this tracing and open source models. Just go grab some small open source model, ask it some weird question, and then just look inside his brain and see what happens.
I think the thing that I respect the most and identify [with] the most about just being an engineer or developer is this willingness to understand all this stubbornness, to understand your program has a bug. Like, I’m going to figure out what it is, and it doesn’t matter what level of abstraction it’s at.
And I would encourage people to use that same level of curiosity and tenacity to look inside these very weird models that are everywhere. Now, those would be my three slides. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1549" target="_blank" rel="noreferrer noopener">25.49</a> Let me ask a follow up question. As you know, most teams are not going to be doing much pretraining. A lot of teams will do some form of posttraining, whatever that might be—fine-tuning, some form of reinforcement learning for the more advanced teams, a lot of prompt engineering, prompt optimization, prompt tuning, some sort of context grounding like RAG or GraphRAG.
You know more about how these models work than a lot of people. How would you approach these various things in a toolbox for a team? You’ve got prompt engineering, some fine-tuning, maybe distillation, I don’t know. So put on your posttraining hat, and based on what you know about interpretability or how these models work, how would you go about, systematically or in a principled way, approaching posttraining? 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1614" target="_blank" rel="noreferrer noopener">26.54</a> Lucky for you, I also used to work on the posttraining team at Anthropic. So I have some experience as well. I think it’s funny, what I’m going to say is the same thing I would have said before I studied these model internals, but maybe I’ll say it in a different way or something. The key takeaway I keep on having from looking at model internals is, “God, there’s a lot of complexity.” And that means they’re able to do very complex reasoning just in latent space inside their weights. There’s a lot of processing that can happen—more than I think most people have an intuition for. And two, that also means that usually, they’re doing a bunch of different algorithms at once for everything they do.
So they’re solving problems in three different ways. And a lot of times, the weird mistakes you might see when you’re looking at your fine-tuning or just looking at the results model is, “Ah, well, there’s three different ways to solve this thing. And the model just kind of picked the wrong one this time.” 
Because these models are already so complicated, I find that the first thing to do is just pretty much always to build some sort of eval suite. That’s the thing that people fail at the most. It doesn’t take that long—it usually takes an afternoon. You just write down 100 examples of what you want and what you don’t want. And then you can get incredibly far by just prompt engineering and context engineering, or just giving the model the right context.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1714" target="_blank" rel="noreferrer noopener">28.34</a> That’s my experience, having worked on fine-tuning models that you only want to resort to if everything else fails. I mean, it’s pretty rare that everything else fails, especially with the models getting better. And so, yeah, understanding that, in principle, the models have an immense amount of capacity and it’s just your job to tease that capacity out is the first thing I would say. Or the second thing, I guess, after just, build some evals.
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1740" target="_blank" rel="noreferrer noopener">29.00</a> And with that, thank you, Emmanuel. 
<a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1743" target="_blank" rel="noreferrer noopener">29.03</a> Thanks, man.
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-emmanuel-ameisen-on-llm-interpretability/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>The Cognitive Shortcut Paradox</title>
<link>https://www.oreilly.com/radar/the-cognitive-shortcut-paradox/</link>
<pubDate>Wed, 01 Oct 2025 11:07:04 +0000</pubDate>
<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17489</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-colors-1.jpg"
medium="image"
type="image/jpeg"
width="2304"
height="1792"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-colors-1-160x160.jpg"
width="160"
height="160"
/>
<description><![CDATA[This article is part of a series on the Sens-AI Framework—practical habits for learning and coding with AI. AI gives novice developers the ability to skip the slow, messy parts of learning. For experienced developers, that can mean getting to a working solution faster. Developers early in their learning path, however, face what I call […]]]></description>
<content:encoded><![CDATA[
This article is part of a series on the <a href="https://www.oreilly.com/radar/the-sens-ai-framework/" target="_blank" rel="noreferrer noopener">Sens-AI Framework</a>—practical habits for learning and coding with AI.
AI gives novice developers the ability to skip the slow, messy parts of learning. For experienced developers, that can mean getting to a working solution faster. Developers early in their learning path, however, face what I call the cognitive shortcut paradox: they need coding experience to use AI tools well, because experience builds the judgment required to evaluate, debug, and improve AI-generated code—but leaning on AI too much in those first stages can keep them from ever gaining that experience.
I saw this firsthand when adapting <a href="https://www.oreilly.com/library/view/head-first-c/9781098141776/" target="_blank" rel="noreferrer noopener">Head First C#</a> to include AI exercises. The book’s exercises are built to teach specific development concepts like object-oriented programming, separation of concerns, and refactoring. If new learners let AI generate the code before they’ve learned the fundamentals, they miss the problem-solving work that leads to those “aha!” moments where understanding really clicks.
With AI, it’s easy for new learners to bypass the learning process completely by pasting the exercise instructions into a coding assistant, getting a complete program in seconds, and running it without ever working through the design or debugging. When the AI produces the right output, it feels like progress to the learner. But the goal was never just to have a running program; it was to understand the requirements and craft a solution that reinforced a specific concept or technique that was taught earlier in the book. The problem is that to the novice, the work still looks right—code that compiles and produces the expected results—so the missing skills stay hidden until the gap is too wide to close.
Evidence is emerging that AI chatbots can boost productivity for experienced workers but have little measurable impact on skill growth for beginners. In practice, the tool that speeds mastery for seniors can slow it for juniors, because it hands over a polished answer before they’ve had the chance to build the skills needed to use that answer effectively.
The cognitive shortcut paradox isn’t just a classroom issue. In real projects, the most valuable engineering work often involves understanding ambiguous requirements, making architectural calls when nothing is certain, and tracking down the kind of bugs that don’t have obvious fixes. Those abilities come from wrestling with problems that don’t have a quick path to “done.” If developers turn to AI at the first sign of difficulty, they skip the work that builds the pattern recognition and systematic thinking senior engineers depend on.
Over time, the effect compounds. A new developer might complete early tickets through vibe coding, feel the satisfaction of shipping working code, and gain confidence in their abilities. Months later, when they’re asked to debug a complex system or refactor code they didn’t write, the gap shows. By then, their entire approach to development may depend on AI to fill in every missing piece, making it much harder to develop independent problem-solving skills.
The cognitive shortcut paradox presents a fundamental challenge for how we teach and learn programming in the AI era. The traditional path of building skills through struggle and iteration hasn’t become obsolete; it’s become more critical than ever, because those same skills are what allow developers to use AI tools effectively. The question isn’t whether to use AI in learning, but how to use it in ways that build rather than bypass the critical thinking abilities that separate effective developers from code generators. This requires a more deliberate approach to AI-assisted development, one that preserves the essential learning experiences while harnessing AI’s capabilities.
]]></content:encoded>
</item>
<item>
<title>The Java Developer’s Dilemma: Part 1</title>
<link>https://www.oreilly.com/radar/the-java-developers-dilemma-part-1/</link>
<pubDate>Tue, 30 Sep 2025 11:09:21 +0000</pubDate>
<dc:creator><![CDATA[Markus Eisele]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Deep Dive]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17484</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/A-human-with-a-laptop-races-a-humanoid-robot.jpg"
medium="image"
type="image/jpeg"
width="2304"
height="1792"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/A-human-with-a-laptop-races-a-humanoid-robot-160x160.jpg"
width="160"
height="160"
/>
<description><![CDATA[This is the first of a three-part series by Markus Eisele. Stay tuned for the follow-up posts. AI is everywhere right now. Every conference, keynote, and internal meeting has someone showing a prototype powered by a large language model. It looks impressive. You ask a question, and the system answers in natural language. But if […]]]></description>
<content:encoded><![CDATA[
<figure class="wp-block-table"><table class="has-cyan-bluish-gray-background-color has-background has-fixed-layout"><tbody><tr><td>This is the first of a three-part series by Markus Eisele. Stay tuned for the follow-up posts.</td></tr></tbody></table></figure>
AI is everywhere right now. Every conference, keynote, and internal meeting has someone showing a prototype powered by a large language model. It looks impressive. You ask a question, and the system answers in natural language. But if you are an enterprise Java developer, you probably have mixed feelings. You know how hard it is to build reliable systems that scale, comply with regulations, and run for years. You also know that what looks good in a demo often falls apart in production. That’s the dilemma we face. How do we make sense of AI and apply it to our world without giving up the qualities that made Java the standard for enterprise software?
<h2 class="wp-block-heading">The History of Java in the Enterprise</h2>
Java became the backbone of enterprise systems for a reason. It gave us strong typing, memory safety, portability across operating systems, and an ecosystem of frameworks that codified best practices. Whether you used <a href="https://jakarta.ee/" target="_blank" rel="noreferrer noopener">Jakarta EE</a>, <a href="http://spring.io/" target="_blank" rel="noreferrer noopener">Spring</a>, or later, <a href="https://quarkus.io/" target="_blank" rel="noreferrer noopener">Quarkus</a> and <a href="https://micronaut.io/" target="_blank" rel="noreferrer noopener">Micronaut</a>, the goal was the same: build systems that are stable, predictable, and maintainable. Enterprises invested heavily because they knew Java applications would still be running years later with minimal surprises.
This history matters when we talk about AI. Java developers are used to deterministic behavior. If a method returns a result, you can rely on that result as long as your inputs are the same. Business processes depend on that predictability. AI does not work like that. Outputs are probabilistic. The same input might give different results. That alone challenges everything we know about enterprise software.
<h2 class="wp-block-heading">The Prototype Versus Production Gap</h2>
Most AI work today starts with prototypes. A team connects to an API, wires up a chat interface, and demonstrates a result. Prototypes are good for exploration. They aren’t good for production. Once you try to run them at scale you discover problems.
Latency is one issue. A call to a remote model may take several seconds. That’s not acceptable in systems where a two-second delay feels like forever. Cost is another issue. Calling hosted models is not free, and repeated calls across thousands of users quickly adds up. Security and compliance are even bigger concerns. Enterprises need to know where data goes, how it’s stored, and whether it leaks into a shared model. A quick demo rarely answers those questions.
The result is that many prototypes never make it into production. The gap between a demo and a production system is large, and most teams underestimate the effort required to close it.
<h2 class="wp-block-heading">Why This Matters for Java Developers</h2>
Java developers are often the ones who receive these prototypes and are asked to “make them real.” That means dealing with all the issues left unsolved. How do you handle unpredictable outputs? How do you log and monitor AI behavior? How do you validate responses before they reach downstream systems? These are not trivial questions.
At the same time, business stakeholders expect results. They see the promise of AI and want it integrated into existing platforms. The pressure to deliver is strong. The dilemma is that we cannot ignore AI, but we also cannot adopt it naively. Our responsibility is to bridge the gap between experimentation and production.
<h2 class="wp-block-heading">Where the Risks Show Up</h2>
Let’s make this concrete. Imagine an AI-powered customer support tool. The prototype connects a chat interface to a hosted LLM. It works in a demo with simple questions. Now imagine it deployed in production. A customer asks about account balances. The model hallucinates and invents a number. The system has just broken compliance rules. Or imagine a user submits malicious input and the model responds with something harmful. Suddenly you’re facing a security incident. These are real risks that go beyond “the model sometimes gets it wrong.”
For Java developers, this is the dilemma. We need to preserve the qualities we know matter: correctness, security, and maintainability. But we also need to embrace a new class of technologies that behave very differently from what we’re used to.
<h2 class="wp-block-heading">The Role of Java Standards and Frameworks</h2>
The good news is that the Java ecosystem is already moving to help. Standards and frameworks are emerging that make AI integration less of a wild west. The OpenAI API turns into a standard, providing a way to access models in a standard form, regardless of vendor. That means code you write today won’t be locked in to a single provider. The Model Context Protocol (MCP) is another step, defining how tools and models can interact in a consistent way.
Frameworks are also evolving. Quarkus has <a href="https://docs.quarkiverse.io/quarkus-langchain4j/dev/index.html" target="_blank" rel="noreferrer noopener">extensions for LangChain4j</a>, making it possible to define AI services as easily as you define REST endpoints. Spring has introduced <a href="https://spring.io/projects/spring-ai" target="_blank" rel="noreferrer noopener">Spring AI</a>. These projects bring the discipline of dependency injection, configuration management, and testing into the AI space. In other words, they give Java developers familiar tools for unfamiliar problems.
<h2 class="wp-block-heading">The Standards Versus Speed Dilemma</h2>
A common argument against Java and enterprise standards is that they move too slowly. The AI world changes every month, with new models and APIs appearing at a pace that no standards body can match. At first glance, it looks like standards are a barrier to progress. The reality is different. In enterprise software, standards are not the anchors holding us back. They’re the foundation that makes long-term progress possible.
Standards define a shared vocabulary. They ensure that knowledge is transferable across projects and teams. If you hire a developer who knows JDBC, you can expect them to work with any database supported by the driver ecosystem. If you rely on Jakarta REST, you can swap frameworks or vendors without rewriting every service. This is not slow. This is what allows enterprises to move fast without constantly breaking things.
AI will be no different. Proprietary APIs and vendor-specific SDKs can get you started quickly, but they come with hidden costs. You risk locking yourself in to one provider, or building a system that only a small set of specialists understands. If those people leave, or if the vendor changes terms, you’re stuck. Standards avoid that trap. They make sure that today’s investment remains useful years from now.
Another advantage is the support horizon. Enterprises don’t think in terms of weeks or hackathon demos. They think in years. Standards bodies and established frameworks commit to supporting APIs and specifications over the long term. That stability is critical for applications that process financial transactions, manage healthcare data, or run supply chains. Without standards, every system becomes a one-off, fragile and dependent on whoever built it.
Java has shown this again and again. Servlets, CDI, JMS, JPA: These standards secured decades of business-critical development. They allowed millions of developers to build applications without reinventing core infrastructure. They also made it possible for vendors and open source projects to compete on quality, not just lock-in. The same will be true for AI. Emerging efforts like LangChain4j and the Java SDK for the <a href="https://modelcontextprotocol.io/sdk/java/mcp-overview" target="_blank" rel="noreferrer noopener">Model Context Protocol</a> or the <a href="https://github.com/a2aproject/a2a-java" target="_blank" rel="noreferrer noopener">Agent2Agent Protocol SDK</a> will not slow us down. They’ll enable enterprises to adopt AI at scale, safely and sustainably.
In the end, speed without standards leads to short-lived prototypes. Standards with speed lead to systems that survive and evolve. Java developers should not see standards as a constraint. They should see them as the mechanism that allows us to bring AI into production, where it actually matters.
<h2 class="wp-block-heading">Performance and Numerics: Java’s Catching Up</h2>
One more part of the dilemma is performance. Python became the default language for AI not because of its syntax, but because of its libraries. NumPy, SciPy, PyTorch, and TensorFlow all rely on highly optimized C and C++ code. Python is mostly a frontend wrapper around these math kernels. Java, by contrast, has never had numerics libraries of the same adoption or depth. JNI made calling native code possible, but it was awkward and unsafe.
That is changing. The Foreign Function & Memory (FFM) API (<a href="https://openjdk.org/jeps/454" target="_blank" rel="noreferrer noopener">JEP 454</a>) makes it possible to call native libraries directly from Java without the boilerplate of JNI. It’s safer, faster, and easier to use. This opens the door for Java applications to integrate with the same optimized math libraries that power Python. Alongside FFM, the Vector API (<a href="https://openjdk.org/jeps/508" target="_blank" rel="noreferrer noopener">JEP 508</a>) introduces explicit support for SIMD operations on modern CPUs. It allows developers to write vectorized algorithms in Java that run efficiently across hardware platforms. Together, these features bring Java much closer to the performance profile needed for AI and machine learning workloads.
For enterprise architects, this matters because it changes the role of Java in AI systems. Java isn’t the only orchestration layer that calls external services. With projects like <a href="https://github.com/tjake/Jlama" target="_blank" rel="noreferrer noopener">Jlama</a>, models can run inside the JVM. With FFM and the Vector API, Java can take advantage of native math libraries and hardware acceleration. That means AI inference can move closer to where the data lives, whether in the data center or at the edge, while still benefiting from the standards and discipline of the Java ecosystem.
<h2 class="wp-block-heading">The Testing Dimension</h2>
Another part of the dilemma is testing. Enterprise systems are only trusted when they’re tested. Java has a long tradition of unit testing and integration testing, supported by standards and frameworks that every developer knows: JUnit, TestNG, Testcontainers, Jakarta EE testing harnesses, and more recently, <a href="https://quarkus.io/guides/dev-services" target="_blank" rel="noreferrer noopener">Quarkus Dev Services</a> for spinning up dependencies in integration tests. These practices are a core reason Java applications are considered production-grade. Hamel Husain’s work on evaluation frameworks is directly relevant here. He describes three levels of evaluation: unit tests, model/human evaluation, and <a href="https://hamel.dev/blog/posts/evals?utm_source=chatgpt.com" target="_blank" rel="noreferrer noopener">production-facing A/B tests</a>. For Java developers treating models as black boxes, the first two levels map neatly onto our existing practice: unit tests for deterministic components and black-box evaluations with curated prompts for system behavior.
AI-infused applications bring new challenges. How do you write a unit test for a model that gives slightly different answers each time? How do you validate that an AI component works correctly when the definition of “correct” is fuzzy? The answer is not to give up testing but to extend it.
At the unit level, you still test deterministic components around the AI service: context builders, data retrieval pipelines, validation, and guardrail logic. These remain classic unit test targets. For the AI service itself, you can use schema validation tests, golden datasets, and bounded assertions. For example, you may assert that the model returns valid JSON, contains required fields, or produces a result within an acceptable range. The exact words may differ, but the structure and boundaries must hold.
At the integration level, you can bring AI into the picture. Dev Services can spin up a local Ollama container or mock inference API for repeatable test runs. Testcontainers can manage vector databases like PostgreSQL with pgvector or Elasticsearch. Property-based testing libraries such as <a href="https://jqwik.net/" target="_blank" rel="noreferrer noopener">jqwik</a> can generate varied inputs to expose edge cases in AI pipelines. These tools are already familiar to Java developers; they simply need to be applied to new targets.
The key insight is that AI testing must complement, not replace, the testing discipline we already have. Enterprises cannot put untested AI into production and hope for the best. By extending unit and integration testing practices to AI-infused components, we give stakeholders the confidence that these systems behave within defined boundaries. Even when individual model outputs are probabilistic.
This is where Java’s culture of testing becomes an advantage. Teams already expect comprehensive test coverage before deploying. Extending that mindset to AI ensures that these applications meet enterprise standards, not just demo requirements. Over time, testing patterns for AI outputs will mature into the same kind of de facto standards that JUnit brought to unit tests and Arquillian brought to integration tests. We should expect evaluation frameworks for AI-infused applications to become as normal as JUnit in the enterprise stack.
<h2 class="wp-block-heading">A Path Forward</h2>
So what should we do? The first step is to acknowledge that AI is not going away. Enterprises will demand it, and customers will expect it. The second step is to be realistic. Not every prototype deserves to become a product. We need to evaluate use cases carefully, ask whether AI adds real value, and design with risks in mind.
From there, the path forward looks familiar. Use standards to avoid lock-in. Use frameworks to manage complexity. Apply the same discipline you already use for transactions, messaging, and observability. The difference is that now you also need to handle probabilistic behavior. That means adding validation layers, monitoring AI outputs, and designing systems that fail gracefully when the model is wrong.
The Java developer’s dilemma is not about choosing whether to use AI. It’s about how to use it responsibly. We cannot treat AI like a library we drop into an application and forget about. We need to integrate it with the same care we apply to any critical system. The Java ecosystem is giving us the tools to do that. Our challenge is to learn quickly, apply those tools, and keep the qualities that made Java the enterprise standard in the first place.
This is the beginning of a larger conversation. In the next article we will look at new types of applications that emerge when AI is treated as a core part of the architecture, not just an add-on. That’s where the real transformation happens.
]]></content:encoded>
</item>
<item>
<title>Flow State to Free Fall: An AI Coding Cautionary Tale</title>
<link>https://www.oreilly.com/radar/flow-state-to-free-fall-an-ai-coding-cautionary-tale/</link>
<pubDate>Mon, 29 Sep 2025 10:59:37 +0000</pubDate>
<dc:creator><![CDATA[Sreeram Venkatasubramanian]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17481</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Robot-free-falling-down-a-waterfall.jpg"
medium="image"
type="image/jpeg"
width="2304"
height="1792"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Robot-free-falling-down-a-waterfall-160x160.jpg"
width="160"
height="160"
/>
<custom:subtitle><![CDATA[Learning to Hammer Pitons in the Age of AI]]></custom:subtitle>
<description><![CDATA[When I was eight years old, I watched a mountaineering documentary while waiting for the cricket match to start. I remember being incredibly frustrated watching these climbers inch their way up a massive rock face, stopping every few feet to hammer what looked like giant nails into the mountain. “Why don’t they just climb faster?” […]]]></description>
<content:encoded><![CDATA[
When I was eight years old, I watched a mountaineering documentary while waiting for the cricket match to start. I remember being incredibly frustrated watching these climbers inch their way up a massive rock face, stopping every few feet to hammer what looked like giant nails into the mountain.
“Why don’t they just climb faster?” I asked my father. “They’re wasting so much time with those metal things!”
“Those are safety anchors, son. If they fall, they don’t want to tumble all the way back to the bottom.”
I found this logic deeply unsatisfying. Clearly, the solution was simple: don’t fall. Just climb faster and more carefully.
Thirty years later, debugging AI-generated code at 2 AM in my Chennai office, I finally understood what those mountaineers were doing.
<h2 class="wp-block-heading">The Intoxicating Rush of AI-Powered Flow</h2>
Last month, I was working on a revenue analysis project for my manager—the kind of perfectionist who notices when PowerPoint slides have inconsistent font sizes. The task seemed straightforward: slice and dice our quarterly revenue across multiple dimensions. Normally, this would have been a three-day slog of SQL queries, CSV exports, and fighting with chart libraries.
But this time, I had my AI assistant. And it was like having a data visualization superhero as my personal coding buddy.
”Create a stacked bar chart showing quarterly revenue by contract type,” I typed. Thirty seconds later: a beautiful, publication-quality chart.
I was in what psychologists call “flow state,” supercharged by AI assistance. Chart after chart materialized on my screen. For three glorious hours, I was completely absorbed. I generated seventeen different visualizations, created an interactive dashboard, and even added animated transitions that made the data dance.
I was so caught up in the momentum that the thought of stopping to commit changes never even crossed my mind. Why interrupt this beautiful flow?
That should have been my first clue that I was about to learn a very expensive lesson about the value of safety anchors.
<h2 class="wp-block-heading">When the Mountain Crumbles</h2>
At 1:47 AM, disaster struck. I asked my AI assistant to ”optimize the color palette for color-blind accessibility” across all my charts. It was a reasonable request—the kind of thoughtful enhancement that makes software better.
What happened next was like watching a controlled demolition, except there was nothing controlled about it.
The AI didn’t just change colors. It restructured my entire charting library. It modified the data processing pipeline. It altered the component architecture. It even changed the CSS framework ”for better accessibility compliance.”
Suddenly, my beautiful dashboard looked like it had been designed by someone having a heated argument with their computer. Charts overlapped, data disappeared, and the color scheme now resembled a medical diagram of various internal organs.
”No problem,” I thought. ”I’ll just ask it to undo those changes.”
This is where I learned that AI assistants, despite their impressive capabilities, have the rollback skills of a three-year-old trying to unscramble an egg.
I spent the next two hours in what can only be described as a negotiation with a well-meaning but entirely confused digital assistant. By 4 AM, I had given up and reverted to the last committed version of my code—from six hours earlier. Three hours of brilliant AI-generated visualizations vanished into the digital equivalent of that mountainside I would have tumbled down as an impatient eight-year-old.
<h2 class="wp-block-heading">The Wisdom of Slow Climbing</h2>
The next morning, over coffee and the particular kind of wisdom that comes from watching your colleague’s spectacular failure, my teammate Mohan delivered his verdict.
”You know what you did wrong?” he said. ”You forgot to use pitons.”
”Pitons?”
”Like mountain climbers. They hammer those metal spikes into the rock every few feet and attach their safety rope. If they fall, they only drop back to the last piton, not all the way to the bottom.”
”Your pitons are your commits, your tests, your version control. Every time you get a working feature, you hammer in a piton. Test it, commit it, make sure you can get back to that exact spot if something goes wrong.”
”But the AI was so fast,” I protested. ”Stopping to commit felt like it would break my flow.”
”Flow is great until you flow right off a cliff,” Mohan replied. ”The AI doesn’t understand your safety rope. It just keeps climbing higher and higher, making bigger and bigger changes. You’re the one who has to decide when to stop and secure your position.”
As much as I hated to admit it, Mohan was right. I had been so mesmerized by the AI’s speed that I had abandoned every good software engineering practice I knew. No incremental commits, no systematic testing, no architectural planning—just pure, reckless velocity.
<h2 class="wp-block-heading">The Art of Strategic Impatience</h2>
But this isn’t just about my late-night coding disaster. This challenge is baked into how AI assistants work.
AI assistants are incredibly good at making us feel productive. They generate code so quickly and confidently that it’s easy to mistake output for outcomes. But productivity without sustainability is just a fancy way of creating technical debt.
This isn’t an argument against AI-assisted development—it’s an argument for getting better at it. The mountaineers in that documentary weren’t slow because they were incompetent; they were methodical because they understood the consequences of failure.
The AI doesn’t care about your codebase either. It doesn’t understand your architecture, your business constraints, or your technical debt. It’s a powerful tool, but it’s not a substitute for engineering judgment. And engineering judgment, it turns out, is largely about knowing when to slow down.
Which brings us back to those mountaineers and their methodical approach. In my revenue dashboard disaster, I was going incredibly fast, but I ended up arriving at the same place I started, six hours later and significantly more exhausted. The irony is that if I had spent 15 minutes every hour committing working code and running tests, I would have finished the project faster, not slower.
My experience isn’t unique. Across the industry, developers are discovering that AI-powered productivity comes with hidden costs.
<h2 class="wp-block-heading">The Future Is Methodical</h2>
We’re living through the most significant shift in software development productivity since the invention of high-level programming languages. AI assistants are genuinely transformative tools that can accelerate development in ways that seemed impossible just a few years ago.
But they don’t eliminate the need for good engineering practices; they make those practices more important. The faster you can generate code, the more crucial it becomes to have reliable ways of validating, testing, and versioning that code. This might disappoint the eight-year-old in all of us who just wants to climb faster. But it should encourage the part of us that wants to actually reach the summit. Building software with AI assistance is a high-risk activity. You’re generating code faster than you can fully understand it, integrating libraries you didn’t choose, and implementing patterns you might not have had time to fully vet.
In that environment, safety anchors aren’t overhead—they’re essential infrastructure. The future of AI-assisted development isn’t about eliminating the methodical practices that make software engineering work. It’s about getting better at them, because we’re going to need them more than ever.
Now if you’ll excuse me, I have some commits to catch up on. And this time, I’m setting a timer.
]]></content:encoded>
</item>
<item>
<title>Why AI Efficiency May Be Making Your Organization More Fragile</title>
<link>https://www.oreilly.com/radar/why-ai-efficiency-may-be-making-your-organization-more-fragile/</link>
<pubDate>Thu, 25 Sep 2025 11:00:38 +0000</pubDate>
<dc:creator><![CDATA[Brinda Sarathy and Rajeshwari Ganesan]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17473</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Humanoid-robot-lumberjacks-chopping-trees-in-a-sunny-forest.-746312.jpg"
medium="image"
type="image/jpeg"
width="2304"
height="1792"
/>
<media:thumbnail
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Humanoid-robot-lumberjacks-chopping-trees-in-a-sunny-forest.-746312-160x160.jpg"
width="160"
height="160"
/>
<description><![CDATA[The productivity gains from AI tools are undeniable. Development teams are shipping faster, marketing campaigns are launching quicker, and deliverables are more polished than ever. But if you’re a technology leader watching these efficiency improvements, you might want to ask yourself a harder question: Are we building a more capable organization, or are we unintentionally […]]]></description>
<content:encoded><![CDATA[
The productivity gains from AI tools are undeniable. Development teams are shipping faster, marketing campaigns are launching quicker, and deliverables are more polished than ever. But if you’re a technology leader watching these efficiency improvements, you might want to ask yourself a harder question: Are we building a more capable organization, or are we unintentionally creating a more fragile one?
If you’re a humanist (or anyone in public higher education), you may be wondering: How will AI compromise the ability of newer generations of scholars and students to think critically, to engage in nuance and debate, and to experience the benefits born out of human friction?
This article itself is a testament to serendipitous encounters—and to taking more meandering paths instead of, always, the optimized fast track.
There’s a pattern emerging among AI-augmented teams—whether in tech firms or on college campuses—that should concern anyone responsible for long-term organizational health and human well-being. In the AI arms race, we’re seeing what ecologists would recognize as a classic monoculture problem—and the tech industry and early AI-adopters in higher education might learn a lesson from nature’s playbook gone wrong.
<h2 class="wp-block-heading">The Forestry Parallel</h2>
Consider how industrial forestry approached “inefficient” old-growth forests in the mid-20th century. Faced with complex ecosystems full of fallen logs, competing species, and seemingly “decadent” and “unproductive” old-growth trees, American foresters could only see waste. For these technocrats, waste represented unharnessed value. With the gospel of conservation efficiency as their guiding star, foresters in the US clear-cut complexity and replaced it with monocultures: uniform rows of fast-growing trees optimized for rapid timber yield, a productive and profitable cash crop.
By the narrow metric of board feet of timber per acre per year, it worked brilliantly. But the ecological costs only emerged later. Without biodiversity, these forests became vulnerable to pests, diseases, and catastrophic fires. It turns out that less complex systems are also less resilient and are limited in their ability to absorb shocks or adapt to a changing climate. What looked like optimization to the foresters of yesterday was actually a system designed for fragility.
This pattern mirrors what ecological and environmental justice research has revealed about resource management policies more broadly: When we optimize for single metrics while ignoring systemic complexity, we often create the very vulnerabilities we’re trying to avoid, including decimating systems linked to fostering resilience and well-being. The question is: Are we repeating this pattern in knowledge work? The early warning signs suggest we are.
<h2 class="wp-block-heading">The Real Cost of Frictionless Workflows</h2>
Today’s AI tools excel at what managers have long considered inefficiency: the messy, time-consuming parts of knowledge work. (There are also considerable environmental and social justice concerns about AI, but we will save them for a future post.) But something more concerning is happening beneath the surface. We’re seeing a dangerous homogenization of skills across traditional role boundaries.
Junior developers, for instance, can generate vast quantities of code, but this speed often comes at the expense of quality and maintainability. Product managers generate specifications without working through edge cases but also find themselves writing marketing copy and creating user documentation. Marketing teams craft campaign content without wrestling with audience psychology, yet they increasingly handle tasks that once required dedicated UX researchers or data analysts.
This role convergence might seem like efficiency, but it’s actually skill flattening at scale. When everyone can do everything adequately with AI assistance, the deep specialization that creates organizational resilience starts to erode. More pointedly, when AI becomes both the first and last pass in project conception, problem identification, and product generation, we lose out on examining core assumptions, ideologies, and systems with baked-in practices—and that critical engagement is very much what we need when adopting a technology as fundamentally transformative as AI. AI sets the table for conversations, and our engagement with one another is potentially that much less robust as a result.
For organizations and individuals, role convergence and faster workflows may feel like liberation and lead to a more profitable bottom line. But at the individual level, “cognitive offloading” can lead to significant losses in critical thinking, cognitive retention, and the ability to work without the crutch of technology. Depending heavily on AI to generate ideas or find “solutions” may be seductive in the short run—especially for a generation already steeped in social anxiety and social isolation—but it risks further corroding problem-solving in collaboration with others. Organizationally, we’re accumulating what we call “cognitive debt”—the hidden costs of optimization that compound over time.
The symptoms are emerging faster than expected:
<ul class="wp-block-list">
<li>Junior team members report anxiety about their value-add when AI can produce their typical deliverables faster.</li>
<li>Critical thinking skills atrophy when problem framing is outsourced to large language models.</li>
<li>Team discussions become thinner when AI provides the first draft of everything, reducing the productive friction that generates new insights.</li>
<li>Decision-making processes accelerate but become more brittle when faced with novel situations.</li>
<li>Deep domain expertise gets diluted as everyone becomes a generalist with AI assistance.</li>
</ul>
<h2 class="wp-block-heading">What Productive Friction Actually Does</h2>
The most successful knowledge workers have always been those who could synthesize disparate perspectives, ask better questions, and navigate ambiguity. These capabilities develop through what we might call “productive friction”—the discomfort of reconciling conflicting viewpoints, the struggle of articulating half-formed ideas, and the hard work of building understanding from scratch and in relationship with other people. This is wisdom born out of experience, not algorithm.
AI can eliminate this friction, but friction isn’t just drag—the slowing down of process may have its own benefits. The contained friction sometimes produced through working collectively is like the biodiverse and ostensibly “messy” forest understory where there are many layers of interdependence. This is the rich terrain in which assumptions break down, where edge cases lurk, and where real innovation opportunities hide. From an enterprise AI architecture perspective, friction often reveals the most valuable insights about system boundaries and integration challenges.
When teams default to AI-assisted workflows for most thinking tasks, they become cognitively brittle. They optimize for output velocity at the expense of the adaptability they’ll need when the next paradigm shift arrives.
<h2 class="wp-block-heading">Cultivating Organizational Resilience</h2>
The solution isn’t to abandon AI tools—that would be both futile and counterproductive. Instead, technology leaders need to design for long-term capability building rather than short-term output maximization. The efficiency granted by AI should create an opportunity not just to build faster, but to think deeper—to finally invest the time needed to truly understand the problems we claim to solve, a task the technology industry has historically sidelined in its pursuit of speed. The goal is creating organizational ecosystems that can adapt and thrive and be more humane, not just optimize. It may mean slowing down to ask even more difficult questions: Just because we can do it, should it be done? What are the ethical, social, and environmental implications of unleashing AI? Simply saying AI will solve these thorny questions is like foresters of yore who only focused on the cash crop and were blind to the longer-term negative externalities of ravaged ecosystems.
Here are four strategies that preserve cognitive diversity alongside algorithmic efficiency:
<ol class="wp-block-list">
<li>Make process visible, not just outcomes Instead of presenting AI-generated deliverables as finished products, require teams to identify the problems they’re solving, alternatives they considered, and assumptions they’re making before AI assistance kicks in. This preserves the reasoning layer that’s getting lost and maintains the interpretability that’s crucial for organizational learning. </li>
<li>Schedule cognitive cross-training Institute regular “AI-free zones” where teams work through problems without algorithmic assistance. Treat these as skill-building exercises, not productivity drains. They are also crucial to maintaining human sociality. Like physical cross-training, the goal is maintaining cognitive fitness and preventing the skill atrophy we’re observing in AI-augmented workflows. </li>
<li>Scale apprenticeship models Pair junior team members with seniors on problems that require building understanding from scratch. AI can assist with implementation, but humans should own problem framing, approach selection, and decision rationale. This counters the dangerous trend toward skill homogenization. </li>
<li>Institutionalize productive dissent Every team of “true believers” needs some skeptics to avoid being blindsided. For every AI-assisted recommendation, designate someone to argue the opposite case or identify failure modes. Rotate this role to normalize productive disagreement and prevent groupthink. This mirrors the natural checks and balances that make diverse ecosystems resilient.</li>
</ol>
<h2 class="wp-block-heading">The Organizational Radar Question</h2>
The critical question for technology leaders isn’t whether AI will increase productivity—it will. But at what cost and for whom? The question is whether your organization—and your people—will emerge from this transition more capable or more fragile.
Like those foresters measuring only timber yield, we risk optimizing for metrics that feel important but miss systemic health. The organizations that thrive in the AI era won’t be those that adopted the tools fastest, but those that figured out how to preserve and cultivate uniquely human capabilities alongside algorithmic efficiency.
Individual optimization matters less than collective intelligence. As we stand at the threshold of truly transformative AI capabilities, perhaps it’s time to learn from the forests: Diversity, not efficiency, is the foundation of antifragile systems.
What steps are your organization taking to preserve cognitive diversity? The decisions you make in the next 12 months about how to integrate AI tools may determine whether you’re building a resilient ecosystem or a mundane monoculture.
]]></content:encoded>
</item>
</channel>
</rss>
<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/
Object Caching 233/234 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed)
Minified using Memcached
Served from: www.oreilly.com @ 2025-10-16 11:18:35 by W3 Total Cache
-->

Home · About · News · Docs · Terms