This is a valid RSS feed.
This feed is valid, but interoperability with the widest range of feed readers could be improved by implementing the following recommendations.
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
^
... rel="self" type="application/rss+xml" />
^
<figure class="wp-block-embed is-type-rich is-provider-embed-handler wp-bloc ...
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:custom="https://www.oreilly.com/rss/custom" > <channel> <title>Radar</title> <atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" /> <link>https://www.oreilly.com/radar</link> <description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description> <lastBuildDate>Thu, 23 Oct 2025 15:26:07 +0000</lastBuildDate> <language>en-US</language> <sy:updatePeriod> hourly </sy:updatePeriod> <sy:updateFrequency> 1 </sy:updateFrequency> <generator>https://wordpress.org/?v=6.8.3</generator> <image> <url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url> <title>Radar</title> <link>https://www.oreilly.com/radar</link> <width>32</width> <height>32</height></image> <item> <title>Code Generation and the Shifting Value of Software</title> <link>https://www.oreilly.com/radar/code-generation-and-the-shifting-value-of-software/</link> <comments>https://www.oreilly.com/radar/code-generation-and-the-shifting-value-of-software/#respond</comments> <pubDate>Thu, 23 Oct 2025 11:14:26 +0000</pubDate> <dc:creator><![CDATA[Tim O'Brien]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Commentary]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17582</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Abstract-lights-3.jpg" medium="image" type="image/jpeg" width="2304" height="1792" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Abstract-lights-3-160x160.jpg" width="160" height="160" /> <description><![CDATA[This article originally appeared on Medium. Tim O’Brien has given us permission to repost here on Radar. One of the most unexpected changes in software development right now comes from code generation. We’ve all known that it could speed up certain kinds of work, but what’s becoming clear is that it also reshapes the economics […]]]></description> <content:encoded><![CDATA[<p class="has-cyan-bluish-gray-background-color has-background"><em>This article originally appeared on </em><a href="https://medium.com/@tobrien/code-generation-and-the-shifting-value-of-software-0c64cfc91adc" target="_blank" rel="noreferrer noopener"><em>Medium</em></a><em>. Tim O’Brien has given us permission to repost here on Radar.</em></p> <p>One of the most unexpected changes in software development right now comes from code generation. We’ve all known that it could speed up certain kinds of work, but what’s becoming clear is that it also reshapes the economics of libraries, frameworks, and even the way we think about open source.</p> <p>Just to be clear, I don’t view this as a threat to the employment of developers. I think we’ll end up needing more developers, and I also think that more people will start to consider themselves developers. But I do think that there are practices that are expiring:</p> <ol class="wp-block-list"><li><strong>Purchasing software</strong>—It will become more challenging to sell software unless it provides a compelling and difficult-to-reproduce product.</li> <li><strong>Adopting open source frameworks</strong>—Don’t get me wrong, open source will continue to play a role, but there’s going to be more of it, and there will be fewer “star stage” projects.</li> <li><strong>Software architects</strong>—Again, I’m not saying that we won’t have software architects, but the human process of considering architecture alternatives and having very expensive discussions about abstractions is already starting to disappear.</li></ol> <h2 class="wp-block-heading"><strong>Why Are You Paying for That?</strong></h2> <p>Take paid libraries as an example. For years, developers paid for specific categories of software simply because they solved problems that felt tedious or complex to recreate. A table renderer with pagination, custom cell rendering, and filtering might have justified a license fee because of the time it saved. What developer wants to stop and rewrite the pagination logic for that React table library?</p> <p>Lately, I’ve started answering, “me.” Instead of upgrading the license and paying some ridiculous per-developer fee, why not just ask Claude Sonnet to “render this component with an HTML table that also supports on-demand pagination”? At first, it feels like a mistake, but then you realize it’s cheaper and faster to ask a generative model to write a tailored implementation for that table—and it’s simpler.</p> <p>Most developers who buy software libraries end up using one or two features, while most of the library’s surface area goes untouched. Flipping the switch and moving to a simpler custom approach makes your build cleaner. (I know some of you pay for a very popular React component library with a widespread table implementation that recently raised prices. I also know some of you started asking, “Do I really need this?”)</p> <p>If you can point your IDE at it and say, “Hey, can you implement this in HTML with some simple JavaScript?” and it generates flawless code in five minutes—why wouldn’t you? The next question becomes: Will library creators start adding new legal clauses to lock you in? (My prediction: That’s next.)</p> <p>The moat around specific, specialized libraries keeps shrinking. If you can answer “Can I just replace that?” in five minutes, then replace it.</p> <h2 class="wp-block-heading"><strong>Did You Need That Library?</strong></h2> <p>This same shift also touches open source. Many of the libraries we use came out of long-term community efforts to solve straightforward problems. Logging illustrates this well: Packages like Log4j or Winston exist because developers needed consistent logging across projects. However, most teams utilize only a fraction of that functionality. These days, generating a lightweight logging library with exactly the levels and formatting you need often proves easier.</p> <p>Although adopting a shared library still offers interoperability benefits, the balance tilts toward custom solutions. I just needed to format logs in a standard way. Instead of adding a dependency, we wrote a 200-line internal library. Done.</p> <p>Five years ago, that might have sounded wild. Why rewrite Winston? But once you see the level of complexity these libraries carry, and you realize Claude Opus can generate that same logging library to your exact specifications in five minutes, the whole discussion shifts. Again, I’m not saying you should drop everything and craft your own logging library. But look at the 100 dependencies you have in your software—some of them add complexity you’ll never use.</p> <h2 class="wp-block-heading"><strong>Say Goodbye to “Let’s Think About”</strong></h2> <p>Another subtle change shows up in how we solve problems. In the past, a new requirement meant pausing to consider the architecture, interfaces, or patterns before implementing anything. Increasingly, I delegate that “thinking” step to a model. It runs in parallel, proposing solutions while I evaluate and refine. The time between idea and execution keeps shrinking. Instead of carefully choosing among frameworks or libraries, I can ask for a bespoke implementation and iterate from there.</p> <p>Compare that to five years ago. Back then, you assembled your most senior engineers and architects to brainstorm an approach. That still happens, but more often today, you end up discussing the output of five or six independent models that have already generated solutions. You discuss outcomes of models, not ideas for abstractions.</p> <p>The bigger implication: Entire categories of software may lose relevance. I’ve spent years working on open source libraries like Jakarta Commons—collections of utilities that solved countless minor problems. Those projects may no longer matter when developers can write simple functionality on demand. Even build tools face this shift. Maven, for example, once justified an ecosystem of training and documentation. But in the future, documenting your build system in a way that a generative model can understand might prove more useful than teaching people how to use Maven.</p> <h2 class="wp-block-heading"><strong>The Common Thread</strong></h2> <p>The pattern across all of this is simple: Software generation makes it harder to justify paying for prepackaged solutions. Both proprietary and open source libraries lose value when it’s faster to generate something custom. Direct automation displaces tooling and frameworks. Frameworks existed to capture standard code that generative models can now produce on demand.</p> <p>As a result, the future may hold more custom-built code and fewer compromises to fit preexisting systems. In short, code generation doesn’t just speed up development—it fundamentally changes what’s worth building, buying, and maintaining.</p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/code-generation-and-the-shifting-value-of-software/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item> <title>AI Is Reshaping Developer Career Paths</title> <link>https://www.oreilly.com/radar/ai-is-reshaping-developer-career-paths/</link> <comments>https://www.oreilly.com/radar/ai-is-reshaping-developer-career-paths/#respond</comments> <pubDate>Wed, 22 Oct 2025 11:14:11 +0000</pubDate> <dc:creator><![CDATA[Andrew Stellman]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Commentary]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17579</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Shift-button.png" medium="image" type="image/png" width="1080" height="1080" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Shift-button-160x160.png" width="160" height="160" /> <custom:subtitle><![CDATA[From Specialists to Generalists]]></custom:subtitle> <description><![CDATA[This article is part of a series on the Sens-AI Framework—practical habits for learning and coding with AI. A few decades ago, I worked with a developer who was respected by everyone on our team. Much of that respect came from the fact that he kept adopting new technologies that none of us had worked […]]]></description> <content:encoded><![CDATA[<p class="has-cyan-bluish-gray-background-color has-background"><em>This article is part of a series on the Sens-AI Framework—practical habits for learning and coding with AI.</em></p> <p>A few decades ago, I worked with a developer who was respected by everyone on our team. Much of that respect came from the fact that he kept adopting new technologies that none of us had worked with. There was a cutting-edge language at the time that few people were using, and he built an entire feature with it. He quickly became known as the person you’d go to for these niche technologies, and it earned him a lot of respect from the rest of the team.</p> <p>Years later, I worked with another developer who went out of his way to incorporate specific, obscure .NET libraries into his code. That too got him recognition from our team members and managers, and he was viewed as a senior developer in part because of his expertise with these specialized tools.</p> <p>Both developers built their reputations on deep knowledge of specific technologies. It was a reliable career strategy that worked for decades: Become the expert in something valuable but not widely known, and you’d have authority on your team and an edge in job interviews.</p> <p>But AI is changing that dynamic in ways we’re just starting to see.</p> <p>In the past, experienced developers could build deep expertise in a single technology (like Rails or React, for example) and that expertise would consistently get them recognition on their team and help them stand out in reviews and job interviews. It used to take months or years of working with a specific framework before a developer could write <em>idiomatic code</em>, or code that follows the accepted patterns and best practices of that technology.</p> <p>But now AI models are trained on countless examples of idiomatic code, so developers without that experience can generate similar code immediately. That puts less of a premium on the time spent developing that deep expertise.</p> <h2 class="wp-block-heading"><strong>The Shift Toward Generalist Skills</strong></h2> <p>That change is reshaping career paths in ways we’re just starting to see. The traditional approach worked for decades, but as AI fills in more of that specialized knowledge, the career advantage is shifting toward people who can integrate across systems and spot design problems early.</p> <p>As I’ve trained developers and teams who are increasingly adopting AI coding tools, I’ve noticed that the developers who adapt best aren’t always the ones with the deepest expertise in a specific framework. Rather, they’re the ones who can spot when something looks wrong, integrate across different systems, and recognize patterns. Most importantly, they can apply those skills even when they’re not deep experts in the particular technology they’re working with.</p> <p>This represents a shift from the more traditional dynamic on teams, where being an expert in a specific technology (like being the “Rails person” or the “React expert” on the team) carried real authority. AI now fills in much of that specialized knowledge. You can still build a career on deep Rails knowledge, but thanks to AI, it doesn’t always carry the same authority on a team that it once did.</p> <h2 class="wp-block-heading"><strong>What AI Still Can’t Do</strong></h2> <p>Both new and experienced developers routinely find themselves accumulating technical debt, especially when deadlines push delivery over maintainability, and this is an area where experienced engineers often distinguish themselves, even on a team with wide AI adoption. The key difference is that an experienced developer often knows they’re taking on debt. They can spot antipatterns early because they’ve seen them repeatedly and take steps to “pay off” the debt before it gets much more expensive to fix.</p> <p>But AI is also changing the game for experienced developers in ways that go beyond technical debt management, and it’s starting to reshape their traditional career paths. What AI still can’t do is tell you when a design or architecture decision today will cause problems six months from now, or when you’re writing code that doesn’t actually solve the user’s problem. That’s why being a generalist, with skills in architecture, design patterns, requirements analysis, and even project management, is becoming more valuable on software teams.</p> <p>Many developers I see thriving with AI tools are the ones who can:</p> <ul class="wp-block-list"><li><strong>Recognize when generated code will create maintenance problems</strong> even if it works initially</li> <li><strong>Integrate across multiple systems</strong> without being deep experts in each one</li> <li><strong>Spot architectural patterns and antipatterns</strong> regardless of the specific technology</li> <li><strong>Frame problems clearly</strong> so AI can generate more useful solutions</li> <li><strong>Question and refine AI output</strong> rather than accepting it as is</li></ul> <h2 class="wp-block-heading"><strong>Practical Implications for Your Career</strong></h2> <p>This shift has real implications for how developers think about career development:</p> <p><strong>For experienced developers:</strong> Your years of expertise are still important and valuable, but the career advantage is shifting from “I know this specific tool really well” to “I can solve complex problems across different technologies.” Focus on building skills in system design, integration, and pattern recognition that apply broadly.</p> <p><strong>For early-career developers:</strong> The temptation might be to rely on AI to fill knowledge gaps, but this can be dangerous. Those broader skills—architecture, design judgment, problem-solving across domains—typically require years of hands-on experience to develop. Use AI as a tool, but make sure you’re still building the fundamental thinking skills that let you guide it effectively.</p> <p><strong>For teams:</strong> Look for people who can adapt to new technologies quickly and integrate across systems, not just deep specialists. The “Rails person” might still be valuable, but the person who can work with Rails, integrate it with three other systems, and spot when the architecture is heading for trouble six months down the line is becoming more valuable.</p> <p>The developers who succeed in an AI-enabled world won’t always be the ones who know the most about any single technology. They’ll be the ones who can see the bigger picture, integrate across systems, and use AI as a powerful tool while maintaining the critical thinking necessary to guide it toward genuinely useful solutions.</p> <p>AI isn’t replacing developers. It’s changing what kinds of developer skills matter most.</p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/ai-is-reshaping-developer-career-paths/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item> <title>The Java Developer’s Dilemma: Part 2</title> <link>https://www.oreilly.com/radar/the-java-developers-dilemma-part-2/</link> <comments>https://www.oreilly.com/radar/the-java-developers-dilemma-part-2/#respond</comments> <pubDate>Tue, 21 Oct 2025 11:17:33 +0000</pubDate> <dc:creator><![CDATA[Markus Eisele]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Commentary]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17572</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Abstract-fractal-drops-1.jpg" medium="image" type="image/jpeg" width="2304" height="1792" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Abstract-fractal-drops-1-160x160.jpg" width="160" height="160" /> <custom:subtitle><![CDATA[New Types of Applications]]></custom:subtitle> <description><![CDATA[This is the second of a three-part series by Markus Eisele. Part 1 can be found here. Stay tuned for part 3. Many AI projects fail. The reason is often simple. Teams try to rebuild last decade’s applications but add AI on top: A CRM system with AI. A chatbot with AI. A search engine […]]]></description> <content:encoded><![CDATA[<p class="has-cyan-bluish-gray-background-color has-background"><em>This is the second of a three-part series by Markus Eisele. Part 1 can be found </em><a href="https://www.oreilly.com/radar/the-java-developers-dilemma-part-1/" target="_blank" rel="noreferrer noopener"><em>here</em></a><em>. Stay tuned for part 3.</em></p> <p>Many AI projects fail. The reason is often simple. Teams try to rebuild last decade’s applications but add AI on top: A CRM system with AI. A chatbot with AI. A search engine with AI. The pattern is the same: “X, but now with AI.” These projects usually look fine in a demo, but they rarely work in production. The problem is that AI doesn’t just extend old systems. It changes what applications are and how they behave. If we treat AI as a bolt-on, we miss the point.</p> <h2 class="wp-block-heading">What AI Changes in Application Design</h2> <p>Traditional enterprise applications are built around deterministic workflows. A service receives input, applies business logic, stores or retrieves data, and responds. If the input is the same, the output is the same. Reliability comes from predictability.</p> <p>AI changes this model. Outputs are probabilistic. The same question asked twice may return two different answers. Results depend heavily on context and prompt structure. Applications now need to manage data retrieval, context building, and memory across interactions. They also need mechanisms to validate and control what comes back from a model. In other words, the application is no longer just code plus a database. It’s code plus a reasoning component with uncertain behavior. That shift makes “AI add-ons” fragile and points to a need for entirely new designs.</p> <h2 class="wp-block-heading">Defining AI-Infused Applications</h2> <p>AI-infused applications aren’t just old applications with smarter text boxes. They have new structural elements:</p> <ul class="wp-block-list"><li><strong>Context pipelines</strong>: Systems need to assemble inputs before passing them to a model. This often includes retrieval-augmented generation (RAG), where enterprise data is searched and embedded into the prompt. But also hierarchical, per user memory.</li> <li><strong>Memory</strong>: Applications need to persist context across interactions. Without memory, conversations reset on every request. And this memory might need to be stored in different ways. In process, midterm and even long-term memory. Who wants to start support conversations by saying your name and purchased products over and over again?</li> <li><strong>Guardrails</strong>: Outputs must be checked, validated, and filtered. Otherwise, hallucinations or malicious responses leak into business workflows.</li> <li><strong>Agents</strong>: Complex tasks often require coordination. An agent can break down a request, call multiple tools or APIs or even other agents, and assemble complex results. Executed in parallel or synchronously. Instead of workflow driven, agents are goal driven. They try to produce a result that satisfies a request. <a href="https://en.wikipedia.org/wiki/Business_Process_Model_and_Notation" target="_blank" rel="noreferrer noopener">Business Process Model and Notation</a> (BPMN) is turning toward goal-context–oriented agent design.</li></ul> <p>These are not theoretical. They’re the building blocks we already see in modern AI systems. What’s important for Java developers is that they can be expressed as familiar architectural patterns: pipelines, services, and validation layers. That makes them approachable even though the underlying behavior is new.</p> <h2 class="wp-block-heading">Models as Services, Not Applications</h2> <p>One foundational thought: AI models should not be part of the application binary. They are services. Whether they’re served through a container locally, served via vLLM, hosted by a model cloud provider, or deployed on private infrastructure, the model is consumed through a service boundary. For enterprise Java developers, this is familiar territory. We have decades of experience consuming external services through fast protocols, handling retries, applying backpressure, and building resilience into service calls. We know how to build clients that survive transient errors, timeouts, and version mismatches. This experience is directly relevant when the “service” happens to be a model endpoint rather than a database or messaging broker.</p> <p>By treating the model as a service, we avoid a major source of fragility. Applications can evolve independently of the model. If you need to swap a local Ollama model for a cloud-hosted GPT or an internal Jlama deployment, you change configuration, not business logic. This separation is one of the reasons enterprise Java is well positioned to build AI-infused systems.</p> <h2 class="wp-block-heading">Java Examples in Practice</h2> <p>The Java ecosystem is beginning to support these ideas with concrete tools that address enterprise-scale requirements rather than toy examples.</p> <ul class="wp-block-list"><li><strong>Retrieval-augmented generation (RAG)</strong>: Context-driven retrieval is the most common pattern for grounding model answers in enterprise data. At scale this means structured ingestion of documents, PDFs, spreadsheets, and more into vector stores. Projects like <a href="https://github.com/docling-project/docling" target="_blank" rel="noreferrer noopener">Docling</a> handle parsing and transformation, and <a href="https://docs.langchain4j.dev/" target="_blank" rel="noreferrer noopener">LangChain4j</a> provides the abstractions for embedding, retrieval, and ranking. Frameworks such as <a href="https://quarkus.io/" target="_blank" rel="noreferrer noopener">Quarkus</a> then extend those concepts into production-ready services with dependency injection, configuration, and observability. The combination moves RAG from a demo pattern into a reliable enterprise feature.</li></ul> <ul class="wp-block-list"><li><strong>LangChain4j as a standard abstraction</strong>: LangChain4j is emerging as a common layer across frameworks. It offers CDI integration for <a href="https://github.com/langchain4j/langchain4j-cdi" target="_blank" rel="noreferrer noopener">Jakarta EE</a> and <a href="https://docs.quarkiverse.io/quarkus-langchain4j/dev/" target="_blank" rel="noreferrer noopener">extensions for Quarkus</a> but also supports Spring, Micronaut, and Helidon. Instead of writing fragile, low-level OpenAPI glue code for each provider, developers define AI services as interfaces and let the framework handle the wiring. This standardization is also beginning to cover agentic modules, so orchestration across multiple tools or APIs can be expressed in a framework-neutral way.<br></li> <li><strong>Cloud to on-prem portability</strong>: In enterprises, portability and control matter. Abstractions make it easier to switch between cloud-hosted providers and on-premises deployments. With LangChain4j, you can change configuration to point from a cloud LLM to a local Jlama model or Ollama instance without rewriting business logic. These abstractions also make it easier to use more and smaller domain-specific models and maintain consistent behavior across environments. For enterprises, this is critical to balancing innovation with control.</li></ul> <p>These examples show how Java frameworks are taking AI integration from low-level glue code toward reusable abstractions. The result is not only faster development but also better portability, testability, and long-term maintainability.</p> <h2 class="wp-block-heading">Testing AI-Infused Applications</h2> <p>Testing is where AI-infused applications diverge most sharply from traditional systems. In deterministic software, we write unit tests that confirm exact results. With AI, outputs vary, so testing has to adapt. The answer is not to stop testing but to broaden how we define it.</p> <ul class="wp-block-list"><li><strong>Unit tests</strong>: Deterministic parts of the system—context builders, validators, database queries—are still tested the same way. Guardrail logic, which enforces schema correctness or policy compliance, is also a strong candidate for unit tests.</li> <li><strong>Integration tests</strong>: AI models should be tested as opaque systems. You feed in a set of prompts and check that outputs meet defined boundaries: JSON is valid, responses contain required fields, values are within expected ranges.</li> <li><strong>Prompt testing</strong>: Enterprises need to track how prompts perform over time. Variation testing with slightly different inputs helps expose weaknesses. This should be automated and included in the CI pipeline, not left to ad hoc manual testing.</li></ul> <p>Because outputs are probabilistic, tests often look like assertions on structure, ranges, or presence of warning signs rather than exact matches. Hamel Husain stresses that specification-based testing with curated prompt sets is essential, and that <a href="http://hamel.dev/blog/posts/evals-faq" target="_blank" rel="noreferrer noopener">evaluations should be problem-specific rather than generic</a>. This aligns well with Java practices: We design integration tests around known inputs and expected boundaries, not exact strings. Over time, this produces confidence that the AI behaves within defined boundaries, even if specific sentences differ.</p> <h2 class="wp-block-heading">Collaboration with Data Science</h2> <p>Another dimension of testing is collaboration with data scientists. Models aren’t static. They can drift as training data changes or as providers update versions. Java teams cannot ignore this. We need methodologies to surface warning signs and detect sudden drops in accuracy on known inputs or unexpected changes in response style. They need to be fed back into monitoring systems that span both the data science and the application side.</p> <p>This requires closer collaboration between application developers and data scientists than most enterprises are used to. Developers must expose signals from production (logs, metrics, traces) to help data scientists diagnose drift. Data scientists must provide datasets and evaluation criteria that can be turned into automated tests. Without this feedback loop, drift goes unnoticed until it becomes a business incident.</p> <p>Domain experts play a central role here. Looking back at Husain, he points out that <a href="https://hamel.dev/blog/posts/evals" target="_blank" rel="noreferrer noopener">automated metrics often fail to capture user-perceived quality</a>. Java developers shouldn’t leave evaluation criteria to data scientists alone. Business experts need to help define what “good enough” means in their context. A clinical assistant has very different correctness criteria than a customer service bot. Without domain experts, AI-infused applications risk delivering the wrong things.</p> <h2 class="wp-block-heading">Guardrails and Sensitive Data</h2> <p>Guardrails belong under testing as well. For example, an enterprise system should never return personally identifiable information (PII) unless explicitly authorized. Tests must simulate cases where PII could be exposed and confirm that guardrails block those outputs. This is not optional. While a best practice on the model training side, especially RAG and memory carry a lot of risks for exactly that personal identifiable information to be carried across boundaries. Regulatory frameworks like GDPR and HIPAA already enforce strict requirements. Enterprises must prove that AI components respect these boundaries, and testing is the way to demonstrate it.</p> <p>By treating guardrails as testable components, not ad hoc filters, we raise their reliability. Schema checks, policy enforcement, and PII filters should all have automated tests just like database queries or API endpoints. This reinforces the idea that AI is part of the application, not a mysterious bolt-on.</p> <h2 class="wp-block-heading">Edge-Based Scenarios: Inference on the JVM</h2> <p>Not all AI workloads belong in the cloud. Latency, cost, and data sovereignty often demand local inference. This is especially true at the edge: in retail stores, factories, vehicles, or other environments where sending every request to a cloud service is impractical.</p> <p>Java is starting to catch up here. Projects like Jlama allow language models to run directly inside the JVM. This makes it possible to deploy inference alongside existing Java applications without adding a separate Python or C++ runtime. The advantages are clear: lower latency, no external data transfer, and simpler integration with the rest of the enterprise stack. For developers, it also means you can test and debug everything inside one environment rather than juggling multiple languages and toolchains.</p> <p>Edge-based inference is still new, but it points to a future where AI isn’t just a remote service you call. It becomes a local capability embedded into the same platform you already trust.</p> <h2 class="wp-block-heading">Performance and Numerics in Java</h2> <p>One reason Python became dominant in AI is its excellent math libraries like NumPy and SciPy. These libraries are backed by native C and C++ code, which delivers strong performance. Java has historically lacked first-rate numerics libraries of the same quality and ecosystem adoption. Libraries like <a href="https://deeplearning4j.konduit.ai/nd4j/tutorials/quickstart" target="_blank" rel="noreferrer noopener">ND4J</a> (part of <a href="https://deeplearning4j.konduit.ai/" target="_blank" rel="noreferrer noopener">Deeplearning4j</a>) exist, but they never reached the same critical mass.</p> <p>That picture is starting to change. <a href="https://openjdk.org/projects/panama/" target="_blank" rel="noreferrer noopener">Project Panama</a> is an important step. It gives Java developers efficient access to native libraries, GPUs, and accelerators without complex JNI code. Combined with ongoing work on vector APIs and Panama-based bindings, Java is becoming much more capable of running performance-sensitive tasks. This evolution matters because inference and machine learning won’t always be external services. In many cases, they’ll be libraries or models you want to embed directly in your JVM-based systems.</p> <h2 class="wp-block-heading">Why This Matters for Enterprises</h2> <p>Enterprises cannot afford to live in prototype mode. They need systems that run for years, can be supported by large teams, and fit into existing operational practices. AI-infused applications built in Java are well positioned for this. They are:</p> <ul class="wp-block-list"><li><strong>Closer to business logic</strong>: Running in the same environment as existing services</li> <li><strong>More auditable</strong>: Observable with the same tools already used for logs, metrics, and traces</li> <li><strong>Deployable across cloud and edge</strong>: Capable of running in centralized data centers or at the periphery, where latency and privacy matter</li></ul> <p>This is a different vision from “add AI to last decade’s application.” It’s about creating applications that only make sense because AI is at their core.</p> <p>In <a href="https://www.oreilly.com/library/view/applied-ai-for/9781098174491/" target="_blank" rel="noreferrer noopener"><em>Applied AI for Enterprise Java Development</em></a>, we go deeper into these patterns. The book provides an overview of architectural concepts, shows how to implement them with real code, and explains how emerging standards like the <a href="https://a2a-protocol.org/latest/" target="_blank" rel="noreferrer noopener">Agent2Agent Protocol</a> and <a href="https://modelcontextprotocol.io/docs/getting-started/intro" target="_blank" rel="noreferrer noopener">Model Context Protocol</a> fit in. The goal is to give Java developers a road map to move beyond demos and build applications that are robust, explainable, and ready for production.</p> <p>The transformation isn’t about replacing everything we know. It’s about extending our toolbox. Java has adapted before, from servlets to EJBs to microservices. The arrival of AI is the next shift. The sooner we understand what these new types of applications look like, the sooner we can build systems that matter.</p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/the-java-developers-dilemma-part-2/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item> <title>A Human-Centered Approach to Competitive Advantage</title> <link>https://www.oreilly.com/radar/a-human-centered-approach-to-competitive-advantage/</link> <comments>https://www.oreilly.com/radar/a-human-centered-approach-to-competitive-advantage/#respond</comments> <pubDate>Mon, 20 Oct 2025 11:25:17 +0000</pubDate> <dc:creator><![CDATA[Kord Davis]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Commentary]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17565</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/A-collaborative-approach-to-AI.jpg" medium="image" type="image/jpeg" width="2304" height="1792" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/A-collaborative-approach-to-AI-160x160.jpg" width="160" height="160" /> <custom:subtitle><![CDATA[Unlocking AI's Potential]]></custom:subtitle> <description><![CDATA[In the modern enterprise, information is the new capital. While companies pour resources into artificial intelligence, many discover that technology, standing alone, delivers only expense, not transformation. The true engine of change lies not in the algorithm but in the hands and minds of the people who use it. The greatest asset an organization possesses […]]]></description> <content:encoded><![CDATA[<p>In the modern enterprise, information is the new capital. While companies pour resources into artificial intelligence, many discover that technology, standing alone, delivers only expense, not transformation. The true engine of change lies not in the algorithm but in the hands and minds of the people who use it. The greatest asset an organization possesses is the diverse, domain-specific expertise held within its human teams.</p> <p>Drawing directly from <a href="https://en.wikipedia.org/wiki/Peter_Drucker" target="_blank" rel="noreferrer noopener">Peter Drucker</a>‘s principles, the path to competitive advantage is a human-centered approach. Effective management, Drucker taught, demands a focus on measurable results, fostered through collaboration and the strict alignment of individual efforts with institutional goals. Technology is but a tool; it has no purpose unless it serves the people who use it and the mission they are trying to accomplish. This is the only reliable way to generate genuine innovation and tangible outcomes.</p> <h2 class="wp-block-heading"><strong>The Social Reality of Data and The Peril of Silos</strong></h2> <h3 class="wp-block-heading"><strong><strong>Data as a Collective Endeavor</strong></strong></h3> <p>Data analysis is fundamentally a collective effort. We shouldn’t aim to turn everyone into a data scientist; rather, we must empower teams to collaborate effectively with both AI and one another—together. Consider a large retail company seeking to optimize its supply chain. The firm has invested heavily in a sophisticated AI model to forecast demand and automate inventory. The model, however, is failing. It recommends stocking up on products that sit unsold while critical items are frequently out of stock.</p> <p>The problem is not the technology. It’s a failure to apply human intelligence, experience, and expertise. The AI model, built by a team of data scientists, was designed to optimize for cost per unit and speed of delivery. It did not, and could not, account for the deep insights held by the people who actually run the business. The marketing team understands that a sudden social media trend will create a surge in demand for a specific item, while the sales team knows that a key corporate client has just placed a large, unannounced order. The operations manager on the warehouse floor can predict which logistical choke points will delay a shipment, regardless of the model’s prediction. The AI’s diagnosis was based on limited data; the humans had the full picture.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p><em>“The purpose of an organization is to enable ordinary human beings to do extraordinary things.” </em><br>Peter Drucker</p></blockquote> <p>These individuals—the marketing leader, the sales professional, the operations manager—hold the domain expertise that unlocks the AI’s full potential. The purpose of the AI is to augment and amplify this expertise, not to replace it.</p> <h3 class="wp-block-heading"><strong>The Challenge of Silos</strong></h3> <p>This collective effort often fails because of organizational silos. While some silos began as practical necessity—protecting sensitive customer data, for instance—many persist long after their original justification has vanished. More dangerously, silos are often the result of political dynamics and the fear of losing power or influence. Consider a chief marketing officer (CMO) who is reluctant to share a new predictive model for customer lifetime value with the chief information officer (CIO). The CMO views this model as a competitive asset, a tool to justify her department’s budget and influence. By withholding it, she ensures her team remains the sole source of this critical insight.</p> <p>This mindset is toxic; it substitutes internal competition for collective performance. It creates a system where departments focus on territory over results. As Drucker taught, <a href="https://www.academia.edu/43101559/The_Essential_Drucker" target="_blank" rel="noreferrer noopener">the purpose of an organization is to enable ordinary human beings to do extraordinary things</a>. When they are confined to their own small domains, their work becomes ordinary, no matter how advanced their tools.</p> <h3 class="wp-block-heading"><strong>Cultivating a Collaborative Environment</strong></h3> <p>Dismantling these barriers isn’t merely a structural challenge; it’s a fundamental human and cultural imperative. Leaders must recognize that silos are symptoms of human challenges that demand a shift in mindset: prioritize collaboration over competition. To do this, they must create an environment where diverse perspectives are actively sought and rewarded.</p> <p>This begins with a shared language and a clear mandate. A leader can facilitate a series of cross-departmental workshops, bringing together marketers, engineers, and financial analysts not to “get trained on AI” but to identify shared problems. A question like “How can we use existing data to reduce customer service call volume?” can be the starting point for a collaboration that organically breaks down barriers. The result isn’t a new algorithm but a new process built on mutual understanding.</p> <h2 class="wp-block-heading"><strong>Strategy: Start Small, Win Big</strong></h2> <p>Many enterprises err by pursuing ambitious, grand-scale technology implementations, such as vast enterprise resource planning (ERP) systems. The intention—to integrate and streamline—is sound, but the result is often disappointment, cost overruns, and fresh confusion. Consider a manufacturing company that invested millions in a new system to automate its entire production line. The initial rollout was chaotic. The system’s inflexible data entry requirements frustrated engineers on the floor who had their own established, practical methods. Production was halted for weeks as frontline workers grappled with a system that complicated, rather than simplified, their work. This is a cautionary tale: Without a people-centered approach, even the most advanced systems fall short.</p> <h3 class="wp-block-heading"><strong>The Power of Incrementalism</strong></h3> <p>The path to AI success isn’t a sweeping, top-down overhaul. It’s about incremental projects that empower teams to tackle small, relevant challenges. This isn’t a retreat; it’s a strategic choice. It’s a recognition that true change happens through a series of manageable, successful steps.</p> <ol class="wp-block-list"><li><strong>Start with a small, strategic project</strong>: Don’t overhaul the entire customer service platform; focus on a single, pressing problem. For a call center, a small project might be using a simple AI model to analyze call transcripts and identify the top five reasons for long hold times. This is manageable, provides immediate, actionable insights, and gives the team a sense of accomplishment. The project is small, but the win is significant: It proves the value of the approach.<br></li> <li><strong>Establish clear objectives</strong>: If the call center project aims to reduce hold times, define success with a clear, measurable goal: reduce the average call handle time by 15% within three months. This clarity is nonnegotiable. It provides a focal point and eliminates ambiguity.<br></li> <li><strong>Prevent scope creep</strong>: This is the silent killer of projects. To prevent it, clear boundaries must be established from the outset. The team might agree: “We will only analyze calls from Q3, and we will only focus on the top five identified root causes. We will not expand to analyze email support tickets during this phase.” This rigid discipline ensures the project remains on track and delivers a tangible outcome.<br></li> <li><strong>Encourage cross-functional collaboration</strong>: The project’s success depends on the human element. The team must include a frontline call center representative who understands the nuances of customer conversations, a data analyst to interpret the AI’s output, and a product manager to implement the recommended changes. These cross-functional workshops are where true insights collide and innovation is born.</li></ol> <h2 class="wp-block-heading"><strong>Learning and Scaling</strong></h2> <p>Every incremental project is an opportunity for relentless learning. After completing the call center project and reducing hold times, the team must conduct a thorough retrospective. They should ask: What succeeded? What failed? If a project successfully reduces churn rates, document the strategies that led to this success and apply them broadly. Success isn’t the end; it’s the beginning of a new process. The team can then apply the same methodology to email support, then to their live chat. The small win becomes a repeatable blueprint for progress.</p> <h3 class="wp-block-heading"><strong>The Leadership Imperative</strong></h3> <p>The leader’s role is unambiguous: foster a culture of transparency, trust, and empowerment.</p> <p>A human-centered strategy addresses the root causes of slow AI adoption and siloed data. It encourages a resilient environment where curiosity about data becomes ingrained in the corporate culture. When diverse disciplines actively engage with data, they cultivate a shared language and a collective, data-first mindset.</p> <p>This endeavor isn’t about tool adoption; it’s about nurturing an environment where collaboration is the default setting. It’s about understanding that a silo isn’t a structure; it’s a human behavior that must be managed and redirected toward a common goal. By prioritizing human expertise and actively confronting the political realities underpinning silos, businesses transform AI from a technology expense into a competitive advantage that drives meaningful innovation and secures long-term success.</p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/a-human-centered-approach-to-competitive-advantage/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item> <title>Generative AI in the Real World: Context Engineering with Drew Breunig</title> <link>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-context-engineering-with-drew-breunig/</link> <comments>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-context-engineering-with-drew-breunig/#respond</comments> <pubDate>Thu, 16 Oct 2025 11:18:24 +0000</pubDate> <dc:creator><![CDATA[Ben Lorica and Drew Breunig]]></dc:creator> <category><![CDATA[Generative AI in the Real World]]></category> <category><![CDATA[Podcast]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&p=17562</guid> <enclosure url="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3" length="0" type="audio/mpeg" /> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png" medium="image" type="image/png" width="2560" height="2560" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-160x160.png" width="160" height="160" /> <description><![CDATA[In this episode, Ben Lorica and Drew Breunig, a strategist at the Overture Maps Foundation, talk all things context engineering: what’s working, where things are breaking down, and what comes next. Listen in to hear why huge context windows aren’t solving the problems we hoped they might, why companies shouldn’t discount evals and testing, and […]]]></description> <content:encoded><![CDATA[<p>In this episode, Ben Lorica and Drew Breunig, a strategist at the Overture Maps Foundation, talk all things context engineering: what’s working, where things are breaking down, and what comes next. Listen in to hear why huge context windows aren’t solving the problems we hoped they might, why companies shouldn’t discount evals and testing, and why we’re doing the field a disservice by leaning into marketing and buzzwords rather than trying to leverage what current crop of LLMs are actually capable of.</p> <p><strong>About the <em>Generative AI in the Real World </em>podcast</strong>: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In <em>Generative AI in the Real World</em>, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.</p> <p>Check out <a href="https://learning.oreilly.com/playlists/42123a72-1108-40f1-91c0-adbfb9f4983b/?_gl=1*m7f70i*_ga*MTYyODYzMzQwMi4xNzU4NTY5ODYz*_ga_092EL089CH*czE3NTkxNzAwODUkbzE0JGcwJHQxNzU5MTcwMDg1JGo2MCRsMCRoMA.." target="_blank" rel="noreferrer noopener">other episodes</a> of this podcast on the O’Reilly learning platform.</p> <h2 class="wp-block-heading">Transcript</h2> <p><em>This transcript was created with the help of AI and has been lightly edited for clarity.</em></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=0" target="_blank" rel="noreferrer noopener">00.00</a>: <strong>All right. So today we have Drew Breunig. He is a strategist at the Overture Maps Foundation. And he’s also in the process of writing a book for O’Reilly called the <em>Context Engineering Handbook</em>. And with that, Drew, welcome to the podcast.</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=23" target="_blank" rel="noreferrer noopener">00.23</a>: Thanks, Ben. Thanks for having me on here. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=26" target="_blank" rel="noreferrer noopener">00.26</a>: <strong>So context engineering. . . I remember before ChatGPT was even released, someone was talking to me about prompt engineering. I said, “What’s that?” And then of course, fast-forward to today, now people are talking about context engineering. And I guess the short definition is it’s the delicate art and science of filling the context window with just the right information. What’s broken with how teams think about context today? </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=56" target="_blank" rel="noreferrer noopener">00.56</a>: I think it’s important to talk about why we need a new word or why a new word makes sense. I was just talking with Mike Taylor, who wrote the <a href="https://learning.oreilly.com/library/view/prompt-engineering-for/9781098153427/" target="_blank" rel="noreferrer noopener">prompt engineering book</a> for O’Reilly, exactly about this and why we need a new word. Why is prompt engineering not good enough? And I think it has to do with the way the models and the way they’re being built is evolving. I think it also has to deal with the way that we’re learning how to use these models. </p> <p>And so prompt engineering was a natural word to think about when your interaction and how you program the model was maybe one turn of conversation, maybe two, and you might pull in some context to give it examples. You might do some RAG and context augmentation, but you’re working with this one-shot service. And that was really similar to the way people were working in chatbots. And so prompt engineering started to evolve as this thing. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=120" target="_blank" rel="noreferrer noopener">02.00</a>: But as we started to build agents and as companies started to develop models that were capable of multiturn tool-augmented reasoning usage, suddenly you’re not using that one prompt. You have a context that is sometimes being prompted by you, sometimes being modified by your software harness around the model, sometimes being modified by the model itself. And increasingly the model is starting to manage that context. And that prompt is very user-centric. It is a user giving that prompt. </p> <p>But when we start to have these multiturn systematic editing and preparation of contexts, a new word was needed, which is this idea of context engineering. This is not to belittle prompt engineering. I think it’s an evolution. And it shows how we’re evolving and finding this space in real time. I think context engineering is more suited to agents and applied AI programing, whereas prompt engineering lives in how people use chatbots, which is a different field. It’s not better and not worse. </p> <p>And so context engineering is more specific to understanding the failure modes that occur, diagnosing those failure modes and establishing good practices for both preparing your context but also setting up systems that fix and edit your context, if that makes sense. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=213" target="_blank" rel="noreferrer noopener">03.33</a>: <strong>Yeah, and also, it seems like the words themselves are indicative of the scope, right? So “prompt” engineering means it’s the prompt. So you’re fiddling with the prompt. And [with] context engineering, “context” can be a lot of things. It could be the information you retrieve. It might involve RAG, so you retrieve information. You put that in the context window. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=242" target="_blank" rel="noreferrer noopener">04.02</a>: Yeah. And people were doing that with prompts too. But I think in the beginning we just didn’t have the words. And that word became a big empty bucket that we filled up. You know, the quote I always quote too often, but I find it fitting, is one of my favorite quotes from Stuart Brand, which is, “If you want to know where the future is being made, follow where the lawyers are congregating and the language is being invented,” and the arrival of context engineering as a word came after the field was invented. It just kind of crystallized and demarcated what people were already doing. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=276" target="_blank" rel="noreferrer noopener">04.36</a>: <strong>So the word “context” means you’re providing context. So context could be a tool, right? It could be memory. Whereas the word “prompt” is much more specific.</strong> </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=295" target="_blank" rel="noreferrer noopener">04.55</a>: And I think it also is like, it has to be edited by a person. I’m a big advocate for not using anthropomorphizing words around large language models. “Prompt” to me involves agency. And so I think it’s nice—it’s a good delineation. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=314" target="_blank" rel="noreferrer noopener">05.14</a>: <strong>And then I think one of the very immediate lessons that people realize is, just because. . . </strong></p> <p><strong>So one of the things that these model providers do when they have a model release, one of the things they note is, What’s the size of the context window? So people started associating context window [with] “I stuff as much as I can in there.” But the reality is actually that, one, it’s not efficient. And two, it also is not useful to the model. Just because you have a massive context window doesn’t mean that the model treats the entire context window evenly.</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=357" target="_blank" rel="noreferrer noopener">05.57</a>: Yeah, it doesn’t treat it evenly. And it’s not a one-size-fits-all solution. So I don’t know if you remember last year, but that was the big dream, which was, “Hey, we’re doing all this work with RAG and augmenting our context. But wait a second, if we can make the context 1 million tokens, 2 million tokens, I don’t have to run RAG on all of my corporate documents. I can just fit it all in there, and I can constantly be asking this. And if we can do this, we essentially have solved all of the hard problems that we were worrying about last year.” And so that was the big hope. </p> <p>And you started to see an arms race of everybody trying to make bigger and bigger context windows to the point where, you know, Llama 4 had its spectacular flameout. It was rushed out the door. But the headline feature by far was “We will be releasing a 10 million token context window.” And the thing that everybody realized is. . . Like, all right, we were really hopeful for that. And then as we started building with these context windows, we started to realize there were some big limitations around them.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=421" target="_blank" rel="noreferrer noopener">07.01</a>: Perhaps the thing that clicked for me was in <a href="https://arxiv.org/abs/2507.06261" target="_blank" rel="noreferrer noopener">Google’s Gemini 2.5 paper</a>. Fantastic paper. And one of the reasons I love it is because they dedicate about four pages in the appendix to talking about the kind of methodology and harnesses they built so that they could teach Gemini to play Pokémon: how to connect it to the game, how to actually read out the state of the game, how to make choices about it, what tools they gave it, all of these other things.</p> <p>And buried in there was a real “warts and all” case study, which are my favorite when you talk about the hard things and especially when you cite the things you can’t overcome. And Gemini 2.5 was a million-token context window with, eventually, 2 million tokens coming. But in this Pokémon thing, they said, “Hey, we actually noticed something, which is once you get to about 200,000 tokens, things start to fall apart, and they fall apart for a host of reasons. They start to hallucinate. One of the things that is really demonstrable is they start to rely more on the context knowledge than the weights knowledge. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=502" target="_blank" rel="noreferrer noopener">08.22</a>: So inside every model there’s a knowledge base. There’s, you know, all of these other things that get kind of buried into the parameters. But when you reach a certain level of context, it starts to overload the model, and it starts to rely more on the examples in the context. And so this means that you are not taking advantage of the full strength or knowledge of the model. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=523" target="_blank" rel="noreferrer noopener">08.43</a>: So that’s one way it can fail. We call this “context distraction,” though Kelly Hong at Chroma has written an <a href="https://research.trychroma.com/context-rot" target="_blank" rel="noreferrer noopener">incredible paper documenting this</a>, which she calls “context rot,” which is a similar way [of] charting when these benchmarks start to fall apart.</p> <p>Now the cool thing about this is that you can actually use this to your advantage. There’s another paper out of, I believe, the Harvard Interaction Lab, where they look at these inflection points for. . . </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=553" target="_blank" rel="noreferrer noopener">09.13</a>: Are you familiar with the term “in-context learning”? In-context learning is when you teach the model to do something that doesn’t know how to do by providing examples in your context. And those examples illustrate how it should perform. It’s not something that it’s seen before. It’s not in the weights. It’s a completely unique problem. </p> <p>Well, sometimes those in-context learning[s] are counter to what the model has learned in the weights. So they end up fighting each other, the weights and the context. And this paper documented that when you get over a certain context length, you can overwhelm the weights and you can force it to listen to your in-context examples.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=597" target="_blank" rel="noreferrer noopener">09.57</a>: And so all of this is just to try to illustrate the complexity of what’s going on here and how I think one of the traps that leads us to this place is that the gift and the curse of LLMs is that we prompt and build contexts that are in the English language or whatever language you speak. And so that leads us to believe that they’re going to react like other people or entities that read the English language.</p> <p>And the fact of the matter is, they don’t—they’re reading it in a very specific way. And that specific way can vary from model to model. And so you have to systematically approach this to understand these nuances, which is where the context management field comes in. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=635" target="_blank" rel="noreferrer noopener">10.35</a>: <strong>This is interesting because even before those papers came out, there were studies which showed the exact opposite problem, which is the following: You may have a RAG system that actually retrieves the right information, but then somehow the LLMs can still fail because, as you alluded to, they have weights so they have prior beliefs. You saw something [on] the internet, and they will opine against the precise information you retrieve from the context. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=668" target="_blank" rel="noreferrer noopener">11.08</a>: This is a really big problem. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=669" target="_blank" rel="noreferrer noopener">11.09</a>: <strong>So this is true even if the context window’s small actually.</strong> </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=673" target="_blank" rel="noreferrer noopener">11.13</a>: Yeah, and Ben, you touched on something that’s really important. So in my <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html" target="_blank" rel="noreferrer noopener">original blog post</a>, I document four ways that context fails. I talk about “context poisoning.” That’s when you hallucinate something in a long-running task and it stays in there, and so it’s continually confusing it. “Context distraction,” which is when you overwhelm that soft limit to the context window and then you start to perform poorly. “Context confusion”: This is when you put things that aren’t relevant to the task inside your context, and suddenly they think the model thinks that it has to pay attention to this stuff and it leads them astray. And then the last thing is “context clash,” which is when there’s information in the context that’s at odds with the task that you are trying to perform. </p> <p>A good example of this is, say you’re asking the model to only reply in JSON, but you’re using MCP tools that are defined with XML. And so you’re creating this backwards thing. But I think there’s a fifth piece that I need to write about because it keeps coming up. And it’s exactly what you described.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=743" target="_blank" rel="noreferrer noopener">12.23</a>: Douwe [Kiela] over at Contextual AI refers to this as “context” or “prompt adherence.” But the term that keeps sticking in my mind is this idea of fighting the weights. There’s three situations you get yourself into when you’re interacting with an LLM. The first is when you’re working with the weights. You’re asking it a question that it knows how to answer. It’s seen many examples of that answer. It has it in its knowledge base. It comes back with the weights, and it can give you a phenomenal, detailed answer to that question. That’s what I call “working with the weights.” </p> <p>The second is what we referred to earlier, which is that in-context learning, which is you’re doing something that it doesn’t know about and you’re showing an example, and then it does it. And this is great. It’s wonderful. We do it all the time. </p> <p>But then there’s a third example which is, you’re providing it examples. But those examples are at odds with some things that it had learned usually during posttraining, during the fine-tuning or RL stage. A really good example is format outputs. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=814" target="_blank" rel="noreferrer noopener">13.34</a>: Recently a friend of mine was updating his pipeline to try out a new model, Moonshots. A really great model and really great model for tool use. And so he just changed his model and hit run to see what happened. And he kept failing—his thing couldn’t even work. He’s like, “I don’t understand. This is supposed to be the best tool use model there is.” And he asked me to look at his code.</p> <p>I looked at his code and he was extracting data using Markdown, essentially: “Put the final answer in an ASCII box and I’ll extract it that way.” And I said, “If you change this to XML, see what happens. Ask it to respond in XML, use XML as your formatting, and see what happens.” He did that. That one change passed every test. Like basically crushed it because it was working with the weights. He wasn’t fighting the weights. Everyone’s experienced this if you build with AI: the stubborn things it refuses to do, no matter how many times you ask it, including formatting. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=875" target="_blank" rel="noreferrer noopener">14.35</a>: [Here’s] my favorite example of this though, Ben: So in ChatGPT’s web interface or their application interface, if you go there and you try to prompt an image, a lot of the images that people prompt—and I’ve talked to user research about this—are really boring prompts. They have a text box that can be anything, and they’ll say something like “a black cat” or “a statue of a man thinking.”</p> <p>OpenAI realized this was leading to a lot of bad images because the prompt wasn’t detailed; it wasn’t a good prompt. So they built a system that recognizes if your prompt is too short, low detail, bad, and it hands it to another model and says, “Improve this prompt,” and it improves the prompt for you. And if you inspect in Chrome or Safari or Firefox, whatever, you inspect the developer settings, you can see the JSON being passed back and forth, and you can see your original prompt going in. Then you can see the improved prompt. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=936" target="_blank" rel="noreferrer noopener">15.36</a>: My favorite example of this [is] I asked it to make a statue of a man thinking, and it came back and said something like “A detailed statue of a human figure in a thinking pose similar to Rodin’s ‘The Thinker.’ The statue is made of weathered stone sitting on a pedestal. . .” Blah blah blah blah blah blah. A paragraph. . . But below that prompt there were instructions to the chatbot or to the LLM that said, “Generate this image and after you generate the image, do not reply. Do not ask follow up questions. Do not ask. Do not make any comments describing what you’ve done. Just generate the image.” And in this prompt, then nine times, some of them in all caps, they say, “Please do not reply.” And the reason is because a big chunk of OpenAI’s posttraining is teaching these models how to converse back and forth. They want you to always be asking a follow-up question and they train it. And so now they have to fight the prompts. They have to add in all these statements. And that’s another way that fails. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1002" target="_blank" rel="noreferrer noopener">16.42</a>: So why I bring this up—and this is why I need to write about it—is as an applied AI developer, you need to recognize when you’re fighting the prompt, understand enough about the posttraining of that model, or make some assumptions about it, so that you can stop doing that and try something different, because you’re just banging your head against a wall and you’re going to get inconsistent, bad applications and the same statement 20 times over. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1027" target="_blank" rel="noreferrer noopener">17.07</a>: <strong>By the way, the other thing that’s interesting about this whole topic is, people actually somehow have underappreciated or forgotten all of the progress we’ve made in information retrieval. There’s a whole. . . I mean, these people have their own conferences, right? Everything from reranking to the actual indexing, even with vector search—the information retrieval community still has a lot to offer, and it’s the kind of thing that people underappreciated. And so by simply loading your context window with massive amounts of garbage, you’re actually, leaving on the field so much progress in information retrieval.</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1084" target="_blank" rel="noreferrer noopener">18.04</a>: I do think it’s hard. And that’s one of the risks: We’re building all this stuff so fast from the ground up, and there’s a tendency to just throw everything into the biggest model possible and then hope it sorts it out.</p> <p>I really do think there’s two pools of developers. There’s the “throw everything in the model” pool, and then there’s the “I’m going to take incremental steps and find the most optimal model.” And I often find that latter group, which I called a compound AI group after a <a href="https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/" target="_blank" rel="noreferrer noopener">paper that was published out of Berkeley</a>, those tend to be people who have run data pipelines, because it’s not just a simple back and forth interaction. It’s gigabytes or even more of data you’re processing with the LLM. The costs are high. Latency is important. So designing efficient systems is actually incredibly key, if not a total requirement. So there’s a lot of innovation that comes out of that space because of that kind of boundary.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1148" target="_blank" rel="noreferrer noopener">19.08</a>: <strong>If you were to talk to one of these applied AI teams and you were to give them one or two things that they can do right away to improve, or fix context in general, what are some of the best practices?</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1169" target="_blank" rel="noreferrer noopener">19.29</a>: Well you’re going to laugh, Ben, because the answer is dependent on the context, and I mean the context in the team and what have you. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1178" target="_blank" rel="noreferrer noopener">19.38</a>: <strong>But if you were to just go give a keynote to a general audience, if you were to list down one, two, or three things that are the lowest hanging fruit, so to speak. . .</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1190" target="_blank" rel="noreferrer noopener">19.50</a>: The first thing I’m gonna do is I’m going to look in the room and I’m going to look at the titles of all the people in there, and I’m going to see if they have any subject-matter experts or if it’s just a bunch of engineers trying to build something for subject-matter experts. And my first bit of advice is you need to get yourself a subject-matter expert who is looking at the data, helping you with the eval data, and telling you what “good” looks like. </p> <p>I see a lot of teams that don’t have this, and they end up building fairly brittle prompt systems. And then they can’t iterate well, and so that enterprise AI project fails. I also see them not wanting to open themselves up to subject-matter experts, because they want to hold on to the power themselves. It’s not how they’re used to building. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1238" target="_blank" rel="noreferrer noopener">20.38</a>: I really do think building in applied AI has changed the power dynamic between builders and subject-matter experts. You know, we were talking earlier about some of like the old Web 2.0 days and I’m sure you remember. . . Remember back at the beginning of the iOS app craze, we’d be at a dinner party and someone would find out that you’re capable of building an app, and you would get cornered by some guy who’s like “I’ve got a great idea for an app,” and he would just talk at you—usually a he. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1275" target="_blank" rel="noreferrer noopener">21.15</a>: <strong>This is back in the Objective-C days. . .</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1277" target="_blank" rel="noreferrer noopener">21.17</a>: Yes, way back when. And this is someone who loves Objective-C. So you’d get cornered and you’d try to find a way out of that awkward conversation. Nowadays, that dynamic has shifted. The subject-matter expertise is so important for codifying and designing the spec, which usually gets specced out by the evals that it leads itself to more. And you can even see this. OpenAI is arguably creating and at the forefront of this stuff. And what are they doing? They’re standing up programs to get lawyers to come in, to get doctors to come in, to get these specialists to come in and help them create benchmarks because they can’t do it themselves. And so that’s the first thing. Got to work with the subject-matter expert. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1324" target="_blank" rel="noreferrer noopener">22.04</a>: The second thing is if they’re just starting out—and this is going to sound backwards, given our topic today—I would encourage them to use a system like DSPy or GEPA, which are essentially frameworks for building with AI. And one of the components of that framework is that they optimize the prompt for you with the help of an LLM and your eval data. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1357" target="_blank" rel="noreferrer noopener">22.37</a>: <strong>Throw in BAML?</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1359" target="_blank" rel="noreferrer noopener">22.39</a>: BAML is similar [but it’s] more like the spec for how to describe the entire spec. So it’s similar.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1372" target="_blank" rel="noreferrer noopener">22.52</a>: <strong>BAML and TextGrad? </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1375" target="_blank" rel="noreferrer noopener">22.55</a>: TextGrad is more like the prompt optimization I’m talking about. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1377" target="_blank" rel="noreferrer noopener">22:57</a>: <strong>TextGrad plus GEPA plus Regolo?</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1382" target="_blank" rel="noreferrer noopener">23.02</a>: Yeah, those things are really important. And the reason I say they’re important is. . .</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1388" target="_blank" rel="noreferrer noopener">23.08</a>: <strong>I mean, Drew, those are kind of advanced topics. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1392" target="_blank" rel="noreferrer noopener">23.12</a>: I don’t think they’re that advanced. I think they can appear really intimidating because everybody comes in and says, “Well, it’s so easy. I could just write what I want.” And this is the gift and curse of prompts, in my opinion. There’s a lot of things to like about.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1413" target="_blank" rel="noreferrer noopener">23.33</a>: <strong>DSPy is fine, but I think TextGrad, GEPA, and Regolo. . .</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1421" target="_blank" rel="noreferrer noopener">23.41</a>: Well. . . I wouldn’t encourage you to use GEPA directly. I would encourage you to use it through the framework of DSPy. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1428" target="_blank" rel="noreferrer noopener">23.48</a>: The point here is if it’s a team building, you can go down essentially two paths. You can handwrite your prompt, and I think this creates some issues. One is as you build, you tend to have a lot of hotfix statements like, “Oh, there’s a bug over here. We’ll say it over here. Oh, that didn’t fix it. So let’s say it again.” It will encourage you to have one person who <em>really</em> understands this prompt. And so you end up being reliant on this prompt magician. Even though they’re written in English, there’s kind of no syntax highlighting. They get messier and messier as you build the application because they start to grow and become these growing collections of edge cases.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1467" target="_blank" rel="noreferrer noopener">24.27</a>: And the other thing too, and this is really important, is when you build and you spend so much time honing a prompt, you’re doing it against one model, and then at some point there’s going to be a better, cheaper, more effective model. And you’re going to have to go through the process of tweaking it and fixing all the bugs again, because this model functions differently.</p> <p>And I used to have to try to convince people that this was a problem, but they all kind of found out when OpenAI deprecated all of their models and tried to move everyone over to GPT-5. And now I hear about it all the time. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1503" target="_blank" rel="noreferrer noopener">25.03</a>: <strong>Although I think right now “agents” is our hot topic, right? So we talk to people about agents and you start really getting into the weeds, you realize, “Oh, okay. So their agents are really just prompts.” </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1516" target="_blank" rel="noreferrer noopener">25.16</a>: In the loop. . .</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1519" target="_blank" rel="noreferrer noopener">25.19</a>: <strong>So agent optimization in many ways means injecting a bit more software engineering rigor in how you maintain and version. . .</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1530" target="_blank" rel="noreferrer noopener">25.30</a>: Because that context is growing. As that loop goes, you’re deciding what gets added to it. And so you have to put guardrails in—ways to rescue from failure and figure out all these things. It’s very difficult. And you have to go at it systematically. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1546" target="_blank" rel="noreferrer noopener">25.46</a>: <strong>And then the problem is that, in many situations, the models are not even models that you control, actually. You’re using them through an API like OpenAI or Claude so you don’t actually have access to the weights. So even if you’re one of the super, super advanced teams that can do gradient descent and backprop, you can’t do that. Right? So then, what are your options for being more rigorous in doing optimization?</strong></p> <p><strong>Well, it’s precisely these tools that Drew alluded to, which is the TextGrads of the world, the GEPA. You have these compound systems that are nondifferentiable. So then how do you actually do optimization in a world where you have things that are not differentiable? Right. So these are precisely the tools that will allow you to turn it from somewhat of a, I guess, black art to something with a little more discipline. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1613" target="_blank" rel="noreferrer noopener">26.53</a>: And I think a good example is, even if you aren’t going to use prompt optimization-type tools. . . The prompt optimization is a great solution for what you just described, which is when you can’t control the weights of the models you’re using. But the other thing too, is, even if you aren’t going to adopt that, you need to get evals because that’s going to be step one for anything, which is you need to start working with subject-matter experts to create evals.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1642" target="_blank" rel="noreferrer noopener">27.22</a>: Because what I see. . . And there was just a really dumb argument online of “Are evals worth it or not?” And it was really silly to me because it was positioned as an either-or argument. And there were people arguing against evals, which is just insane to me. And the reason they were arguing against evals is they’re basically arguing in favor of what they called, to your point about dark arts, vibe shipping—which is they’d make changes, push those changes, and then the person who was also making the changes would go in and type in 12 different things and say, “Yep, feels right to me.” And that’s insane to me. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1677" target="_blank" rel="noreferrer noopener">27.57</a>: And even if you’re doing that—which I think is a good thing and you may not go create coverage and eval, you have some taste. . . And I do think when you’re building more qualitative tools. . . So a good example is like if you’re Character.AI or you’re Portola Labs, who’s building essentially personalized emotional chatbots, it’s going to be harder to create evals and it’s going to require taste as you build them. But having evals is going to ensure that your whole thing didn’t fall apart because you changed one sentence, which sadly is a risk because these are probabilistic software.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1713" target="_blank" rel="noreferrer noopener">28.33</a>: <strong>Honestly, evals are super important. Number one, because, basically, leaderboards like LMArena are great for narrowing your options. But at the end of the day, you still need to benchmark all of these against your own application use case and domain. And then secondly, obviously, it’s an ongoing thing. So it ties in with reliability. The more reliable your application is, that means most likely you’re doing evals properly in an ongoing fashion. And I really believe that eval and reliability are a moat, because basically what else is your moat? Prompt? That’s not a moat. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1761" target="_blank" rel="noreferrer noopener">29.21</a>: So first off, violent agreement there. The only asset teams truly have—unless they’re a model builder, which is only a handful—is their eval data. And I would say the counterpart to that is their spec, whatever defines their program, but mostly the eval data. But to the other point about it, like why are people vibe shipping? I think you can get pretty far with vibe shipping and it fools you into thinking that that’s right.</p> <p>We saw this pattern in the Web 2.0 and social era, which was, you would have the product genius—everybody wanted to be the Steve Jobs, who didn’t hold focus groups, didn’t ask their customers what they wanted. The Henry Ford quote about “They all say faster horses,” and I’m the genius who comes in and tweaks these things and ships them. And that often takes you very far.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1813" target="_blank" rel="noreferrer noopener">30.13</a>: I also think it’s a bias of success. We only know about the ones that succeed, but the best ones, when they grow up and they start to serve an audience that’s way bigger than what they could hold in their head, they start to grow up with AB testing and ABX testing throughout their organization. And a good example of that is Facebook.</p> <p>Facebook stopped being just some choices and started having to do testing and ABX testing in every aspect of their business. Compare that to Snap, which again, was kind of the last of the great product geniuses to come out. Evan [Spiegel] was heralded as “He’s the product genius,” but I think they ran that too long, and they kept shipping on vibes rather than shipping on ABX testing and growing and, you know, being more boring.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1864" target="_blank" rel="noreferrer noopener">31.04</a>: But again, that’s how you get the global reach. I think there’s a lot of people who probably are really great vibe shippers. And they’re probably having great success doing that. The question is, as their company grows and starts to hit harder times or the growth starts to slow, can that vibe shipping take them over the hump? And I would argue, no, I think you have to grow up and start to have more accountable metrics that, you know, scale to the size of your audience. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1894" target="_blank" rel="noreferrer noopener">31.34</a>: <strong>So in closing. . . We talked about prompt engineering. And then we talked about context engineering. So putting you on the spot. What’s a buzzword out there that either irks you or you think is undertalked about at this point? So what’s a buzzword out there, Drew? </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1917" target="_blank" rel="noreferrer noopener">31.57</a>: [laughs] I mean, I wish you had given me some time to think about it. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1918" target="_blank" rel="noreferrer noopener">31.58</a>: <strong>We are in a hype cycle here. . .</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=1922" target="_blank" rel="noreferrer noopener">32.02</a>: We’re always in a hype cycle. I don’t like anthropomorphosizing LLMs or AI for a whole host of reasons. One, I think it leads to bad understanding and bad mental models, that means that we don’t have substantive conversations about these things, and we don’t learn how to build really well with them because we think they’re intelligent. We think they’re a PhD in your pocket. We think they’re all of these things and they’re not—they’re fundamentally different. </p> <p>I’m not against using the way we think the brain works for inspiration. That’s fine with me. But when you start oversimplifying these and not taking the time to explain to your audience how they actually work—you just say it’s a PhD in your pocket, and here’s the benchmark to prove it—you’re misleading and setting unrealistic expectations. And unfortunately, the market rewards them for that. So they keep going. </p> <p>But I also think it just doesn’t help you build sustainable programs because you aren’t actually understanding how it works. You’re just kind of reducing it down to it. AGI is one of those things. And superintelligence, but AGI especially.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2001" target="_blank" rel="noreferrer noopener">33.21</a>: I went to school at UC Santa Cruz, and one of my favorite classes I ever took was a seminar with Donna Haraway. Donna Haraway wrote “<a href="https://en.wikipedia.org/wiki/A_Cyborg_Manifesto" target="_blank" rel="noreferrer noopener">A Cyborg Manifesto</a>” in the ’80s. She’s kind of a tech science history feminist lens. You would just sit in that class and your mind would explode, and then at the end, you just have to sit there for like five minutes afterwards, just picking up the pieces. </p> <p>She had a great term called “power objects.” A power object is something that we as a society recognize to be incredibly important, believe to be incredibly important, but we don’t know how it works. That lack of understanding allows us to fill this bucket with whatever we want it to be: our hopes, our fears, our dreams. This happened with DNA; this happened with PET scans and brain scans. This happens all throughout science history, down to phrenology and blood types and things that we understand to be, or we believed to be, important, but they’re not. And big data, another one that is very, very relevant. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2074" target="_blank" rel="noreferrer noopener">34.34</a>: <strong>That’s my handle on Twitter. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2095" target="_blank" rel="noreferrer noopener">34.55</a>: Yeah, there you go. So like it’s, you know, I fill it with Ben Lorica. That’s how I fill that power object. But AI is definitely that. AI is definitely that. And my favorite example of this is when the DeepSeek moment happened, we understood this to be really important, but we didn’t understand why it works and how well it worked.</p> <p>And so what happened is, if you looked at the news and you looked at people’s reactions to what DeepSeek meant, you could basically find all the hopes and dreams about whatever was important to that person. So to AI boosters, DeepSeek proved that LLM progress is not slowing down. To AI skeptics, DeepSeek proved that AI companies have no moat. To open source advocates, it proved open is superior. To AI doomers, it proved that we aren’t being careful enough. Security researchers worried about the risk of backdoors in the models because it was in China. Privacy advocates worried about DeepSeek’s web services collecting sensitive data. China hawks said, “We need more sanctions.” Doves said, “Sanctions don’t work.” NVIDIA bears said, “We’re not going to need any more data centers if it’s going to be this efficient.” And bulls said, “No, we’re going to need tons of them because it’s going to use everything.”</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2144" target="_blank" rel="noreferrer noopener">35.44</a>: And AGI is another term like that, which means everything and nothing. And when the point we’ve reached it comes, isn’t. And compounding that is that it’s in the contract between OpenAI and Microsoft—I forget the exact term, but it’s the statement that Microsoft gets access to OpenAI’s technologies until AGI is achieved.</p> <p>And so it’s a very loaded definition right now that’s being debated back and forth and trying to figure out how to take [Open]AI into being a for-profit corporation. And Microsoft has a lot of leverage because how do you define AGI? Are we going to go to court to define what AGI is? I almost look forward to that.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2188" target="_blank" rel="noreferrer noopener">36.28</a>: So because it’s going to be that thing, and you’ve seen Sam Altman come out and some days he talks about how LLMs are just software. Some days he talks about how it’s a PhD in your pocket, some days he talks about how we’ve already passed AGI, it’s already over. </p> <p>I think Nathan Lambert has some <a href="https://www.interconnects.ai/p/agi-is-what-you-want-it-to-be" target="_blank" rel="noreferrer noopener">great writing about how AGI is a mistake</a>. We shouldn’t talk about trying to turn LLMs into humans. We should try to leverage what they do now, which is something fundamentally different, and we should keep building and leaning into that rather than trying to make them like us. So AGI is my word for you. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2223" target="_blank" rel="noreferrer noopener">37.03</a>: <strong>The way I think of it is, AGI is great for fundraising, let’s put it that way. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2228" target="_blank" rel="noreferrer noopener">37.08</a>: That’s basically it. Well, until you need it to have already been achieved, or until you need it to not be achieved because you don’t want any regulation or if you <em>want</em> regulation—it’s kind of a fuzzy word. And that has some really good properties. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2243" target="_blank" rel="noreferrer noopener">37.23</a>: <strong>So I’ll close by throwing in my own term. So prompt engineering, context engineering. . . I will close by saying pay attention to this boring term, which my friend Ion Stoica is now talking more about “systems engineering.” If you look at particularly the agentic applications, you’re talking about systems.</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2275" target="_blank" rel="noreferrer noopener">37.55</a>: Can I add one thing to this? Violent agreement. I think that is an underrated. . . </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2280" target="_blank" rel="noreferrer noopener">38.00</a>: <strong>Although I think it’s too boring a term, Drew, to take off.</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2283" target="_blank" rel="noreferrer noopener">38.03</a>: That’s fine! The reason I like it is because—and you were talking about this when you talk about fine-tuning—is, looking at the way people build and looking at the way I see teams with success build, there’s pretraining, where you’re basically training on unstructured data and you’re just building your base knowledge, your base English capabilities and all that. And then you have posttraining. And in general, posttraining is where you build. I do think of it as a form of interface design, even though you are adding new skills, but you’re teaching reasoning, you’re teaching it validated functions like code and math. You’re teaching it how to chat with you. This is where it learns to converse. You’re teaching it how to use tools and specific sets of tools. And then you’re teaching it alignment, what’s safe, what’s not safe, all these other things. </p> <p>But then after it ships, you can still RL that model, you can still fine-tune that model, and you can still prompt engineer that model, and you can still context engineer that model. And back to the systems engineering thing is, I think we’re going to see that posttraining all the way through to a final applied AI product. That’s going to be a real shades-of-gray gradient. It’s going to be. And this is one of the reasons why I think open models have a pretty big advantage in the future is that you’re going to dip down the way throughout that, leverage that. . .</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2372" target="_blank" rel="noreferrer noopener">39.32</a>: The only thing that’s keeping us from doing that now is we don’t have the tools and the operating system to align throughout that posttraining to shipping. Once we do, that operating system is going to change how we build, because the distance between posttraining and building is going to look really, really, really blurry. I really like the systems engineering type of approach, but I also think you can also start to see this yesterday [when] Thinking Machines released their first product.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2404" target="_blank" rel="noreferrer noopener">40.04</a>: And so Thinking Machines is Mira [Murati]. Her very hype thing. They launched their first thing, and it’s called Tinker. And it’s essentially, “Hey, you can write a very simple Python code, and then we will do the RL for you or the fine-tuning for you using our cluster of GPU so you don’t have to manage that.” And that is the type of thing that we want to see in a maturing kind of development framework. And you start to see this operating system emerging. </p> <p>And it reminds me of the early days of O’Reilly, where it’s like I had to stand up a web server, I had to maintain a web server, I had to do all of these things, and now I don’t have to. I can spin up a Docker image, I can ship to render, I can ship to Vercel. All of these shared complicated things now have frameworks and tooling, and I think we’re going to see a similar evolution from that. And I’m really excited. And I think you have picked a great underrated term. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2456" target="_blank" rel="noreferrer noopener">40.56</a>: <strong>Now with that. Thank you, Drew. </strong><br><br><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Drew_Breunig.mp3#t=2458" target="_blank" rel="noreferrer noopener">40.58</a>: Awesome. Thank you for having me, Ben.</p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-context-engineering-with-drew-breunig/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item> <title>From Habits to Tools</title> <link>https://www.oreilly.com/radar/from-habits-to-tools/</link> <comments>https://www.oreilly.com/radar/from-habits-to-tools/#respond</comments> <pubDate>Wed, 15 Oct 2025 12:49:38 +0000</pubDate> <dc:creator><![CDATA[Andrew Stellman]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Commentary]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17557</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Abstract-colorful-drops_Otherworldly.jpg" medium="image" type="image/jpeg" width="2304" height="1792" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Abstract-colorful-drops_Otherworldly-160x160.jpg" width="160" height="160" /> <custom:subtitle><![CDATA[The Future of AI-Assisted Development]]></custom:subtitle> <description><![CDATA[This article is part of a series on the Sens-AI Framework—practical habits for learning and coding with AI. AI-assisted coding is here to stay. I’ve seen many companies now require all developers to install Copilot extensions in their IDEs, and teams are increasingly being measured on AI-adoption metrics. Meanwhile, the tools themselves have become genuinely […]]]></description> <content:encoded><![CDATA[<p class="has-cyan-bluish-gray-background-color has-background"><em>This article is part of a series on the Sens-AI Framework—practical habits for learning and coding with AI.</em></p> <p>AI-assisted coding is here to stay. I’ve seen many companies now require all developers to install Copilot extensions in their IDEs, and teams are increasingly being measured on AI-adoption metrics. Meanwhile, the tools themselves have become genuinely useful for routine tasks: Developers regularly use them to generate boilerplate, convert between formats, write unit tests, and explore unfamiliar APIs—giving us more time to focus on solving our real problems instead of wrestling with syntax or going down research rabbit holes.</p> <p>Many team leads, managers, and instructors looking to help developers ramp up on AI tools assume the biggest challenge is learning to write better prompts or picking the right AI tool; that assumption misses the point. The real challenge is figuring out how developers can use these tools in ways that keep them engaged and strengthen their skills instead of becoming disconnected from the code and letting their development skills atrophy.</p> <p>This was the challenge I took on when I developed the Sens-AI Framework. When I was updating <a href="https://learning.oreilly.com/library/view/head-first-c/9781098141776/" target="_blank" rel="noreferrer noopener"><em>Head First C#</em></a> (O’Reilly 2024) to help readers ramp up on AI skills alongside other fundamental development skills, I watched new learners struggle not with the mechanics of prompting but with maintaining their understanding of the code they were producing. The framework emerged from those observations—five habits that keep developers engaged in the design conversation: context, research, framing, refining, and critical thinking. These habits address the real issue: making sure the developer stays in control of the work, understanding not just what the code does but why it’s structured that way.</p> <h2 class="wp-block-heading"><strong>What We’ve Learned So Far</strong></h2> <p>When I updated <em>Head First C# </em>to include AI exercises, I had to design them knowing learners would paste instructions directly into AI tools. That forced me to be deliberate: The instructions had to guide the learner while also shaping how the AI responded. Testing those same exercises against Copilot and ChatGPT showed the same kinds of problems over and over—AI filling in gaps with the wrong assumptions or producing code that looked fine until you actually had to run it, read and understand it, or modify and extend it.</p> <p>Those issues don’t only trip up new learners. More experienced developers can fall for them too. The difference is that experienced developers already have habits for catching themselves, while newer developers usually don’t—unless we make a point of teaching them. AI skills aren’t exclusive to senior or experienced developers either; I’ve seen relatively new developers develop their AI skills quickly because they’ve built these habits quickly.</p> <h2 class="wp-block-heading"><strong>Habits Across the Lifecycle</strong></h2> <p>In “<a href="https://www.oreilly.com/radar/the-sens-ai-framework/" target="_blank" rel="noreferrer noopener">The Sens-AI Framework</a>,” I introduced the five habits and explained how they work together to keep developers engaged with their code rather than becoming passive consumers of AI output. These habits also address specific failure modes, and understanding how they solve real problems points the way toward broader implementation across teams and tools:</p> <p><strong>Context</strong> helps avoid vague prompts that lead to poor output. Ask an AI to “make this code better” without sharing what the code does, and it might suggest adding comments to a performance-critical section where comments would just clutter. But provide the context—“This is a high-frequency trading system where microseconds matter,” along with the actual code structure, dependencies, and constraints—and the AI understands it should focus on optimizations, not documentation.</p> <p><strong>Research</strong> makes sure the AI isn’t your only source of truth. When you rely solely on AI, you risk compounding errors—the AI makes an assumption, you build on it, and soon you’re deep in a solution that doesn’t match reality. Cross-checking with documentation or even asking a different AI can reveal when you’re being led astray.</p> <p><strong>Framing</strong> is about asking questions that set up useful answers. “How do I handle errors?” gets you a try-catch block. “How do I handle network timeout errors in a distributed system where partial failures need rollback?” gets you circuit breakers and compensation patterns. As I showed in “<a href="https://www.oreilly.com/radar/understanding-the-rehash-loop/" target="_blank" rel="noreferrer noopener">Understanding the Rehash Loop</a>,” proper framing can break the AI out of circular suggestions.</p> <p><strong>Refining</strong> means not settling for the first thing the AI gives you. The first response is rarely the best—it’s just the AI’s initial attempt. When you iterate, you’re steering toward better patterns. Refining moves you from “This works” to “This is actually good.”</p> <p><strong>Critical thinking</strong> ties it all together, asking whether the code actually works for your project. It’s debugging the AI’s assumptions, reviewing for maintainability, and asking, “Will this make sense six months from now?”</p> <p>The real power of the Sens-AI Framework comes from using all five habits together. They form a reinforcing loop: Context informs research, research improves framing, framing guides refinement, refinement reveals what needs critical thinking, and critical thinking shows you what context you were missing. When developers use these habits in combination, they stay engaged with the design and engineering process rather than becoming passive consumers of AI output. It’s the difference between using AI as a crutch and using it as a genuine collaborator.</p> <h2 class="wp-block-heading"><strong>Where We Go from Here</strong></h2> <p>If developers are going to succeed with AI, these habits need to show up beyond individual workflows. They need to become part of:</p> <p><strong>Education</strong>: <em>Teaching AI literacy alongside basic coding skills.</em> As I described in “<a href="https://www.oreilly.com/radar/the-ai-teaching-toolkit-practical-guidance-for-teams/" target="_blank" rel="noreferrer noopener">The AI Teaching Toolkit</a>,” techniques like having learners debug intentionally flawed AI output help them spot when the AI is confidently wrong and practice breaking out of rehash loops. These aren’t advanced skills; they’re foundational.</p> <p><strong>Team practice</strong>: <em>Using code reviews, pairing, and retrospectives to evaluate AI output the same way we evaluate human-written code.</em> In my teaching article, I described techniques like AI archaeology and shared language patterns. What matters here is making those kinds of habits part of standard training—so teams develop vocabulary like “I’m stuck in a rehash loop” or “The AI keeps defaulting to the old pattern.” And as I explored in “<a href="https://www.oreilly.com/radar/trust-but-verify/" target="_blank" rel="noreferrer noopener">Trust but Verify</a>,” treating AI-generated code with the same scrutiny as human code is essential for maintaining quality.</p> <p><strong>Tooling</strong>: <em>IDEs and linters that don’t just generate code but highlight assumptions and surface design trade-offs.</em> Imagine your IDE warning: “Possible rehash loop detected: you’ve been iterating on this same approach for 15 minutes.” That’s one direction IDEs need to evolve—surfacing assumptions and warning when you’re stuck. The technical debt risks I outlined in “<a href="https://www.oreilly.com/radar/building-ai-resistant-technical-debt/" target="_blank" rel="noreferrer noopener">Building AI-Resistant Technical Debt</a>” could be mitigated with better tooling that catches antipatterns early.</p> <p><strong>Culture</strong>: <em>A shared understanding that AI is a collaboration too (and not a teammate)</em>. A team’s measure of success for code shouldn’t revolve around AI. Teams still need to understand that code, keep it maintainable, and grow their own skills along the way. Getting there will require changes in how they work together—for example, adding AI-specific checks to code reviews or developing shared vocabulary for when AI output starts drifting. This cultural shift connects to the requirements engineering parallels I explored in “<a href="https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/" target="_blank" rel="noreferrer noopener">Prompt Engineering Is Requirements Engineering</a>”—we need the same clarity and shared understanding with AI that we’ve always needed with human teams.</p> <p><strong>More convincing output will require more sophisticated evaluation.</strong> Models will keep getting faster and more capable. What won’t change is the need for developers to think critically about the code in front of them.</p> <p>The Sens-AI habits work alongside today’s tools and are designed to stay relevant to tomorrow’s tools as well. They’re practices that keep developers in control, even as models improve and the output gets harder to question. The framework gives teams a way to talk about both the successes and the failures they see when using AI. From there, it’s up to instructors, tool builders, and team leads to decide how to put those lessons into practice.</p> <p>The next generation of developers will never know coding without AI. Our job is to make sure they build lasting engineering habits alongside these tools—so AI strengthens their craft rather than hollowing it out.</p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/from-habits-to-tools/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item> <title>Magic Words: Programming the Next Generation of AI Applications</title> <link>https://www.oreilly.com/radar/magic-words-programming-the-next-generation-of-ai-applications/</link> <comments>https://www.oreilly.com/radar/magic-words-programming-the-next-generation-of-ai-applications/#respond</comments> <pubDate>Wed, 15 Oct 2025 10:06:50 +0000</pubDate> <dc:creator><![CDATA[Tim O’Reilly]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Commentary]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17539</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Cluny_-_Mero_-_Croix-Talisman_motifs_magiques_base_sur_Abracadabra_-_VIe-VII_siecle-_Ag_nielle-scaled.jpg" medium="image" type="image/jpeg" width="2560" height="1467" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Cluny_-_Mero_-_Croix-Talisman_motifs_magiques_base_sur_Abracadabra_-_VIe-VII_siecle-_Ag_nielle-160x160.jpg" width="160" height="160" /> <description><![CDATA[“Strange was obliged to invent most of the magic he did, working from general principles and half-remembered stories from old books.” — Susanna Clarke, Jonathan Strange & Mr Norrell Fairy tales, myths, and fantasy fiction are full of magic spells. You say “abracadabra” and something profound happens.1 Say “open sesame” and the door swings open. […]]]></description> <content:encoded><![CDATA[<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p class="has-cyan-bluish-gray-background-color has-background"><em>“Strange was obliged to invent most of the magic he did, working from general principles and half-remembered stories from old books.”</em><br><br><em>— </em>Susanna Clarke,<em> Jonathan Strange & Mr Norrell</em></p></blockquote> <p>Fairy tales, myths, and fantasy fiction are full of magic spells. You say “abracadabra” and something profound happens.<sup>1</sup> Say “open sesame” and the door swings open.</p> <p>It turns out that this is also a useful metaphor for what happens with large language models.</p> <p>I first got this idea from David Griffiths’s O’Reilly course on <a href="https://learning.oreilly.com/live-events/using-generative-ai-to-boost-your-personal-productivity/0636920099736/" target="_blank" rel="noreferrer noopener">using AI to boost your productivity</a>. He gave a simple example. You can tell ChatGPT “Organize my task list using the Eisenhower four-sided box.” And it just knows what to do, even if you yourself know nothing about General Dwight D. Eisenhower’s approach to decision making. David then suggests his students instead try “Organize my task list using Getting Things Done,” or just “Use GTD.” Each of those phrases is shorthand for systems of thought, practices, and conventions that the model has learned from human culture.</p> <p>These are magic words. They’re magic not because they do something unworldly and unexpected but because they have the power to summon patterns that have been encoded in the model. The words act as keys, unlocking context and even entire workflows.</p> <p>We all use magic words in our prompts. We say something like “Update my <em>resume</em>” or “Draft a <em>Substack post</em>” without thinking how much detailed prompting we’d have to do to create that output if the LLM didn’t already know the magic word.</p> <p>Every field has a specialized language whose terms are known only to its initiates. We can be fanciful and pretend they are magic spells, but the reality is that each of them is really a kind of <strong>fuzzy function call </strong>to an LLM, bringing in a body of context and unlocking a set of behaviors and capabilities. When we ask an LLM to write a program in <em>Javascript </em>rather than <em>Python</em>, we are using one of these fuzzy function calls. When we ask for output as an <em>.md</em> file, we are doing the same. Unlike a function call in a traditional programming language, it doesn’t always return the same result, which is why developers have an opportunity to enhance the magic.</p> <h2 class="wp-block-heading"><strong>From Prompts to Applications</strong></h2> <p>The next light bulb went off for me in a conversation with Claire Vo, the creator of an AI application called <a href="http://chatprd.ai" target="_blank" rel="noreferrer noopener">ChatPRD</a>. Claire spent years as a product manager, and as soon as ChatGPT became available, began using it to help her write product requirement documents or PRDs. Every product manager knows what a PRD is. When Claire prompted ChatGPT to “write a PRD,” it didn’t need a long preamble. That one acronym carried decades of professional practice. But Claire went further. She refined her prompts, improved them, and taught ChatGPT how to think like her. Over time, she had trained a system, not at the model level, but at the level of context and workflow.</p> <p>Next, Claire turned her workflow into a product. That product is a software interface that wraps up a number of related magic words into a useful package. It controls access to her customized magic spell, so to speak. Claire added detailed prompts, integrations with other tools, access control, and a whole lot of traditional programming in a next-generation application that uses a mix of traditional software code and “magical” fuzzy function calls to an LLM. ChatPRD even interviews users to learn more about their goals, customizing the application for each organization and use case.</p> <p>Claire’s <a href="https://www.chatprd.ai/blog/chatprd-quickstart-guide" target="_blank" rel="noreferrer noopener">quickstart guide to ChatPRD</a> is a great example of what a magic-word (fuzzy function call) application looks like.</p> <figure class="wp-block-embed is-type-rich is-provider-embed-handler wp-block-embed-embed-handler wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper"><iframe title="UPDATED ChatPRD Demo - Product Tour & Updated Features (2025)" width="500" height="281" src="https://www.youtube.com/embed/-V6bzSwYUZY?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe></div></figure> <p>You can also see how magic words are crafted into magic spells and how these spells are even part of the architecture of applications like Claude Code through the explorations of developers like Jesse Vincent and Simon Willison.</p> <p>In “<a href="https://blog.fsck.com/2025/10/05/how-im-using-coding-agents-in-september-2025/" target="_blank" rel="noreferrer noopener">How I’m Using Coding Agents in September, 2025</a>,” Jesse first describes how his <a href="http://claude.md" target="_blank" rel="noreferrer noopener">claude.md</a> file provides a base prompt that “encodes a bunch of process documentation and rules that do a pretty good job keeping Claude on track.” And then his workflow calls on a bunch of specialized prompts he has created (i.e., “spells” that give clearer and more personalized meaning to specific magic words) like “brainstorm,” “plan,” “architect,” “implement,” “debug,” and so on. Note how inside these prompts, he may use additional magic words like DRY, YAGNI, and TDD, which refer to specific programming methodologies. For example, here’s his planning prompt (boldface mine):</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p><code>Great. I need your help to write out a comprehensive implementation plan.</code></p> <p><code>Assume that the engineer has zero context for our codebase and questionable</code><br><code>taste. document everything they need to know. which files to touch for each</code><br><code>task, code, testing, docs they might need to check. how to test it.give </code><br><code>them the whole plan as bite-sized tasks. <strong>DRY. YAGNI. TDD.</strong> <strong>frequent commits</strong>.</code></p> <p><code>Assume they are a skilled developer, but know almost nothing about our</code><br><code>toolset or problem domain. assume they don't know good test design</code> <code>very</code><br><code>well.</code></p> <p><code>please write out this plan, in full detail, into docs/plans/</code></p></blockquote> <p>But Jesse didn’t stop there. He built a project called <a href="https://github.com/obra/superpowers" target="_blank" rel="noreferrer noopener">Superpowers</a>, which uses Claude’s <a href="https://docs.claude.com/en/docs/claude-code/plugins" target="_blank" rel="noreferrer noopener">recently announced plug-in architecture</a> to “give Claude Code superpowers with a comprehensive skills library of proven techniques, patterns, and tools.” <a href="https://blog.fsck.com/2025/10/09/superpowers/" target="_blank" rel="noreferrer noopener">Announcing the project</a>, he wrote:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>Skills are what give your agents Superpowers. The first time they really popped up on my radar was a few weeks ago when Anthropic rolled out improved Office document creation. When the feature rolled out, I went poking around a bit – I asked Claude to tell me all about its new skills. And it <a href="https://claude.ai/share/0fe5a9c0-4e5a-42a1-9df7-c5b7636dad92" target="_blank" rel="noreferrer noopener">was only too happy to dish</a>…. [Be sure to follow this link! – TOR]</p> <p>One of the first skills I taught Superpowers was <a href="https://raw.githubusercontent.com/obra/superpowers-skills/35c29f0fe22881149a991eca1276c148567a7c29/skills/meta/writing-skills/SKILL.md" target="_blank" rel="noreferrer noopener">How to create skills</a>. That has meant that when I wanted to do something like add git worktree workflows to Superpowers, it was a matter of describing how I wanted the workflows to go…and then Claude put the pieces together and added a couple notes to the existing skills that needed to clue future-Claude into using worktrees.</p></blockquote> <p>After reading Jesse’s post, Simon Willison did a bit more digging into the original document handling skills that Claude had announced and that had sparked Jesse’s brainstorm. He noted:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>Skills are more than just prompts though: the repository also includes dozens of pre-written Python scripts for performing common operations.</p> <p> <a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/pdf/scripts/fill_fillable_fields.py" target="_blank" rel="noreferrer noopener">pdf/scripts/fill_fillable_fields.py</a> for example is a custom CLI tool that uses <a href="https://pypi.org/project/pypdf/">pypdf</a> to find and then fill in a bunch of PDF form fields, specified as JSON, then render out the resulting combined PDF.</p> <p>This is a really sophisticated set of tools for document manipulation, and I love that Anthropic have made those visible—presumably deliberately—to users of Claude who know how to ask for them.</p></blockquote> <p>You can see what’s happening here. Magic words are being enhanced and given a more rigorous definition, and new ones are being added to what, in fantasy tales, they call a “grimoire,” or book of spells. Microsoft calls such spells “<a href="https://paradox921.medium.com/amplifier-notes-from-an-experiment-thats-starting-to-snowball-ef7df4ff8f97" target="_blank" rel="noreferrer noopener">metacognitive recipes</a>,” a wonderful term that should get widely adopted, though in this article I’m going to stick with my fanciful analogy to magic.</p> <p>At O’Reilly, we’re working with a very different set of magic words. For example, we’re building a system for precisely targeted competency-based learning, through which our customers can skip what they already know, master what they need, and prove what they’ve learned. It also gives corporate learning system managers the ability to assign learning goals and to measure the ROI on their investment.</p> <p>It turns out that there are dozens of <em>learning frameworks</em> (and that is itself a magic word). In the design of our own specialized learning framework, we’re invoking Bloom’s taxonomy, SFIA, and the Dreyfus Model of Skill Acquisition. But when a customer says, “We love your approach, but we use LTEM,” we can invoke that framework instead. Every corporate customer also has its own specialized tech stack. So we are exploring how to use magic words to let whatever we build adapt dynamically not only to our end users’ learning needs but to the tech stack and to the learning framework that already exists at each company.</p> <p>That would be a nightmare if we had to support dozens of different learning frameworks using traditional processes. But the problem seems much more tractable if we are able to invoke the right magic words. That’s what I mean when I say that magic words are a crucial building block in the next generation of application programming.</p> <h2 class="wp-block-heading"><strong>The Architecture of Magic</strong></h2> <p>Here’s the important thing: Magic isn’t arbitrary. In every mythic tradition, it has structure, discipline, and cost. The magician’s power depends on knowing the right words, pronounced in the right way, with the right intent.</p> <p>The same is true for AI systems. The effectiveness of our magic words depends on context, grounding, and feedback loops that give the model reliable information about the world.</p> <p>That’s why I find the emerging ecosystem of AI applications so fascinating. It’s about providing the right context to the model. It’s about defining vocabularies, workflows, and roles that expose and make sense of the model’s abilities. It’s about turning implicit cultural knowledge into explicit systems of interaction.</p> <p>We’re only at the beginning. But just as early programmers learned to build structured software without spelling out exact machine instructions, today’s AI practitioners are learning to build structured reasoning systems out of fuzzy language patterns.</p> <p>Magic words aren’t just a poetic image. They’re the syntax of a new kind of computing. As people become more comfortable with LLMs, they will pass around the magic words they have learned as power user tricks. Meanwhile, developers will wrap more advanced capabilities around existing magic words and perhaps even teach the models new ones that haven’t yet had the time to accrete sufficient meaning through wide usage in the training set. Each application will be built around a shared vocabulary that encodes its domain knowledge. Back in 2022, Mike Loukides called these systems “<a href="https://www.oreilly.com/radar/formal-informal-languages/" target="_blank" rel="noreferrer noopener">formal informal languages</a>.” That is, they are spoken in human language, but do better when you apply a bit of rigor.</p> <p>And at least for the foreseeable future, developers will write “shims” between the magic words that control the LLMs and the more traditional programming tools and techniques that interface with existing systems, much as Claire did with ChatPRD. But eventually we’ll see true AI to AI communication.</p> <p>Magic words and the spells built around them are only the beginning. Once people start using them in common, they become <em>protocols</em>. They define how humans and AI systems cooperate, and how AI systems cooperate with each other.</p> <p>We can already see this happening. Frameworks like LangChain or the Model Context Protocol (MCP) formalize how context and tools are shared. Teams build agentic workflows that depend on a common vocabulary of intent. What is an MCP server, after all, but a mapping of a fuzzy function call into a set of predictable tools and services available at a given endpoint?</p> <p>In other words, what was once a set of magic spells is becoming infrastructure. When enough people use the same magic words, they stop being magic and start being standards—the building blocks for the next generation of software.</p> <p>We can already see this progression with MCP. There are three distinct kinds of MCP servers. Some, like <a href="https://github.com/microsoft/playwright-mcp" target="_blank" rel="noreferrer noopener">Playwright MCP</a>, are designed to make it easier for AIs to interface with applications originally designed for interactive human use. Others, like the <a href="https://github.com/github/github-mcp-server" target="_blank" rel="noreferrer noopener">GitHub MCP Server</a>, are designed to make it easier for AIs to interface with existing APIs, that is, with interfaces originally designed to be called by traditional programs. But some are designed as a frontend for a true AI-to-AI conversation. Other protocols, like A2A, are already optimized for this third use case.</p> <p>But in each case, an MCP server is really a dictionary (or in magic terms, a spellbook) that explains the magic words that it understands and how to invoke them. As Jesse Vincent put it to me after reading a draft of this piece:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>The part that feels the most like magic spells is the part that most MCP authors do incredibly poorly. Each tool has a “description” field that tells the LLM how you use the tool. That description field is read and internalized by the LLM and changes how it behaves. Anthropic are particularly good at tool descriptions and most everybody else, in my experience, is…less good.</p></blockquote> <p>In many ways, publishing the prompts, tool descriptions, context, and skills that add functionality to LLMs may be a more important frontier of open source AI than open weights. It’s important that we treat our enhancements to magic words not as proprietary secrets but as shared cultural artifacts. The more open and participatory our vocabularies are, the more inclusive and creative the resulting ecosystem will be.</p> <hr class="wp-block-separator has-alpha-channel-opacity"/> <h2 class="wp-block-heading">Footnotes</h2> <ol class="wp-block-list"><li>While often associated today with stage magic and cartoons, this magic word was apparently used from Roman times as a healing spell. One proposed etymology suggests that it comes <a href="https://en.wikipedia.org/wiki/Abracadabra" target="_blank" rel="noreferrer noopener">from the Aramaic for “I create as I speak.”</a></li></ol> <p></p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/magic-words-programming-the-next-generation-of-ai-applications/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item> <title>Enlightenment</title> <link>https://www.oreilly.com/radar/enlightenment/</link> <comments>https://www.oreilly.com/radar/enlightenment/#respond</comments> <pubDate>Tue, 14 Oct 2025 11:03:06 +0000</pubDate> <dc:creator><![CDATA[Mike Loukides]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Commentary]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17534</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Student-sleeps-while-AI-works-1.jpg" medium="image" type="image/jpeg" width="2304" height="1792" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Student-sleeps-while-AI-works-1-160x160.jpg" width="160" height="160" /> <description><![CDATA[In a fascinating op-ed, David Bell, a professor of history at Princeton, argues that “AI is shedding enlightenment values.” As someone who has taught writing at a similarly prestigious university, and as someone who has written about technology for the past 35 or so years, I had a deep response. Bell’s is not the argument […]]]></description> <content:encoded><![CDATA[<p>In a fascinating op-ed, David Bell, a professor of history at Princeton, argues that “<a href="https://www.nytimes.com/2025/08/02/opinion/artificial-intelligence-enlightenment.html" target="_blank" rel="noreferrer noopener">AI is shedding enlightenment values</a>.” As someone who has taught writing at a similarly prestigious university, and as someone who has written about technology for the past 35 or so years, I had a deep response.</p> <p>Bell’s is not the argument of an AI skeptic. For his argument to work, AI has to be pretty good at reasoning and writing. It’s an argument about the nature of thought itself. Reading is thinking. Writing is thinking. Those are almost clichés—they even turn up in students’ assessments of <a href="https://lithub.com/what-happened-when-i-tried-to-replace-myself-with-chatgpt-in-my-english-classroom/" target="_blank" rel="noreferrer noopener">using AI in a college writing class</a>. It’s not a surprise to see these ideas in the 18th century, and only a bit more surprising to see how far Enlightenment thinkers took them. Bell writes:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p><em>The great political philosopher Baron de Montesquieu wrote: “One should never so exhaust a subject that nothing is left for readers to do. The point is not to make them read, but to make them think.” Voltaire, the most famous of the French “philosophes,” claimed, “The most useful books are those that the readers write half of themselves.”</em></p></blockquote> <p>And in the late 20th century, the great Dante scholar John Freccero would say to his classes “The text reads you”: How you read <a href="https://digitaldante.columbia.edu/dante/divine-comedy/" target="_blank" rel="noreferrer noopener"><em>The Divine Comedy</em></a> tells you who you are. You inevitably find your reflection in the act of reading.</p> <p>Is the use of AI an aid to thinking or a crutch or a replacement? If it’s either a crutch or a replacement, then we have to go back to Descartes’s “I think, therefore I am” and read it backward: What am I if I don’t think? What am I if I have offloaded my thinking to some other device? Bell points out that books guide the reader through the thinking process, while AI expects us to guide the process and all too often resorts to flattery. <a href="https://openai.com/index/sycophancy-in-gpt-4o/" target="_blank" rel="noreferrer noopener">Sycophancy isn’t limited to a few recent versions of GPT</a>; “That’s a great idea” has been a staple of AI chat responses since its earliest days. A dull sameness goes along with the flattery—the paradox of AI is that, for all the talk of general intelligence, it really doesn’t think better than we do. It can access a wealth of information, but it ultimately gives us (at best) an unexceptional average of what has been thought in the past. Books lead you through radically different kinds of thought. Plato is not Aquinas is not Machiavelli is not Voltaire (and for great insights on the transition from the fractured world of medieval thought to the fractured world of Renaissance thought, see Ada Palmer’s <a href="https://press.uchicago.edu/ucp/books/book/chicago/I/bo246135916.html" target="_blank" rel="noreferrer noopener"><em>Inventing the Renaissance</em></a>).</p> <p>We’ve been tricked into thinking that education is about preparing to enter the workforce, whether as a laborer who can plan how to spend his paycheck (readin’, writin’, ’rithmetic) or as a potential lawyer or engineer (Bachelor’s, Master’s, Doctorate). We’ve been tricked into thinking of schools as factories—just look at any school built in the 1950s or earlier, and compare it to an early 20th century manufacturing facility. Take the children in, process them, push them out. Evaluate them with exams that don’t measure much more than the ability to take exams—not unlike the benchmarks that the AI companies are constantly quoting. The result is that students who can read Voltaire or Montesquieu as a dialogue with their own thoughts, who could potentially make a breakthrough in science or technology, are rarities. They’re not the students our institutions were designed to produce; they have to struggle against the system, and frequently fail. As one elementary school administrator told me, “They’re handicapped, as handicapped as the students who come here with learning disabilities. But we can do little to help them.”</p> <p>So the difficult question behind Bell’s article is: How do we teach students to think in a world that will inevitably be full of AI, whether or not that AI looks like our current LLMs? In the end, education isn’t about collecting facts, duplicating the answers in the back of the book, or getting passing grades. It’s about learning to think. The educational system gets in the way of education, leading to short-term thinking. If I’m measured by a grade, I should do everything I can to optimize that metric. <a href="https://en.wikipedia.org/wiki/Goodhart%27s_law" target="_blank" rel="noreferrer noopener">All metrics will be gamed</a>. Even if they aren’t gamed, metrics shortcut around the real issues.</p> <p>In a world full of AI, retreating to stereotypes like “AI is damaging” and “AI hallucinates” misses the point, and is a sure route to failure. What’s damaging isn’t the AI, but the set of attitudes that make AI just another tool for gaming the system. We need a way of thinking with AI, of arguing with it, of completing AI’s “book” in a way that goes beyond maximizing a score. In this light, so much of the discourse around AI has been misguided. I still hear people say that AI will save you from needing to know the facts, that you won’t have to learn the dark and difficult corners of programming languages—but as much as I personally would like to take the easy route, facts are the skeleton on which thinking is based. Patterns arise out of facts, whether those patterns are historical movements, scientific theories, or software designs. And errors are easily uncovered when you engage actively with AI’s output.</p> <p>AI can help to assemble facts, but at some point those facts need to be internalized. I can name a dozen (or two or three) important writers and composers whose best work came around 1800. What does it take to go from those facts to a conception of the Romantic movement? An AI could certainly assemble and group those facts, but would you then be able to think about what that movement meant (and continues to mean) for European culture? What are the bigger patterns revealed by the facts? And what would it mean for those facts and patterns to reside only within an AI model, without human comprehension? You need to know the shape of history, particularly if you want to think productively about it. You need to know the dark corners of your programming languages if you’re going to debug a mess of AI-generated code. Returning to Bell’s argument, the ability to find patterns is what allows you to complete Voltaire’s writing. AI can be a tremendous aid in finding those patterns, but as human thinkers, we have to make those patterns our own.</p> <p>That’s really what learning is about. It isn’t just collecting facts, though facts are important. Learning is about understanding and finding relationships and understanding how those relationships change and evolve. It’s about weaving the narrative that connects our intellectual worlds together. That’s enlightenment. AI can be a valuable tool in that process, as long as you don’t mistake the means for the end. It can help you come up with new ideas and new ways of thinking. Nothing says that you can’t have the kind of mental dialogue that Bell writes about with an AI-generated essay. ChatGPT may not be Voltaire, but not much is. But if you don’t have the kind of dialogue that lets you internalize the relationships hidden behind the facts, AI is a hindrance. We’re all prone to be lazy—intellectually and otherwise. What’s the point at which thinking stops? What’s the point at which knowledge ceases to become your own? Or, to go back to the Enlightenment thinkers, when do you stop writing your share of the book?</p> <p>That’s not a choice AI makes for you. It’s your choice.</p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/enlightenment/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item> <title>The Architect’s Dilemma</title> <link>https://www.oreilly.com/radar/the-architects-dilemma/</link> <comments>https://www.oreilly.com/radar/the-architects-dilemma/#respond</comments> <pubDate>Mon, 13 Oct 2025 11:22:35 +0000</pubDate> <dc:creator><![CDATA[Heiko Hotz]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Deep Dive]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17515</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Robot-Concierge-maximalism.jpg" medium="image" type="image/jpeg" width="960" height="747" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Robot-Concierge-maximalism-160x160.jpg" width="160" height="160" /> <custom:subtitle><![CDATA[Choosing Between Tools and Agents with MCP and A2A]]></custom:subtitle> <description><![CDATA[The agentic AI landscape is exploding. Every new framework, demo, and announcement promises to let your AI assistant book flights, query databases, and manage calendars. This rapid advancement of capabilities is thrilling for users, but for the architects and engineers building these systems, it poses a fundamental question: When should a new capability be a […]]]></description> <content:encoded><![CDATA[<p>The agentic AI landscape is exploding. Every new framework, demo, and announcement promises to let your AI assistant book flights, query databases, and manage calendars. This rapid advancement of capabilities is thrilling for users, but for the architects and engineers building these systems, it poses a fundamental question: When should a new capability be a simple, predictable <em>tool</em> (exposed via the Model Context Protocol, MCP) and when should it be a sophisticated, collaborative <em>agent</em> (exposed via the Agent2Agent Protocol, A2A)?</p> <p>The common advice is often circular and unhelpful: “Use MCP for tools and A2A for agents.” This is like telling a traveler that cars use motorways and trains use tracks, without offering any guidance on which is better for a specific journey. This lack of a clear mental model leads to architectural guesswork. Teams build complex conversational interfaces for tasks that demand rigid predictability, or they expose rigid APIs to users who desperately need guidance. The outcome is often the same: a system that looks great in demos but falls apart in the real world.</p> <p>In this article, I argue that the answer isn’t found by analyzing your service’s internal logic or technology stack. It’s found by looking outward and asking a single, fundamental question: Who is calling your product/service? By reframing the problem this way—as a user experience challenge first and a technical one second—the architect’s dilemma evaporates.</p> <p>This essay draws a line where it matters for architects: the line between MCP tools and A2A agents. I will introduce a clear framework, built around the “Vending Machine Versus Concierge” model, to help you choose the right interface based on your consumer’s needs. I will also explore failure modes, testing, and the powerful <em>Gatekeeper Pattern</em> that shows how these two interfaces can work together to create systems that are not just clever but truly reliable.</p> <h2 class="wp-block-heading"><strong>Two Very Different Interfaces</strong></h2> <p>MCP presents tools—named operations with declared inputs and outputs. The caller (a person, program, or agent) must already know what it wants, and provide a complete payload. The tool validates, executes once, and returns a result. If your mental image is a vending machine—insert a well-formed request, get a deterministic response—you’re close enough.</p> <p>A2A presents agents—goal-first collaborators that converse, plan, and act across turns. The caller expresses an outcome (“book a refundable flight under $450”), not an argument list. The agent asks clarifying questions, calls tools as needed, and holds onto session state until the job is done. If you picture a concierge—interacting, negotiating trade-offs, and occasionally escalating—you’re in the right neighborhood.</p> <p>Neither interface is “better.” They are optimized for different situations:</p> <ul class="wp-block-list"><li>MCP is fast to reason about, easy to test, and strong on determinism and auditability.</li> <li>A2A is built for ambiguity, long-running processes, and preference capture.</li></ul> <h2 class="wp-block-heading"><strong>Bringing the Interfaces to Life: A Booking Example</strong></h2> <p>To see the difference in practice, let’s imagine a simple task: booking a specific meeting room in an office.</p> <p><strong>The MCP “vending machine”</strong> expects a perfectly structured, machine-readable request for its book_room_tool. The caller must provide all necessary information in a single, valid payload:</p> <pre class="wp-block-code"><code>{ "jsonrpc": "2.0", "id": 42, "method": "tools/call", "params": { "name": "book_room_tool", "arguments": { "room_id": "CR-104B", "start_time": "2025-11-05T14:00:00Z", "end_time": "2025-11-05T15:00:00Z", "organizer": "user@example.com" } }}</code></pre> <p>Any deviation—a missing field or incorrect data type—results in an immediate error. This is the vending machine: You provide the exact code of the item you want (e.g., “D4”) or you get nothing.</p> <p><strong>The A2A “concierge</strong>,<strong>“</strong> an “office assistant” agent, is approached with a high-level, ambiguous goal. It uses conversation to resolve ambiguity:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p><strong>User:</strong> “Hey, can you book a room for my 1-on-1 with Alex tomorrow afternoon?”<br><strong>Agent:</strong> “Of course. To make sure I get the right one, what time works best, and how long will you need it for?”</p></blockquote> <p>The agent’s job is to take the ambiguous goal, gather the necessary details, and then likely call the MCP tool behind the scenes once it has a complete, valid set of arguments.</p> <p>With this clear dichotomy established—the predictable vending machine (MCP) versus the stateful concierge (A2A)—how do we choose? As I argued in the introduction, the answer isn’t found in your tech stack. It’s found by asking the most important architectural question of all: <strong>Who is calling your service?</strong></p> <h3 class="wp-block-heading"><strong>Step 1: Identify your consumer</strong></h3> <ol class="wp-block-list"><li><strong>The machine consumer: A need for predictability</strong><br>Is your service going to be called by another automated system, a script, or another agent acting in a purely deterministic capacity? This consumer requires absolute predictability. It needs a rigid, unambiguous contract that can be scripted and relied upon to behave the same way every single time. It cannot handle a clarifying question or an unexpected update; any deviation from the strict contract is a failure. <em>This consumer doesn’t want a conversation; it needs a vending machine.</em> This nonnegotiable requirement for a predictable, stateless, and transactional interface points directly to designing your service as a tool (MCP).</li> <li><strong>The human (or agentic) consumer: A need for convenience</strong><br>Is your service being built for a human end user or for a sophisticated AI that’s trying to fulfill a complex, high-level goal? This consumer values convenience and the offloading of cognitive load. They don’t want to specify every step of a process; they want to delegate ownership of a goal and trust that it will be handled. They’re comfortable with ambiguity because they expect the service—the agent—to resolve it on their behalf. <em>This consumer doesn’t want to follow a rigid script; they need a concierge. </em>This requirement for a stateful, goal-oriented, and conversational interface points directly to designing your service as an agent (A2A).</li></ol> <p>By starting with the consumer, the architect’s dilemma often evaporates. Before you ever debate statefulness or determinism, you first define the user experience you are obligated to provide. In most cases, identifying your customer will give you your definitive answer.</p> <h3 class="wp-block-heading"><strong>Step 2: Validate with the four factors</strong></h3> <p>Once you have identified who calls your service, you have a strong hypothesis for your design. A machine consumer points to a tool; a human or agentic consumer points to an agent. The next step is to validate this hypothesis with a technical litmus test. This framework gives you the vocabulary to justify your choice and ensure the underlying architecture matches the user experience you intend to create.</p> <ol class="wp-block-list"><li><strong>Determinism versus ambiguity</strong><br>Does your service require a precise, unambiguous input, or is it designed to interpret and resolve ambiguous goals? <em>A vending machine is deterministic.</em> Its API is rigid: <code>GET /item/D4</code>. Any other request is an error. This is the world of MCP, where a strict schema ensures predictable interactions. <em>A concierge handles ambiguity.</em> “Find me a nice place for dinner” is a valid request that the agent is expected to clarify and execute. This is the world of A2A, where a conversational flow allows for clarification and negotiation.</li> <li><strong>Simple execution versus complex process</strong><br>Is the interaction a single, one-shot execution, or a long-running, multistep process? <em>A vending machine performs a short-lived execution.</em> The entire operation—from payment to dispensing—is an atomic transaction that is over in seconds. This aligns with the synchronous-style, one-shot model of MCP. <em>A concierge manages a process.</em> Booking a full travel itinerary might take hours or even days, with multiple updates along the way. This requires the asynchronous, stateful nature of A2A, which can handle long-running tasks gracefully.</li> <li><strong>Stateless versus stateful</strong><br>Does each request stand alone or does the service need to remember the context of previous interactions? <em>A vending machine is stateless.</em> It doesn’t remember that you bought a candy bar five minutes ago. Each transaction is a blank slate. MCP is designed for these self-contained, stateless calls. <em>A concierge is stateful. </em>It remembers your preferences, the details of your ongoing request, and the history of your conversation. A2A is built for this, using concepts like a session or thread ID to maintain context.</li> <li><strong>Direct control versus delegated ownership</strong><br>Is the consumer orchestrating every step, or are they delegating the entire goal? <em>When using a vending machine, the consumer is in direct control.</em> You are the orchestrator, deciding which button to press and when. With MCP, the calling application retains full control, making a series of precise function calls to achieve its own goal. <em>With a concierge, you delegate ownership.</em> You hand over the high-level goal and trust the agent to manage the details. This is the core model of A2A, where the consumer offloads the cognitive load and trusts the agent to deliver the outcome.</li></ol> <figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><strong>Factor</strong></td><td><strong>Tool (MCP)</strong></td><td><strong>Agent (A2A)</strong></td><td><strong>Key question</strong></td></tr><tr><td><em>Determinism</em></td><td>Strict schema; errors on deviation</td><td>Clarifies ambiguity via dialogue</td><td>Can inputs be fully specified up front?</td></tr><tr><td><em>Process</em></td><td>One-shot</td><td>Multi-step/long-running</td><td>Is this atomic or a workflow?</td></tr><tr><td><em>State</em></td><td>Stateless</td><td>Stateful/sessionful</td><td>Must we remember context/preferences?</td></tr><tr><td><em>Control</em></td><td>Caller orchestrates</td><td>Ownership delegated</td><td>Who drives: the caller or callee?</td></tr></tbody></table></figure> <p><em>Table 1: Four question framework</em></p> <p>These factors are not independent checkboxes; they are four facets of the same core principle. A service that is deterministic, transactional, stateless, and directly controlled is a tool. A service that handles ambiguity, manages a process, maintains state, and takes ownership is an agent. By using this framework, you can confidently validate that the technical architecture of your service aligns perfectly with the needs of your customer.</p> <h3 class="wp-block-heading"><strong>No framework, no matter how clear…</strong></h3> <p>…can perfectly capture the messiness of the real world. While the “Vending Machine Versus Concierge” model provides a robust guide, architects will eventually encounter services that seem to blur the lines. The key is to remember the core principle we’ve established: The choice is dictated by the consumer’s experience, not the service’s internal complexity.</p> <p>Let’s explore two common edge cases.</p> <p><strong>The complex tool: The iceberg</strong><br>Consider a service that performs a highly complex, multistep internal process, like a video transcoding API. A consumer sends a video file and a desired output format. This is a simple, predictable request. But internally, this one call might kick off a massive, long-running workflow involving multiple machines, quality checks, and encoding steps. It’s a hugely complex process.</p> <p>However, from the consumer’s perspective, none of that matters. They made a single, stateless, fire-and-forget call. They don’t need to manage the process; they just need a predictable result. This service is like an iceberg: 90% of its complexity is hidden beneath the surface. But because its external contract is that of a vending machine—a simple, deterministic, one-shot transaction—it is, and should be, implemented as a tool (MCP).</p> <p><strong>The simple agent: The scripted conversation</strong><br>Now consider the opposite: a service with very simple internal logic that still requires a conversational interface. Imagine a chatbot for booking a dentist appointment. The internal logic might be a simple state machine: ask for a date, then a time, then a patient name. It’s not “intelligent” or particularly flexible.</p> <p>However, it must remember the user’s previous answers to complete the booking. It’s an inherently stateful, multiturn interaction. The consumer cannot provide all the required information in a single, prevalidated call. They need to be guided through the process. Despite its internal simplicity, the need for a stateful dialogue makes it a concierge. It must be implemented as an agent (A2A) because its consumer-facing experience is that of a conversation, however scripted.</p> <p>These gray areas reinforce the framework’s central lesson. Don’t get distracted by what your service does internally. Focus on the experience it provides externally. That contract with your customer is the ultimate arbiter in the architect’s dilemma.</p> <h2 class="wp-block-heading"><strong>Testing What Matters: Different Strategies for Different Interfaces</strong></h2> <p>A service’s interface doesn’t just dictate its design; it dictates how you validate its correctness. Vending machines and concierges have fundamentally different failure modes and require different testing strategies.</p> <p><strong>Testing MCP tools (vending machines):</strong></p> <ul class="wp-block-list"><li><strong>Contract testing:</strong> Validate that inputs and outputs strictly adhere to the defined schema.</li> <li><strong>Idempotency tests:</strong> Ensure that calling the tool multiple times with the same inputs produces the same result without side effects.</li> <li><strong>Deterministic logic tests:</strong> Use standard unit and integration tests with fixed inputs and expected outputs.</li> <li><strong>Adversarial fuzzing:</strong> Test for security vulnerabilities by providing malformed or unexpected arguments.</li></ul> <p><strong>Testing A2A agents (concierges):</strong></p> <ul class="wp-block-list"><li><strong>Goal completion rate (GCR):</strong> Measure the percentage of conversations where the agent successfully achieved the user’s high-level goal.</li> <li><strong>Conversational efficiency:</strong> Track the number of turns or clarifications required to complete a task.</li> <li><strong>Tool selection accuracy:</strong> For complex agents, verify that the right MCP tool was chosen for a given user request.</li> <li><strong>Conversation replay testing:</strong> Use logs of real user interactions as a regression suite to ensure updates don’t break existing conversational flows.</li></ul> <h2 class="wp-block-heading"><strong>The Gatekeeper Pattern</strong></h2> <p>Our journey so far has focused on a dichotomy: MCP or A2A, vending machine or concierge. But the most sophisticated and robust agentic systems do not force a choice. Instead, they recognize that these two protocols don’t compete with each other; they complement each other. The ultimate power lies in using them together, with each playing to its strengths.</p> <p>The most effective way to achieve this is through a powerful architectural choice we can call the Gatekeeper Pattern.</p> <p>In this pattern, a single, stateful A2A agent acts as the primary, user-facing entry point—the concierge. Behind this gatekeeper sits a collection of discrete, stateless MCP tools—the vending machines. The A2A agent takes on the complex, messy work of understanding a high-level goal, managing the conversation, and maintaining state. It then acts as an intelligent orchestrator, making precise, one-shot calls to the appropriate MCP tools to execute specific tasks.</p> <p>Consider a travel agent. A user interacts with it via A2A, giving it a high-level goal: “Plan a business trip to London for next week.”</p> <ul class="wp-block-list"><li>The travel agent (A2A) accepts this ambiguous request and starts a conversation to gather details (exact dates, budget, etc.).</li> <li>Once it has the necessary information, it calls a flight_search_tool (MCP) with precise arguments like origin, destination, and date.</li> <li>It then calls a hotel_booking_tool (MCP) with the required city, check_in_date, and room_type.</li> <li>Finally, it might call a currency_converter_tool (MCP) to provide expense estimates.</li></ul> <p>Each tool is a simple, reliable, and stateless vending machine. The A2A agent is the smart concierge that knows which buttons to press and in what order. This pattern provides several significant architectural benefits:</p> <ul class="wp-block-list"><li><strong>Decoupling:</strong> It separates the complex, conversational logic (the “how”) from the simple, reusable business logic (the “what”). The tools can be developed, tested, and maintained independently.</li> <li><strong>Centralized governance:</strong> The A2A gatekeeper is the perfect place to implement cross-cutting concerns. It can handle authentication, enforce rate limits, manage user quotas, and log all activity before a single tool is ever invoked.</li> <li><strong>Simplified tool design:</strong> Because the tools are just simple MCP functions, they don’t need to worry about state or conversational context. Their job is to do one thing and do it well, making them incredibly robust.</li></ul> <h2 class="wp-block-heading"><strong>Making the Gatekeeper Production-Ready</strong></h2> <p>Beyond its design benefits, the Gatekeeper Pattern is the ideal place to implement the operational guardrails required to run a reliable agentic system in production.</p> <ul class="wp-block-list"><li><strong>Observability:</strong> Each A2A conversation generates a unique trace ID. This ID must be propagated to every downstream MCP tool call, allowing you to trace a single user request across the entire system. Structured logs for tool inputs and outputs (with PII redacted) are critical for debugging.</li> <li><strong>Guardrails and security:</strong> The A2A Gatekeeper acts as a single point of enforcement for critical policies. It handles authentication and authorization for the user, enforces rate limits and usage quotas, and can maintain a list of which tools a particular user or group is allowed to call.</li> <li><strong>Resilience and fallbacks:</strong> The Gatekeeper must gracefully manage failure. When it calls an MCP tool, it should implement patterns like timeouts, retries with exponential backoff, and circuit breakers. Critically, it is responsible for the final failure state—escalating to a human in the loop for review or clearly communicating the issue to the end user.</li></ul> <p>The Gatekeeper Pattern is the ultimate synthesis of our framework. It uses A2A for what it does best—managing a stateful, goal-oriented process—and MCP for what it was designed for—the reliable, deterministic execution of a task.</p> <h2 class="wp-block-heading"><strong>Conclusion</strong></h2> <p>We began this journey with a simple but frustrating problem: the architect’s dilemma. Faced with the circular advice that “MCP is for tools and A2A is for agents,” we were left in the same position as a traveler trying to get to Edinburgh—knowing that cars use motorways and trains use tracks but with no intuition on which to choose for our specific journey.</p> <p>The goal was to build that intuition. We did this not by accepting abstract labels, but by reasoning from first principles. We dissected the protocols themselves, revealing how their core mechanics inevitably lead to two distinct service profiles: the predictable, one-shot “vending machine” and the stateful, conversational “concierge.”</p> <p>With that foundation, we established a clear, two-step framework for a confident design choice:</p> <ol class="wp-block-list"><li><strong>Start with your customer.</strong> The most critical question is not a technical one but an experiential one. A machine consumer needs the predictability of a vending machine (MCP). A human or agentic consumer needs the convenience of a concierge (A2A).</li> <li><strong>Validate with the four factors.</strong> Use the litmus test of determinism, process, state, and ownership to technically justify and solidify your choice.</li></ol> <p>Ultimately, the most robust systems will synthesize both, using the Gatekeeper Pattern to combine the strengths of a user-facing A2A agent with a suite of reliable MCP tools.</p> <p>The choice is no longer a dilemma. By focusing on the consumer’s needs and understanding the fundamental nature of the protocols, architects can move from confusion to confidence, designing agentic ecosystems that are not just functional but also intuitive, scalable, and maintainable.</p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/the-architects-dilemma/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item> <title>Everyday AI Agents</title> <link>https://www.oreilly.com/radar/everyday-ai-agents/</link> <comments>https://www.oreilly.com/radar/everyday-ai-agents/#respond</comments> <pubDate>Fri, 10 Oct 2025 11:30:16 +0000</pubDate> <dc:creator><![CDATA[David Michelson]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Events]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17512</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Everyday-AI-Agents.jpg" medium="image" type="image/jpeg" width="2304" height="1792" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Everyday-AI-Agents-160x160.jpg" width="160" height="160" /> <description><![CDATA[A common misconception about O’Reilly is that we cater only to the deeply technical learner. While we’re proud of our deep roots in the tech community, the breadth of our offerings, both in books and on our learning platform, has always aimed to reach a broader audience of tech-adjacent and tech-curious people who want to […]]]></description> <content:encoded><![CDATA[<p>A common misconception about O’Reilly is that we cater only to the deeply technical learner. While we’re proud of our deep roots in the tech community, the breadth of our offerings, both in books and on our learning platform, has always aimed to reach a broader audience of tech-adjacent and tech-curious people who want to learn new technologies and skills to improve how they work. For this audience, generative AI has opened up a world of new capabilities, making it possible to contribute to technical work that previously required coding knowledge or specialized expertise. As <a href="https://www.oreilly.com/radar/ai-and-programming-the-beginning-of-a-new-era/" target="_blank" rel="noreferrer noopener">Tim O’Reilly has put it,</a> “the addressable surface area of programming has gone up by orders of magnitude. There’s so much more to do and explore.”</p> <p>Over the last few years, many in this less technical audience have become adept at using chatbots in their daily lives for summarizing, writing, data analysis, automating tedious tasks and even prototyping. But this proficiency with chatbots is just the beginning. The underlying technology has evolved beyond simple conversations and outputs to power the next step: AI agents.</p> <p>While chatbots are great for answering questions and generating outputs, AI agents are designed to take action. They are proactive, goal-oriented, and can handle complex, multi-step tasks. If we’re often encouraged to think of chatbots as bright but overconfident interns, we can think of AI agents like competent direct reports you can hand an entire project to. They’ve been trained, understand their goals, can make decisions and employ tools to achieve their ends, all with minimal oversight. Across industries, agents are already handling real work, from automating software development to managing complex marketing campaigns and customer service calls. But there’s a gap. Many people who are comfortable with chatbots don’t yet see the path to harnessing the power of agents in their everyday work. They’ve heard the hype but how can agents impact daily work? How do you get started?</p> <p>This is why we’ve created the October 23rd <a href="https://learning.oreilly.com/live-events/genai-superstream-everyday-ai-agents/0642572213459/" target="_blank" rel="noreferrer noopener">GenAI Superstream: Everyday AI Agents</a>. This event is designed to bridge that gap and show you how to move from simply chatting with AI to building and deploying AI agents that can become valuable co-workers. Kathy Pham (VP of AI at Workday) and Claire Vo (CEO at ChatPRD) will kick off the conference with a fireside chat about how agents are already changing work and why it matters. From there, we’ll get into the specifics. You’ll hear from Jacob Bank of Relay.app, who will help demystify agents and share real patterns for automating your work, and from April Dunnam of Microsoft, who will demonstrate how to build agents directly within Microsoft 365. You’ll also learn how agents can help designers enforce creative governance with Nadia Elinbabi of Lowes and manage complex product workflows with Aman Khan of Arize AI. David Griffiths of HereScreen will explain how thinking like a programmer—without needing to be one—can help you design more intelligent and flexible agents. Finally, Babak Hodjat, CTO of AI at Cognizant will talk about how individual agents can evolve into multi-agent ecosystems that manage complex operations across an entire enterprise. Together, the goal of these presentations is to show vivid instances of real-world agents in action, inspiring you to imagine how you might use agents to augment your own abilities and work smarter.</p> <p>Democratization of technical capabilities is one of the key benefits of the current sea change ushered in by genAI. We believe that everyone, regardless of their technical background, should have the opportunity to participate in this transformation. Whether you’re new to agents, feel like your experimentation with agents has plateaued, or you just want a measured assessment of the hype, this GenAI Superstream is your chance to get informed, be inspired, and take the first steps toward building your own AI-powered future. We hope you’ll join us.</p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/everyday-ai-agents/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item> <title>Control Codegen Spend</title> <link>https://www.oreilly.com/radar/control-codegen-spend/</link> <pubDate>Thu, 09 Oct 2025 11:19:19 +0000</pubDate> <dc:creator><![CDATA[Tim O'Brien]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Commentary]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17506</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Who-Should-Get-Paid-in-the-Age-of-AI.jpg" medium="image" type="image/jpeg" width="2304" height="1792" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Who-Should-Get-Paid-in-the-Age-of-AI-160x160.jpg" width="160" height="160" /> <custom:subtitle><![CDATA[Use the Right Model—Then Switch Back]]></custom:subtitle> <description><![CDATA[This article originally appeared on Medium. Tim O’Brien has given us permission to repost here on Radar. When you’re working with AI tools like Cursor or GitHub Copilot, the real power isn’t just having access to different models—it’s knowing when to use them. Some jobs are OK with Auto. Others need a stronger model. And […]]]></description> <content:encoded><![CDATA[<figure class="wp-block-table"><table class="has-cyan-bluish-gray-background-color has-background has-fixed-layout"><tbody><tr><td><em>This article originally appeared on </em><a href="https://medium.com/@tobrien/control-codegen-spend-use-the-right-model-then-switch-back-cf173753d0ae" target="_blank" rel="noreferrer noopener"><em>Medium</em></a><em>. Tim O’Brien has given us permission to repost here on Radar.</em></td></tr></tbody></table></figure> <p>When you’re working with AI tools like Cursor or GitHub Copilot, the real power isn’t just having access to different models—it’s knowing when to use them. Some jobs are OK with Auto. Others need a stronger model. And sometimes you should bail and switch if you continue spending money on a complex problem with a lower-quality model. If you don’t, you’ll waste both time and money.</p> <p>And this is the missing discussion in code generation. There are a few “camps” here; the majority of people writing about this appear to view this as a fantastical and fun “vibe coding” experience, and a few people out there are trying to use this technology to deliver real products. If you are in that last category, you’ve probably started to realize that you can spend a <em>fantastic</em> amount of money if you don’t have a strategy for model selection.</p> <p>Let’s make it very specific—if you sign up for Cursor and drop $20/month on a subscription using Auto and you are happy with the output, there’s not much to worry about. But if you are starting to run agents in parallel and are paying for token consumption atop a monthly subscription, this post will make sense. In my own experience, a single developer working alone can easily spend $200–$300/day (or four times that figure) if they are trying to tackle a project and have opted for the most expensive model.</p> <p><em>And—if you are a company and you give your developers unlimited access to these tools—get ready for some surprises.</em></p> <h2 class="wp-block-heading"><strong>My Escalation Ladder for Models…</strong></h2> <ol class="wp-block-list"><li><strong>Start here: Auto.</strong> Let Cursor route to a strong model with good capacity. If output quality degrades or the loop occurs, escalate the issue. (Cursor explicitly says Auto selects among premium models and will switch when output is degraded.)<br></li> <li><strong>Medium-complexity tasks: Sonnet 4/GPT‑5/Gemini.</strong> Use for focused tasks on a handful of files: robust unit tests, targeted refactors, API remodels.<br></li> <li><strong>Heavy lift: Sonnet 4 – 1 million. </strong>If I need to do something that requires more context, but I still don’t want to pay top dollar, I’ve been starting to move up models that don’t quickly max out on context.<br></li> <li><strong>Ultraheavy lift: Opus 4/4.1.</strong> Use this when the task spans multiple projects or requires long context and careful reasoning, then <strong>switch back</strong> once the big move is done. (Anthropic positions Opus 4 as a deep‑reasoning, long‑horizon model for coding and agent workflows.)</li></ol> <p>Auto works fine, but there are times when you can sense that it’s selected the wrong model, and if you use these models enough, you know when you are looking at Gemini Pro output by the verbosity or the ChatGPT models by the way they go about solving a problem.</p> <p>I’ll admit that my heavy and ultraheavy choices here are biased towards the models I’ve had more experience with—your own experience might vary. Still, you should also have a similar escalation list. Start with Auto and only upgrade if you need to; otherwise, you are going to learn some lessons about how much this costs.</p> <h2 class="wp-block-heading"><strong>Watch Out for “Thinking” Model Costs</strong></h2> <p>Some models support explicit “thinking” (longer reasoning). Useful, but costlier. Cursor’s docs note that enabling thinking on specific Sonnet versions can count as <strong>two requests</strong> under team request accounting, and in the individual plans, the same idea translates to <strong>more tokens</strong> burned. In short, thinking mode is excellent—use it when you need it.</p> <p>And when do you need it? My rule of thumb here is that when I understand what needs to be done already, when I’m asking for a unit test to be polished or a method to be executed in the pattern of another… I usually don’t need a thinking model. On the other hand, if I’m asking it to analyze a problem and propose various options for me to choose from, or (something I do often) when I’m asking it to challenge my decisions and play devil’s advocate, I will pay the premium for the best model.</p> <h2 class="wp-block-heading"><strong>Max Mode and When to Use It</strong></h2> <p>If you need giant context windows or extended reasoning (e.g., sweeping changes across 20+ files), <strong>Max Mode</strong> can help—but it will consume more usage. Make Max Mode a <strong>temporary tool</strong>, not your default. If you find yourself constantly requiring Max Mode to be turned on, there’s a good chance you are “overapplying” this technology.</p> <p>If it needs to consume a million tokens for hours on end? That’s usually a hint that you need another programmer. More on that later, but what I’ve seen too often are managers who think this is like the “vibe coding” they are witnessing. Spoiler alert: Vibe coding is that thing that people do in presentations because it takes five minutes to make a silly video game. It’s 100% not programming, and to use codegen, here’s the secret: You have to understand how to program.</p> <p>Max Mode and thinking models are not a shortcut, and neither are they a replacement for good programmers. If you think they are, you are going to be paying top dollar for code that will one day have to be rewritten by a good programmer using these same tools.</p> <h2 class="wp-block-heading"><strong>Most Important Tip: Watch Your Bill as It Happens</strong></h2> <p>The most important tip is to regularly monitor your utilization and usage fees in Cursor, since they appear within a minute or two of running something. You can see usage by the minute, the number of tokens consumed, and in some cases, how much you’re being charged beyond your subscription. Make a habit of checking a couple of times a day, especially during heavy sessions, and ideally every half hour. This helps you catch runaway costs—like spending $100 an hour—before they get out of hand, which is entirely possible if you’re running many parallel agents or doing resource-intensive work. Paying attention ensures you stay in control of both your usage and your bill.</p> <h2 class="wp-block-heading"><strong>Keep Track and Avoid Loops</strong></h2> <p>The other thing you need to do is keep track of what works and what doesn’t. Over time, you’ll notice it’s very easy to make mistakes, and the models themselves can sometimes fall into loops. You might give an instruction, and instead of resolving it, the system keeps running the same process again and again. If you’re not paying attention, you can burn through a lot of tokens—and a lot of money—without actually getting sound output. That’s why it’s essential to watch your sessions closely and be ready to interrupt if something looks like it’s stuck.</p> <p>Another pitfall is pushing the models beyond their limits. There are tasks they can’t handle well, and when that happens, it’s tempting to keep rephrasing the request and asking again, hoping for a better result. In practice, that often leads to the same cycle of failure, except you’re footing the bill for every attempt. Knowing where the boundaries are and when to stop is critical.</p> <p>A practical way to stay on top of this is to maintain a running diary of what worked and what didn’t. Record prompts, outcomes, and notes about efficiency so you can learn from experience instead of repeating expensive mistakes. Combined with keeping an eye on your live usage metrics, this habit will help you refine your approach and avoid wasting both time and money.</p>]]></content:encoded> </item> <item> <title>The AI Teaching Toolkit: Practical Guidance for Teams</title> <link>https://www.oreilly.com/radar/the-ai-teaching-toolkit-practical-guidance-for-teams/</link> <pubDate>Wed, 08 Oct 2025 11:12:34 +0000</pubDate> <dc:creator><![CDATA[Andrew Stellman]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Commentary]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17503</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/AI-Teaching-Toolkit.jpg" medium="image" type="image/jpeg" width="2304" height="1792" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/AI-Teaching-Toolkit-160x160.jpg" width="160" height="160" /> <description><![CDATA[Teaching developers to work effectively with AI means building habits that keep critical thinking active while leveraging AI’s speed. But teaching these habits isn’t straightforward. Instructors and team leads often find themselves needing to guide developers through challenges in ways that build confidence rather than short-circuit their growth. (See “The Cognitive Shortcut Paradox.”) There are […]]]></description> <content:encoded><![CDATA[<p>Teaching developers to work effectively with AI means building habits that keep critical thinking active while leveraging AI’s speed.</p> <p>But teaching these habits isn’t straightforward. Instructors and team leads often find themselves needing to guide developers through challenges in ways that build confidence rather than short-circuit their growth. (See “<a href="https://www.oreilly.com/radar/the-cognitive-shortcut-paradox/" target="_blank" rel="noreferrer noopener">The Cognitive Shortcut Paradox</a>.”) There are the regular challenges of working with AI:</p> <ul class="wp-block-list"><li>Suggestions that look correct while hiding subtle flaws</li> <li>Less experienced developers accepting output without questioning it</li> <li>AI producing patterns that don’t match the team’s standards</li> <li>Code that works but creates long-term maintainability headaches</li></ul> <p>The Sens-AI Framework (see “<a href="https://www.oreilly.com/radar/the-sens-ai-framework/" target="_blank" rel="noreferrer noopener">The Sens-AI Framework: Teaching Developers to Think with AI</a>”) was built to address these problems. It focuses on five habits—context, research, framing, refining, and critical thinking—that help developers use AI effectively while keeping learning and design judgment in the loop.</p> <p>This toolkit builds on and reinforces those habits by giving you concrete ways to integrate them into team practices. It’s designed to give you concrete ways to build these habits in your team, whether you’re running a workshop, leading code reviews, or mentoring individual developers. The techniques that follow include practical teaching strategies, common pitfalls to avoid, reflective questions to deepen learning, and positive signs that show the habits are sticking.</p> <h2 class="wp-block-heading"><strong>Advice for Instructors and Team Leads</strong></h2> <p>The strategies in this toolkit can be used in classrooms, review meetings, design discussions, or one-on-one mentoring. They’re meant to help new learners, experienced developers, and teams have more open conversations about design decisions, context, and the quality of AI suggestions. The focus is on making review and questioning feel like a normal, expected part of everyday development.</p> <p><strong>Discuss assumptions and context explicitly. </strong>In code reviews or mentoring sessions, ask developers to talk about occurrences when the AI gave them poor out unexpected results. Also try asking them to explain what they think the AI might have needed to know to produce a better answer, and where it might have filled in gaps incorrectly. Getting developers to articulate those assumptions helps spot weak points in design before they’re cemented into the code. (See “<a href="https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/" target="_blank" rel="noreferrer noopener">Prompt Engineering Is Requirements Engineering</a>.”)</p> <p><strong>Encourage pairing or small-group prompt reviews: </strong>Make AI-assisted development collaborative, not siloed. Have developers on a team or students in a class share their prompts with each other, and talk through why they wrote them a certain way, just like they’d talk through design decisions in pair or mob programming. This helps less experienced developers see how others approach framing and refining prompts.</p> <p><strong>Encourage researching idiomatic use of code.</strong> One thing that often holds back intermediate developers is not knowing the idioms of a specific framework or language. AI can help here—if they ask for the <em>idiomatic</em> way to do something, they see not just the syntax but also the patterns experienced developers rely on. That shortcut can speed up their understanding and make them more confident when working with new technologies.</p> <p>Here are two examples of how using AI to research idioms can help developers quickly adapt:</p> <ul class="wp-block-list"><li>A developer with deep experience writing microservices but little exposure to Spring Boot can use AI to see the idiomatic way to annotate a class with <code>@RestController</code> and <code>@RequestMapping</code>. They might also learn that Spring Boot favors constructor injection over field injection with <code>@Autowired</code>, or that <code>@GetMapping("/users")</code> is preferred over <code>@RequestMapping(method = RequestMethod.GET, value = "/users")</code>.</li> <li>A Java developer new to Scala might reach for <code>null</code> instead of Scala’s <code>Option</code> types—missing a core part of the language’s design. Asking the AI for the idiomatic approach surfaces not just the syntax but the philosophy behind it, guiding developers toward safer and more natural patterns.</li></ul> <p><strong>Help developers recognize rehash loops as meaningful signals. </strong>When the AI keeps circling the same broken idea, even developers who have experienced this many times may not realize they’re caught in a rehash loop. Teach them to recognize the loop as a signal that the AI has exhausted its context, and that it’s time to step back. That pause can lead to research, reframing the problem, or providing new information. For example, you might stop and say: “Notice how it’s circling the same idea? That’s our signal to break out.” Then demonstrate how to reset: open a new session, consult documentation, or try a narrower prompt. (See “<a href="https://www.oreilly.com/radar/understanding-the-rehash-loop/" target="_blank" rel="noreferrer noopener">Understanding the Rehash Loop</a>.”)</p> <p><strong>Research beyond AI.</strong> Help developers learn that when hitting walls, they don’t need to just tweak prompts endlessly. Model the habit of branching out: check official documentation, search Stack Overflow, or review similar patterns in your existing codebase. AI should be one tool among many. Showing developers how to diversify their research keeps them from looping and builds stronger problem-solving instincts.</p> <p><strong>Use failed projects as test cases. </strong>Bring in previous projects that ran into trouble with AI-generated code and revisit them with Sens-AI habits. Review what went right and wrong, talk about where it might have helped to break out of the vibe coding loop to do additional research, reframe the problem, and apply critical thinking. Work with the team to write down lessons you learned from the discussion. Holding a retrospective exercise like this lowers the stakes—developers are free to experiment and critique without slowing down current work. It’s also a powerful way to show how reframing, refining, and verifying could have prevented past issues. (See “<a href="https://www.oreilly.com/radar/building-ai-resistant-technical-debt/" target="_blank" rel="noreferrer noopener">Building AI-Resistant Technical Debt</a>.”)</p> <p><strong>Make refactoring part of the exercise. </strong>Help developers avoid the habit of deciding the code is finished when it runs and seems to work. Have them work with the AI to clean up variable names, reduce duplication, simplify overly complex logic, apply design patterns, and find other ways to prevent technical debt. By making evaluation and improvement explicit, you can help developers build the muscle memory that prevents passive acceptance of AI output. (See “<a href="https://www.oreilly.com/radar/trust-but-verify/" target="_blank" rel="noreferrer noopener">Trust but Verify</a>.”)</p> <h2 class="wp-block-heading"><strong>Common Pitfalls to Address with Teams</strong></h2> <p>Even with good intentions, teams often fall into predictable traps. Watch for these patterns and address them explicitly, because otherwise they can slow progress and mask real learning.</p> <p><strong>The completionist trap: </strong><em>Trying to read every line of AI output even when you’re about to regenerate it.</em> Teach developers it’s okay to skim, spot problems, and regenerate early. This helps them avoid wasting time carefully reviewing code they’ll never use, and reduces the risk of cognitive overload. The key is to balance thoroughness with pragmatism—they can start to learn when detail matters and when speed matters more.</p> <p><strong>The perfection loop: </strong><em>Endless tweaking of prompts for marginal improvements.</em> Try setting a limit on iteration—for example, if refining a prompt doesn’t get good results after three or four attempts, it’s time to step back and rethink. Developers need to learn that diminishing returns are a sign to change strategy, not to keep grinding, so energy that should go toward solving the problem doesn’t get lost in chasing minor refinements.</p> <p><strong>Context dumping:</strong> <em>Pasting entire codebases into prompts.</em> Teach scoping—What’s the minimum context needed for this specific problem? Help them anticipate what the AI needs, and provide the minimal context required to solve each problem. Context dumping can be especially problematic with limited context windows, where the AI literally can’t see all the code you’ve pasted, leading to incomplete or contradictory suggestions. Teaching developers to be intentional about scope prevents confusion and makes AI output more reliable.</p> <p><strong>Skipping the fundamentals: </strong><em>Using AI for extensive code generation before understanding basic software development concepts and patterns.</em> Ensure learners can solve simple development problems on their own (without the help of AI) before accelerating with AI on more complex ones. This helps reduce the risk of developers building a shallow knowledge platform that collapses under pressure. Fundamentals are what allow them to evaluate AI’s output critically rather than blindly trusting it.</p> <h2 class="wp-block-heading"><strong>AI Archaeology: A Practical Team Exercise for Better Judgment</strong></h2> <p>Have your team do an <strong>AI archaeology</strong> exercise. Take a piece of AI-generated code from the previous week and analyze it together. More complex or nontrivial code samples work especially well because they tend to surface more assumptions and patterns worth discussing.</p> <p>Have each team member independently write down their own answers to these questions:</p> <ul class="wp-block-list"><li>What assumptions did the AI make?</li> <li>What patterns did it use?</li> <li>Did it make the right decision for our codebase?</li> <li>How would you refactor or simplify this code if you had to maintain it long-term?</li></ul> <p>Once everyone has had time to write, bring the group back together—either in a room or virtually—and compare answers. Look for points of agreement and disagreement. When different developers spot different issues, that contrast can spark discussion about standards, best practices, and hidden dependencies. Encourage the group to debate respectfully, with an emphasis on surfacing reasoning rather than just labeling answers as right or wrong.</p> <p>This exercise makes developers slow down and compare perspectives, which helps surface hidden assumptions and coding habits. By putting everyone’s observations side by side, the team builds a shared sense of what good AI-assisted code looks like.</p> <p>For example, the team might discover the AI consistently uses older patterns your team has moved away from or that it defaults to verbose solutions when simpler ones exist. Discoveries like that become teaching moments about your team’s standards and help calibrate everyone’s “code smell” detection for AI output. The retrospective format makes the whole exercise more friendly and less intimidating than real-time critique, which helps to strengthen everyone’s judgment over time.</p> <h2 class="wp-block-heading"><strong>Signs of Success</strong></h2> <p>Balancing pitfalls with positive indicators helps teams see what good AI practice looks like. When these habits take hold, you’ll notice developers:</p> <p><strong>Reviewing AI code with the same rigor as human-written code—but only when appropriate.</strong> When developers stop saying “the AI wrote it, so it must be fine” and start giving AI code the same scrutiny they’d give a teammate’s pull request, it demonstrates that the habits are sticking.</p> <p><strong>Exploring multiple approaches instead of accepting the first answer.</strong> Developers who use AI effectively don’t settle for the initial response. They ask the AI to generate alternatives, compare them, and use that exploration to deepen their understanding of the problem.</p> <p><strong>Recognizing rehash loops without frustration.</strong> Instead of endlessly tweaking prompts, developers treat rehash loops as signals to pause and rethink. This shows they’re learning to manage AI’s limitations rather than fight against them.</p> <p><strong>Sharing “AI gotchas” with teammates.</strong> Developers start saying things like “I noticed Copilot always tries this approach, but here’s why it doesn’t work in our codebase.” These small observations become collective knowledge that helps the whole team work together and with AI more effectively.</p> <p><strong>Asking “Why did the AI choose this pattern?” instead of just asking “Does it work?”</strong> This subtle shift shows developers are moving beyond surface correctness to reasoning about design. It’s a clear sign that critical thinking is active.</p> <p><strong>Bringing fundamentals into AI conversations:</strong> Developers who are working positively with AI tools tend to relate AI output back to core principles like readability, separation of concerns, or testability. This shows they’re not letting AI bypass their grounding in software engineering.</p> <p><strong>Treating AI failures as learning opportunities:</strong> When something goes wrong, instead of blaming the AI or themselves, developers dig into why. Was it context? Framing? A fundamental limitation? This investigative mindset turns problems into teachable moments.</p> <h2 class="wp-block-heading"><strong>Reflective Questions for Teams</strong></h2> <p>Encourage developers to ask themselves these reflective questions periodically. They slow the process just enough to surface assumptions and spark discussion. You might use them in training, pairing sessions, or code reviews to prompt developers to explain their reasoning. The goal is to keep the design conversation active, even when the AI seems to offer quick answers.</p> <ul class="wp-block-list"><li><strong>What does the AI need to know to do this well?</strong> (Ask this before writing any prompt.)</li> <li><strong>What context or requirements might be missing here?</strong> (Helps catch gaps early.)</li> <li><strong>Do you need to pause here and do some research?</strong> (Promotes branching out beyond AI.)</li> <li><strong>How might you reframe this problem more clearly for the AI?</strong> (Encourages clarity in prompts.)</li> <li><strong>What assumptions are you making about this AI output?</strong> (Surfaces hidden design risks.)</li> <li><strong>If you’re getting frustrated, is that a signal to step back and rethink?</strong> (Normalizes stepping away.)</li> <li><strong>Would it help to switch from reading code to writing tests to check behavior?</strong> (Shifts the lens to validation.)</li> <li><strong>Do these unit tests reveal any design issues or hidden dependencies?</strong> (Connects testing with design insight.)</li> <li><strong>Have you tried starting a new chat session or using a different AI tool for this research?</strong> (Models flexibility with tools.)</li></ul> <p>The goal of this toolkit is to help developers build the kind of judgment that keeps them confident with AI while still growing their core skills. When teams learn to pause, review, and refactor AI-generated code, they move quickly without losing sight of design clarity or long-term maintainability. These teaching strategies give developers the habits to stay in control of the process, learn more deeply from the work, and treat AI as a true collaborator in building better software. As AI tools evolve, these fundamental habits—questioning, verifying, and maintaining design judgment—will remain the difference between teams that use AI well and those that get used by it.</p>]]></content:encoded> </item> <item> <title>Radar Trends to Watch: October 2025</title> <link>https://www.oreilly.com/radar/radar-trends-to-watch-october-2025/</link> <pubDate>Tue, 07 Oct 2025 11:17:09 +0000</pubDate> <dc:creator><![CDATA[Mike Loukides]]></dc:creator> <category><![CDATA[Radar Trends]]></category> <category><![CDATA[Commentary]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17499</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-3.png" medium="image" type="image/png" width="1400" height="950" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-3-160x160.png" width="160" height="160" /> <custom:subtitle><![CDATA[Developments in Programming, Operations, Augmented Reality, and More]]></custom:subtitle> <description><![CDATA[This month we have two more protocols to learn. Google has announced the Agent Payments Protocol (AP2), which is intended to help agents to engage in ecommerce—it’s largely concerned with authenticating and authorizing parties making a transaction. And the Agent Client Protocol (ACP) is concerned with communications between code editors and coding agents. When implemented, […]]]></description> <content:encoded><![CDATA[<p>This month we have two more protocols to learn. Google has announced the Agent Payments Protocol (AP2), which is intended to help agents to engage in ecommerce—it’s largely concerned with authenticating and authorizing parties making a transaction. And the Agent Client Protocol (ACP) is concerned with communications between code editors and coding agents. When implemented, it would allow any code editor to plug in any compliant agent.</p> <p>All hasn’t been quiet on the virtual reality front. Meta has announced its new VR/AR glasses, with the ability to display images on the lenses along with capabilities like live captioning for conversations. They’re much less obtrusive than the previous generation of VR goggles.</p> <h2 class="wp-block-heading">AI</h2> <ul class="wp-block-list"><li>Suno has <a href="https://suno.com/studio-welcome" target="_blank" rel="noreferrer noopener">announced</a> an AI-driven digital audio workstation (DAW), a tool for enabling people to be creative with AI-generated music.</li> <li>Ollama has added its own <a href="https://docs.ollama.com/web-search" target="_blank" rel="noreferrer noopener">web search API</a>. Ollama’s search API can be used to augment the information available to models. </li> <li>GitHub Copilot now offers a command-line tool, <a href="https://github.blog/changelog/2025-09-25-github-copilot-cli-is-now-in-public-preview/" target="_blank" rel="noreferrer noopener">GitHub CLI</a>. It can use either Claude Sonnet 4 or GPT-5 as the backing model, though other models should be available soon. Claude 4 is the default.</li> <li>Alibaba has released <a href="https://qwen.ai/blog?id=87dc93fc8a590dc718c77e1f6e84c07b474f6c5a&from=home.latest-research-list" target="_blank" rel="noreferrer noopener">Qwen3-Max</a>, a trillion-plus parameter model. There are reasoning and nonreasoning variants, though the reasoning variant hasn’t yet been released. Alibaba <a href="https://thesequence.substack.com/p/the-sequence-radar-727-qwens-oneweek" target="_blank" rel="noreferrer noopener">also released</a> models for <a href="https://qwen.ai/blog?id=b4264e11fb80b5e37350790121baf0a0f10daf82&from=research.latest-advancements-list" target="_blank" rel="noreferrer noopener">speech-to-text</a>, <a href="https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list" target="_blank" rel="noreferrer noopener">vision-language</a>, <a href="https://qwen.ai/blog?id=4266edf7f3718f2d3fda098b3f4c48f3573215d0&from=home.latest-research-list" target="_blank" rel="noreferrer noopener">live translation</a>, and more. They’ve been busy. </li> <li>GitHub has launched its <a href="https://github.blog/ai-and-ml/github-copilot/meet-the-github-mcp-registry-the-fastest-way-to-discover-mcp-servers/" target="_blank" rel="noreferrer noopener">MCP Registry</a> to make it easier to discover MCP servers archived on GitHub. It’s also working with Anthropic and others to build an <a href="https://github.com/modelcontextprotocol/registry/" target="_blank" rel="noreferrer noopener">open source MCP registry</a>, which lists servers regardless of their origin and integrates with GitHub’s registry. </li> <li>DeepMind has published <a href="https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/strengthening-our-frontier-safety-framework/frontier-safety-framework_3.pdf" target="_blank" rel="noreferrer noopener">version 3.0</a> of its <a href="https://deepmind.google/discover/blog/strengthening-our-frontier-safety-framework/" target="_blank" rel="noreferrer noopener">Frontier Safety Framework</a>, a framework for experimenting with AI-human alignment. They’re particularly interested in scenarios where the AI doesn’t follow a user’s directives, and in behaviors that can’t be traced to a specific reasoning chain.</li> <li>Alibaba has released the <a href="https://github.com/Alibaba-NLP/DeepResearch" target="_blank" rel="noreferrer noopener">Tongyi DeepResearch</a> reasoning model. Tongyi is a 30.5B parameter mixture-of-experts model, with 3.3B parameters active. More importantly, it’s fully open source, with no restrictions on how it can be used. </li> <li><a href="https://locallyai.app/" target="_blank" rel="noreferrer noopener">Locally AI</a> is an iOS app that lets you run large language models on your iPhone or iPad. It works offline; there’s no need for a network connection. </li> <li>OpenAI has added <a href="https://www.bleepingcomputer.com/news/artificial-intelligence/chatgpt-now-gives-you-greater-control-over-gpt-5-thinking-model/" target="_blank" rel="noreferrer noopener">control over the “reasoning” process</a> to its GPT-5 models. Users can choose between four levels: Light (Pro users only), Standard, Extended, and Heavy (Pro only). </li> <li>Google has announced the <a href="https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol" target="_blank" rel="noreferrer noopener">Agent Payments Protocol</a> (AP2), which facilitates purchases. It focuses on authorization (proving that it has the authority to make a purchase), authentication (proving that the merchant is legitimate), and accountability (in case of a fraudulent transaction).</li> <li><a href="https://www.dbreunig.com/2025/09/15/ai-adoption-at-work-play.html" target="_blank" rel="noreferrer noopener">Bring Your Own AI</a>: Employee adoption of AI greatly exceeds official IT adoption. We’ve seen this before, on technologies as different as the iPhone and open source.</li> <li>Alibaba has <a href="https://news.smol.ai/issues/25-09-11-qwen3-next/" target="_blank" rel="noreferrer noopener">released</a> the ponderously named <a href="https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list" target="_blank" rel="noreferrer noopener">Qwen3-Next-80B-A3B-Base</a>. It’s a mixture-of-experts model with a high ratio of active parameters to total parameters (3.75%). Alibaba claims that the model cost 1/10 as much to train and is 10 times faster than its previous models. If this holds up, Alibaba is winning on performance where it counts.</li> <li>Anthropic has announced a <a href="https://www.anthropic.com/news/create-files" target="_blank" rel="noreferrer noopener">major upgrade to Claude’s capabilities</a>. It can now execute Python scripts in a sandbox and can create Excel spreadsheets, PowerPoint presentations, PNG files, and other documents. You can upload files for it to analyze. And of course this comes with security risks.</li> <li>The <a href="https://guides.lib.uchicago.edu/c.php?g=1241077&p=9082322" target="_blank" rel="noreferrer noopener">SIFT</a> method—stop, investigate the source, find better sources, and trace quotes to their original context—is a way of structuring your use of AI output that will make you less vulnerable to misinformation. Hint: it’s not just for AI.</li> <li>OpenAI’s <a href="https://help.openai.com/en/articles/10169521-projects-in-chatgpt" target="_blank" rel="noreferrer noopener">Projects</a> feature is now available to <a href="https://www.bleepingcomputer.com/news/artificial-intelligence/chatgpt-makes-projects-feature-free-adds-a-toggle-to-split-chat/" target="_blank" rel="noreferrer noopener">free</a> accounts. Projects is a set of tools for organizing conversations with the LLM. Projects are separate workspaces with their own custom instructions, independent memory, and context. They can be forked. Projects sounds something like Git for LLMs—a set of features that’s badly needed.</li> <li><a href="https://developers.googleblog.com/en/introducing-embeddinggemma/" target="_blank" rel="noreferrer noopener">EmbeddingGemma</a> is a new open weights embedding model (308M parameters) that’s designed to run on devices, requiring as little as 200 MB of memory.</li> <li>An <a href="https://arstechnica.com/science/2025/09/these-psychological-tricks-can-get-llms-to-respond-to-forbidden-prompts/" target="_blank" rel="noreferrer noopener">experiment</a> with GPT-4o-mini shows that language models can fall to psychological manipulation. Is this surprising? After all, they are trained on human output.</li> <li>“<a href="https://www.lukew.com/ff/entry.asp?2117" target="_blank" rel="noreferrer noopener">Platform Shifts Redefine Apps</a>”: AI is a new kind of platform and demands rethinking what applications mean and how they should work. Failure to do this rethinking may be why so many AI efforts fail.</li> <li><a href="https://mcpui.dev/" target="_blank" rel="noreferrer noopener">MCP-UI</a> is a protocol that allows MCP servers to <a href="https://thenewstack.io/how-mcp-ui-powers-shopifys-new-commerce-widgets-in-agents/" target="_blank" rel="noreferrer noopener">send React components</a> or Web Components to agents, allowing the agent to build an appropriate browser-based interface on the fly.</li> <li>The <a href="https://agentclientprotocol.com/overview/introduction" target="_blank" rel="noreferrer noopener">Agent Client Protocol</a> (ACP) is a new protocol that standardizes communications between code editors and coding agents. It’s currently supported by the Zed and Neovim editors, and by the Gemini CLI coding agent.</li> <li>Gemini 2.5 Flash is now using a <a href="https://blog.google/products/gemini/updated-image-editing-model/" target="_blank" rel="noreferrer noopener">new image generation model</a> that was internally known as “nano banana.” This new model can edit uploaded images, merge images, and maintain visual consistency across a series of images.</li></ul> <h2 class="wp-block-heading">Programming</h2> <ul class="wp-block-list"><li>Anthropic <a href="https://www.anthropic.com/news/enabling-claude-code-to-work-more-autonomously" target="_blank" rel="noreferrer noopener">released Claude Code 2.0</a>. New features include the ability to checkpoint your work, so that if a coding agent wanders off-course, you can return to a previous state. They have also added the ability to run tasks in the background, call hooks, and use subagents.</li> <li>Suno has <a href="https://suno.com/studio-welcome" target="_blank" rel="noreferrer noopener">announced</a> an AI-driven digital audio workstation (DAW), a tool for enabling people to be creative with AI-generated music.</li> <li>The Wasmer project has <a href="https://wasmer.io/posts/python-on-the-edge-powered-by-webassembly" target="_blank" rel="noreferrer noopener">announced</a> that it now has full Python support in the beta version of Wasmer Edge, its WebAssembly runtime for serverless edge deployment.</li> <li>Mitchell Hashimoto, founder of Hashicorp, has <a href="https://mitchellh.com/writing/libghostty-is-coming" target="_blank" rel="noreferrer noopener">promised</a> that a library for Ghostty (libghostty) is coming! This library will make it easy to embed a terminal emulator into an application. Perhaps more important, libghostty might standardize the code for terminal output across applications. </li> <li>There’s a new benchmark for agentic coding: <a href="https://quesma.com/blog/introducing-compilebench/" target="_blank" rel="noreferrer noopener">CompileBench</a>. CompileBench tests the ability of models to <a href="https://simonwillison.net/2025/Sep/22/compilebench/" target="_blank" rel="noreferrer noopener">solve complex problems in figuring out how to build code</a>. </li> <li>Apple is reportedly <a href="https://medium.com/@yashbatra11111/why-apple-is-quietly-rewriting-ios-in-a-language-youve-never-heard-of-2f70181df3bb" target="_blank" rel="noreferrer noopener">rewriting iOS in a new programming language</a>. Rust would be the obvious choice, but rumors are that it’s something of their own creation. Apple likes languages it can control. </li> <li><a href="https://www.oracle.com/news/announcement/oracle-releases-java-25-2025-09-16/" target="_blank" rel="noreferrer noopener">Java 25</a>, the latest long-term support release, has a number of new features that <a href="https://thenewstack.io/java-25-oracle-makes-java-easier-to-learn-ready-for-ai-development/" target="_blank" rel="noreferrer noopener">reduce the boilerplate</a> that makes Java difficult to learn. </li> <li><a href="https://luau.org/" target="_blank" rel="noreferrer noopener">Luau</a> is a new scripting language derived from Lua. It claims to be fast, small, and safe. It’s backward compatible with Version 5.1 of Lua.</li> <li>OpenAI has <a href="https://www.latent.space/p/gpt5-codex" target="_blank" rel="noreferrer noopener">launched</a> <a href="https://openai.com/index/introducing-upgrades-to-codex/" target="_blank" rel="noreferrer noopener">GPT-5 Codex</a>, its generation model trained specifically for software engineering. Codex is now available both in the CLI tool and through the API. It’s clearly intended to challenge Anthropic’s dominant coding tool, Claude Code.</li> <li>Do prompts belong in code repositories? We’ve argued that prompts should be archived. But <a href="https://towardsdatascience.com/why-your-prompts-dont-belong-in-git/" target="_blank" rel="noreferrer noopener">they don’t belong in a source code repo</a> like Git. There are better tools available.</li> <li>This is cool and different. A developer has <a href="https://joshfonseca.com/blogs/animal-crossing-llm" target="_blank" rel="noreferrer noopener">hacked</a> the 2001 game <em>Animal Crossing</em> so that the dialog is generated by LLM rather than coming from the game’s memory.</li> <li>There’s a new programming language, vibe-coded in its entirety with Claude. <a href="https://simonwillison.net/2025/Sep/9/cursed/" target="_blank" rel="noreferrer noopener">Cursed</a> is similar to Claude, but all the keywords are Gen Z slang. It’s not yet on the list, but it’s a worthy addition to <a href="https://esolangs.org/wiki/Main_Page" target="_blank" rel="noreferrer noopener">Esolang</a>. </li> <li>Claude Code is now integrated into the Zed editor (beta), using the <a href="https://agentclientprotocol.com/overview/introduction" target="_blank" rel="noreferrer noopener">Agent Client Protocol</a> <a href="https://agentclientprotocol.com/overview/introduction" target="_blank" rel="noreferrer noopener">(ACP)</a>. </li> <li>Ida Bechtle’s <a href="https://www.youtube.com/watch?v=GfH4QL4VqJ0" target="_blank" rel="noreferrer noopener">documentary on the history of Python</a>, complete with many interviews with Guido van Rossum, is a must-watch.</li></ul> <h2 class="wp-block-heading">Security</h2> <ul class="wp-block-list"><li>The first <a href="https://www.koi.security/blog/postmark-mcp-npm-malicious-backdoor-email-theft" target="_blank" rel="noreferrer noopener">malicious MCP server</a> has been found in the wild. Postmark-MCP, an MCP server for interacting with the Postmark application, suddenly (version 1.0.16) started sending copies of all the email it handles to its developer.</li> <li>I doubt this is the first time, but <a href="https://www.bleepingcomputer.com/news/security/malicious-rust-packages-on-cratesio-steal-crypto-wallet-keys/" target="_blank" rel="noreferrer noopener">supply chain security vulnerabilities have now hit Rust’s package management system, Crates.io</a>. Two packages that steal keys for cryptocurrency wallets have been found. It’s time to be careful about what you download.</li> <li><a href="https://embracethered.com/blog/posts/2025/cross-agent-privilege-escalation-agents-that-free-each-other/" target="_blank" rel="noreferrer noopener">Cross-agent privilege escalation</a> is a new kind of vulnerability in which a compromised intelligent agent uses indirect prompt injection to cause a victim agent to overwrite its configuration, granting it additional privileges. </li> <li>GitHub is taking a number of measures to <a href="https://www.bleepingcomputer.com/news/security/github-tightens-npm-security-with-mandatory-2fa-access-tokens/" target="_blank" rel="noreferrer noopener">improve software supply chain security</a>, including requiring two-factor authentication (2FA), expanding <a href="https://repos.openssf.org/trusted-publishers-for-all-package-repositories" target="_blank" rel="noreferrer noopener">trusted publishing</a>, and more.</li> <li>A compromised npm package uses a <a href="https://www.bleepingcomputer.com/news/security/npm-package-caught-using-qr-code-to-fetch-cookie-stealing-malware/" target="_blank" rel="noreferrer noopener">QR code to encode malware</a>. The malware is apparently downloaded in the QR code (which is valid, but too dense to be read by a normal camera), unpacked by the software, and used to steal cookies from the victim’s browser. </li> <li>Node.js and its package manager npm have been in the news because of an ongoing series of supply chain attacks. Here’s the <a href="https://www.bleepingcomputer.com/news/security/self-propagating-supply-chain-attack-hits-187-npm-packages/" target="_blank" rel="noreferrer noopener">latest report</a>.</li> <li>A <a href="https://blogs.cisco.com/security/detecting-exposed-llm-servers-shodan-case-study-on-ollama" target="_blank" rel="noreferrer noopener">study by Cisco</a> has discovered over a thousand unsecured LLM servers running on Ollama. Roughly 20% were actively serving requests. The rest may have been idle Ollama instances, waiting to be exploited. </li> <li>Anthropic has announced that <a href="https://old.reddit.com/r/LocalLLaMA/comments/1n2ubjx/if_you_have_a_claude_personal_account_they_are/" target="_blank" rel="noreferrer noopener">Claude will train on data from personal accounts</a>, effective September 28. This includes Free, Pro, and Max plans. Work plans are exempted. While the company says that training on personal data is opt-in, it’s (currently) enabled by default, so it’s opt-out.</li> <li>We now have “<a href="https://www.bleepingcomputer.com/news/security/malware-devs-abuse-anthropics-claude-ai-to-build-ransomware/" target="_blank" rel="noreferrer noopener">vibe hacking</a>,” the use of AI to develop malware. Anthropic has reported several instances in which Claude was used to create malware that the authors could not have created themselves. Anthropic is banning threat actors and implementing classifiers to detect illegal use.</li> <li>Zero trust is basic to modern security. But groups implementing zero trust have to realize that it’s a project that’s <a href="https://www.bleepingcomputer.com/news/security/why-zero-trust-is-never-done-and-is-an-ever-evolving-process/" target="_blank" rel="noreferrer noopener">never finished</a>. Threats change, people change, systems change.</li> <li>There’s a new technique for jailbreaking LLMs: write prompts with <a href="https://www.theregister.com/2025/08/26/breaking_llms_for_fun/" target="_blank" rel="noreferrer noopener">bad grammar and run-on sentences</a>. These seem to prevent guardrails from taking effect. </li> <li>In an attempt to minimize the propagation of malware on the Android platform, Google <a href="https://www.bleepingcomputer.com/news/security/google-to-verify-all-android-devs-to-block-malware-on-google-play/" target="_blank" rel="noreferrer noopener">plans</a> to block “sideloading” apps for Android devices and require developer ID verification for apps installed through Google Play.</li> <li>A <a href="https://research.checkpoint.com/2025/zipline-phishing-campaign/" target="_blank" rel="noreferrer noopener">new phishing attack called ZipLine</a> targets companies using their own “contact us” pages. The attacker then engages in an extended dialog with the company, often posing as a potential business partner, before eventually delivering a malware payload.</li></ul> <h2 class="wp-block-heading">Operations</h2> <ul class="wp-block-list"><li>The <a href="https://blog.google/technology/developers/dora-report-2025/" target="_blank" rel="noreferrer noopener">2025 DORA report</a> is out! DORA may be the most detailed summary of the state of the IT industry. DORA’s authors note that AI is everywhere and that the use of AI now improves end-to-end productivity, something that was ambiguous in last year’s report.</li> <li>Microsoft has <a href="https://www.bleepingcomputer.com/news/microsoft/microsoft-word-will-save-your-files-to-the-cloud-by-default/" target="_blank" rel="noreferrer noopener">announced</a> that Word will save files to the cloud (OneDrive) by default. This (so far) appears to apply only when using Windows. The feature is currently in beta.</li></ul> <h2 class="wp-block-heading">Web</h2> <ul class="wp-block-list"><li>Do we need another browser? <a href="https://helium.computer/" target="_blank" rel="noreferrer noopener">Helium</a> is a Chromium-based browser that is private by default. </li> <li>Are scientists <a href="https://www.psypost.org/scientists-say-x-formerly-twitter-has-lost-its-professional-edge-and-bluesky-is-taking-its-place/" target="_blank" rel="noreferrer noopener">moving from Twitter to Bluesky</a>?</li></ul> <h2 class="wp-block-heading">Virtual and Augmented Reality</h2> <ul class="wp-block-list"><li>Meta has announced a pair of <a href="https://arstechnica.com/gadgets/2025/09/metas-799-ray-ban-display-is-the-companys-first-big-step-from-vr-to-ar/" target="_blank" rel="noreferrer noopener">augmented reality glasses</a> with a small display on one of the lenses, bringing it to the edge of AR. In addition to displaying apps from your phone, the glasses can do “live captioning” for conversations. The display is controlled by a wristband.</li></ul>]]></content:encoded> </item> <item> <title>Mapping the Design Space of AI Coding Assistants</title> <link>https://www.oreilly.com/radar/from-autocomplete-to-agents-mapping-the-design-space-of-ai-coding-assistants/</link> <pubDate>Mon, 06 Oct 2025 11:09:27 +0000</pubDate> <dc:creator><![CDATA[Sam Lau and Philip Guo]]></dc:creator> <category><![CDATA[AI & ML]]></category> <category><![CDATA[Research]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17493</guid> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Analysis_of_AI_assistants.jpg" medium="image" type="image/jpeg" width="2304" height="1792" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/Analysis_of_AI_assistants-160x160.jpg" width="160" height="160" /> <custom:subtitle><![CDATA[From Autocomplete to Agents: Analyzing 90 Tools from Industry and Academia]]></custom:subtitle> <description><![CDATA[Just a few years ago, AI coding assistants were little more than autocomplete curiosities—tools that could finish your variable names or suggest a line of boilerplate. Today, they’ve become an everyday part of millions of developers’ workflows, with entire products and startups built around them. Depending on who you ask, they represent either the dawn […]]]></description> <content:encoded><![CDATA[<p>Just a few years ago, AI coding assistants were little more than autocomplete curiosities—tools that could finish your variable names or suggest a line of boilerplate. Today, they’ve become an everyday part of millions of developers’ workflows, with entire products and startups built around them. Depending on who you ask, they represent either the dawn of a new programming era or the end of programming as we know it. Amid the hype and skepticism, one thing is clear: The landscape of coding assistants is expanding rapidly, and it can be hard to zoom out and see the bigger picture.</p> <p>I’m <a href="https://lau.ucsd.edu/" target="_blank" rel="noreferrer noopener">Sam Lau</a> from UC San Diego, and my colleague <a href="https://pg.ucsd.edu/" target="_blank" rel="noreferrer noopener">Philip Guo</a> and I are presenting a <a href="https://lau.ucsd.edu/pubs/2025_analysis-of-90-genai-coding-tools_VLHCC.pdf" target="_blank" rel="noreferrer noopener">research paper</a> at the Visual Languages and Human-Centric Computing conference (VL/HCC) on this very topic. We wanted to know: <strong>How have AI coding assistants evolved over the past few years, and where is the field headed?</strong></p> <p>To answer this question, we analyzed <strong>90 AI coding assistants</strong> created between 2021 and 2025: 58 industry products and 32 academic prototypes. Some were widely used commercial assistants, while others were experimental research systems that explored entirely new ways of working with AI. Rather than focusing on who was “best” or which system was most powerful, we took a different approach. We built a <strong>design space framework</strong>: a kind of map that highlights the major choices designers and researchers make when building coding assistants. By comparing industry and academic systems side by side, we hoped to uncover both patterns and blind spots in how these tools are being shaped.</p> <p>The result is the first comprehensive snapshot of the space at this critical moment in 2025 when AI coding assistants are starting to mature but their future directions remain very much in flux.</p> <p>Here’s a summary of our findings:</p> <figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1600" height="1332" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-1600x1332.png" alt="Overview of findings" class="wp-image-17494" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-1600x1332.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-300x250.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-768x640.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-1536x1279.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/10/design_space-2048x1705.png 2048w" sizes="(max-width: 1600px) 100vw, 1600px" /></figure> <h2 class="wp-block-heading"><strong>10 Dimensions That Define These Tools</strong></h2> <p>What makes one coding assistant feel like a helpful copilot and another feel like a clunky distraction? In our analysis, we identified 10 dimensions of design, grouped into four broad themes:</p> <ol class="wp-block-list"><li>Interface: How the assistant shows up (inline autocomplete, proactive suggestions, full IDEs).</li> <li>Inputs: What you can feed it (text, design files, code analysis, custom project rules).</li> <li>Capabilities: What it can do (self-correct, run code, call external tools).</li> <li>Outputs: How it delivers results (code blocks, interactive outputs, reasoning traces, references).</li></ol> <p>For example, some assistants like GitHub Copilot are optimized for speed and minimal friction: autocomplete a few keystrokes, press tab, keep coding. Academic projects like WaitGPT and DBox are designed for exploration and learning by slowing users down to reflect on trade-offs, offering explanations, or scaffolding programming concepts for beginners. (Links to all 90 projects are in our <a href="https://lau.ucsd.edu/pubs/2025_analysis-of-90-genai-coding-tools_VLHCC.pdf" target="_blank" rel="noreferrer noopener">paper PDF</a>.)</p> <p>One of the clearest findings from our survey is a split between industry and academia.</p> <ul class="wp-block-list"><li>Industry products focus on speed, efficiency, and seamless integration. Their pitch is simple: write code faster, with fewer errors. Think of tools like Cursor, Claude Code, or GitHub Copilot, which promise “coding at the speed of thought.”</li> <li>Academic prototypes, by contrast, diverge in many directions. Some deliberately slow down the coding process to encourage reflection. Others focus on scaffolding learning for students, supporting accessibility, or enabling entirely new ways of prompting, like letting users sketch a UI instead of writing a text-based prompt.</li></ul> <p>This divergence reflects two different priorities: one optimized for productivity in professional software engineering, the other for exploring what programming could be or should be. Both approaches have value, and to us the most interesting question is whether the two cultures might eventually converge, or at least learn from each other.</p> <h2 class="wp-block-heading"><strong>Six Personas, Six Ways of Coding with AI</strong></h2> <p>Another way to make sense of the space is to ask: Who are these tools really for? We identified six user personas that kept reappearing across systems:</p> <ul class="wp-block-list"><li>Software engineers, who seek tools to accelerate professional workflows</li> <li>HCI researchers and hobbyists, who create prototypes and new ways of working with AI</li> <li>UX designers, who use assistants to quickly prototype and iterate on interface ideas</li> <li><a href="https://pg.ucsd.edu/publications/conversational-programmers-in-industry_CHI-2016.pdf" target="_blank" rel="noreferrer noopener">Conversational programmers</a>, who are nontechnical professionals that engage in vibe coding by describing ideas in natural language</li> <li>Data scientists, who need explainability and quick iterations on code-driven experiments</li> <li>Students learning to code, who benefit from scaffolding, guidance, and explanations</li></ul> <p>Each persona requires different designs, which we highlight within our design space. For example, tools designed for software engineers like Claude Code and Aider are integrated into their existing code editors and terminals, support a high degree of customization, and have autonomy to write and run code without human intervention. In contrast, tools for designers like Lovable and Vercel v0 are browser-based and can create applications using a visual mockup like a Figma design file.</p> <h2 class="wp-block-heading"><strong><strong>What Comes After Autocomplete, Chat, and Agents?</strong></strong></h2> <p>So where does this leave us? Coding assistants are no longer experimental toys. They’re woven into production workflows, classrooms, design studios, and research labs. But their future is far from settled.</p> <p>From our perspective, the central challenge is that academia and industry are innovating in parallel yet rarely in conversation with one another. While industry tools optimize for speed, generating lots of code quickly is not the same as building good software. In fact, recent studies have shown that although AI coding assistants have claimed to boost productivity by 10x, reality so far is closer to incremental improvements. (See <a href="https://addyo.substack.com/p/the-reality-of-ai-assisted-software" target="_blank" rel="noreferrer noopener">Addy Osmani’s recent blog post</a> for a summary.) <strong>What if academia and industry could work together to combine rigorous study of real barriers to productivity with the practical experience of scaling tools in production?</strong> If this could happen, we might move beyond simply making code faster to write toward making software development itself more rapid and sustainable.</p> <p>Check out our paper <a href="https://lau.ucsd.edu/pubs/2025_analysis-of-90-genai-coding-tools_VLHCC.pdf" target="_blank" rel="noreferrer noopener">here</a> and email us if you’d like to discuss anything related to it!</p>]]></content:encoded> </item> <item> <title>Generative AI in the Real World: Emmanuel Ameisen on LLM Interpretability</title> <link>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-emmanuel-ameisen-on-llm-interpretability/</link> <comments>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-emmanuel-ameisen-on-llm-interpretability/#respond</comments> <pubDate>Thu, 02 Oct 2025 14:31:22 +0000</pubDate> <dc:creator><![CDATA[Ben Lorica and Emmanuel Ameisen]]></dc:creator> <category><![CDATA[Podcast]]></category> <guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&p=17488</guid> <enclosure url="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3" length="0" type="audio/mpeg" /> <media:content url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png" medium="image" type="image/png" width="2560" height="2560" /> <media:thumbnail url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-160x160.png" width="160" height="160" /> <description><![CDATA[In this episode, Ben Lorica and Anthropic interpretability researcher Emmanuel Ameisen get into the work Emmanuel’s team has been doing to better understand how LLMs like Claude work. Listen in to find out what they’ve uncovered by taking a microscopic look at how LLMs function—and just how far the analogy to the human brain holds. […]]]></description> <content:encoded><![CDATA[<p>In this episode, Ben Lorica and Anthropic interpretability researcher Emmanuel Ameisen get into the work Emmanuel’s team has been doing to better understand how LLMs like Claude work. Listen in to find out what they’ve uncovered by taking a microscopic look at how LLMs function—and just how far the analogy to the human brain holds.</p> <p><strong>About the <em>Generative AI in the Real World </em>podcast</strong>: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In <em>Generative AI in the Real World</em>, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.</p> <p>Check out <a href="https://learning.oreilly.com/playlists/42123a72-1108-40f1-91c0-adbfb9f4983b/?_gl=1*m7f70i*_ga*MTYyODYzMzQwMi4xNzU4NTY5ODYz*_ga_092EL089CH*czE3NTkxNzAwODUkbzE0JGcwJHQxNzU5MTcwMDg1JGo2MCRsMCRoMA.." target="_blank" rel="noreferrer noopener">other episodes</a> of this podcast on the O’Reilly learning platform.</p> <h2 class="wp-block-heading"><strong>Transcript</strong></h2> <p><em>This transcript was created with the help of AI and has been lightly edited for clarity.</em></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=0" target="_blank" rel="noreferrer noopener">00.00</a><br><strong>Today we have Emmanuel Ameisen. He works at Anthropic on interpretability research. And he also authored an O’Reilly book called </strong><a href="https://learning.oreilly.com/library/view/building-machine-learning/9781492045106/"><strong><em>Building Machine Learning Powered Applications</em></strong></a><strong>. So welcome to the podcast, Emmanuel. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=22" target="_blank" rel="noreferrer noopener">00.22</a><br>Thanks, man. I’m glad to be here. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=24" target="_blank" rel="noreferrer noopener">00.24</a><br><strong>As I go through what you and your team do, it’s almost like biology, right? You’re studying these models, but increasingly they look like biological systems. Why do you think that’s useful as an analogy? And am I actually accurate in calling this out?</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=50" target="_blank" rel="noreferrer noopener">00.50</a><br>Yeah, that’s right. Our team’s mandate is to basically understand how the models work, right? And one fact about language models is that they’re not really written like a program, where somebody sort of by hand described what should happen in that logical branch or this logical branch. Really the way we think about it is they’re almost grown. But what that means is, they’re trained over a large dataset, and on that dataset, they learn to adjust their parameters. They have many, many parameters—often, you know, billions—in order to perform well. And so the result of that is that when you get the trained model back, it’s sort of unclear to you how that model does what it does, because all you’ve done to create it is show it tasks and have it improve at how it does these tasks.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=108" target="_blank" rel="noreferrer noopener">01.48</a><br>And so it feels similar to biology. I think the analogy is apt because for analyzing this, you kind of resort to the tools that you would use in that context, where you try to look inside the model [and] see which parts seem to light up in different contexts. You poke and prod in different parts to try to see, “Ah, I think this part of the model does this.” If I just turn it off, does the model stop doing the thing that I think it’s doing? It’s very much not what you would do in most cases if you were analyzing a program, but it is what you would do if you’re trying to understand how a mouse works. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=142" target="_blank" rel="noreferrer noopener">02.22</a><br><strong>You and your team have discovered surprising ways as to how these models do problem-solving, the strategies they employ. What are some examples of these surprising problem-solving patterns? </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=160" target="_blank" rel="noreferrer noopener">02.40</a><br>We’ve spent a bunch of time studying these models. And again I should say, whether it’s surprising or not depends on what you were expecting. So maybe there’s a few ways in which they’re surprising. </p> <p>There’s various bits of common knowledge about, for example, how models predict one token at a time. And it turns out if you actually look inside the model and try to see how it’s sort of doing its job of predicting text, you’ll find that actually a lot of the time it’s predicting multiple tokens ahead of time. It’s sort of deciding what it’s going to say in a few tokens and presumably in a few sentences to decide what it says now. That might be surprising to people who have heard that [models] are predicting one token at a time. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=208" target="_blank" rel="noreferrer noopener">03.28</a><br>Maybe another one that’s sort of interesting to people is that if you look inside these models and you try to understand what they represent in their artificial neurons, you’ll find that there are general concepts they represent.</p> <p>So one example I like is you can say, “Somebody is tall,” and then, inside the model, you can find neurons activating for the concept of something being tall. And you can have them all read the same text, but translated in French: “Quelqu’un est grand.” And then you’ll find the same neurons that represent the concept of somebody being tall or active.</p> <p>So you have these concepts that are shared across languages and that the model represents in one way, which is again, maybe surprising, maybe not surprising, in the sense that that’s clearly the optimal thing to do, or that’s the way that. . . You don’t want to repeat all of your concepts; like in your brain, you don’t want to have a separate French brain, an English brain, ideally. But surprising if you think that these models are mostly doing pattern matching. Then it is surprising that, when they’re processing English text or French text, they’re actually using the same representations rather than leveraging different patterns. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=281" target="_blank" rel="noreferrer noopener">04.41</a><br><strong>[In] the text you just described, is there a material difference between the reasoning and nonreasoning models? </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=291" target="_blank" rel="noreferrer noopener">04.51</a><br>We haven’t studied that in depth. I will say that the thing that’s interesting about reasoning models is that when you ask them a question, instead of answering right away for a while, they write some text thinking through the problem, saying oftentimes, “Are you using math or code?” You know, trying to think: “Ah, well, maybe this is the answer. Let me try to prove it. Oh no, it’s wrong.” And so they’ve proven to be good at a variety of tasks that models which immediately answer aren’t good at. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=322" target="_blank" rel="noreferrer noopener">05.22</a><br>And one thing that you might think if you look at reasoning models is that you could just read their reasoning and you would understand how they think. But it turns out that one thing that we did find is that you can look at a model’s reasoning, that it writes down, that it samples, the text it’s writing, right? It’s saying, “I’m now going to do this calculation,” and in some cases when for example, the calculation is too hard, if at the same time you look inside the model’s brain inside its weights, you’ll find that actually it could be lying to you.</p> <p>It’s not at all doing the math that it says it’s doing. It’s just kind of doing its best guess. It’s taking a stab at it, just based on either context clues from the rest or what it thinks is probably the right answer—but it’s totally not doing the computation. And so one thing that we found is that you can’t quite always trust the reasoning that is output by reasoning models.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=379" target="_blank" rel="noreferrer noopener">06.19</a><br><strong>Obviously one of the frequent complaints is around hallucination. So based on what you folks have been learning, are we getting close to a, I guess, much more principled mechanistic explanation for hallucination at this point? </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=399" target="_blank" rel="noreferrer noopener">06.39</a><br>Yeah. I mean, I think we’re making progress. We study that in our recent paper, and we found something that’s pretty neat. So hallucinations are cases where the model will confidently say something’s wrong. You might ask the model about some person. You’ll say, “Who’s Emmanuel Ameisen?” And it’ll be like “Ah, it’s the famous basketball player” or something. So it will say something where instead it should have said, “I don’t quite know. I’m not sure who you’re talking about.” And we looked inside the model’s neurons while it’s processing these kinds of questions, and we did a simple test: We asked the model, “Who’s Michael Jordan?” And then we made up some name. We asked it, “Who’s Michael Batkin?” (which it doesn’t know).</p> <p>And if you look inside there’s something really interesting that happens, which is that basically these models by default—because they’ve been trained to try not to hallucinate—they have this default set of neurons that is just: If you ask me about anyone, I’ll just say no. I’ll just say, “I don’t know.” And the way that the models actually choose to answer is if you mentioned somebody famous enough, like Michael Jordan, there’s neurons for like, “Oh, this person is famous; I definitely know them” that activate and that turns off the neurons that were going to promote the answer for, “Hey, I’m not too sure.” And so that’s why the model answers in the Michael Jordan case. And that’s why it doesn’t answer by default in the Michael Batkin case.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=489" target="_blank" rel="noreferrer noopener">08.09</a><br>But what happens if instead now you force the neurons for “Oh, this is a famous person” to turn on even when the person isn’t famous, the model is just going to answer the question. And in fact, what we found is in some hallucination cases, this is exactly what happens. It’s that basically there’s a separate part of the model’s brain, essentially, that’s making the determination of “Hey, do I know this person or not?” And then that part can be wrong. And if it’s wrong, the model’s just going to go on and yammer about that person. And so it’s almost like you have a split mechanism here, where, “Well I guess the part of my brain that’s in charge of telling me I know says, ‘I know.’ So I’m just gonna go ahead and say stuff about this person.” And that’s, at least in some cases, how you get a hallucination. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=534" target="_blank" rel="noreferrer noopener">08.54</a><br><strong>That’s interesting because a person would go, “I know this person. Yes, I know this person.” But then if you actually don’t know this person, you have nothing more to say, right? It’s almost like you forget. Okay, so I’m supposed to know Emmanuel, but I guess I don’t have anything else to say. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=555" target="_blank" rel="noreferrer noopener">09.15</a><br>Yeah, exactly. So I think the way I’ve thought about it is there’s definitely a part of my brain that feels similar to this thing, where you might ask me, you know, “Who was the actor in the second movie of that series?” and I know I know; I just can’t quite recall it at the time. Like, “Ah, you know, this is how they look; they were also in that other movie”—but I can’t think of the name. But the difference is, if that happens, I’m going to say, “Well, listen, man, I think I know, but at the moment I just can’t quite recall it.” Whereas the models are like, “I think I know.” And so I guess I’m just going to say stuff. It’s not that the “Oh, I know” [and] “I don’t know” parts [are] separate. That’s not the problem. It’s that they don’t catch themselves sometimes early enough like you would, where, to your point exactly, you’d just be like, “Well, look, I think I know who this is, but honestly at this moment, I can’t really tell you. So let’s move on.” </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=610" target="_blank" rel="noreferrer noopener">10.10</a><br><strong>By the way, this is part of a bigger topic now in the AI space around reliability and predictability, the idea being, I can have a model that’s 95% [or] 99% accurate. And if I don’t know when the 5% or the 1% is inaccurate, it’s quite scary. Right? So I’d rather have a model that’s 60% accurate, but I know exactly when that 60% is. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=645" target="_blank" rel="noreferrer noopener">10.45</a><br>Models are getting better at hallucinations for that reason. That’s pretty important. People are training them to just be better calibrated. If you look at the rates of hallucinations for most models today, they’re so much lower than the previous models. But yeah, I agree. And I think in a sense maybe like there’s a hard question there, which is at least in some of these examples that we looked at, it’s not necessarily that, insofar as what we’ve seen, that you can clearly see just from looking at the inside of the model, oh, the model is hallucinating. What we can see is the model thinks it knows who this person is, and then it’s saying some stuff about this person. And so I think the key bit that would be interesting to do future work on is then try to understand, well, when it’s saying things about people, when it’s saying, you know, this person won this championship or whatever, is there a way there that we can kind of tell whether those are real facts or those are sort of confabulated in a way? And I think that’s still an active area of research. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=711" target="_blank" rel="noreferrer noopener">11.51</a><br><strong>So in the case where you hook up Claude to web search, presumably there’s some sort of citation trail where at least you can check, right? The model is saying it knows Emmanuel and then says who Emmanuel is and gives me a link. I can check, right? </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=732" target="_blank" rel="noreferrer noopener">12.12</a><br>Yeah. And in fact, I feel like it’s even more fun than that sometimes. I had this experience yesterday where I was asking the model about some random detail, and it confidently said, “This is how you do this thing.” I was asking how to change the time on a device—it’s not important. And it was like, “This is how you do it.” And then it did a web search and it said, “Oh, actually, I was wrong. You know, according to the search results, that’s how you do it. The initial advice I gave you is wrong.” And so, yeah, I think grounding results in search is definitely helpful for hallucinations. Although, of course, then you have the other problem of making sure that the model doesn’t trust sources that are unreliable. But it does help. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=770" target="_blank" rel="noreferrer noopener">12.50</a><br><strong>Case in point: science. There’s tons and tons of scientific papers now that get retracted. So just because it does a web search, what it should do is also cross-verify that search with whatever database there is for retracted papers.</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=788" target="_blank" rel="noreferrer noopener">13:08</a><br>And you know, as you think about these things, I think you get an answer like effort-level questions where right now, if you go to Claude, there’s a research mode where you can send it off on a quest and it’ll do research for a long time. It’ll cross-reference tens and tens and tens of sources.</p> <p>But that will take I don’t know, it depends. Sometimes 10 minutes, sometimes 20 minutes. And so there’s a question like, when you’re asking, “Should I buy these running shoes?” you don’t care, [but] when you’re asking about something serious or you’re going to make an important life decision, maybe you do. I always feel like as the models get better, we also want them to get better at knowing when they should spend 10 seconds or 10 minutes on something. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=827" target="_blank" rel="noreferrer noopener">13.47</a><br><strong>There’s a surprisingly growing number of people who go to these models to ask help in medical questions. And as anyone who uses these models knows, a lot of it comes down to your problem, right? A neurosurgeon will prompt this model about brain surgery very differently than you and me, right? </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=848" target="_blank" rel="noreferrer noopener">14:08</a><br>Of course. In fact, that was one of the cases that we studied actually, where we prompted the model with a case that’s similar to one that a doctor would see. Not in the language that you or I would use, but in the sort of like “This patient is age 35 presenting symptoms A, B, and C,” because we wanted to try to understand how the model arrives to an answer. And so the question had all these symptoms. And then we asked the model, “Based on all these symptoms, answer in only one word: What other tests should we run?” Just to force it to do all of its reasoning in its head. I can’t write anything down. </p> <p>And what we found is that there were groups of neurons that were activating for each of the symptoms. And then they were two different groups of neurons that were activating for two potential diagnoses, two potential diseases. And then those were promoting a specific test to run, which is sort of a practitioner and a differential diagnosis: The person either has A or B, and you want to run a test to know which one it is. And then the model suggested the test that would help you decide between A and B. And I found that quite striking because I think again, outside of the question of reliability for a second, there’s a depth of richness to just the internal representations of them all as it does all of this in one word. </p> <p>This makes me excited about continuing down this path of trying to understand the model, like the model’s done a full round of diagnosing someone and proposing something to help with the diagnostic just in one forward pass in its head. As we use these models in a bunch of places, I sure really want to understand all of the complex behavior like this that happens in its weights. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=961" target="_blank" rel="noreferrer noopener">16.01</a><br><strong>In traditional software, we have debuggers and profilers. Do you think as interpretability matures our tools for building AI applications, we could have kind of the equivalent of debuggers that flag when a model is going off the rails?</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=984" target="_blank" rel="noreferrer noopener">16.24</a><br>Yeah. I mean, that’s the hope. I think debuggers are a good comparison actually, because debuggers mostly get used by the person building the application. If I go to, I don’t know, claude.ai or something, I can’t really use the debugger to understand what’s going on in the backend. And so that’s the first state of debuggers, and the people building the models use it to understand the models better. We’re hoping that we’re going to get there at some point. We’re making progress. I don’t want to be too optimistic, but, I think, we’re on a path here where this work I’ve been describing, the vision was to build this big microscope, basically where the model is doing something, it’s answering a question, and you just want to look inside. And just like a debugger will show you basically the states of all of the variables in your program, we want to see the state of all of the neurons in this model.</p> <p>It’s like, okay. The “I definitely know this person” neuron is on and the “This person is a basketball player” neuron is on—that’s kind of interesting. How do they affect each other? <em>Should</em> they affect each other in that way? So I think in many ways we’re sort of getting to something close where at least you can inspect the execution of your running program like you would with a debugger. You’re inspecting the execution learning model. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1066" target="_blank" rel="noreferrer noopener">17.46</a><br>Of course, then there’s a question of, What do you do with it? That I think is another active area of research where, if you spend some time looking at your debugger, you can say, “Ah, okay, I get it. I initialized this variable the wrong way. Let me fix it.”</p> <p>We’re not there yet with models, right? Even if I tell you “This is exactly how this is happening and it’s wrong,” then the way that we make them again is we train them. So really, you have to think, “Ah, can we give it other examples that <em>I</em> would learn to do that way?” </p> <p>It’s almost like we’re doing neuroscience on a developing child or something. But then our only way to actually improve them is to change the curriculum of their school. So we have to translate from what we saw in their brain to “Maybe they need a little more math. Or maybe they need a little more English class.” I think we’re on that path. I’m pretty excited about it. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1113" target="_blank" rel="noreferrer noopener">18.33</a><br>We also open-sourced the tools to do this a couple months back. And so, you know, this is something that can now be run on open source models. And people have been doing a bunch of experiments with them trying to see if they behave the same way as some of the behaviors that we saw in the Claud models that we studied. And so I think that also is promising. And there’s room for people to contribute if they want to. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1136" target="_blank" rel="noreferrer noopener">18.56</a><br><strong>Do you folks internally inside Anthropic have special interpretability tools—not that the interpretability team uses but [that] now you can push out to other people in Anthropic as they’re using these models? I don’t know what these tools would be. Could be what you describe, some sort of UX or some sort of microscope towards a model. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1162" target="_blank" rel="noreferrer noopener">19.22</a><br>Right now we’re sort of at the stage where the interpretability team is doing most of the microscopic exploration, and we’re building all these tools and doing all of this research, and it mostly happens on the team for now. I think there’s a dream and a vision to have this. . . You know, I think the debugger metaphor is really apt. But we’re still in the early days. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1186" target="_blank" rel="noreferrer noopener">19.46</a><br><strong>You used the example earlier [where] the part of the model “That is a basketball player” lights up. Is that what you would call a concept? And from what I understand, you folks have a lot of these concepts. And by the way, is a concept something that you have to consciously identify, or do you folks have an automatic way of, “Here’s millions and millions of concepts that we’ve identified and we don’t have actual names for some of them yet”?</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1221" target="_blank" rel="noreferrer noopener">20.21</a><br>That’s right, that’s right. The latter one is the way to think about it. The way that I like to describe it is basically, the model has a bunch of neurons. And for a second let’s just imagine that we can make the comparison to the human brain, [which] also has a bunch of neurons.</p> <p>Usually it’s groups of neurons that mean something. So it’s like I have these five neurons around. That means that the model’s reading text about basketball or something. And so we want to find all of these groups. And the way that we find them basically is in an automated, unsupervised way.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1255" target="_blank" rel="noreferrer noopener">20.55</a><br>The way you can think about it, in terms of how we try to understand what they mean, is maybe the same way that you do in a human brain, where if I had full access to your brain, I could record all of your neurons. And [if] I wanted to know where the basketball neuron was, probably what I would do is I would put you in front of a screen and I would play some basketball videos, and I would see which part of your brain lights up, you know? And then I would play some videos of football and I’d hopefully see some common parts, like the sports part and then the football part would be different. And then I play a video of an apple and then it’d be a completely different part of the brain. </p> <p>And that’s basically exactly what we do to understand what these concepts mean in Claude is we just run a bunch of text through and see which part of its weight matrices light up, and that tells us, okay, this is the basketball concept probably. </p> <p>The other way we can confirm that we’re right is just we can then turn it off and see if Claude then stops talking about basketball, for example.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1312" target="_blank" rel="noreferrer noopener">21.52</a><br><strong>Does the nature of the neurons change between model generations or between types of models—reasoning, nonreasoning, multimodal, nonmultimodal?</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1323" target="_blank" rel="noreferrer noopener">22.03</a><br>Yeah. I mean, at the base level all the weights of the model are different, so all of the neurons are going to be different. So the sort of trivial answer to your question [is] yes, everything’s changed. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1334" target="_blank" rel="noreferrer noopener">22.14</a><br><strong>But you know, it’s kind of like [in] the brain, the basketball concept is close to the Michael Jordan concept.</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1341" target="_blank" rel="noreferrer noopener">22.21</a><br>Yeah, exactly. There’s basically commonalities, and you see things like that. We don’t at all have an in-depth understanding of anything like you’d have for the human brain, where it’s like “Ah, this is a map of where the concepts are in the model.” However, you do see that, provided that the models are trained on and doing kind of the same “being a helpful assistant” stuff, they’ll have similar concepts. They’ll all have the basketball concept, and they’ll have a concept for Michael Jordan. And these concepts will be using similar groups of neurons. So there’s a lot of overlap between the basketball concept and the Michael Jordan concept. You’re going to see similar overlap in most models.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1383" target="_blank" rel="noreferrer noopener">23.03</a><br><strong>So channeling your previous self, if I were to give you a keynote at a conference and I give you three slides—this is in front of developers, mind you, not ML researchers—what are the one to three things about interpretability research that developers should know about or potentially even implement or do something about today?</strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1410" target="_blank" rel="noreferrer noopener">23.30</a><br>Oh man, it’s a good question. My first slide would say something like models, language models in particular, are complicated, interesting, and they can be understood, and it’s worth spending time to understand them. The point here being, we don’t have to treat them as this mysterious thing. We don’t have to use approximate, “Oh, they’re just next-token predictors or they’re just pattern matters. They’re black boxes.” We can look inside, and we can make progress on understanding them, and we can find a lot of rich structure. That would be slide one.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1450" target="_blank" rel="noreferrer noopener">24.10</a><br>Slide two would be the stuff that we talked about at the start of this conversation, which would be, “Here’s three ways your intuitions are wrong.” You know, oftentimes this is, “Look at this example of a model planning many tokens ahead, not just waiting for the next token. And look at this example of the model having these rich representations showing that it’s sort of like actually doing multistep reasoning in its weights rather than just kind of matching to some training data example.” And then I don’t know what my third example would be. Maybe this universal language example we talked about. Complicated, interesting stuff. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1484" target="_blank" rel="noreferrer noopener">24.44</a><br>And then, three: What can you do about it? That’s the third slide. It’s an early research area. There’s not anything that you can take that will make anything that you’re building better today. Hopefully if I’m viewing this presentation in six months or a year, maybe this third slide is different. But for now, that’s what it is.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1501" target="_blank" rel="noreferrer noopener">25.01</a><br>If you’re interested about this stuff, there are these open source libraries that let you do this tracing and open source models. Just go grab some small open source model, ask it some weird question, and then just look inside his brain and see what happens.</p> <p>I think the thing that I respect the most and identify [with] the most about just being an engineer or developer is this willingness to understand all this stubbornness, to understand your program has a bug. Like, I’m going to figure out what it is, and it doesn’t matter what level of abstraction it’s at.</p> <p>And I would encourage people to use that same level of curiosity and tenacity to look inside these very weird models that are everywhere. Now, those would be my three slides. </p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1549" target="_blank" rel="noreferrer noopener">25.49</a><br><strong>Let me ask a follow up question. As you know, most teams are not going to be doing much pretraining. A lot of teams will do some form of posttraining, whatever that might be—fine-tuning, some form of reinforcement learning for the more advanced teams, a lot of prompt engineering, prompt optimization, prompt tuning, some sort of context grounding like RAG or GraphRAG.</strong></p> <p><strong>You know more about how these models work than a lot of people. How would you approach these various things in a toolbox for a team? You’ve got prompt engineering, some fine-tuning, maybe distillation, I don’t know. So put on your posttraining hat, and based on what you know about interpretability or how these models work, how would you go about, systematically or in a principled way, approaching posttraining? </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1614" target="_blank" rel="noreferrer noopener">26.54</a><br>Lucky for you, I also used to work on the posttraining team at Anthropic. So I have some experience as well. I think it’s funny, what I’m going to say is the same thing I would have said before I studied these model internals, but maybe I’ll say it in a different way or something. The key takeaway I keep on having from looking at model internals is, “God, there’s a lot of complexity.” And that means they’re able to do very complex reasoning just in latent space inside their weights. There’s a lot of processing that can happen—more than I think most people have an intuition for. And two, that also means that usually, they’re doing a bunch of different algorithms at once for everything they do.</p> <p>So they’re solving problems in three different ways. And a lot of times, the weird mistakes you might see when you’re looking at your fine-tuning or just looking at the results model is, “Ah, well, there’s three different ways to solve this thing. And the model just kind of picked the wrong one this time.” </p> <p>Because these models are already so complicated, I find that the first thing to do is just pretty much always to build some sort of eval suite. That’s the thing that people fail at the most. It doesn’t take that long—it usually takes an afternoon. You just write down 100 examples of what you want and what you don’t want. And then you can get incredibly far by just prompt engineering and context engineering, or just giving the model the right context.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1714" target="_blank" rel="noreferrer noopener">28.34</a><br>That’s my experience, having worked on fine-tuning models that you only want to resort to if everything else fails. I mean, it’s pretty rare that everything else fails, especially with the models getting better. And so, yeah, understanding that, in principle, the models have an immense amount of capacity and it’s just your job to tease that capacity out is the first thing I would say. Or the second thing, I guess, after just, build some evals.</p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1740" target="_blank" rel="noreferrer noopener">29.00</a><br><strong>And with that, thank you, Emmanuel. </strong></p> <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Emmanuel_Ameisen.mp3#t=1743" target="_blank" rel="noreferrer noopener">29.03</a><br>Thanks, man.</p>]]></content:encoded> <wfw:commentRss>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-emmanuel-ameisen-on-llm-interpretability/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> </channel></rss> <!--Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/ Object Caching 224/227 objects using MemcachedPage Caching using Disk: Enhanced (Page is feed) Minified using Memcached Served from: www.oreilly.com @ 2025-10-23 15:26:37 by W3 Total Cache-->If you would like to create a banner that links to this page (i.e. this validation result), do the following:
Download the "valid RSS" banner.
Upload the image to your own server. (This step is important. Please do not link directly to the image on this server.)
Add this HTML to your page (change the image src attribute if necessary):
If you would like to create a text link instead, here is the URL you can use:
http://www.feedvalidator.org/check.cgi?url=http%3A//feeds.feedburner.com/oreilly/radar/atom