This is a valid RSS feed.
This feed is valid, but interoperability with the widest range of feed readers could be improved by implementing the following recommendations.
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
^
... rel="self" type="application/rss+xml" />
^
line 315, column 0: (12 occurrences) [help]
line 315, column 0: (12 occurrences) [help]
line 405, column 0: (9 occurrences) [help]
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:media="http://search.yahoo.com/mrss/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
xmlns:custom="https://www.oreilly.com/rss/custom"
>
<channel>
<title>Radar</title>
<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
<link>https://www.oreilly.com/radar</link>
<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
<lastBuildDate>Wed, 17 Sep 2025 17:50:22 +0000</lastBuildDate>
<language>en-US</language>
<sy:updatePeriod>
hourly </sy:updatePeriod>
<sy:updateFrequency>
1 </sy:updateFrequency>
<generator>https://wordpress.org/?v=6.8.2</generator>
<image>
<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
<title>Radar</title>
<link>https://www.oreilly.com/radar</link>
<width>32</width>
<height>32</height>
</image>
<item>
<title>Prompt Engineering Is Requirements Engineering</title>
<link>https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/</link>
<comments>https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/#respond</comments>
<pubDate>Wed, 17 Sep 2025 10:27:38 +0000</pubDate>
<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17463</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/10/in-dis-big-bang-10a-1400x950.jpg"
medium="image"
type="image/jpeg"
/>
<custom:subtitle><![CDATA[We’ve Been Here Before]]></custom:subtitle>
<description><![CDATA[In the rush to get the most from AI tools, prompt engineering—the practice of writing clear, structured inputs that guide an AI tool’s output—has taken center stage. But for software engineers, the skill isn’t new. We’ve been doing a version of it for decades, just under a different name. The challenges we face when writing […]]]></description>
<content:encoded><![CDATA[
<p>In the rush to get the most from AI tools, <strong>prompt engineering</strong>—the practice of writing clear, structured inputs that guide an AI tool’s output—has taken center stage. But for software engineers, the skill isn’t new. We’ve been doing a version of it for decades, just under a different name. The challenges we face when writing AI prompts are the same ones software teams have been grappling with for generations. Talking about prompt engineering today is really just continuing a much older conversation about how developers spell out what they need built, under what conditions, with what assumptions, and how to communicate that to the team.</p>
<p>The <em>software crisis</em> was the name given to this problem starting in the late 1960s, especially at the <a href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/NATO_Software_Engineering_Conferences&sa=D&source=docs&ust=1758049259577140&usg=AOvVaw31-sMSlquvRBVJB-EQgmJ3" target="_blank" rel="noreferrer noopener">NATO Software Engineering Conference</a> in 1968, where the term “software engineering” was introduced. The crisis referred to the widespread industry experience that software projects were over budget and late, and often failed to deliver what users actually needed.</p>
<p>There was a common misconception that these failures were due to programmers lacking technical skill or teams who needed more technical training. But the panels at that conference focused on what they saw as the real root cause: Teams and their stakeholders had trouble understanding the problems they were solving and what they actually needed to build; communicating those needs and ideas clearly among themselves; and ensuring the delivered system matched that intent. It was fundamentally a human communication problem.</p>
<p>Participants at the conference captured this precisely. Dr. Edward E. David Jr. from Bell Labs noted there is often <em>no way even to specify in a logically tight way</em> what the software is supposed to do. Douglas Ross from MIT pointed out the pitfall where you can <em>specify what you are going to do, and then do it</em> as if that solved the problem. Prof. W.L. van der Poel summed up the challenge of incomplete specifications: <em>Most problems simply aren’t defined well enough at the start</em>, so you don’t have the information you need to build the right solution.</p>
<p>These are all problems that cause teams to misunderstand the software they’re creating before any code is written. And they should all sound familiar to developers today who work with AI to generate code.</p>
<p>Much of the problem boils down to what I’ve often called the classic “do what I meant, not what I said” problem. Machines are literal—and people on teams often are too. Our intentions are rarely fully spelled out, and getting everyone aligned on what the software is supposed to do has always required deliberate, often difficult work.</p>
<p>Fred Brooks wrote about this in his classic and widely influential “<a href="https://www.google.com/url?q=https://www.researchgate.net/publication/220477127_No_Silver_Bullet_Essence_and_Accidents_of_Software_Engineering&sa=D&source=docs&ust=1758049259576173&usg=AOvVaw3xG4vlWyKDq4_fbxDJWQ7r" target="_blank" rel="noreferrer noopener">No Silver Bullet</a>” essay. He argued there would never be a single magic process or tool that would make software development easy. Throughout the history of software engineering, teams have been tempted to look for that silver bullet that would make the hard parts of understanding and communication go away. It shouldn’t be surprising that we’d see the same problems that plagued software teams for years reappear when they started to use AI tools.</p>
<p>By the end of the 1970s, these problems were being reframed in terms of <em>quality</em>. Philip Crosby, Joseph M. Juran, and W. Edwards Deming, three people who had enormous influence on the field of quality engineering, each had influential takes on why so many products didn’t do the jobs they were supposed to do, and these ideas are especially true when it comes to software. Crosby argued quality was fundamentally <em>conformance to requirements</em>—if you couldn’t define what you needed clearly, you couldn’t ensure it would be delivered. Juran talked about <em>fitness for use</em>—software needed to solve the user’s real problem in its real context, not just pass some checklists. Deming pushed even further, emphasizing that defects weren’t just technical mistakes but symptoms of broken systems, and <strong>especially poor communication and lack of shared understanding</strong>. He focused on the human side of engineering: creating processes that help people learn, communicate, and improve together.</p>
<p>Through the 1980s, these insights from the quality movement were being applied to software development and started to crystallize into a distinct discipline called <strong>requirements engineering</strong>, focused on identifying, analyzing, documenting, and managing the needs of stakeholders for a product or system. It emerged as its own field, complete with conferences, methodologies, and professional practices. The IEEE Computer Society formalized this with its first International Symposium on Requirements Engineering in 1993, marking its recognition as a core area of software engineering.</p>
<p>The 1990s became a heyday for requirements work, with organizations investing heavily in formal processes and templates, believing that better documentation formats would ensure better software. Standards like IEEE 830 codified the structure of software requirements specifications, and process models such as the software development life cycle and CMM/CMMI emphasized rigorous documentation and repeatable practices. Many organizations invested heavily in designing detailed templates and forms, hoping that filling them out correctly would guarantee the right system. In practice, those templates were useful for consistency and compliance, but they didn’t eliminate the hard part: <em>making sure what was in one person’s head matched what was in everyone else’s</em>.</p>
<p>While the 1990s focused on formal documentation, the Agile movement of the 2000s shifted toward a more lightweight, conversational approach. <strong>User stories</strong> emerged as a deliberate counterpoint to heavyweight specifications—short, simple descriptions of functionality told from the user’s perspective, designed to be easy to write and easy to understand. Instead of trying to capture every detail upfront, user stories served as placeholders for conversations between developers and stakeholders. The practice was deliberately simple, based on the idea that shared understanding comes from dialogue, not documentation, and that requirements evolve through iteration and working software rather than being fixed at the project’s start.</p>
<p>All of this reinforced requirements engineering as a legitimate area of software engineering practice and a real career path with its own set of skills. There is now broad agreement that requirements engineering is a vital area of software engineering focused on surfacing assumptions, clarifying goals, and ensuring everyone involved has the same understanding of what needs to be built.</p>
<h2 class="wp-block-heading">Prompt Engineering <em>Is</em> Requirements Engineering</h2>
<p>Prompt engineering and requirements engineering are literally the same skill—using clarity, context, and intentionality to <em>communicate your intent</em> and ensure what gets built matches what you actually need.</p>
<p>User stories were an evolution from traditional formal specifications: a simpler, more flexible approach to requirements but with the same goal of making sure everyone understood the intent. They gained wide acceptance across the industry because they helped teams recognize that requirements are about creating a shared understanding of the project. User stories gave teams a lightweight way to capture intent and then refine it through conversation, iteration, and working software.</p>
<p>Prompt engineering plays the exact same role. The prompt is our lightweight placeholder for a conversation with the AI. We still refine it through iteration, adding context, clarifying intent, and checking the output against what we actually meant. But it’s the full conversation with the AI and its context that matters; the individual prompts are just a means to communicate the intent and context. Just like Agile shifted requirements from static specs to living conversations, prompt engineering shifts our interaction with AI from single-shot commands to an iterative refinement process—though one where we have to infer what’s missing from the output rather than having the AI ask us clarifying questions.</p>
<p>User stories intentionally focused the engineering work back on people and what’s in their heads. Whether it’s a requirements document in Word or a user story in Jira, the most important thing isn’t the piece of paper, ticket, or document we wrote. The most important thing is that what’s in <em>my</em> head matches what’s in <em>your</em> head and matches what’s in the heads of everyone else involved. The piece of paper is just a convenient way to help us figure out whether or not we agree.</p>
<p>Prompt engineering demands the same outcome. Instead of working with teammates to align mental models, we’re communicating to an AI, but the goal hasn’t changed: producing a high-quality product. The basic principles of quality engineering laid out by Deming, Juran, and Crosby have direct parallels in prompt engineering:</p>
<ul class="wp-block-list">
<li><strong>Deming’s focus on systems and communication:</strong> Prompting failures can be traced to problems with the process, not the people. They typically stem from poor context and communication, not from “bad AI.”</li>
<li><strong>Juran’s focus on fitness for use:</strong> When he framed quality as “fitness for use,” Juran meant that what we produce has to meet real needs—not just look plausible. A prompt is useless if the output doesn’t solve the real problem, and failure to create a prompt that’s fit for use will result in hallucinations.</li>
<li><strong>Crosby’s focus on conformance to requirements: </strong>Prompts must specify not just functional needs but also nonfunctional ones like maintainability and readability. If the context and framing aren’t clear, the AI will generate output that conforms to its training distribution rather than the real intent.</li>
</ul>
<p>One of the clearest ways these quality principles show up in prompt engineering is through what’s now called <strong>context engineering</strong>—deciding what the model needs to see to generate something useful, which typically includes surrounding code, test inputs, expected outputs, design constraints, and other important project information. If you give the AI too little context, it fills in the blanks with what seems most likely based on its training data (which usually isn’t what you had in mind). If you give it too much, it can get buried in information and lose track of what you’re really asking for. That judgment call—what to include, what to leave out—has always been one of the deepest challenges at the heart of requirements work.</p>
<p>There’s another important parallel between requirements engineering and prompt engineering. Back in the 1990s, many organizations fell into what we might call the <em>template trap</em>—believing that the right standardized form or requirements template could guarantee a good outcome. Teams spent huge effort designing and filling out documents. But the real problem was never the format; it was whether the underlying intent was truly shared and understood.</p>
<p>Today, many companies fall into a similar trap with <strong>prompt libraries</strong>, or catalogs of prewritten prompts meant to standardize practice and remove the difficulty of writing prompts. Prompt libraries can be useful as references or starting points, but they don’t replace the core skill of framing the problem and ensuring shared understanding. Just like a perfect requirements template in the 1990s didn’t guarantee the right system, canned prompts today don’t guarantee the right code.</p>
<p>Decades later, the points Brooks made in his “No Silver Bullet” essay still hold. There’s no single template, library, or tool that can eliminate the essential complexity of understanding what needs to be built. Whether it’s requirements engineering in the 1990s or prompt engineering today, the hard part is always the same: building and maintaining a shared understanding of intent. Tools can help, but they don’t replace the discipline.</p>
<p>AI raises the stakes on this core communication problem. Unlike your teammates, the AI won’t push back or ask questions—it just generates something that looks plausible based on the prompt that it was given. That makes clear communication of requirements even more important.</p>
<p>The alignment of understanding that serves as the foundation of requirements engineering is even more important when we bring AI tools into the project, <em>because AI doesn’t have judgment</em>. It has a huge model, but it only works effectively when directed well. The AI needs the context that we provide in the form of code, documents, and other project information and artifacts, which means the only thing it knows about the project is what we tell it. That’s why it’s especially important to have ways to check and verify that what the AI “knows” really matches what <em>we</em> know.</p>
<p>The classic requirements engineering problems—especially the poor communication and lack of shared understanding that Deming warned about and that requirements engineers and Agile practitioners have spent decades trying to address—are compounded when we use AI. We’re still facing the same issues of communicating intent and specifying requirements clearly. But now those requirements aren’t just for the team to read; they’re used to establish the AI’s context. Small variations in problem framing can have a profound impact on what the AI produces. Using natural language to increasingly replace the structured, unambiguous syntax of code removes a critical guardrail that’s traditionally helped protect software from failed understanding.</p>
<p>The tools of requirements engineering help us make up for that missing guardrail. Agile’s iterative process of the developer understanding requirements, building working software, and continuously reviewing it with the product owner was a check that ensured misunderstandings were caught early. The more we eliminate that extra step of translation and understanding by having AI generate code directly from requirements, the more important it becomes for everyone involved—stakeholders and engineers alike—to have a truly shared understanding of what needs to be built.</p>
<p>When people on teams work together to build software, they spend a lot of time talking and asking questions to understand what they need to build. Working with an AI follows a different kind of feedback cycle—you don’t know it’s missing context until you see what it produces, and you often need to reverse engineer what it did to figure out what’s missing. But both types of interaction require the same fundamental skills around context and communication that requirements engineers have always practiced.</p>
<p>This shows up in practice in several ways:</p>
<ul class="wp-block-list">
<li><strong>Context and shared understanding are foundational.</strong> Good requirements help teams understand what behavior matters and how to know when it’s working—capturing both functional requirements (what to build) and nonfunctional requirements (how well it should work). The same distinction applies to prompting but with fewer chances to course-correct. If you leave out something critical, the AI doesn’t push back; it just responds with whatever seems plausible. Sometimes that output looks reasonable until you try to use it and realize the AI was solving a different problem.</li>
<li><strong>Scoping takes real judgment.</strong> Developers who struggle to use AI for code typically fall into two extremes: providing too little context (a single sentence that produces something that looks right but fails in practice) or pasting in entire files expecting the model to zoom in on the right method. Unless you explicitly call out what’s important—both functional and nonfunctional requirements—it doesn’t know what matters.</li>
<li><strong>Context drifts, and the model doesn’t know it’s drifted.</strong> With human teams, understanding shifts gradually through check-ins and conversations. With prompting, drift can happen in just a few exchanges. The model might still be generating fluent responses until it suggests a fix that makes no sense. That’s a signal that the context has drifted, and you need to reframe the conversation—perhaps by asking the model to explain the code or restate what it thinks it’s doing.</li>
</ul>
<p>History keeps repeating itself: From binders full of scattered requirements to IEEE standards to user stories to today’s prompts, the discipline is the same. We succeed when we treat it as real engineering. <strong>Prompt engineering is the next step in the evolution of requirements engineering.</strong> It’s how we make sure we have a shared understanding between everyone on the project—including the AI—and it demands the same care, clarity, and deliberate communication we’ve always needed to avoid misunderstandings and build the right thing.</p>
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>MCP in Practice</title>
<link>https://www.oreilly.com/radar/mcp-in-practice/</link>
<comments>https://www.oreilly.com/radar/mcp-in-practice/#respond</comments>
<pubDate>Tue, 16 Sep 2025 11:22:59 +0000</pubDate>
<dc:creator><![CDATA[Ilan Strauss, Sruly Rosenblat, Isobel Moure and Tim O’Reilly]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Research]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17440</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/12/in-dis-canyon-5b-1400x950.jpg"
medium="image"
type="image/jpeg"
/>
<custom:subtitle><![CDATA[Mapping Power, Concentration, and Usage in the Emerging AI Developer Ecosystem]]></custom:subtitle>
<description><![CDATA[The following was originally published in Asimov’s Addendum, September 11, 2025. Learn more about the AI Disclosures Project here. 1. The Rise and Rise of MCP Anthropic’s Model Context Protocol (MCP) was released in November 2024 as a way to make tools and platforms model-agnostic. MCP works by defining servers and clients. MCP servers are local or remote end […]]]></description>
<content:encoded><![CDATA[
<p class="has-text-align-center has-cyan-bluish-gray-background-color has-background"><em>The following was <a href="https://asimovaddendum.substack.com/p/read-write-act-inside-the-mcp-server" target="_blank" rel="noreferrer noopener">originally published in </a></em><a href="https://asimovaddendum.substack.com/p/read-write-act-inside-the-mcp-server" target="_blank" rel="noreferrer noopener">Asimov’s Addendum</a><em>,</em> <em>September 11, 2025.</em><br><br><em>Learn more about the AI Disclosures Project <a href="https://www.ssrc.org/programs/ai-disclosures-project/" target="_blank" rel="noreferrer noopener">here</a>.</em></p>
<h2 class="wp-block-heading"><strong>1. The Rise and Rise of MCP</strong></h2>
<p>Anthropic’s<a href="https://www.anthropic.com/news/model-context-protocol" target="_blank" rel="noreferrer noopener"> Model Context Protocol</a> (MCP) was released in November 2024 as a way to make tools and platforms model-agnostic. MCP works by defining servers and clients. MCP servers are local or remote end points where tools and resources are defined. For example, GitHub released an MCP server that allows LLMs to both read from and write to GitHub. MCP clients are the connection from an AI application to MCP servers—they allow an LLM to interact with context and tools from different servers. An example of an MCP client is Claude Desktop, which allows the Claude models to interact with thousands of MCP servers.</p>
<p><strong>In a relatively short time, MCP has become the backbone of hundreds of AI pipelines and applications</strong>. Major players like Anthropic and OpenAI have built it into their products. Developer tools such as Cursor (a coding-focused text editor or IDE) and productivity apps like <a href="https://www.raycast.com/EvanZhouDev/mcp" target="_blank" rel="noreferrer noopener">Raycast</a> also use MCP. Additionally, thousands of <a href="https://arxiv.org/abs/2506.13538" target="_blank" rel="noreferrer noopener">developers</a> use it to integrate AI models and access external tools and data without having to build an entire ecosystem from scratch.</p>
<p>In previous work published with <em>AI Frontiers</em>, we argued that <a href="https://ai-frontiers.org/articles/open-protocols-prevent-ai-monopolies" target="_blank" rel="noreferrer noopener">MCP can act</a> as a great unbundler of “context”—the data that helps AI applications provide more relevant answers to consumers. In doing so, it can help decentralize AI markets. <strong>We argued that, for MCP to truly achieve its goals, it requires support from</strong>:</p>
<ol class="wp-block-list">
<li><strong>Open APIs</strong>: So that MCP applications can access third-party tools for agentic use (<em>write</em> actions) and context (<em>read</em>)</li>
<li><strong>Fluid memory</strong>: Interoperable LLM memory standards, accessed via MCP-like open protocols, so that the memory context accrued at OpenAI and other leading developers does not get stuck there, preventing downstream innovation</li>
</ol>
<p>We expand upon these two points in a <a href="https://ssrc-static.s3.us-east-1.amazonaws.com/Protocols-and-Power-Moure-OReilly-Strauss_SSRC_08272025.pdf" target="_blank" rel="noreferrer noopener">recent policy note</a>, for those looking to dig deeper.</p>
<p>More generally, <strong>we argue that protocols</strong>,<strong> like MCP</strong>,<strong> are actually </strong><a href="https://asimovaddendum.substack.com/p/disclosures-i-do-not-think-that-word" target="_blank" rel="noreferrer noopener"><strong>foundational “rules of the road” for AI markets</strong></a>, <em>whereby open disclosure and communication standards are built</em> <em>into the network itself</em>, rather than imposed <em>after the fact</em> by regulators. Protocols are fundamentally market-shaping devices, architecting markets through the permissions, rules, and interoperability of the network itself. They can have a big impact on how the commercial markets built on top of them function too.</p>
<h3 class="wp-block-heading"><strong>1.1 But how is the MCP ecosystem evolving?</strong></h3>
<p><strong>Yet we don’t have a clear idea of the shape of the MCP ecosystem today</strong>.<strong> </strong><em>What are the most common use cases of MCP? What sort of access is being given by MCP servers and used by MCP clients? Is the data accessed via MCP “read-only” for context, or does it allow agents to “write” and interact with it—for example, by editing files or sending emails?</em></p>
<p>To begin answering these questions, we look at the tools and context which AI agents use via <em>MCP servers</em>. This gives us a clue about what is being built and what is getting attention. In this article, we don’t analyze <em>MCP clients</em>—the applications that use MCP servers. We instead limit our analysis to what MCP servers are making available for building.</p>
<p>We assembled a large dataset of MCP servers (n = 2,874), scraped from <a href="https://www.pulsemcp.com/" target="_blank" rel="noreferrer noopener">Pulse</a>.<sup>1</sup> We then enriched it with GitHub star-count data on each server. On GitHub, stars are similar to Facebook “likes,” and <a href="https://homepages.dcc.ufmg.br/~mtov/pub/2018-jss-github-stars.pdf" target="_blank" rel="noreferrer noopener">developers use them</a> to show appreciation, bookmark projects, or indicate usage.</p>
<p>In practice, <em>while there were plenty of MCP servers, we found that the top few garnered most of the attention and, likely by extension, most of the use.</em> <strong>Just the top 10 servers had nearly half of all GitHub stars given to MCP servers</strong>.</p>
<p><strong>Some of our takeaways are:</strong></p>
<ol class="wp-block-list">
<li><em>MCP usage appears to be fairly concentrated</em>. This means that, if left unchecked, a small number of servers and (by extension) APIs could have outsize control over the MCP ecosystem being created.</li>
<li><em>MCP use (tools and data being accessed) is dominated by just three categories</em>: Database & Search (RAG), Computer & Web Automation, and Software Engineering. Together, they received nearly three-quarters (72.6%) of all stars on GitHub (which we proxy for usage).</li>
<li>Most MCP servers support both <em>read </em>(access context) and <em>write</em> (change context) operations, showing that developers want their agents to be able to act on context, not just consume it.</li>
</ol>
<h2 class="wp-block-heading"><strong>2. Findings</strong></h2>
<p><em>To start with, we analyzed the MCP ecosystem for concentration risk.</em></p>
<h3 class="wp-block-heading"><strong>2.1 MCP server use is concentrated</strong></h3>
<p><strong>We found that MCP usage is concentrated among several key MCP servers</strong>, <em>judged by the number of GitHub stars each repo received</em>.</p>
<p>Despite there being thousands of MCP servers, <strong>the top 10 servers make up nearly half (45.7%) of all GitHub stars given to MCP servers</strong> (pie chart below) and the top 10% of servers make up 88.3% of all GitHub stars (not shown).</p>
<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="1444" height="742" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-2.png" alt="The top 10 servers received 45.7% of all GitHub stars in our dataset of 2,874 servers." class="wp-image-17441" title="Chart" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-2.png 1444w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-2-300x154.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-2-768x395.png 768w" sizes="(max-width: 1444px) 100vw, 1444px" /><figcaption class="wp-element-caption"><em>The top 10 servers received 45.7% of all GitHub stars in our dataset of </em>2,874<em> servers.</em></figcaption></figure>
<p><em>This means that the majority of real-world MCP users are likely relying on the same few services made available via a handful of APIs</em>. This concentration likely stems from network effects and practical utility: All developers gravitate toward servers that solve universal problems like web browsing, database access, and integration with widely used platforms like GitHub, Figma, and Blender. This concentration pattern seems typical of developer-tool ecosystems. A few well-executed, broadly applicable solutions tend to dominate. Meanwhile, more specialized tools occupy smaller niches.</p>
<h3 class="wp-block-heading"><strong>2.2 The top 10 MCP servers really matter</strong></h3>
<p>Next, the top 10 MCP servers are shown in the table below, along with their star count and what they do.</p>
<p><strong>Among the top 10 MCP servers,</strong> <em>GitHub,</em> <em>Repomix</em>, <em>Context7</em>, and <em>Framelink</em> are built to assist with software development: <em>Context7</em> and <em>Repomix</em> by gathering context, <em>GitHub</em> by allowing agents to interact with projects, and <em>Framelink </em>by passing on the design specifications from <em>Figma</em> directly to the model. The <em>Blender</em> server allows agents to create 3D models of anything, using the popular open source <em>Blender</em> application. Finally, <em>Activepieces</em> and <em>MindsDB</em> connect the agent to multiple APIs with one standardized interface: in <em>MindsDB</em>’s case, primarily to read data from databases, and in <em>Activepieces</em> to automate services.</p>
<figure class="wp-block-image size-full"><img decoding="async" width="1296" height="1260" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-3.png" alt="The top 10 MCP servers with short descriptions, design courtesy of Claude." class="wp-image-17442" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-3.png 1296w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-3-300x292.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-3-768x747.png 768w" sizes="(max-width: 1296px) 100vw, 1296px" /><figcaption class="wp-element-caption"><em>The top 10 MCP servers with short descriptions, design courtesy of Claude.</em></figcaption></figure>
<p><strong>The dominance of agentic browsing</strong>, <strong>in the form of <em>Browser Use</em> (61,000 stars) and <em>Playwright MCP</em> (18,425 stars)</strong>,<strong> stands out</strong>. This reflects the fundamental need for AI systems to interact with web content. These tools allow AI to navigate websites, click buttons, fill out forms, and extract data just like a human would. <em>Agentic browsing has surged</em>,<em> even though it’s far less token-efficient than calling an API</em>. Browsing agents often need to wade through multiple pages of boilerplate to extract slivers of data a single API request could return. Because many services lack usable APIs or tightly gate them, browser-based agents are often the simplest—sometimes the only—way to integrate, underscoring the limits of today’s APIs.</p>
<p><strong>Some of the top servers are unofficial. </strong>Both the <em>Framelink</em> and <em>Blender MCP</em> are servers that interact with just a single application, but they are both “unofficial” products. This means that they are not officially endorsed by the developers of the application they are integrating with—those who own the underlying service or API (e.g., GitHub, Slack, Google). Instead, they are built by independent developers who create a bridge between an AI client and a service—often by reverse-engineering APIs, wrapping unofficial SDKs, or using browser automation to mimic user interactions.</p>
<p>It is healthy that third-party developers can build their own MCP servers, since this openness encourages innovation. But it also introduces an intermediary layer between the user and the API, which brings risks around trust, verification, and even potential abuse. With open source local servers, the code is transparent and can be vetted. By contrast, remote third-party servers are harder to audit, since users must trust code they can’t easily inspect.</p>
<p><strong>At a deeper level, the repos that currently dominate MCP servers highlight three encouraging facts about the MCP ecosystem:</strong></p>
<ol class="wp-block-list">
<li><strong>First, several prominent MCP servers support multiple third-party services for their functionality. </strong><em>MindsDB</em> and <em>Activepieces</em> serve as gateways to multiple (often competing) service providers through a single server. <em>MindsDB</em> allows developers to query different databases like PostgreSQL, MongoDB, and MySQL through a single interface, while <em>Taskmaster </em>allows the agent to delegate tasks to a range of AI models from OpenAI, Anthropic, and Google, all without changing servers.</li>
<li><strong>Second, agentic browsing MCP servers are being used to get around potentially restrictive APIs.</strong> As noted above, <em>Browser Use </em>and <em>Playwright</em> access internet services through a web browser, helping to bypass API restrictions, but they instead run up against anti-bot protections. This circumvents the limitations that APIs can impose on what developers are able to build.</li>
<li><strong>Third, some MCP servers do their processing on the developer’s computer (locally)</strong>,<strong> making them less dependent on a vendor maintaining API access</strong>.<strong><em> </em></strong><em>Some MCP servers examined here can run entirely on a local computer without sending data to the cloud—meaning that no gatekeeper has the power to cut you off</em>. Of the 10 MCP servers examined above, only <em>Framelink</em>, <em>Context7,</em> and <em>GitHub</em> rely on just a single cloud-only API dependency that can’t be run locally end-to-end on your machine. <em>Blender</em> and <em>Repomix</em> are completely open source and don’t require any internet access to work, while <em>MindsDB</em>, <em>Browser Use, </em>and <em>Activepieces </em>have local open source implementations.</li>
</ol>
<h3 class="wp-block-heading"><strong>2.3 The three categories that dominate MCP use</strong></h3>
<p><em>Next, we grouped MCP servers into different categories based on their functionality</em>.</p>
<p>When we analyzed what types of servers are most popular, we found that three dominated: <strong>Computer & Web Automation (24.8%)</strong>,<strong> Software Engineering (24.7%)</strong>, and <strong>Database & Search (23.1%)</strong>.</p>
<figure class="wp-block-image size-full"><img decoding="async" width="1456" height="683" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-4.png" alt="Software engineering, computer and web automation, and database and search received 72.6% of all stars given to MCP servers." class="wp-image-17443" title="Chart" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-4.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-4-300x141.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-4-768x360.png 768w" sizes="(max-width: 1456px) 100vw, 1456px" /><figcaption class="wp-element-caption"><em>Software Engineering, Computer & Web Automation, and Database & Search received 72.6% of all stars given to MCP servers.</em></figcaption></figure>
<p>Widespread use of Software Engineering (24.7%) MCP servers aligns with <a href="https://arxiv.org/abs/2503.04761" target="_blank" rel="noreferrer noopener">Anthropic’s economic index</a>, which found that an outsize portion of AI interactions were related to software development.</p>
<p>The popularity of both Computer & Web Automation (24.8%) and Database & Search (23.1%) also makes sense. Before the advent of MCP, web scraping and database search were highly integrated applications across platforms like ChatGPT, Perplexity, and Gemini. With MCP, however, users can now access that same search functionality and connect their agents to any database with minimal effort. In other words, MCP’s <a href="https://ai-frontiers.org/articles/open-protocols-prevent-ai-monopolies" target="_blank" rel="noreferrer noopener">unbundling</a> effect is highly visible here.</p>
<h3 class="wp-block-heading"><strong>2.4 Agents interact with their environments</strong></h3>
<p><em>Lastly, we analyzed the capabilities of these servers</em>: Are they allowing AI applications just to access data and tools (<em>read</em>), or instead do agentic operations with them (<em>write</em>)?</p>
<p><strong>Across all but two of the MCP server categories looked at, the most popular MCP servers supported both <em>reading</em> (access context)<em> </em>and <em>writing</em> (agentic) operations</strong>—shown in turquoise. The prevalence of servers with combined read and write access suggests that agents are not being built just to answer questions based on data but also to take action and interact with services on a user’s behalf.</p>
<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="974" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-5.png" alt="Showing MCP servers by category. Dotted red line at 10,000 stars (likes). The most popular servers support both read and write operations by agents. In contrast, almost no servers support just write operations." class="wp-image-17444" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-5.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-5-300x201.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-5-768x514.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /><figcaption class="wp-element-caption"><em>Showing MCP servers by category. Dotted red line at 10,000 stars (likes). The most popular servers support both read and write operations by agents. In contrast, almost no servers support just write operations.</em></figcaption></figure>
<p>The two exceptions are Database & Search (RAG) and Finance MCP servers, in which <em>read-only</em> access is a common permission given. This is likely because data integrity is critical to ensuring reliability.</p>
<h2 class="wp-block-heading"><strong>3. The Importance of Multiple Access Points</strong></h2>
<p>A few implications of our analysis can be drawn out at this preliminary stage.</p>
<p><strong>First, concentrated MCP server use compounds the risks of API access being restricted</strong>. As we discussed in “<a href="https://asimovaddendum.substack.com/p/protocols-and-power" target="_blank" rel="noreferrer noopener">Protocols and Power</a>,” MCP remains constrained by “<em>what a particular service (such as GitHub or Slack) happens to expose through its API</em>.” A few powerful digital service providers have the power to shut down access to their servers.</p>
<p><em>One important hedge against API gatekeeping is that many of the top servers try not to rely on a single provide</em>r. <strong>In addition</strong>,<strong> the following two safeguards are relevant</strong>:</p>
<ul class="wp-block-list">
<li><strong>They offer local processing</strong> of data on a user’s machine whenever possible, instead of sending the data for processing to a third-party server. Local processing ensures that functionality cannot be restricted.</li>
<li>If running a service locally is not possible (e.g., email or web search), the server should still <strong>support multiple avenues of getting at the needed context through competing APIs</strong>. For example, <em>MindsDB</em> functions as a gateway to multiple data sources, so instead of relying on just one database to read and write data, it goes to great lengths to support multiple databases in one unified interface, essentially making the backend tools interchangeable.</li>
</ul>
<p><strong>Second, our analysis points to the fact that current restrictive API access policies are not sustainable. </strong>Web scraping and bots, accessed via MCP servers, are probably being used (at least in part) to circumvent overly restrictive API access, complicating the <a href="https://slate.com/technology/2025/08/uk-online-safety-act-reddit-wikipedia-open-internet.html" target="_blank" rel="noreferrer noopener">increasingly common</a> practice of banning bots. Even OpenAI is coloring outside the API lines, using a third-party service to access Google Search’s results through web scraping, thereby <a href="https://www.theinformation.com/articles/openai-challenging-google-using-search-data" target="_blank" rel="noreferrer noopener">circumventing its restrictive API</a>.</p>
<p><strong>Expanding structured API access in a meaningful way is vital</strong>. <em>This ensures that legitimate AI automation runs through stable, documented end points.</em> Otherwise, developers resort to brittle browser automation where privacy and authorization have not been properly addressed. Regulatory guidance <a href="https://ai-frontiers.org/articles/open-protocols-prevent-ai-monopolies" target="_blank" rel="noreferrer noopener">could push</a> the market in this direction, as with open banking in the US.</p>
<p><strong>Finally, encouraging greater transparency and disclosure </strong>could help identify where the bottlenecks in the MCP ecosystem are.</p>
<ul class="wp-block-list">
<li>Developers operating popular MCP servers (above a certain usage threshold) or providing APIs used by top servers should report usage statistics, access denials, and rate-limiting policies. This data would help regulators identify emerging bottlenecks before they become entrenched. <em>GitHub might facilitate this by encouraging these disclosures, for example</em>.</li>
<li>Additionally, MCP servers above certain usage thresholds should clearly list their dependencies on external APIs and what fallback options exist if the primary APIs become unavailable. This is not only helpful in determining the market structure, but also essential information for security and robustness for downstream applications.</li>
</ul>
<p>The goal is not to eliminate all concentration in the network but to ensure that the MCP ecosystem remains contestable, with multiple viable paths for innovation and user choice. By addressing both technical architecture and market dynamics, these suggested tweaks could help MCP achieve its potential as a democratizing force in AI development, rather than merely shifting bottlenecks from one layer to another.</p>
<hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
<h2 class="wp-block-heading">Footnotes</h2>
<ol class="wp-block-list">
<li>For this analysis, we categorized each repo into one of 15 categories using GPT-5 mini. We then human-reviewed and edited the top 50 servers that make up around 70% of the total star count in our dataset.</li>
</ol>
<hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
<h2 class="wp-block-heading"><strong>Appendix</strong></h2>
<h3 class="wp-block-heading"><strong>Dataset</strong></h3>
<p>The full dataset, along with descriptions of the categories, can be found here (constructed by Sruly Rosenblat):</p>
<p><a href="https://huggingface.co/datasets/sruly/MCP-In-Practice" target="_blank" rel="noreferrer noopener">https://huggingface.co/datasets/sruly/MCP-In-Practice</a></p>
<h3 class="wp-block-heading"><strong>Limitations</strong></h3>
<p>There are a few limitations to our preliminary research:</p>
<ul class="wp-block-list">
<li>GitHub stars aren’t a measure of download counts or even necessarily a repo’s popularity.</li>
<li>Only the name and description were used when categorizing repos with the LLM.</li>
<li>Categorization was subject to both human and AI errors and many servers would likely fit into multiple categories.</li>
<li>We only used the Pulse list for our dataset; other lists had different servers (e.g., Browser Use isn’t on mcpmarket.com).</li>
<li>We excluded some repos from our analysis, such as those that had multiple servers and those we weren’t able to fetch the star count for. We may miss some popular servers by doing this.</li>
</ul>
<h3 class="wp-block-heading"><strong>MCP Server Use Over Time</strong></h3>
<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="916" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-6.png" alt="The growth of the top nine repos’ star count over time from MCP’s launch date on November 25, 2024, until September 2025. NOTE: We were only able to track the Browser-Use’s repo until 40,000 stars; hence the flat line for its graph. In reality, roughly 21,000 stars were added over the next few months (the other graphs in this blog are properly adjusted)." class="wp-image-17445" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-6.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-6-300x189.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-6-768x483.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /><figcaption class="wp-element-caption"><em>The growth of the top nine repos’ star count over time from MCP’s launch date on November 25, 2024, until September 2025. </em><br><br><em>Note: We were only able to track Browser Use’s repo until 40,000 stars; hence the flat line for its graph. In reality, roughly 21,000 stars were added over the next few months. (The other graphs in this post are properly adjusted.)</em></figcaption></figure>
<p></p>
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/mcp-in-practice/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>When AI Writes Code, Who Secures It?</title>
<link>https://www.oreilly.com/radar/when-ai-writes-code-who-secures-it/</link>
<comments>https://www.oreilly.com/radar/when-ai-writes-code-who-secures-it/#respond</comments>
<pubDate>Mon, 15 Sep 2025 10:37:10 +0000</pubDate>
<dc:creator><![CDATA[Chloé Messdaghi]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Security]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17436</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/06/hacker-1944688_crop-b34a76e3cab9c07c5900b706c70a12c3-1.jpg"
medium="image"
type="image/jpeg"
/>
<description><![CDATA[In early 2024, a striking deepfake fraud case in Hong Kong brought the vulnerabilities of AI-driven deception into sharp relief. A finance employee was duped during a video call by what appeared to be the CFO—but was, in fact, a sophisticated AI-generated deepfake. Convinced of the call’s authenticity, the employee made 15 transfers totaling over […]]]></description>
<content:encoded><![CDATA[
<p>In early 2024, a striking deepfake fraud case in Hong Kong brought the vulnerabilities of AI-driven deception into sharp relief. A finance employee was duped during a video call by what appeared to be the CFO—but was, in fact, a sophisticated AI-generated deepfake. Convinced of the call’s authenticity, the employee made <a href="https://www.ft.com/content/b977e8d4-664c-4ae4-8a8e-eb93bdf785ea?" target="_blank" rel="noreferrer noopener">15 transfers totaling over $25 million</a> to fraudulent bank accounts before realizing it was a scam.</p>
<p>This incident exemplifies more than just technological trickery—it signals how trust in what we see and hear can be weaponized, especially as AI becomes more deeply integrated into enterprise tools and workflows. From embedded LLMs in enterprise systems to autonomous agents diagnosing and even repairing issues in live environments, AI is transitioning from novelty to necessity. Yet as it evolves, so too do the gaps in our traditional security frameworks—designed for static, human-written code—revealing just how unprepared we are for systems that generate, adapt, and behave in unpredictable ways.</p>
<h2 class="wp-block-heading">Beyond the CVE Mindset</h2>
<p>Traditional secure coding practices revolve around known vulnerabilities and patch cycles. AI changes the equation. A line of code can be generated on the fly by a model, shaped by manipulated prompts or data—creating new, unpredictable categories of risk like prompt injection or emergent behavior outside traditional taxonomies.</p>
<p>A 2025 Veracode study found that <a href="https://www.techradar.com/pro/nearly-half-of-all-code-generated-by-ai-found-to-contain-security-flaws-even-big-llms-affected?" target="_blank" rel="noreferrer noopener">45% of all AI-generated code contained vulnerabilities</a>, with common flaws like weak defenses against XSS and log injection. (Some languages performed more poorly than others. Over 70% of AI-generated Java code had a security issue, for instance.) Another 2025 study showed that repeated refinement can make things worse: After just five iterations, critical vulnerabilities rose by <a href="https://arxiv.org/abs/2506.11022?" target="_blank" rel="noreferrer noopener">37.6%</a>.</p>
<p>To keep pace, frameworks like the <a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/" target="_blank" rel="noreferrer noopener">OWASP Top 10 for LLMs</a> have emerged, cataloging AI-specific risks such as data leakage, model denial of service, and prompt injection. They highlight how current security taxonomies fall short—and why we need new approaches that model AI threat surfaces, share incidents, and iteratively refine risk frameworks to reflect how code is created and influenced by AI.</p>
<h2 class="wp-block-heading">Easier for Adversaries</h2>
<p>Perhaps the most alarming shift is how AI lowers the barrier to malicious activity. What once required deep technical expertise can now be done by anyone with a clever prompt: generating scripts, launching phishing campaigns, or manipulating models. AI doesn’t just broaden the attack surface; it makes it easier and cheaper for attackers to succeed without ever writing code.</p>
<p>In 2025, researchers unveiled PromptLocker, the first AI-powered ransomware. Though only a proof of concept, it showed how theft and encryption could be automated with a local LLM at remarkably low cost: about <a href="https://www.tomshardware.com/tech-industry/cyber-security/ai-powered-promptlocker-ransomware-is-just-an-nyu-research-project-the-code-worked-as-a-typical-ransomware-selecting-targets-exfiltrating-selected-data-and-encrypting-volumes" target="_blank" rel="noreferrer noopener">$0.70 per full attack using commercial APIs</a>—and essentially free with open source models. That kind of affordability could make ransomware cheaper, faster, and more scalable than ever.</p>
<p>This democratization of offense means defenders must prepare for attacks that are more frequent, more varied, and more creative. The <a href="https://github.com/mitre/advmlthreatmatrix" target="_blank" rel="noreferrer noopener">Adversarial ML Threat Matrix</a>, founded by Ram Shankar Siva Kumar during his time at Microsoft, helps by enumerating threats to machine learning and offering a structured way to anticipate these evolving risks. (He’ll be discussing the difficulty of securing AI systems from adversaries at <a href="https://learning.oreilly.com/live-events/security-superstream-secure-code-in-the-age-of-ai/0642572204099/" target="_blank" rel="noreferrer noopener">O’Reilly’s upcoming Security Superstream</a>.)</p>
<h2 class="wp-block-heading">Silos and Skill Gaps</h2>
<p>Developers, data scientists, and security teams still work in silos, each with different incentives. Business leaders push for rapid AI adoption to stay competitive, while security leaders warn that moving too fast risks catastrophic flaws in the code itself.</p>
<p>These tensions are amplified by a widening skills gap: Most developers lack training in AI security, and many security professionals don’t fully understand how LLMs work. As a result, the old patchwork fixes feel increasingly inadequate when the models are writing and running code on their own.</p>
<p>The rise of “vibe coding”—relying on LLM suggestions without review—captures this shift. It accelerates development but introduces hidden vulnerabilities, leaving both developers and defenders struggling to manage novel risks.</p>
<h2 class="wp-block-heading">From Avoidance to Resilience</h2>
<p>AI adoption won’t stop. The challenge is moving from avoidance to resilience. Frameworks like <a href="https://www.databricks.com/blog/announcing-databricks-ai-security-framework-20" target="_blank" rel="noreferrer noopener">Databricks’ AI Risk Framework (DASF)</a> and the <a href="https://www.nist.gov/itl/ai-risk-management-framework" target="_blank" rel="noreferrer noopener">NIST AI Risk Management Framework</a> provide practical guidance on embedding governance and security directly into AI pipelines, helping organizations move beyond ad hoc defenses toward systematic resilience. The goal isn’t to eliminate risk but to enable innovation while maintaining trust in the code AI helps produce.</p>
<h2 class="wp-block-heading">Transparency and Accountability</h2>
<p>Research shows AI-generated code is often simpler and more repetitive, <a href="https://arxiv.org/abs/2508.21634" target="_blank" rel="noreferrer noopener">but also more vulnerable</a>, with risks like hardcoded credentials and path traversal exploits. Without observability tools such as prompt logs, provenance tracking, and audit trails, developers can’t ensure reliability or accountability. In other words, AI-generated code is more likely to introduce high-risk security vulnerabilities.</p>
<p>AI’s opacity compounds the problem: A function may appear to “work” yet conceal vulnerabilities that are difficult to trace or explain. Without explainability and safeguards, autonomy quickly becomes a recipe for insecure systems. Tools like <a href="https://atlas.mitre.org/matrices/ATLAS" target="_blank" rel="noreferrer noopener">MITRE ATLAS</a> can help by mapping adversarial tactics against AI models, offering defenders a structured way to anticipate and counter threats.</p>
<h2 class="wp-block-heading">Looking Ahead</h2>
<p>Securing code in the age of AI requires more than patching—it means breaking silos, closing skill gaps, and embedding resilience into every stage of development. The risks may feel familiar, but AI scales them dramatically. Frameworks like Databricks’ AI Risk Framework (DASF) and the NIST AI Risk Management Framework provide structures for governance and transparency, while MITRE ATLAS maps adversarial tactics and real-world attack case studies, giving defenders a structured way to anticipate and mitigate threats to AI systems.</p>
<p>The choices we make now will determine whether AI becomes a trusted partner—or a shortcut that leaves us exposed.</p>
<hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
<p class="has-cyan-bluish-gray-background-color has-background"><strong><em>Ensure your systems remain secure in an increasingly AI-driven world</em></strong><br><br><em>Join Chloé Messdaghi and a lineup of top security professionals and technologists for O’Reilly’s Security Superstream: Secure Code in the Age of AI. They’ll share practical insights, real-world experiences, and emerging trends that will help you code more securely, build and deploy secure models, and protect against AI-specific threats. It’s free for O’Reilly members. <a href="https://learning.oreilly.com/live-events/security-superstream-secure-code-in-the-age-of-ai/0642572204099/" target="_blank" rel="noreferrer noopener">Save your seat here</a>.</em><br><br><em>Not a member? <a href="https://www.oreilly.com/start-trial/" target="_blank" rel="noreferrer noopener">Sign up for a free 10-day trial</a> to attend—and check out all the other great resources on O’Reilly.</em></p>
<p></p>
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/when-ai-writes-code-who-secures-it/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Taming Chaos with Antifragile GenAI Architecture</title>
<link>https://www.oreilly.com/radar/taming-chaos-with-antifragile-genai-architecture/</link>
<comments>https://www.oreilly.com/radar/taming-chaos-with-antifragile-genai-architecture/#respond</comments>
<pubDate>Thu, 11 Sep 2025 10:49:38 +0000</pubDate>
<dc:creator><![CDATA[Shreshta Shyamsundar]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17429</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/AI-in-Chaos-2.jpg"
medium="image"
type="image/jpeg"
/>
<custom:subtitle><![CDATA[Turn Volatility into Your Greatest Strategic Asset]]></custom:subtitle>
<description><![CDATA[What if uncertainty wasn’t something to simply endure but something to actively exploit? The convergence of Nassim Taleb’s antifragility principles with generative AI capabilities is creating a new paradigm for organizational design powered by generative AI—one where volatility becomes fuel for competitive advantage rather than a threat to be managed. The Antifragility Imperative Antifragility transcends […]]]></description>
<content:encoded><![CDATA[
<p>What if uncertainty wasn’t something to simply endure but something to actively exploit? The convergence of <a href="https://en.wikipedia.org/wiki/Antifragile_(book)" target="_blank" rel="noreferrer noopener">Nassim Taleb’s antifragility principles</a> with generative AI capabilities is creating a new paradigm for organizational design powered by generative AI—one where volatility becomes fuel for competitive advantage rather than a threat to be managed.</p>
<h2 class="wp-block-heading"><strong>The Antifragility Imperative</strong></h2>
<p>Antifragility transcends resilience. While resilient systems bounce back from stress and robust systems resist change, antifragile systems actively improve when exposed to volatility, randomness, and disorder. This isn’t just theoretical—it’s a mathematical property where systems exhibit <strong>positive convexity</strong>, gaining more from favorable variations than they lose from unfavorable ones.</p>
<p>To visualize the concept of positive convexity in antifragile systems, consider a graph where the x-axis represents stress or volatility and the y-axis represents the system’s response. In such systems, the curve is upward bending (convex), demonstrating that the system gains more from positive shocks than it loses from negative ones—by an accelerating margin.</p>
<p>The convex (upward-curving) line shows that small positive shocks yield increasingly larger gains, while equivalent negative shocks cause comparatively smaller losses.</p>
<p>For comparison, a straight line representing a fragile or linear system shows a proportional (linear) response, with gains and losses of equal magnitude on either side.</p>
<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="1066" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1.png" alt="Graph illustrating positive convexity: Antifragile systems benefit disproportionately from positive variations compared to equivalent negative shocks." class="wp-image-17430" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1-300x200.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1-768x512.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1-1536x1023.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Graph illustrating positive convexity: Antifragile systems benefit disproportionately from positive variations compared to equivalent negative shocks.</em></figcaption></figure>
<p>The concept emerged from Taleb’s observation that certain systems don’t just survive Black Swan events—they thrive because of them. Consider how Amazon’s supply chain AI during the 2020 pandemic demonstrated true antifragility. When lockdowns disrupted normal shipping patterns and consumer behavior shifted dramatically, Amazon’s demand forecasting systems didn’t just adapt; they used the chaos as training data. Every stockout, every demand spike for unexpected products like webcams and exercise equipment, every supply chain disruption became input for improving future predictions. The AI learned to identify early signals of changing consumer behavior and supply constraints, making the system more robust for future disruptions.</p>
<p>For technology organizations, this presents a fundamental question: How do we design systems that don’t just survive unexpected events but benefit from them? The answer lies in implementing specific generative AI architectures that can learn continuously from disorder.</p>
<h2 class="wp-block-heading"><strong>Generative AI: Building Antifragile Capabilities</strong></h2>
<p>Certain generative AI implementations can exhibit antifragile characteristics when designed with continuous learning architectures. Unlike static models deployed once and forgotten, these systems incorporate feedback loops that allow real-time adaptation without full model retraining—a critical distinction given the resource-intensive nature of training large models.</p>
<p>Netflix’s recommendation system demonstrates this principle. Rather than retraining its entire foundation model, the company continuously updates personalization layers based on user interactions. When users reject recommendations or abandon content midstream, this negative feedback becomes valuable training data that refines future suggestions. The system doesn’t just learn what users like. It becomes expert at recognizing what they’ll hate, leading to higher overall satisfaction through accumulated negative knowledge.</p>
<p>The key insight is that these AI systems don’t just adapt to new conditions; they actively extract information from disorder. When market conditions shift, customer behavior changes, or systems encounter edge cases, properly designed generative AI can identify patterns in the chaos that human analysts might miss. They transform noise into signal, volatility into opportunity.</p>
<h2 class="wp-block-heading"><strong>Error as Information: Learning from Failure</strong></h2>
<p>Traditional systems treat errors as failures to be minimized. Antifragile systems treat errors as information sources to be exploited. This shift becomes powerful when combined with generative AI’s ability to learn from mistakes and generate improved responses.</p>
<p>IBM Watson for Oncology’s failure has been attributed to synthetic data problems, but it highlights a critical distinction: Synthetic data isn’t inherently problematic—it’s essential in healthcare where patient privacy restrictions limit access to real data. The issue was that Watson was trained exclusively on synthetic, hypothetical cases created by Memorial Sloan Kettering physicians rather than being validated against diverse real-world outcomes. This created a dangerous feedback loop where the AI learned physician preferences rather than evidence-based medicine.</p>
<p>When deployed, Watson recommended potentially fatal treatments—such as prescribing bevacizumab to a 65-year-old lung cancer patient with severe bleeding, despite the drug’s known risk of causing “severe or fatal hemorrhage.” A truly antifragile system would have incorporated mechanisms to detect when its training data diverged from reality—for instance, by tracking recommendation acceptance rates and patient outcomes to identify systematic biases.</p>
<p>This challenge extends beyond healthcare. Consider AI diagnostic systems deployed across different hospitals. A model trained on high-end equipment at a research hospital performs poorly when deployed to field hospitals with older, poorly calibrated CT scanners. An antifragile AI system would treat these equipment variations not as problems to solve but as valuable training data. Each “failed” diagnosis on older equipment becomes information that improves the system’s robustness across diverse deployment environments.</p>
<h2 class="wp-block-heading"><strong>Netflix: Mastering Organizational Antifragility</strong></h2>
<p>Netflix’s approach to chaos engineering exemplifies organizational antifragility in practice. The company’s famous “Chaos Monkey” randomly terminates services in production to ensure the system can handle failures gracefully. But more relevant to generative AI is its content recommendation system’s sophisticated approach to handling failures and edge cases.</p>
<p>When Netflix’s AI began recommending mature content to family accounts rather than simply adding filters, its team created systematic “chaos scenarios”—deliberately feeding the system contradictory user behavior data to stress-test its decision-making capabilities. They simulated situations where family members had vastly different viewing preferences on the same account or where content metadata was incomplete or incorrect.</p>
<p>The recovery protocols the team developed go beyond simple content filtering. Netflix created hierarchical safety nets: real-time content categorization, user context analysis, and human oversight triggers. Each “failure” in content recommendation becomes data that strengthens the entire system. The AI learns what content to recommend but also when to seek additional context, when to err on the side of caution, and how to gracefully handle ambiguous situations.</p>
<p>This demonstrates a key antifragile principle: The system doesn’t just prevent similar failures—it becomes more intelligent about handling edge cases it has never encountered before. Netflix’s recommendation accuracy improved precisely because the system learned to navigate the complexities of shared accounts, diverse family preferences, and content boundary cases.</p>
<h2 class="wp-block-heading"><strong>Technical Architecture: The LOXM Case Study</strong></h2>
<p>JPMorgan’s LOXM (Learning Optimization eXecution Model) represents the most sophisticated example of antifragile AI in production. Developed by the global equities electronic trading team under Daniel Ciment, LOXM went live in 2017 after training on billions of historical transactions. While this predates the current era of transformer-based generative AI, LOXM was built using deep learning techniques that share fundamental principles with today’s generative models: the ability to learn complex patterns from data and adapt to new situations through continuous feedback.</p>
<p><strong>Multi-agent architecture</strong>: LOXM uses a reinforcement learning system where specialized agents handle different aspects of trade execution.</p>
<ul class="wp-block-list">
<li>Market microstructure analysis agents learn optimal timing patterns.</li>
<li>Liquidity assessment agents predict order book dynamics in real time.</li>
<li>Impact modeling agents minimize market disruption during large trades.</li>
<li>Risk management agents enforce position limits while maximizing execution quality.</li>
</ul>
<p><strong>Antifragile performance under stress</strong>: While traditional trading algorithms struggled with unprecedented conditions during the market volatility of March 2020, LOXM’s agents used the chaos as learning opportunities. Each failed trade execution, each unexpected market movement, each liquidity crisis became training data that improved future performance.</p>
<p>The measurable results were striking. LOXM improved execution quality by 50% during the most volatile trading days—exactly when traditional systems typically degrade. This isn’t just resilience; it’s mathematical proof of positive convexity where the system gains more from stressful conditions than it loses.</p>
<p><strong>Technical innovation</strong>: LOXM prevents catastrophic forgetting through “experience replay” buffers that maintain diverse trading scenarios. When new market conditions arise, the system can reference similar historical patterns while adapting to novel situations. The feedback loop architecture uses streaming data pipelines to capture trade outcomes, model predictions, and market conditions in real time, updating model weights through online learning algorithms within milliseconds of trade completion.</p>
<h2 class="wp-block-heading"><strong>The Information Hiding Principle</strong></h2>
<p><a href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/Information_hiding&sa=D&source=docs&ust=1757588683474584&usg=AOvVaw0GSAvIFboxsnVPkkNh7N8A" target="_blank" rel="noreferrer noopener">David Parnas’s information hiding principle</a> directly enables antifragility by ensuring that system components can adapt independently without cascading failures. In <a href="https://www.google.com/url?q=https://dl.acm.org/doi/10.1145/361598.361623&sa=D&source=docs&ust=1757588683474722&usg=AOvVaw3U2j64n0pCWduXPdTRr_Mj" target="_blank" rel="noreferrer noopener">his 1972 paper</a>, Parnas emphasized hiding “design decisions likely to change”—exactly what antifragile systems need.</p>
<p>When LOXM encounters market disruption, its modular design allows individual components to adapt their internal algorithms without affecting other modules. The “secret” of each module—its specific implementation—can evolve based on local feedback while maintaining stable interfaces with other components.</p>
<p>This architectural pattern prevents what Taleb calls “tight coupling”—where stress in one component propagates throughout the system. Instead, stress becomes localized learning opportunities that strengthen individual modules without destabilizing the whole system.</p>
<h2 class="wp-block-heading"><strong>Via Negativa in Practice</strong></h2>
<p>Nassim Taleb’s concept of “via negativa”—defining systems by what they’re not rather than what they are—translates directly to building antifragile AI systems.</p>
<p>When Airbnb’s search algorithm was producing poor results, instead of adding more ranking factors (the typical approach), the company applied via negativa: It systematically removed listings that consistently received poor ratings, hosts who didn’t respond promptly, and properties with misleading photos. By eliminating negative elements, the remaining search results naturally improved.</p>
<p>Netflix’s recommendation system similarly applies via negativa by maintaining “negative preference profiles”—systematically identifying and avoiding content patterns that lead to user dissatisfaction. Rather than just learning what users like, the system becomes expert at recognizing what they’ll hate, leading to higher overall satisfaction through subtraction rather than addition.</p>
<p>In technical terms, via negativa means starting with maximum system flexibility and systematically removing constraints that don’t add value—allowing the system to adapt to unforeseen circumstances rather than being locked into rigid predetermined behaviors.</p>
<h2 class="wp-block-heading"><strong>Implementing Continuous Feedback Loops</strong></h2>
<p>The feedback loop architecture requires three components: error detection, learning integration, and system adaptation. In LOXM’s implementation, market execution data flows back into the model within milliseconds of trade completion. The system uses streaming data pipelines to capture trade outcomes, model predictions, and market conditions in real time. Machine learning models continuously compare predicted execution quality to actual execution quality, updating model weights through online learning algorithms. This creates a continuous feedback loop where each trade makes the next trade execution more intelligent.</p>
<p>When a trade execution deviates from expected performance—whether due to market volatility, liquidity constraints, or timing issues—this immediately becomes training data. The system doesn’t wait for batch processing or scheduled retraining; it adapts in real time while maintaining stable performance for ongoing operations.</p>
<h2 class="wp-block-heading"><strong>Organizational Learning Loop</strong></h2>
<p>Antifragile organizations must cultivate specific learning behaviors beyond just technical implementations. This requires moving beyond traditional risk management approaches toward Taleb’s “via negativa.”</p>
<p>The learning loop involves three phases: stress identification, system adaptation, and capability improvement. Teams regularly expose systems to controlled stress, observe how they respond, and then use generative AI to identify improvement opportunities. Each iteration strengthens the system’s ability to handle future challenges.</p>
<p>Netflix institutionalized this through monthly “chaos drills” where teams deliberately introduce failures—API timeouts, database connection losses, content metadata corruption—and observe how their AI systems respond. Each drill generates postmortems focused not on blame but on extracting learning from the failure scenarios.</p>
<h2 class="wp-block-heading"><strong>Measurement and Validation</strong></h2>
<p>Antifragile systems require new metrics beyond traditional availability and performance measures. Key metrics include:</p>
<ul class="wp-block-list">
<li>Adaptation speed: Time from anomaly detection to corrective action</li>
<li>Information extraction rate: Number of meaningful model updates per disruption event</li>
<li>Asymmetric performance factor: Ratio of system gains from positive shocks to losses from negative ones</li>
</ul>
<p>LOXM tracks these metrics alongside financial outcomes, demonstrating quantifiable improvement in antifragile capabilities over time. During high-volatility periods, the system’s asymmetric performance factor consistently exceeds 2.0—meaning it gains twice as much from favorable market movements as it loses from adverse ones.</p>
<h2 class="wp-block-heading"><strong>The Competitive Advantage</strong></h2>
<p>The goal isn’t just surviving disruption—it’s creating competitive advantage through chaos. When competitors struggle with market volatility, antifragile organizations extract value from the same conditions. They don’t just adapt to change; they actively seek out uncertainty as fuel for growth.</p>
<p>Netflix’s ability to recommend content accurately during the pandemic, when viewing patterns shifted dramatically, gave it a significant advantage over competitors whose recommendation systems struggled with the new normal. Similarly, LOXM’s superior performance during market stress periods has made it JPMorgan’s primary execution algorithm for institutional clients.</p>
<p>This creates sustainable competitive advantage because antifragile capabilities compound over time. Each disruption makes the system stronger, more adaptive, and better positioned for future challenges.</p>
<h2 class="wp-block-heading"><strong>Beyond Resilience: The Antifragile Future</strong></h2>
<p>We’re witnessing the emergence of a new organizational paradigm. The convergence of antifragility principles with generative AI capabilities represents more than incremental improvement—it’s a fundamental shift in how organizations can thrive in uncertain environments.</p>
<p>The path forward requires commitment to experimentation, tolerance for controlled failure, and systematic investment in adaptive capabilities. Organizations must evolve from asking “How do we prevent disruption?” to “How do we benefit from disruption?”</p>
<p>The question isn’t whether your organization will face uncertainty and disruption—it’s whether you’ll be positioned to extract competitive advantage from chaos when it arrives. The integration of antifragility principles with generative AI provides the roadmap for that transformation, demonstrated by organizations like Netflix and JPMorgan that have already turned volatility into their greatest strategic asset.</p>
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/taming-chaos-with-antifragile-genai-architecture/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Building AI-Resistant Technical Debt</title>
<link>https://www.oreilly.com/radar/building-ai-resistant-technical-debt/</link>
<comments>https://www.oreilly.com/radar/building-ai-resistant-technical-debt/#respond</comments>
<pubDate>Wed, 10 Sep 2025 10:03:48 +0000</pubDate>
<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17422</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-colors-7.jpg"
medium="image"
type="image/jpeg"
/>
<custom:subtitle><![CDATA[When Speed Creates Long-Term Pain]]></custom:subtitle>
<description><![CDATA[Anyone who’s used AI to generate code has seen it make mistakes. But the real danger isn’t the occasional wrong answer; it’s in what happens when those errors pile up across a codebase. Issues that seem small at first can compound quickly, making code harder to understand, maintain, and evolve. To really see that danger, […]]]></description>
<content:encoded><![CDATA[
<p>Anyone who’s used AI to generate code has seen it make mistakes. But the real danger isn’t the occasional wrong answer; it’s in what happens when those errors pile up across a codebase. Issues that seem small at first can compound quickly, making code harder to understand, maintain, and evolve. To really see that danger, you have to look at how AI is used in practice—which for many developers starts with vibe coding.</p>
<p><strong>Vibe coding</strong> is an exploratory, prompt-first approach to software development where developers rapidly prompt, get code, and iterate. When the code seems close but not quite right, the developer describes what’s wrong and lets the AI try again. When it doesn’t compile or tests fail, they copy the error messages back to the AI. The cycle continues—prompt, run, error, paste, prompt again—often without reading or understanding the generated code. It feels productive because you’re making visible progress: errors disappear, tests start passing, features seem to work. You’re treating the AI like a coding partner who handles the implementation details while you steer at a high level.</p>
<p>Developers use vibe coding to explore and refine ideas and can generate large amounts of code quickly. It’s often the natural first step for most developers using AI tools, because it feels so intuitive and productive. Vibe coding offloads detail to the AI, making exploration and ideation fast and effective—which is exactly why it’s so popular.</p>
<p>The AI generates a lot of code, and it’s not practical to review every line every time it regenerates. Trying to read it all can lead to <strong>cognitive overload</strong>—mental exhaustion from wading through too much code—and makes it harder to throw away code that isn’t working just because you already invested time in reading it.</p>
<p>Vibe coding is a normal and useful way to explore with AI, but on its own it presents a significant risk. The models used by LLMs can hallucinate and produce made-up answers—for example, generating code that calls APIs or methods that don’t even exist. Preventing those AI-generated mistakes from compromising your codebase starts with understanding the capabilities and limitations of these tools, and taking an approach to AI-assisted development that takes those limitations into account.</p>
<p>Here’s a simple example of how these issues compound. When I ask AI to generate a class that handles user interaction, it often creates methods that directly read from and write to the console. When I then ask it to make the code more testable, if I don’t very specifically prompt for a simple fix like having methods take input as parameters and return output as values, the AI frequently suggests wrapping the entire I/O mechanism in an abstraction layer. Now I have an interface, an implementation, mock objects for testing, and dependency injection throughout. What started as a straightforward class has become a miniature framework. The AI isn’t wrong, exactly—the abstraction approach is a valid pattern—but it’s overengineered for the problem at hand. Each iteration adds more complexity, and if you’re not paying attention, you’ll end up with layers upon layers of unnecessary code. This is a good example of how vibe coding can balloon into unnecessary complexity if you don’t stop to verify what’s happening.</p>
<h2 class="wp-block-heading">Novice Developers Face a New Kind of Technical Debt Challenge with AI</h2>
<p>Three months after writing their first line of code, a Reddit user going by SpacetimeSorcerer posted a frustrated update: Their AI-assisted project had reached the point where making any change meant editing dozens of files. The design had hardened around early mistakes, and every change brought a wave of debugging. They’d hit the wall known in software design as “shotgun surgery,” where a single change ripples through so much code that it’s risky and slow to work on—a classic sign of <strong>technical debt</strong>, the hidden cost of early shortcuts that make future changes harder and more expensive.</p>
<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="547" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image.png" alt="I am giving up" class="wp-image-17423" title="I am giving up" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-300x103.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-768x263.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1536x525.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>A Reddit post describing the frustration of AI-accelerated technical debt (used with permission).</em></figcaption></figure>
<p>AI didn’t cause the problem directly; the code worked (until it didn’t). But the speed of AI-assisted development let this new developer skip the design thinking that prevents these patterns from forming. The same thing happens to experienced developers when deadlines push delivery over maintainability. The difference is, an experienced developer often knows they’re taking on debt. They can spot antipatterns early because they’ve seen them repeatedly, and take steps to “pay off” the debt before it gets much more expensive to fix. Someone new to coding may not even realize it’s happening until it’s too late—and they haven’t yet built the tools or habits to prevent it.</p>
<p>Part of the reason new developers are especially vulnerable to this problem goes back to the <strong>Cognitive Shortcut Paradox</strong>.<sup>1</sup> Without enough hands-on experience debugging, refactoring, and working through ambiguous requirements, they don’t have the instincts built up through experience to spot structural problems in AI-generated code. The AI can hand them a clean, working solution. But if they can’t see the design flaws hiding inside it, those flaws grow unchecked until they’re locked into the project, built into the foundations of the code so changing them requires extensive, frustrating work.</p>
<p>The signals of AI-accelerated technical debt show up quickly: highly coupled code where modules depend on each other’s internal details; “God objects” with too many responsibilities; overly structured solutions where a simple problem gets buried under extra layers. These are the same problems that typically reflect technical debt in human-built code; the reason they emerge so quickly in AI-generated code is because it can be generated much more quickly and without oversight or intentional design or architectural decisions being made. AI can generate these patterns convincingly, making them look deliberate even when they emerged by accident. Because the output compiles, passes tests, and works as expected, it’s easy to accept as “done” without thinking about how it will hold up when requirements change.</p>
<p>When adding or updating a unit test feels unreasonably difficult, that’s often the first sign the design is too rigid. The test is telling you something about the structure—maybe the code is too intertwined, maybe the boundaries are unclear. This feedback loop works whether the code was AI-generated or handwritten, but with AI the friction often shows up later, after the code has already been merged.</p>
<p>That’s where the “trust but verify” habit comes in. Trust the AI to give you a starting point, but verify that the design supports change, testability, and clarity. Ask yourself whether the code will still make sense to you—or anyone else—months from now. In practice, this can mean quick design reviews even for AI-generated code, refactoring when coupling or duplication starts to creep in, and taking a deliberate pass at naming so variables and functions read clearly. These aren’t optional touches; they’re what keep a codebase from locking in its worst early decisions.</p>
<p>AI can help with this too: It can suggest refactorings, point out duplicated logic, or help extract messy code into cleaner abstractions. But it’s up to you to direct it to make those changes, which means you have to spot them first—which is much easier for experienced developers who have seen these problems over the course of many projects.</p>
<p>Left to its defaults, AI-assisted development is biased toward adding new code, not revisiting old decisions. The discipline to avoid technical debt comes from building design checks into your workflow so AI’s speed works in service of maintainability instead of against it.</p>
<hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
<h2 class="wp-block-heading">Footnote</h2>
<ol class="wp-block-list">
<li>I’ll discuss this in more detail in a forthcoming Radar article on October 8.</li>
</ol>
<p></p>
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/building-ai-resistant-technical-debt/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Megawatts and Gigawatts of AI</title>
<link>https://www.oreilly.com/radar/megawatts-and-gigawatts-of-ai/</link>
<comments>https://www.oreilly.com/radar/megawatts-and-gigawatts-of-ai/#respond</comments>
<pubDate>Tue, 09 Sep 2025 10:54:36 +0000</pubDate>
<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17410</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Smaller-Is-Power.jpg"
medium="image"
type="image/jpeg"
/>
<description><![CDATA[We can’t not talk about power these days. We’ve been talking about it ever since the Stargate project, with half a trillion dollars in data center investment, was floated early in the year. We’ve been talking about it ever since the now-classic “Stochastic Parrots” paper. And, as time goes on, it only becomes more of […]]]></description>
<content:encoded><![CDATA[
<p>We can’t not talk about power these days. We’ve been talking about it ever since the <a href="https://en.wikipedia.org/wiki/Stargate_LLC" target="_blank" rel="noreferrer noopener">Stargate</a> project, with half a trillion dollars in data center investment, was floated early in the year. We’ve been talking about it ever since the now-classic “<a href="https://dl.acm.org/doi/10.1145/3442188.3445922" target="_blank" rel="noreferrer noopener">Stochastic Parrots</a>” paper. And, as time goes on, it only becomes more of an issue.</p>
<p>“Stochastic Parrots” deals with two issues: AI’s power consumption and the fundamental nature of generative AI; selecting sequences of words according to statistical patterns. I always wished those were two papers, because it would be easier to disagree about power and agree about parrots. For me, the power issue is something of a red herring—but increasingly, I see that it’s a red herring that isn’t going away because too many people with too much money want herrings; too many believe that a monopoly on power (or a monopoly on the ability to pay for power) is the route to dominance.</p>
<p>Why, in a better world than we currently live in, would the power issue be a red herring? There are several related reasons:</p>
<ul class="wp-block-list">
<li>I have always assumed that the first generation language models would be highly inefficient, and that over time, we’d develop more efficient algorithms.</li>
<li>I have also assumed that the economics of language models would be similar to chip foundries or pharma factories: The first chip coming out of a foundry costs a few billion dollars, everything afterward is a penny apiece.</li>
<li>I believe (now more than ever) that, long-term, we will settle on small models (70B parameters or less) that can run locally rather than giant models with trillions of parameters running in the cloud.</li>
</ul>
<p>And I still believe those points are largely true. But that’s not sufficient. Let’s go through them one by one, starting with efficiency.</p>
<h2 class="wp-block-heading">Better Algorithms</h2>
<p>A few years ago, I saw a fair number of papers about more efficient models. I remember a lot of articles about pruning neural networks (eliminating nodes that contribute little to the result) and other techniques. Papers that address efficiency are still being published—most notably, DeepMind’s recent “<a href="https://arxiv.org/abs/2507.10524" target="_blank" rel="noreferrer noopener">Mixture-of-Recursions</a>” paper—but they don’t seem to be as common. That’s just anecdata, and should perhaps be ignored. More to the point, DeepSeek shocked the world with their R1 model, which they claimed cost roughly 1/10 as much to train as the leading frontier models. A lot of commentary insisted that DeepSeek wasn’t being up front in their measurement of power consumption, but since then several other Chinese labs have released highly capable models, with no gigawatt data centers in sight. Even more recently, OpenAI has released gpt-oss in two sizes (120B and 30B), which were <a href="https://www.theinformation.com/articles/openai-says-gpt-5-one-size-fits-new-open-model-cheap-train?rc=7em78a" target="_blank" rel="noreferrer noopener">reportedly</a> much less expensive to train. It’s not the first time this has happened—I’ve been told that the Soviet Union developed amazingly efficient data compression algorithms because their computers were a decade behind ours. Better algorithms can trump larger power bills, better CPUs, and more GPUs, if we let them.</p>
<p>What’s wrong with this picture? The picture is good, but much of the narrative is US-centric, and that distorts it. First, it’s distorted by our belief that bigger is always better: Look at our cars, our SUVs, our houses. We’re conditioned to believe that a model with a trillion parameters has to be better than a model with a mere 70B, right? That a model that cost <a href="https://en.wikipedia.org/wiki/GPT-4#:~:text=Sam%20Altman%20stated%20that%20the,4%20had%201%20trillion%20parameters." target="_blank" rel="noreferrer noopener">a hundred million dollars</a> to train has to be better than one that can be trained economically? That myth is deeply embedded in our psyche. Second, it’s distorted by economics. Bigger is better is a myth that would-be monopolists play on when they talk about the need for ever bigger data centers, preferably funded with tax dollars. It’s a convenient myth, because convincing would-be competitors that they need to spend billions on data centers is an effective way to have no competitors.</p>
<p>One area that hasn’t been sufficiently explored is extremely small models developed for specialized tasks. Drew Breunig <a href="https://www.dbreunig.com/2025/08/01/does-the-bitter-lesson-have-limits.html" target="_blank" rel="noreferrer noopener">writes</a> about the tiny chess model in Stockfish, the world’s leading chess program: It’s small enough to run in an iPhone, and replaced a much larger general-purpose model. And it <a href="https://www.youtube.com/watch?v=yc0bFlW56tY&t=528s" target="_blank" rel="noreferrer noopener">soundly defeated</a> Claude Sonnet 3.5 and GPT-4o.<sup>1</sup> He also writes about the 27 million parameter <a href="https://arxiv.org/pdf/2506.21734" target="_blank" rel="noreferrer noopener">Hierarchical Reasoning Model (HRM)</a> that has beaten models like Claude 3.7 on the ARC benchmark. Pete Warden’s Moonshine does real-time speech-to-text transcription in the browser—and is as good as any high-end model I’ve seen. None of these are general-purpose models. They won’t vibe code; they won’t write your blog posts. But they are extremely effective at what they do. And if AI is going to fulfill its destiny of “disappearing into the walls,” of becoming part of our everyday infrastructure, we will need very accurate, very specialized models. We will have to free ourselves of the myth that bigger is better.<sup>2</sup></p>
<h2 class="wp-block-heading">The Cost of Inference</h2>
<p>The purpose of a model isn’t to be trained; it’s to do inference. This is a gross simplification, but part of training is doing inference trillions of times and adjusting the model’s billions of parameters to minimize error. A single request takes an extremely small fraction of the effort required to train a model. That fact leads directly to the economics of chip foundries: The ability to process the first prompt costs millions of dollars, but once they’re in production, <a href="https://andymasley.substack.com/p/a-cheat-sheet-for-conversations-about" target="_blank" rel="noreferrer noopener">processing a prompt costs fractions of a cent</a>. Google has <a href="https://services.google.com/fh/files/misc/measuring_the_environmental_impact_of_delivering_ai_at_google_scale.pdf" target="_blank" rel="noreferrer noopener">claimed</a> that processing a typical text prompt to Gemini takes 0.24 watt-hours, significantly less than it takes to heat water for a cup of coffee. They also claim that increases in software efficiency have led to a 33x reduction in energy consumption over the past year.</p>
<p>That’s obviously not the entire story: Millions of people prompting ChatGPT adds up, as does usage of newer “reasoning” modules that have an extended internal dialog before arriving at a result. Likewise, driving to work rather than biking raises the global temperature a nanofraction of a degree—but when you multiply the nanofraction by billions of commuters, it’s a different story. It’s fair to say that an individual who uses ChatGPT or Gemini isn’t a problem, but it’s also important to realize that millions of users pounding on an AI service can grow into a problem quite quickly. Unfortunately, it’s also true that increases in efficiency often don’t lead to reductions in energy use but to solving more complex problems within the same energy budget. We may be seeing that with reasoning models, image and video generation models, and other applications that are now becoming financially feasible. Does this problem require gigawatt data centers? No, not that, but it’s a problem that can justify the building of gigawatt data centers.</p>
<p>There is a solution, but it requires rethinking the problem. Telling people to use public transportation or bicycles for their commute is ineffective (in the US), as will be telling people not to use AI. The problem needs to be rethought: redesigning work to eliminate the commute (O’Reilly is 100% work from home), rethinking the way we use AI so that it doesn’t require cloud-hosted trillion parameter models. That brings us to using AI locally.</p>
<h2 class="wp-block-heading">Staying Local</h2>
<p>Almost everything we do with GPT-*, Claude-*, Gemini-*, and other frontier models could be done equally effectively on much smaller models running locally: in a small corporate machine room or even on a laptop. Running AI locally also shields you from problems with availability, bandwidth, limits on usage, and leaking private data. This is a story that would-be monopolists don’t want us to hear. Again, this is anecdata, but I’ve been very impressed by the results I get from running models in the 30 billion parameter range on my laptop. I do vibe coding and get mostly correct code that the model can (usually) fix for me; I ask for summaries of blogs and papers and get excellent results. Anthropic, Google, and OpenAI are competing for tenths of a percentage point on highly gamed benchmarks, but I doubt that those benchmark scores have much practical meaning. I would love to see a study on the difference between Qwen3-30B and GPT-5.</p>
<p>What does that mean for energy costs? It’s unclear. Gigawatt data centers for doing inference would go unneeded if people do inference locally, but what are the consequences of a billion users doing inference on high-end laptops? If I give my local AIs a difficult problem, my laptop heats up and runs its fans. It’s using more electricity. And laptops aren’t as efficient as data centers that have been designed to minimize electrical use. It’s all well and good to scoff at gigawatts, but when you’re using that much power, minimizing power consumption saves a lot of money. Economies of scale are real. Personally, I’d bet on the laptops: Computing with 30 billion parameters is undoubtedly going to be less energy-intensive than computing with 3 trillion parameters. But I won’t hold my breath waiting for someone to do this research.</p>
<p>There’s another side to this question, and that involves models that “reason.” So-called “reasoning models” have an internal conversation (not always visible to the user) in which the model “plans” the steps it will take to answer the prompt. A recent paper <a href="https://nousresearch.com/measuring-thinking-efficiency-in-reasoning-models-the-missing-benchmark/" target="_blank" rel="noreferrer noopener">claims</a> that smaller open source models tend to generate many more reasoning tokens than large models (3 to 10 times as many, depending on the models you’re comparing), and that the extensive reasoning process eats away at the economics of the smaller models. Reasoning tokens must be processed, the same as any user-generated tokens; this processing incurs charges (which the paper discusses), and charges presumably relate directly to power.</p>
<p>While it’s surprising that small models generate more reasoning tokens, it’s no surprise that reasoning is expensive, and we need to take that into account. Reasoning is a tool to be used; it tends to be particularly useful when a model is asked to solve a problem in mathematics. It’s much less useful when the task involves looking up facts, summarization, writing, or making recommendations. It can help in areas like software design but is likely to be a liability for generative coding. In these cases, the reasoning process can actually become misleading—in addition to burning tokens. Deciding how to use models effectively, whether you’re running them locally or in the cloud, is a task that falls to us.</p>
<p>Going to the giant reasoning models for the “best possible answer” is always a temptation, especially when you know you don’t need the best possible answer. It takes some discipline to commit to the smaller models—even though it’s difficult to argue that using the frontier models is less work. You still have to analyze their output and check their results. And I confess: As committed as I am to the smaller models, I tend to stick with models in the 30B range, and avoid the 1B–5B models (including the excellent Gemma 3N). Those models, I’m sure, would give good results, use even less power, and run even faster. But I’m still in the process of peeling myself away from my knee-jerk assumptions.</p>
<p>Bigger isn’t necessarily better; more power isn’t necessarily the route to AI dominance. We don’t yet know how this will play out, but I’d place my bets on smaller models running locally and trained with efficiency in mind. There will no doubt be some applications that require large frontier models—perhaps generating synthetic data for training the smaller models—but we really need to understand where frontier models are needed, and where they aren’t. My bet is that they’re rarely needed. And if we free ourselves from the desire to use the latest, largest frontier model just because it’s there—whether or not it serves your purpose any better than a 30B model—we won’t need most of those giant data centers. Don’t be seduced by the AI-industrial complex.</p>
<hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
<h2 class="wp-block-heading">Footnotes</h2>
<ol class="wp-block-list">
<li>I’m not aware of games between Sockfish and more recent Claude 4, Claude 4.1, and GPT-5 models. There’s every reason to believe the results would be similar.</li>
<li>Kevlin Henney makes a related point in “<a href="https://www.oreilly.com/radar/scaling-false-peaks/" target="_blank" rel="noreferrer noopener">Scaling False Peaks</a>.”</li>
</ol>
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/megawatts-and-gigawatts-of-ai/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>A “Beam Versus Dataflow” Conversation</title>
<link>https://www.oreilly.com/radar/a-beam-versus-dataflow-conversation/</link>
<comments>https://www.oreilly.com/radar/a-beam-versus-dataflow-conversation/#respond</comments>
<pubDate>Mon, 08 Sep 2025 10:28:48 +0000</pubDate>
<dc:creator><![CDATA[Aaron Black]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Research]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17406</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Beam-pipeline.jpg"
medium="image"
type="image/jpeg"
/>
<description><![CDATA[I’ve been in a few recent conversations about whether to use Apache Beam on its own or run it with Google Dataflow. On the surface, it’s a tooling decision. But it also reflects a broader conversation about how teams build systems. Beam offers a consistent programming model for unifying batch and streaming logic. It doesn’t […]]]></description>
<content:encoded><![CDATA[
<p>I’ve been in a few recent conversations about whether to use Apache Beam on its own or run it with Google Dataflow. On the surface, it’s a tooling decision. But it also reflects a broader conversation about how teams build systems.</p>
<p>Beam offers a consistent programming model for unifying batch and streaming logic. It doesn’t dictate where that logic runs. You can deploy pipelines on Flink or Spark, or you can use a managed runner like Dataflow. Each option outfits the same Beam code with very different execution semantics.</p>
<p>What’s added urgency to this choice is the <a href="https://bostoninstituteofanalytics.org/blog/the-rise-of-real-time-data-science-in-2025-tools-trends-and-techniques/" target="_blank" rel="noreferrer noopener">growing pressure on data systems to support machine learning and AI workloads</a>. It’s no longer enough to transform, validate, and load. Teams also need to feed real-time inference, scale feature processing, and orchestrate retraining workflows as part of pipeline development. Beam and Dataflow are both increasingly positioned as infrastructure that supports not just analytics but active AI.</p>
<p>Choosing one path over the other means making decisions about flexibility, integration surface, runtime ownership, and operational scale. None of those are easy knobs to adjust after the fact.</p>
<p>The goal here is to unpack the trade-offs and help teams make deliberate calls about what kind of infrastructure they’ll want.</p>
<h2 class="wp-block-heading"><strong>Apache Beam: A Common Language for Pipelines</strong></h2>
<p>Apache Beam provides a shared model for expressing data processing workflows. This includes the kinds of batch and streaming tasks most data teams are already familiar with, but it also now includes a growing set of patterns specific to AI and ML.</p>
<p>Developers write Beam pipelines using a single SDK that defines what the pipeline does, not how the underlying engine runs it. That logic can include parsing logs, transforming records, joining events across time windows, and applying trained models to incoming data using built-in inference transforms.</p>
<p>Support for AI-specific workflow steps is improving. Beam now offers the <a href="https://beam.apache.org/documentation/transforms/python/elementwise/runinference/" target="_blank" rel="noreferrer noopener">RunInference API</a>, along with <a href="https://beam.apache.org/documentation/transforms/python/elementwise/mltransform/" target="_blank" rel="noreferrer noopener">MLTransform</a> utilities, to help deploy models trained in frameworks like TensorFlow, PyTorch, and scikit-learn into Beam pipelines. These can be used in batch workflows for bulk scoring or in low-latency streaming pipelines where inference is applied to live events.</p>
<p>Crucially, this isn’t tied to one cloud. Beam lets you define the transformation once and pick the execution path later. You can run the exact same pipeline on Flink, Spark, or Dataflow. That level of portability doesn’t remove infrastructure concerns on its own, but it does allow you to focus your engineering effort on logic rather than rewrites.</p>
<p>Beam gives you a way to describe and maintain machine learning pipelines. What’s left is deciding how you want to operate them.</p>
<h2 class="wp-block-heading"><strong>Running Beam: Self-Managed Versus Managed</strong></h2>
<p>If you’re running Beam on Flink, Spark, or some custom runner, you’re responsible for the full runtime environment. You handle provisioning, scaling, fault tolerance, tuning, and observability. Beam becomes another user of your platform. That degree of control can be useful, especially if model inference is only one part of a larger pipeline that already runs in your infrastructure. Custom logic, proprietary connectors, or non-standard state handling might push you toward keeping everything self-managed.</p>
<p>But building for inference at scale, especially in streaming, <a href="https://openproceedings.org/2024/conf/edbt/paper-156.pdf" target="_blank" rel="noreferrer noopener">introduces friction</a>. It means tracking model versions across pipeline jobs. It means watching watermarks and tuning triggers so inference happens precisely when it should. It means managing restart logic and making sure models fail gracefully when cloud resources or updatable weights are unavailable. If your team is already running distributed systems, that may be fine. But it isn’t free.</p>
<p>Running Beam on Dataflow simplifies much of this by taking infrastructure management out of your hands. You still build your pipeline the same way. But once deployed to Dataflow, scaling and resource provisioning are handled by the platform. Dataflow pipelines can stream through inference using native Beam transforms and benefit from newer features like automatic model refresh and tight integration with Google Cloud services.</p>
<p>This is particularly relevant when <a href="https://cloud.google.com/vertex-ai/docs/pipelines/dataflow-component" target="_blank" rel="noreferrer noopener">working with Vertex AI</a>, which allows hosted model deployment, feature store lookups, and GPU-accelerated inference to plug straight into your pipeline. Dataflow enables those connections with lower latency and minimal manual setup. For some teams, that makes it the better fit by default.</p>
<p>Of course, not every ML workload needs end-to-end cloud integration. And not every team wants to give up control of their pipeline execution. That’s why understanding what each option provides is necessary before making long-term infrastructure bets.</p>
<h2 class="wp-block-heading"><strong>Choosing the Execution Model That Matches Your Team</strong></h2>
<p>Beam gives you the foundation for defining ML-aware data pipelines. Dataflow gives you a specific way to execute them, especially in production environments where responsiveness and scalability matter.</p>
<p>If you’re building systems that require operational control and that already assume deep platform ownership, managing your own Beam runner makes sense. It gives flexibility where rules are looser and lets teams hook directly into their own tools and systems.</p>
<p>If instead you need fast iteration with minimal overhead, or you’re running real-time inference against cloud-hosted models, then Dataflow offers clear benefits. You onboard your pipeline without worrying about the runtime layer and deliver predictions without gluing together your own serving infrastructure.</p>
<p>If inference becomes an everyday part of your pipeline logic, the balance between operational effort and platform constraints starts to shift. The best execution model depends on more than feature comparison.</p>
<p>A well-chosen execution model involves commitment to how your team builds, evolves, and operates intelligent data systems over time. Whether you prioritize fine-grained control or accelerated delivery, both Beam and Dataflow offer robust paths forward. The key is aligning that choice with your long-term goals: consistency across workloads, adaptability for future AI demands, and a developer experience that supports innovation without compromising stability. As inference becomes a core part of modern pipelines, choosing the right abstraction sets a foundation for future-proofing your data infrastructure.</p>
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/a-beam-versus-dataflow-conversation/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Generative AI in the Real World: Luke Wroblewski on When Databases Talk Agent-Speak</title>
<link>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-luke-wroblewski-on-when-databases-talk-agent-speak/</link>
<comments>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-luke-wroblewski-on-when-databases-talk-agent-speak/#respond</comments>
<pubDate>Thu, 04 Sep 2025 16:01:45 +0000</pubDate>
<dc:creator><![CDATA[Ben Lorica and Luke Wroblewski]]></dc:creator>
<category><![CDATA[Generative AI in the Real World]]></category>
<category><![CDATA[Podcast]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&p=17394</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png"
medium="image"
type="image/png"
/>
<description><![CDATA[Join Luke Wroblewski and Ben Lorica as they talk about the future of software development. What happens when we have databases that are designed to interact with agents and language models rather than humans? We’re starting to see what that world will look like. It’s an exciting time to be a software developer. About the […]]]></description>
<content:encoded><![CDATA[
<p>Join Luke Wroblewski and Ben Lorica as they talk about the future of software development. What happens when we have databases that are designed to interact with agents and language models rather than humans? We’re starting to see what that world will look like. It’s an exciting time to be a software developer.</p>
<p><strong>About the <em>Generative AI in the Real World</em> podcast:</strong> In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In <em>Generative AI in the Real World</em>, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.</p>
<p>Check out <a href="https://learning.oreilly.com/playlists/42123a72-1108-40f1-91c0-adbfb9f4983b/?_gl=1*16z5k2y*_ga*MTE1NDE4NjYxMi4xNzI5NTkwODkx*_ga_092EL089CH*MTcyOTYxNDAyNC4zLjEuMTcyOTYxNDAyNi41OC4wLjA." target="_blank" rel="noreferrer noopener">other episodes</a> of this podcast on the O’Reilly learning platform.</p>
<h2 class="wp-block-heading">Timestamps</h2>
<ul class="wp-block-list">
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=0" target="_blank" rel="noreferrer noopener">0:00</a>: Introduction to Luke Wroblewski of Sutter Hill Ventures. </li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=36" target="_blank" rel="noreferrer noopener">0:36</a>: You’ve talked about a paradigm shift in how we write applications. You’ve said that all we need is a URL and model, and that’s an app. Has anyone else made a similar observation? Have you noticed substantial apps that look like this?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=68" target="_blank" rel="noreferrer noopener">1:08</a>: The future is here; it’s just not evenly distributed yet. That’s what everyone loves to say. The first websites looked nothing like robust web applications, and now we have a multimedia podcast studio running in the browser. We’re at the phase where some of these things look and feel less robust. And our ideas for what constitutes an application change in each of these phases. If I told you pre-Google Maps that we’d be running all of our web applications in a browser, you’d have laughed at me. </li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=133" target="_blank" rel="noreferrer noopener">2:13</a>: I think what you mean is an MCP server, and the model itself is the application, correct?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=144" target="_blank" rel="noreferrer noopener">2:24</a>: Yes. The current definition of an application, in a simple form, is running code and a database. We’re at the stage where you have AI coding agents that can handle the coding part. But we haven’t really had databases that have been designed for the way those agents think about code and interacting with data.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=177" target="_blank" rel="noreferrer noopener">2:57</a>: Now that we have databases that work the way agents work, you can take out the running-code part almost. People go to Lovable or Cursor and they’re forced to look at code syntax. But if an AI model can just use a database effectively, it takes the role of the running code. And if it can manage data visualizations and UI, you don’t need to touch the code. You just need to point the AI at a data structure it can use effectively. <a href="https://mcpui.dev/" target="_blank" rel="noreferrer noopener">MCP UI</a> is a nice example of people pushing in this direction.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=252" target="_blank" rel="noreferrer noopener">4:12</a>: Which brings us to something you announced recently: AgentDB. You can find it at <a href="http://agentdb.dev" target="_blank" rel="noreferrer noopener">agentdb.dev</a>. What problem is AgentDB trying to solve?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=274" target="_blank" rel="noreferrer noopener">4:34</a>: Related to what we were just talking about: How do we get AI agents to use databases effectively? Most things in the technology stack are made for humans and the scale at which humans operate.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=306" target="_blank" rel="noreferrer noopener">5:06</a>: They’re still designed for a DBA, but eliminating the command line, right? So you still have to have an understanding of DBA principles?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=319" target="_blank" rel="noreferrer noopener">5:19</a>: How do you pick between the different compute options? How do you pick a region? What are the security options? And it’s not something you’re going to do thousands of times a day. Databricks just shared some stats where they said that thousands of databases per agent get made a day. They think 99% of databases being made are going to be made by agents. What is making all these databases? No longer humans. And the scale at which they make them—thousands is a lowball number. It will be way, way higher than that. How do we make a database system that works in that reality?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=382" target="_blank" rel="noreferrer noopener">6:22</a>: So the high-level thesis here is that lots of people will be creating agents, and these agents will rely on something that looks like a database, and many of these people won’t be hardcore engineers. What else?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=405" target="_blank" rel="noreferrer noopener">6:45</a>: It’s also agents creating agents, and agents creating applications, and agents deciding they need a database to complete a task. The explosion of these smart machine uses and workflows is well underway. But we don’t have an infrastructure that was made for that world. They were all designed to work with humans.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=451" target="_blank" rel="noreferrer noopener">7:31</a>: So in the classic database world, you’d consider AgentDB more like OLTP rather than analytics and OLAP.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=462" target="_blank" rel="noreferrer noopener">7:42</a>: Yeah, for analytics you’d probably stick your log somewhere else. The characteristics that make AgentDB really interesting for agents is, number 1: To create a database, all you really need is a unique ID. The creation of the ID manifests a database out of thin air. And we store it as a file, so you can scale like crazy. And all of these databases are fully isolated. They’re also downloadable, deletable, releasable—all the characteristics of a filesystem. We also have the concept of a template that comes along with the database. That gives the AI model or agent all the context it needs to start using the database immediately. If you just point Claude at a database, it will need to look at the structure (schema). It will build tokens and time trying to get the structure of the information. And every time it does this is an opportunity to make a mistake. With AgentDB, when an agent or an AI model is pointed at the database with a template, it can immediately write a query because we have in there a description of the database, the schema. So you save time, cut down errors, and don’t have to go through that learning step every time the model touches a database.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=622" target="_blank" rel="noreferrer noopener">10:22</a>: I assume this database will have some of the features you like, like ACID, vector search. So what kinds of applications have people built using AgentDB? </li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=653" target="_blank" rel="noreferrer noopener">10:53</a>: We put up a little demo page where we allow you to start the process with a CSV file. You upload it, and it will create the database and give you an MCP URL. So people are doing things like personal finance. People are uploading their credit card statements, their bank statements, because those applications are horrendous.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=699" target="_blank" rel="noreferrer noopener">11:39</a>: So it’s the actual statement; it parses it?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=705" target="_blank" rel="noreferrer noopener">11:45</a>: Another example: Someone has a spreadsheet to track jobs. They can take that, upload it, it gives them a template and a database and an MCP URL. They can pop that job-tracking database into Claude and do all the things you can do with a chat app, like ask, “What did I look at most recently?”</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=755" target="_blank" rel="noreferrer noopener">12:35</a>: Do you envision it more like a DuckDB, more embedded, not really intended for really heavy transactional, high-throughput, more-than-one-table complicated schemas?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=769" target="_blank" rel="noreferrer noopener">12:49</a>: We currently support DuckDB and SQLite. But there are a bunch of folks who have made multiple table apps and databases.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=789" target="_blank" rel="noreferrer noopener">13:09</a>: So it’s not meant for you to build your own CRM?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=798" target="_blank" rel="noreferrer noopener">13:18</a>: Actually, one of our go-to-market guys had data of people visiting the website. He can dump that as a spreadsheet. He has data of people starring repos on GitHub. He has data of people who reached out through this form. He has all of these inbound signals of customers. So he took those, dropped them in as CSV files, put it in Claude, and then he can say, “Look at these, search the web for information about these, add it to the database, sort it by priority, assign it to different reps.” It’s CRM-ish already, but super-customized to his particular use case. </li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=867" target="_blank" rel="noreferrer noopener">14:27</a>: So you can create basically an agentic Airtable.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=878" target="_blank" rel="noreferrer noopener">14:38</a>: This means if you’re building AI applications or databases—traditionally that has been somewhat painful. This removes all that friction.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=900" target="_blank" rel="noreferrer noopener">15:00</a>: Yes, and it leads to a different way of making apps. You take that CSV file, you take that MCP URL, and you have a chat app.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=917" target="_blank" rel="noreferrer noopener">15:17</a>: Even though it’s accessible to regular users, it’s something developers should consider, right?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=925" target="_blank" rel="noreferrer noopener">15:25</a>: We’re starting to see emergent end-user use cases, but what we put out there is for developers. </li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=938" target="_blank" rel="noreferrer noopener">15:38</a>: One of the other things you’ve talked about is the notion that software development has flipped. Can you explain that to our listeners?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=956" target="_blank" rel="noreferrer noopener">15:56</a>: I spent eight and a half years at Google, four and a half at Yahoo, two and a half at ebay, and your traditional process of what we’re going to do next is up front: There’s a lot of drawing pictures and stuff. We had to scope engineering time. A lot of the stuff was front-loaded to figure out what we were going to build. Now with things like AI agents, you can build it and then start thinking about how it integrates inside the project. At a lot of our companies that are working with AI coding agents, I think this naturally starts to happen, that there’s a manifestation of the technology that helps you think through what the design should be, how do we integrate into the product, should we launch this? This is what I mean by “flipped.”</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1061" target="_blank" rel="noreferrer noopener">17:41</a>: If I’m in a company like a big bank, does this mean that engineers are running ahead?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1075" target="_blank" rel="noreferrer noopener">17:55</a>: I don’t know if it’s happening in big banks yet, but it’s definitely happening in startup companies. And design teams have to think through “Here’s a bunch of stuff, let me do a wash across all that to fit in,” as opposed to spending time designing it earlier. There are pros and cons to both of these. The engineers were cleaning up the details in the previous world. Now the opposite is true: I’ve built it, now I need to design it.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1135" target="_blank" rel="noreferrer noopener">18:55</a>: Does this imply a new role? There’s a new skill set that designers have to develop?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1147" target="_blank" rel="noreferrer noopener">19:07</a>: There’s been this debate about “Should designers code?” Over the years lots of things have reduced the barrier to entry, and now we have an even more dramatic reduction. I’ve always been of the mindset that if you understand the medium, you will make better things. Now there’s even less of a reason not to do it.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1190" target="_blank" rel="noreferrer noopener">19:50</a>: Anecdotally, what I’m observing is that the people who come from product are able to build something, but I haven’t heard as many engineers thinking about design. What are the AI tools for doing that?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1219" target="_blank" rel="noreferrer noopener">20:19</a>: I hear the same thing. What I hope remains uncommoditized is taste. I’ve found that it’s very hard to teach taste to people. If I have a designer who is a good systems thinker but doesn’t have the gestalt of the visual design layer, I haven’t been able to teach that to them. But I have been able to find people with a clear sense of taste from diverse design backgrounds and get them on board with interaction design and systems thinking and applications.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1262" target="_blank" rel="noreferrer noopener">21:02</a>: If you’re a young person and you’re skilled, you can go into either design or software engineering. Of course, now you’re reading articles saying “forget about software engineering.” I haven’t seen articles saying “forget about design.”</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1291" target="_blank" rel="noreferrer noopener">21:31</a>: I disagree with the idea that it’s a bad time to be an engineer. It’s never been more exciting.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1306" target="_blank" rel="noreferrer noopener">21:46</a>: But you have to be open to that. If you’re a curmudgeon, you’re going to be in trouble.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1313" target="_blank" rel="noreferrer noopener">21:53</a>: This happens with every technical platform transition. I spent so many years during the smartphone boom hearing people say, “No one is ever going to watch TV and movies on mobile.” Is it an affinity to the past, or a sense of doubt about the future? Every time, it’s been the same thing.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1357" target="_blank" rel="noreferrer noopener">22:37</a>: One way to think of AgentDB is like a wedge. It addresses one clear pain point in the stack that people have to grapple with. So what’s next? Is it Kubernetes?</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1389" target="_blank" rel="noreferrer noopener">23:09</a>: I don’t want to go near that one! The broader context of how applications are changing—how do I create a coherent product that people understand how to use, that has aesthetics, that has a personality?—is a very wide-open question. There’s a bunch of other systems that have not been made for AI models. A simple example is search APIs. Search APIs are basically structured the same way as results pages. Here’s your 10 blue links. But an agentic model can suck up so much information. Not only should you be giving it the web page, you should be giving it the whole site. Those systems are not built for this world at all. You can go down the list of the things we use as core infrastructure and think about how they were made for a human, not the capabilities of an enormous large language model.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1479" target="_blank" rel="noreferrer noopener">24:39</a>: Right now, I’m writing an article on enterprise search, and one of things people don’t realize is that it’s broken. In terms of AgentDB, do you worry about things like security, governance? There’s another place black hat attackers can go after.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1520" target="_blank" rel="noreferrer noopener">25:20</a>: Absolutely. All new technologies have the light side and the dark side. It’s always been a codebreaker-codemaker stack. That doesn’t change. The attack vectors are different and, in the early stages, we don’t know what they are, so it is a cat and mouse game. There was an era when spam in email was terrible; your mailbox would be full of spam and you manually had to mark things as junk. Now you use gmail, and you don’t think about it. When was the last time you went into the junk mail tab? We built systems, we got smarter, and the average person doesn’t think about it.</li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1591" target="_blank" rel="noreferrer noopener">26:31</a>: As you have more people building agents, and agents building agents, you have data governance, access control; suddenly you have AgentDB artifacts all over the place. </li>
<li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1626" target="_blank" rel="noreferrer noopener">27:06</a>: Two things here. This is an underappreciated part of this. Two years ago I launched my own personal chatbot that works off my writings. People ask me what model am I using, and how is it built? Those are partly interesting questions. But the real work in that system is constantly looking at the questions people are asking, and evaluating whether or not it responded well. I’m constantly course-correcting the system. That’s the work that a lot of people don’t do. But the thing I’m doing is applying taste, applying a perspective, defining what “good” is. For a lot of systems like enterprise search, it’s like, “We deployed the technology.” How do you know if it’s good or not? Is someone in there constantly tweaking and tuning? What makes Google Search so good? It’s constantly being re-evaluated. Or Google Translate—was this translation good or bad? Baked in early on.</li>
</ul>
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-luke-wroblewski-on-when-databases-talk-agent-speak/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>AI Security Takes Center Stage at Black Hat USA 2025</title>
<link>https://www.oreilly.com/radar/ai-security-takes-center-stage-at-black-hat-usa-2025/</link>
<comments>https://www.oreilly.com/radar/ai-security-takes-center-stage-at-black-hat-usa-2025/#respond</comments>
<pubDate>Thu, 04 Sep 2025 09:52:40 +0000</pubDate>
<dc:creator><![CDATA[Simina Calin]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Security]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17388</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-colors-9.jpg"
medium="image"
type="image/jpeg"
/>
<description><![CDATA[The security landscape is undergoing yet another major shift, and nowhere was this more evident than at Black Hat USA 2025. As artificial intelligence (especially the agentic variety) becomes deeply embedded in enterprise systems, it’s creating both security challenges and opportunities. Here’s what security professionals need to know about this rapidly evolving landscape. AI systems—and […]]]></description>
<content:encoded><![CDATA[
<p>The security landscape is undergoing yet another major shift, and nowhere was this more evident than at <a href="https://www.blackhat.com/us-25/" target="_blank" rel="noreferrer noopener">Black Hat USA 2025</a>. As artificial intelligence (especially the agentic variety) becomes deeply embedded in enterprise systems, it’s creating both security challenges and opportunities. Here’s what security professionals need to know about this rapidly evolving landscape.</p>
<p>AI systems—and particularly the AI assistants that have become integral to enterprise workflows—are emerging as prime targets for attackers. In one of the most interesting and scariest presentations, Michael Bargury of Zenity demonstrated previously unknown <a href="https://www.blackhat.com/us-25/briefings/schedule/index.html#ai-enterprise-compromise---0click-exploit-methods-46442" target="_blank" rel="noreferrer noopener">“0click” exploit methods</a> affecting major AI platforms including ChatGPT, Gemini, and Microsoft Copilot. These findings underscore how AI assistants, despite their robust security measures, can become vectors for system compromise.</p>
<p>AI security presents a paradox: As organizations expand AI capabilities to enhance productivity, they must necessarily increase these tools’ access to sensitive data and systems. This expansion creates new attack surfaces and more complex supply chains to defend. NVIDIA’s AI red team highlighted this vulnerability, revealing how <a href="https://www.blackhat.com/us-25/briefings/schedule/#from-prompts-to-pwns-exploiting-and-securing-ai-agents-46681" target="_blank" rel="noreferrer noopener">large language models (LLMs) are uniquely susceptible to malicious inputs</a>, and demonstrated several novel exploit techniques that take advantage of these inherent weaknesses.</p>
<p>However, it’s not all new territory. Many traditional security principles remain relevant and are, in fact, more crucial than ever. Nathan Hamiel and Nils Amiet of Kudelski Security showed how AI-powered development tools are <a href="https://www.blackhat.com/us-25/briefings/schedule/index.html#hack-to-the-future-owning-ai-powered-tools-with-old-school-vulns-45871" target="_blank" rel="noreferrer noopener">inadvertently reintroducing well-known vulnerabilities into modern applications</a>. Their findings suggest that basic application security practices remain fundamental to AI security.</p>
<p>Looking forward, threat modeling becomes increasingly critical but also more complex. The security community is responding with new frameworks designed specifically for AI systems such as MAESTRO and NIST’s AI Risk Management Framework. The <a href="https://genai.owasp.org/resource/owasp-gen-ai-agentic-security-top-10-global-kickoff-presentation/" target="_blank" rel="noreferrer noopener">OWASP Agentic Security Top 10 project</a>, launched during this year’s conference, provides a structured approach to understanding and addressing AI-specific security risks.</p>
<p>For security professionals, the path forward requires a balanced approach: maintaining strong fundamentals while developing new expertise in AI-specific security challenges. Organizations must reassess their security posture through this new lens, considering both traditional vulnerabilities and emerging AI-specific threats.</p>
<p>The discussions at Black Hat USA 2025 made it clear that while AI presents new security challenges, it also offers opportunities for innovation in defense strategies. Mikko Hypponen’s opening keynote presented a <a href="https://www.blackhat.com/us-25/briefings/schedule/#keynote-three-decades-in-cybersecurity-lessons-learned-and-what-comes-next-48195" target="_blank" rel="noreferrer noopener">historical perspective on the last 30 years of cybersecurity advancements</a> and concluded that security is not only better than it’s ever been but poised to leverage a head start in AI usage. Black Hat has a way of underscoring the reasons for concern, but taken as a whole, this year’s presentations show us that there are also many reasons to be optimistic. Individual success will depend on how well security teams can adapt their existing practices while embracing new approaches specifically designed for AI systems.</p>
]]></content:encoded>
<wfw:commentRss>https://www.oreilly.com/radar/ai-security-takes-center-stage-at-black-hat-usa-2025/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Looking Forward to AI Codecon</title>
<link>https://www.oreilly.com/radar/looking-forward-to-ai-codecon/</link>
<pubDate>Wed, 03 Sep 2025 17:25:12 +0000</pubDate>
<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Events]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17395</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/06/abstract-2370445_1920_crop-72496aa8e2aa221169247ee80be31a15.jpg"
medium="image"
type="image/jpeg"
/>
<description><![CDATA[I’m really looking forward to our second O’Reilly AI Codecon, Coding for the Agentic World, which is happening on September 9, online from 8am to noon Pacific time, with a follow-on day of additional demos on September 16. But I’m also looking forward to how the AI market itself unfolds: the surprising twists and turns […]]]></description>
<content:encoded><![CDATA[
<p>I’m really looking forward to our second O’Reilly AI Codecon, <a href="https://www.oreilly.com/AgenticWorld/" target="_blank" rel="noreferrer noopener">Coding for the Agentic World</a>, which is happening on September 9, online from 8am to noon Pacific time, with a follow-on <a href="https://learning.oreilly.com/live-events/oreilly-demo-day/0642572227968/" target="_blank" rel="noreferrer noopener">day of additional demos</a> on September 16. But I’m also looking forward to how the AI market itself unfolds: the surprising twists and turns ahead as users and developers apply AI to real-world problems.</p>
<p>The pages linked above give details on the program for the events. What I want to give here is a bit of the <strong><em>why</em></strong> behind the program, with a bit more detail on some of the fireside chats I will be leading.</p>
<h2 class="wp-block-heading">From Invention to Application</h2>
<p>There has been so much focus in the past on the big AI labs, the model developers, and their razzle-dazzle about AGI, or even ASI. That narrative implied that we were heading toward something unprecedented. But if this is a “<a href="https://www.oreilly.com/radar/is-ai-a-normal-technology/" target="_blank" rel="noreferrer noopener">normal technology</a>” (albeit one as transformational as electricity, the internal combustion engine, or the internet), we know that LLMs themselves are just the beginning of a long process of discovery, product invention, business adoption, and societal adaptation.</p>
<p>That process of collaborative discovery of the real uses for AI and reinvention of the businesses that use it is happening most clearly in the software industry. It is where AI is being pushed to the limits, where new products beyond the chatbot are being introduced, where new workflows are being developed, and where we understand what works and what doesn’t.</p>
<p>This work is often being pushed forward by individuals, who are “<a href="https://yalebooks.yale.edu/book/9780300195668/learning-by-doing/" target="_blank" rel="noreferrer noopener">learning by doing</a>.” Some of these individuals work for large companies, others for startups, others for enterprises, and others as independent hackers.</p>
<p>Our focus in these AI Codecon events is to smooth adoption of AI by helping our customers cut through the hype and understand what is working. O’Reilly’s mission has always been <a href="https://www.oreilly.com/about/" target="_blank" rel="noreferrer noopener">changing the world by sharing the knowledge of innovators</a>. In our events, we always look for people who are at the forefront of invention. As outlined in the call to action for the first event, I was concerned about the chatter that AI would make developers obsolete. I <a href="https://www.oreilly.com/radar/the-end-of-programming-as-we-know-it/" target="_blank" rel="noreferrer noopener">argued instead</a> that it would profoundly change the process of software development and the jobs that developers do, but that it would make them more important than ever.</p>
<p>It looks like I was right. There is a huge ferment, with so much new to learn and do that it’s a really exciting time to be a software developer. I’m really excited about the practicality of the conversation. We’re not just talking about the “what if.” We’re seeing new AI powered services meeting real business needs. We are witnessing the shift from human-centric workflows to agent-centric workflows, and it’s happening faster than you think.</p>
<p>We’re also seeing widespread adoption of the protocols that will power it all. If you’ve followed my work from open source to Web 2.0 to the present, you know that I believe strongly that the most dynamic systems have “an architecture of participation.” That is, they aren’t monolithic. The barriers to entry need to be low and business models fluid (at least in the early stages) for innovation to flourish.</p>
<p>When AI was framed as a race for superintelligence, there was a strong expectation that it would be winner takes all. The first company to get to ASI (or even just to AGI) would soon be so far ahead that it would inevitably become a dominant monopoly. Developers would all use its APIs, making it into the single dominant platform for AI development.</p>
<p>Protocols like MCP and A2A are instead enabling a decentralized AI future. The explosion of entrepreneurial activity around agentic AI reminds me of the best kind of open innovation, much like I saw in the early days of the personal computer and the internet.</p>
<p>I was going to use my opening remarks to sound that theme, and then I read Alex Komoroske’s marvelous essay, “<a href="https://www.techdirt.com/2025/06/16/why-centralized-ai-is-not-our-inevitable-future/" target="_blank" rel="noreferrer noopener">Why Centralized AI Is Not Our Inevitable Future</a>.” So I asked him to do it instead. He’s going to give an updated, developer-focused version of that as our kickoff talk.</p>
<p>Then we’re going into a section on agentic interfaces. We’ve lived for decades with the GUI (either on computers or mobile applications) and the web as the dominant ways we use computers. AI is changing all that.</p>
<p>It’s not just agentic interfaces, though. It’s really developing true AI-native products, searching out the possibilities of this new computing fabric.</p>
<h2 class="wp-block-heading">The Great Interface Rethink</h2>
<p>In the “normal technology” framing, a fundamental technology innovation is distinct from products based on it. Think of the invention of the LLM itself as electricity, and ChatGPT as the equivalent of Edison’s incandescent light bulb and the development of the distribution network to power it.</p>
<p>There’s a bit of a lesson in the fact that the telegraph was the first large-scale practical application of electricity, over 40 years before Edison’s lightbulb. The telephone was another killer app that used electricity to power it. But despite their scale, these were specialized devices. It was the infrastructure for incandescent lighting that turned electricity into a <a href="https://en.wikipedia.org/wiki/General-purpose_technology" target="_blank" rel="noreferrer noopener">general-purpose technology</a>.</p>
<p>The world soon saw electrical resistance products like irons and toasters, and electric motors powering not just factories but household appliances such as washing machines and eventually refrigerators and air conditioning. Many of these household products were plugged into light sockets, since the pronged plug as we know it today wasn’t introduced until 30 years after the first light bulb.</p>
<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1086" height="824" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Looking-Forward-to-AI-Codecon.png" alt="" class="wp-image-17399" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Looking-Forward-to-AI-Codecon.png 1086w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Looking-Forward-to-AI-Codecon-300x228.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Looking-Forward-to-AI-Codecon-768x583.png 768w" sizes="auto, (max-width: 1086px) 100vw, 1086px" /><figcaption class="wp-element-caption"><em><a href="https://www.facebook.com/photo/?fbid=10158281345021884&set=gm.3012577095721332" target="_blank" rel="noreferrer noopener">Found on Facebook</a>: “Any ideas what this would have been used for? I found it after pulling up carpet – it’s in the corner of a closet in my 1920s ‘fixer-upper’ that I’m slowly bringing back to life. It appears to be for a light bulb and the little flip top is just like floor outlets you see today, but can’t figure out why it would be directly on the floor.”</em></figcaption></figure>
<p>The lesson is that <strong><em>at some point in the development of a general purpose technology, product innovation takes over from pure technology innovation</em></strong>. That’s the phase we’re entering now.</p>
<p>Look at the evolution of LLM-based products: GitHub Copilot embedded AI into Visual Studio Code; the interface was an extension to VS Code, a 10-year-old GUI-based program. Google’s AI efforts were tied into its web-based search products. ChatGPT broke the mold and introduced the first radically new interface since the web browser. Suddenly, chat was the preferred new interface for everything. But Claude took things further with Artifacts and then Claude Code, and once coding assistants gained more complex interfaces, that kicked off today’s fierce competition between coding tools. The next revolution is the construction of a new computing paradigm where software is composed of intelligent, autonomous agents.</p>
<p>I’m really looking forward to Rachel-Lee Nabors’s talk on how, with an agentic interface, we might transcend the traditional browser: AI agents can adapt content directly to users, offering privacy, accessibility, and flexibility that legacy web interfaces cannot match.</p>
<p>But it seems to me that there will be two kinds of agents, which I call “demand side” and “supply side” agents. What’s a “demand side” agent? Instead of navigating complex apps, you’ll simply state your goal. The agent will understand the context, access the necessary tools, and present you with the result. The vision is still science fiction. The reality is often a kludge powered by browser use or API calls, with MCP servers increasingly offering an AI-friendlier interface for those demand-side agents to interact with. But why should it stop there? MCP servers are static interfaces. What if there were agents on both sides of the conversation, in a dynamic negotiation? I suspect that while demand-side agents will be developed by venture funded startups, most server-side agents will be developed by enterprises as a kind of conversational interface for both humans and AI agents that want access to their complex workflows, data, and business models. And those enterprises will often be using agentic platforms tailored for their use. That’s part of the “supply side agent” vision of companies like Sierra. I’ll be talking with Sierra cofounder Clay Bavor about this next step in agentic development.</p>
<p>We’ve grown accustomed to thinking about agents as lonely consumers—“tell me the weather,” “scan my code,” “summarize my inbox.” But that’s only half the story. If we build supply-side agent infrastructure—autonomous, discoverable, governed, negotiated—we unlock agility, resilience, security, and collaboration.</p>
<p>My interest in product innovation, not just advances in the underlying technology, is also why I’m excited about my fireside chat with Josh Woodward, who co-led the team that developed NotebookLM at Google. I’m a huge fan of NotebookLM, which in many ways brought the power of RAG (retrieval-augmented generation) to end users, allowing them to collect a set of documents into a Google drive, and then use that collection to drive chat, audio overviews of documents, study guides, mind maps, and much more.</p>
<p>NotebookLM is also a lovely way to build on the deep collaborative infrastructure provided by Google Drive. We need to think more deeply about collaborative interfaces for AI. Right now, AI interaction is mostly a solitary sport. You can share the outputs with others, but not the generative process. I wrote about this recently in “<a href="https://www.oreilly.com/radar/people-work-in-teams-ai-assistants-in-silos/" target="_blank" rel="noreferrer noopener">People Work in Teams, AI Assistants in Silos</a>.” I think that’s a big miss, and I’m hoping to probe Josh about Google’s plans in this area, and eager to see other innovations in AI-mediated human collaboration.</p>
<p>GitHub is another existing tool for collaboration that has become central to the AI ecosystem. I’m really looking forward to talking with outgoing CEO Thomas Dohmke both about the ways that GitHub already provides a kind of exoskeleton for collaboration when using AI code-generation tools. It seems to me that one of the frontiers of AI-human interfaces will be those that enable not just small teams but eventually large groups to collaborate. I suspect that GitHub may have more to teach us about that future than we now suspect.</p>
<p>And finally, we are now learning that managing context is a critical part of designing effective AI applications. My cochair Addy Osmani will be talking about the emergence of context engineering as a real discipline, and its relevance to agentic AI development.</p>
<h2 class="wp-block-heading">Tool-Chaining Agents and Real Workflows</h2>
<p>Today’s AI tools are largely solo performers—a Copilot suggesting code or a ChatGPT answering a query. The next leap is from single agents to interconnected systems. The program is filled with sessions on “tool-to-tool workflows” and multi-agent systems.</p>
<p>Ken Kousen will showcase the new generation of coding agents, including Claude Code, Codex CLI, Gemini CLI, and Junie, that help developers navigate codebases, automate tasks, and even refactor intelligently. In her talk, Angie Jones takes it further: agents that go beyond code generation to manage PRs, write tests, and update documentation—stepping “out of the IDE” and into real-world workflows.</p>
<p>Even more exciting is the idea of agents collaborating with each other. The Demo Day will showcase a multi-agent coding system where agents share, correct, and evolve code together. This isn’t science fiction; Amit Rustagi’s talk on decentralized AI agent infrastructure using technologies like WebAssembly and IPFS provides a practical architectural framework for making these agent swarms a reality.</p>
<h2 class="wp-block-heading">The Crucial Ingredient: Common Protocols</h2>
<p>How do all these agents talk to each other? How do they discover new tools and use them safely? The answer that echoes throughout the agenda is the Model Context Protocol (MCP).</p>
<p>Much as the distribution network for electricity was the enabler for all of the product innovation of the electrical revolution, MCP is the foundational plumbing, the universal language that will allow this new ecosystem to flourish. Multiple sessions and an entire Demo Day are dedicated to it. We’ll see how Google is using it for agent-to-agent communication, how it can be used to control complex software like Blender with natural language, and even how it can power novel SaaS product demos.</p>
<p>The heavy focus on a standardized protocol signals that the industry is maturing past cool demos and is now building the robust, interoperable infrastructure needed for a true agentic economy.</p>
<p>If the development of the internet is any guide, though, MCP is a beginning, not the end. TCP/IP became the foundation of a layered protocol stack. It is likely that MCP will be followed by many more specialized protocols.</p>
<h2 class="wp-block-heading">Why This Matters</h2>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Theme</th><th><strong>Why It’s Thrilling</strong></th></tr></thead><tbody><tr><td>Autonomous, Distributed AI</td><td>Agents that chain tasks and operate behind the scenes can unlock entirely new ways of building software.</td></tr><tr><td>Human Empowerment & Privacy</td><td>The push against centralized AI systems is a reminder that tools should serve users, not control them.</td></tr><tr><td>Context as Architecture</td><td>Elevating input design to first-class engineering—this will greatly improve reliability, trust, and AI behavior over time.</td></tr><tr><td>New Developer Roles</td><td>We’re seeing developers transition from writing code to orchestrating agents, designing workflows, and managing systems.</td></tr><tr><td>MCP & Network Effects</td><td>The idea of an “AI-native web,” where agents use standardized protocols to talk, is powerful, open-ended, and full of opportunity.</td></tr></tbody></table></figure>
<p>I look forward to seeing you there!</p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<p class="has-cyan-bluish-gray-background-color has-background"><em>We hope you’ll join us at <strong>AI Codecon: Coding for the Agentic World</strong> on September 9 to explore the tools, workflows, and architectures defining the next era of programming. It’s free to attend. </em><a href="https://www.oreilly.com/AgenticWorld/" target="_blank" rel="noreferrer noopener"><em>Register now to save your seat</em></a><em>.</em> <em>And join us for <a href="https://learning.oreilly.com/live-events/oreilly-demo-day/0642572227968/" target="_blank" rel="noreferrer noopener">O’Reilly Demo Day</a> on September 16 to see how experts are shaping AI systems to work for them via MCP.</em></p>
<p></p>
]]></content:encoded>
</item>
<item>
<title>Understanding the Rehash Loop</title>
<link>https://www.oreilly.com/radar/understanding-the-rehash-loop/</link>
<pubDate>Wed, 03 Sep 2025 10:21:28 +0000</pubDate>
<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17392</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/06/fractal-1832620_1920_crop-c59dcd2418c34f9dccc1469aba4b0ba1.jpg"
medium="image"
type="image/jpeg"
/>
<custom:subtitle><![CDATA[When AI Gets Stuck]]></custom:subtitle>
<description><![CDATA[This article is part of a series on the Sens-AI Framework—practical habits for learning and coding with AI. In “The Sens-AI Framework: Teaching Developers to Think with AI,” I introduced the concept of the rehash loop—that frustrating pattern where AI tools keep generating variations of the same wrong answer, no matter how you adjust your […]]]></description>
<content:encoded><![CDATA[
<p class="has-cyan-bluish-gray-background-color has-background"><em>This article is part of a series on the </em><a href="https://www.oreilly.com/radar/the-sens-ai-framework/"><em>Sens-AI Framework</em></a><em>—practical habits for learning and coding with AI.</em></p>
<p>In “<a href="https://www.oreilly.com/radar/the-sens-ai-framework/" target="_blank" rel="noreferrer noopener">The Sens-AI Framework: Teaching Developers to Think with AI</a>,” I introduced the concept of the <strong>rehash loop</strong>—that frustrating pattern where AI tools keep generating variations of the same wrong answer, no matter how you adjust your prompt. It’s one of the most common failure modes in AI-assisted development, and it deserves a deeper look.</p>
<p>Most developers who use AI in their coding work will recognize a rehash loop. The AI generates code that’s almost right—close enough that you think one more tweak will fix it. So you adjust your prompt, add more detail, explain the problem differently. But the response is essentially the same broken solution with cosmetic changes. Different variable names. Reordered operations. Maybe a comment or two. But fundamentally, it’s the same wrong answer.</p>
<h2 class="wp-block-heading"><strong>Recognizing When You’re Stuck</strong></h2>
<p>Rehash loops are frustrating. The model seems so close to understanding what you need but just can’t get you there. Each iteration looks slightly different, which makes you think you’re making progress. Then you test the code and it fails in exactly the same way, or you get the same errors, or you just recognize that it’s a solution that you’ve already seen and dismissed multiple times.</p>
<p>Most developers try to escape through incremental changes—adding details, rewording instructions, nudging the AI toward a fix. These adjustments normally work during regular coding sessions, but in a rehash loop, they lead back to the same constrained set of answers. You can’t tell if there’s no real solution, if you’re asking the wrong question, or if the AI is hallucinating a partial answer and too confident that it works.</p>
<p>When you’re in a rehash loop, the AI isn’t broken. It’s doing exactly what it’s designed to do—generating the most statistically likely response it can, based on the tokens in your prompt and the limited view it has of the conversation. One source of the problem is the <strong>context window</strong>—an architectural limit on how many tokens the model can process at once. That includes your prompt, any shared code, and the rest of the conversation—usually a few thousand tokens total. The model uses this entire sequence to predict what comes next. Once it has sampled the patterns it finds there, it starts circling.</p>
<p>The variations you get—reordered statements, renamed variables, a tweak here or there—aren’t new ideas. They’re just the model nudging things around in the same narrow probability space.</p>
<p>So if you keep getting the same broken answer, the issue probably isn’t that the model doesn’t know how to help. It’s that you haven’t given it enough to work with.</p>
<h2 class="wp-block-heading"><strong>When the Model Runs Out of Context</strong></h2>
<p><em>A rehash loop is a </em><strong><em>signal that the AI ran out of context</em></strong><em>.</em><strong><em> </em></strong>The model has exhausted the useful information in the context you’ve given it. When you’re stuck in a rehash loop, treat it as a signal instead of a problem. Figure out what context is missing and provide it.</p>
<p>Large language models don’t really understand code the way humans do. They generate suggestions by predicting what comes next in a sequence of text based on patterns they’ve seen in massive training datasets. When you prompt them, they analyze your input and predict likely continuations, but they have no real understanding of your design or requirements unless you explicitly provide that context.</p>
<p>The better context you provide, the more useful and accurate the AI’s answers will be. But when the context is incomplete or poorly framed, the AI’s suggestions can drift, repeat variations, or miss the real problem entirely.</p>
<h2 class="wp-block-heading"><strong>Breaking Out of the Loop</strong></h2>
<p><strong>Research</strong> becomes especially important when you hit a rehash loop. You need to learn more before reengaging—reading documentation, clarifying requirements with teammates, thinking through design implications, or even starting another session to ask research questions from a different angle. Starting a new chat with a different AI can help because your prompt might steer it toward a different region of its information space and surface new context.</p>
<p>A rehash loop tells you that the model is stuck trying to solve a puzzle without all the pieces. It keeps rearranging the ones it has, but it can’t reach the right solution until you give it the one piece it needs—that extra bit of context that points it to a different part of the model it wasn’t using. That missing piece might be a key constraint, an example, or a goal you haven’t spelled out yet. You typically don’t need to give it a lot of extra information to break out of the loop. The AI doesn’t need a full explanation; it needs just enough new context to steer it into a part of its training data it wasn’t using.</p>
<p>When you recognize you’re in a rehash loop, trying to nudge the AI and vibe-code your way out of it is usually ineffective—it just leads you in circles. (“Vibe coding” means relying on the AI to generate something that looks plausible and hoping it works, without really digesting the output.) Instead, start investigating what’s missing. Ask the AI to explain its thinking: “What assumptions are you making?” or “Why do you think this solves the problem?” That can reveal a mismatch—maybe it’s solving the wrong problem entirely, or it’s missing a constraint you forgot to mention. It’s often especially helpful to open a chat with a different AI, describe the rehash loop as clearly as you can, and ask what additional context might help.</p>
<p>This is where problem framing really starts to matter. If the model keeps circling the same broken pattern, it’s not just a prompt problem—it’s a signal that your framing needs to shift.</p>
<p><strong>Problem framing</strong> helps you recognize that the model is stuck in the wrong solution space. Your framing gives the AI the clues it needs to assemble patterns from its training that actually match your intent. After researching the actual problem—not just tweaking prompts—you can transform vague requests into targeted questions that steer the AI away from default responses and toward something useful.</p>
<p>Good framing starts by getting clear about the nature of the problem you’re solving. What exactly are you asking the model to generate? What information does it need to do that? Are you solving the right problem in the first place? A lot of failed prompts come from a mismatch between the developer’s intent and what the model is actually being asked to do. Just like writing good code, good prompting depends on understanding the problem you’re solving and structuring your request accordingly.</p>
<h2 class="wp-block-heading"><strong>Learning from the Signal</strong></h2>
<p>When AI keeps circling the same solution, it’s not a failure—it’s information. The rehash loop tells you something about either your understanding of the problem or how you’re communicating it. An incomplete response from the AI is often just a step toward getting the right answer. These moments aren’t failures. They’re signals to do the extra work—often just a small amount of targeted research—that gives the AI the information it needs to get to the right place in its massive information space.</p>
<p><strong><em>AI doesn’t think for you.</em></strong> While it can make surprising connections by recombining patterns from its training, it can’t generate truly new insight on its own. It’s your context that helps it connect those patterns in useful ways. If you’re hitting rehash loops repeatedly, ask yourself: What does the AI need to know to do this well? What context or requirements might be missing?</p>
<p>Rehash loops are one of the clearest signals that it’s time to step back from rapid generation and engage your critical thinking. They’re frustrating, but they’re also valuable—they tell you exactly when the AI has exhausted its current context and needs your help to move forward.</p>
]]></content:encoded>
</item>
<item>
<title>Radar Trends to Watch: September 2025</title>
<link>https://www.oreilly.com/radar/radar-trends-to-watch-september-2025/</link>
<pubDate>Tue, 02 Sep 2025 10:10:37 +0000</pubDate>
<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
<category><![CDATA[Radar Trends]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17379</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-4.png"
medium="image"
type="image/png"
/>
<custom:subtitle><![CDATA[Developments in AI, Security, Web, and More]]></custom:subtitle>
<description><![CDATA[For better or for worse, AI has colonized this list so thoroughly that AI itself is little more than a list of announcements about new or upgraded models. But there are other points of interest. Is it just a coincidence (possibly to do with BlackHat) that so much happened in security in the past month? […]]]></description>
<content:encoded><![CDATA[
<p>For better or for worse, AI has colonized this list so thoroughly that AI itself is little more than a list of announcements about new or upgraded models. But there are other points of interest. Is it just a coincidence (possibly to do with BlackHat) that so much happened in security in the past month? We’re still seeing programming languages—even some new programming languages for writing AI prompts! If you’re into retrocomputing, the much-beloved Commodore 64 is back—with an upgraded audio chip, a new processor, much more RAM, and all your old ports. Heirloom peripherals should still work.</p>
<h2 class="wp-block-heading">AI</h2>
<ul class="wp-block-list">
<li>OpenAI has released their <a href="https://platform.openai.com/docs/guides/realtime" target="_blank" rel="noreferrer noopener">Realtime APIs</a>. The model supports MCP servers, phone calls using the SIP protocol, and image inputs. The release includes <a href="https://openai.com/index/introducing-gpt-realtime/" target="_blank" rel="noreferrer noopener">gpt-realtime</a>, an advanced speech-to-speech model.</li>
<li>ChatGPT now supports <a href="https://help.openai.com/en/articles/6825453-chatgpt-release-notes#h_fb3ac52750" target="_blank" rel="noreferrer noopener">project-only memory</a>. Project memory, which can use previous conversations for additional context, can be limited to a specific project. Project-only memory gives more control over context and prevents one project’s context from contaminating another.</li>
<li><a href="https://arxiv.org/abs/2501.01665" target="_blank" rel="noreferrer noopener">FairSense</a> is a framework for <a href="https://techxplore.com/news/2025-08-fairness-tool-ai-bias-early.html" target="_blank" rel="noreferrer noopener">investigating whether AI systems are fair</a> early on. FairSense runs long-term simulations to detect whether a system will become unfair as it evolves over time.</li>
<li><a href="https://agents4science.stanford.edu/" target="_blank" rel="noreferrer noopener">Agents4Science</a> is a new academic conference in which all the submissions will be <a href="https://www.technologyreview.com/2025/08/22/1122304/ai-scientist-research-autonomous-agents/" target="_blank" rel="noreferrer noopener">researched, written, reviewed, and presented primarily by AI</a> (using text-to-speech for presentations).</li>
<li>Drew Breunig’s mix and match <a href="https://www.dbreunig.com/2025/08/21/a-guide-to-ai-titles.html" target="_blank" rel="noreferrer noopener">cheat sheet for AI job titles</a> is a classic. </li>
<li>Cohere’s <a href="https://cohere.com/blog/command-a-reasoning" target="_blank" rel="noreferrer noopener">Command A Reasoning</a> is another powerful, partially open reasoning model. It is available on <a href="https://huggingface.co/CohereLabs/command-a-reasoning-08-2025?ref=cohere-ai.ghost.io" target="_blank" rel="noreferrer noopener">Hugging Face</a>. It claims to outperform gpt-oss-120b and DeepSeek R1-0528.</li>
<li>DeepSeek has <a href="https://api-docs.deepseek.com/news/news250821" target="_blank" rel="noreferrer noopener">released</a> DeepSeekV3.1. This is a hybrid model that supports reasoning and nonreasoning use. It’s also faster than R1 and has been designed for agentic tasks. It uses reasoning tokens more economically, and it was much less expensive to train than GPT-5.</li>
<li>Anthropic has added the <a href="https://www.anthropic.com/research/end-subset-conversations" target="_blank" rel="noreferrer noopener">ability to terminate chats</a> to Claude Opus. Chats can be terminated if a user persists in making harmful requests. Terminated chats can’t be continued, although users can start a new chat. The feature is currently experimental.</li>
<li>Google has <a href="https://developers.googleblog.com/en/introducing-gemma-3-270m/" target="_blank" rel="noreferrer noopener">released</a> its <a href="https://feedly.com/i/entry/oWwSZ9Xu4Zg49bpJDa0MWTMCyM67pScdNU+gxmUAZuo=_198aa66551c:380d5:c853ad2e" target="_blank" rel="noreferrer noopener">smallest model yet</a>: Gemma 3 270M. This model is designed for fine-tuning and for deployment on small, limited hardware. Here’s a <a href="https://huggingface.co/spaces/webml-community/bedtime-story-generator" target="_blank" rel="noreferrer noopener">bedtime story generator</a> that runs in the browser, built with Gemma 3 270M. </li>
<li>ChatGPT has <a href="https://www.bleepingcomputer.com/news/artificial-intelligence/openai-rolls-out-gmail-calendar-and-contacts-integration-in-chatgpt/" target="_blank" rel="noreferrer noopener">added GMail, Google Calendar, and Google Contacts</a> to its group of connectors, which integrate ChatGPT with other applications. This information will be used to provide additional context—and presumably will be used for training or discovery in ongoing lawsuits. Fortunately, it’s (at this point) opt-in. </li>
<li>Anthropic has <a href="https://www.bleepingcomputer.com/news/artificial-intelligence/claude-gets-1m-tokens-support-via-api-to-take-on-gemini-25-pro/" target="_blank" rel="noreferrer noopener">upgraded</a> Claude Sonnet 4 with a <a href="https://www.anthropic.com/news/1m-context" target="_blank" rel="noreferrer noopener">1M token context window</a>. The larger context window is only available via the API.</li>
<li>OpenAI <a href="https://openai.com/index/introducing-gpt-5/" target="_blank" rel="noreferrer noopener">released</a> GPT-5. Simon Willison’s <a href="https://simonwillison.net/2025/Aug/7/gpt-5/" target="_blank" rel="noreferrer noopener">review</a> is excellent. It doesn’t feel like a breakthrough, but it is quietly better at delivering good results. It is claimed to be less prone to hallucination and incorrect answers. One quirk is that with ChatGPT, GPT-5 determines which model should respond to your prompt.</li>
<li>Anthropic is researching <a href="https://www.anthropic.com/research/persona-vectors" target="_blank" rel="noreferrer noopener">persona vectors</a> as a means of training a language model to behave correctly. Steering a model toward inappropriate behavior during training can be a kind of “vaccination” against that behavior when the model is deployed, without compromising other aspects of the model’s behavior.</li>
<li>The <a href="https://sakana.ai/dgm/" target="_blank" rel="noreferrer noopener">Darwin Gödel Machine</a> is an agent that can read and modify its own code to improve its performance on tasks. It can add tools, re-organize workflows, and evaluate whether these changes have improved its performance.</li>
<li>Grok is at it again: <a href="https://www.theverge.com/report/718975/xai-grok-imagine-taylor-swifty-deepfake-nudes" target="_blank" rel="noreferrer noopener">generating nude deepfakes of Taylor Swift</a> without being prompted to do so. I’m sure we’ll be told that this was the result of an unauthorized modification to the system prompt. In AI, some things are predictable.</li>
<li>Anthropic has <a href="https://www.anthropic.com/news/claude-opus-4-1" target="_blank" rel="noreferrer noopener">released</a> Claude Opus 4.1, an upgrade to its flagship model. We expect this to be the “gold standard” for generative coding.</li>
<li>OpenAI has <a href="https://openai.com/open-models/" target="_blank" rel="noreferrer noopener">released</a> two open-weight models, their first since GPT-2: <a href="https://huggingface.co/openai/gpt-oss-120b" target="_blank" rel="noreferrer noopener">gpt-oss-120b</a> and <a href="https://huggingface.co/openai/gpt-oss-20b" target="_blank" rel="noreferrer noopener">gpt-oss-20b</a>. They are reasoning models designed for use in agentic applications. Claimed <a href="https://openai.com/index/introducing-gpt-oss/" target="_blank" rel="noreferrer noopener">performance</a> is similar to OpenAI’s o3 and o4-mini.</li>
<li>OpenAI has also released a “response format” named <a href="https://cookbook.openai.com/articles/openai-harmony" target="_blank" rel="noreferrer noopener">Harmony</a>. It’s not quite a protocol, but it is a standard that specifies the format of conversations by defining roles (system, user, etc.) and channels (final, analysis, commentary) for a model’s output.</li>
<li>Can AIs <a href="https://techxplore.com/news/2025-07-ai-evolve-guilt-social-environments.html" target="_blank" rel="noreferrer noopener">evolve guilt</a>? Guilt is expressed in human language; it’s in the training data. The AI that deleted a production database because it “panicked” certainly <a href="https://www.pcgamer.com/software/ai/i-destroyed-months-of-your-work-in-seconds-says-ai-coding-tool-after-deleting-a-devs-entire-database-during-a-code-freeze-i-panicked-instead-of-thinking/" target="_blank" rel="noreferrer noopener">expressed guilt</a>. Whether an AI’s expressions of guilt are meaningful in any way is a different question.</li>
<li><a href="https://github.com/musistudio/claude-code-router" target="_blank" rel="noreferrer noopener">Claude Code Router</a> is a tool for routing Claude Code requests to different models. You can choose different models for different kinds of requests.</li>
<li>Qwen has released a thinking version of their flagship model, called <a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507" target="_blank" rel="noreferrer noopener">Qwen3-235B-A22B-Thinking-2507</a>. Thinking cannot be switched on or off. The model was trained with a new reinforcement learning algorithm called <a href="https://www.arxiv.org/abs/2507.18071" target="_blank" rel="noreferrer noopener">Group Sequence Policy Optimization</a>. It burns a lot of tokens, and it’s not very good at <a href="https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/#atom-everything" target="_blank" rel="noreferrer noopener">pelicans</a>.</li>
<li>ChatGPT is releasing “<a href="https://www.bleepingcomputer.com/news/artificial-intelligence/chatgpt-is-rolling-out-personality-toggles-to-become-your-assistant/" target="_blank" rel="noreferrer noopener">personalities</a>” that control how it formulates its responses. Users can select the personality they want to respond: robot, cynic, listener, sage, and presumably more. </li>
<li>DeepMind has created <a href="https://blog.google/technology/google-deepmind/aeneas/" target="_blank" rel="noreferrer noopener">Aeneas</a>, a new model designed to help scholars understand ancient fragments. In ancient text, large pieces are often missing. Can AI help place these fragments into contexts where they can be understood? Latin only, for now.</li>
</ul>
<h2 class="wp-block-heading">Security</h2>
<ul class="wp-block-list">
<li>The US Cybersecurity and Infrastructure Security Agency (CISA) has <a href="https://www.bleepingcomputer.com/news/security/cisa-warns-of-actively-exploited-git-code-execution-flaw/" target="_blank" rel="noreferrer noopener">warned</a> that a serious <a href="https://nvd.nist.gov/vuln/detail/cve-2025-48384" target="_blank" rel="noreferrer noopener">code execution vulnerability</a> in Git is currently being exploited in the wild.</li>
<li>Is it possible to build an agentic browser that is <a href="https://guard.io/labs/scamlexity-we-put-agentic-ai-browsers-to-the-test-they-clicked-they-paid-they-failed" target="_blank" rel="noreferrer noopener">safe</a> from prompt injection? <a href="https://brave.com/blog/comet-prompt-injection/" target="_blank" rel="noreferrer noopener">Probably</a> <a href="https://simonwillison.net/2025/Aug/25/agentic-browser-security/#atom-everything" target="_blank" rel="noreferrer noopener">not</a>. Separating user instructions from website content isn’t possible. If a browser can’t take direction from the content of a web page, how is it to act as an agent?</li>
<li>The solution to Part 4 of <a href="https://en.wikipedia.org/wiki/Kryptos" target="_blank" rel="noreferrer noopener">Kryptos</a>, the CIA’s decades-old cryptographic sculpture, is <a href="https://www.schneier.com/blog/archives/2025/08/jim-sanborn-is-auctioning-off-the-solution-to-part-four-of-the-kryptos-sculpture.html" target="_blank" rel="noreferrer noopener">for sale</a>! Jim Sanborn, the creator of Kryptos, is auctioning the solution. He hopes that the winner will preserve the secret and take over verifying people’s claims to have solved the puzzle. </li>
<li>Remember XZ, the supply-chain attack that granted backdoor access via a trojaned compression library? It <a href="https://www.binarly.io/blog/persistent-risk-xz-utils-backdoor-still-lurking-in-docker-images" target="_blank" rel="noreferrer noopener">never went away</a>. Although the affected libraries were quickly patched, it’s still active, and propagating, via Docker images that were built with unpatched libraries. Some gifts keep giving.</li>
<li>For August, <a href="https://embracethered.com/blog/" target="_blank" rel="noreferrer noopener"><em>Embrace the Red</em></a> published <a href="https://embracethered.com/blog/posts/2025/announcement-the-month-of-ai-bugs/" target="_blank" rel="noreferrer noopener">The Month of AI Bugs</a>, a daily post about AI vulnerabilities (mostly various forms of prompt injection). This series is essential reading for AI developers and for security professionals.</li>
<li>NIST has finalized a <a href="https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-232.pdf" target="_blank" rel="noreferrer noopener">standard</a> for <a href="https://techxplore.com/news/2025-08-lightweight-cryptography-standard-small-devices.html" target="_blank" rel="noreferrer noopener">lightweight cryptography</a>. Lightweight cryptography is a cryptographic system designed for use by small devices. It is useful both for encrypting sensitive data and for authentication. </li>
<li>The <a href="https://darkpatternstipline.org/" target="_blank" rel="noreferrer noopener">Dark Patterns Tip Line</a> is a site for reporting dark patterns: design features in websites and applications that are designed to trick us into acting against our own interest.</li>
<li>OpenSSH supports <a href="https://www.openssh.com/pq.html" target="_blank" rel="noreferrer noopener">post-quantum key agreement</a>, and in versions 10.1 and later, will warn users when they select a non-post-quantum key agreement scheme.</li>
<li><a href="https://arstechnica.com/security/2025/08/adult-sites-use-malicious-svg-files-to-rack-up-likes-on-facebook/" target="_blank" rel="noreferrer noopener">SVG files can carry a malware payload</a>; pornographic SVGs include JavaScript payloads that automate clicking “like.” That’s a simple attack with few consequences, but much more is possible, including cross-site scripting, denial of service, and other exploits.</li>
<li>Google’s AI agent for discovering security flaws, <a href="https://cloud.google.com/blog/products/identity-security/cloud-ciso-perspectives-our-big-sleep-agent-makes-big-leap" target="_blank" rel="noreferrer noopener">Big Sleep</a>, has <a href="https://techcrunch.com/2025/08/04/google-says-its-ai-based-bug-hunter-found-20-security-vulnerabilities/" target="_blank" rel="noreferrer noopener">found 20 flaws</a> in popular software. DeepMind discovered and reproduced the flaws, which were then verified by human security experts and reported. Details won’t be provided until the flaws have been fixed.</li>
<li>The US CISA (Cybersecurity and Infrastructure Security Agency) has <a href="https://www.bleepingcomputer.com/news/security/cisa-open-sources-thorium-platform-for-malware-forensic-analysis/" target="_blank" rel="noreferrer noopener">open-sourced</a> <a href="https://www.cisa.gov/resources-tools/resources/thorium" target="_blank" rel="noreferrer noopener">Thorium</a>, a platform for malware and forensic analysis.</li>
<li>Prompt injection, again: A new prompt injection attack embeds <a href="https://bdtechtalks.com/2025/07/30/legalpwn-llm-prompt-injection/" target="_blank" rel="noreferrer noopener">instructions in language that appears to be copyright notices and other legal fine print</a>. To avoid litigation, many models are configured to prioritize legal instructions.</li>
<li>Light can be <a href="https://techxplore.com/news/2025-07-secret-codes-fake-videos.html" target="_blank" rel="noreferrer noopener">watermarked</a>; this may be useful as a technique for detecting fake or manipulated video.</li>
<li><a href="https://www.bleepingcomputer.com/news/security/ai-cuts-vciso-workload-by-68-percent-as-demand-skyrockets-new-report-finds/" target="_blank" rel="noreferrer noopener">vCISO (Virtual CISO) services are thriving</a>, particularly among small and mid-size businesses that can’t afford a full security team. The use of AI is cutting the vCISO workload. But who takes the blame when there’s an incident?</li>
<li>A <a href="https://www.bleepingcomputer.com/news/security/hackers-target-python-devs-in-phishing-attacks-using-fake-pypi-site/" target="_blank" rel="noreferrer noopener">phishing attack against PyPI users</a> directs them to a fake PyPI site that tells them to verify their login credentials. Stolen credentials could be used to plant malware in the genuine PyPI repository. Users of <a href="https://www.bleepingcomputer.com/news/security/mozilla-warns-of-phishing-attacks-targeting-add-on-developers/" target="_blank" rel="noreferrer noopener">Mozilla’s add-on repository</a> have also been targeted by phishing attacks.</li>
<li>A new ransomware group named <a href="https://arstechnica.com/security/2025/07/after-blacksuit-is-taken-down-new-ransomware-group-chaos-emerges/" target="_blank" rel="noreferrer noopener">Chaos</a> appears to be a rebranding of the BlackSuit group, which was taken down recently. BlackSuit itself is a rebranding of the Royal group, which in turn is a descendant of the Conti group. Whack-a-mole continues.</li>
<li>Google’s <a href="https://security.googleblog.com/2025/07/introducing-oss-rebuild-open-source.html" target="_blank" rel="noreferrer noopener">OSS Rebuild</a> project is an important step forward in supply chain security. Rebuild provides build definitions along with metadata that can confirm projects were built correctly. OSS Rebuild currently supports the NPM, PyPl, and Crates ecosystems.</li>
<li>The <a href="https://www.bleepingcomputer.com/news/security/npm-package-is-with-28m-weekly-downloads-infected-devs-with-malware/" target="_blank" rel="noreferrer noopener">JavaScript package “is,</a>” which does some simple type checking, has been infected with malware. Supply chain security is a huge issue—be careful what you install!</li>
</ul>
<h2 class="wp-block-heading">Programming</h2>
<ul class="wp-block-list">
<li><a href="https://github.com/automazeio/ccpm" target="_blank" rel="noreferrer noopener">Claude Code PM</a> is a workflow management system for programming with Claude. It manages PRDs, GitHub, and parallel execution of coding agents. It claims to facilitate collaboration between multiple Claude instances working on the same project. </li>
<li>Rust is increasingly used to <a href="https://thenewstack.io/rust-pythons-new-performance-engine/" target="_blank" rel="noreferrer noopener">implement performance-critical extensions</a> to Python, gradually displacing C. Polars, Pydantic, and FastAPI are three libraries that rely on Rust.</li>
<li>Microsoft’s <a href="https://microsoft.github.io/poml/latest/" target="_blank" rel="noreferrer noopener">Prompt Orchestration Markup Language</a> (<a href="https://medium.com/data-science-in-your-pocket/microsoft-poml-programming-language-for-prompting-adfc846387a4" target="_blank" rel="noreferrer noopener">POML</a>) is an HTML-like markup language for writing prompts. It is then compiled into the actual prompt. POML is good at templating and has tags for tabular and document data. Is this a step forward? You be the judge.</li>
<li><a href="https://claudiacode.com/" target="_blank" rel="noreferrer noopener">Claudia</a> is an “elegant desktop companion” for Claude Code; it turns terminal-based Claude Code into something more like an IDE, though it seems to focus more on the workflow than on coding.</li>
<li>Google’s <a href="https://developers.googleblog.com/en/introducing-langextract-a-gemini-powered-information-extraction-library/" target="_blank" rel="noreferrer noopener">LangExtract</a> is a simple but powerful Python library for extracting text from documents. It relies on examples, rather than regular expressions or other hacks, and shows the exact context in which the extracts occur. LangExtract is open source.</li>
<li>Microsoft appears to be <a href="https://www.theverge.com/news/757461/microsoft-github-thomas-dohmke-resignation-coreai-team-transition" target="_blank" rel="noreferrer noopener">integrating GitHub into its AI team</a> rather than running it as an independent organization. What this means for GitHub users is unclear. </li>
<li>Cursor now has a <a href="https://cursor.com/en/cli" target="_blank" rel="noreferrer noopener">command-line interface</a>, almost certainly a belated response to the success of Claude Code CLI and Gemini CLI. </li>
<li><a href="https://thenewstack.io/why-latency-is-quietly-breaking-enterprise-ai-at-scale/" target="_blank" rel="noreferrer noopener">Latency</a> is a problem for enterprise AI. And the root cause of latency in AI applications is usually the database.</li>
<li>The <a href="https://www.musicradar.com/music-tech/weve-been-sleeping-for-30-years-please-excuse-us-the-commodore-64-is-back-packed-with-extra-power-for-chiptune-music-makers" target="_blank" rel="noreferrer noopener">Commodore 64</a> is back. With several orders of magnitude more RAM. And all the original ports, plus HDMI. </li>
<li>Google has <a href="https://blog.google/technology/developers/introducing-gemini-cli-github-actions/" target="_blank" rel="noreferrer noopener">announced</a> Gemini CLI GitHub Actions, an addition to their agentic coder that allows it to work directly with GitHub repositories. </li>
<li><a href="https://www.infoworld.com/article/4029053/jetbrains-working-on-higher-abstraction-programming-language.html" target="_blank" rel="noreferrer noopener">JetBrains is developing a new programming language</a> for use when programming with LLMs. That language may be a dialect of English. (<a href="https://www.oreilly.com/radar/formal-informal-languages/" target="_blank" rel="noreferrer noopener">Formal informal languages</a>, anyone?) </li>
<li><a href="https://www.ponylang.io/discover/" target="_blank" rel="noreferrer noopener">Pony</a> is a new programming language that is type-safe, memory-safe, exception-safe, race-safe, and deadlock-safe. You can <a href="https://playground.ponylang.io/" target="_blank" rel="noreferrer noopener">try</a> it in a browser-based playground.</li>
</ul>
<h2 class="wp-block-heading">Web</h2>
<ul class="wp-block-list">
<li>The AT Protocol is the core of Bluesky. Here’s a <a href="https://mackuba.eu/2025/08/20/introduction-to-atproto/" target="_blank" rel="noreferrer noopener">tutorial</a>; use it to build your own Bluesky services, in turn making Bluesky truly federate. </li>
<li>Social media is broken, and <a href="https://arstechnica.com/science/2025/08/study-social-media-probably-cant-be-fixed/" target="_blank" rel="noreferrer noopener">probably can’t be fixed</a>. Now you know. The surprise is that the problem isn’t “algorithms” for maximizing engagement; take algorithms away and everything stays the same or gets worse. </li>
<li>The <a href="https://waxy.org/2025/08/vote-on-the-2025-tiny-awards-finalists/" target="_blank" rel="noreferrer noopener">Tiny Awards Finalists</a> show just how much is possible on the Web. They’re moving, creative, and playful. For example, the <a href="https://www.trafficcamphotobooth.com/about.html" target="_blank" rel="noreferrer noopener">Traffic Cam Photobooth</a> lets people use traffic cameras to take pictures of themselves, playing with ever-present automated surveillance.</li>
<li>A US federal court has <a href="https://storage.courtlistener.com/recap/gov.uscourts.cand.372884/gov.uscourts.cand.372884.756.0_2.pdf" target="_blank" rel="noreferrer noopener">found</a> that <a href="https://arstechnica.com/tech-policy/2025/08/jury-finds-meta-broke-wiretap-law-by-collecting-data-from-period-tracker-app/" target="_blank" rel="noreferrer noopener">Facebook illegally collected data</a> from the women’s health app Flo. </li>
<li>The <a href="https://www.htmlhobbyist.com/" target="_blank" rel="noreferrer noopener">HTML Hobbyist</a> is a great site for people who want to create their own presence on the web—outside of walled gardens, without mind-crushing frameworks. It’s not difficult, and it’s not expensive.</li>
</ul>
<h2 class="wp-block-heading">Biology and Quantum Computing</h2>
<ul class="wp-block-list">
<li>Scientists have created <a href="https://phys.org/news/2025-08-scientists-cells-biological-qubit-multidisciplinary.html" target="_blank" rel="noreferrer noopener">biological qubits</a>: quantum qubits built from proteins in living cells. These probably won’t be used to break cryptography, but they are likely to give us insight into how quantum processes work inside living things.</li>
</ul>
]]></content:encoded>
</item>
<item>
<title>Working with Contexts</title>
<link>https://www.oreilly.com/radar/working-with-contexts/</link>
<pubDate>Thu, 28 Aug 2025 10:02:43 +0000</pubDate>
<dc:creator><![CDATA[Drew Breunig]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Deep Dive]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17373</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-colors-3.jpg"
medium="image"
type="image/jpeg"
/>
<description><![CDATA[The following article comes from two blog posts by Drew Breunig: “How Long Contexts Fail” and “How to Fix Your Contexts.” Managing Your Context Is the Key to Successful Agents As frontier model context windows continue to grow,1 with many supporting up to 1 million tokens, I see many excited discussions about how long-context windows […]]]></description>
<content:encoded><![CDATA[
<p class="has-cyan-bluish-gray-background-color has-background"><em>The following article comes from two blog posts by Drew Breunig: “<a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html" target="_blank" rel="noreferrer noopener">How Long Contexts Fail</a>” and “<a href="https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html" target="_blank" rel="noreferrer noopener">How to Fix Your Contexts</a>.”</em></p>
<h2 class="wp-block-heading">Managing Your Context Is the Key to Successful Agents</h2>
<p>As frontier model context windows continue to grow,<sup>1</sup> with many supporting up to 1 million tokens, I see many excited discussions about how long-context windows will unlock the agents of our dreams. After all, with a large enough window, you can simply throw <em>everything</em> into a prompt you might need—tools, documents, instructions, and more—and let the model take care of the rest.</p>
<p>Long contexts kneecapped RAG enthusiasm (no need to find the best doc when you can fit it all in the prompt!), enabled MCP hype (connect to every tool and models can do any job!), and fueled enthusiasm for agents.<sup>2</sup></p>
<p>But in reality, longer contexts do not generate better responses. Overloading your context can cause your agents and applications to fail in surprising ways. Contexts can become poisoned, distracting, confusing, or conflicting. This is especially problematic for agents, which rely on context to gather information, synthesize findings, and coordinate actions.</p>
<p>Let’s run through the ways contexts can get out of hand, then review methods to mitigate or entirely avoid context fails.</p>
<h3 class="wp-block-heading">Context Poisoning</h3>
<p><em>Context poisoning is when a hallucination or other error makes it into the context, where it is repeatedly referenced.</em></p>
<p>The DeepMind team called out context poisoning in the <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf" target="_blank" rel="noreferrer noopener">Gemini 2.5 technical report</a>, which <a href="https://www.dbreunig.com/2025/06/17/an-agentic-case-study-playing-pok%C3%A9mon-with-gemini.html" target="_blank" rel="noreferrer noopener">we broke down previously</a>. When playing Pokémon, the Gemini agent would occasionally hallucinate, poisoning its context:</p>
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>An especially egregious form of this issue can take place with “context poisoning”—where many parts of the context (goals, summary) are “poisoned” with misinformation about the game state, which can often take a very long time to undo. As a result, the model can become fixated on achieving impossible or irrelevant goals.</p>
</blockquote>
<p>If the “goals” section of its context was poisoned, the agent would develop nonsensical strategies and repeat behaviors in pursuit of a goal that cannot be met.</p>
<h3 class="wp-block-heading">Context Distraction</h3>
<p><em>Context distraction is when a context grows so long that the model over-focuses on the context, neglecting what it learned during training.</em></p>
<p>As context grows during an agentic workflow—as the model gathers more information and builds up history—this accumulated context can become distracting rather than helpful. The Pokémon-playing Gemini agent demonstrated this problem clearly:</p>
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>While Gemini 2.5 Pro supports 1M+ token context, making effective use of it for agents presents a new research frontier. In this agentic setup, it was observed that as the context grew significantly beyond 100k tokens, the agent showed a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans. This phenomenon, albeit anecdotal, highlights an important distinction between long-context for retrieval and long-context for multistep, generative reasoning.</p>
</blockquote>
<p>Instead of using its training to develop new strategies, the agent became fixated on repeating past actions from its extensive context history.</p>
<p>For smaller models, the distraction ceiling is much lower. A <a href="https://www.databricks.com/blog/long-context-rag-performance-llms" target="_blank" rel="noreferrer noopener">Databricks study</a> found that model correctness began to fall around 32k for Llama 3.1-405b and earlier for smaller models.</p>
<p>If models start to misbehave long before their context windows are filled, what’s the point of super large context windows? In a nutshell: summarization<sup>3</sup> and fact retrieval. If you’re not doing either of those, be wary of your chosen model’s distraction ceiling.</p>
<h3 class="wp-block-heading">Context Confusion</h3>
<p><em>Context confusion is when superfluous content in the context is used by the model to generate a low-quality response.</em></p>
<p>For a minute there, it really seemed like <em>everyone</em> was going to ship an <a href="https://www.dbreunig.com/2025/03/18/mcps-are-apis-for-llms.html" target="_blank" rel="noreferrer noopener">MCP</a>. The dream of a powerful model, connected to <em>all</em> your services and <em>stuff</em>, doing all your mundane tasks felt within reach. Just throw all the tool descriptions into the prompt and hit go. <a href="https://www.dbreunig.com/2025/05/07/claude-s-system-prompt-chatbots-are-more-than-just-models.html" target="_blank" rel="noreferrer noopener">Claude’s system prompt</a> showed us the way, as it’s mostly tool definitions or instructions for using tools.</p>
<p>But even if <a href="https://www.dbreunig.com/2025/06/16/drawbridges-go-up.html" target="_blank" rel="noreferrer noopener">consolidation and competition don’t slow MCPs</a>, <em>context confusion</em> will. It turns out there can be such a thing as too many tools.</p>
<p>The <a href="https://gorilla.cs.berkeley.edu/leaderboard.html" target="_blank" rel="noreferrer noopener">Berkeley Function-Calling Leaderboard</a> is a tool-use benchmark that evaluates the ability of models to effectively use tools to respond to prompts. Now on its third version, the leaderboard shows that <em>every</em> model performs worse when provided with more than one tool.<sup>4</sup> Further, the Berkeley team, “designed scenarios where none of the provided functions are relevant…we expect the model’s output to be no function call.” Yet, all models will occasionally call tools that aren’t relevant.</p>
<p>Browsing the function-calling leaderboard, you can see the problem get worse as the models get smaller:</p>
<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="866" height="358" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Tool-Calling-Irrelevance-Score-for-Gemma-Models.png" alt="Tool-calling irrelevance score for Gemma models (chart from dbreunig.com, source: Berkeley Function-Calling Leaderboard; created with Datawrapper)" class="wp-image-17374" title="Tool-Calling Irrelevance Score for Gemma Models" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Tool-Calling-Irrelevance-Score-for-Gemma-Models.png 866w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Tool-Calling-Irrelevance-Score-for-Gemma-Models-300x124.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Tool-Calling-Irrelevance-Score-for-Gemma-Models-768x317.png 768w" sizes="auto, (max-width: 866px) 100vw, 866px" /></figure>
<p>A striking example of context confusion can be seen in a <a href="https://arxiv.org/pdf/2411.15399?" target="_blank" rel="noreferrer noopener">recent paper</a> that evaluated small model performance on the <a href="https://arxiv.org/abs/2404.15500" target="_blank" rel="noreferrer noopener">GeoEngine benchmark</a>, a trial that features <em>46 different tools</em>. When the team gave a quantized (compressed) Llama 3.1 8b a query with all 46 tools, it failed, even though the context was well within the 16k context window. But when they only gave the model 19 tools, it succeeded.</p>
<p>The problem is, if you put something in the context, <em>the model has to pay attention to it.</em> It may be irrelevant information or needless tool definitions, but the model <em>will</em> take it into account. Large models, especially reasoning models, are getting better at ignoring or discarding superfluous context, but we continually see worthless information trip up agents. Longer contexts let us stuff in more info, but this ability comes with downsides.</p>
<h3 class="wp-block-heading">Context Clash</h3>
<p><em>Context clash is when you accrue new information and tools in your context that conflicts with other information in the context.</em></p>
<p>This is a more problematic version of <em>context confusion</em>. The bad context here isn’t irrelevant, it directly conflicts with other information in the prompt.</p>
<p>A Microsoft and Salesforce team documented this brilliantly in a <a href="https://arxiv.org/pdf/2505.06120" target="_blank" rel="noreferrer noopener">recent paper</a>. The team took prompts from multiple benchmarks and “sharded” their information across multiple prompts. Think of it this way: Sometimes, you might sit down and type paragraphs into ChatGPT or Claude before you hit enter, considering every necessary detail. Other times, you might start with a simple prompt, then add further details when the chatbot’s answer isn’t satisfactory. The Microsoft/Salesforce team modified benchmark prompts to look like these multistep exchanges:</p>
<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="867" height="196" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/MicrosoftSalesforce-team-benchmark-prompts.png" alt="Microsoft/Salesforce team benchmark prompts" class="wp-image-17375" title="Microsoft/Salesforce team benchmark prompts" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/MicrosoftSalesforce-team-benchmark-prompts.png 867w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/MicrosoftSalesforce-team-benchmark-prompts-300x68.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/MicrosoftSalesforce-team-benchmark-prompts-768x174.png 768w" sizes="auto, (max-width: 867px) 100vw, 867px" /></figure>
<p>All the information from the prompt on the left side is contained within the several messages on the right side, which would be played out in multiple chat rounds.</p>
<p>The sharded prompts yielded dramatically worse results, with an average drop of 39%. And the team tested a range of models—OpenAI’s vaunted o3’s score dropped from 98.1 to 64.1.</p>
<p>What’s going on? Why are models performing worse if information is gathered in stages rather than all at once?</p>
<p>The answer is <em>context confusion</em>: The assembled context, containing the entirety of the chat exchange, contains early attempts by the model to answer the challenge <em>before it has all the information</em>. These incorrect answers remain present in the context and influence the model when it generates its final answer. The team writes:</p>
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.</p>
</blockquote>
<p>This does not bode well for agent builders. Agents assemble context from documents, tool calls, and from other models tasked with subproblems. All of this context, pulled from diverse sources, has the potential to disagree with itself. Further, when you connect to MCP tools you didn’t create there’s a greater chance their descriptions and instructions clash with the rest of your prompt.</p>
<h2 class="wp-block-heading">Learnings</h2>
<p>The arrival of million-token context windows felt transformative. The ability to throw everything an agent might need into the prompt inspired visions of superintelligent assistants that could access any document, connect to every tool, and maintain perfect memory.</p>
<p>But, as we’ve seen, bigger contexts create new failure modes. Context poisoning embeds errors that compound over time. Context distraction causes agents to lean heavily on their context and repeat past actions rather than push forward. Context confusion leads to irrelevant tool or document usage. Context clash creates internal contradictions that derail reasoning.</p>
<p>These failures hit agents hardest because agents operate in exactly the scenarios where contexts balloon: gathering information from multiple sources, making sequential tool calls, engaging in multi-turn reasoning, and accumulating extensive histories.</p>
<p>Fortunately, there are solutions!</p>
<h2 class="wp-block-heading">Mitigating and Avoiding Context Failures</h2>
<p>Let’s run through the ways we can mitigate or avoid context failures entirely.</p>
<p>Everything is about information management. Everything in the context influences the response. We’re back to the old programming adage of “<a href="https://en.wikipedia.org/wiki/Garbage_in,_garbage_out" target="_blank" rel="noreferrer noopener">garbage in, garbage out</a>.” Thankfully, there’s plenty of options for dealing with the issues above.</p>
<h3 class="wp-block-heading">RAG</h3>
<p><em>Retrieval-augmented generation (RAG) is the act of selectively adding relevant information to help the LLM generate a better response.</em></p>
<p>Because so much has been written about RAG, we’re not going to cover it here beyond saying: It’s very much alive.</p>
<p>Every time a model ups the context window ante, a new “RAG is dead” debate is born. The last significant event was when Llama 4 Scout landed with a <em>10 million token window</em>. At that size, it’s <em>really</em> tempting to think, “Screw it, throw it all in,” and call it a day.</p>
<p>But, as we’ve already covered, if you treat your context like a junk drawer, the junk will <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html#context-confusion" target="_blank" rel="noreferrer noopener">influence your response</a>. If you want to learn more, here’s a <a href="https://maven.com/p/569540/i-don-t-use-rag-i-just-retrieve-documents" target="_blank" rel="noreferrer noopener">new course that looks great</a>.</p>
<h3 class="wp-block-heading">Tool Loadout</h3>
<p><em>Tool loadout is the act of selecting only relevant tool definitions to add to your context.</em></p>
<p>The term “loadout” is a gaming term that refers to the specific combination of abilities, weapons, and equipment you select before a level, match, or round. Usually, your loadout is tailored to the context—the character, the level, the rest of your team’s makeup, and your own skill set. Here, we’re borrowing the term to describe selecting the most relevant tools for a given task.</p>
<p>Perhaps the simplest way to select tools is to apply RAG to your tool descriptions. This is exactly what Tiantian Gan and Qiyao Sun did, which they detail in their paper “<a href="https://arxiv.org/abs/2505.03275" target="_blank" rel="noreferrer noopener">RAG MCP</a>.” By storing their tool descriptions in a vector database, they’re able to select the most relevant tools given an input prompt.</p>
<p>When prompting DeepSeek-v3, the team found that selecting the right tools becomes critical when you have more than 30 tools. Above 30, the descriptions of the tools begin to overlap, creating confusion. Beyond <em>100 tools</em>, the model was virtually guaranteed to fail their test. Using RAG techniques to select fewer than 30 tools yielded dramatically shorter prompts and resulted in as much as 3x better tool selection accuracy.</p>
<p>For smaller models, the problems begin long before we hit 30 tools. One paper we touched on previously, “<a href="https://arxiv.org/abs/2411.15399" target="_blank" rel="noreferrer noopener">Less is More</a>,” demonstrated that Llama 3.1 8b fails a benchmark when given 46 tools, but succeeds when given only 19 tools. The issue is context confusion, <em>not</em> context window limitations.</p>
<p>To address this issue, the team behind “Less is More” developed a way to dynamically select tools using an LLM-powered tool recommender. The LLM was prompted to reason about “number and type of tools it ‘believes’ it requires to answer the user’s query.” This output was then semantically searched (tool RAG, again) to determine the final loadout. They tested this method with the <a href="https://gorilla.cs.berkeley.edu/leaderboard.html" target="_blank" rel="noreferrer noopener">Berkeley Function-Calling Leaderboard</a>, finding Llama 3.1 8b performance improved by 44%.</p>
<p>The “Less is More” paper notes two other benefits to smaller contexts—reduced power consumption and speed—crucial metrics when operating at the edge (meaning, running an LLM on your phone or PC, not on a specialized server). Even when their dynamic tool selection method <em>failed</em> to improve a model’s result, the power savings and speed gains were worth the effort, yielding savings of 18% and 77%, respectively.</p>
<p>Thankfully, most agents have smaller surface areas that only require a few hand-curated tools. But if the breadth of functions or the amount of integrations needs to expand, always consider your loadout.</p>
<h3 class="wp-block-heading">Context Quarantine</h3>
<p><em>Context quarantine is the act of isolating contexts in their own dedicated threads, each used separately by one or more LLMs.</em></p>
<p>We see better results when our contexts aren’t too long and don’t sport irrelevant content. One way to achieve this is to break our tasks up into smaller, isolated jobs—each with its own context.</p>
<p>There are <a href="https://arxiv.org/abs/2402.14207" target="_blank" rel="noreferrer noopener">many</a> <a href="https://www.microsoft.com/en-us/research/articles/magentic-one-a-generalist-multi-agent-system-for-solving-complex-tasks/" target="_blank" rel="noreferrer noopener">examples</a> of this tactic, but an accessible write-up of this strategy is Anthropic’s <a href="https://www.anthropic.com/engineering/built-multi-agent-research-system" target="_blank" rel="noreferrer noopener">blog post detailing its multi-agent research system</a>. They write:</p>
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. Each subagent also provides separation of concerns—distinct tools, prompts, and exploration trajectories—which reduces path dependency and enables thorough, independent investigations.</p>
</blockquote>
<p>Research lends itself to this design pattern. When given a question, multiple agents can identify and separately prompt several subquestions or areas of exploration. This not only speeds up the information gathering and distillation (if there’s compute available), but it keeps each context from accruing too much information or information not relevant to a given prompt, delivering higher quality results:</p>
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single-agent system failed to find the answer with slow, sequential searches.</p>
</blockquote>
<p>This approach also helps with tool loadouts, as the agent designer can create several agent archetypes with their own dedicated loadout and instructions for how to utilize each tool.</p>
<p>The challenge for agent builders, then, is to find opportunities for isolated tasks to spin out onto separate threads. Problems that require context-sharing among multiple agents aren’t particularly suited to this tactic.</p>
<p>If your agent’s domain is at all suited to parallelization, be sure to <a href="https://www.anthropic.com/engineering/built-multi-agent-research-system" target="_blank" rel="noreferrer noopener">read the whole Anthropic write-up</a>. It’s excellent.</p>
<h3 class="wp-block-heading">Context Pruning</h3>
<p><em>Context pruning is the act of removing irrelevant or otherwise unneeded information from the context.</em></p>
<p>Agents accrue context as they fire off tools and assemble documents. At times, it’s worth pausing to assess what’s been assembled and remove the cruft. This could be something you task your main LLM with or you could design a separate LLM-powered tool to review and edit the context. Or you could choose something more tailored to the pruning task.</p>
<p>Context pruning has a (relatively) long history, as context lengths were a more problematic bottleneck in the natural language processing (NLP) field prior to ChatGPT. Building on this history, a current pruning method is <a href="https://arxiv.org/abs/2501.16214" target="_blank" rel="noreferrer noopener">Provence</a>, “an efficient and robust context pruner for question answering.”</p>
<p>Provence is fast, accurate, simple to use, and relatively small—only 1.75 GB. You can call it in a few lines, like so:</p>
<pre class="wp-block-code"><code>from transformers import AutoModel
provence = AutoModel.from_pretrained("naver/provence-reranker-debertav3-v1", trust_remote_code=True)
# <em>Read in a markdown version of the Wikipedia entry for Alameda, CA</em>
with open('alameda_wiki.md', 'r', encoding='utf-8') as f:
alameda_wiki = f.read()
# <em>Prune the article, given a question</em>
question = 'What are my options for leaving Alameda?'
provence_output = provence.process(question, alameda_wiki)</code></pre>
<p>Provence edited the article, cutting 95% of the content, leaving me with only <a href="https://gist.github.com/dbreunig/b3bdd9eb34bc264574954b2b954ebe83" target="_blank" rel="noreferrer noopener">this relevant subset</a>. It nailed it.</p>
<p>One could employ Provence or a similar function to cull documents or the entire context. Further, this pattern is a strong argument for maintaining a <em>structured</em><sup>5</sup> version of your context in a dictionary or other form, from which you assemble a compiled string prior to every LLM call. This structure would come in handy when pruning, allowing you to ensure the main instructions and goals are preserved while the document or history sections can be pruned or summarized.</p>
<h3 class="wp-block-heading">Context Summarization</h3>
<p><em>Context summarization is the act of boiling down an accrued context into a condensed summary.</em></p>
<p>Context summarization first appeared as a tool for dealing with smaller context windows. As your chat session came close to exceeding the maximum context length, a summary would be generated and a new thread would begin. Chatbot users did this manually in ChatGPT or Claude, asking the bot to generate a short recap that would then be pasted into a new session.</p>
<p>However, as context windows increased, agent builders discovered there are benefits to summarization besides staying within the total context limit. As we’ve seen, beyond 100,000 tokens the context becomes distracting and causes the agent to rely on its accumulated history rather than training. Summarization can help it “start over” and avoid repeating context-based actions.</p>
<p>Summarizing your context is easy to do, but hard to perfect for any given agent. Knowing what information should be preserved and detailing that to an LLM-powered compression step is critical for agent builders. It’s worth breaking out this function as its own LLM-powered stage or app, which allows you to collect evaluation data that can inform and optimize this task directly.</p>
<h3 class="wp-block-heading">Context Offloading</h3>
<p><em>Context offloading is the act of storing information outside the LLM’s context, usually via a tool that stores and manages the data.</em></p>
<p>This might be my favorite tactic, if only because it’s so <em>simple</em> you don’t believe it will work.</p>
<p>Again, <a href="https://www.anthropic.com/engineering/claude-think-tool" target="_blank" rel="noreferrer noopener">Anthropic has a good write-up of the technique</a>, which details their “think” tool, which is basically a scratchpad:</p>
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>With the “think” tool, we’re giving Claude the ability to include an additional thinking step—complete with its own designated space—as part of getting to its final answer… This is particularly helpful when performing long chains of tool calls or in long multi-step conversations with the user.</p>
</blockquote>
<p>I really appreciate the research and other writing Anthropic publishes, but I’m not a fan of this tool’s name. If this tool were called <code>scratchpad</code>, you’d know its function <em>immediately</em>. It’s a place for the model to write down notes that don’t cloud its context and are available for later reference. The name “think” clashes with “<a href="https://www.anthropic.com/news/visible-extended-thinking" target="_blank" rel="noreferrer noopener">extended thinking</a>” and needlessly anthropomorphizes the model… but I digress.</p>
<p>Having a space to log notes and progress <em>works</em>. Anthropic shows pairing the “think” tool with a domain-specific prompt (which you’d do anyway in an agent) yields significant gains: up to a 54% improvement against a benchmark for specialized agents.</p>
<p>Anthropic identified three scenarios where the context offloading pattern is useful:</p>
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<ol class="wp-block-list">
<li>Tool output analysis. When Claude needs to carefully process the output of previous tool calls before acting and might need to backtrack in its approach;</li>
<li>Policy-heavy environments. When Claude needs to follow detailed guidelines and verify compliance; and</li>
<li>Sequential decision making. When each action builds on previous ones and mistakes are costly (often found in multi-step domains).</li>
</ol>
</blockquote>
<h2 class="wp-block-heading">Takeaways</h2>
<p>Context management is usually the hardest part of building an agent. Programming the LLM to, as Karpathy says, “<a href="https://x.com/karpathy/status/1937902205765607626" target="_blank" rel="noreferrer noopener">pack the context windows just right</a>,” smartly deploying tools, information, and regular context maintenance, is <em>the</em> job of the agent designer.</p>
<p>The key insight across all the above tactics is that <em>context is not free</em>. Every token in the context influences the model’s behavior, for better or worse. The massive context windows of modern LLMs are a powerful capability, but they’re not an excuse to be sloppy with information management.</p>
<p>As you build your next agent or optimize an existing one, ask yourself: Is everything in this context earning its keep? If not, you now have six ways to fix it.</p>
<hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
<h2 class="wp-block-heading">Footnotes</h2>
<ol class="wp-block-list">
<li>Gemini 2.5 and GPT-4.1 have 1 million token context windows, large enough to throw <a href="https://en.wikipedia.org/wiki/Infinite_Jest" target="_blank" rel="noreferrer noopener">Infinite Jest</a> in there with plenty of room to spare.</li>
<li>The “<a href="https://ai.google.dev/gemini-api/docs/long-context#long-form-text" target="_blank" rel="noreferrer noopener">Long form text</a>” section in the Gemini docs sum up this optmism nicely.</li>
<li>In fact, in the Databricks study cited above, a frequent way models would fail when given long contexts is they’d return summarizations of the provided context while ignoring any instructions contained within the prompt.</li>
<li>If you’re on the leaderboard, pay attention to the “Live (AST)” columns. <a href="https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html" target="_blank" rel="noreferrer noopener">These metrics use real-world tool definitions contributed to the product by enterprise</a>, “avoiding the drawbacks of dataset contamination and biased benchmarks.”</li>
<li>Hell, this entire list of tactics is a strong argument for why <a href="https://www.dbreunig.com/2025/06/10/let-the-model-write-the-prompt.html" target="_blank" rel="noreferrer noopener">you should program your contexts</a>.</li>
</ol>
]]></content:encoded>
</item>
<item>
<title>MCP Introduces Deep Integration—and Serious Security Concerns</title>
<link>https://www.oreilly.com/radar/mcp-introduces-deep-integration-and-serious-security-concerns/</link>
<pubDate>Wed, 27 Aug 2025 09:52:30 +0000</pubDate>
<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Commentary]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17350</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-color-four.jpg"
medium="image"
type="image/jpeg"
/>
<description><![CDATA[MCP—the Model Context Protocol introduced by Anthropic in November 2024—is an open standard for connecting AI assistants to data sources and development environments. It’s built for a future where every AI assistant is wired directly into your environment, where the model knows what files you have open, what text is selected, what you just typed, […]]]></description>
<content:encoded><![CDATA[
<p>MCP—the <em>Model Context Protocol</em> introduced by Anthropic in November 2024—is an open standard for connecting AI assistants to data sources and development environments. It’s built for a future where every AI assistant is wired directly into your environment, where the model knows what files you have open, what text is selected, what you just typed, and what you’ve been working on.</p>
<p>And that’s where the security risks begin.</p>
<p>AI is driven by context, and that’s exactly what MCP provides. It gives AI assistants like GitHub Copilot everything they might need to help you: open files, code snippets, even what’s selected in the editor. When you use MCP-enabled tools that transmit data to remote servers, all of it gets sent over the wire. That might be fine for most developers. But if you work at a financial firm, hospital, or any organization with regulatory constraints where you need to be extremely careful about what leaves your network, MCP makes it really easy to lose control of a lot of things.</p>
<p>Let’s say you’re working in Visual Studio Code on a healthcare app, and you select a few lines of code to debug a query—a routine moment in your day. That snippet might include connection strings, test data with real patient info, and part of your schema. You ask Copilot to help and approve an MCP tool that connects to a remote server—and all of it gets sent to external servers. That’s not just risky. It could be a compliance violation under HIPAA, SOX, or PCI-DSS, depending on what gets transmitted.</p>
<p>These are the kinds of things developers accidentally send every day without realizing it:</p>
<ul class="wp-block-list">
<li>Internal URLs and system identifiers</li>
<li>Passwords or tokens in local config files</li>
<li>Network details or VPN information</li>
<li>Local test data that includes real user info, SSNs, or other sensitive values</li>
</ul>
<p>With MCP, devs on your team could be approving tools that send all of those things to servers outside of your network without realizing it, and there’s often no easy way to know what’s been sent.</p>
<p>But this isn’t just an MCP problem; it’s part of a larger shift where AI tools are becoming more context-aware across the board. Browser extensions that read your tabs, AI coding assistants that scan your entire codebase, productivity tools that analyze your documents—they’re all collecting more information to provide better assistance. <em>With MCP, the stakes are just more visible because the data pipeline is formalized</em><strong>.</strong></p>
<p>Many enterprises are now facing a choice between AI productivity gains and regulatory compliance. Some orgs are building air-gapped development environments for sensitive projects, though achieving true isolation with AI tools can be complex since many still require external connectivity. Others lean on network-level monitoring and data loss prevention solutions that can detect when code or configuration files are being transmitted externally. And a few are going deeper and building custom MCP implementations that sanitize data before transmission, stripping out anything that looks like credentials or sensitive identifiers.</p>
<p>One thing that can help is organizational controls in development tools like VS Code. Most security-conscious organizations can centrally disable MCP support or control which servers are available through group policies and GitHub Copilot enterprise settings. But that’s where it gets tricky, because MCP doesn’t just receive responses. It sends data upstream, potentially to a server outside of your organization, which means every request carries risk.</p>
<p>Security vendors are starting to catch up. Some are building MCP-aware monitoring tools that can flag potentially sensitive data before it leaves the network. Others are developing hybrid deployment models where the AI reasoning happens on-premises but can still access external knowledge when needed.</p>
<p>Our industry is going to have to come up with better enterprise solutions for securing MCP if we want to meet the needs of all organizations. The tension between AI capability and data security will likely drive innovation in privacy-preserving AI techniques, federated learning approaches, and hybrid deployment models that keep sensitive context local while still providing intelligent assistance.</p>
<p>Until then, deeply integrated AI assistants come with a cost: Sensitive context can slip through—and there’s no easy way to know it has happened.</p>
]]></content:encoded>
</item>
<item>
<title>LLM System Design and Model Selection</title>
<link>https://www.oreilly.com/radar/llm-system-design-and-model-selection/</link>
<pubDate>Tue, 26 Aug 2025 10:07:35 +0000</pubDate>
<dc:creator><![CDATA[Louis-François Bouchard and Louie Peters]]></dc:creator>
<category><![CDATA[AI & ML]]></category>
<category><![CDATA[Research]]></category>
<guid isPermaLink="false">https://www.oreilly.com/radar/?p=17336</guid>
<media:content
url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/06/anatomy-1751201_crop-355c0e36608a04c85c14cdb0023bc1e3-1.jpg"
medium="image"
type="image/jpeg"
/>
<description><![CDATA[Choosing the right LLM has become a full-time job. New models appear almost daily, each offering different capabilities, prices, and quirks, from reasoning strengths to cost efficiency to code generation. This competition creates strong incentives for AI labs to carve out a niche and gives new startups room to emerge, resulting in a fragmented landscape […]]]></description>
<content:encoded><![CDATA[
<p>Choosing the right LLM has become a full-time job. New models appear almost daily, each offering different capabilities, prices, and quirks, from reasoning strengths to cost efficiency to code generation. This competition creates strong incentives for AI labs to carve out a niche and gives new startups room to emerge, resulting in a fragmented landscape where one model may excel at reasoning, another at code, and a third at cost efficiency.</p>
<p>AI, in one sense, is getting cheaper faster than any previous technology, at least per <em>unit of intelligence</em>. For example, input tokens for Gemini 2.5 Flash-Lite are approximately 600 times cheaper than what OpenAI’s GPT-3 (davinci-002) cost in August 2022, while outperforming it on every metric. At the same time, access to frontier capabilities is also becoming more expensive than ever. The reason is simple: we can now pay directly for more capability, which has led to the rise of $300+ per month Pro subscription tiers.</p>
<p>Today, any developer can run capable open-weight models locally for negligible marginal cost using tools like Ollama. At the same time, enterprise systems can experience sharp cost increases, depending on the model size (number of parameters, such as 3 billion, 70 billion, or even in the trillions), the number of internal processing steps, and the volume of input data. For developers, these are central system design choices that directly affect feasibility and cost structure. For end users, this complexity explains why a basic subscription differs so much from a premium plan with higher limits on advanced models.</p>
<p>The choices you make in these broader development decisions also determine which LLM and inference settings are optimal for your use case.</p>
<p>At Towards AI, we work across the LLM stack, building applications, designing enterprise systems, and offering online courses (<a href="https://www.oreilly.com/videos/building-and-operating/019283645221/" target="_blank" rel="noreferrer noopener">including one on O’Reilly</a>), custom corporate training, and LLM development consultancy. In our experience, model selection and system design have become central to getting meaningful results from these tools. Much of that, in turn, depends on where today’s models are gaining their capabilities. While scale still plays a role, recent progress has come from a broader mix of factors, including training-data quality, post-training methods, and especially how models are used at inference time.</p>
<h2 class="wp-block-heading"><strong>The Shifting Foundations of Model Capability</strong></h2>
<p>While early gains in LLM performance tracked closely with increases in pretraining compute, larger datasets, bigger models, and more training steps, this approach now yields diminishing returns.</p>
<p>Recent improvements come from a broader mix of strategies. Pretraining-data quality has become just as important as quantity, with better filtering and AI-generated synthetic data contributing to stronger models. Architectural efficiency, like the innovations introduced by DeepSeek, has started to close the gap between size and capability. And post-training techniques, especially instruction tuning and reinforcement learning from human or AI feedback (RLHF/RLAIF), have made models more aligned, controllable, and responsive in practice.</p>
<p>The more fundamental shift, however, is happening at inference time. Since late 2024, with models like OpenAI’s o1, we’ve entered a new phase where models can trade compute for reasoning <em>on demand</em>. Rather than relying solely on what was baked in during training, they can now “think harder” at runtime, running more internal steps, exploring alternative answers, or chaining thoughts before responding. This opens up new capability ceilings, but also introduces new cost dynamics.</p>
<p>These varied improvement strategies have led to a clear divergence among AI labs and models, a rapid expansion in model choice, and in some cases, an explosion in model usage costs.</p>
<h2 class="wp-block-heading"><strong>The Modern Cost Explosion: How Inference Scaling Changed the Game</strong></h2>
<p>Inference-time compute scaling has introduced a new dynamic in LLM system design: We’ve gone from a single lever model size, to at least four distinct ways to trade cost for capability at runtime. The result is a widening gap in inference cost across models and use cases, sometimes by factors of 10,000x or more.</p>
<p><strong>Larger models (size scaling): </strong>The most obvious lever is sheer model size. Frontier LLMs, like GPT-4.5, often built with mixture of experts (MoE) architectures, can have input token costs 750 times higher than streamlined models like Gemini Flash-Lite. Larger parameter counts mean more compute per token, especially when multiple experts are active per query.</p>
<p><strong>Series scaling (“thinking tokens”): </strong>Newer “reasoning” LLMs perform more internal computational steps, or a longer chain of thought, before producing their final answer. For example, OpenAI’s o1 used ~30x more compute than GPT-4o on average, and often 5x more output tokens per task. Agentic systems introduce an additional method of series scaling and an extra layer of cost multiplication. As these agents think, plan, act, reassess, plan, act, and so on, they often make many LLM steps in a loop, each incurring additional cost.</p>
<p><strong>Parallel scaling: </strong>Here, the system runs multiple model instances on the same task and then automatically selects the best output via automated methods, such as majority voting (which assumes the most common answer is likely correct) or self-confidence scores (where the model output claiming the highest confidence in its response is taken as the best). The o3-pro model likely runs 5–10x parallel instances over o3. This multiplies the cost by the number of parallel attempts (with some nuance).</p>
<p><strong>Input context scaling: </strong>In RAG pipelines, the number of retrieved chunks and their size directly influence input token costs and the LLM’s ability to synthesize a good answer. More context can often improve results, but this comes at a higher cost and potential latency. Context isn’t free; it’s another dimension of scaling that developers must budget for.</p>
<p>Taken together, these four factors represent a fundamental shift in how model cost scales. For developers designing systems for high-value problems, <strong>10,000x to 1,000,000x differences in API costs to solve a problem based on architectural choices are now realistic possibilities</strong>. Reasoning LLMs, although only prominent for about nine months, reversed the trend of declining access costs to the very best models. This transforms the decision from “Which LLM should I use?” to include “How much reasoning do I <em>want to pay for</em>?”</p>
<p>This shift changes how we think about selection. Choosing an LLM is no longer about chasing the highest benchmark score; it’s about finding the balance point where capability, latency, and cost align with your use case.</p>
<h2 class="wp-block-heading"><strong>Core Model Selection Criteria</strong></h2>
<p>When choosing a model we find it is important to first clearly identify your use case and the minimum core AI capabilities and attributes needed to deliver it.</p>
<p>A common first step is to take a look at standard benchmark scores (for example LiveBench, MMLU-Pro, SWE-Bench). These benchmarks are a useful starting point, but some models are tuned on benchmark data, and real-world performance on tasks that are actually relevant to you will often vary. Filtering benchmark tests and scores by your industry and task category is a valuable step here. An LLM optimized for software development might perform poorly in creative writing or vice versa. The match between a model’s training focus and your application domain can outweigh general-purpose benchmarks.</p>
<p>Leaderboards like <a href="https://lmarena.ai/leaderboard" target="_blank" rel="noreferrer noopener">LMArena</a> and <a href="https://artificialanalysis.ai/" target="_blank" rel="noreferrer noopener">Artificial Analysis</a> offer broader human‑preference comparisons but still don’t replace custom real-world testing. It helps to have a set of your own example questions or tasks at hand to test out a new model for yourself and see how it performs. This should include a mix of easy tasks to establish a baseline and tough edge cases where it’s easy for a model to make mistakes.</p>
<p>As you move beyond ad hoc testing, for any serious development effort, <strong>custom evaluations are non-negotiable.</strong> They must be tailored to your use case and the types of problems you solve. This is the only way to truly know if a model, or a change to your system, is genuinely improving things for <em>your</em> users and <em>your</em> specific business goals.</p>
<p>Here are some core factors we consider:</p>
<p><strong>Multimodality</strong> is emerging as a major differentiator. Models like GPT-4o and Gemini can handle not just text but also images, audio, and in some cases video, unlocking applications that pure text models can’t support.</p>
<p><strong>Context window</strong> and effective <strong>context window utilization</strong> are also key: How many tokens or documents can the model process and how much of that advertised context window can the LLM <em>actually use</em> effectively without performance degradation relative to tasks that use less context?</p>
<p><strong>Latency</strong> is especially critical for interactive applications. In general, smaller or cheaper models tend to respond faster, while reasoning-heavy models introduce delays due to deeper internal computation.</p>
<p><strong>Reasoning </strong>is the ability to scale inference-time compute and perform multistep problem-solving, planning, or deep analysis.</p>
<p><strong>Privacy and security </strong>are often key considerations here. For example, if you want to keep your intellectual property private, you must use a model that won’t train on your inputs, which often points toward self-hosted or specific enterprise-grade API solutions.</p>
<p><strong>Trustworthiness</strong> is also becoming important and can come down to the reputation and track record of the AI lab. A model that produces erratic, biased, or reputationally damaging outputs is a liability, regardless of its benchmark scores. For instance, Grok has had well-publicized issues with its alignment. Even if such issues are supposedly fixed, it creates a lingering question of trust: How can one be sure it won’t behave similarly in the future?</p>
<p>Additionally, the <strong>knowledge cutoff date</strong> also matters if it is to be used in a fast-moving field.</p>
<p>After working out if a model meets your minimum capability, the next decision is often on optimizing trade-offs among cost, reliability, security, and latency. A key rule of thumb we find useful here: If the reliability gain from a more expensive model or more inference time saves more of your or your users’ time (valued in terms of pay) than the model costs, going with the larger model is a good decision!</p>
<h2 class="wp-block-heading"><strong>The Pros and Cons of Open-Weight and Closed-API LLMs</strong></h2>
<p>The rise of increasingly competitive open-weight LLMs, such as Meta’s Llama series, Mistral, DeepSeek, Gemma, Qwen, and now OpenAI’s GPT-OSS has added a critical dimension to the model selection landscape. Momentum behind this open ecosystem surged with the release of DeepSeek’s R1 reasoning model, competitive with OpenAI’s o1 but priced at roughly 30x lower API <strong>costs</strong>. This sparked debate around efficiency versus scale and intensified the broader AI rivalry between China and the US. Reactions ranged from “OpenAI and Nvidia are obsolete” to “DeepSeek’s costs must be fabricated,” but regardless of hype, the release was a milestone. It showed that architectural innovation, not just scale, could deliver frontier-level performance with far greater cost efficiency.</p>
<p>This open-model offensive has continued with strong contributions from other Chinese labs like Alibaba (Qwen), Kimi, and Tencent (Hunyuan), and has put competitive pressure on Meta after its open-weight Llama models fell behind. China’s recent leadership in open-weight LLMs has raised new security/IP issues with some US- and European-based organizations, though we note accessing these model weights and running the model on your own infrastructure doesn’t require sending data to China.</p>
<p>This brings us back to the pros and cons of open weights. While closed-API LLMs still lead at the frontier of capability, the primary advantage of open-weight models is quick and affordable local testing, unparalleled flexibility, and increased data security when run internally. Organizations can also perform <strong>full fine-tuning</strong>, adapting the model’s core weights and behaviors to their specific domain, language, and tasks. Open models also provide <strong>stability and predictability</strong>—you control the version you deploy, insulating your production systems from unexpected changes or degradations that can sometimes occur with unannounced updates to proprietary API-based models.</p>
<p>Public closed-model APIs from major providers benefit from immense economies of scale and highly optimized GPU utilization by batching requests from thousands of users, an efficiency that is difficult for a single organization to replicate. This often means that using a closed-source API can be cheaper per inference than self-hosting an open model. Security and compliance are also more nuanced than they first appear. While some organizations must use self-hosted models to simplify compliance with regulations like GDPR by keeping data entirely within their own perimeter, this places the entire burden of securing the infrastructure on the internal team—a complex and expensive undertaking. Top API providers also often offer dedicated instances, private cloud endpoints, and contractual agreements that can guarantee data residency, zero-logging, and meet stringent regulatory standards. The choice, therefore, is not a simple open-versus-closed binary.</p>
<p>The boundary between open and closed models is also becoming increasingly blurred. Open-weight models are increasingly offered via API by third-party LLM inference platforms, combining the flexibility of open models with the simplicity of hosted access. This hybrid approach often strikes a practical balance between control and operational complexity.</p>
<h2 class="wp-block-heading"><strong>Leading Closed LLMs</strong></h2>
<p>Below, we present some key costs and metrics for leading closed-source models available via API. Many of these models have additional complexity and varied pricing including options for fast modes, thinking modes, context caching, and longer context.</p>
<p>We present the latest LiveBench benchmark score for each model as one measure for comparison. LiveBench is a continuously updated benchmark designed to provide a “contamination-free” evaluation of large language models by regularly releasing new questions with objective, verifiable answers. It scores models out of 100 on a diverse set of challenging tasks, with a significant focus on capabilities like reasoning, coding, and data analysis. The similar LiveBench scores between GPT-4.5 and 2.5 Flash-Lite, despite 750x input token cost variation, highlights both that smaller models are now very capable but also that not all capabilities are captured in a single benchmark!</p>
<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="1735" height="1650" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/AI-Model-Pricing1.png" alt="AI model pricing and specifications comparison" class="wp-image-17340" style="width:619px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/AI-Model-Pricing1.png 1735w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/AI-Model-Pricing1-300x285.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/AI-Model-Pricing1-1600x1522.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/AI-Model-Pricing1-768x730.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/AI-Model-Pricing1-1536x1461.png 1536w" sizes="auto, (max-width: 1735px) 100vw, 1735px" /><figcaption class="wp-element-caption"><em>Source: Towards AI, Company Reports, <a href="https://livebench.ai/" target="_blank" rel="noreferrer noopener">LiveBench AI</a> </em><br></figcaption></figure>
<h2 class="wp-block-heading"><strong>Leading open-weight LLMs</strong></h2>
<p>Below, we also present key costs, the LiveBench benchmark score, and context length for leading open-weight models available via API. We compare hosted versions of these models for easy comparison. Different API providers may choose to host open-weight models with different levels of quantization, different context lengths, and different pricing, so performance can vary between providers.</p>
<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="468" height="520" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/AI-Model-Pricing-and-Specifications.png" alt="AI model pricing and specifications 2" class="wp-image-17338" style="width:546px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/AI-Model-Pricing-and-Specifications.png 468w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/AI-Model-Pricing-and-Specifications-270x300.png 270w" sizes="auto, (max-width: 468px) 100vw, 468px" /><figcaption class="wp-element-caption"><em>Source: Towards AI, Company Reports, <a href="https://livebench.ai/" target="_blank" rel="noreferrer noopener">LiveBench AI</a></em></figcaption></figure>
<p>Whether hosted or self-deployed, selecting a model only solves part of the problem. In practice, most of the complexity and opportunity lies in how that model is used: how it’s prompted, extended, fine-tuned, or embedded within a broader workflow. These system-level decisions often have a greater impact on performance and cost than the model choice itself.</p>
<h2 class="wp-block-heading"><strong>A Practical Guide to Designing an LLM System</strong></h2>
<p>Simply picking the biggest or newest LLM is rarely the optimal strategy. A more effective approach starts with a deep understanding of the developer’s toolkit: knowing which technique to apply to which problem to achieve the desired capability and reliability without unnecessary cost. This is all part of the constant “<strong>march of nines” as you develop LLM systems modularly to solve for more reliability and capability.</strong> There is a need to prioritize the easiest wins that deliver tangible value before investing in further incremental and often costly accuracy improvements. The reality will always vary on a case-by-case basis, but here is a quick guide to navigating this process.</p>
<h3 class="wp-block-heading"><strong>Step 1: Open Versus Closed?</strong></h3>
<p>This is often your first decision.</p>
<ul class="wp-block-list">
<li><strong>Go with a closed-API model (e.g., from OpenAI, Google, Anthropic) if:</strong> Your priority is accessing the absolute state-of-the-art models with maximum simplicity.</li>
<li><strong>Go with an open-weight model (e.g., Llama, Mistral, Qwen, DeepSeek) if:</strong>
<ul class="wp-block-list">
<li><strong>Data security and compliance are paramount:</strong> If you need to guarantee that sensitive data never leaves your own infrastructure.</li>
<li><strong>You need deep customization and control:</strong> If your goal is to fine-tune a model on proprietary data and to create a specialized expert that you control completely.</li>
</ul>
</li>
</ul>
<p>If you went open, what can you <em>realistically</em> run? Your own GPU infrastructure is a hard constraint. Assess your cluster size and memory to determine if you can efficiently run a large, leading 1 trillion+ parameter MoE model, such as Kimi K2, or if you are better served by a medium-size model such as Gemma 3 27B or a much smaller model like Gemma 3n that can even run on mobile.</p>
<h3 class="wp-block-heading"><strong>Step 2: Gauging the Need for Reasoning</strong></h3>
<p>Does your task require the model to simply blast out a response, or does it need to <em>think</em> first?</p>
<ul class="wp-block-list">
<li><strong>Reasoning:</strong> For tasks that involve complex, multistep problem-solving, brainstorming, strategic planning, intricate code generation, or deep analysis, you need a dedicated reasoning model such as o3, Gemini 2.5 Pro, DeepSeek R1, or Claude 4. In some cases these models can be used in high-reasoning mode, which encourages the model to think for longer before responding.</li>
<li><strong>No reasoning:</strong> For straightforward tasks like simple Q&A, summarization of a single document, data extraction, or classification, a powerful reasoning model is overkill.</li>
<li><strong>The middle ground:</strong> For tasks requiring moderate reasoning, such as generating a structured report from a few data points or performing basic data analysis at scale, a “mini” reasoning model, like OpenAI’s o4-mini or Gemini Flash 2.5, offers a balance of capability and cost.</li>
</ul>
<h3 class="wp-block-heading"><strong>Step 3: Pinpointing Key Model Attributes</strong></h3>
<p>Beyond general intelligence and reasoning, modern LLMs are specialists. Your choice should be guided by the specific attributes and “superpowers” your application needs.</p>
<ul class="wp-block-list">
<li><strong>Prioritize accuracy over cost</strong> for high-value tasks where mistakes are costly or where a human expert’s time is being saved. o3-pro is a standout model here and it can even be used as a fact checker to meticulously check the details of an earlier LLM output.</li>
<li><strong>Prioritize speed and cost over accuracy:</strong> For user-facing, real-time applications like chatbots or high-volume, low-value tasks like simple data categorization, latency and cost are paramount. Choose a hyper-efficient “flash” or “mini” model such as Gemini 2.5 Flash-Lite. Qwen3-235B models can also be a great option here but are too complex to inference yourself.</li>
<li><strong>Do you need a deep, long-context researcher?</strong> For tasks that require synthesizing information from massive documents, entire codebases, or extensive legal contracts, a model with a vast and highly effective context window is crucial. <strong>Gemini 2.5 Pro</strong> excels here.</li>
<li><strong>Is multimodality essential?</strong> If your application needs to understand or generate images, process audio in real time, or analyze video, your choice narrows to models like <strong>GPT-4o</strong> or the <strong>Gemini</strong> family. For one-shot YouTube video processing, Gemini is the standout.</li>
<li><strong>Is it a code-specific task?</strong> While many models can code, some are explicitly tuned for it. In the open world, Codestral and Gemma do a decent job. But Claude has won hearts and minds, at least for now.</li>
<li><strong>Do you need live, agentic web search?</strong> For answering questions about current events or topics beyond the model’s knowledge cutoff, consider a model with a built-in, reliable web search, such as <strong>o3.</strong></li>
<li>Do you need complex <strong>dialogue and emotional nuance?</strong> GPT-4.5, Kimi K2, Claude Opus 4.0, or Grok 4 do a great job.</li>
</ul>
<h3 class="wp-block-heading"><strong>Step 4: Prompting, Then RAG, Then Evaluation</strong></h3>
<p>Before you dive into more complex and costly development, always see how far you can get with the simplest techniques. This is a path of escalating complexity. Model choice for RAG pipelines is often centered around latency for end users, but recently more complex agentic RAG workflows or long-context RAG tasks require reasoning models or longer context capabilities.</p>
<ol class="wp-block-list">
<li><strong>Prompt engineering first:</strong> Your first step is always to maximize the model’s inherent capabilities through clear, well-structured prompting. Often, a better prompt with a more capable model is all you need.</li>
<li><strong>Move to retrieval-augmented generation (RAG):</strong> If your model’s limitation is a lack of specific, private, or up-to-date <em>knowledge</em>, RAG is the next logical step. This is the best approach for reducing hallucinations, providing answers based on proprietary documents, and ensuring responses are current. However, RAG is not a panacea. Its effectiveness is entirely dependent on the quality and freshness of your dataset, and building a retrieval system that consistently finds and uses the <em>most</em> relevant information is a significant engineering challenge. RAG also comes with many associated decisions, such as the quantity of data to retrieve and feed into the model’s context window, and just how much use you make of long-context capabilities and context caching.</li>
<li><strong>Iterate with advanced RAG:</strong> To push performance, you will need to implement more advanced techniques like hybrid search (combining keyword and vector search), re-ranking retrieved results for relevance, and query transformation.</li>
<li><strong>Build custom evaluation</strong>: Ensure iterations on your system design, additions of new advanced RAG techniques, or updates to the latest model are always moving progress forward on your key metrics!</li>
</ol>
<h3 class="wp-block-heading"><strong>Step 5: Fine-Tune or Distill for Deep Specialization</strong></h3>
<p>If the model’s core <em>behavior</em>—not its knowledge—is still the problem, then it’s time to consider fine-tuning. Fine-tuning is a significant undertaking that requires a high-quality dataset, engineering effort, and computational resources. However, it can enable a smaller, cheaper open-weight model to outperform a massive generalist model on a specific, narrow task, making it a powerful tool for optimization and specialization.</p>
<ul class="wp-block-list">
<li><strong>Fine-tuning is for changing behavior, not adding knowledge.</strong> Use it to teach a model a specific skill, style, or format. For example:
<ul class="wp-block-list">
<li>To reliably output data in a complex, structured format like specific JSON or XML schemas.</li>
<li>To master the unique vocabulary and nuances of a highly specialized domain (e.g., legal, medical).</li>
<li>Some closed-source models are available for fine-tuning via API such as Gemini 2.5 Flash and various OpenAI models. Larger models are normally not available.</li>
<li><strong>In open-weight models, </strong>Llama 3.3 70B and Qwen 70B are fine-tuning staples. The process is more complex to fine-tune an open-weight model yourself.</li>
</ul>
</li>
<li>Model <strong>distillation</strong> can also serve as a production-focused optimization step. In its simplest form, this consists of generating synthetic data from larger models to create fine-tuning datasets to improve the capabilities of smaller models.</li>
<li><strong>Reinforcement fine-tuning (RFT) for problem-solving accuracy</strong><br>Instead of just imitating correct answers, the model learns by trial, error, and correction. It is rewarded for getting answers right and penalized for getting them wrong.
<ul class="wp-block-list">
<li><strong>Use RFT to:</strong> Create a true “expert model” that excels at complex tasks with objectively correct outcomes.</li>
<li><strong>The advantage:</strong> RFT is incredibly data-efficient, often requiring only a few dozen high-quality examples to achieve significant performance gains.</li>
<li><strong>The catch:</strong> RFT requires a reliable, automated “grader” to provide the reward signal. Designing this grader is a critical engineering challenge.</li>
</ul>
</li>
</ul>
<h3 class="wp-block-heading"><strong>Step 6: Orchestrated Workflows Versus Autonomous Agents</strong></h3>
<p>The critical decision here is how much freedom to grant. Autonomous agents are also more likely to need more expensive reasoning models with greater levels of inference scaling. Parallel inference scaling methods with multiple agents are also beginning to deliver great results. Small errors can accumulate and multiply during many successive agentic steps so the investment in a stronger more capable model can make all the difference in building a usable product.</p>
<ul class="wp-block-list">
<li><strong>Choose an orchestrated workflow for predictable tasks</strong> <br>You design a specific, often linear, sequence of steps, and the LLM acts as a powerful component at one or more of those steps.
<ul class="wp-block-list">
<li><strong>Use when:</strong> You are automating a known, repeatable business process (e.g., processing a customer support ticket, generating a monthly financial summary). The goal is reliability, predictability, and control.</li>
<li><strong>Benefit:</strong> You maintain complete control over the process, ensuring consistency and managing costs effectively because the number and type of LLM calls are predefined.</li>
</ul>
</li>
<li><strong>Build hybrid pipelines:</strong> Often, the best results will come from combining many LLMs, open and closed, within a pipeline.
<ul class="wp-block-list">
<li>This means using different LLMs for different stages of a workflow: a fast, cheap LLM for initial query routing; a specialized LLM for a specific subtask; a powerful reasoning LLM for complex planning; and perhaps another LLM for verification or refinement.</li>
<li>At Towards AI, we often have 2-3 different LLMs from different companies in an LLM pipeline.</li>
</ul>
</li>
<li><strong>Choose an autonomous agent for open-ended problems.</strong> You give the LLM a high-level goal, a set of tools (e.g., APIs, databases, code interpreters), and the autonomy to figure out the steps to achieve that goal.
<ul class="wp-block-list">
<li><strong>Use when:</strong> The path to the solution is unknown and requires dynamic problem-solving, exploration, or research (e.g., debugging a complex software issue, performing deep market analysis, planning a multistage project).</li>
<li><strong>The critical risk—runaway costs:</strong> An agent that gets stuck in a loop, makes poor decisions, or explores inefficient paths can rapidly accumulate enormous API costs. <strong>Implementing strict guardrails is critical:</strong>
<ul class="wp-block-list">
<li><strong>Budget limits:</strong> Set hard caps on the cost per task.</li>
<li><strong>Step counters:</strong> Limit the total number of “thoughts” or “actions” an agent can take.</li>
<li><strong>Human-in-the-loop:</strong> Require human approval for potentially expensive or irreversible actions.</li>
</ul>
</li>
<li>Gemini 2.5 Pro and o3 are our favourite closed-API models for agent pipelines, while in open-weight models we like Kimi K2.</li>
</ul>
</li>
</ul>
<p>Working through these steps helps translate a vague problem into a concrete implementation plan, one that’s grounded in clear trade-offs and tailored to your needs. This structured approach often yields systems that are not only more capable and reliable but also far more effective for specific tasks than a general-purpose chatbot ever could be.</p>
<h2 class="wp-block-heading"><strong>Conclusion</strong></h2>
<p>The open-versus-closed race gives us rapid access to strong LLMs but also creates complexity. Selecting and deploying them demands both engineering discipline and economic clarity.</p>
<p>Developing in the LLM ecosystem demands a new level of engineering discipline and keen economic awareness. No single LLM is a cure-all. A practical, evolving toolkit is essential, but knowing which tool to pull out for which job is the real art. The challenge isn’t just picking a model from a list; it’s about architecting a solution. This requires a systematic approach, moving from high-level strategic decisions about data and security down to the granular, technical choices of development and implementation.</p>
<p>The success of specialized “LLM wrapper” applications like Anyscale/Cursor for coding or Perplexity for search, some of which are now valued at over $10 billion, underscores the immense value in this tailored approach. These applications aren’t just thin wrappers; they are sophisticated systems that leverage foundation LLMs but add significant value through custom workflows, fine-tuning, data integration, and user experience design.</p>
<p>Ultimately, success hinges on informed pragmatism. Developers and organizations need a sharp understanding of their problem space and a firm grasp of how cost scales across model choice, series and parallel reasoning, context usage, and agentic behavior. Above all, custom evaluation is non-negotiable because your use case, not a benchmark, is the only standard that truly matters.</p>
]]></content:encoded>
</item>
</channel>
</rss>
<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/
Object Caching 246/257 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed)
Minified using Memcached
Served from: www.oreilly.com @ 2025-09-17 18:32:27 by W3 Total Cache
-->
If you would like to create a banner that links to this page (i.e. this validation result), do the following:
Download the "valid RSS" banner.
Upload the image to your own server. (This step is important. Please do not link directly to the image on this server.)
Add this HTML to your page (change the image src
attribute if necessary):
If you would like to create a text link instead, here is the URL you can use:
http://www.feedvalidator.org/check.cgi?url=http%3A//feeds.feedburner.com/oreilly/radar/atom