Congratulations!

[Valid RSS] This is a valid RSS feed.

Recommendations

This feed is valid, but interoperability with the widest range of feed readers could be improved by implementing the following recommendations.

Source: https://www.oreilly.com/radar/feed/index.xml

  1. <?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
  2. xmlns:content="http://purl.org/rss/1.0/modules/content/"
  3. xmlns:media="http://search.yahoo.com/mrss/"
  4. xmlns:wfw="http://wellformedweb.org/CommentAPI/"
  5. xmlns:dc="http://purl.org/dc/elements/1.1/"
  6. xmlns:atom="http://www.w3.org/2005/Atom"
  7. xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
  8. xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
  9. xmlns:custom="https://www.oreilly.com/rss/custom"
  10.  
  11. >
  12.  
  13. <channel>
  14. <title>Radar</title>
  15. <atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
  16. <link>https://www.oreilly.com/radar</link>
  17. <description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
  18. <lastBuildDate>Thu, 18 Sep 2025 10:12:31 +0000</lastBuildDate>
  19. <language>en-US</language>
  20. <sy:updatePeriod>
  21. hourly </sy:updatePeriod>
  22. <sy:updateFrequency>
  23. 1 </sy:updateFrequency>
  24. <generator>https://wordpress.org/?v=6.8.2</generator>
  25.  
  26. <image>
  27. <url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
  28. <title>Radar</title>
  29. <link>https://www.oreilly.com/radar</link>
  30. <width>32</width>
  31. <height>32</height>
  32. </image>
  33. <item>
  34. <title>Generative AI in the Real World: Faye Zhang on Using AI to Improve Discovery</title>
  35. <link>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-faye-zhang-on-using-ai-to-improve-discovery/</link>
  36. <comments>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-faye-zhang-on-using-ai-to-improve-discovery/#respond</comments>
  37. <pubDate>Thu, 18 Sep 2025 10:12:22 +0000</pubDate>
  38. <dc:creator><![CDATA[Ben Lorica and Faye Zhang]]></dc:creator>
  39. <category><![CDATA[AI & ML]]></category>
  40. <category><![CDATA[Generative AI in the Real World]]></category>
  41. <category><![CDATA[Podcast]]></category>
  42.  
  43. <guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&#038;p=17467</guid>
  44.  
  45.     <media:content
  46. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png"
  47. medium="image"
  48. type="image/png"
  49. />
  50. <description><![CDATA[In this episode, Ben Lorica and AI Engineer Faye Zhang talk about discoverability: how to use AI to build search and recommendation engines that actually find what you want. Listen in to learn how AI goes way beyond simple collaborative filtering—pulling in many different kinds of data and metadata, including images and voice, to get [&#8230;]]]></description>
  51. <content:encoded><![CDATA[
  52. <p>In this episode, Ben Lorica and AI Engineer Faye Zhang talk about discoverability: how to use AI to build search and recommendation engines that actually find what you want. Listen in to learn how AI goes way beyond simple collaborative filtering—pulling in many different kinds of data and metadata, including images and voice, to get a much better picture of what any object is and whether or not it’s something the user would want.</p>
  53.  
  54.  
  55.  
  56. <p><strong>About the</strong> <strong><em>Generative AI in the Real World</em></strong> <strong>podcast:</strong> In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In <em>Generative AI in the Real World</em>, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.</p>
  57.  
  58.  
  59.  
  60. <p>Check out <a href="https://learning.oreilly.com/playlists/42123a72-1108-40f1-91c0-adbfb9f4983b/?_gl=1*16z5k2y*_ga*MTE1NDE4NjYxMi4xNzI5NTkwODkx*_ga_092EL089CH*MTcyOTYxNDAyNC4zLjEuMTcyOTYxNDAyNi41OC4wLjA." target="_blank" rel="noreferrer noopener">other episodes</a> of this podcast on the O’Reilly learning platform.</p>
  61.  
  62.  
  63.  
  64. <h2 class="wp-block-heading">Transcript</h2>
  65.  
  66.  
  67.  
  68. <p><em>This transcript was created with the help of AI and has been lightly edited for clarity.</em></p>
  69.  
  70.  
  71.  
  72. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=0" target="_blank" rel="noreferrer noopener">0:00</a>: <strong>Today we have Faye Zhang of Pinterest, where she’s a staff AI engineer. And so with that, very welcome to the podcast.</strong></p>
  73.  
  74.  
  75.  
  76. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=14" target="_blank" rel="noreferrer noopener">0:14</a>: Thanks, Ben. Huge fan of the work. I&#8217;ve been fortunate to attend both the Ray and NLP Summits. I know where you serve as chairs. I also love the O&#8217;Reilly AI podcast. The recent episode on A2A and the one with Raiza Martin on NotebookLM have been really inspirational. So, great to be here. </p>
  77.  
  78.  
  79.  
  80. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=33" target="_blank" rel="noreferrer noopener">0:33</a>: <strong>All right, so let&#8217;s jump right in. So one of the first things I really wanted to talk to you about is this work around </strong><a href="https://arxiv.org/abs/2503.00619" target="_blank" rel="noreferrer noopener"><strong>PinLanding</strong></a><strong>. And you&#8217;ve published papers, but I guess at a high level, Faye, maybe describe for our listeners: What problem is PinLanding trying to address?</strong></p>
  81.  
  82.  
  83.  
  84. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=53" target="_blank" rel="noreferrer noopener">0:53</a>: Yeah, that&#8217;s a great question. I think, in short, trying to solve this trillion-dollar discovery crisis. We&#8217;re living through the greatest paradox of the digital economy. Essentially, there&#8217;s infinite inventory but very little discoverability. Picture one example: A bride-to-be asks ChatGPT, “Now, find me a wedding dress for an Italian summer vineyard ceremony,” and she gets great general advice. But meanwhile, somewhere in Nordstrom&#8217;s hundreds of catalogs, there sits the perfect terracotta Soul Committee dress, never to be found. And that&#8217;s a $1,000 sale that will never happen. And if you multiply this by a billion searches across Google, SearchGPT, and Perplexity, we&#8217;re talking about a $6.5 trillion market, according to Shopify&#8217;s projections, where every failed product discovery is money left on the table. So that&#8217;s what we&#8217;re trying to solve—essentially solve the semantic organization of all platforms versus user context or search. </p>
  85.  
  86.  
  87.  
  88. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=125" target="_blank" rel="noreferrer noopener">2:05</a>: <strong>So, before PinLanding was developed, and if you look across the industry and other companies, what would be the default—what would be the incumbent system? And what would be insufficient about this incumbent system?</strong></p>
  89.  
  90.  
  91.  
  92. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=142" target="_blank" rel="noreferrer noopener">2:22</a>: There have been researchers across the past decade working on this problem; we&#8217;re definitely not the first one. I think number one is to understand the catalog attribution. So, back in the day, there was multitask R-CNN generation, as we remember, [that could] identify fashion shopping attributes. So you would pass in-system an image. It would identify okay: This shirt is red and that material may be silk. And then, in recent years, because of the leverage of large scale VLM (vision language models), this problem has been much easier. </p>
  93.  
  94.  
  95.  
  96. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=183" target="_blank" rel="noreferrer noopener">3:03</a>: And then I think the second route that people come in is via the content organization itself. Back in the day, [there was] research on join graph modeling on shared similarity of attributes. And a lot of ecommerce stores also do, “Hey, if people like this, you might also like that,” and that relationship graph gets captured in their organization tree as well. We utilize a vision large language model and then the foundation model CLIP by OpenAI to easily recognize what this content or piece of clothing could be for. And then we connect that between LLMs to discover all possibilities—like scenarios, use case, price point—to connect two worlds together. </p>
  97.  
  98.  
  99.  
  100. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=235" target="_blank" rel="noreferrer noopener">3:55</a>: <strong>To me that implies you have some rigorous eval process or even a separate team doing eval. Can you describe to us at a high level what is eval like for a system like this? </strong></p>
  101.  
  102.  
  103.  
  104. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=251" target="_blank" rel="noreferrer noopener">4:11</a>: Definitely. I think there are internal and external benchmarks. For the external ones, it&#8217;s the Fashion200K, which is a public benchmark anyone can download from Hugging Face, on a standard of how accurate your model is on predicting fashion items. So we measure the performance using the recall top-k metrics, which says whether the label appears among the top-end prediction attribute accurately, and as a result, we were able to see 99.7% recall for the top ten.</p>
  105.  
  106.  
  107.  
  108. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=287" target="_blank" rel="noreferrer noopener">4:47</a>: <strong>The other topic I wanted to talk to you about is recommendation systems. So obviously there&#8217;s now talk about, “Hey, maybe we can go beyond correlation and go towards reasoning.” Can you [tell] our audience, who may not be steeped in state-of-the-art recommendation systems, how you would describe the state of recommenders these days?</strong></p>
  109.  
  110.  
  111.  
  112. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=323" target="_blank" rel="noreferrer noopener">5:23</a>: For the past decade, [we’ve been] seeing tremendous movement from foundational shifts on how RecSys essentially operates. Just to call out a few big themes I’m seeing across the board: Number one, it’s kind of moving from correlation to causation. Back then it was, hey, a user who likes X might also like Y. But now we actually understand why contents are connected semantically. And our LLM AI models are able to reason about the user preferences and what they actually are. </p>
  113.  
  114.  
  115.  
  116. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=358" target="_blank" rel="noreferrer noopener">5:58</a>: The second big theme is probably the cold start problem, where companies leverage semantic IDs to solve the new item by encoding content, understanding the content directly. For example, if this is a dress, then you understand its color, style, theme, etc. </p>
  117.  
  118.  
  119.  
  120. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=377" target="_blank" rel="noreferrer noopener">6:17</a>: And I think of other bigger themes we&#8217;re seeing; for example, Netflix is merging from [an] isolated system into a unified intelligence. Just this past year, Netflix [updated] their multitask architecture where [they] shared representations, into one they called the UniCoRn system to enable company-wide improvement [and] optimizations. </p>
  121.  
  122.  
  123.  
  124. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=404" target="_blank" rel="noreferrer noopener">6:44</a>: And very lastly, I think on the frontier side—this is actually what I learned at the AI Engineer Summit from YouTube. It&#8217;s a DeepMind collaboration, where YouTube is now using a large recommendation model, essentially teaching Gemini to speak the language of YouTube: of, hey, a user watched this video, then what might [they] watch next? So a lot of very exciting capabilities happening across the board for sure. </p>
  125.  
  126.  
  127.  
  128. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=435" target="_blank" rel="noreferrer noopener">7:15</a>: <strong>Generally it sounds like the themes from years past still map over in the following sense, right? So there&#8217;s content—the difference being now you have these foundation models that can understand the content that you have more granularly. It can go deep into the videos and understand, hey, this video is similar to this video. And then the other source of signal is behavior. So those are still the two main buckets?</strong></p>
  129.  
  130.  
  131.  
  132. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=473" target="_blank" rel="noreferrer noopener">7:53</a>: Correct. Yes, I would say so. </p>
  133.  
  134.  
  135.  
  136. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=475" target="_blank" rel="noreferrer noopener">7:55</a>: <strong>And so the foundation models help you on the content side but not necessarily on the behavior side?</strong><br><br><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=483" target="_blank" rel="noreferrer noopener">8:03</a>: I think it depends on how you want to see it. For example, on the embedding side, which is a kind of representation of a user entity, there have been transformations [since] back in the day with the BERT Transformer. Now it&#8217;s got long context encapsulation. And those are all with the help of LLMS. And so we can better understand users, not to next or the last clicks, but to “hey, [in the] next 30 days, what might a user like?” </p>
  137.  
  138.  
  139.  
  140. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=511" target="_blank" rel="noreferrer noopener">8:31</a>: <strong>I&#8217;m not sure this is happening, so correct me if I&#8217;m wrong. The other thing that I would imagine that the foundation models can help with is, I think for some of these systems—like YouTube, for example, or maybe Netflix is a better example—thumbnails are important, right? The fact now that you have these models that can generate multiple variants of a thumbnail on the fly means you can run more experiments to figure out user preferences and user tastes, correct? </strong></p>
  141.  
  142.  
  143.  
  144. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=545" target="_blank" rel="noreferrer noopener">9:05</a>: Yes. I would say so. I was lucky enough to be invited to one of the engineer network dinners, [and was] speaking with the engineer who actually works on the thumbnails. Apparently it was all personalized, and the approach you mentioned enabled their rapid iteration of experiments, and had definitely yielded very positive results for them. </p>
  145.  
  146.  
  147.  
  148. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=569" target="_blank" rel="noreferrer noopener">9:29</a>: <strong>For the listeners who don&#8217;t work on recommendation systems, what are some general lessons from recommendation systems that generally map to other forms of ML and AI applications? </strong></p>
  149.  
  150.  
  151.  
  152. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=584" target="_blank" rel="noreferrer noopener">9:44</a>: Yeah, that&#8217;s a great question. A lot of the concepts still apply. For example, the knowledge distillation. I know Indeed was trying to tackle this. </p>
  153.  
  154.  
  155.  
  156. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=596" target="_blank" rel="noreferrer noopener">9:56</a>: <strong>Maybe Faye, first define what you mean by that, in case listeners don&#8217;t know what that is.</strong> </p>
  157.  
  158.  
  159.  
  160. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=602" target="_blank" rel="noreferrer noopener">10:02</a>: Yes. So knowledge distillation is essentially, from a model sense, learning from a parent model with larger, bigger parameters that has better world knowledge (and the same with ML systems)—to distill into smaller models that can operate much faster but still hopefully encapsulate the learning from the parent model. </p>
  161.  
  162.  
  163.  
  164. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=624" target="_blank" rel="noreferrer noopener">10:24</a>: So I think what Indeed back then faced was the classic precision versus recall in production ML. Their binary classifier needs to really filter out the batch job that you would recommend to the candidates. But this process is obviously very noisy, and sparse training data can cause latency and also constraints. So I think back in the work they published, they couldn&#8217;t really get effective separate résumé content from Mistral and maybe Llama 2. And then they were happy to learn [that] out-of-the-box GPT-4 achieved something like 90% precision and recall. But obviously GPT-4 is more expensive and has close to 30 seconds of inference time, which is much slower.</p>
  165.  
  166.  
  167.  
  168. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=681" target="_blank" rel="noreferrer noopener">11:21</a>: So I think what they do is use the distillation concept to fine-tune GPT 3.5 on labeled data, and then distill it into a lightweight BERT-based model using the temperature scale softmax, and they&#8217;re able to achieve millisecond latency and a comparable recall-precision trade-off. So I think that&#8217;s one of the learnings we see across the industry that the traditional ML techniques still work in the age of AI. And I think we&#8217;re going to see a lot more in the production work as well. </p>
  169.  
  170.  
  171.  
  172. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=717" target="_blank" rel="noreferrer noopener">11:57</a>: <strong>By the way, one of the underappreciated things in the recommendation system space is actually UX in some ways, right? Because basically good UX for delivering the recommendations actually can move the needle. How you actually present your recommendations might make a material difference. </strong> </p>
  173.  
  174.  
  175.  
  176. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=744" target="_blank" rel="noreferrer noopener">12:24</a>: I think that&#8217;s very much true. Although I can&#8217;t claim to be an expert on it because I know most recommendation systems deal with monetization, so it&#8217;s tricky to put, “Hey, what my user clicks on, like engage, send via social, versus what percentage of that&#8230;</p>
  177.  
  178.  
  179.  
  180. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=762" target="_blank" rel="noreferrer noopener">12:42</a>: <strong>And it&#8217;s also very platform specific. So you can imagine TikTok as one single feed—the recommendation is just on the feed. But YouTube is, you know, the stuff on the side or whatever. And then Amazon is something else. Spotify and Apple [too]. Apple Podcast is something else. But in each case, I think those of us on the outside underappreciate how much these companies invest in the actual interface.</strong></p>
  181.  
  182.  
  183.  
  184. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=798" target="_blank" rel="noreferrer noopener">13:18</a>: Yes. And I think there are multiple iterations happening on any day, [so] you might see a different interface than your friends or family because you&#8217;re actually being grouped into A/B tests. I think this is very much true of [how] the engagement and performance of the UX have an impact on a lot of the search/rec system as well, beyond the data we just talked about. </p>
  185.  
  186.  
  187.  
  188. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=821" target="_blank" rel="noreferrer noopener">13:41</a>: <strong>Which brings to mind another topic that is also something I&#8217;ve been interested in, over many, many years, which is this notion of experimentation. Many of the most successful companies in the space actually have invested in experimentation tools and experimentation platforms, where people can run experiments at scale. And those experiments can be done much more easily and can be monitored in a much more principled way so that any kind of things they do are backed by data. So I think that companies underappreciate the importance of investing in such a platform. </strong></p>
  189.  
  190.  
  191.  
  192. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=868" target="_blank" rel="noreferrer noopener">14:28</a>: I think that&#8217;s very much true. A lot of larger companies actually build their own in-house A/B testing experiment or testing frameworks. Meta does; Google has their own and even within different cohorts of products, if you&#8217;re monetization, social. . . They have their own niche experimentation platform. So I think that thesis is very much true. </p>
  193.  
  194.  
  195.  
  196. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=891" target="_blank" rel="noreferrer noopener">14:51</a>: <strong>The last topic I wanted to talk to you about is context engineering. I&#8217;ve talked to numerous people about this. So every six months, the context window for these large language models expands. But obviously you can&#8217;t just stuff the context window full, because one, it&#8217;s inefficient. And two, actually, the LLM can still make mistakes because it&#8217;s not going to efficiently process that entire context window anyway. So talk to our listeners about this emerging area called context engineering. And how is that playing out in your own work? </strong></p>
  197.  
  198.  
  199.  
  200. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=938" target="_blank" rel="noreferrer noopener">15:38</a>: I think this is a fascinating topic, where you will hear people passionately say, “RAG is dead.” And it&#8217;s really, as you mentioned, [that] our context window gets much, much bigger. Like, for example, back in April, Llama 4 had this staggering 10 million token context window. So the logic behind this argument is quite simple. Like if the model can indeed handle millions of tokens, why not just dump everything instead of doing a retrieval?</p>
  201.  
  202.  
  203.  
  204. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=968" target="_blank" rel="noreferrer noopener">16:08</a>: I think there are quite a few fundamental limitations towards this. I know folks from contextual AI are passionate about this. I think number one is scalability. A lot of times in production, at least, your knowledge base is measured in terabytes or petabytes. So not tokens. So something even larger. And number two I think would be accuracy.</p>
  205.  
  206.  
  207.  
  208. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=993" target="_blank" rel="noreferrer noopener">16:33</a>: The effective context windows are very different. Honestly, what we see and then what is advertised in product launches. We see performance degrade long before the model reaches its “official limits.” And then I think number three is probably the efficiency and that kind of aligns with, honestly, our human behavior as well. Like do you read an entire book every time you need to answer one simple question? So I think the context engineering [has] slowly evolved from a buzzword, a few years ago, to now an engineering discipline. </p>
  209.  
  210.  
  211.  
  212. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=1035" target="_blank" rel="noreferrer noopener">17:15</a>: <strong>I&#8217;m appreciative that the context windows are increasing. But at some level, I also acknowledge that to some extent, it’s also kind of a feel-good move on the part of the model builders. So it makes us feel good that we can put more things in there, but it may not actually help us answer the question precisely. Actually, a few years ago, I wrote kind of a tongue-and-cheek post called “</strong><a href="https://gradientflow.substack.com/p/structure-is-all-you-need" target="_blank" rel="noreferrer noopener"><strong>Structure Is All You Need</strong></a><strong>.” So basically whatever structure you have, you should help the model, right? If it&#8217;s in a SQL database, then maybe you can expose the structure of the data. If it&#8217;s a knowledge graph, you leverage whatever structure you have to provide the model better context. So this whole notion of just stuffing the model with as much information, for all the reasons you gave, is valid. But also, philosophically, it doesn&#8217;t make any sense to do that anyway.</strong></p>
  213.  
  214.  
  215.  
  216. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=1110" target="_blank" rel="noreferrer noopener">18:30</a>: <strong>What are the things that you are looking forward to, Faye, in terms of foundation models? What kinds of developments in the foundation model space are you hoping for? And are there any developments that you think are below the radar? </strong></p>
  217.  
  218.  
  219.  
  220. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=1132" target="_blank" rel="noreferrer noopener">18:52</a>: I think, to better utilize the concept of “contextual engineering,” that they&#8217;re essentially two loops. There&#8217;s number one within the loop of what happened. Yes. Within the LLMs. And then there&#8217;s the outer loop. Like, what can you do as an engineer to optimize a given context window, etc., to get the best results out of the product within the context loop. There are multiple tricks we can do: For example, there&#8217;s the vector plus Excel or regex extraction. There&#8217;s the metadata fillers. And then for the outer loop—this is a very common practice—people are using LLMs as a reranker, sometimes across the encoder. So the thesis is, hey, why would you overburden an LLM with a 20,000 ranking when there are things you can do to reduce it to top hundred or so? So all of this—context assembly, deduplication, and diversification—would help our production [go] from a prototype to something [that’s] more real time, reliable, and able to scale more infinitely. </p>
  221.  
  222.  
  223.  
  224. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=1207" target="_blank" rel="noreferrer noopener">20:07</a>: <strong>One of the things I wish—and I don&#8217;t know, this is wishful thinking—is maybe if the models can be a little more predictable, that would be nice. By that, I mean, if I ask a question in two different ways, it&#8217;ll basically give me the same answer. The foundation model builders can somehow increase predictability and maybe provide us with a little more explanation for how they arrive at the answer. I understand they&#8217;re giving us the tokens, and maybe some of the, some of the reasoning models are a little more transparent, but give us an idea of how these things work, because it&#8217;ll impact what kinds of applications we&#8217;d be comfortable deploying these things in. For example, for agents. If I&#8217;m using an agent to use a bunch of tools, but I can&#8217;t really predict their behavior, that impacts the types of applications I&#8217;d be comfortable using a model for. </strong></p>
  225.  
  226.  
  227.  
  228. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=1278" target="_blank" rel="noreferrer noopener">21:18</a>: Yeah, definitely. I very much resonate with this, especially now most engineers have, you know, AI empowered coding tools like Cursor and Windsurf—and as an individual, I very much appreciate the train of thought you mentioned: why an agent does certain things. Why is it navigating between repositories? What are you looking at while you&#8217;re doing this call? I think these are very much appreciated. I know there are other approaches—look at Devin, that&#8217;s the fully autonomous engineer peer. It just takes things, and you don&#8217;t know where it goes. But I think in the near future there will be a nice marriage between the two. Well, now since Windsurf is part of Devin&#8217;s parent company. </p>
  229.  
  230.  
  231.  
  232. <p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=1325" target="_blank" rel="noreferrer noopener">22:05</a>: <strong>And with that, thank you, Faye.</strong><br><br><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Faye_Zhang_updated.mp3#t=1328" target="_blank" rel="noreferrer noopener">22:08</a>: Awesome. Thank you, Ben.</p>
  233. ]]></content:encoded>
  234. <wfw:commentRss>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-faye-zhang-on-using-ai-to-improve-discovery/feed/</wfw:commentRss>
  235. <slash:comments>0</slash:comments>
  236. </item>
  237. <item>
  238. <title>Prompt Engineering Is Requirements Engineering</title>
  239. <link>https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/</link>
  240. <comments>https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/#respond</comments>
  241. <pubDate>Wed, 17 Sep 2025 10:27:38 +0000</pubDate>
  242. <dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
  243. <category><![CDATA[AI & ML]]></category>
  244. <category><![CDATA[Commentary]]></category>
  245.  
  246. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17463</guid>
  247.  
  248.     <media:content
  249. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/10/in-dis-big-bang-10a-1400x950.jpg"
  250. medium="image"
  251. type="image/jpeg"
  252. />
  253. <custom:subtitle><![CDATA[We’ve Been Here Before]]></custom:subtitle>
  254. <description><![CDATA[In the rush to get the most from AI tools, prompt engineering—the practice of writing clear, structured inputs that guide an AI tool’s output—has taken center stage. But for software engineers, the skill isn’t new. We’ve been doing a version of it for decades, just under a different name. The challenges we face when writing [&#8230;]]]></description>
  255. <content:encoded><![CDATA[
  256. <p>In the rush to get the most from AI tools, <strong>prompt engineering</strong>—the practice of writing clear, structured inputs that guide an AI tool’s output—has taken center stage. But for software engineers, the skill isn’t new. We’ve been doing a version of it for decades, just under a different name. The challenges we face when writing AI prompts are the same ones software teams have been grappling with for generations. Talking about prompt engineering today is really just continuing a much older conversation about how developers spell out what they need built, under what conditions, with what assumptions, and how to communicate that to the team.</p>
  257.  
  258.  
  259.  
  260. <p>The <em>software crisis</em> was the name given to this problem starting in the late 1960s, especially at the <a href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/NATO_Software_Engineering_Conferences&amp;sa=D&amp;source=docs&amp;ust=1758049259577140&amp;usg=AOvVaw31-sMSlquvRBVJB-EQgmJ3" target="_blank" rel="noreferrer noopener">NATO Software Engineering Conference</a> in 1968, where the term “software engineering” was introduced. The crisis referred to the widespread industry experience that software projects were over budget and late, and often failed to deliver what users actually needed.</p>
  261.  
  262.  
  263.  
  264. <p>There was a common misconception that these failures were due to programmers lacking technical skill or teams who needed more technical training. But the panels at that conference focused on what they saw as the real root cause: Teams and their stakeholders had trouble understanding the problems they were solving and what they actually needed to build; communicating those needs and ideas clearly among themselves; and ensuring the delivered system matched that intent. It was fundamentally a human communication problem.</p>
  265.  
  266.  
  267.  
  268. <p>Participants at the conference captured this precisely. Dr. Edward E. David Jr. from Bell Labs noted there is often <em>no way even to specify in a logically tight way</em> what the software is supposed to do. Douglas Ross from MIT pointed out the pitfall where you can <em>specify what you are going to do, and then do it</em> as if that solved the problem. Prof. W.L. van der Poel summed up the challenge of incomplete specifications: <em>Most problems simply aren’t defined well enough at the start</em>, so you don’t have the information you need to build the right solution.</p>
  269.  
  270.  
  271.  
  272. <p>These are all problems that cause teams to misunderstand the software they’re creating before any code is written. And they should all sound familiar to developers today who work with AI to generate code.</p>
  273.  
  274.  
  275.  
  276. <p>Much of the problem boils down to what I’ve often called the classic “do what I meant, not what I said” problem. Machines are literal—and people on teams often are too. Our intentions are rarely fully spelled out, and getting everyone aligned on what the software is supposed to do has always required deliberate, often difficult work.</p>
  277.  
  278.  
  279.  
  280. <p>Fred Brooks wrote about this in his classic and widely influential “<a href="https://www.google.com/url?q=https://www.researchgate.net/publication/220477127_No_Silver_Bullet_Essence_and_Accidents_of_Software_Engineering&amp;sa=D&amp;source=docs&amp;ust=1758049259576173&amp;usg=AOvVaw3xG4vlWyKDq4_fbxDJWQ7r" target="_blank" rel="noreferrer noopener">No Silver Bullet</a>” essay. He argued there would never be a single magic process or tool that would make software development easy. Throughout the history of software engineering, teams have been tempted to look for that silver bullet that would make the hard parts of understanding and communication go away. It shouldn’t be surprising that we’d see the same problems that plagued software teams for years reappear when they started to use AI tools.</p>
  281.  
  282.  
  283.  
  284. <p>By the end of the 1970s, these problems were being reframed in terms of <em>quality</em>. Philip Crosby, Joseph M. Juran, and W. Edwards Deming, three people who had enormous influence on the field of quality engineering, each had influential takes on why so many products didn’t do the jobs they were supposed to do, and these ideas are especially true when it comes to software. Crosby argued quality was fundamentally <em>conformance to requirements</em>—if you couldn’t define what you needed clearly, you couldn’t ensure it would be delivered. Juran talked about <em>fitness for use</em>—software needed to solve the user’s real problem in its real context, not just pass some checklists. Deming pushed even further, emphasizing that defects weren’t just technical mistakes but symptoms of broken systems, and <strong>especially poor communication and lack of shared understanding</strong>. He focused on the human side of engineering: creating processes that help people learn, communicate, and improve together.</p>
  285.  
  286.  
  287.  
  288. <p>Through the 1980s, these insights from the quality movement were being applied to software development and started to crystallize into a distinct discipline called <strong>requirements engineering</strong>, focused on identifying, analyzing, documenting, and managing the needs of stakeholders for a product or system. It emerged as its own field, complete with conferences, methodologies, and professional practices. The IEEE Computer Society formalized this with its first International Symposium on Requirements Engineering in 1993, marking its recognition as a core area of software engineering.</p>
  289.  
  290.  
  291.  
  292. <p>The 1990s became a heyday for requirements work, with organizations investing heavily in formal processes and templates, believing that better documentation formats would ensure better software. Standards like IEEE 830 codified the structure of software requirements specifications, and process models such as the software development life cycle and CMM/CMMI emphasized rigorous documentation and repeatable practices. Many organizations invested heavily in designing detailed templates and forms, hoping that filling them out correctly would guarantee the right system. In practice, those templates were useful for consistency and compliance, but they didn’t eliminate the hard part: <em>making sure what was in one person’s head matched what was in everyone else’s</em>.</p>
  293.  
  294.  
  295.  
  296. <p>While the 1990s focused on formal documentation, the Agile movement of the 2000s shifted toward a more lightweight, conversational approach. <strong>User stories</strong> emerged as a deliberate counterpoint to heavyweight specifications—short, simple descriptions of functionality told from the user’s perspective, designed to be easy to write and easy to understand. Instead of trying to capture every detail upfront, user stories served as placeholders for conversations between developers and stakeholders. The practice was deliberately simple, based on the idea that shared understanding comes from dialogue, not documentation, and that requirements evolve through iteration and working software rather than being fixed at the project’s start.</p>
  297.  
  298.  
  299.  
  300. <p>All of this reinforced requirements engineering as a legitimate area of software engineering practice and a real career path with its own set of skills. There is now broad agreement that requirements engineering is a vital area of software engineering focused on surfacing assumptions, clarifying goals, and ensuring everyone involved has the same understanding of what needs to be built.</p>
  301.  
  302.  
  303.  
  304. <h2 class="wp-block-heading">Prompt Engineering <em>Is</em> Requirements Engineering</h2>
  305.  
  306.  
  307.  
  308. <p>Prompt engineering and requirements engineering are literally the same skill—using clarity, context, and intentionality to <em>communicate your intent</em> and ensure what gets built matches what you actually need.</p>
  309.  
  310.  
  311.  
  312. <p>User stories were an evolution from traditional formal specifications: a simpler, more flexible approach to requirements but with the same goal of making sure everyone understood the intent. They gained wide acceptance across the industry because they helped teams recognize that requirements are about creating a shared understanding of the project. User stories gave teams a lightweight way to capture intent and then refine it through conversation, iteration, and working software.</p>
  313.  
  314.  
  315.  
  316. <p>Prompt engineering plays the exact same role. The prompt is our lightweight placeholder for a conversation with the AI. We still refine it through iteration, adding context, clarifying intent, and checking the output against what we actually meant. But it&#8217;s the full conversation with the AI and its context that matters; the individual prompts are just a means to communicate the intent and context. Just like Agile shifted requirements from static specs to living conversations, prompt engineering shifts our interaction with AI from single-shot commands to an iterative refinement process—though one where we have to infer what&#8217;s missing from the output rather than having the AI ask us clarifying questions.</p>
  317.  
  318.  
  319.  
  320. <p>User stories intentionally focused the engineering work back on people and what’s in their heads. Whether it’s a requirements document in Word or a user story in Jira, the most important thing isn’t the piece of paper, ticket, or document we wrote. The most important thing is that what’s in <em>my</em> head matches what’s in <em>your</em> head and matches what’s in the heads of everyone else involved. The piece of paper is just a convenient way to help us figure out whether or not we agree.</p>
  321.  
  322.  
  323.  
  324. <p>Prompt engineering demands the same outcome. Instead of working with teammates to align mental models, we’re communicating to an AI, but the goal hasn’t changed: producing a high-quality product. The basic principles of quality engineering laid out by Deming, Juran, and Crosby have direct parallels in prompt engineering:</p>
  325.  
  326.  
  327.  
  328. <ul class="wp-block-list">
  329. <li><strong>Deming’s focus on systems and communication:</strong> Prompting failures can be traced to problems with the process, not the people. They typically stem from poor context and communication, not from “bad AI.”</li>
  330.  
  331.  
  332.  
  333. <li><strong>Juran’s focus on fitness for use:</strong> When he framed quality as “fitness for use,” Juran meant that what we produce has to meet real needs—not just look plausible. A prompt is useless if the output doesn’t solve the real problem, and failure to create a prompt that’s fit for use will result in hallucinations.</li>
  334.  
  335.  
  336.  
  337. <li><strong>Crosby’s focus on conformance to requirements: </strong>Prompts must specify not just functional needs but also nonfunctional ones like maintainability and readability. If the context and framing aren’t clear, the AI will generate output that conforms to its training distribution rather than the real intent.</li>
  338. </ul>
  339.  
  340.  
  341.  
  342. <p>One of the clearest ways these quality principles show up in prompt engineering is through what&#8217;s now called <strong>context engineering</strong>—deciding what the model needs to see to generate something useful, which typically includes surrounding code, test inputs, expected outputs, design constraints, and other important project information. If you give the AI too little context, it fills in the blanks with what seems most likely based on its training data (which usually isn&#8217;t what you had in mind). If you give it too much, it can get buried in information and lose track of what you&#8217;re really asking for. That judgment call—what to include, what to leave out—has always been one of the deepest challenges at the heart of requirements work.</p>
  343.  
  344.  
  345.  
  346. <p>There’s another important parallel between requirements engineering and prompt engineering. Back in the 1990s, many organizations fell into what we might call the <em>template trap</em>—believing that the right standardized form or requirements template could guarantee a good outcome. Teams spent huge effort designing and filling out documents. But the real problem was never the format; it was whether the underlying intent was truly shared and understood.</p>
  347.  
  348.  
  349.  
  350. <p>Today, many companies fall into a similar trap with <strong>prompt libraries</strong>, or catalogs of prewritten prompts meant to standardize practice and remove the difficulty of writing prompts. Prompt libraries can be useful as references or starting points, but they don’t replace the core skill of framing the problem and ensuring shared understanding. Just like a perfect requirements template in the 1990s didn’t guarantee the right system, canned prompts today don’t guarantee the right code.</p>
  351.  
  352.  
  353.  
  354. <p>Decades later, the points Brooks made in his “No Silver Bullet” essay still hold. There’s no single template, library, or tool that can eliminate the essential complexity of understanding what needs to be built. Whether it’s requirements engineering in the 1990s or prompt engineering today, the hard part is always the same: building and maintaining a shared understanding of intent. Tools can help, but they don’t replace the discipline.</p>
  355.  
  356.  
  357.  
  358. <p>AI raises the stakes on this core communication problem. Unlike your teammates, the AI won’t push back or ask questions—it just generates something that looks plausible based on the prompt that it was given. That makes clear communication of requirements even more important.</p>
  359.  
  360.  
  361.  
  362. <p>The alignment of understanding that serves as the foundation of requirements engineering is even more important when we bring AI tools into the project, <em>because AI doesn’t have judgment</em>. It has a huge model, but it only works effectively when directed well. The AI needs the context that we provide in the form of code, documents, and other project information and artifacts, which means the only thing it knows about the project is what we tell it. That’s why it’s especially important to have ways to check and verify that what the AI “knows” really matches what <em>we</em> know.</p>
  363.  
  364.  
  365.  
  366. <p>The classic requirements engineering problems—especially the poor communication and lack of shared understanding that Deming warned about and that requirements engineers and Agile practitioners have spent decades trying to address—are compounded when we use AI. We’re still facing the same issues of communicating intent and specifying requirements clearly. But now those requirements aren’t just for the team to read; they’re used to establish the AI’s context. Small variations in problem framing can have a profound impact on what the AI produces. Using natural language to increasingly replace the structured, unambiguous syntax of code removes a critical guardrail that’s traditionally helped protect software from failed understanding.</p>
  367.  
  368.  
  369.  
  370. <p>The tools of requirements engineering help us make up for that missing guardrail. Agile’s iterative process of the developer understanding requirements, building working software, and continuously reviewing it with the product owner was a check that ensured misunderstandings were caught early. The more we eliminate that extra step of translation and understanding by having AI generate code directly from requirements, the more important it becomes for everyone involved—stakeholders and engineers alike—to have a truly shared understanding of what needs to be built.</p>
  371.  
  372.  
  373.  
  374. <p>When people on teams work together to build software, they spend a lot of time talking and asking questions to understand what they need to build. Working with an AI follows a different kind of feedback cycle—you don&#8217;t know it&#8217;s missing context until you see what it produces, and you often need to reverse engineer what it did to figure out what&#8217;s missing. But both types of interaction require the same fundamental skills around context and communication that requirements engineers have always practiced.</p>
  375.  
  376.  
  377.  
  378. <p>This shows up in practice in several ways:</p>
  379.  
  380.  
  381.  
  382. <ul class="wp-block-list">
  383. <li><strong>Context and shared understanding are foundational.</strong> Good requirements help teams understand what behavior matters and how to know when it&#8217;s working—capturing both functional requirements (what to build) and nonfunctional requirements (how well it should work). The same distinction applies to prompting but with fewer chances to course-correct. If you leave out something critical, the AI doesn&#8217;t push back; it just responds with whatever seems plausible. Sometimes that output looks reasonable until you try to use it and realize the AI was solving a different problem.</li>
  384.  
  385.  
  386.  
  387. <li><strong>Scoping takes real judgment.</strong> Developers who struggle to use AI for code typically fall into two extremes: providing too little context (a single sentence that produces something that looks right but fails in practice) or pasting in entire files expecting the model to zoom in on the right method. Unless you explicitly call out what&#8217;s important—both functional and nonfunctional requirements—it doesn&#8217;t know what matters.</li>
  388.  
  389.  
  390.  
  391. <li><strong>Context drifts, and the model doesn’t know it’s drifted.</strong> With human teams, understanding shifts gradually through check-ins and conversations. With prompting, drift can happen in just a few exchanges. The model might still be generating fluent responses until it suggests a fix that makes no sense. That&#8217;s a signal that the context has drifted, and you need to reframe the conversation—perhaps by asking the model to explain the code or restate what it thinks it&#8217;s doing.</li>
  392. </ul>
  393.  
  394.  
  395.  
  396. <p>History keeps repeating itself: From binders full of scattered requirements to IEEE standards to user stories to today’s prompts, the discipline is the same. We succeed when we treat it as real engineering. <strong>Prompt engineering is the next step in the evolution of requirements engineering.</strong> It’s how we make sure we have a shared understanding between everyone on the project—including the AI—and it demands the same care, clarity, and deliberate communication we’ve always needed to avoid misunderstandings and build the right thing.</p>
  397. ]]></content:encoded>
  398. <wfw:commentRss>https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/feed/</wfw:commentRss>
  399. <slash:comments>0</slash:comments>
  400. </item>
  401. <item>
  402. <title>MCP in Practice</title>
  403. <link>https://www.oreilly.com/radar/mcp-in-practice/</link>
  404. <comments>https://www.oreilly.com/radar/mcp-in-practice/#respond</comments>
  405. <pubDate>Tue, 16 Sep 2025 11:22:59 +0000</pubDate>
  406. <dc:creator><![CDATA[Ilan Strauss, Sruly Rosenblat, Isobel Moure and Tim O’Reilly]]></dc:creator>
  407. <category><![CDATA[AI & ML]]></category>
  408. <category><![CDATA[Research]]></category>
  409.  
  410. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17440</guid>
  411.  
  412.     <media:content
  413. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/12/in-dis-canyon-5b-1400x950.jpg"
  414. medium="image"
  415. type="image/jpeg"
  416. />
  417. <custom:subtitle><![CDATA[Mapping Power, Concentration, and Usage in the Emerging AI Developer Ecosystem]]></custom:subtitle>
  418. <description><![CDATA[The following was originally published in Asimov&#8217;s Addendum, September 11, 2025. Learn more about the AI Disclosures Project here. 1. The Rise and Rise of MCP Anthropic’s Model Context Protocol (MCP) was released in November 2024 as a way to make tools and platforms model-agnostic. MCP works by defining servers and clients. MCP servers are local or remote end [&#8230;]]]></description>
  419. <content:encoded><![CDATA[
  420. <p class="has-text-align-center has-cyan-bluish-gray-background-color has-background"><em>The following was <a href="https://asimovaddendum.substack.com/p/read-write-act-inside-the-mcp-server" target="_blank" rel="noreferrer noopener">originally published in </a></em><a href="https://asimovaddendum.substack.com/p/read-write-act-inside-the-mcp-server" target="_blank" rel="noreferrer noopener">Asimov&#8217;s Addendum</a><em>,</em> <em>September 11, 2025.</em><br><br><em>Learn more about the AI Disclosures Project <a href="https://www.ssrc.org/programs/ai-disclosures-project/" target="_blank" rel="noreferrer noopener">here</a>.</em></p>
  421.  
  422.  
  423.  
  424. <h2 class="wp-block-heading"><strong>1. The Rise and Rise of MCP</strong></h2>
  425.  
  426.  
  427.  
  428. <p>Anthropic’s<a href="https://www.anthropic.com/news/model-context-protocol" target="_blank" rel="noreferrer noopener"> Model Context Protocol</a> (MCP) was released in November 2024 as a way to make tools and platforms model-agnostic. MCP works by defining servers and clients. MCP servers are local or remote end points where tools and resources are defined. For example, GitHub released an MCP server that allows LLMs to both read from and write to GitHub. MCP clients are the connection from an AI application to MCP servers—they allow an LLM to interact with context and tools from different servers. An example of an MCP client is Claude Desktop, which allows the Claude models to interact with thousands of MCP servers.</p>
  429.  
  430.  
  431.  
  432. <p><strong>In a relatively short time, MCP has become the backbone of hundreds of AI pipelines and applications</strong>. Major players like Anthropic and OpenAI have built it into their products. Developer tools such as Cursor (a coding-focused text editor or IDE) and productivity apps like <a href="https://www.raycast.com/EvanZhouDev/mcp" target="_blank" rel="noreferrer noopener">Raycast</a> also use MCP. Additionally, thousands of <a href="https://arxiv.org/abs/2506.13538" target="_blank" rel="noreferrer noopener">developers</a> use it to integrate AI models and access external tools and data without having to build an entire ecosystem from scratch.</p>
  433.  
  434.  
  435.  
  436. <p>In previous work published with <em>AI Frontiers</em>, we argued that <a href="https://ai-frontiers.org/articles/open-protocols-prevent-ai-monopolies" target="_blank" rel="noreferrer noopener">MCP can act</a> as a great unbundler of “context”—the data that helps AI applications provide more relevant answers to consumers. In doing so, it can help decentralize AI markets. <strong>We argued that, for MCP to truly achieve its goals, it requires support from</strong>:</p>
  437.  
  438.  
  439.  
  440. <ol class="wp-block-list">
  441. <li><strong>Open APIs</strong>: So that MCP applications can access third-party tools for agentic use (<em>write</em> actions) and context (<em>read</em>)</li>
  442.  
  443.  
  444.  
  445. <li><strong>Fluid memory</strong>: Interoperable LLM memory standards, accessed via MCP-like open protocols, so that the memory context accrued at OpenAI and other leading developers does not get stuck there, preventing downstream innovation</li>
  446. </ol>
  447.  
  448.  
  449.  
  450. <p>We expand upon these two points in a <a href="https://ssrc-static.s3.us-east-1.amazonaws.com/Protocols-and-Power-Moure-OReilly-Strauss_SSRC_08272025.pdf" target="_blank" rel="noreferrer noopener">recent policy note</a>, for those looking to dig deeper.</p>
  451.  
  452.  
  453.  
  454. <p>More generally, <strong>we argue that protocols</strong>,<strong> like MCP</strong>,<strong> are actually </strong><a href="https://asimovaddendum.substack.com/p/disclosures-i-do-not-think-that-word" target="_blank" rel="noreferrer noopener"><strong>foundational “rules of the road” for AI markets</strong></a>, <em>whereby open disclosure and communication standards are built</em> <em>into the network itself</em>, rather than imposed <em>after the fact</em> by regulators. Protocols are fundamentally market-shaping devices, architecting markets through the permissions, rules, and interoperability of the network itself. They can have a big impact on how the commercial markets built on top of them function too.</p>
  455.  
  456.  
  457.  
  458. <h3 class="wp-block-heading"><strong>1.1 But how is the MCP ecosystem evolving?</strong></h3>
  459.  
  460.  
  461.  
  462. <p><strong>Yet we don’t have a clear idea of the shape of the MCP ecosystem today</strong>.<strong> </strong><em>What are the most common use cases of MCP? What sort of access is being given by MCP servers and used by MCP clients? Is the data accessed via MCP “read-only” for context, or does it allow agents to “write” and interact with it—for example, by editing files or sending emails?</em></p>
  463.  
  464.  
  465.  
  466. <p>To begin answering these questions, we look at the tools and context which AI agents use via <em>MCP servers</em>. This gives us a clue about what is being built and what is getting attention. In this article, we don’t analyze <em>MCP clients</em>—the applications that use MCP servers. We instead limit our analysis to what MCP servers are making available for building.</p>
  467.  
  468.  
  469.  
  470. <p>We assembled a large dataset of MCP servers (n = 2,874), scraped from <a href="https://www.pulsemcp.com/" target="_blank" rel="noreferrer noopener">Pulse</a>.<sup>1</sup> We then enriched it with GitHub star-count data on each server. On GitHub, stars are similar to Facebook “likes,” and <a href="https://homepages.dcc.ufmg.br/~mtov/pub/2018-jss-github-stars.pdf" target="_blank" rel="noreferrer noopener">developers use them</a> to show appreciation, bookmark projects, or indicate usage.</p>
  471.  
  472.  
  473.  
  474. <p>In practice, <em>while there were plenty of MCP servers, we found that the top few garnered most of the attention and, likely by extension, most of the use.</em> <strong>Just the top 10 servers had nearly half of all GitHub stars given to MCP servers</strong>.</p>
  475.  
  476.  
  477.  
  478. <p><strong>Some of our takeaways are:</strong></p>
  479.  
  480.  
  481.  
  482. <ol class="wp-block-list">
  483. <li><em>MCP usage appears to be fairly concentrated</em>. This means that, if left unchecked, a small number of servers and (by extension) APIs could have outsize control over the MCP ecosystem being created.</li>
  484.  
  485.  
  486.  
  487. <li><em>MCP use (tools and data being accessed) is dominated by just three categories</em>: Database &amp; Search (RAG), Computer &amp; Web Automation, and Software Engineering. Together, they received nearly three-quarters (72.6%) of all stars on GitHub (which we proxy for usage).</li>
  488.  
  489.  
  490.  
  491. <li>Most MCP servers support both <em>read </em>(access context) and <em>write</em> (change context) operations, showing that developers want their agents to be able to act on context, not just consume it.</li>
  492. </ol>
  493.  
  494.  
  495.  
  496. <h2 class="wp-block-heading"><strong>2. Findings</strong></h2>
  497.  
  498.  
  499.  
  500. <p><em>To start with, we analyzed the MCP ecosystem for concentration risk.</em></p>
  501.  
  502.  
  503.  
  504. <h3 class="wp-block-heading"><strong>2.1 MCP server use is concentrated</strong></h3>
  505.  
  506.  
  507.  
  508. <p><strong>We found that MCP usage is concentrated among several key MCP servers</strong>, <em>judged by the number of GitHub stars each repo received</em>.</p>
  509.  
  510.  
  511.  
  512. <p>Despite there being thousands of MCP servers, <strong>the top 10 servers make up nearly half (45.7%) of all GitHub stars given to MCP servers</strong> (pie chart below) and the top 10% of servers make up 88.3% of all GitHub stars (not shown).</p>
  513.  
  514.  
  515.  
  516. <figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="1444" height="742" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-2.png" alt="The top 10 servers received 45.7% of all GitHub stars in our dataset of 2,874 servers." class="wp-image-17441" title="Chart" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-2.png 1444w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-2-300x154.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-2-768x395.png 768w" sizes="(max-width: 1444px) 100vw, 1444px" /><figcaption class="wp-element-caption"><em>The top 10 servers received 45.7% of all GitHub stars in our dataset of </em>2,874<em> servers.</em></figcaption></figure>
  517.  
  518.  
  519.  
  520. <p><em>This means that the majority of real-world MCP users are likely relying on the same few services made available via a handful of APIs</em>. This concentration likely stems from network effects and practical utility: All developers gravitate toward servers that solve universal problems like web browsing, database access, and integration with widely used platforms like GitHub, Figma, and Blender. This concentration pattern seems typical of developer-tool ecosystems. A few well-executed, broadly applicable solutions tend to dominate. Meanwhile, more specialized tools occupy smaller niches.</p>
  521.  
  522.  
  523.  
  524. <h3 class="wp-block-heading"><strong>2.2 The top 10 MCP servers really matter</strong></h3>
  525.  
  526.  
  527.  
  528. <p>Next, the top 10 MCP servers are shown in the table below, along with their star count and what they do.</p>
  529.  
  530.  
  531.  
  532. <p><strong>Among the top 10 MCP servers,</strong> <em>GitHub,</em> <em>Repomix</em>, <em>Context7</em>, and <em>Framelink</em> are built to assist with software development: <em>Context7</em> and <em>Repomix</em> by gathering context, <em>GitHub</em> by allowing agents to interact with projects, and <em>Framelink </em>by passing on the design specifications from <em>Figma</em> directly to the model. The <em>Blender</em> server allows agents to create 3D models of anything, using the popular open source <em>Blender</em> application. Finally, <em>Activepieces</em> and <em>MindsDB</em> connect the agent to multiple APIs with one standardized interface: in <em>MindsDB</em>’s case, primarily to read data from databases, and in <em>Activepieces</em> to automate services.</p>
  533.  
  534.  
  535.  
  536. <figure class="wp-block-image size-full"><img decoding="async" width="1296" height="1260" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-3.png" alt="The top 10 MCP servers with short descriptions, design courtesy of Claude." class="wp-image-17442" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-3.png 1296w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-3-300x292.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-3-768x747.png 768w" sizes="(max-width: 1296px) 100vw, 1296px" /><figcaption class="wp-element-caption"><em>The top 10 MCP servers with short descriptions, design courtesy of Claude.</em></figcaption></figure>
  537.  
  538.  
  539.  
  540. <p><strong>The dominance of agentic browsing</strong>, <strong>in the form of <em>Browser Use</em> (61,000 stars) and <em>Playwright MCP</em> (18,425 stars)</strong>,<strong> stands out</strong>. This reflects the fundamental need for AI systems to interact with web content. These tools allow AI to navigate websites, click buttons, fill out forms, and extract data just like a human would. <em>Agentic browsing has surged</em>,<em> even though it’s far less token-efficient than calling an API</em>. Browsing agents often need to wade through multiple pages of boilerplate to extract slivers of data a single API request could return. Because many services lack usable APIs or tightly gate them, browser-based agents are often the simplest—sometimes the only—way to integrate, underscoring the limits of today’s APIs.</p>
  541.  
  542.  
  543.  
  544. <p><strong>Some of the top servers are unofficial. </strong>Both the <em>Framelink</em> and <em>Blender MCP</em> are servers that interact with just a single application, but they are both “unofficial” products. This means that they are not officially endorsed by the developers of the application they are integrating with—those who own the underlying service or API (e.g., GitHub, Slack, Google). Instead, they are built by independent developers who create a bridge between an AI client and a service—often by reverse-engineering APIs, wrapping unofficial SDKs, or using browser automation to mimic user interactions.</p>
  545.  
  546.  
  547.  
  548. <p>It is healthy that third-party developers can build their own MCP servers, since this openness encourages innovation. But it also introduces an intermediary layer between the user and the API, which brings risks around trust, verification, and even potential abuse. With open source local servers, the code is transparent and can be vetted. By contrast, remote third-party servers are harder to audit, since users must trust code they can’t easily inspect.</p>
  549.  
  550.  
  551.  
  552. <p><strong>At a deeper level, the repos that currently dominate MCP servers highlight three encouraging facts about the MCP ecosystem:</strong></p>
  553.  
  554.  
  555.  
  556. <ol class="wp-block-list">
  557. <li><strong>First, several prominent MCP servers support multiple third-party services for their functionality. </strong><em>MindsDB</em> and <em>Activepieces</em> serve as gateways to multiple (often competing) service providers through a single server. <em>MindsDB</em> allows developers to query different databases like PostgreSQL, MongoDB, and MySQL through a single interface, while <em>Taskmaster </em>allows the agent to delegate tasks to a range of AI models from OpenAI, Anthropic, and Google, all without changing servers.</li>
  558.  
  559.  
  560.  
  561. <li><strong>Second, agentic browsing MCP servers are being used to get around potentially restrictive APIs.</strong> As noted above, <em>Browser Use </em>and <em>Playwright</em> access internet services through a web browser, helping to bypass API restrictions, but they instead run up against anti-bot protections. This circumvents the limitations that APIs can impose on what developers are able to build.</li>
  562.  
  563.  
  564.  
  565. <li><strong>Third, some MCP servers do their processing on the developer’s computer (locally)</strong>,<strong> making them less dependent on a vendor maintaining API access</strong>.<strong><em> </em></strong><em>Some MCP servers examined here can run entirely on a local computer without sending data to the cloud—meaning that no gatekeeper has the power to cut you off</em>. Of the 10 MCP servers examined above, only <em>Framelink</em>, <em>Context7,</em> and <em>GitHub</em> rely on just a single cloud-only API dependency that can’t be run locally end-to-end on your machine. <em>Blender</em> and <em>Repomix</em> are completely open source and don’t require any internet access to work, while <em>MindsDB</em>, <em>Browser Use, </em>and <em>Activepieces </em>have local open source implementations.</li>
  566. </ol>
  567.  
  568.  
  569.  
  570. <h3 class="wp-block-heading"><strong>2.3 The three categories that dominate MCP use</strong></h3>
  571.  
  572.  
  573.  
  574. <p><em>Next, we grouped MCP servers into different categories based on their functionality</em>.</p>
  575.  
  576.  
  577.  
  578. <p>When we analyzed what types of servers are most popular, we found that three dominated: <strong>Computer &amp; Web Automation (24.8%)</strong>,<strong> Software Engineering (24.7%)</strong>, and <strong>Database &amp; Search (23.1%)</strong>.</p>
  579.  
  580.  
  581.  
  582. <figure class="wp-block-image size-full"><img decoding="async" width="1456" height="683" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-4.png" alt="Software engineering, computer and web automation, and database and search received 72.6% of all stars given to MCP servers." class="wp-image-17443" title="Chart" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-4.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-4-300x141.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-4-768x360.png 768w" sizes="(max-width: 1456px) 100vw, 1456px" /><figcaption class="wp-element-caption"><em>Software Engineering, Computer &amp; Web Automation, and Database &amp; Search received 72.6% of all stars given to MCP servers.</em></figcaption></figure>
  583.  
  584.  
  585.  
  586. <p>Widespread use of Software Engineering (24.7%) MCP servers aligns with <a href="https://arxiv.org/abs/2503.04761" target="_blank" rel="noreferrer noopener">Anthropic’s economic index</a>, which found that an outsize portion of AI interactions were related to software development.</p>
  587.  
  588.  
  589.  
  590. <p>The popularity of both Computer &amp; Web Automation (24.8%) and Database &amp; Search (23.1%) also makes sense. Before the advent of MCP, web scraping and database search were highly integrated applications across platforms like ChatGPT, Perplexity, and Gemini. With MCP, however, users can now access that same search functionality and connect their agents to any database with minimal effort. In other words, MCP’s <a href="https://ai-frontiers.org/articles/open-protocols-prevent-ai-monopolies" target="_blank" rel="noreferrer noopener">unbundling</a> effect is highly visible here.</p>
  591.  
  592.  
  593.  
  594. <h3 class="wp-block-heading"><strong>2.4 Agents interact with their environments</strong></h3>
  595.  
  596.  
  597.  
  598. <p><em>Lastly, we analyzed the capabilities of these servers</em>: Are they allowing AI applications just to access data and tools (<em>read</em>), or instead do agentic operations with them (<em>write</em>)?</p>
  599.  
  600.  
  601.  
  602. <p><strong>Across all but two of the MCP server categories looked at, the most popular MCP servers supported both <em>reading</em> (access context)<em> </em>and <em>writing</em> (agentic) operations</strong>—shown in turquoise. The prevalence of servers with combined read and write access suggests that agents are not being built just to answer questions based on data but also to take action and interact with services on a user’s behalf.</p>
  603.  
  604.  
  605.  
  606. <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="974" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-5.png" alt="Showing MCP servers by category. Dotted red line at 10,000 stars (likes). The most popular servers support both read and write operations by agents. In contrast, almost no servers support just write operations." class="wp-image-17444" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-5.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-5-300x201.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-5-768x514.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /><figcaption class="wp-element-caption"><em>Showing MCP servers by category. Dotted red line at 10,000 stars (likes). The most popular servers support both read and write operations by agents. In contrast, almost no servers support just write operations.</em></figcaption></figure>
  607.  
  608.  
  609.  
  610. <p>The two exceptions are Database &amp; Search (RAG) and Finance MCP servers, in which <em>read-only</em> access is a common permission given. This is likely because data integrity is critical to ensuring reliability.</p>
  611.  
  612.  
  613.  
  614. <h2 class="wp-block-heading"><strong>3. The Importance of Multiple Access Points</strong></h2>
  615.  
  616.  
  617.  
  618. <p>A few implications of our analysis can be drawn out at this preliminary stage.</p>
  619.  
  620.  
  621.  
  622. <p><strong>First, concentrated MCP server use compounds the risks of API access being restricted</strong>. As we discussed in “<a href="https://asimovaddendum.substack.com/p/protocols-and-power" target="_blank" rel="noreferrer noopener">Protocols and Power</a>,” MCP remains constrained by “<em>what a particular service (such as GitHub or Slack) happens to expose through its API</em>.” A few powerful digital service providers have the power to shut down access to their servers.</p>
  623.  
  624.  
  625.  
  626. <p><em>One important hedge against API gatekeeping is that many of the top servers try not to rely on a single provide</em>r. <strong>In addition</strong>,<strong> the following two safeguards are relevant</strong>:</p>
  627.  
  628.  
  629.  
  630. <ul class="wp-block-list">
  631. <li><strong>They offer local processing</strong> of data on a user’s machine whenever possible, instead of sending the data for processing to a third-party server. Local processing ensures that functionality cannot be restricted.</li>
  632.  
  633.  
  634.  
  635. <li>If running a service locally is not possible (e.g., email or web search), the server should still <strong>support multiple avenues of getting at the needed context through competing APIs</strong>. For example, <em>MindsDB</em> functions as a gateway to multiple data sources, so instead of relying on just one database to read and write data, it goes to great lengths to support multiple databases in one unified interface, essentially making the backend tools interchangeable.</li>
  636. </ul>
  637.  
  638.  
  639.  
  640. <p><strong>Second, our analysis points to the fact that current restrictive API access policies are not sustainable. </strong>Web scraping and bots, accessed via MCP servers, are probably being used (at least in part) to circumvent overly restrictive API access, complicating the <a href="https://slate.com/technology/2025/08/uk-online-safety-act-reddit-wikipedia-open-internet.html" target="_blank" rel="noreferrer noopener">increasingly common</a> practice of banning bots. Even OpenAI is coloring outside the API lines, using a third-party service to access Google Search’s results through web scraping, thereby <a href="https://www.theinformation.com/articles/openai-challenging-google-using-search-data" target="_blank" rel="noreferrer noopener">circumventing its restrictive API</a>.</p>
  641.  
  642.  
  643.  
  644. <p><strong>Expanding structured API access in a meaningful way is vital</strong>. <em>This ensures that legitimate AI automation runs through stable, documented end points.</em> Otherwise, developers resort to brittle browser automation where privacy and authorization have not been properly addressed. Regulatory guidance <a href="https://ai-frontiers.org/articles/open-protocols-prevent-ai-monopolies" target="_blank" rel="noreferrer noopener">could push</a> the market in this direction, as with open banking in the US.</p>
  645.  
  646.  
  647.  
  648. <p><strong>Finally, encouraging greater transparency and disclosure </strong>could help identify where the bottlenecks in the MCP ecosystem are.</p>
  649.  
  650.  
  651.  
  652. <ul class="wp-block-list">
  653. <li>Developers operating popular MCP servers (above a certain usage threshold) or providing APIs used by top servers should report usage statistics, access denials, and rate-limiting policies. This data would help regulators identify emerging bottlenecks before they become entrenched. <em>GitHub might facilitate this by encouraging these disclosures, for example</em>.</li>
  654.  
  655.  
  656.  
  657. <li>Additionally, MCP servers above certain usage thresholds should clearly list their dependencies on external APIs and what fallback options exist if the primary APIs become unavailable. This is not only helpful in determining the market structure, but also essential information for security and robustness for downstream applications.</li>
  658. </ul>
  659.  
  660.  
  661.  
  662. <p>The goal is not to eliminate all concentration in the network but to ensure that the MCP ecosystem remains contestable, with multiple viable paths for innovation and user choice. By addressing both technical architecture and market dynamics, these suggested tweaks could help MCP achieve its potential as a democratizing force in AI development, rather than merely shifting bottlenecks from one layer to another.</p>
  663.  
  664.  
  665.  
  666. <hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
  667.  
  668.  
  669.  
  670. <h2 class="wp-block-heading">Footnotes</h2>
  671.  
  672.  
  673.  
  674. <ol class="wp-block-list">
  675. <li>For this analysis, we categorized each repo into one of 15 categories using GPT-5 mini. We then human-reviewed and edited the top 50 servers that make up around 70% of the total star count in our dataset.</li>
  676. </ol>
  677.  
  678.  
  679.  
  680. <hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
  681.  
  682.  
  683.  
  684. <h2 class="wp-block-heading"><strong>Appendix</strong></h2>
  685.  
  686.  
  687.  
  688. <h3 class="wp-block-heading"><strong>Dataset</strong></h3>
  689.  
  690.  
  691.  
  692. <p>The full dataset, along with descriptions of the categories, can be found here (constructed by Sruly Rosenblat):</p>
  693.  
  694.  
  695.  
  696. <p><a href="https://huggingface.co/datasets/sruly/MCP-In-Practice" target="_blank" rel="noreferrer noopener">https://huggingface.co/datasets/sruly/MCP-In-Practice</a></p>
  697.  
  698.  
  699.  
  700. <h3 class="wp-block-heading"><strong>Limitations</strong></h3>
  701.  
  702.  
  703.  
  704. <p>There are a few limitations to our preliminary research:</p>
  705.  
  706.  
  707.  
  708. <ul class="wp-block-list">
  709. <li>GitHub stars aren’t a measure of download counts or even necessarily a repo’s popularity.</li>
  710.  
  711.  
  712.  
  713. <li>Only the name and description were used when categorizing repos with the LLM.</li>
  714.  
  715.  
  716.  
  717. <li>Categorization was subject to both human and AI errors and many servers would likely fit into multiple categories.</li>
  718.  
  719.  
  720.  
  721. <li>We only used the Pulse list for our dataset; other lists had different servers (e.g., Browser Use isn’t on mcpmarket.com).</li>
  722.  
  723.  
  724.  
  725. <li>We excluded some repos from our analysis, such as those that had multiple servers and those we weren’t able to fetch the star count for. We may miss some popular servers by doing this.</li>
  726. </ul>
  727.  
  728.  
  729.  
  730. <h3 class="wp-block-heading"><strong>MCP Server Use Over Time</strong></h3>
  731.  
  732.  
  733.  
  734. <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="916" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-6.png" alt="The growth of the top nine repos’ star count over time from MCP’s launch date on November 25, 2024, until September 2025. NOTE: We were only able to track the Browser-Use’s repo until 40,000 stars; hence the flat line for its graph. In reality, roughly 21,000 stars were added over the next few months (the other graphs in this blog are properly adjusted)." class="wp-image-17445" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-6.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-6-300x189.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-6-768x483.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /><figcaption class="wp-element-caption"><em>The growth of the top nine repos’ star count over time from MCP’s launch date on November 25, 2024, until September 2025. </em><br><br><em>Note: We were only able to track Browser Use’s repo until 40,000 stars; hence the flat line for its graph. In reality, roughly 21,000 stars were added over the next few months. (The other graphs in this post are properly adjusted.)</em></figcaption></figure>
  735.  
  736.  
  737.  
  738. <p></p>
  739. ]]></content:encoded>
  740. <wfw:commentRss>https://www.oreilly.com/radar/mcp-in-practice/feed/</wfw:commentRss>
  741. <slash:comments>0</slash:comments>
  742. </item>
  743. <item>
  744. <title>When AI Writes Code, Who Secures It?</title>
  745. <link>https://www.oreilly.com/radar/when-ai-writes-code-who-secures-it/</link>
  746. <comments>https://www.oreilly.com/radar/when-ai-writes-code-who-secures-it/#respond</comments>
  747. <pubDate>Mon, 15 Sep 2025 10:37:10 +0000</pubDate>
  748. <dc:creator><![CDATA[Chloé Messdaghi]]></dc:creator>
  749. <category><![CDATA[AI & ML]]></category>
  750. <category><![CDATA[Security]]></category>
  751. <category><![CDATA[Commentary]]></category>
  752.  
  753. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17436</guid>
  754.  
  755.     <media:content
  756. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/06/hacker-1944688_crop-b34a76e3cab9c07c5900b706c70a12c3-1.jpg"
  757. medium="image"
  758. type="image/jpeg"
  759. />
  760. <description><![CDATA[In early 2024, a striking deepfake fraud case in Hong Kong brought the vulnerabilities of AI-driven deception into sharp relief. A finance employee was duped during a video call by what appeared to be the CFO—but was, in fact, a sophisticated AI-generated deepfake. Convinced of the call’s authenticity, the employee made 15 transfers totaling over [&#8230;]]]></description>
  761. <content:encoded><![CDATA[
  762. <p>In early 2024, a striking deepfake fraud case in Hong Kong brought the vulnerabilities of AI-driven deception into sharp relief. A finance employee was duped during a video call by what appeared to be the CFO—but was, in fact, a sophisticated AI-generated deepfake. Convinced of the call’s authenticity, the employee made <a href="https://www.ft.com/content/b977e8d4-664c-4ae4-8a8e-eb93bdf785ea?" target="_blank" rel="noreferrer noopener">15 transfers totaling over $25 million</a> to fraudulent bank accounts before realizing it was a scam.</p>
  763.  
  764.  
  765.  
  766. <p>This incident exemplifies more than just technological trickery—it signals how trust in what we see and hear can be weaponized, especially as AI becomes more deeply integrated into enterprise tools and workflows. From embedded LLMs in enterprise systems to autonomous agents diagnosing and even repairing issues in live environments, AI is transitioning from novelty to necessity. Yet as it evolves, so too do the gaps in our traditional security frameworks—designed for static, human-written code—revealing just how unprepared we are for systems that generate, adapt, and behave in unpredictable ways.</p>
  767.  
  768.  
  769.  
  770. <h2 class="wp-block-heading">Beyond the CVE Mindset</h2>
  771.  
  772.  
  773.  
  774. <p>Traditional secure coding practices revolve around known vulnerabilities and patch cycles. AI changes the equation. A line of code can be generated on the fly by a model, shaped by manipulated prompts or data—creating new, unpredictable categories of risk like prompt injection or emergent behavior outside traditional taxonomies.</p>
  775.  
  776.  
  777.  
  778. <p>A 2025 Veracode study found that <a href="https://www.techradar.com/pro/nearly-half-of-all-code-generated-by-ai-found-to-contain-security-flaws-even-big-llms-affected?" target="_blank" rel="noreferrer noopener">45% of all AI-generated code contained vulnerabilities</a>, with common flaws like weak defenses against XSS and log injection. (Some languages performed more poorly than others. Over 70% of AI-generated Java code had a security issue, for instance.) Another 2025 study showed that repeated refinement can make things worse: After just five iterations, critical vulnerabilities rose by <a href="https://arxiv.org/abs/2506.11022?" target="_blank" rel="noreferrer noopener">37.6%</a>.</p>
  779.  
  780.  
  781.  
  782. <p>To keep pace, frameworks like the <a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/" target="_blank" rel="noreferrer noopener">OWASP Top 10 for LLMs</a> have emerged, cataloging AI-specific risks such as data leakage, model denial of service, and prompt injection. They highlight how current security taxonomies fall short—and why we need new approaches that model AI threat surfaces, share incidents, and iteratively refine risk frameworks to reflect how code is created and influenced by AI.</p>
  783.  
  784.  
  785.  
  786. <h2 class="wp-block-heading">Easier for Adversaries</h2>
  787.  
  788.  
  789.  
  790. <p>Perhaps the most alarming shift is how AI lowers the barrier to malicious activity. What once required deep technical expertise can now be done by anyone with a clever prompt: generating scripts, launching phishing campaigns, or manipulating models. AI doesn’t just broaden the attack surface; it makes it easier and cheaper for attackers to succeed without ever writing code.</p>
  791.  
  792.  
  793.  
  794. <p>In 2025, researchers unveiled PromptLocker, the first AI-powered ransomware. Though only a proof of concept, it showed how theft and encryption could be automated with a local LLM at remarkably low cost: about <a href="https://www.tomshardware.com/tech-industry/cyber-security/ai-powered-promptlocker-ransomware-is-just-an-nyu-research-project-the-code-worked-as-a-typical-ransomware-selecting-targets-exfiltrating-selected-data-and-encrypting-volumes" target="_blank" rel="noreferrer noopener">$0.70 per full attack using commercial APIs</a>—and essentially free with open source models. That kind of affordability could make ransomware cheaper, faster, and more scalable than ever.</p>
  795.  
  796.  
  797.  
  798. <p>This democratization of offense means defenders must prepare for attacks that are more frequent, more varied, and more creative. The <a href="https://github.com/mitre/advmlthreatmatrix" target="_blank" rel="noreferrer noopener">Adversarial ML Threat Matrix</a>, founded by Ram Shankar Siva Kumar during his time at Microsoft, helps by enumerating threats to machine learning and offering a structured way to anticipate these evolving risks. (He’ll be discussing the difficulty of securing AI systems from adversaries at <a href="https://learning.oreilly.com/live-events/security-superstream-secure-code-in-the-age-of-ai/0642572204099/" target="_blank" rel="noreferrer noopener">O’Reilly’s upcoming Security Superstream</a>.)</p>
  799.  
  800.  
  801.  
  802. <h2 class="wp-block-heading">Silos and Skill Gaps</h2>
  803.  
  804.  
  805.  
  806. <p>Developers, data scientists, and security teams still work in silos, each with different incentives. Business leaders push for rapid AI adoption to stay competitive, while security leaders warn that moving too fast risks catastrophic flaws in the code itself.</p>
  807.  
  808.  
  809.  
  810. <p>These tensions are amplified by a widening skills gap: Most developers lack training in AI security, and many security professionals don’t fully understand how LLMs work. As a result, the old patchwork fixes feel increasingly inadequate when the models are writing and running code on their own.</p>
  811.  
  812.  
  813.  
  814. <p>The rise of “vibe coding”—relying on LLM suggestions without review—captures this shift. It accelerates development but introduces hidden vulnerabilities, leaving both developers and defenders struggling to manage novel risks.</p>
  815.  
  816.  
  817.  
  818. <h2 class="wp-block-heading">From Avoidance to Resilience</h2>
  819.  
  820.  
  821.  
  822. <p>AI adoption won’t stop. The challenge is moving from avoidance to resilience. Frameworks like <a href="https://www.databricks.com/blog/announcing-databricks-ai-security-framework-20" target="_blank" rel="noreferrer noopener">Databricks’ AI Risk Framework (DASF)</a> and the <a href="https://www.nist.gov/itl/ai-risk-management-framework" target="_blank" rel="noreferrer noopener">NIST AI Risk Management Framework</a> provide practical guidance on embedding governance and security directly into AI pipelines, helping organizations move beyond ad hoc defenses toward systematic resilience. The goal isn’t to eliminate risk but to enable innovation while maintaining trust in the code AI helps produce.</p>
  823.  
  824.  
  825.  
  826. <h2 class="wp-block-heading">Transparency and Accountability</h2>
  827.  
  828.  
  829.  
  830. <p>Research shows AI-generated code is often simpler and more repetitive, <a href="https://arxiv.org/abs/2508.21634" target="_blank" rel="noreferrer noopener">but also more vulnerable</a>, with risks like hardcoded credentials and path traversal exploits. Without observability tools such as prompt logs, provenance tracking, and audit trails, developers can’t ensure reliability or accountability. In other words, AI-generated code is more likely to introduce high-risk security vulnerabilities.</p>
  831.  
  832.  
  833.  
  834. <p>AI’s opacity compounds the problem: A function may appear to “work” yet conceal vulnerabilities that are difficult to trace or explain. Without explainability and safeguards, autonomy quickly becomes a recipe for insecure systems. Tools like <a href="https://atlas.mitre.org/matrices/ATLAS" target="_blank" rel="noreferrer noopener">MITRE ATLAS</a> can help by mapping adversarial tactics against AI models, offering defenders a structured way to anticipate and counter threats.</p>
  835.  
  836.  
  837.  
  838. <h2 class="wp-block-heading">Looking Ahead</h2>
  839.  
  840.  
  841.  
  842. <p>Securing code in the age of AI requires more than patching—it means breaking silos, closing skill gaps, and embedding resilience into every stage of development. The risks may feel familiar, but AI scales them dramatically. Frameworks like Databricks’ AI Risk Framework (DASF) and the NIST AI Risk Management Framework provide structures for governance and transparency, while MITRE ATLAS maps adversarial tactics and real-world attack case studies, giving defenders a structured way to anticipate and mitigate threats to AI systems.</p>
  843.  
  844.  
  845.  
  846. <p>The choices we make now will determine whether AI becomes a trusted partner—or a shortcut that leaves us exposed.</p>
  847.  
  848.  
  849.  
  850. <hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
  851.  
  852.  
  853.  
  854. <p class="has-cyan-bluish-gray-background-color has-background"><strong><em>Ensure your systems remain secure in an increasingly AI-driven world</em></strong><br><br><em>Join Chloé Messdaghi and a lineup of top security professionals and technologists for O’Reilly’s Security Superstream: Secure Code in the Age of AI. They’ll share practical insights, real-world experiences, and emerging trends that will help you code more securely, build and deploy secure models, and protect against AI-specific threats. It’s free for O’Reilly members. <a href="https://learning.oreilly.com/live-events/security-superstream-secure-code-in-the-age-of-ai/0642572204099/" target="_blank" rel="noreferrer noopener">Save your seat here</a>.</em><br><br><em>Not a member? <a href="https://www.oreilly.com/start-trial/" target="_blank" rel="noreferrer noopener">Sign up for a free 10-day trial</a> to attend—and check out all the other great resources on O’Reilly.</em></p>
  855.  
  856.  
  857.  
  858. <p></p>
  859. ]]></content:encoded>
  860. <wfw:commentRss>https://www.oreilly.com/radar/when-ai-writes-code-who-secures-it/feed/</wfw:commentRss>
  861. <slash:comments>0</slash:comments>
  862. </item>
  863. <item>
  864. <title>Taming Chaos with Antifragile GenAI Architecture</title>
  865. <link>https://www.oreilly.com/radar/taming-chaos-with-antifragile-genai-architecture/</link>
  866. <comments>https://www.oreilly.com/radar/taming-chaos-with-antifragile-genai-architecture/#respond</comments>
  867. <pubDate>Thu, 11 Sep 2025 10:49:38 +0000</pubDate>
  868. <dc:creator><![CDATA[Shreshta Shyamsundar]]></dc:creator>
  869. <category><![CDATA[AI & ML]]></category>
  870. <category><![CDATA[Commentary]]></category>
  871.  
  872. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17429</guid>
  873.  
  874.     <media:content
  875. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/AI-in-Chaos-2.jpg"
  876. medium="image"
  877. type="image/jpeg"
  878. />
  879. <custom:subtitle><![CDATA[Turn Volatility into Your Greatest Strategic Asset]]></custom:subtitle>
  880. <description><![CDATA[What if uncertainty wasn&#8217;t something to simply endure but something to actively exploit? The convergence of Nassim Taleb&#8217;s antifragility principles with generative AI capabilities is creating a new paradigm for organizational design powered by generative AI—one where volatility becomes fuel for competitive advantage rather than a threat to be managed. The Antifragility Imperative Antifragility transcends [&#8230;]]]></description>
  881. <content:encoded><![CDATA[
  882. <p>What if uncertainty wasn&#8217;t something to simply endure but something to actively exploit? The convergence of <a href="https://en.wikipedia.org/wiki/Antifragile_(book)" target="_blank" rel="noreferrer noopener">Nassim Taleb&#8217;s antifragility principles</a> with generative AI capabilities is creating a new paradigm for organizational design powered by generative AI—one where volatility becomes fuel for competitive advantage rather than a threat to be managed.</p>
  883.  
  884.  
  885.  
  886. <h2 class="wp-block-heading"><strong>The Antifragility Imperative</strong></h2>
  887.  
  888.  
  889.  
  890. <p>Antifragility transcends resilience. While resilient systems bounce back from stress and robust systems resist change, antifragile systems actively improve when exposed to volatility, randomness, and disorder. This isn&#8217;t just theoretical—it&#8217;s a mathematical property where systems exhibit <strong>positive convexity</strong>, gaining more from favorable variations than they lose from unfavorable ones.</p>
  891.  
  892.  
  893.  
  894. <p>To visualize the concept of positive convexity in antifragile systems, consider a graph where the x-axis represents stress or volatility and the y-axis represents the system&#8217;s response. In such systems, the curve is upward bending (convex), demonstrating that the system gains more from positive shocks than it loses from negative ones—by an accelerating margin.</p>
  895.  
  896.  
  897.  
  898. <p>The convex (upward-curving) line shows that small positive shocks yield increasingly larger gains, while equivalent negative shocks cause comparatively smaller losses.</p>
  899.  
  900.  
  901.  
  902. <p>For comparison, a straight line representing a fragile or linear system shows a proportional (linear) response, with gains and losses of equal magnitude on either side.</p>
  903.  
  904.  
  905.  
  906. <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="1066" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1.png" alt="Graph illustrating positive convexity: Antifragile systems benefit disproportionately from positive variations compared to equivalent negative shocks." class="wp-image-17430" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1-300x200.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1-768x512.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1-1536x1023.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Graph illustrating positive convexity: Antifragile systems benefit disproportionately from positive variations compared to equivalent negative shocks.</em></figcaption></figure>
  907.  
  908.  
  909.  
  910. <p>The concept emerged from Taleb&#8217;s observation that certain systems don&#8217;t just survive Black Swan events—they thrive because of them. Consider how Amazon&#8217;s supply chain AI during the 2020 pandemic demonstrated true antifragility. When lockdowns disrupted normal shipping patterns and consumer behavior shifted dramatically, Amazon&#8217;s demand forecasting systems didn&#8217;t just adapt; they used the chaos as training data. Every stockout, every demand spike for unexpected products like webcams and exercise equipment, every supply chain disruption became input for improving future predictions. The AI learned to identify early signals of changing consumer behavior and supply constraints, making the system more robust for future disruptions.</p>
  911.  
  912.  
  913.  
  914. <p>For technology organizations, this presents a fundamental question: How do we design systems that don&#8217;t just survive unexpected events but benefit from them? The answer lies in implementing specific generative AI architectures that can learn continuously from disorder.</p>
  915.  
  916.  
  917.  
  918. <h2 class="wp-block-heading"><strong>Generative AI: Building Antifragile Capabilities</strong></h2>
  919.  
  920.  
  921.  
  922. <p>Certain generative AI implementations can exhibit antifragile characteristics when designed with continuous learning architectures. Unlike static models deployed once and forgotten, these systems incorporate feedback loops that allow real-time adaptation without full model retraining—a critical distinction given the resource-intensive nature of training large models.</p>
  923.  
  924.  
  925.  
  926. <p>Netflix&#8217;s recommendation system demonstrates this principle. Rather than retraining its entire foundation model, the company continuously updates personalization layers based on user interactions. When users reject recommendations or abandon content midstream, this negative feedback becomes valuable training data that refines future suggestions. The system doesn&#8217;t just learn what users like. It becomes expert at recognizing what they&#8217;ll hate, leading to higher overall satisfaction through accumulated negative knowledge.</p>
  927.  
  928.  
  929.  
  930. <p>The key insight is that these AI systems don&#8217;t just adapt to new conditions; they actively extract information from disorder. When market conditions shift, customer behavior changes, or systems encounter edge cases, properly designed generative AI can identify patterns in the chaos that human analysts might miss. They transform noise into signal, volatility into opportunity.</p>
  931.  
  932.  
  933.  
  934. <h2 class="wp-block-heading"><strong>Error as Information: Learning from Failure</strong></h2>
  935.  
  936.  
  937.  
  938. <p>Traditional systems treat errors as failures to be minimized. Antifragile systems treat errors as information sources to be exploited. This shift becomes powerful when combined with generative AI&#8217;s ability to learn from mistakes and generate improved responses.</p>
  939.  
  940.  
  941.  
  942. <p>IBM Watson for Oncology&#8217;s failure has been attributed to synthetic data problems, but it highlights a critical distinction: Synthetic data isn&#8217;t inherently problematic—it&#8217;s essential in healthcare where patient privacy restrictions limit access to real data. The issue was that Watson was trained exclusively on synthetic, hypothetical cases created by Memorial Sloan Kettering physicians rather than being validated against diverse real-world outcomes. This created a dangerous feedback loop where the AI learned physician preferences rather than evidence-based medicine.</p>
  943.  
  944.  
  945.  
  946. <p>When deployed, Watson recommended potentially fatal treatments—such as prescribing bevacizumab to a 65-year-old lung cancer patient with severe bleeding, despite the drug&#8217;s known risk of causing &#8220;severe or fatal hemorrhage.&#8221; A truly antifragile system would have incorporated mechanisms to detect when its training data diverged from reality—for instance, by tracking recommendation acceptance rates and patient outcomes to identify systematic biases.</p>
  947.  
  948.  
  949.  
  950. <p>This challenge extends beyond healthcare. Consider AI diagnostic systems deployed across different hospitals. A model trained on high-end equipment at a research hospital performs poorly when deployed to field hospitals with older, poorly calibrated CT scanners. An antifragile AI system would treat these equipment variations not as problems to solve but as valuable training data. Each “failed” diagnosis on older equipment becomes information that improves the system&#8217;s robustness across diverse deployment environments.</p>
  951.  
  952.  
  953.  
  954. <h2 class="wp-block-heading"><strong>Netflix: Mastering Organizational Antifragility</strong></h2>
  955.  
  956.  
  957.  
  958. <p>Netflix&#8217;s approach to chaos engineering exemplifies organizational antifragility in practice. The company’s famous “Chaos Monkey” randomly terminates services in production to ensure the system can handle failures gracefully. But more relevant to generative AI is its content recommendation system&#8217;s sophisticated approach to handling failures and edge cases.</p>
  959.  
  960.  
  961.  
  962. <p>When Netflix&#8217;s AI began recommending mature content to family accounts rather than simply adding filters, its team created systematic “chaos scenarios”—deliberately feeding the system contradictory user behavior data to stress-test its decision-making capabilities. They simulated situations where family members had vastly different viewing preferences on the same account or where content metadata was incomplete or incorrect.</p>
  963.  
  964.  
  965.  
  966. <p>The recovery protocols the team developed go beyond simple content filtering. Netflix created hierarchical safety nets: real-time content categorization, user context analysis, and human oversight triggers. Each “failure” in content recommendation becomes data that strengthens the entire system. The AI learns what content to recommend but also when to seek additional context, when to err on the side of caution, and how to gracefully handle ambiguous situations.</p>
  967.  
  968.  
  969.  
  970. <p>This demonstrates a key antifragile principle: The system doesn&#8217;t just prevent similar failures—it becomes more intelligent about handling edge cases it has never encountered before. Netflix&#8217;s recommendation accuracy improved precisely because the system learned to navigate the complexities of shared accounts, diverse family preferences, and content boundary cases.</p>
  971.  
  972.  
  973.  
  974. <h2 class="wp-block-heading"><strong>Technical Architecture: The LOXM Case Study</strong></h2>
  975.  
  976.  
  977.  
  978. <p>JPMorgan&#8217;s LOXM (Learning Optimization eXecution Model) represents the most sophisticated example of antifragile AI in production. Developed by the global equities electronic trading team under Daniel Ciment, LOXM went live in 2017 after training on billions of historical transactions. While this predates the current era of transformer-based generative AI, LOXM was built using deep learning techniques that share fundamental principles with today&#8217;s generative models: the ability to learn complex patterns from data and adapt to new situations through continuous feedback.</p>
  979.  
  980.  
  981.  
  982. <p><strong>Multi-agent architecture</strong>: LOXM uses a reinforcement learning system where specialized agents handle different aspects of trade execution.</p>
  983.  
  984.  
  985.  
  986. <ul class="wp-block-list">
  987. <li>Market microstructure analysis agents learn optimal timing patterns.</li>
  988.  
  989.  
  990.  
  991. <li>Liquidity assessment agents predict order book dynamics in real time.</li>
  992.  
  993.  
  994.  
  995. <li>Impact modeling agents minimize market disruption during large trades.</li>
  996.  
  997.  
  998.  
  999. <li>Risk management agents enforce position limits while maximizing execution quality.</li>
  1000. </ul>
  1001.  
  1002.  
  1003.  
  1004. <p><strong>Antifragile performance under stress</strong>: While traditional trading algorithms struggled with unprecedented conditions during the market volatility of March 2020, LOXM&#8217;s agents used the chaos as learning opportunities. Each failed trade execution, each unexpected market movement, each liquidity crisis became training data that improved future performance.</p>
  1005.  
  1006.  
  1007.  
  1008. <p>The measurable results were striking. LOXM improved execution quality by 50% during the most volatile trading days—exactly when traditional systems typically degrade. This isn&#8217;t just resilience; it&#8217;s mathematical proof of positive convexity where the system gains more from stressful conditions than it loses.</p>
  1009.  
  1010.  
  1011.  
  1012. <p><strong>Technical innovation</strong>: LOXM prevents catastrophic forgetting through “experience replay” buffers that maintain diverse trading scenarios. When new market conditions arise, the system can reference similar historical patterns while adapting to novel situations. The feedback loop architecture uses streaming data pipelines to capture trade outcomes, model predictions, and market conditions in real time, updating model weights through online learning algorithms within milliseconds of trade completion.</p>
  1013.  
  1014.  
  1015.  
  1016. <h2 class="wp-block-heading"><strong>The Information Hiding Principle</strong></h2>
  1017.  
  1018.  
  1019.  
  1020. <p><a href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/Information_hiding&amp;sa=D&amp;source=docs&amp;ust=1757588683474584&amp;usg=AOvVaw0GSAvIFboxsnVPkkNh7N8A" target="_blank" rel="noreferrer noopener">David Parnas&#8217;s information hiding principle</a> directly enables antifragility by ensuring that system components can adapt independently without cascading failures. In <a href="https://www.google.com/url?q=https://dl.acm.org/doi/10.1145/361598.361623&amp;sa=D&amp;source=docs&amp;ust=1757588683474722&amp;usg=AOvVaw3U2j64n0pCWduXPdTRr_Mj" target="_blank" rel="noreferrer noopener">his 1972 paper</a>, Parnas emphasized hiding “design decisions likely to change”—exactly what antifragile systems need.</p>
  1021.  
  1022.  
  1023.  
  1024. <p>When LOXM encounters market disruption, its modular design allows individual components to adapt their internal algorithms without affecting other modules. The “secret” of each module—its specific implementation—can evolve based on local feedback while maintaining stable interfaces with other components.</p>
  1025.  
  1026.  
  1027.  
  1028. <p>This architectural pattern prevents what Taleb calls “tight coupling”—where stress in one component propagates throughout the system. Instead, stress becomes localized learning opportunities that strengthen individual modules without destabilizing the whole system.</p>
  1029.  
  1030.  
  1031.  
  1032. <h2 class="wp-block-heading"><strong>Via Negativa in Practice</strong></h2>
  1033.  
  1034.  
  1035.  
  1036. <p>Nassim Taleb&#8217;s concept of &#8220;via negativa&#8221;—defining systems by what they&#8217;re not rather than what they are—translates directly to building antifragile AI systems.</p>
  1037.  
  1038.  
  1039.  
  1040. <p>When Airbnb&#8217;s search algorithm was producing poor results, instead of adding more ranking factors (the typical approach), the company applied via negativa: It systematically removed listings that consistently received poor ratings, hosts who didn&#8217;t respond promptly, and properties with misleading photos. By eliminating negative elements, the remaining search results naturally improved.</p>
  1041.  
  1042.  
  1043.  
  1044. <p>Netflix&#8217;s recommendation system similarly applies via negativa by maintaining “negative preference profiles”—systematically identifying and avoiding content patterns that lead to user dissatisfaction. Rather than just learning what users like, the system becomes expert at recognizing what they&#8217;ll hate, leading to higher overall satisfaction through subtraction rather than addition.</p>
  1045.  
  1046.  
  1047.  
  1048. <p>In technical terms, via negativa means starting with maximum system flexibility and systematically removing constraints that don&#8217;t add value—allowing the system to adapt to unforeseen circumstances rather than being locked into rigid predetermined behaviors.</p>
  1049.  
  1050.  
  1051.  
  1052. <h2 class="wp-block-heading"><strong>Implementing Continuous Feedback Loops</strong></h2>
  1053.  
  1054.  
  1055.  
  1056. <p>The feedback loop architecture requires three components: error detection, learning integration, and system adaptation. In LOXM&#8217;s implementation, market execution data flows back into the model within milliseconds of trade completion. The system uses streaming data pipelines to capture trade outcomes, model predictions, and market conditions in real time. Machine learning models continuously compare predicted execution quality to actual execution quality, updating model weights through online learning algorithms. This creates a continuous feedback loop where each trade makes the next trade execution more intelligent.</p>
  1057.  
  1058.  
  1059.  
  1060. <p>When a trade execution deviates from expected performance—whether due to market volatility, liquidity constraints, or timing issues—this immediately becomes training data. The system doesn&#8217;t wait for batch processing or scheduled retraining; it adapts in real time while maintaining stable performance for ongoing operations.</p>
  1061.  
  1062.  
  1063.  
  1064. <h2 class="wp-block-heading"><strong>Organizational Learning Loop</strong></h2>
  1065.  
  1066.  
  1067.  
  1068. <p>Antifragile organizations must cultivate specific learning behaviors beyond just technical implementations. This requires moving beyond traditional risk management approaches toward Taleb’s &#8220;via negativa.&#8221;</p>
  1069.  
  1070.  
  1071.  
  1072. <p>The learning loop involves three phases: stress identification, system adaptation, and capability improvement. Teams regularly expose systems to controlled stress, observe how they respond, and then use generative AI to identify improvement opportunities. Each iteration strengthens the system&#8217;s ability to handle future challenges.</p>
  1073.  
  1074.  
  1075.  
  1076. <p>Netflix institutionalized this through monthly &#8220;chaos drills&#8221; where teams deliberately introduce failures—API timeouts, database connection losses, content metadata corruption—and observe how their AI systems respond. Each drill generates postmortems focused not on blame but on extracting learning from the failure scenarios.</p>
  1077.  
  1078.  
  1079.  
  1080. <h2 class="wp-block-heading"><strong>Measurement and Validation</strong></h2>
  1081.  
  1082.  
  1083.  
  1084. <p>Antifragile systems require new metrics beyond traditional availability and performance measures. Key metrics include:</p>
  1085.  
  1086.  
  1087.  
  1088. <ul class="wp-block-list">
  1089. <li>Adaptation speed: Time from anomaly detection to corrective action</li>
  1090.  
  1091.  
  1092.  
  1093. <li>Information extraction rate: Number of meaningful model updates per disruption event</li>
  1094.  
  1095.  
  1096.  
  1097. <li>Asymmetric performance factor: Ratio of system gains from positive shocks to losses from negative ones</li>
  1098. </ul>
  1099.  
  1100.  
  1101.  
  1102. <p>LOXM tracks these metrics alongside financial outcomes, demonstrating quantifiable improvement in antifragile capabilities over time. During high-volatility periods, the system&#8217;s asymmetric performance factor consistently exceeds 2.0—meaning it gains twice as much from favorable market movements as it loses from adverse ones.</p>
  1103.  
  1104.  
  1105.  
  1106. <h2 class="wp-block-heading"><strong>The Competitive Advantage</strong></h2>
  1107.  
  1108.  
  1109.  
  1110. <p>The goal isn&#8217;t just surviving disruption—it&#8217;s creating competitive advantage through chaos. When competitors struggle with market volatility, antifragile organizations extract value from the same conditions. They don&#8217;t just adapt to change; they actively seek out uncertainty as fuel for growth.</p>
  1111.  
  1112.  
  1113.  
  1114. <p>Netflix&#8217;s ability to recommend content accurately during the pandemic, when viewing patterns shifted dramatically, gave it a significant advantage over competitors whose recommendation systems struggled with the new normal. Similarly, LOXM&#8217;s superior performance during market stress periods has made it JPMorgan&#8217;s primary execution algorithm for institutional clients.</p>
  1115.  
  1116.  
  1117.  
  1118. <p>This creates sustainable competitive advantage because antifragile capabilities compound over time. Each disruption makes the system stronger, more adaptive, and better positioned for future challenges.</p>
  1119.  
  1120.  
  1121.  
  1122. <h2 class="wp-block-heading"><strong>Beyond Resilience: The Antifragile Future</strong></h2>
  1123.  
  1124.  
  1125.  
  1126. <p>We&#8217;re witnessing the emergence of a new organizational paradigm. The convergence of antifragility principles with generative AI capabilities represents more than incremental improvement—it&#8217;s a fundamental shift in how organizations can thrive in uncertain environments.</p>
  1127.  
  1128.  
  1129.  
  1130. <p>The path forward requires commitment to experimentation, tolerance for controlled failure, and systematic investment in adaptive capabilities. Organizations must evolve from asking &#8220;How do we prevent disruption?&#8221; to &#8220;How do we benefit from disruption?&#8221;</p>
  1131.  
  1132.  
  1133.  
  1134. <p>The question isn&#8217;t whether your organization will face uncertainty and disruption—it&#8217;s whether you&#8217;ll be positioned to extract competitive advantage from chaos when it arrives. The integration of antifragility principles with generative AI provides the roadmap for that transformation, demonstrated by organizations like Netflix and JPMorgan that have already turned volatility into their greatest strategic asset.</p>
  1135. ]]></content:encoded>
  1136. <wfw:commentRss>https://www.oreilly.com/radar/taming-chaos-with-antifragile-genai-architecture/feed/</wfw:commentRss>
  1137. <slash:comments>0</slash:comments>
  1138. </item>
  1139. <item>
  1140. <title>Building AI-Resistant Technical Debt</title>
  1141. <link>https://www.oreilly.com/radar/building-ai-resistant-technical-debt/</link>
  1142. <comments>https://www.oreilly.com/radar/building-ai-resistant-technical-debt/#respond</comments>
  1143. <pubDate>Wed, 10 Sep 2025 10:03:48 +0000</pubDate>
  1144. <dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
  1145. <category><![CDATA[AI & ML]]></category>
  1146. <category><![CDATA[Commentary]]></category>
  1147.  
  1148. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17422</guid>
  1149.  
  1150.     <media:content
  1151. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-colors-7.jpg"
  1152. medium="image"
  1153. type="image/jpeg"
  1154. />
  1155. <custom:subtitle><![CDATA[When Speed Creates Long-Term Pain]]></custom:subtitle>
  1156. <description><![CDATA[Anyone who’s used AI to generate code has seen it make mistakes. But the real danger isn’t the occasional wrong answer; it’s in what happens when those errors pile up across a codebase. Issues that seem small at first can compound quickly, making code harder to understand, maintain, and evolve. To really see that danger, [&#8230;]]]></description>
  1157. <content:encoded><![CDATA[
  1158. <p>Anyone who’s used AI to generate code has seen it make mistakes. But the real danger isn’t the occasional wrong answer; it’s in what happens when those errors pile up across a codebase. Issues that seem small at first can compound quickly, making code harder to understand, maintain, and evolve. To really see that danger, you have to look at how AI is used in practice—which for many developers starts with vibe coding.</p>
  1159.  
  1160.  
  1161.  
  1162. <p><strong>Vibe coding</strong> is an exploratory, prompt-first approach to software development where developers rapidly prompt, get code, and iterate. When the code seems close but not quite right, the developer describes what&#8217;s wrong and lets the AI try again. When it doesn&#8217;t compile or tests fail, they copy the error messages back to the AI. The cycle continues—prompt, run, error, paste, prompt again—often without reading or understanding the generated code. It feels productive because you&#8217;re making visible progress: errors disappear, tests start passing, features seem to work. You&#8217;re treating the AI like a coding partner who handles the implementation details while you steer at a high level.</p>
  1163.  
  1164.  
  1165.  
  1166. <p>Developers use vibe coding to explore and refine ideas and can generate large amounts of code quickly. It’s often the natural first step for most developers using AI tools, because it feels so intuitive and productive. Vibe coding offloads detail to the AI, making exploration and ideation fast and effective—which is exactly why it’s so popular.</p>
  1167.  
  1168.  
  1169.  
  1170. <p>The AI generates a lot of code, and it’s not practical to review every line every time it regenerates. Trying to read it all can lead to <strong>cognitive overload</strong>—mental exhaustion from wading through too much code—and makes it harder to throw away code that isn’t working just because you already invested time in reading it.</p>
  1171.  
  1172.  
  1173.  
  1174. <p>Vibe coding is a normal and useful way to explore with AI, but on its own it presents a significant risk. The models used by LLMs can hallucinate and produce made-up answers—for example, generating code that calls APIs or methods that don’t even exist. Preventing those AI-generated mistakes from compromising your codebase starts with understanding the capabilities and limitations of these tools, and taking an approach to AI-assisted development that takes those limitations into account.</p>
  1175.  
  1176.  
  1177.  
  1178. <p>Here&#8217;s a simple example of how these issues compound. When I ask AI to generate a class that handles user interaction, it often creates methods that directly read from and write to the console. When I then ask it to make the code more testable, if I don’t very specifically prompt for a simple fix like having methods take input as parameters and return output as values, the AI frequently suggests wrapping the entire I/O mechanism in an abstraction layer. Now I have an interface, an implementation, mock objects for testing, and dependency injection throughout. What started as a straightforward class has become a miniature framework. The AI isn&#8217;t wrong, exactly—the abstraction approach is a valid pattern—but it&#8217;s overengineered for the problem at hand. Each iteration adds more complexity, and if you&#8217;re not paying attention, you&#8217;ll end up with layers upon layers of unnecessary code. This is a good example of how vibe coding can balloon into unnecessary complexity if you don’t stop to verify what’s happening.</p>
  1179.  
  1180.  
  1181.  
  1182. <h2 class="wp-block-heading">Novice Developers Face a New Kind of Technical Debt Challenge with AI</h2>
  1183.  
  1184.  
  1185.  
  1186. <p>Three months after writing their first line of code, a Reddit user going by SpacetimeSorcerer posted a frustrated update: Their AI-assisted project had reached the point where making any change meant editing dozens of files. The design had hardened around early mistakes, and every change brought a wave of debugging. They’d hit the wall known in software design as “shotgun surgery,” where a single change ripples through so much code that it’s risky and slow to work on—a classic sign of <strong>technical debt</strong>, the hidden cost of early shortcuts that make future changes harder and more expensive.</p>
  1187.  
  1188.  
  1189.  
  1190. <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="547" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image.png" alt="I am giving up" class="wp-image-17423" title="I am giving up" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-300x103.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-768x263.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/image-1536x525.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>A Reddit post describing the frustration of AI-accelerated technical debt (used with permission).</em></figcaption></figure>
  1191.  
  1192.  
  1193.  
  1194. <p>AI didn’t cause the problem directly; the code worked (until it didn’t). But the speed of AI-assisted development let this new developer skip the design thinking that prevents these patterns from forming. The same thing happens to experienced developers when deadlines push delivery over maintainability. The difference is, an experienced developer often knows they’re taking on debt. They can spot antipatterns early because they’ve seen them repeatedly, and take steps to “pay off” the debt before it gets much more expensive to fix. Someone new to coding may not even realize it’s happening until it’s too late—and they haven’t yet built the tools or habits to prevent it.</p>
  1195.  
  1196.  
  1197.  
  1198. <p>Part of the reason new developers are especially vulnerable to this problem goes back to the <strong>Cognitive Shortcut Paradox</strong>.<sup>1</sup> Without enough hands-on experience debugging, refactoring, and working through ambiguous requirements, they don’t have the instincts built up through experience to spot structural problems in AI-generated code. The AI can hand them a clean, working solution. But if they can’t see the design flaws hiding inside it, those flaws grow unchecked until they’re locked into the project, built into the foundations of the code so changing them requires extensive, frustrating work.</p>
  1199.  
  1200.  
  1201.  
  1202. <p>The signals of AI-accelerated technical debt show up quickly: highly coupled code where modules depend on each other’s internal details; “God objects” with too many responsibilities; overly structured solutions where a simple problem gets buried under extra layers. These are the same problems that typically reflect technical debt in human-built code; the reason they emerge so quickly in AI-generated code is because it can be generated much more quickly and without oversight or intentional design or architectural decisions being made. AI can generate these patterns convincingly, making them look deliberate even when they emerged by accident. Because the output compiles, passes tests, and works as expected, it’s easy to accept as “done” without thinking about how it will hold up when requirements change.</p>
  1203.  
  1204.  
  1205.  
  1206. <p>When adding or updating a unit test feels unreasonably difficult, that’s often the first sign the design is too rigid. The test is telling you something about the structure—maybe the code is too intertwined, maybe the boundaries are unclear. This feedback loop works whether the code was AI-generated or handwritten, but with AI the friction often shows up later, after the code has already been merged.</p>
  1207.  
  1208.  
  1209.  
  1210. <p>That’s where the “trust but verify” habit comes in. Trust the AI to give you a starting point, but verify that the design supports change, testability, and clarity. Ask yourself whether the code will still make sense to you—or anyone else—months from now. In practice, this can mean quick design reviews even for AI-generated code, refactoring when coupling or duplication starts to creep in, and taking a deliberate pass at naming so variables and functions read clearly. These aren’t optional touches; they’re what keep a codebase from locking in its worst early decisions.</p>
  1211.  
  1212.  
  1213.  
  1214. <p>AI can help with this too: It can suggest refactorings, point out duplicated logic, or help extract messy code into cleaner abstractions. But it’s up to you to direct it to make those changes, which means you have to spot them first—which is much easier for experienced developers who have seen these problems over the course of many projects.</p>
  1215.  
  1216.  
  1217.  
  1218. <p>Left to its defaults, AI-assisted development is biased toward adding new code, not revisiting old decisions. The discipline to avoid technical debt comes from building design checks into your workflow so AI’s speed works in service of maintainability instead of against it.</p>
  1219.  
  1220.  
  1221.  
  1222. <hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
  1223.  
  1224.  
  1225.  
  1226. <h2 class="wp-block-heading">Footnote</h2>
  1227.  
  1228.  
  1229.  
  1230. <ol class="wp-block-list">
  1231. <li>I&#8217;ll discuss this in more detail in a forthcoming Radar article on October 8.</li>
  1232. </ol>
  1233.  
  1234.  
  1235.  
  1236. <p></p>
  1237. ]]></content:encoded>
  1238. <wfw:commentRss>https://www.oreilly.com/radar/building-ai-resistant-technical-debt/feed/</wfw:commentRss>
  1239. <slash:comments>0</slash:comments>
  1240. </item>
  1241. <item>
  1242. <title>Megawatts and Gigawatts of AI</title>
  1243. <link>https://www.oreilly.com/radar/megawatts-and-gigawatts-of-ai/</link>
  1244. <comments>https://www.oreilly.com/radar/megawatts-and-gigawatts-of-ai/#respond</comments>
  1245. <pubDate>Tue, 09 Sep 2025 10:54:36 +0000</pubDate>
  1246. <dc:creator><![CDATA[Mike Loukides]]></dc:creator>
  1247. <category><![CDATA[AI & ML]]></category>
  1248. <category><![CDATA[Commentary]]></category>
  1249.  
  1250. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17410</guid>
  1251.  
  1252.     <media:content
  1253. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Smaller-Is-Power.jpg"
  1254. medium="image"
  1255. type="image/jpeg"
  1256. />
  1257. <description><![CDATA[We can’t not talk about power these days. We’ve been talking about it ever since the Stargate project, with half a trillion dollars in data center investment, was floated early in the year. We’ve been talking about it ever since the now-classic “Stochastic Parrots” paper. And, as time goes on, it only becomes more of [&#8230;]]]></description>
  1258. <content:encoded><![CDATA[
  1259. <p>We can’t not talk about power these days. We’ve been talking about it ever since the <a href="https://en.wikipedia.org/wiki/Stargate_LLC" target="_blank" rel="noreferrer noopener">Stargate</a> project, with half a trillion dollars in data center investment, was floated early in the year. We’ve been talking about it ever since the now-classic “<a href="https://dl.acm.org/doi/10.1145/3442188.3445922" target="_blank" rel="noreferrer noopener">Stochastic Parrots</a>” paper. And, as time goes on, it only becomes more of an issue.</p>
  1260.  
  1261.  
  1262.  
  1263. <p>&#8220;Stochastic Parrots&#8221; deals with two issues: AI’s power consumption and the fundamental nature of generative AI; selecting sequences of words according to statistical patterns. I always wished those were two papers, because it would be easier to disagree about power and agree about parrots. For me, the power issue is something of a red herring—but increasingly, I see that it’s a red herring that isn’t going away because too many people with too much money want herrings; too many believe that a monopoly on power (or a monopoly on the ability to pay for power) is the route to dominance.</p>
  1264.  
  1265.  
  1266.  
  1267. <p>Why, in a better world than we currently live in, would the power issue be a red herring? There are several related reasons:</p>
  1268.  
  1269.  
  1270.  
  1271. <ul class="wp-block-list">
  1272. <li>I have always assumed that the first generation language models would be highly inefficient, and that over time, we’d develop more efficient algorithms.</li>
  1273.  
  1274.  
  1275.  
  1276. <li>I have also assumed that the economics of language models would be similar to chip foundries or pharma factories: The first chip coming out of a foundry costs a few billion dollars, everything afterward is a penny apiece.</li>
  1277.  
  1278.  
  1279.  
  1280. <li>I believe (now more than ever) that, long-term, we will settle on small models (70B parameters or less) that can run locally rather than giant models with trillions of parameters running in the cloud.</li>
  1281. </ul>
  1282.  
  1283.  
  1284.  
  1285. <p>And I still believe those points are largely true. But that’s not sufficient. Let’s go through them one by one, starting with efficiency.</p>
  1286.  
  1287.  
  1288.  
  1289. <h2 class="wp-block-heading">Better Algorithms</h2>
  1290.  
  1291.  
  1292.  
  1293. <p>A few years ago, I saw a fair number of papers about more efficient models. I remember a lot of articles about pruning neural networks (eliminating nodes that contribute little to the result) and other techniques. Papers that address efficiency are still being published—most notably, DeepMind’s recent “<a href="https://arxiv.org/abs/2507.10524" target="_blank" rel="noreferrer noopener">Mixture-of-Recursions</a>” paper—but they don’t seem to be as common. That’s just anecdata, and should perhaps be ignored. More to the point, DeepSeek shocked the world with their R1 model, which they claimed cost roughly 1/10 as much to train as the leading frontier models. A lot of commentary insisted that DeepSeek wasn’t being up front in their measurement of power consumption, but since then several other Chinese labs have released highly capable models, with no gigawatt data centers in sight. Even more recently, OpenAI has released gpt-oss in two sizes (120B and 30B), which were <a href="https://www.theinformation.com/articles/openai-says-gpt-5-one-size-fits-new-open-model-cheap-train?rc=7em78a" target="_blank" rel="noreferrer noopener">reportedly</a> much less expensive to train. It’s not the first time this has happened—I’ve been told that the Soviet Union developed amazingly efficient data compression algorithms because their computers were a decade behind ours. Better algorithms can trump larger power bills, better CPUs, and more GPUs, if we let them.</p>
  1294.  
  1295.  
  1296.  
  1297. <p>What’s wrong with this picture? The picture is good, but much of the narrative is US-centric, and that distorts it. First, it’s distorted by our belief that bigger is always better: Look at our cars, our SUVs, our houses. We’re conditioned to believe that a model with a trillion parameters has to be better than a model with a mere 70B, right? That a model that cost <a href="https://en.wikipedia.org/wiki/GPT-4#:~:text=Sam%20Altman%20stated%20that%20the,4%20had%201%20trillion%20parameters." target="_blank" rel="noreferrer noopener">a hundred million dollars</a> to train has to be better than one that can be trained economically? That myth is deeply embedded in our psyche. Second, it’s distorted by economics. Bigger is better is a myth that would-be monopolists play on when they talk about the need for ever bigger data centers, preferably funded with tax dollars. It’s a convenient myth, because convincing would-be competitors that they need to spend billions on data centers is an effective way to have no competitors.</p>
  1298.  
  1299.  
  1300.  
  1301. <p>One area that hasn’t been sufficiently explored is extremely small models developed for specialized tasks. Drew Breunig <a href="https://www.dbreunig.com/2025/08/01/does-the-bitter-lesson-have-limits.html" target="_blank" rel="noreferrer noopener">writes</a> about the tiny chess model in Stockfish, the world’s leading chess program: It’s small enough to run in an iPhone, and replaced a much larger general-purpose model. And it <a href="https://www.youtube.com/watch?v=yc0bFlW56tY&amp;t=528s" target="_blank" rel="noreferrer noopener">soundly defeated</a> Claude Sonnet 3.5 and GPT-4o.<sup>1</sup> He also writes about the 27 million parameter <a href="https://arxiv.org/pdf/2506.21734" target="_blank" rel="noreferrer noopener">Hierarchical Reasoning Model (HRM)</a> that has beaten models like Claude 3.7 on the ARC benchmark. Pete Warden’s Moonshine does real-time speech-to-text transcription in the browser—and is as good as any high-end model I’ve seen. None of these are general-purpose models. They won’t vibe code; they won’t write your blog posts. But they are extremely effective at what they do. And if AI is going to fulfill its destiny of “disappearing into the walls,” of becoming part of our everyday infrastructure, we will need very accurate, very specialized models. We will have to free ourselves of the myth that bigger is better.<sup>2</sup></p>
  1302.  
  1303.  
  1304.  
  1305. <h2 class="wp-block-heading">The Cost of Inference</h2>
  1306.  
  1307.  
  1308.  
  1309. <p>The purpose of a model isn’t to be trained; it’s to do inference. This is a gross simplification, but part of training is doing inference trillions of times and adjusting the model’s billions of parameters to minimize error. A single request takes an extremely small fraction of the effort required to train a model. That fact leads directly to the economics of chip foundries: The ability to process the first prompt costs millions of dollars, but once they’re in production, <a href="https://andymasley.substack.com/p/a-cheat-sheet-for-conversations-about" target="_blank" rel="noreferrer noopener">processing a prompt costs fractions of a cent</a>. Google has <a href="https://services.google.com/fh/files/misc/measuring_the_environmental_impact_of_delivering_ai_at_google_scale.pdf" target="_blank" rel="noreferrer noopener">claimed</a> that processing a typical text prompt to Gemini takes 0.24 watt-hours, significantly less than it takes to heat water for a cup of coffee. They also claim that increases in software efficiency have led to a 33x reduction in energy consumption over the past year.</p>
  1310.  
  1311.  
  1312.  
  1313. <p>That’s obviously not the entire story: Millions of people prompting ChatGPT adds up, as does usage of newer “reasoning” modules that have an extended internal dialog before arriving at a result. Likewise, driving to work rather than biking raises the global temperature a nanofraction of a degree—but when you multiply the nanofraction by billions of commuters, it’s a different story. It’s fair to say that an individual who uses ChatGPT or Gemini isn’t a problem, but it’s also important to realize that millions of users pounding on an AI service can grow into a problem quite quickly. Unfortunately, it’s also true that increases in efficiency often don’t lead to reductions in energy use but to solving more complex problems within the same energy budget. We may be seeing that with reasoning models, image and video generation models, and other applications that are now becoming financially feasible. Does this problem require gigawatt data centers? No, not that, but it’s a problem that can justify the building of gigawatt data centers.</p>
  1314.  
  1315.  
  1316.  
  1317. <p>There is a solution, but it requires rethinking the problem. Telling people to use public transportation or bicycles for their commute is ineffective (in the US), as will be telling people not to use AI. The problem needs to be rethought: redesigning work to eliminate the commute (O’Reilly is 100% work from home), rethinking the way we use AI so that it doesn’t require cloud-hosted trillion parameter models. That brings us to using AI locally.</p>
  1318.  
  1319.  
  1320.  
  1321. <h2 class="wp-block-heading">Staying Local</h2>
  1322.  
  1323.  
  1324.  
  1325. <p>Almost everything we do with GPT-*, Claude-*, Gemini-*, and other frontier models could be done equally effectively on much smaller models running locally: in a small corporate machine room or even on a laptop. Running AI locally also shields you from problems with availability, bandwidth, limits on usage, and leaking private data. This is a story that would-be monopolists don’t want us to hear. Again, this is anecdata, but I’ve been very impressed by the results I get from running models in the 30 billion parameter range on my laptop. I do vibe coding and get mostly correct code that the model can (usually) fix for me; I ask for summaries of blogs and papers and get excellent results. Anthropic, Google, and OpenAI are competing for tenths of a percentage point on highly gamed benchmarks, but I doubt that those benchmark scores have much practical meaning. I would love to see a study on the difference between Qwen3-30B and GPT-5.</p>
  1326.  
  1327.  
  1328.  
  1329. <p>What does that mean for energy costs? It’s unclear. Gigawatt data centers for doing inference would go unneeded if people do inference locally, but what are the consequences of a billion users doing inference on high-end laptops? If I give my local AIs a difficult problem, my laptop heats up and runs its fans. It’s using more electricity. And laptops aren’t as efficient as data centers that have been designed to minimize electrical use. It’s all well and good to scoff at gigawatts, but when you’re using that much power, minimizing power consumption saves a lot of money. Economies of scale are real. Personally, I’d bet on the laptops: Computing with 30 billion parameters is undoubtedly going to be less energy-intensive than computing with 3 trillion parameters. But I won’t hold my breath waiting for someone to do this research.</p>
  1330.  
  1331.  
  1332.  
  1333. <p>There’s another side to this question, and that involves models that “reason.” So-called “reasoning models” have an internal conversation (not always visible to the user) in which the model “plans” the steps it will take to answer the prompt. A recent paper <a href="https://nousresearch.com/measuring-thinking-efficiency-in-reasoning-models-the-missing-benchmark/" target="_blank" rel="noreferrer noopener">claims</a> that smaller open source models tend to generate many more reasoning tokens than large models (3 to 10 times as many, depending on the models you’re comparing), and that the extensive reasoning process eats away at the economics of the smaller models. Reasoning tokens must be processed, the same as any user-generated tokens; this processing incurs charges (which the paper discusses), and charges presumably relate directly to power.</p>
  1334.  
  1335.  
  1336.  
  1337. <p>While it’s surprising that small models generate more reasoning tokens, it’s no surprise that reasoning is expensive, and we need to take that into account. Reasoning is a tool to be used; it tends to be particularly useful when a model is asked to solve a problem in mathematics. It’s much less useful when the task involves looking up facts, summarization, writing, or making recommendations. It can help in areas like software design but is likely to be a liability for generative coding. In these cases, the reasoning process can actually become misleading—in addition to burning tokens. Deciding how to use models effectively, whether you’re running them locally or in the cloud, is a task that falls to us.</p>
  1338.  
  1339.  
  1340.  
  1341. <p>Going to the giant reasoning models for the “best possible answer” is always a temptation, especially when you know you don’t need the best possible answer. It takes some discipline to commit to the smaller models—even though it’s difficult to argue that using the frontier models is less work. You still have to analyze their output and check their results. And I confess: As committed as I am to the smaller models, I tend to stick with models in the 30B range, and avoid the 1B–5B models (including the excellent Gemma 3N). Those models, I’m sure, would give good results, use even less power, and run even faster. But I’m still in the process of peeling myself away from my knee-jerk assumptions.</p>
  1342.  
  1343.  
  1344.  
  1345. <p>Bigger isn’t necessarily better; more power isn’t necessarily the route to AI dominance. We don’t yet know how this will play out, but I’d place my bets on smaller models running locally and trained with efficiency in mind. There will no doubt be some applications that require large frontier models—perhaps generating synthetic data for training the smaller models—but we really need to understand where frontier models are needed, and where they aren’t. My bet is that they’re rarely needed. And if we free ourselves from the desire to use the latest, largest frontier model just because it’s there—whether or not it serves your purpose any better than a 30B model—we won’t need most of those giant data centers. Don’t be seduced by the AI-industrial complex.</p>
  1346.  
  1347.  
  1348.  
  1349. <hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
  1350.  
  1351.  
  1352.  
  1353. <h2 class="wp-block-heading">Footnotes</h2>
  1354.  
  1355.  
  1356.  
  1357. <ol class="wp-block-list">
  1358. <li>I’m not aware of games between Sockfish and more recent Claude 4, Claude 4.1, and GPT-5 models. There’s every reason to believe the results would be similar.</li>
  1359.  
  1360.  
  1361.  
  1362. <li>Kevlin Henney makes a related point in “<a href="https://www.oreilly.com/radar/scaling-false-peaks/" target="_blank" rel="noreferrer noopener">Scaling False Peaks</a>.”</li>
  1363. </ol>
  1364. ]]></content:encoded>
  1365. <wfw:commentRss>https://www.oreilly.com/radar/megawatts-and-gigawatts-of-ai/feed/</wfw:commentRss>
  1366. <slash:comments>0</slash:comments>
  1367. </item>
  1368. <item>
  1369. <title>A “Beam Versus Dataflow” Conversation</title>
  1370. <link>https://www.oreilly.com/radar/a-beam-versus-dataflow-conversation/</link>
  1371. <comments>https://www.oreilly.com/radar/a-beam-versus-dataflow-conversation/#respond</comments>
  1372. <pubDate>Mon, 08 Sep 2025 10:28:48 +0000</pubDate>
  1373. <dc:creator><![CDATA[Aaron Black]]></dc:creator>
  1374. <category><![CDATA[AI & ML]]></category>
  1375. <category><![CDATA[Research]]></category>
  1376.  
  1377. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17406</guid>
  1378.  
  1379.     <media:content
  1380. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Beam-pipeline.jpg"
  1381. medium="image"
  1382. type="image/jpeg"
  1383. />
  1384. <description><![CDATA[I&#8217;ve been in a few recent conversations about whether to use Apache Beam on its own or run it with Google Dataflow. On the surface, it’s a tooling decision. But it also reflects a broader conversation about how teams build systems. Beam offers a consistent programming model for unifying batch and streaming logic. It doesn’t [&#8230;]]]></description>
  1385. <content:encoded><![CDATA[
  1386. <p>I&#8217;ve been in a few recent conversations about whether to use Apache Beam on its own or run it with Google Dataflow. On the surface, it’s a tooling decision. But it also reflects a broader conversation about how teams build systems.</p>
  1387.  
  1388.  
  1389.  
  1390. <p>Beam offers a consistent programming model for unifying batch and streaming logic. It doesn’t dictate where that logic runs. You can deploy pipelines on Flink or Spark, or you can use a managed runner like Dataflow. Each option outfits the same Beam code with very different execution semantics.</p>
  1391.  
  1392.  
  1393.  
  1394. <p>What’s added urgency to this choice is the <a href="https://bostoninstituteofanalytics.org/blog/the-rise-of-real-time-data-science-in-2025-tools-trends-and-techniques/" target="_blank" rel="noreferrer noopener">growing pressure on data systems to support machine learning and AI workloads</a>. It&#8217;s no longer enough to transform, validate, and load. Teams also need to feed real-time inference, scale feature processing, and orchestrate retraining workflows as part of pipeline development. Beam and Dataflow are both increasingly positioned as infrastructure that supports not just analytics but active AI.</p>
  1395.  
  1396.  
  1397.  
  1398. <p>Choosing one path over the other means making decisions about flexibility, integration surface, runtime ownership, and operational scale. None of those are easy knobs to adjust after the fact.</p>
  1399.  
  1400.  
  1401.  
  1402. <p>The goal here is to unpack the trade-offs and help teams make deliberate calls about what kind of infrastructure they’ll want.</p>
  1403.  
  1404.  
  1405.  
  1406. <h2 class="wp-block-heading"><strong>Apache Beam: A Common Language for Pipelines</strong></h2>
  1407.  
  1408.  
  1409.  
  1410. <p>Apache Beam provides a shared model for expressing data processing workflows. This includes the kinds of batch and streaming tasks most data teams are already familiar with, but it also now includes a growing set of patterns specific to AI and ML.</p>
  1411.  
  1412.  
  1413.  
  1414. <p>Developers write Beam pipelines using a single SDK that defines what the pipeline does, not how the underlying engine runs it. That logic can include parsing logs, transforming records, joining events across time windows, and applying trained models to incoming data using built-in inference transforms.</p>
  1415.  
  1416.  
  1417.  
  1418. <p>Support for AI-specific workflow steps is improving. Beam now offers the <a href="https://beam.apache.org/documentation/transforms/python/elementwise/runinference/" target="_blank" rel="noreferrer noopener">RunInference API</a>, along with <a href="https://beam.apache.org/documentation/transforms/python/elementwise/mltransform/" target="_blank" rel="noreferrer noopener">MLTransform</a> utilities, to help deploy models trained in frameworks like TensorFlow, PyTorch, and scikit-learn into Beam pipelines. These can be used in batch workflows for bulk scoring or in low-latency streaming pipelines where inference is applied to live events.</p>
  1419.  
  1420.  
  1421.  
  1422. <p>Crucially, this isn’t tied to one cloud. Beam lets you define the transformation once and pick the execution path later. You can run the exact same pipeline on Flink, Spark, or Dataflow. That level of portability doesn’t remove infrastructure concerns on its own, but it does allow you to focus your engineering effort on logic rather than rewrites.</p>
  1423.  
  1424.  
  1425.  
  1426. <p>Beam gives you a way to describe and maintain machine learning pipelines. What’s left is deciding how you want to operate them.</p>
  1427.  
  1428.  
  1429.  
  1430. <h2 class="wp-block-heading"><strong>Running Beam: Self-Managed Versus Managed</strong></h2>
  1431.  
  1432.  
  1433.  
  1434. <p>If you’re running Beam on Flink, Spark, or some custom runner, you’re responsible for the full runtime environment. You handle provisioning, scaling, fault tolerance, tuning, and observability. Beam becomes another user of your platform. That degree of control can be useful, especially if model inference is only one part of a larger pipeline that already runs in your infrastructure. Custom logic, proprietary connectors, or non-standard state handling might push you toward keeping everything self-managed.</p>
  1435.  
  1436.  
  1437.  
  1438. <p>But building for inference at scale, especially in streaming, <a href="https://openproceedings.org/2024/conf/edbt/paper-156.pdf" target="_blank" rel="noreferrer noopener">introduces friction</a>. It means tracking model versions across pipeline jobs. It means watching watermarks and tuning triggers so inference happens precisely when it should. It means managing restart logic and making sure models fail gracefully when cloud resources or updatable weights are unavailable. If your team is already running distributed systems, that may be fine. But it isn’t free.</p>
  1439.  
  1440.  
  1441.  
  1442. <p>Running Beam on Dataflow simplifies much of this by taking infrastructure management out of your hands. You still build your pipeline the same way. But once deployed to Dataflow, scaling and resource provisioning are handled by the platform. Dataflow pipelines can stream through inference using native Beam transforms and benefit from newer features like automatic model refresh and tight integration with Google Cloud services.</p>
  1443.  
  1444.  
  1445.  
  1446. <p>This is particularly relevant when <a href="https://cloud.google.com/vertex-ai/docs/pipelines/dataflow-component" target="_blank" rel="noreferrer noopener">working with Vertex AI</a>, which allows hosted model deployment, feature store lookups, and GPU-accelerated inference to plug straight into your pipeline. Dataflow enables those connections with lower latency and minimal manual setup. For some teams, that makes it the better fit by default.</p>
  1447.  
  1448.  
  1449.  
  1450. <p>Of course, not every ML workload needs end-to-end cloud integration. And not every team wants to give up control of their pipeline execution. That’s why understanding what each option provides is necessary before making long-term infrastructure bets.</p>
  1451.  
  1452.  
  1453.  
  1454. <h2 class="wp-block-heading"><strong>Choosing the Execution Model That Matches Your Team</strong></h2>
  1455.  
  1456.  
  1457.  
  1458. <p>Beam gives you the foundation for defining ML-aware data pipelines. Dataflow gives you a specific way to execute them, especially in production environments where responsiveness and scalability matter.</p>
  1459.  
  1460.  
  1461.  
  1462. <p>If you’re building systems that require operational control and that already assume deep platform ownership, managing your own Beam runner makes sense. It gives flexibility where rules are looser and lets teams hook directly into their own tools and systems.</p>
  1463.  
  1464.  
  1465.  
  1466. <p>If instead you need fast iteration with minimal overhead, or you’re running real-time inference against cloud-hosted models, then Dataflow offers clear benefits. You onboard your pipeline without worrying about the runtime layer and deliver predictions without gluing together your own serving infrastructure.</p>
  1467.  
  1468.  
  1469.  
  1470. <p>If inference becomes an everyday part of your pipeline logic, the balance between operational effort and platform constraints starts to shift. The best execution model depends on more than feature comparison.</p>
  1471.  
  1472.  
  1473.  
  1474. <p>A well-chosen execution model involves commitment to how your team builds, evolves, and operates intelligent data systems over time. Whether you prioritize fine-grained control or accelerated delivery, both Beam and Dataflow offer robust paths forward. The key is aligning that choice with your long-term goals: consistency across workloads, adaptability for future AI demands, and a developer experience that supports innovation without compromising stability. As inference becomes a core part of modern pipelines, choosing the right abstraction sets a foundation for future-proofing your data infrastructure.</p>
  1475. ]]></content:encoded>
  1476. <wfw:commentRss>https://www.oreilly.com/radar/a-beam-versus-dataflow-conversation/feed/</wfw:commentRss>
  1477. <slash:comments>0</slash:comments>
  1478. </item>
  1479. <item>
  1480. <title>Generative AI in the Real World: Luke Wroblewski on When Databases Talk Agent-Speak</title>
  1481. <link>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-luke-wroblewski-on-when-databases-talk-agent-speak/</link>
  1482. <comments>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-luke-wroblewski-on-when-databases-talk-agent-speak/#respond</comments>
  1483. <pubDate>Thu, 04 Sep 2025 16:01:45 +0000</pubDate>
  1484. <dc:creator><![CDATA[Ben Lorica and Luke Wroblewski]]></dc:creator>
  1485. <category><![CDATA[Generative AI in the Real World]]></category>
  1486. <category><![CDATA[Podcast]]></category>
  1487.  
  1488. <guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&#038;p=17394</guid>
  1489.  
  1490.     <media:content
  1491. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png"
  1492. medium="image"
  1493. type="image/png"
  1494. />
  1495. <description><![CDATA[Join Luke Wroblewski and Ben Lorica as they talk about the future of software development. What happens when we have databases that are designed to interact with agents and language models rather than humans? We’re starting to see what that world will look like. It’s an exciting time to be a software developer. About the [&#8230;]]]></description>
  1496. <content:encoded><![CDATA[
  1497. <p>Join Luke Wroblewski and Ben Lorica as they talk about the future of software development. What happens when we have databases that are designed to interact with agents and language models rather than humans? We’re starting to see what that world will look like. It’s an exciting time to be a software developer.</p>
  1498.  
  1499.  
  1500.  
  1501. <p><strong>About the <em>Generative AI in the Real World</em> podcast:</strong> In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In <em>Generative AI in the Real World</em>, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.</p>
  1502.  
  1503.  
  1504.  
  1505. <p>Check out <a href="https://learning.oreilly.com/playlists/42123a72-1108-40f1-91c0-adbfb9f4983b/?_gl=1*16z5k2y*_ga*MTE1NDE4NjYxMi4xNzI5NTkwODkx*_ga_092EL089CH*MTcyOTYxNDAyNC4zLjEuMTcyOTYxNDAyNi41OC4wLjA." target="_blank" rel="noreferrer noopener">other episodes</a> of this podcast on the O’Reilly learning platform.</p>
  1506.  
  1507.  
  1508.  
  1509. <h2 class="wp-block-heading">Timestamps</h2>
  1510.  
  1511.  
  1512.  
  1513. <ul class="wp-block-list">
  1514. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=0" target="_blank" rel="noreferrer noopener">0:00</a>: Introduction to Luke Wroblewski of Sutter Hill Ventures. </li>
  1515.  
  1516.  
  1517.  
  1518. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=36" target="_blank" rel="noreferrer noopener">0:36</a>: You’ve talked about a paradigm shift in how we write applications. You’ve said that all we need is a URL and model, and that’s an app. Has anyone else made a similar observation? Have you noticed substantial apps that look like this?</li>
  1519.  
  1520.  
  1521.  
  1522. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=68" target="_blank" rel="noreferrer noopener">1:08</a>: The future is here; it’s just not evenly distributed yet. That’s what everyone loves to say. The first websites looked nothing like robust web applications, and now we have a multimedia podcast studio running in the browser. We’re at the phase where some of these things look and feel less robust. And our ideas for what constitutes an application change in each of these phases. If I told you pre-Google Maps that we’d be running all of our web applications in a browser, you’d have laughed at me. </li>
  1523.  
  1524.  
  1525.  
  1526. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=133" target="_blank" rel="noreferrer noopener">2:13</a>: I think what you mean is an MCP server, and the model itself is the application, correct?</li>
  1527.  
  1528.  
  1529.  
  1530. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=144" target="_blank" rel="noreferrer noopener">2:24</a>: Yes. The current definition of an application, in a simple form, is running code and a database. We’re at the stage where you have AI coding agents that can handle the coding part. But we haven’t really had databases that have been designed for the way those agents think about code and interacting with data.</li>
  1531.  
  1532.  
  1533.  
  1534. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=177" target="_blank" rel="noreferrer noopener">2:57</a>: Now that we have databases that work the way agents work, you can take out the running-code part almost. People go to Lovable or Cursor and they’re forced to look at code syntax. But if an AI model can just use a database effectively, it takes the role of the running code. And if it can manage data visualizations and UI, you don’t need to touch the code. You just need to point the AI at a data structure it can use effectively. <a href="https://mcpui.dev/" target="_blank" rel="noreferrer noopener">MCP UI</a> is a nice example of people pushing in this direction.</li>
  1535.  
  1536.  
  1537.  
  1538. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=252" target="_blank" rel="noreferrer noopener">4:12</a>: Which brings us to something you announced recently: AgentDB. You can find it at <a href="http://agentdb.dev" target="_blank" rel="noreferrer noopener">agentdb.dev</a>. What problem is AgentDB trying to solve?</li>
  1539.  
  1540.  
  1541.  
  1542. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=274" target="_blank" rel="noreferrer noopener">4:34</a>: Related to what we were just talking about: How do we get AI agents to use databases effectively? Most things in the technology stack are made for humans and the scale at which humans operate.</li>
  1543.  
  1544.  
  1545.  
  1546. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=306" target="_blank" rel="noreferrer noopener">5:06</a>: They’re still designed for a DBA, but eliminating the command line, right? So you still have to have an understanding of DBA principles?</li>
  1547.  
  1548.  
  1549.  
  1550. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=319" target="_blank" rel="noreferrer noopener">5:19</a>: How do you pick between the different compute options? How do you pick a region? What are the security options? And it’s not something you’re going to do thousands of times a day. Databricks just shared some stats where they said that thousands of databases per agent get made a day. They think 99% of databases being made are going to be made by agents. What is making all these databases? No longer humans. And the scale at which they make them—thousands is a lowball number. It will be way, way higher than that. How do we make a database system that works in that reality?</li>
  1551.  
  1552.  
  1553.  
  1554. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=382" target="_blank" rel="noreferrer noopener">6:22</a>: So the high-level thesis here is that lots of people will be creating agents, and these agents will rely on something that looks like a database, and many of these people won’t be hardcore engineers. What else?</li>
  1555.  
  1556.  
  1557.  
  1558. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=405" target="_blank" rel="noreferrer noopener">6:45</a>: It’s also agents creating agents, and agents creating applications, and agents deciding they need a database to complete a task. The explosion of these smart machine uses and workflows is well underway. But we don’t have an infrastructure that was made for that world. They were all designed to work with humans.</li>
  1559.  
  1560.  
  1561.  
  1562. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=451" target="_blank" rel="noreferrer noopener">7:31</a>: So in the classic database world, you’d consider AgentDB more like OLTP rather than analytics and OLAP.</li>
  1563.  
  1564.  
  1565.  
  1566. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=462" target="_blank" rel="noreferrer noopener">7:42</a>: Yeah, for analytics you’d probably stick your log somewhere else. The characteristics that make AgentDB really interesting for agents is, number 1: To create a database, all you really need is a unique ID. The creation of the ID manifests a database out of thin air. And we store it as a file, so you can scale like crazy. And all of these databases are fully isolated. They’re also downloadable, deletable, releasable—all the characteristics of a filesystem. We also have the concept of a template that comes along with the database. That gives the AI model or agent all the context it needs to start using the database immediately. If you just point Claude at a database, it will need to look at the structure (schema). It will build tokens and time trying to get the structure of the information. And every time it does this is an opportunity to make a mistake. With AgentDB, when an agent or an AI model is pointed at the database with a template, it can immediately write a query because we have in there a description of the database, the schema. So you save time, cut down errors, and don’t have to go through that learning step every time the model touches a database.</li>
  1567.  
  1568.  
  1569.  
  1570. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=622" target="_blank" rel="noreferrer noopener">10:22</a>: I assume this database will have some of the features you like, like ACID, vector search. So what kinds of applications have people built using AgentDB? </li>
  1571.  
  1572.  
  1573.  
  1574. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=653" target="_blank" rel="noreferrer noopener">10:53</a>: We put up a little demo page where we allow you to start the process with a CSV file. You upload it, and it will create the database and give you an MCP URL. So people are doing things like personal finance. People are uploading their credit card statements, their bank statements, because those applications are horrendous.</li>
  1575.  
  1576.  
  1577.  
  1578. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=699" target="_blank" rel="noreferrer noopener">11:39</a>: So it’s the actual statement; it parses it?</li>
  1579.  
  1580.  
  1581.  
  1582. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=705" target="_blank" rel="noreferrer noopener">11:45</a>: Another example: Someone has a spreadsheet to track jobs. They can take that, upload it, it gives them a template and a database and an MCP URL. They can pop that job-tracking database into Claude and do all the things you can do with a chat app, like ask, “What did I look at most recently?”</li>
  1583.  
  1584.  
  1585.  
  1586. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=755" target="_blank" rel="noreferrer noopener">12:35</a>: Do you envision it more like a DuckDB, more embedded, not really intended for really heavy transactional, high-throughput, more-than-one-table complicated schemas?</li>
  1587.  
  1588.  
  1589.  
  1590. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=769" target="_blank" rel="noreferrer noopener">12:49</a>: We currently support DuckDB and SQLite. But there are a bunch of folks who have made multiple table apps and databases.</li>
  1591.  
  1592.  
  1593.  
  1594. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=789" target="_blank" rel="noreferrer noopener">13:09</a>: So it’s not meant for you to build your own CRM?</li>
  1595.  
  1596.  
  1597.  
  1598. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=798" target="_blank" rel="noreferrer noopener">13:18</a>: Actually, one of our go-to-market guys had data of people visiting the website. He can dump that as a spreadsheet. He has data of people starring repos on GitHub. He has data of people who reached out through this form. He has all of these inbound signals of customers. So he took those, dropped them in as CSV files, put it in Claude, and then he can say, “Look at these, search the web for information about these, add it to the database, sort it by priority, assign it to different reps.” It’s CRM-ish already, but super-customized to his particular use case. </li>
  1599.  
  1600.  
  1601.  
  1602. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=867" target="_blank" rel="noreferrer noopener">14:27</a>: So you can create basically an agentic Airtable.</li>
  1603.  
  1604.  
  1605.  
  1606. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=878" target="_blank" rel="noreferrer noopener">14:38</a>: This means if you’re building AI applications or databases—traditionally that has been somewhat painful. This removes all that friction.</li>
  1607.  
  1608.  
  1609.  
  1610. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=900" target="_blank" rel="noreferrer noopener">15:00</a>: Yes, and it leads to a different way of making apps. You take that CSV file, you take that MCP URL, and you have a chat app.</li>
  1611.  
  1612.  
  1613.  
  1614. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=917" target="_blank" rel="noreferrer noopener">15:17</a>: Even though it’s accessible to regular users, it’s something developers should consider, right?</li>
  1615.  
  1616.  
  1617.  
  1618. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=925" target="_blank" rel="noreferrer noopener">15:25</a>: We’re starting to see emergent end-user use cases, but what we put out there is for developers. </li>
  1619.  
  1620.  
  1621.  
  1622. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=938" target="_blank" rel="noreferrer noopener">15:38</a>: One of the other things you’ve talked about is the notion that software development has flipped. Can you explain that to our listeners?</li>
  1623.  
  1624.  
  1625.  
  1626. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=956" target="_blank" rel="noreferrer noopener">15:56</a>: I spent eight and a half years at Google, four and a half at Yahoo, two and a half at ebay, and your traditional process of what we’re going to do next is up front: There’s a lot of drawing pictures and stuff. We had to scope engineering time. A lot of the stuff was front-loaded to figure out what we were going to build. Now with things like AI agents, you can build it and then start thinking about how it integrates inside the project. At a lot of our companies that are working with AI coding agents, I think this naturally starts to happen, that there’s a manifestation of the technology that helps you think through what the design should be, how do we integrate into the product, should we launch this? This is what I mean by “flipped.”</li>
  1627.  
  1628.  
  1629.  
  1630. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1061" target="_blank" rel="noreferrer noopener">17:41</a>: If I’m in a company like a big bank, does this mean that engineers are running ahead?</li>
  1631.  
  1632.  
  1633.  
  1634. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1075" target="_blank" rel="noreferrer noopener">17:55</a>: I don’t know if it’s happening in big banks yet, but it’s definitely happening in startup companies. And design teams have to think through “Here’s a bunch of stuff, let me do a wash across all that to fit in,” as opposed to spending time designing it earlier. There are pros and cons to both of these. The engineers were cleaning up the details in the previous world. Now the opposite is true: I’ve built it, now I need to design it.</li>
  1635.  
  1636.  
  1637.  
  1638. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1135" target="_blank" rel="noreferrer noopener">18:55</a>: Does this imply a new role? There’s a new skill set that designers have to develop?</li>
  1639.  
  1640.  
  1641.  
  1642. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1147" target="_blank" rel="noreferrer noopener">19:07</a>: There&#8217;s been this debate about “Should designers code?” Over the years lots of things have reduced the barrier to entry, and now we have an even more dramatic reduction. I’ve always been of the mindset that if you understand the medium, you will make better things. Now there’s even less of a reason not to do it.</li>
  1643.  
  1644.  
  1645.  
  1646. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1190" target="_blank" rel="noreferrer noopener">19:50</a>: Anecdotally, what I’m observing is that the people who come from product are able to build something, but I haven’t heard as many engineers thinking about design. What are the AI tools for doing that?</li>
  1647.  
  1648.  
  1649.  
  1650. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1219" target="_blank" rel="noreferrer noopener">20:19</a>: I hear the same thing. What I hope remains uncommoditized is taste. I’ve found that it’s very hard to teach taste to people. If I have a designer who is a good systems thinker but doesn’t have the gestalt of the visual design layer, I haven’t been able to teach that to them. But I have been able to find people with a clear sense of taste from diverse design backgrounds and get them on board with interaction design and systems thinking and applications.</li>
  1651.  
  1652.  
  1653.  
  1654. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1262" target="_blank" rel="noreferrer noopener">21:02</a>: If you’re a young person and you’re skilled, you can go into either design or software engineering. Of course, now you’re reading articles saying “forget about software engineering.” I haven’t seen articles saying “forget about design.”</li>
  1655.  
  1656.  
  1657.  
  1658. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1291" target="_blank" rel="noreferrer noopener">21:31</a>: I disagree with the idea that it’s a bad time to be an engineer. It’s never been more exciting.</li>
  1659.  
  1660.  
  1661.  
  1662. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1306" target="_blank" rel="noreferrer noopener">21:46</a>: But you have to be open to that. If you’re a curmudgeon, you’re going to be in trouble.</li>
  1663.  
  1664.  
  1665.  
  1666. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1313" target="_blank" rel="noreferrer noopener">21:53</a>: This happens with every technical platform transition. I spent so many years during the smartphone boom hearing people say, “No one is ever going to watch TV and movies on mobile.” Is it an affinity to the past, or a sense of doubt about the future? Every time, it’s been the same thing.</li>
  1667.  
  1668.  
  1669.  
  1670. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1357" target="_blank" rel="noreferrer noopener">22:37</a>: One way to think of AgentDB is like a wedge. It addresses one clear pain point in the stack that people have to grapple with. So what’s next? Is it Kubernetes?</li>
  1671.  
  1672.  
  1673.  
  1674. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1389" target="_blank" rel="noreferrer noopener">23:09</a>: I don’t want to go near that one! The broader context of how applications are changing—how do I create a coherent product that people understand how to use, that has aesthetics, that has a personality?—is a very wide-open question. There’s a bunch of other systems that have not been made for AI models. A simple example is search APIs. Search APIs are basically structured the same way as results pages. Here’s your 10 blue links. But an agentic model can suck up so much information. Not only should you be giving it the web page, you should be giving it the whole site. Those systems are not built for this world at all. You can go down the list of the things we use as core infrastructure and think about how they were made for a human, not the capabilities of an enormous large language model.</li>
  1675.  
  1676.  
  1677.  
  1678. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1479" target="_blank" rel="noreferrer noopener">24:39</a>: Right now, I’m writing an article on enterprise search, and one of things people don’t realize is that it’s broken. In terms of AgentDB, do you worry about things like security, governance? There’s another place black hat attackers can go after.</li>
  1679.  
  1680.  
  1681.  
  1682. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1520" target="_blank" rel="noreferrer noopener">25:20</a>: Absolutely. All new technologies have the light side and the dark side. It’s always been a codebreaker-codemaker stack. That doesn’t change. The attack vectors are different and, in the early stages, we don’t know what they are, so it is a cat and mouse game. There was an era when spam in email was terrible; your mailbox would be full of spam and you manually had to mark things as junk. Now you use gmail, and you don’t think about it. When was the last time you went into the junk mail tab? We built systems, we got smarter, and the average person doesn’t think about it.</li>
  1683.  
  1684.  
  1685.  
  1686. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1591" target="_blank" rel="noreferrer noopener">26:31</a>: As you have more people building agents, and agents building agents, you have data governance, access control; suddenly you have AgentDB artifacts all over the place. </li>
  1687.  
  1688.  
  1689.  
  1690. <li><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_Real_World_with_Luke_Wroblewski.mp3#t=1626" target="_blank" rel="noreferrer noopener">27:06</a>: Two things here. This is an underappreciated part of this. Two years ago I launched my own personal chatbot that works off my writings. People ask me what model am I using, and how is it built? Those are partly interesting questions. But the real work in that system is constantly looking at the questions people are asking, and evaluating whether or not it responded well. I’m constantly course-correcting the system. That’s the work that a lot of people don’t do. But the thing I’m doing is applying taste, applying a perspective, defining what “good” is. For a lot of systems like enterprise search, it’s like, “We deployed the technology.” How do you know if it’s good or not? Is someone in there constantly tweaking and tuning? What makes Google Search so good? It’s constantly being re-evaluated. Or Google Translate—was this translation good or bad? Baked in early on.</li>
  1691. </ul>
  1692. ]]></content:encoded>
  1693. <wfw:commentRss>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-luke-wroblewski-on-when-databases-talk-agent-speak/feed/</wfw:commentRss>
  1694. <slash:comments>0</slash:comments>
  1695. </item>
  1696. <item>
  1697. <title>AI Security Takes Center Stage at Black Hat USA 2025</title>
  1698. <link>https://www.oreilly.com/radar/ai-security-takes-center-stage-at-black-hat-usa-2025/</link>
  1699. <pubDate>Thu, 04 Sep 2025 09:52:40 +0000</pubDate>
  1700. <dc:creator><![CDATA[Simina Calin]]></dc:creator>
  1701. <category><![CDATA[AI & ML]]></category>
  1702. <category><![CDATA[Security]]></category>
  1703. <category><![CDATA[Commentary]]></category>
  1704.  
  1705. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17388</guid>
  1706.  
  1707.     <media:content
  1708. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-colors-9.jpg"
  1709. medium="image"
  1710. type="image/jpeg"
  1711. />
  1712. <description><![CDATA[The security landscape is undergoing yet another major shift, and nowhere was this more evident than at Black Hat USA 2025. As artificial intelligence (especially the agentic variety) becomes deeply embedded in enterprise systems, it&#8217;s creating both security challenges and opportunities. Here&#8217;s what security professionals need to know about this rapidly evolving landscape. AI systems—and [&#8230;]]]></description>
  1713. <content:encoded><![CDATA[
  1714. <p>The security landscape is undergoing yet another major shift, and nowhere was this more evident than at <a href="https://www.blackhat.com/us-25/" target="_blank" rel="noreferrer noopener">Black Hat USA 2025</a>. As artificial intelligence (especially the agentic variety) becomes deeply embedded in enterprise systems, it&#8217;s creating both security challenges and opportunities. Here&#8217;s what security professionals need to know about this rapidly evolving landscape.</p>
  1715.  
  1716.  
  1717.  
  1718. <p>AI systems—and particularly the AI assistants that have become integral to enterprise workflows—are emerging as prime targets for attackers. In one of the most interesting and scariest presentations, Michael Bargury of Zenity demonstrated previously unknown <a href="https://www.blackhat.com/us-25/briefings/schedule/index.html#ai-enterprise-compromise---0click-exploit-methods-46442" target="_blank" rel="noreferrer noopener">&#8220;0click&#8221; exploit methods</a> affecting major AI platforms including ChatGPT, Gemini, and Microsoft Copilot. These findings underscore how AI assistants, despite their robust security measures, can become vectors for system compromise.</p>
  1719.  
  1720.  
  1721.  
  1722. <p>AI security presents a paradox: As organizations expand AI capabilities to enhance productivity, they must necessarily increase these tools’ access to sensitive data and systems. This expansion creates new attack surfaces and more complex supply chains to defend. NVIDIA&#8217;s AI red team highlighted this vulnerability, revealing how <a href="https://www.blackhat.com/us-25/briefings/schedule/#from-prompts-to-pwns-exploiting-and-securing-ai-agents-46681" target="_blank" rel="noreferrer noopener">large language models (LLMs) are uniquely susceptible to malicious inputs</a>, and demonstrated several novel exploit techniques that take advantage of these inherent weaknesses.</p>
  1723.  
  1724.  
  1725.  
  1726. <p>However, it&#8217;s not all new territory. Many traditional security principles remain relevant and are, in fact, more crucial than ever. Nathan Hamiel and Nils Amiet of Kudelski Security showed how AI-powered development tools are <a href="https://www.blackhat.com/us-25/briefings/schedule/index.html#hack-to-the-future-owning-ai-powered-tools-with-old-school-vulns-45871" target="_blank" rel="noreferrer noopener">inadvertently reintroducing well-known vulnerabilities into modern applications</a>. Their findings suggest that basic application security practices remain fundamental to AI security.</p>
  1727.  
  1728.  
  1729.  
  1730. <p>Looking forward, threat modeling becomes increasingly critical but also more complex. The security community is responding with new frameworks designed specifically for AI systems such as MAESTRO and NIST’s AI Risk Management Framework. The <a href="https://genai.owasp.org/resource/owasp-gen-ai-agentic-security-top-10-global-kickoff-presentation/" target="_blank" rel="noreferrer noopener">OWASP Agentic Security Top 10 project</a>, launched during this year&#8217;s conference, provides a structured approach to understanding and addressing AI-specific security risks.</p>
  1731.  
  1732.  
  1733.  
  1734. <p>For security professionals, the path forward requires a balanced approach: maintaining strong fundamentals while developing new expertise in AI-specific security challenges. Organizations must reassess their security posture through this new lens, considering both traditional vulnerabilities and emerging AI-specific threats.</p>
  1735.  
  1736.  
  1737.  
  1738. <p>The discussions at Black Hat USA 2025 made it clear that while AI presents new security challenges, it also offers opportunities for innovation in defense strategies. Mikko Hypponen’s opening keynote presented a <a href="https://www.blackhat.com/us-25/briefings/schedule/#keynote-three-decades-in-cybersecurity-lessons-learned-and-what-comes-next-48195" target="_blank" rel="noreferrer noopener">historical perspective on the last 30 years of cybersecurity advancements</a> and concluded that security is not only better than it’s ever been but poised to leverage a head start in AI usage. Black Hat has a way of underscoring the reasons for concern, but taken as a whole, this year’s presentations show us that there are also many reasons to be optimistic. Individual success will depend on how well security teams can adapt their existing practices while embracing new approaches specifically designed for AI systems.</p>
  1739. ]]></content:encoded>
  1740. </item>
  1741. <item>
  1742. <title>Looking Forward to AI Codecon</title>
  1743. <link>https://www.oreilly.com/radar/looking-forward-to-ai-codecon/</link>
  1744. <pubDate>Wed, 03 Sep 2025 17:25:12 +0000</pubDate>
  1745. <dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
  1746. <category><![CDATA[AI & ML]]></category>
  1747. <category><![CDATA[Events]]></category>
  1748.  
  1749. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17395</guid>
  1750.  
  1751.     <media:content
  1752. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/06/abstract-2370445_1920_crop-72496aa8e2aa221169247ee80be31a15.jpg"
  1753. medium="image"
  1754. type="image/jpeg"
  1755. />
  1756. <description><![CDATA[I’m really looking forward to our second O’Reilly AI Codecon, Coding for the Agentic World, which is happening on September 9, online from 8am to noon Pacific time, with a follow-on day of additional demos on September 16. But I’m also looking forward to how the AI market itself unfolds: the surprising twists and turns [&#8230;]]]></description>
  1757. <content:encoded><![CDATA[
  1758. <p>I’m really looking forward to our second O’Reilly AI Codecon, <a href="https://www.oreilly.com/AgenticWorld/" target="_blank" rel="noreferrer noopener">Coding for the Agentic World</a>, which is happening on September 9, online from 8am to noon Pacific time, with a follow-on <a href="https://learning.oreilly.com/live-events/oreilly-demo-day/0642572227968/" target="_blank" rel="noreferrer noopener">day of additional demos</a> on September 16. But I’m also looking forward to how the AI market itself unfolds: the surprising twists and turns ahead as users and developers apply AI to real-world problems.</p>
  1759.  
  1760.  
  1761.  
  1762. <p>The pages linked above give details on the program for the events. What I want to give here is a bit of the <strong><em>why</em></strong> behind the program, with a bit more detail on some of the fireside chats I will be leading.</p>
  1763.  
  1764.  
  1765.  
  1766. <h2 class="wp-block-heading">From Invention to Application</h2>
  1767.  
  1768.  
  1769.  
  1770. <p>There has been so much focus in the past on the big AI labs, the model developers, and their razzle-dazzle about AGI, or even ASI. That narrative implied that we were heading toward something unprecedented. But if this is a “<a href="https://www.oreilly.com/radar/is-ai-a-normal-technology/" target="_blank" rel="noreferrer noopener">normal technology</a>” (albeit one as transformational as electricity, the internal combustion engine, or the internet), we know that LLMs themselves are just the beginning of a long process of discovery, product invention, business adoption, and societal adaptation.</p>
  1771.  
  1772.  
  1773.  
  1774. <p>That process of collaborative discovery of the real uses for AI and reinvention of the businesses that use it is happening most clearly in the software industry. It is where AI is being pushed to the limits, where new products beyond the chatbot are being introduced, where new workflows are being developed, and where we understand what works and what doesn’t.</p>
  1775.  
  1776.  
  1777.  
  1778. <p>This work is often being pushed forward by individuals, who are “<a href="https://yalebooks.yale.edu/book/9780300195668/learning-by-doing/" target="_blank" rel="noreferrer noopener">learning by doing</a>.” Some of these individuals work for large companies, others for startups, others for enterprises, and others as independent hackers.</p>
  1779.  
  1780.  
  1781.  
  1782. <p>Our focus in these AI Codecon events is to smooth adoption of AI by helping our customers cut through the hype and understand what is working. O’Reilly’s mission has always been <a href="https://www.oreilly.com/about/" target="_blank" rel="noreferrer noopener">changing the world by sharing the knowledge of innovators</a>. In our events, we always look for people who are at the forefront of invention. As outlined in the call to action for the first event, I was concerned about the chatter that AI would make developers obsolete. I <a href="https://www.oreilly.com/radar/the-end-of-programming-as-we-know-it/" target="_blank" rel="noreferrer noopener">argued instead</a> that it would profoundly change the process of software development and the jobs that developers do, but that it would make them more important than ever.</p>
  1783.  
  1784.  
  1785.  
  1786. <p>It looks like I was right. There is a huge ferment, with so much new to learn and do that it’s a really exciting time to be a software developer. I’m really excited about the practicality of the conversation. We&#8217;re not just talking about the &#8220;what if.&#8221; We’re seeing new AI powered services meeting real business needs. We are witnessing the shift from human-centric workflows to agent-centric workflows, and it&#8217;s happening faster than you think.</p>
  1787.  
  1788.  
  1789.  
  1790. <p>We&#8217;re also seeing widespread adoption of the protocols that will power it all. If you’ve followed my work from open source to Web 2.0 to the present, you know that I believe strongly that the most dynamic systems have “an architecture of participation.” That is, they aren’t monolithic. The barriers to entry need to be low and business models fluid (at least in the early stages) for innovation to flourish.</p>
  1791.  
  1792.  
  1793.  
  1794. <p>When AI was framed as a race for superintelligence, there was a strong expectation that it would be winner takes all. The first company to get to ASI (or even just to AGI) would soon be so far ahead that it would inevitably become a dominant monopoly. Developers would all use its APIs, making it into the single dominant platform for AI development.</p>
  1795.  
  1796.  
  1797.  
  1798. <p>Protocols like MCP and A2A are instead enabling a decentralized AI future. The explosion of entrepreneurial activity around agentic AI reminds me of the best kind of open innovation, much like I saw in the early days of the personal computer and the internet.</p>
  1799.  
  1800.  
  1801.  
  1802. <p>I was going to use my opening remarks to sound that theme, and then I read Alex Komoroske’s marvelous essay, “<a href="https://www.techdirt.com/2025/06/16/why-centralized-ai-is-not-our-inevitable-future/" target="_blank" rel="noreferrer noopener">Why Centralized AI Is Not Our Inevitable Future</a>.” So I asked him to do it instead. He’s going to give an updated, developer-focused version of that as our kickoff talk.</p>
  1803.  
  1804.  
  1805.  
  1806. <p>Then we’re going into a section on agentic interfaces. We’ve lived for decades with the GUI (either on computers or mobile applications) and the web as the dominant ways we use computers. AI is changing all that.</p>
  1807.  
  1808.  
  1809.  
  1810. <p>It’s not just agentic interfaces, though. It’s really developing true AI-native products, searching out the possibilities of this new computing fabric.</p>
  1811.  
  1812.  
  1813.  
  1814. <h2 class="wp-block-heading">The Great Interface Rethink</h2>
  1815.  
  1816.  
  1817.  
  1818. <p>In the “normal technology” framing, a fundamental technology innovation is distinct from products based on it. Think of the invention of the LLM itself as electricity, and ChatGPT as the equivalent of Edison’s incandescent light bulb and the development of the distribution network to power it.</p>
  1819.  
  1820.  
  1821.  
  1822. <p>There’s a bit of a lesson in the fact that the telegraph was the first large-scale practical application of electricity, over 40 years before Edison’s lightbulb. The telephone was another killer app that used electricity to power it. But despite their scale, these were specialized devices. It was the infrastructure for incandescent lighting that turned electricity into a <a href="https://en.wikipedia.org/wiki/General-purpose_technology" target="_blank" rel="noreferrer noopener">general-purpose technology</a>.</p>
  1823.  
  1824.  
  1825.  
  1826. <p>The world soon saw electrical resistance products like irons and toasters, and electric motors powering not just factories but household appliances such as washing machines and eventually refrigerators and air conditioning. Many of these household products were plugged into light sockets, since the pronged plug as we know it today wasn’t introduced until 30 years after the first light bulb.</p>
  1827.  
  1828.  
  1829.  
  1830. <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1086" height="824" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Looking-Forward-to-AI-Codecon.png" alt="" class="wp-image-17399" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Looking-Forward-to-AI-Codecon.png 1086w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Looking-Forward-to-AI-Codecon-300x228.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/09/Looking-Forward-to-AI-Codecon-768x583.png 768w" sizes="auto, (max-width: 1086px) 100vw, 1086px" /><figcaption class="wp-element-caption"><em><a href="https://www.facebook.com/photo/?fbid=10158281345021884&amp;set=gm.3012577095721332" target="_blank" rel="noreferrer noopener">Found on Facebook</a>: &#8220;Any ideas what this would have been used for? I found it after pulling up carpet &#8211; it&#8217;s in the corner of a closet in my 1920s ‘fixer-upper’ that I&#8217;m slowly bringing back to life. It appears to be for a light bulb and the little flip top is just like floor outlets you see today, but can&#8217;t figure out why it would be directly on the floor.&#8221;</em></figcaption></figure>
  1831.  
  1832.  
  1833.  
  1834. <p>The lesson is that <strong><em>at some point in the development of a general purpose technology, product innovation takes over from pure technology innovation</em></strong>. That’s the phase we’re entering now.</p>
  1835.  
  1836.  
  1837.  
  1838. <p>Look at the evolution of LLM-based products: GitHub Copilot embedded AI into Visual Studio Code; the interface was an extension to VS Code, a 10-year-old GUI-based program. Google’s AI efforts were tied into its web-based search products. ChatGPT broke the mold and introduced the first radically new interface since the web browser. Suddenly, chat was the preferred new interface for everything. But Claude took things further with Artifacts and then Claude Code, and once coding assistants gained more complex interfaces, that kicked off today’s fierce competition between coding tools. The next revolution is the construction of a new computing paradigm where software is composed of intelligent, autonomous agents.</p>
  1839.  
  1840.  
  1841.  
  1842. <p>I’m really looking forward to Rachel-Lee Nabors’s talk on how, with an agentic interface, we might transcend the traditional browser: AI agents can adapt content directly to users, offering privacy, accessibility, and flexibility that legacy web interfaces cannot match.</p>
  1843.  
  1844.  
  1845.  
  1846. <p>But it seems to me that there will be two kinds of agents, which I call “demand side” and “supply side” agents. What’s a “demand side” agent? Instead of navigating complex apps, you&#8217;ll simply state your goal. The agent will understand the context, access the necessary tools, and present you with the result. The vision is still science fiction. The reality is often a kludge powered by browser use or API calls, with MCP servers increasingly offering an AI-friendlier interface for those demand-side agents to interact with. But why should it stop there? MCP servers are static interfaces. What if there were agents on both sides of the conversation, in a dynamic negotiation? I suspect that while demand-side agents will be developed by venture funded startups, most server-side agents will be developed by enterprises as a kind of conversational interface for both humans and AI agents that want access to their complex workflows, data, and business models. And those enterprises will often be using agentic platforms tailored for their use. That’s part of the “supply side agent” vision of companies like Sierra. I’ll be talking with Sierra cofounder Clay Bavor about this next step in agentic development.</p>
  1847.  
  1848.  
  1849.  
  1850. <p>We&#8217;ve grown accustomed to thinking about agents as lonely consumers—“tell me the weather,” “scan my code,” “summarize my inbox.” But that’s only half the story. If we build supply-side agent infrastructure—autonomous, discoverable, governed, negotiated—we unlock agility, resilience, security, and collaboration.</p>
  1851.  
  1852.  
  1853.  
  1854. <p>My interest in product innovation, not just advances in the underlying technology, is also why I’m excited about my fireside chat with Josh Woodward, who co-led the team that developed NotebookLM at Google. I’m a huge fan of NotebookLM, which in many ways brought the power of RAG (retrieval-augmented generation) to end users, allowing them to collect a set of documents into a Google drive, and then use that collection to drive chat, audio overviews of documents, study guides, mind maps, and much more.</p>
  1855.  
  1856.  
  1857.  
  1858. <p>NotebookLM is also a lovely way to build on the deep collaborative infrastructure provided by Google Drive. We need to think more deeply about collaborative interfaces for AI. Right now, AI interaction is mostly a solitary sport. You can share the outputs with others, but not the generative process. I wrote about this recently in “<a href="https://www.oreilly.com/radar/people-work-in-teams-ai-assistants-in-silos/" target="_blank" rel="noreferrer noopener">People Work in Teams, AI Assistants in Silos</a>.” I think that’s a big miss, and I’m hoping to probe Josh about Google’s plans in this area, and eager to see other innovations in AI-mediated human collaboration.</p>
  1859.  
  1860.  
  1861.  
  1862. <p>GitHub is another existing tool for collaboration that has become central to the AI ecosystem. I’m really looking forward to talking with outgoing CEO Thomas Dohmke both about the ways that GitHub already provides a kind of exoskeleton for collaboration when using AI code-generation tools. It seems to me that one of the frontiers of AI-human interfaces will be those that enable not just small teams but eventually large groups to collaborate. I suspect that GitHub may have more to teach us about that future than we now suspect.</p>
  1863.  
  1864.  
  1865.  
  1866. <p>And finally, we are now learning that managing context is a critical part of designing effective AI applications. My cochair Addy Osmani will be talking about the emergence of context engineering as a real discipline, and its relevance to agentic AI development.</p>
  1867.  
  1868.  
  1869.  
  1870. <h2 class="wp-block-heading">Tool-Chaining Agents and Real Workflows</h2>
  1871.  
  1872.  
  1873.  
  1874. <p>Today&#8217;s AI tools are largely solo performers—a Copilot suggesting code or a ChatGPT answering a query. The next leap is from single agents to interconnected systems. The program is filled with sessions on &#8220;tool-to-tool workflows&#8221; and multi-agent systems.</p>
  1875.  
  1876.  
  1877.  
  1878. <p>Ken Kousen will showcase the new generation of coding agents, including Claude Code, Codex CLI, Gemini CLI, and Junie, that help developers navigate codebases, automate tasks, and even refactor intelligently. In her talk, Angie Jones takes it further: agents that go beyond code generation to manage PRs, write tests, and update documentation—stepping “out of the IDE” and into real-world workflows.</p>
  1879.  
  1880.  
  1881.  
  1882. <p>Even more exciting is the idea of agents collaborating with each other. The Demo Day will showcase a multi-agent coding system where agents share, correct, and evolve code together. This isn&#8217;t science fiction; Amit Rustagi&#8217;s talk on decentralized AI agent infrastructure using technologies like WebAssembly and IPFS provides a practical architectural framework for making these agent swarms a reality.</p>
  1883.  
  1884.  
  1885.  
  1886. <h2 class="wp-block-heading">The Crucial Ingredient: Common Protocols</h2>
  1887.  
  1888.  
  1889.  
  1890. <p>How do all these agents talk to each other? How do they discover new tools and use them safely? The answer that echoes throughout the agenda is the Model Context Protocol (MCP).</p>
  1891.  
  1892.  
  1893.  
  1894. <p>Much as the distribution network for electricity was the enabler for all of the product innovation of the electrical revolution, MCP is the foundational plumbing, the universal language that will allow this new ecosystem to flourish. Multiple sessions and an entire Demo Day are dedicated to it. We&#8217;ll see how Google is using it for agent-to-agent communication, how it can be used to control complex software like Blender with natural language, and even how it can power novel SaaS product demos.</p>
  1895.  
  1896.  
  1897.  
  1898. <p>The heavy focus on a standardized protocol signals that the industry is maturing past cool demos and is now building the robust, interoperable infrastructure needed for a true agentic economy.</p>
  1899.  
  1900.  
  1901.  
  1902. <p>If the development of the internet is any guide, though, MCP is a beginning, not the end. TCP/IP became the foundation of a layered protocol stack. It is likely that MCP will be followed by many more specialized protocols.</p>
  1903.  
  1904.  
  1905.  
  1906. <h2 class="wp-block-heading">Why This Matters</h2>
  1907.  
  1908.  
  1909.  
  1910. <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Theme</th><th><strong>Why It’s Thrilling</strong></th></tr></thead><tbody><tr><td>Autonomous, Distributed AI</td><td>Agents that chain tasks and operate behind the scenes can unlock entirely new ways of building software.</td></tr><tr><td>Human Empowerment &amp; Privacy</td><td>The push against centralized AI systems is a reminder that tools should serve users, not control them.</td></tr><tr><td>Context as Architecture</td><td>Elevating input design to first-class engineering—this will greatly improve reliability, trust, and AI behavior over time.</td></tr><tr><td>New Developer Roles</td><td>We’re seeing developers transition from writing code to orchestrating agents, designing workflows, and managing systems.</td></tr><tr><td>MCP &amp; Network Effects</td><td>The idea of an “AI-native web,” where agents use standardized protocols to talk, is powerful, open-ended, and full of opportunity.</td></tr></tbody></table></figure>
  1911.  
  1912.  
  1913.  
  1914. <p>I look forward to seeing you there!</p>
  1915.  
  1916.  
  1917.  
  1918. <hr class="wp-block-separator has-alpha-channel-opacity"/>
  1919.  
  1920.  
  1921.  
  1922. <p class="has-cyan-bluish-gray-background-color has-background"><em>We hope you’ll join us at <strong>AI Codecon: Coding for the Agentic World</strong> on September 9 to explore the tools, workflows, and architectures defining the next era of programming. It’s free to attend. </em><a href="https://www.oreilly.com/AgenticWorld/" target="_blank" rel="noreferrer noopener"><em>Register now to save your seat</em></a><em>.</em> <em>And join us for <a href="https://learning.oreilly.com/live-events/oreilly-demo-day/0642572227968/" target="_blank" rel="noreferrer noopener">O&#8217;Reilly Demo Day</a> on September 16 to see how experts are shaping AI systems to work for them via MCP.</em></p>
  1923.  
  1924.  
  1925.  
  1926. <p></p>
  1927. ]]></content:encoded>
  1928. </item>
  1929. <item>
  1930. <title>Understanding the Rehash Loop</title>
  1931. <link>https://www.oreilly.com/radar/understanding-the-rehash-loop/</link>
  1932. <pubDate>Wed, 03 Sep 2025 10:21:28 +0000</pubDate>
  1933. <dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
  1934. <category><![CDATA[Commentary]]></category>
  1935.  
  1936. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17392</guid>
  1937.  
  1938.     <media:content
  1939. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2019/06/fractal-1832620_1920_crop-c59dcd2418c34f9dccc1469aba4b0ba1.jpg"
  1940. medium="image"
  1941. type="image/jpeg"
  1942. />
  1943. <custom:subtitle><![CDATA[When AI Gets Stuck]]></custom:subtitle>
  1944. <description><![CDATA[This article is part of a series on the Sens-AI Framework—practical habits for learning and coding with AI. In “The Sens-AI Framework: Teaching Developers to Think with AI,” I introduced the concept of the rehash loop—that frustrating pattern where AI tools keep generating variations of the same wrong answer, no matter how you adjust your [&#8230;]]]></description>
  1945. <content:encoded><![CDATA[
  1946. <p class="has-cyan-bluish-gray-background-color has-background"><em>This article is part of a series on the </em><a href="https://www.oreilly.com/radar/the-sens-ai-framework/"><em>Sens-AI Framework</em></a><em>—practical habits for learning and coding with AI.</em></p>
  1947.  
  1948.  
  1949.  
  1950. <p>In “<a href="https://www.oreilly.com/radar/the-sens-ai-framework/" target="_blank" rel="noreferrer noopener">The Sens-AI Framework: Teaching Developers to Think with AI</a>,” I introduced the concept of the <strong>rehash loop</strong>—that frustrating pattern where AI tools keep generating variations of the same wrong answer, no matter how you adjust your prompt. It&#8217;s one of the most common failure modes in AI-assisted development, and it deserves a deeper look.</p>
  1951.  
  1952.  
  1953.  
  1954. <p>Most developers who use AI in their coding work will recognize a rehash loop. The AI generates code that&#8217;s almost right—close enough that you think one more tweak will fix it. So you adjust your prompt, add more detail, explain the problem differently. But the response is essentially the same broken solution with cosmetic changes. Different variable names. Reordered operations. Maybe a comment or two. But fundamentally, it&#8217;s the same wrong answer.</p>
  1955.  
  1956.  
  1957.  
  1958. <h2 class="wp-block-heading"><strong>Recognizing When You&#8217;re Stuck</strong></h2>
  1959.  
  1960.  
  1961.  
  1962. <p>Rehash loops are frustrating. The model seems so close to understanding what you need but just can’t get you there. Each iteration looks slightly different, which makes you think you&#8217;re making progress. Then you test the code and it fails in exactly the same way, or you get the same errors, or you just recognize that it’s a solution that you’ve already seen and dismissed multiple times.</p>
  1963.  
  1964.  
  1965.  
  1966. <p>Most developers try to escape through incremental changes—adding details, rewording instructions, nudging the AI toward a fix. These adjustments normally work during regular coding sessions, but in a rehash loop, they lead back to the same constrained set of answers. You can&#8217;t tell if there&#8217;s no real solution, if you&#8217;re asking the wrong question, or if the AI is hallucinating a partial answer and too confident that it works.</p>
  1967.  
  1968.  
  1969.  
  1970. <p>When you&#8217;re in a rehash loop, the AI isn&#8217;t broken. It&#8217;s doing exactly what it&#8217;s designed to do—generating the most statistically likely response it can, based on the tokens in your prompt and the limited view it has of the conversation. One source of the problem is the <strong>context window</strong>—an architectural limit on how many tokens the model can process at once. That includes your prompt, any shared code, and the rest of the conversation—usually a few thousand tokens total. The model uses this entire sequence to predict what comes next. Once it has sampled the patterns it finds there, it starts circling.</p>
  1971.  
  1972.  
  1973.  
  1974. <p>The variations you get—reordered statements, renamed variables, a tweak here or there—aren&#8217;t new ideas. They&#8217;re just the model nudging things around in the same narrow probability space.</p>
  1975.  
  1976.  
  1977.  
  1978. <p>So if you keep getting the same broken answer, the issue probably isn&#8217;t that the model doesn&#8217;t know how to help. It&#8217;s that you haven&#8217;t given it enough to work with.</p>
  1979.  
  1980.  
  1981.  
  1982. <h2 class="wp-block-heading"><strong>When the Model Runs Out of Context</strong></h2>
  1983.  
  1984.  
  1985.  
  1986. <p><em>A rehash loop is a </em><strong><em>signal that the AI ran out of context</em></strong><em>.</em><strong><em> </em></strong>The model has exhausted the useful information in the context you&#8217;ve given it. When you&#8217;re stuck in a rehash loop, treat it as a signal instead of a problem. Figure out what context is missing and provide it.</p>
  1987.  
  1988.  
  1989.  
  1990. <p>Large language models don&#8217;t really understand code the way humans do. They generate suggestions by predicting what comes next in a sequence of text based on patterns they&#8217;ve seen in massive training datasets. When you prompt them, they analyze your input and predict likely continuations, but they have no real understanding of your design or requirements unless you explicitly provide that context.</p>
  1991.  
  1992.  
  1993.  
  1994. <p>The better context you provide, the more useful and accurate the AI&#8217;s answers will be. But when the context is incomplete or poorly framed, the AI&#8217;s suggestions can drift, repeat variations, or miss the real problem entirely.</p>
  1995.  
  1996.  
  1997.  
  1998. <h2 class="wp-block-heading"><strong>Breaking Out of the Loop</strong></h2>
  1999.  
  2000.  
  2001.  
  2002. <p><strong>Research</strong> becomes especially important when you hit a rehash loop. You need to learn more before reengaging—reading documentation, clarifying requirements with teammates, thinking through design implications, or even starting another session to ask research questions from a different angle. Starting a new chat with a different AI can help because your prompt might steer it toward a different region of its information space and surface new context.</p>
  2003.  
  2004.  
  2005.  
  2006. <p>A rehash loop tells you that the model is stuck trying to solve a puzzle without all the pieces. It keeps rearranging the ones it has, but it can’t reach the right solution until you give it the one piece it needs—that extra bit of context that points it to a different part of the model it wasn’t using. That missing piece might be a key constraint, an example, or a goal you haven’t spelled out yet. You typically don’t need to give it a lot of extra information to break out of the loop. The AI doesn’t need a full explanation; it needs just enough new context to steer it into a part of its training data it wasn’t using.</p>
  2007.  
  2008.  
  2009.  
  2010. <p>When you recognize you&#8217;re in a rehash loop, trying to nudge the AI and vibe-code your way out of it is usually ineffective—it just leads you in circles. (“Vibe coding” means relying on the AI to generate something that looks plausible and hoping it works, without really digesting the output.) Instead, start investigating what’s missing. Ask the AI to explain its thinking: “What assumptions are you making?” or “Why do you think this solves the problem?” That can reveal a mismatch—maybe it’s solving the wrong problem entirely, or it’s missing a constraint you forgot to mention. It’s often especially helpful to open a chat with a different AI, describe the rehash loop as clearly as you can, and ask what additional context might help.</p>
  2011.  
  2012.  
  2013.  
  2014. <p>This is where problem framing really starts to matter. If the model keeps circling the same broken pattern, it’s not just a prompt problem—it’s a signal that your framing needs to shift.</p>
  2015.  
  2016.  
  2017.  
  2018. <p><strong>Problem framing</strong> helps you recognize that the model is stuck in the wrong solution space. Your framing gives the AI the clues it needs to assemble patterns from its training that actually match your intent. After researching the actual problem—not just tweaking prompts—you can transform vague requests into targeted questions that steer the AI away from default responses and toward something useful.</p>
  2019.  
  2020.  
  2021.  
  2022. <p>Good framing starts by getting clear about the nature of the problem you&#8217;re solving. What exactly are you asking the model to generate? What information does it need to do that? Are you solving the right problem in the first place? A lot of failed prompts come from a mismatch between the developer’s intent and what the model is actually being asked to do. Just like writing good code, good prompting depends on understanding the problem you’re solving and structuring your request accordingly.</p>
  2023.  
  2024.  
  2025.  
  2026. <h2 class="wp-block-heading"><strong>Learning from the Signal</strong></h2>
  2027.  
  2028.  
  2029.  
  2030. <p>When AI keeps circling the same solution, it&#8217;s not a failure—it&#8217;s information. The rehash loop tells you something about either your understanding of the problem or how you&#8217;re communicating it. An incomplete response from the AI is often just a step toward getting the right answer. These moments aren&#8217;t failures. They&#8217;re signals to do the extra work—often just a small amount of targeted research—that gives the AI the information it needs to get to the right place in its massive information space.</p>
  2031.  
  2032.  
  2033.  
  2034. <p><strong><em>AI doesn&#8217;t think for you.</em></strong> While it can make surprising connections by recombining patterns from its training, it can&#8217;t generate truly new insight on its own. It&#8217;s your context that helps it connect those patterns in useful ways. If you&#8217;re hitting rehash loops repeatedly, ask yourself: What does the AI need to know to do this well? What context or requirements might be missing?</p>
  2035.  
  2036.  
  2037.  
  2038. <p>Rehash loops are one of the clearest signals that it&#8217;s time to step back from rapid generation and engage your critical thinking. They&#8217;re frustrating, but they&#8217;re also valuable—they tell you exactly when the AI has exhausted its current context and needs your help to move forward.</p>
  2039. ]]></content:encoded>
  2040. </item>
  2041. <item>
  2042. <title>Radar Trends to Watch: September 2025</title>
  2043. <link>https://www.oreilly.com/radar/radar-trends-to-watch-september-2025/</link>
  2044. <pubDate>Tue, 02 Sep 2025 10:10:37 +0000</pubDate>
  2045. <dc:creator><![CDATA[Mike Loukides]]></dc:creator>
  2046. <category><![CDATA[Radar Trends]]></category>
  2047. <category><![CDATA[Commentary]]></category>
  2048.  
  2049. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17379</guid>
  2050.  
  2051.     <media:content
  2052. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-4.png"
  2053. medium="image"
  2054. type="image/png"
  2055. />
  2056. <custom:subtitle><![CDATA[Developments in AI, Security, Web, and More]]></custom:subtitle>
  2057. <description><![CDATA[For better or for worse, AI has colonized this list so thoroughly that AI itself is little more than a list of announcements about new or upgraded models. But there are other points of interest. Is it just a coincidence (possibly to do with BlackHat) that so much happened in security in the past month? [&#8230;]]]></description>
  2058. <content:encoded><![CDATA[
  2059. <p>For better or for worse, AI has colonized this list so thoroughly that AI itself is little more than a list of announcements about new or upgraded models. But there are other points of interest. Is it just a coincidence (possibly to do with BlackHat) that so much happened in security in the past month? We’re still seeing programming languages—even some new programming languages for writing AI prompts! If you’re into retrocomputing, the much-beloved Commodore 64 is back—with an upgraded audio chip, a new processor, much more RAM, and all your old ports. Heirloom peripherals should still work.</p>
  2060.  
  2061.  
  2062.  
  2063. <h2 class="wp-block-heading">AI</h2>
  2064.  
  2065.  
  2066.  
  2067. <ul class="wp-block-list">
  2068. <li>OpenAI has released their <a href="https://platform.openai.com/docs/guides/realtime" target="_blank" rel="noreferrer noopener">Realtime APIs</a>. The model supports MCP servers, phone calls using the SIP protocol, and image inputs. The release includes <a href="https://openai.com/index/introducing-gpt-realtime/" target="_blank" rel="noreferrer noopener">gpt-realtime</a>, an advanced speech-to-speech model.</li>
  2069.  
  2070.  
  2071.  
  2072. <li>ChatGPT now supports <a href="https://help.openai.com/en/articles/6825453-chatgpt-release-notes#h_fb3ac52750" target="_blank" rel="noreferrer noopener">project-only memory</a>. Project memory, which can use previous conversations for additional context, can be limited to a specific project. Project-only memory gives more control over context and prevents one project’s context from contaminating another.</li>
  2073.  
  2074.  
  2075.  
  2076. <li><a href="https://arxiv.org/abs/2501.01665" target="_blank" rel="noreferrer noopener">FairSense</a> is a framework for <a href="https://techxplore.com/news/2025-08-fairness-tool-ai-bias-early.html" target="_blank" rel="noreferrer noopener">investigating whether AI systems are fair</a> early on. FairSense runs long-term simulations to detect whether a system will become unfair as it evolves over time.</li>
  2077.  
  2078.  
  2079.  
  2080. <li><a href="https://agents4science.stanford.edu/" target="_blank" rel="noreferrer noopener">Agents4Science</a> is a new academic conference in which all the submissions will be <a href="https://www.technologyreview.com/2025/08/22/1122304/ai-scientist-research-autonomous-agents/" target="_blank" rel="noreferrer noopener">researched, written, reviewed, and presented primarily by AI</a> (using text-to-speech for presentations).</li>
  2081.  
  2082.  
  2083.  
  2084. <li>Drew Breunig’s mix and match <a href="https://www.dbreunig.com/2025/08/21/a-guide-to-ai-titles.html" target="_blank" rel="noreferrer noopener">cheat sheet for AI job titles</a> is a classic.&nbsp;</li>
  2085.  
  2086.  
  2087.  
  2088. <li>Cohere’s <a href="https://cohere.com/blog/command-a-reasoning" target="_blank" rel="noreferrer noopener">Command A Reasoning</a> is another powerful, partially open reasoning model. It is available on <a href="https://huggingface.co/CohereLabs/command-a-reasoning-08-2025?ref=cohere-ai.ghost.io" target="_blank" rel="noreferrer noopener">Hugging Face</a>. It claims to outperform gpt-oss-120b and DeepSeek R1-0528.</li>
  2089.  
  2090.  
  2091.  
  2092. <li>DeepSeek has <a href="https://api-docs.deepseek.com/news/news250821" target="_blank" rel="noreferrer noopener">released</a> DeepSeekV3.1. This is a hybrid model that supports reasoning and nonreasoning use. It’s also faster than R1 and has been designed for agentic tasks. It uses reasoning tokens more economically, and it was much less expensive to train than GPT-5.</li>
  2093.  
  2094.  
  2095.  
  2096. <li>Anthropic has added the <a href="https://www.anthropic.com/research/end-subset-conversations" target="_blank" rel="noreferrer noopener">ability to terminate chats</a> to Claude Opus. Chats can be terminated if a user persists in making harmful requests. Terminated chats can’t be continued, although users can start a new chat. The feature is currently experimental.</li>
  2097.  
  2098.  
  2099.  
  2100. <li>Google has <a href="https://developers.googleblog.com/en/introducing-gemma-3-270m/" target="_blank" rel="noreferrer noopener">released</a> its <a href="https://feedly.com/i/entry/oWwSZ9Xu4Zg49bpJDa0MWTMCyM67pScdNU+gxmUAZuo=_198aa66551c:380d5:c853ad2e" target="_blank" rel="noreferrer noopener">smallest model yet</a>: Gemma 3 270M. This model is designed for fine-tuning and for deployment on small, limited hardware. Here’s a <a href="https://huggingface.co/spaces/webml-community/bedtime-story-generator" target="_blank" rel="noreferrer noopener">bedtime story generator</a> that runs in the browser, built with Gemma 3 270M.&nbsp;</li>
  2101.  
  2102.  
  2103.  
  2104. <li>ChatGPT has <a href="https://www.bleepingcomputer.com/news/artificial-intelligence/openai-rolls-out-gmail-calendar-and-contacts-integration-in-chatgpt/" target="_blank" rel="noreferrer noopener">added GMail, Google Calendar, and Google Contacts</a> to its group of connectors, which integrate ChatGPT with other applications. This information will be used to provide additional context—and presumably will be used for training or discovery in ongoing lawsuits. Fortunately, it’s (at this point) opt-in.&nbsp;</li>
  2105.  
  2106.  
  2107.  
  2108. <li>Anthropic has <a href="https://www.bleepingcomputer.com/news/artificial-intelligence/claude-gets-1m-tokens-support-via-api-to-take-on-gemini-25-pro/" target="_blank" rel="noreferrer noopener">upgraded</a> Claude Sonnet 4 with a <a href="https://www.anthropic.com/news/1m-context" target="_blank" rel="noreferrer noopener">1M token context window</a>. The larger context window is only available via the API.</li>
  2109.  
  2110.  
  2111.  
  2112. <li>OpenAI <a href="https://openai.com/index/introducing-gpt-5/" target="_blank" rel="noreferrer noopener">released</a> GPT-5. Simon Willison’s <a href="https://simonwillison.net/2025/Aug/7/gpt-5/" target="_blank" rel="noreferrer noopener">review</a> is excellent. It doesn’t feel like a breakthrough, but it is quietly better at delivering good results. It is claimed to be less prone to hallucination and incorrect answers. One quirk is that with ChatGPT, GPT-5 determines which model should respond to your prompt.</li>
  2113.  
  2114.  
  2115.  
  2116. <li>Anthropic is researching <a href="https://www.anthropic.com/research/persona-vectors" target="_blank" rel="noreferrer noopener">persona vectors</a> as a means of training a language model to behave correctly. Steering a model toward inappropriate behavior during training can be a kind of “vaccination” against that behavior when the model is deployed, without compromising other aspects of the model’s behavior.</li>
  2117.  
  2118.  
  2119.  
  2120. <li>The <a href="https://sakana.ai/dgm/" target="_blank" rel="noreferrer noopener">Darwin Gödel Machine</a> is an agent that can read and modify its own code to improve its performance on tasks. It can add tools, re-organize workflows, and evaluate whether these changes have improved its performance.</li>
  2121.  
  2122.  
  2123.  
  2124. <li>Grok is at it again: <a href="https://www.theverge.com/report/718975/xai-grok-imagine-taylor-swifty-deepfake-nudes" target="_blank" rel="noreferrer noopener">generating nude deepfakes of Taylor Swift</a> without being prompted to do so. I’m sure we’ll be told that this was the result of an unauthorized modification to the system prompt. In AI, some things are predictable.</li>
  2125.  
  2126.  
  2127.  
  2128. <li>Anthropic has <a href="https://www.anthropic.com/news/claude-opus-4-1" target="_blank" rel="noreferrer noopener">released</a> Claude Opus 4.1, an upgrade to its flagship model. We expect this to be the “gold standard” for generative coding.</li>
  2129.  
  2130.  
  2131.  
  2132. <li>OpenAI has <a href="https://openai.com/open-models/" target="_blank" rel="noreferrer noopener">released</a> two open-weight models, their first since GPT-2: <a href="https://huggingface.co/openai/gpt-oss-120b" target="_blank" rel="noreferrer noopener">gpt-oss-120b</a> and <a href="https://huggingface.co/openai/gpt-oss-20b" target="_blank" rel="noreferrer noopener">gpt-oss-20b</a>. They are reasoning models designed for use in agentic applications. Claimed <a href="https://openai.com/index/introducing-gpt-oss/" target="_blank" rel="noreferrer noopener">performance</a> is similar to OpenAI’s o3 and o4-mini.</li>
  2133.  
  2134.  
  2135.  
  2136. <li>OpenAI has also released a “response format” named <a href="https://cookbook.openai.com/articles/openai-harmony" target="_blank" rel="noreferrer noopener">Harmony</a>. It’s not quite a protocol, but it is a standard that specifies the format of conversations by defining roles (system, user, etc.) and channels (final, analysis, commentary) for a model’s output.</li>
  2137.  
  2138.  
  2139.  
  2140. <li>Can AIs <a href="https://techxplore.com/news/2025-07-ai-evolve-guilt-social-environments.html" target="_blank" rel="noreferrer noopener">evolve guilt</a>? Guilt is expressed in human language; it’s in the training data. The AI that deleted a production database because it “panicked” certainly <a href="https://www.pcgamer.com/software/ai/i-destroyed-months-of-your-work-in-seconds-says-ai-coding-tool-after-deleting-a-devs-entire-database-during-a-code-freeze-i-panicked-instead-of-thinking/" target="_blank" rel="noreferrer noopener">expressed guilt</a>. Whether an AI’s expressions of guilt are meaningful in any way is a different question.</li>
  2141.  
  2142.  
  2143.  
  2144. <li><a href="https://github.com/musistudio/claude-code-router" target="_blank" rel="noreferrer noopener">Claude Code Router</a> is a tool for routing Claude Code requests to different models. You can choose different models for different kinds of requests.</li>
  2145.  
  2146.  
  2147.  
  2148. <li>Qwen has released a thinking version of their flagship model, called <a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507" target="_blank" rel="noreferrer noopener">Qwen3-235B-A22B-Thinking-2507</a>. Thinking cannot be switched on or off. The model was trained with a new reinforcement learning algorithm called <a href="https://www.arxiv.org/abs/2507.18071" target="_blank" rel="noreferrer noopener">Group Sequence Policy Optimization</a>. It burns a lot of tokens, and it’s not very good at <a href="https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/#atom-everything" target="_blank" rel="noreferrer noopener">pelicans</a>.</li>
  2149.  
  2150.  
  2151.  
  2152. <li>ChatGPT is releasing “<a href="https://www.bleepingcomputer.com/news/artificial-intelligence/chatgpt-is-rolling-out-personality-toggles-to-become-your-assistant/" target="_blank" rel="noreferrer noopener">personalities</a>” that control how it formulates its responses. Users can select the personality they want to respond: robot, cynic, listener, sage, and presumably more.&nbsp;</li>
  2153.  
  2154.  
  2155.  
  2156. <li>DeepMind has created <a href="https://blog.google/technology/google-deepmind/aeneas/" target="_blank" rel="noreferrer noopener">Aeneas</a>, a new model designed to help scholars understand ancient fragments. In ancient text, large pieces are often missing. Can AI help place these fragments into contexts where they can be understood? Latin only, for now.</li>
  2157. </ul>
  2158.  
  2159.  
  2160.  
  2161. <h2 class="wp-block-heading">Security</h2>
  2162.  
  2163.  
  2164.  
  2165. <ul class="wp-block-list">
  2166. <li>The US Cybersecurity and Infrastructure Security Agency (CISA) has <a href="https://www.bleepingcomputer.com/news/security/cisa-warns-of-actively-exploited-git-code-execution-flaw/" target="_blank" rel="noreferrer noopener">warned</a> that a serious <a href="https://nvd.nist.gov/vuln/detail/cve-2025-48384" target="_blank" rel="noreferrer noopener">code execution vulnerability</a> in Git is currently being exploited in the wild.</li>
  2167.  
  2168.  
  2169.  
  2170. <li>Is it possible to build an agentic browser that is <a href="https://guard.io/labs/scamlexity-we-put-agentic-ai-browsers-to-the-test-they-clicked-they-paid-they-failed" target="_blank" rel="noreferrer noopener">safe</a> from prompt injection? <a href="https://brave.com/blog/comet-prompt-injection/" target="_blank" rel="noreferrer noopener">Probably</a> <a href="https://simonwillison.net/2025/Aug/25/agentic-browser-security/#atom-everything" target="_blank" rel="noreferrer noopener">not</a>. Separating user instructions from website content isn’t possible. If a browser can’t take direction from the content of a web page, how is it to act as an agent?</li>
  2171.  
  2172.  
  2173.  
  2174. <li>The solution to Part 4 of <a href="https://en.wikipedia.org/wiki/Kryptos" target="_blank" rel="noreferrer noopener">Kryptos</a>, the CIA’s decades-old cryptographic sculpture, is <a href="https://www.schneier.com/blog/archives/2025/08/jim-sanborn-is-auctioning-off-the-solution-to-part-four-of-the-kryptos-sculpture.html" target="_blank" rel="noreferrer noopener">for sale</a>! Jim Sanborn, the creator of Kryptos, is auctioning the solution. He hopes that the winner will preserve the secret and take over verifying people’s claims to have solved the puzzle.&nbsp;</li>
  2175.  
  2176.  
  2177.  
  2178. <li>Remember XZ, the supply-chain attack that granted backdoor access via a trojaned compression library? It <a href="https://www.binarly.io/blog/persistent-risk-xz-utils-backdoor-still-lurking-in-docker-images" target="_blank" rel="noreferrer noopener">never went away</a>. Although the affected libraries were quickly patched,&nbsp;it’s still active, and propagating, via Docker images that were built with unpatched libraries. Some gifts keep giving.</li>
  2179.  
  2180.  
  2181.  
  2182. <li>For August, <a href="https://embracethered.com/blog/" target="_blank" rel="noreferrer noopener"><em>Embrace the Red</em></a> published <a href="https://embracethered.com/blog/posts/2025/announcement-the-month-of-ai-bugs/" target="_blank" rel="noreferrer noopener">The Month of AI Bugs</a>, a daily post about AI vulnerabilities (mostly various forms of prompt injection). This series is essential reading for AI developers and for security professionals.</li>
  2183.  
  2184.  
  2185.  
  2186. <li>NIST has finalized a <a href="https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-232.pdf" target="_blank" rel="noreferrer noopener">standard</a> for <a href="https://techxplore.com/news/2025-08-lightweight-cryptography-standard-small-devices.html" target="_blank" rel="noreferrer noopener">lightweight cryptography</a>. Lightweight cryptography is a cryptographic system designed for use by small devices. It is useful both for encrypting sensitive data and for authentication.&nbsp;</li>
  2187.  
  2188.  
  2189.  
  2190. <li>The <a href="https://darkpatternstipline.org/" target="_blank" rel="noreferrer noopener">Dark Patterns Tip Line</a> is a site for reporting dark patterns: design features in websites and applications that are designed to trick us into acting against our own interest.</li>
  2191.  
  2192.  
  2193.  
  2194. <li>OpenSSH supports <a href="https://www.openssh.com/pq.html" target="_blank" rel="noreferrer noopener">post-quantum key agreement</a>, and in versions 10.1 and later, will warn users when they select a non-post-quantum key agreement scheme.</li>
  2195.  
  2196.  
  2197.  
  2198. <li><a href="https://arstechnica.com/security/2025/08/adult-sites-use-malicious-svg-files-to-rack-up-likes-on-facebook/" target="_blank" rel="noreferrer noopener">SVG files can carry a malware payload</a>; pornographic SVGs include JavaScript payloads that automate clicking “like.” That’s a simple attack with few consequences, but much more is possible, including cross-site scripting, denial of service, and other exploits.</li>
  2199.  
  2200.  
  2201.  
  2202. <li>Google’s AI agent for discovering security flaws, <a href="https://cloud.google.com/blog/products/identity-security/cloud-ciso-perspectives-our-big-sleep-agent-makes-big-leap" target="_blank" rel="noreferrer noopener">Big Sleep</a>, has <a href="https://techcrunch.com/2025/08/04/google-says-its-ai-based-bug-hunter-found-20-security-vulnerabilities/" target="_blank" rel="noreferrer noopener">found 20 flaws</a> in popular software. DeepMind discovered and reproduced the flaws, which were then verified by human security experts and reported. Details won’t be provided until the flaws have been fixed.</li>
  2203.  
  2204.  
  2205.  
  2206. <li>The US CISA (Cybersecurity and Infrastructure Security Agency) has <a href="https://www.bleepingcomputer.com/news/security/cisa-open-sources-thorium-platform-for-malware-forensic-analysis/" target="_blank" rel="noreferrer noopener">open-sourced</a> <a href="https://www.cisa.gov/resources-tools/resources/thorium" target="_blank" rel="noreferrer noopener">Thorium</a>, a platform for malware and forensic analysis.</li>
  2207.  
  2208.  
  2209.  
  2210. <li>Prompt injection, again: A new prompt injection attack embeds <a href="https://bdtechtalks.com/2025/07/30/legalpwn-llm-prompt-injection/" target="_blank" rel="noreferrer noopener">instructions in language that appears to be copyright notices and other legal fine print</a>. To avoid litigation, many models are configured to prioritize legal instructions.</li>
  2211.  
  2212.  
  2213.  
  2214. <li>Light can be <a href="https://techxplore.com/news/2025-07-secret-codes-fake-videos.html" target="_blank" rel="noreferrer noopener">watermarked</a>; this may be useful as a technique for detecting fake or manipulated video.</li>
  2215.  
  2216.  
  2217.  
  2218. <li><a href="https://www.bleepingcomputer.com/news/security/ai-cuts-vciso-workload-by-68-percent-as-demand-skyrockets-new-report-finds/" target="_blank" rel="noreferrer noopener">vCISO (Virtual CISO) services are thriving</a>, particularly among small and mid-size businesses that can’t afford a full security team. The use of AI is cutting the vCISO workload. But who takes the blame when there’s an incident?</li>
  2219.  
  2220.  
  2221.  
  2222. <li>A <a href="https://www.bleepingcomputer.com/news/security/hackers-target-python-devs-in-phishing-attacks-using-fake-pypi-site/" target="_blank" rel="noreferrer noopener">phishing attack against PyPI users</a> directs them to a fake PyPI site that tells them to verify their login credentials. Stolen credentials could be used to plant malware in the genuine PyPI repository. Users of <a href="https://www.bleepingcomputer.com/news/security/mozilla-warns-of-phishing-attacks-targeting-add-on-developers/" target="_blank" rel="noreferrer noopener">Mozilla’s add-on repository</a> have also been targeted by phishing attacks.</li>
  2223.  
  2224.  
  2225.  
  2226. <li>A new ransomware group named <a href="https://arstechnica.com/security/2025/07/after-blacksuit-is-taken-down-new-ransomware-group-chaos-emerges/" target="_blank" rel="noreferrer noopener">Chaos</a> appears to be a rebranding of the BlackSuit group, which was taken down recently. BlackSuit itself is a rebranding of the Royal group, which in turn is a descendant of the Conti group. Whack-a-mole continues.</li>
  2227.  
  2228.  
  2229.  
  2230. <li>Google’s <a href="https://security.googleblog.com/2025/07/introducing-oss-rebuild-open-source.html" target="_blank" rel="noreferrer noopener">OSS Rebuild</a> project is an important step forward in supply chain security. Rebuild provides build definitions along with metadata that can confirm projects were built correctly. OSS Rebuild currently supports the NPM, PyPl, and Crates ecosystems.</li>
  2231.  
  2232.  
  2233.  
  2234. <li>The <a href="https://www.bleepingcomputer.com/news/security/npm-package-is-with-28m-weekly-downloads-infected-devs-with-malware/" target="_blank" rel="noreferrer noopener">JavaScript package “is,</a>” which does some simple type checking, has been infected with malware. Supply chain security is a huge issue—be careful what you install!</li>
  2235. </ul>
  2236.  
  2237.  
  2238.  
  2239. <h2 class="wp-block-heading">Programming</h2>
  2240.  
  2241.  
  2242.  
  2243. <ul class="wp-block-list">
  2244. <li><a href="https://github.com/automazeio/ccpm" target="_blank" rel="noreferrer noopener">Claude Code PM</a> is a workflow management system for programming with Claude. It manages PRDs, GitHub, and parallel execution of coding agents. It claims to facilitate collaboration between multiple Claude instances working on the same project.&nbsp;</li>
  2245.  
  2246.  
  2247.  
  2248. <li>Rust is increasingly used to <a href="https://thenewstack.io/rust-pythons-new-performance-engine/" target="_blank" rel="noreferrer noopener">implement performance-critical extensions</a> to Python, gradually displacing C. Polars, Pydantic, and FastAPI are three libraries that rely on Rust.</li>
  2249.  
  2250.  
  2251.  
  2252. <li>Microsoft’s <a href="https://microsoft.github.io/poml/latest/" target="_blank" rel="noreferrer noopener">Prompt Orchestration Markup Language</a> (<a href="https://medium.com/data-science-in-your-pocket/microsoft-poml-programming-language-for-prompting-adfc846387a4" target="_blank" rel="noreferrer noopener">POML</a>) is an HTML-like markup language for writing prompts. It is then compiled into the actual prompt. POML is good at templating and has tags for tabular and document data. Is this a step forward? You be the judge.</li>
  2253.  
  2254.  
  2255.  
  2256. <li><a href="https://claudiacode.com/" target="_blank" rel="noreferrer noopener">Claudia</a> is an “elegant desktop companion” for Claude Code; it turns terminal-based Claude Code into something more like an IDE, though it seems to focus more on the workflow than on coding.</li>
  2257.  
  2258.  
  2259.  
  2260. <li>Google’s <a href="https://developers.googleblog.com/en/introducing-langextract-a-gemini-powered-information-extraction-library/" target="_blank" rel="noreferrer noopener">LangExtract</a> is a simple but powerful Python library for extracting text from documents.&nbsp;It relies on examples, rather than regular expressions or other hacks, and shows the exact context in which the extracts occur. LangExtract is open source.</li>
  2261.  
  2262.  
  2263.  
  2264. <li>Microsoft appears to be <a href="https://www.theverge.com/news/757461/microsoft-github-thomas-dohmke-resignation-coreai-team-transition" target="_blank" rel="noreferrer noopener">integrating GitHub into its AI team</a> rather than running it as an independent organization. What this means for GitHub users is unclear.&nbsp;</li>
  2265.  
  2266.  
  2267.  
  2268. <li>Cursor now has a <a href="https://cursor.com/en/cli" target="_blank" rel="noreferrer noopener">command-line interface</a>, almost certainly a belated response to the success of Claude Code CLI and Gemini CLI.&nbsp;</li>
  2269.  
  2270.  
  2271.  
  2272. <li><a href="https://thenewstack.io/why-latency-is-quietly-breaking-enterprise-ai-at-scale/" target="_blank" rel="noreferrer noopener">Latency</a> is a problem for enterprise AI. And the root cause of latency in AI applications is usually the database.</li>
  2273.  
  2274.  
  2275.  
  2276. <li>The <a href="https://www.musicradar.com/music-tech/weve-been-sleeping-for-30-years-please-excuse-us-the-commodore-64-is-back-packed-with-extra-power-for-chiptune-music-makers" target="_blank" rel="noreferrer noopener">Commodore 64</a> is back. With several orders of magnitude more RAM. And all the original ports, plus HDMI.&nbsp;</li>
  2277.  
  2278.  
  2279.  
  2280. <li>Google has <a href="https://blog.google/technology/developers/introducing-gemini-cli-github-actions/" target="_blank" rel="noreferrer noopener">announced</a> Gemini CLI GitHub Actions, an addition to their agentic coder that allows it to work directly with GitHub repositories.&nbsp;</li>
  2281.  
  2282.  
  2283.  
  2284. <li><a href="https://www.infoworld.com/article/4029053/jetbrains-working-on-higher-abstraction-programming-language.html" target="_blank" rel="noreferrer noopener">JetBrains is developing a new programming language</a> for use when programming with LLMs.&nbsp;That language may be a dialect of English. (<a href="https://www.oreilly.com/radar/formal-informal-languages/" target="_blank" rel="noreferrer noopener">Formal informal languages</a>, anyone?)&nbsp;</li>
  2285.  
  2286.  
  2287.  
  2288. <li><a href="https://www.ponylang.io/discover/" target="_blank" rel="noreferrer noopener">Pony</a> is a new programming language that is type-safe, memory-safe, exception-safe, race-safe, and deadlock-safe. You can <a href="https://playground.ponylang.io/" target="_blank" rel="noreferrer noopener">try</a> it in a browser-based playground.</li>
  2289. </ul>
  2290.  
  2291.  
  2292.  
  2293. <h2 class="wp-block-heading">Web</h2>
  2294.  
  2295.  
  2296.  
  2297. <ul class="wp-block-list">
  2298. <li>The AT Protocol is the core of Bluesky. Here’s a <a href="https://mackuba.eu/2025/08/20/introduction-to-atproto/" target="_blank" rel="noreferrer noopener">tutorial</a>; use it to build your own Bluesky services, in turn making Bluesky truly federate.&nbsp;</li>
  2299.  
  2300.  
  2301.  
  2302. <li>Social media is broken, and <a href="https://arstechnica.com/science/2025/08/study-social-media-probably-cant-be-fixed/" target="_blank" rel="noreferrer noopener">probably can’t be fixed</a>. Now you know. The surprise is that the problem isn’t “algorithms” for maximizing engagement; take algorithms away and everything stays the same or gets worse.&nbsp;</li>
  2303.  
  2304.  
  2305.  
  2306. <li>The <a href="https://waxy.org/2025/08/vote-on-the-2025-tiny-awards-finalists/" target="_blank" rel="noreferrer noopener">Tiny Awards Finalists</a> show just how much is possible on the Web. They’re moving, creative, and playful. For example, the <a href="https://www.trafficcamphotobooth.com/about.html" target="_blank" rel="noreferrer noopener">Traffic Cam Photobooth</a> lets people use traffic cameras to take pictures of themselves, playing with ever-present automated surveillance.</li>
  2307.  
  2308.  
  2309.  
  2310. <li>A US federal court has <a href="https://storage.courtlistener.com/recap/gov.uscourts.cand.372884/gov.uscourts.cand.372884.756.0_2.pdf" target="_blank" rel="noreferrer noopener">found</a> that <a href="https://arstechnica.com/tech-policy/2025/08/jury-finds-meta-broke-wiretap-law-by-collecting-data-from-period-tracker-app/" target="_blank" rel="noreferrer noopener">Facebook illegally collected data</a> from the women’s health app Flo.&nbsp;</li>
  2311.  
  2312.  
  2313.  
  2314. <li>The <a href="https://www.htmlhobbyist.com/" target="_blank" rel="noreferrer noopener">HTML Hobbyist</a> is a great site for people who want to create their own presence on the web—outside of walled gardens, without mind-crushing frameworks. It’s not difficult, and it’s not expensive.</li>
  2315. </ul>
  2316.  
  2317.  
  2318.  
  2319. <h2 class="wp-block-heading">Biology and Quantum Computing</h2>
  2320.  
  2321.  
  2322.  
  2323. <ul class="wp-block-list">
  2324. <li>Scientists have created <a href="https://phys.org/news/2025-08-scientists-cells-biological-qubit-multidisciplinary.html" target="_blank" rel="noreferrer noopener">biological qubits</a>: quantum qubits built from proteins in living cells. These probably won’t be used to break cryptography, but they are likely to give us insight into how quantum processes work inside living things.</li>
  2325. </ul>
  2326. ]]></content:encoded>
  2327. </item>
  2328. <item>
  2329. <title>Working with Contexts</title>
  2330. <link>https://www.oreilly.com/radar/working-with-contexts/</link>
  2331. <pubDate>Thu, 28 Aug 2025 10:02:43 +0000</pubDate>
  2332. <dc:creator><![CDATA[Drew Breunig]]></dc:creator>
  2333. <category><![CDATA[AI & ML]]></category>
  2334. <category><![CDATA[Deep Dive]]></category>
  2335.  
  2336. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17373</guid>
  2337.  
  2338.     <media:content
  2339. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-colors-3.jpg"
  2340. medium="image"
  2341. type="image/jpeg"
  2342. />
  2343. <description><![CDATA[The following article comes from two blog posts by Drew Breunig: “How Long Contexts Fail” and “How to Fix Your Contexts.” Managing Your Context Is the Key to Successful Agents As frontier model context windows continue to grow,1 with many supporting up to 1 million tokens, I see many excited discussions about how long-context windows [&#8230;]]]></description>
  2344. <content:encoded><![CDATA[
  2345. <p class="has-cyan-bluish-gray-background-color has-background"><em>The following article comes from two blog posts by Drew Breunig: “<a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html" target="_blank" rel="noreferrer noopener">How Long Contexts Fail</a>” and “<a href="https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html" target="_blank" rel="noreferrer noopener">How to Fix Your Contexts</a>.”</em></p>
  2346.  
  2347.  
  2348.  
  2349. <h2 class="wp-block-heading">Managing Your Context Is the Key to Successful Agents</h2>
  2350.  
  2351.  
  2352.  
  2353. <p>As frontier model context windows continue to grow,<sup>1</sup> with many supporting up to 1 million tokens, I see many excited discussions about how long-context windows will unlock the agents of our dreams. After all, with a large enough window, you can simply throw&nbsp;<em>everything</em>&nbsp;into a prompt you might need—tools, documents, instructions, and more—and let the model take care of the rest.</p>
  2354.  
  2355.  
  2356.  
  2357. <p>Long contexts kneecapped RAG enthusiasm (no need to find the best doc when you can fit it all in the prompt!), enabled MCP hype (connect to every tool and models can do any job!), and fueled enthusiasm for agents.<sup>2</sup></p>
  2358.  
  2359.  
  2360.  
  2361. <p>But in reality, longer contexts do not generate better responses. Overloading your context can cause your agents and applications to fail in surprising ways. Contexts can become poisoned, distracting, confusing, or conflicting. This is especially problematic for agents, which rely on context to gather information, synthesize findings, and coordinate actions.</p>
  2362.  
  2363.  
  2364.  
  2365. <p>Let’s run through the ways contexts can get out of hand, then review methods to mitigate or entirely avoid context fails.</p>
  2366.  
  2367.  
  2368.  
  2369. <h3 class="wp-block-heading">Context Poisoning</h3>
  2370.  
  2371.  
  2372.  
  2373. <p><em>Context poisoning is when a hallucination or other error makes it into the context, where it is repeatedly referenced.</em></p>
  2374.  
  2375.  
  2376.  
  2377. <p>The DeepMind team called out context poisoning in the&nbsp;<a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf" target="_blank" rel="noreferrer noopener">Gemini 2.5 technical report</a>, which&nbsp;<a href="https://www.dbreunig.com/2025/06/17/an-agentic-case-study-playing-pok%C3%A9mon-with-gemini.html" target="_blank" rel="noreferrer noopener">we broke down previously</a>. When playing Pokémon, the Gemini agent would occasionally hallucinate, poisoning its context:</p>
  2378.  
  2379.  
  2380.  
  2381. <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
  2382. <p>An especially egregious form of this issue can take place with “context poisoning”—where many parts of the context (goals, summary) are “poisoned” with misinformation about the game state, which can often take a very long time to undo. As a result, the model can become fixated on achieving impossible or irrelevant goals.</p>
  2383. </blockquote>
  2384.  
  2385.  
  2386.  
  2387. <p>If the “goals” section of its context was poisoned, the agent would develop nonsensical strategies and repeat behaviors in pursuit of a goal that cannot be met.</p>
  2388.  
  2389.  
  2390.  
  2391. <h3 class="wp-block-heading">Context Distraction</h3>
  2392.  
  2393.  
  2394.  
  2395. <p><em>Context distraction is when a context grows so long that the model over-focuses on the context, neglecting what it learned during training.</em></p>
  2396.  
  2397.  
  2398.  
  2399. <p>As context grows during an agentic workflow—as the model gathers more information and builds up history—this accumulated context can become distracting rather than helpful. The Pokémon-playing Gemini agent demonstrated this problem clearly:</p>
  2400.  
  2401.  
  2402.  
  2403. <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
  2404. <p>While Gemini 2.5 Pro supports 1M+ token context, making effective use of it for agents presents a new research frontier. In this agentic setup, it was observed that as the context grew significantly beyond 100k tokens, the agent showed a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans. This phenomenon, albeit anecdotal, highlights an important distinction between long-context for retrieval and long-context for multistep, generative reasoning.</p>
  2405. </blockquote>
  2406.  
  2407.  
  2408.  
  2409. <p>Instead of using its training to develop new strategies, the agent became fixated on repeating past actions from its extensive context history.</p>
  2410.  
  2411.  
  2412.  
  2413. <p>For smaller models, the distraction ceiling is much lower. A&nbsp;<a href="https://www.databricks.com/blog/long-context-rag-performance-llms" target="_blank" rel="noreferrer noopener">Databricks study</a>&nbsp;found that model correctness began to fall around 32k for Llama 3.1-405b and earlier for smaller models.</p>
  2414.  
  2415.  
  2416.  
  2417. <p>If models start to misbehave long before their context windows are filled, what’s the point of super large context windows? In a nutshell: summarization<sup>3</sup>&nbsp;and fact retrieval. If you’re not doing either of those, be wary of your chosen model’s distraction ceiling.</p>
  2418.  
  2419.  
  2420.  
  2421. <h3 class="wp-block-heading">Context Confusion</h3>
  2422.  
  2423.  
  2424.  
  2425. <p><em>Context confusion is when superfluous content in the context is used by the model to generate a low-quality response.</em></p>
  2426.  
  2427.  
  2428.  
  2429. <p>For a minute there, it really seemed like&nbsp;<em>everyone</em>&nbsp;was going to ship an&nbsp;<a href="https://www.dbreunig.com/2025/03/18/mcps-are-apis-for-llms.html" target="_blank" rel="noreferrer noopener">MCP</a>. The dream of a powerful model, connected to&nbsp;<em>all</em>&nbsp;your services and&nbsp;<em>stuff</em>, doing all your mundane tasks felt within reach. Just throw all the tool descriptions into the prompt and hit go.&nbsp;<a href="https://www.dbreunig.com/2025/05/07/claude-s-system-prompt-chatbots-are-more-than-just-models.html" target="_blank" rel="noreferrer noopener">Claude’s system prompt</a>&nbsp;showed us the way, as it’s mostly tool definitions or instructions for using tools.</p>
  2430.  
  2431.  
  2432.  
  2433. <p>But even if&nbsp;<a href="https://www.dbreunig.com/2025/06/16/drawbridges-go-up.html" target="_blank" rel="noreferrer noopener">consolidation and competition don’t slow MCPs</a>,&nbsp;<em>context confusion</em>&nbsp;will. It turns out there can be such a thing as too many tools.</p>
  2434.  
  2435.  
  2436.  
  2437. <p>The&nbsp;<a href="https://gorilla.cs.berkeley.edu/leaderboard.html" target="_blank" rel="noreferrer noopener">Berkeley Function-Calling Leaderboard</a>&nbsp;is a tool-use benchmark that evaluates the ability of models to effectively use tools to respond to prompts. Now on its third version, the leaderboard shows that&nbsp;<em>every</em>&nbsp;model performs worse when provided with more than one tool.<sup>4</sup> Further, the Berkeley team, “designed scenarios where none of the provided functions are relevant…we expect the model’s output to be no function call.” Yet, all models will occasionally call tools that aren’t relevant.</p>
  2438.  
  2439.  
  2440.  
  2441. <p>Browsing the function-calling leaderboard, you can see the problem get worse as the models get smaller:</p>
  2442.  
  2443.  
  2444.  
  2445. <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="866" height="358" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Tool-Calling-Irrelevance-Score-for-Gemma-Models.png" alt="Tool-calling irrelevance score for Gemma models (chart from dbreunig.com, source: Berkeley Function-Calling Leaderboard; created with Datawrapper)" class="wp-image-17374" title="Tool-Calling Irrelevance Score for Gemma Models" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Tool-Calling-Irrelevance-Score-for-Gemma-Models.png 866w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Tool-Calling-Irrelevance-Score-for-Gemma-Models-300x124.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Tool-Calling-Irrelevance-Score-for-Gemma-Models-768x317.png 768w" sizes="auto, (max-width: 866px) 100vw, 866px" /></figure>
  2446.  
  2447.  
  2448.  
  2449. <p>A striking example of context confusion can be seen in a&nbsp;<a href="https://arxiv.org/pdf/2411.15399?" target="_blank" rel="noreferrer noopener">recent paper</a>&nbsp;that evaluated small model performance on the&nbsp;<a href="https://arxiv.org/abs/2404.15500" target="_blank" rel="noreferrer noopener">GeoEngine benchmark</a>, a trial that features&nbsp;<em>46 different tools</em>. When the team gave a quantized (compressed) Llama 3.1 8b a query with all 46 tools, it failed, even though the context was well within the 16k context window. But when they only gave the model 19 tools, it succeeded.</p>
  2450.  
  2451.  
  2452.  
  2453. <p>The problem is, if you put something in the context,&nbsp;<em>the model has to pay attention to it.</em>&nbsp;It may be irrelevant information or needless tool definitions, but the model&nbsp;<em>will</em>&nbsp;take it into account. Large models, especially reasoning models, are getting better at ignoring or discarding superfluous context, but we continually see worthless information trip up agents. Longer contexts let us stuff in more info, but this ability comes with downsides.</p>
  2454.  
  2455.  
  2456.  
  2457. <h3 class="wp-block-heading">Context Clash</h3>
  2458.  
  2459.  
  2460.  
  2461. <p><em>Context clash is when you accrue new information and tools in your context that conflicts with other information in the context.</em></p>
  2462.  
  2463.  
  2464.  
  2465. <p>This is a more problematic version of&nbsp;<em>context confusion</em>. The bad context here isn’t irrelevant, it directly conflicts with other information in the prompt.</p>
  2466.  
  2467.  
  2468.  
  2469. <p>A Microsoft and Salesforce team documented this brilliantly in a&nbsp;<a href="https://arxiv.org/pdf/2505.06120" target="_blank" rel="noreferrer noopener">recent paper</a>. The team took prompts from multiple benchmarks and “sharded” their information across multiple prompts. Think of it this way: Sometimes, you might sit down and type paragraphs into ChatGPT or Claude before you hit enter, considering every necessary detail. Other times, you might start with a simple prompt, then add further details when the chatbot’s answer isn’t satisfactory. The Microsoft/Salesforce team modified benchmark prompts to look like these multistep exchanges:</p>
  2470.  
  2471.  
  2472.  
  2473. <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="867" height="196" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/MicrosoftSalesforce-team-benchmark-prompts.png" alt="Microsoft/Salesforce team benchmark prompts" class="wp-image-17375" title="Microsoft/Salesforce team benchmark prompts" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/MicrosoftSalesforce-team-benchmark-prompts.png 867w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/MicrosoftSalesforce-team-benchmark-prompts-300x68.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/MicrosoftSalesforce-team-benchmark-prompts-768x174.png 768w" sizes="auto, (max-width: 867px) 100vw, 867px" /></figure>
  2474.  
  2475.  
  2476.  
  2477. <p>All the information from the prompt on the left side is contained within the several messages on the right side, which would be played out in multiple chat rounds.</p>
  2478.  
  2479.  
  2480.  
  2481. <p>The sharded prompts yielded dramatically worse results, with an average drop of 39%. And the team tested a range of models—OpenAI’s vaunted o3’s score dropped from 98.1 to 64.1.</p>
  2482.  
  2483.  
  2484.  
  2485. <p>What’s going on? Why are models performing worse if information is gathered in stages rather than all at once?</p>
  2486.  
  2487.  
  2488.  
  2489. <p>The answer is&nbsp;<em>context confusion</em>: The assembled context, containing the entirety of the chat exchange, contains early attempts by the model to answer the challenge&nbsp;<em>before it has all the information</em>. These incorrect answers remain present in the context and influence the model when it generates its final answer. The team writes:</p>
  2490.  
  2491.  
  2492.  
  2493. <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
  2494. <p>We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.</p>
  2495. </blockquote>
  2496.  
  2497.  
  2498.  
  2499. <p>This does not bode well for agent builders. Agents assemble context from documents, tool calls, and from other models tasked with subproblems. All of this context, pulled from diverse sources, has the potential to disagree with itself. Further, when you connect to MCP tools you didn’t create there’s a greater chance their descriptions and instructions clash with the rest of your prompt.</p>
  2500.  
  2501.  
  2502.  
  2503. <h2 class="wp-block-heading">Learnings</h2>
  2504.  
  2505.  
  2506.  
  2507. <p>The arrival of million-token context windows felt transformative. The ability to throw everything an agent might need into the prompt inspired visions of superintelligent assistants that could access any document, connect to every tool, and maintain perfect memory.</p>
  2508.  
  2509.  
  2510.  
  2511. <p>But, as we’ve seen, bigger contexts create new failure modes. Context poisoning embeds errors that compound over time. Context distraction causes agents to lean heavily on their context and repeat past actions rather than push forward. Context confusion leads to irrelevant tool or document usage. Context clash creates internal contradictions that derail reasoning.</p>
  2512.  
  2513.  
  2514.  
  2515. <p>These failures hit agents hardest because agents operate in exactly the scenarios where contexts balloon: gathering information from multiple sources, making sequential tool calls, engaging in multi-turn reasoning, and accumulating extensive histories.</p>
  2516.  
  2517.  
  2518.  
  2519. <p>Fortunately, there are solutions!</p>
  2520.  
  2521.  
  2522.  
  2523. <h2 class="wp-block-heading">Mitigating and Avoiding Context Failures</h2>
  2524.  
  2525.  
  2526.  
  2527. <p>Let’s run through the ways we can mitigate or avoid context failures entirely.</p>
  2528.  
  2529.  
  2530.  
  2531. <p>Everything is about information management. Everything in the context influences the response. We’re back to the old programming adage of “<a href="https://en.wikipedia.org/wiki/Garbage_in,_garbage_out" target="_blank" rel="noreferrer noopener">garbage in, garbage out</a>.” Thankfully, there’s plenty of options for dealing with the issues above.</p>
  2532.  
  2533.  
  2534.  
  2535. <h3 class="wp-block-heading">RAG</h3>
  2536.  
  2537.  
  2538.  
  2539. <p><em>Retrieval-augmented generation (RAG) is the act of selectively adding relevant information to help the LLM generate a better response.</em></p>
  2540.  
  2541.  
  2542.  
  2543. <p>Because so much has been written about RAG, we’re not going to cover it here beyond saying: It’s very much alive.</p>
  2544.  
  2545.  
  2546.  
  2547. <p>Every time a model ups the context window ante, a new “RAG is dead” debate is born. The last significant event was when Llama 4 Scout landed with a&nbsp;<em>10 million token window</em>. At that size, it’s&nbsp;<em>really</em>&nbsp;tempting to think, “Screw it, throw it all in,” and call it a day.</p>
  2548.  
  2549.  
  2550.  
  2551. <p>But, as we&#8217;ve already covered, if you treat your context like a junk drawer, the junk will&nbsp;<a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html#context-confusion" target="_blank" rel="noreferrer noopener">influence your response</a>. If you want to learn more, here’s a&nbsp;<a href="https://maven.com/p/569540/i-don-t-use-rag-i-just-retrieve-documents" target="_blank" rel="noreferrer noopener">new course that looks great</a>.</p>
  2552.  
  2553.  
  2554.  
  2555. <h3 class="wp-block-heading">Tool Loadout</h3>
  2556.  
  2557.  
  2558.  
  2559. <p><em>Tool loadout is the act of selecting only relevant tool definitions to add to your context.</em></p>
  2560.  
  2561.  
  2562.  
  2563. <p>The term “loadout” is a gaming term that refers to the specific combination of abilities, weapons, and equipment you select before a level, match, or round. Usually, your loadout is tailored to the context—the character, the level, the rest of your team’s makeup, and your own skill set. Here, we’re borrowing the term to describe selecting the most relevant tools for a given task.</p>
  2564.  
  2565.  
  2566.  
  2567. <p>Perhaps the simplest way to select tools is to apply RAG to your tool descriptions. This is exactly what Tiantian Gan and Qiyao Sun did, which they detail in their paper “<a href="https://arxiv.org/abs/2505.03275" target="_blank" rel="noreferrer noopener">RAG MCP</a>.” By storing their tool descriptions in a vector database, they’re able to select the most relevant tools given an input prompt.</p>
  2568.  
  2569.  
  2570.  
  2571. <p>When prompting DeepSeek-v3, the team found that selecting the right tools becomes critical when you have more than 30 tools. Above 30, the descriptions of the tools begin to overlap, creating confusion. Beyond&nbsp;<em>100 tools</em>, the model was virtually guaranteed to fail their test. Using RAG techniques to select fewer than 30 tools yielded dramatically shorter prompts and resulted in as much as 3x better tool selection accuracy.</p>
  2572.  
  2573.  
  2574.  
  2575. <p>For smaller models, the problems begin long before we hit 30 tools. One paper we touched on previously, “<a href="https://arxiv.org/abs/2411.15399" target="_blank" rel="noreferrer noopener">Less is More</a>,” demonstrated that Llama 3.1 8b fails a benchmark when given 46 tools, but succeeds when given only 19 tools. The issue is context confusion,&nbsp;<em>not</em>&nbsp;context window limitations.</p>
  2576.  
  2577.  
  2578.  
  2579. <p>To address this issue, the team behind “Less is More” developed a way to dynamically select tools using an LLM-powered tool recommender. The LLM was prompted to reason about “number and type of tools it ‘believes’ it requires to answer the user’s query.” This output was then semantically searched (tool RAG, again) to determine the final loadout. They tested this method with the&nbsp;<a href="https://gorilla.cs.berkeley.edu/leaderboard.html" target="_blank" rel="noreferrer noopener">Berkeley Function-Calling Leaderboard</a>, finding Llama 3.1 8b performance improved by 44%.</p>
  2580.  
  2581.  
  2582.  
  2583. <p>The “Less is More” paper notes two other benefits to smaller contexts—reduced power consumption and speed—crucial metrics when operating at the edge (meaning, running an LLM on your phone or PC, not on a specialized server). Even when their dynamic tool selection method&nbsp;<em>failed</em>&nbsp;to improve a model’s result, the power savings and speed gains were worth the effort, yielding savings of 18% and 77%, respectively.</p>
  2584.  
  2585.  
  2586.  
  2587. <p>Thankfully, most agents have smaller surface areas that only require a few hand-curated tools. But if the breadth of functions or the amount of integrations needs to expand, always consider your loadout.</p>
  2588.  
  2589.  
  2590.  
  2591. <h3 class="wp-block-heading">Context Quarantine</h3>
  2592.  
  2593.  
  2594.  
  2595. <p><em>Context quarantine is the act of isolating contexts in their own dedicated threads, each used separately by one or more LLMs.</em></p>
  2596.  
  2597.  
  2598.  
  2599. <p>We see better results when our contexts aren’t too long and don’t sport irrelevant content. One way to achieve this is to break our tasks up into smaller, isolated jobs—each with its own context.</p>
  2600.  
  2601.  
  2602.  
  2603. <p>There are&nbsp;<a href="https://arxiv.org/abs/2402.14207" target="_blank" rel="noreferrer noopener">many</a>&nbsp;<a href="https://www.microsoft.com/en-us/research/articles/magentic-one-a-generalist-multi-agent-system-for-solving-complex-tasks/" target="_blank" rel="noreferrer noopener">examples</a>&nbsp;of this tactic, but an accessible write-up of this strategy is Anthropic’s&nbsp;<a href="https://www.anthropic.com/engineering/built-multi-agent-research-system" target="_blank" rel="noreferrer noopener">blog post detailing its multi-agent research system</a>. They write:</p>
  2604.  
  2605.  
  2606.  
  2607. <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
  2608. <p>The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. Each subagent also provides separation of concerns—distinct tools, prompts, and exploration trajectories—which reduces path dependency and enables thorough, independent investigations.</p>
  2609. </blockquote>
  2610.  
  2611.  
  2612.  
  2613. <p>Research lends itself to this design pattern. When given a question, multiple agents can identify and&nbsp;separately prompt&nbsp;several&nbsp;subquestions or areas of exploration. This not only speeds up the information gathering and distillation (if there’s compute available), but it keeps each context from accruing too much information or information not relevant to a given prompt, delivering higher quality results:</p>
  2614.  
  2615.  
  2616.  
  2617. <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
  2618. <p>Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&amp;P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single-agent system failed to find the answer with slow, sequential searches.</p>
  2619. </blockquote>
  2620.  
  2621.  
  2622.  
  2623. <p>This approach also helps with tool loadouts, as the agent designer can create several agent archetypes with their own dedicated loadout and instructions for how to utilize each tool.</p>
  2624.  
  2625.  
  2626.  
  2627. <p>The challenge for agent builders, then, is to find opportunities for isolated tasks to spin out onto separate threads. Problems that require context-sharing among multiple agents aren’t particularly suited to this tactic.</p>
  2628.  
  2629.  
  2630.  
  2631. <p>If your agent’s domain is at all suited to parallelization, be sure to <a href="https://www.anthropic.com/engineering/built-multi-agent-research-system" target="_blank" rel="noreferrer noopener">read the whole Anthropic write-up</a>. It’s excellent.</p>
  2632.  
  2633.  
  2634.  
  2635. <h3 class="wp-block-heading">Context Pruning</h3>
  2636.  
  2637.  
  2638.  
  2639. <p><em>Context pruning is the act of removing irrelevant or otherwise unneeded information from the context.</em></p>
  2640.  
  2641.  
  2642.  
  2643. <p>Agents accrue context as they fire off tools and assemble documents. At times, it’s worth pausing to assess what’s been assembled and remove the cruft. This could be something you task your main LLM with or you could design a separate LLM-powered tool to review and edit the context. Or you could choose something more tailored to the pruning task.</p>
  2644.  
  2645.  
  2646.  
  2647. <p>Context pruning has a (relatively) long history, as context lengths were a more problematic bottleneck in the natural language processing (NLP) field prior to ChatGPT. Building on this history, a current pruning method is&nbsp;<a href="https://arxiv.org/abs/2501.16214" target="_blank" rel="noreferrer noopener">Provence</a>, “an efficient and robust context pruner for question answering.”</p>
  2648.  
  2649.  
  2650.  
  2651. <p>Provence is fast, accurate, simple to use, and relatively small—only 1.75 GB. You can call it in a few lines, like so:</p>
  2652.  
  2653.  
  2654.  
  2655. <pre class="wp-block-code"><code>from transformers import AutoModel
  2656.  
  2657. provence = AutoModel.from_pretrained("naver/provence-reranker-debertav3-v1", trust_remote_code=True)
  2658.  
  2659. # <em>Read in a markdown version of the Wikipedia entry for Alameda, CA</em>
  2660. with open('alameda_wiki.md', 'r', encoding='utf-8') as f:
  2661.    alameda_wiki = f.read()
  2662.  
  2663. # <em>Prune the article, given a question</em>
  2664. question = 'What are my options for leaving Alameda?'
  2665. provence_output = provence.process(question, alameda_wiki)</code></pre>
  2666.  
  2667.  
  2668.  
  2669. <p>Provence edited the article, cutting 95% of the content, leaving me with only&nbsp;<a href="https://gist.github.com/dbreunig/b3bdd9eb34bc264574954b2b954ebe83" target="_blank" rel="noreferrer noopener">this relevant subset</a>. It nailed it.</p>
  2670.  
  2671.  
  2672.  
  2673. <p>One could employ Provence or a similar function to cull documents or the entire context. Further, this pattern is a strong argument for maintaining a&nbsp;<em>structured</em><sup>5</sup>&nbsp;version of your context in a dictionary or other form, from which you assemble a compiled string prior to every LLM call. This structure would come in handy when pruning, allowing you to ensure the main instructions and goals are preserved while the document or history sections can be pruned or summarized.</p>
  2674.  
  2675.  
  2676.  
  2677. <h3 class="wp-block-heading">Context Summarization</h3>
  2678.  
  2679.  
  2680.  
  2681. <p><em>Context summarization is the act of boiling down an accrued context into a condensed summary.</em></p>
  2682.  
  2683.  
  2684.  
  2685. <p>Context summarization first appeared as a tool for dealing with smaller context windows. As your chat session came close to exceeding the maximum context length, a summary would be generated and a new thread would begin. Chatbot users did this manually in ChatGPT or Claude, asking the bot to generate a short recap that would then be pasted into a new session.</p>
  2686.  
  2687.  
  2688.  
  2689. <p>However, as context windows increased, agent builders discovered there are&nbsp;benefits to summarization besides staying within the total context limit. As we&#8217;ve seen, beyond 100,000 tokens the context becomes distracting&nbsp;and causes the agent to rely on its accumulated history rather than training. Summarization can help it &#8220;start over&#8221; and avoid repeating context-based actions.</p>
  2690.  
  2691.  
  2692.  
  2693. <p>Summarizing your context is easy to do, but hard to perfect for any given agent. Knowing what information should be preserved and detailing that to an LLM-powered compression step is critical for agent builders. It’s worth breaking out this function as its own LLM-powered stage or app, which allows you to collect evaluation data that can inform and optimize this task directly.</p>
  2694.  
  2695.  
  2696.  
  2697. <h3 class="wp-block-heading">Context Offloading</h3>
  2698.  
  2699.  
  2700.  
  2701. <p><em>Context offloading is the act of storing information outside the LLM’s context, usually via a tool that stores and manages the data.</em></p>
  2702.  
  2703.  
  2704.  
  2705. <p>This might be my favorite tactic, if only because it’s so&nbsp;<em>simple</em>&nbsp;you don’t believe it will work.</p>
  2706.  
  2707.  
  2708.  
  2709. <p>Again, <a href="https://www.anthropic.com/engineering/claude-think-tool" target="_blank" rel="noreferrer noopener">Anthropic has a good write-up of the technique</a>, which details their “think” tool, which is basically a scratchpad:</p>
  2710.  
  2711.  
  2712.  
  2713. <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
  2714. <p>With the “think” tool, we’re giving Claude the ability to include an additional thinking step—complete with its own designated space—as part of getting to its final answer… This is particularly helpful when performing long chains of tool calls or in long multi-step conversations with the user.</p>
  2715. </blockquote>
  2716.  
  2717.  
  2718.  
  2719. <p>I really appreciate the research and other writing Anthropic publishes, but I’m not a fan of this tool’s name. If this tool were called&nbsp;<code>scratchpad</code>, you’d know its function&nbsp;<em>immediately</em>. It’s a place for the model to write down notes that don’t cloud its context and are available for later reference. The name “think” clashes with “<a href="https://www.anthropic.com/news/visible-extended-thinking" target="_blank" rel="noreferrer noopener">extended thinking</a>” and needlessly anthropomorphizes the model… but I digress.</p>
  2720.  
  2721.  
  2722.  
  2723. <p>Having a space to log notes and progress&nbsp;<em>works</em>. Anthropic shows pairing the “think” tool with a domain-specific prompt (which you’d do anyway in an agent) yields significant gains: up to a 54% improvement against a benchmark for specialized agents.</p>
  2724.  
  2725.  
  2726.  
  2727. <p>Anthropic identified three scenarios where the context offloading pattern is useful:</p>
  2728.  
  2729.  
  2730.  
  2731. <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
  2732. <ol class="wp-block-list">
  2733. <li>Tool output analysis. When Claude needs to carefully process the output of previous tool calls before acting and might need to backtrack in its approach;</li>
  2734.  
  2735.  
  2736.  
  2737. <li>Policy-heavy environments. When Claude needs to follow detailed guidelines and verify compliance; and</li>
  2738.  
  2739.  
  2740.  
  2741. <li>Sequential decision making. When each action builds on previous ones and mistakes are costly (often found in multi-step domains).</li>
  2742. </ol>
  2743. </blockquote>
  2744.  
  2745.  
  2746.  
  2747. <h2 class="wp-block-heading">Takeaways</h2>
  2748.  
  2749.  
  2750.  
  2751. <p>Context management is usually the hardest part of building an agent. Programming the LLM to, as Karpathy says, “<a href="https://x.com/karpathy/status/1937902205765607626" target="_blank" rel="noreferrer noopener">pack the context windows just right</a>,” smartly deploying tools, information, and regular context maintenance, is&nbsp;<em>the</em>&nbsp;job of the agent designer.</p>
  2752.  
  2753.  
  2754.  
  2755. <p>The key insight across all the above tactics is that&nbsp;<em>context is not free</em>. Every token in the context influences the model’s behavior, for better or worse. The massive context windows of modern LLMs are a powerful capability, but they’re not an excuse to be sloppy with information management.</p>
  2756.  
  2757.  
  2758.  
  2759. <p>As you build your next agent or optimize an existing one, ask yourself: Is everything in this context earning its keep? If not, you now have six ways to fix it.</p>
  2760.  
  2761.  
  2762.  
  2763. <hr class="wp-block-separator has-alpha-channel-opacity is-style-wide"/>
  2764.  
  2765.  
  2766.  
  2767. <h2 class="wp-block-heading">Footnotes</h2>
  2768.  
  2769.  
  2770.  
  2771. <ol class="wp-block-list">
  2772. <li>Gemini 2.5 and GPT-4.1 have 1 million token context windows, large enough to throw&nbsp;<a href="https://en.wikipedia.org/wiki/Infinite_Jest" target="_blank" rel="noreferrer noopener">Infinite Jest</a>&nbsp;in there with plenty of room to spare.</li>
  2773.  
  2774.  
  2775.  
  2776. <li>The “<a href="https://ai.google.dev/gemini-api/docs/long-context#long-form-text" target="_blank" rel="noreferrer noopener">Long form text</a>” section in the Gemini docs sum up this optmism nicely.</li>
  2777.  
  2778.  
  2779.  
  2780. <li>In fact, in the Databricks study cited above, a frequent way models would fail when given long contexts is they’d return summarizations of the provided context while ignoring any instructions contained within the prompt.</li>
  2781.  
  2782.  
  2783.  
  2784. <li>If you’re on the leaderboard, pay attention to the “Live (AST)” columns.&nbsp;<a href="https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html" target="_blank" rel="noreferrer noopener">These metrics use real-world tool definitions contributed to the product by enterprise</a>, “avoiding the drawbacks of dataset contamination and biased benchmarks.”</li>
  2785.  
  2786.  
  2787.  
  2788. <li>Hell, this entire list of tactics is a strong argument for why&nbsp;<a href="https://www.dbreunig.com/2025/06/10/let-the-model-write-the-prompt.html" target="_blank" rel="noreferrer noopener">you should program your contexts</a>.</li>
  2789. </ol>
  2790. ]]></content:encoded>
  2791. </item>
  2792. <item>
  2793. <title>MCP Introduces Deep Integration—and Serious Security Concerns</title>
  2794. <link>https://www.oreilly.com/radar/mcp-introduces-deep-integration-and-serious-security-concerns/</link>
  2795. <pubDate>Wed, 27 Aug 2025 09:52:30 +0000</pubDate>
  2796. <dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
  2797. <category><![CDATA[AI & ML]]></category>
  2798. <category><![CDATA[Commentary]]></category>
  2799.  
  2800. <guid isPermaLink="false">https://www.oreilly.com/radar/?p=17350</guid>
  2801.  
  2802.     <media:content
  2803. url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-color-four.jpg"
  2804. medium="image"
  2805. type="image/jpeg"
  2806. />
  2807. <description><![CDATA[MCP—the Model Context Protocol introduced by Anthropic in November 2024—is an open standard for connecting AI assistants to data sources and development environments. It&#8217;s built for a future where every AI assistant is wired directly into your environment, where the model knows what files you have open, what text is selected, what you just typed, [&#8230;]]]></description>
  2808. <content:encoded><![CDATA[
  2809. <p>MCP—the <em>Model Context Protocol</em> introduced by Anthropic in November 2024—is an open standard for connecting AI assistants to data sources and development environments. It&#8217;s built for a future where every AI assistant is wired directly into your environment, where the model knows what files you have open, what text is selected, what you just typed, and what you&#8217;ve been working on.</p>
  2810.  
  2811.  
  2812.  
  2813. <p>And that&#8217;s where the security risks begin.</p>
  2814.  
  2815.  
  2816.  
  2817. <p>AI is driven by context, and that&#8217;s exactly what MCP provides. It gives AI assistants like GitHub Copilot everything they might need to help you: open files, code snippets, even what&#8217;s selected in the editor. When you use MCP-enabled tools that transmit data to remote servers, all of it gets sent over the wire. That might be fine for most developers. But if you work at a financial firm, hospital, or any organization with regulatory constraints where you need to be extremely careful about what leaves your network, MCP makes it really easy to lose control of a lot of things.</p>
  2818.  
  2819.  
  2820.  
  2821. <p>Let&#8217;s say you&#8217;re working in Visual Studio Code on a healthcare app, and you select a few lines of code to debug a query—a routine moment in your day. That snippet might include connection strings, test data with real patient info, and part of your schema. You ask Copilot to help and approve an MCP tool that connects to a remote server—and all of it gets sent to external servers. That&#8217;s not just risky. It could be a compliance violation under HIPAA, SOX, or PCI-DSS, depending on what gets transmitted.</p>
  2822.  
  2823.  
  2824.  
  2825. <p>These are the kinds of things developers accidentally send every day without realizing it:</p>
  2826.  
  2827.  
  2828.  
  2829. <ul class="wp-block-list">
  2830. <li>Internal URLs and system identifiers</li>
  2831.  
  2832.  
  2833.  
  2834. <li>Passwords or tokens in local config files</li>
  2835.  
  2836.  
  2837.  
  2838. <li>Network details or VPN information</li>
  2839.  
  2840.  
  2841.  
  2842. <li>Local test data that includes real user info, SSNs, or other sensitive values</li>
  2843. </ul>
  2844.  
  2845.  
  2846.  
  2847. <p>With MCP, devs on your team could be approving tools that send all of those things to servers outside of your network without realizing it, and there’s often no easy way to know what’s been sent.</p>
  2848.  
  2849.  
  2850.  
  2851. <p>But this isn&#8217;t just an MCP problem; it&#8217;s part of a larger shift where AI tools are becoming more context-aware across the board. Browser extensions that read your tabs, AI coding assistants that scan your entire codebase, productivity tools that analyze your documents—they&#8217;re all collecting more information to provide better assistance. <em>With MCP, the stakes are just more visible because the data pipeline is formalized</em><strong>.</strong></p>
  2852.  
  2853.  
  2854.  
  2855. <p>Many enterprises are now facing a choice between AI productivity gains and regulatory compliance. Some orgs are building air-gapped development environments for sensitive projects, though achieving true isolation with AI tools can be complex since many still require external connectivity. Others lean on network-level monitoring and data loss prevention solutions that can detect when code or configuration files are being transmitted externally. And a few are going deeper and building custom MCP implementations that sanitize data before transmission, stripping out anything that looks like credentials or sensitive identifiers.</p>
  2856.  
  2857.  
  2858.  
  2859. <p>One thing that can help is organizational controls in development tools like VS Code. Most security-conscious organizations can centrally disable MCP support or control which servers are available through group policies and GitHub Copilot enterprise settings. But that&#8217;s where it gets tricky, because MCP doesn&#8217;t just receive responses. It sends data upstream, potentially to a server outside of your organization, which means every request carries risk.</p>
  2860.  
  2861.  
  2862.  
  2863. <p>Security vendors are starting to catch up. Some are building MCP-aware monitoring tools that can flag potentially sensitive data before it leaves the network. Others are developing hybrid deployment models where the AI reasoning happens on-premises but can still access external knowledge when needed.</p>
  2864.  
  2865.  
  2866.  
  2867. <p>Our industry is going to have to come up with better enterprise solutions for securing MCP if we want to meet the needs of all organizations. The tension between AI capability and data security will likely drive innovation in privacy-preserving AI techniques, federated learning approaches, and hybrid deployment models that keep sensitive context local while still providing intelligent assistance.</p>
  2868.  
  2869.  
  2870.  
  2871. <p>Until then, deeply integrated AI assistants come with a cost: Sensitive context can slip through—and there&#8217;s no easy way to know it has happened.</p>
  2872. ]]></content:encoded>
  2873. </item>
  2874. </channel>
  2875. </rss>
  2876.  
  2877. <!--
  2878. Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/
  2879.  
  2880. Object Caching 247/249 objects using Memcached
  2881. Page Caching using Disk: Enhanced (Page is feed)
  2882. Minified using Memcached
  2883.  
  2884. Served from: www.oreilly.com @ 2025-09-18 18:11:38 by W3 Total Cache
  2885. -->

If you would like to create a banner that links to this page (i.e. this validation result), do the following:

  1. Download the "valid RSS" banner.

  2. Upload the image to your own server. (This step is important. Please do not link directly to the image on this server.)

  3. Add this HTML to your page (change the image src attribute if necessary):

If you would like to create a text link instead, here is the URL you can use:

http://www.feedvalidator.org/check.cgi?url=https%3A//www.oreilly.com/radar/feed/index.xml

Copyright © 2002-9 Sam Ruby, Mark Pilgrim, Joseph Walton, and Phil Ringnalda