concepts · youtube · 14 min
Don't Build More AI Agents Until You Watch This
Nate B Jones · Jun 21, 2026
YouTube video: Don't build more AI agents until you watch this Channel: Nate B Jones Published: 2026-06-17 URL: https://www.youtube.com/watch?v=BOXK2XFLA-E
Description: Full post with Agent Maintenance Guide: https://natesnewsletter.substack.com/p/ai-agent-maintenance?r=1z4sm5&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
OpenAI and Anthropic aren't the only ones who figured this out: Vercel made its AI agent better by deleting most of its tools. The real skill in 2026 is not building agents, it is maintaining the harness around them as the model improves and your data drifts.
My Links 🔗 👉🏻 Newsletter: https://natesnewsletter.substack.com/ 👉🏻 X: ht
Transcript: Vercel made its agent better by deleting 80% of its tools. You heard that right. And that sentence can sound wrong if you've been following a lot of the hype around new tools and new skills for agents. So, I want to set the record straight. The usual story we hear is that agents get better as you give them more stuff, right? More context, more memory, more tools, more integrations, more access, more autonomy. Let the agent touch the CRM, let it use Slack, let it browse the web, let it update the record. Vercel's example is a really healthy counterexample in that process. And no, it's not just about context window, which is the usual reason people dump out tools. Messages came in. Some were real leads, some were spam, some were support questions dressed up as sales questions, some came from little companies, some came from accounts that might matter to Vercel, some deserved a really quick reply, and some needed research, and some needed routing. It's the usual messy inbox. So, Vercel studied one of its best reps. They watched the workflow closely enough to turn pieces of it into an agent. I love that part. You have to study what people are already doing. What did the rep ignore? What did they answer? What made a lead real? What research happened before the reply? When was a message actually a support issue? Where did the human still need to make a judgment call? Then, they built the agent around the actual observed workflow, not the paper workflow. Again, love that. The agent filtered inbound messages, it qualified leads, it researched companies, it drafted responses, it routed support questions away from sales. A human still reviewed the work because the goal was not to let a bot roam around the company, right? The goal was to take a repeatable workflow from a strong employee and make that repeatable bit run fast. And that's already a great story. But, the more important lesson is what happened after the agent existed. The agent did not get better when the team kept piling on tools. It got better when they took away tools. And this is something that I think that a lot of folks who are excited about agents need to sit with more. And this goes for skills, too. If you've got a pile of skills in your codex or Claude, pay attention. Because most of us are building agents the opposite way in practice. We're so enthused about building, right? We start with one task and add a tool and add another tool and add a memory file and add a slack integration, add a browser, and add a CRM action, add another exception. And after a while, the agent will look super powerful and muscled up, but it's going to become harder to trust. The beginner instinct is to add. The maintenance instinct is to ask what should be removed. That is the real agent story of 2026. Not can you build an agent? Look, I've got a video on that. There's dozens of videos out there on that. Of course, you can build an agent. The harder question is whether you can keep the setup around the agent healthy as the work changes and the model evolves. People call that setup a harness. If that word feels super technical, you can call it a workbench. It's kind of the same thing. The agent is the worker, the harness is the workbench. It's what the agent reads, it's what it remembers, it's what tools it can touch, it's what it's allowed to change, it's what proof it has to bring back, it's what stops it when the work gets risky. For Salesforce agent, had a workbench or harness. It had a documented workflow from a top performer. It had tools, it had handoffs, it had human review, it had feedback, and then the team learned that part of maintaining that workbench or harness is pruning. And that is a much more important lesson than AI replaced the sales process, which is what all the headlines were about. The real lesson is that useful agents desperately need good maintenance. And I think there are four first principles here that I want to lay out that are going to be durable for 2026. The first is that agents themselves are moving. The model underneath the agent is not stable. It's getting better, it's getting better at tool use, it's better at reasoning across steps, it's better at understanding messy instructions, it's better at reading files, it's better at remembering what matters, it's better at moving through work without needing every step spelled out. So, agents get better, and that sounds purely good. And mostly it is, but it also means yesterday's harness can become very wrong very quickly at the price of an update. A tool that helped a weaker model can confuse a stronger one. A A rule that protected you from an unreliable model's mistakes can trap a better model. A workflow that forced structure around a clumsy agent can become a drag when the model can handle a lot more of the work itself. These are all real examples. We are used to software breaking when it gets worse. That's our mental model. Agents can also break when the model gets better and that is a different and new thing. It's a strange new maintenance problem. Imagine the first version of an agent is not very reliable. It overreaches. It invents patterns. It treats one example like a trend. So you build a really careful harness around it. You give it strict tools and narrow the prompt and say only use these sources. Don't infer. Don't create records. Don't recommend a next step. Just summarize what you see. And that may be exactly right for that model. the model improves. Again, real examples here. Now it can compare sources better. It can understand the workflow better. It can tell the difference between a weak signal and a real pattern. It can draft a useful next step. I am describing November to March of this past 6 or 8 months. But your harness still treats it like the old model. So the agent is underused or the opposite happens, right? The old model was clumsy so you give it broad access because you knew a human would catch everything. Then the model gets better. Now it can take 20 plausible actions in a few minutes. Now they look real. They look organized. They create work that a human has to unwind. So the model improved, the harness did not and that is a massive driver of agent breakage in 2026. Normal systems drift. Prompts drift. Wiki's get stale. Dashboards break. Automations keep running long after the process changes. SOPs describe how the company worked months ago. Slack channels become junk drawers. Templates survive long after the reason for the template disappeared. None of that started with AI, right? Every company already has this problem. The product wiki, it's a little or a lot wrong. The CRM field means something slightly different than it used to. The dashboard, it still says activation, but the team changed what activation means. The support tags have evolved. The road map moved, the owner changed, the process changed, the docs didn't. With normal software, this is vaguely annoying and you sometimes get messages saying, "Please update your wiki." With agents, it's very dangerous because agents don't sit. They produce work. They're proactive. That's their job. They summarize, they recommend, they draft, they route, they update, and sometimes, of course, they act. That's the value. So, a stale wiki that is annoying to you is incredibly dangerous to an agent because it doesn't know that and it just keeps on working. And this is the second principle I want to communicate. Agents inherit all of the crud of the systems around them. If your wiki is stale, your agent reads and ingests stale truth. If your process changed, your agent will follow old process unless you update your docs. If your prompt is written for last quarter's company and model, that agent may keep serving last quarter's company and not realize everything's changed. If your dashboard definition is incorrect now, the agent will make the wrong number feel very convincing. This is not a model failure in the simple sense, right? The agent did its job. It's the old maintenance problem with a machine that now can produce work from that mess that is sometimes very convincing. And this is why Stewart Brand's Maintenance of Everything, I think it's the right frame for agents. Brand is writing about sailboats and vehicles and weapons and manuals and corrosion and the work that keeps important systems alive after the launch moment is over. Agents are a lot less like apps and more like sailboats. I love this book. This is like one of my favorite books of the year. You don't just launch agents and walk away. The weather changes, the lines loosen, salt gets into everything, and yes, this is all from that book. The same setup that worked yesterday can be wrong tomorrow. A sailboat is not maintained because it was badly designed, it is maintained because it lives in motion. Agents live in motion, too. The model changes inside them. The world changes around them. In that sense, they are much more like traditional vehicle maintenance than anything else we've seen in software in a long time. The harness has to keep up with the model changes and the world changes. And so few of us really have a good system for that. Now, the third principle I want to call out is that the biggest AI companies already know this. A lot of the implicit bet from the frontier labs and platform companies is not just that their models will get better. It is that they can use those better models to ship and evolve the harness faster. And I think that's one reason why it's really important to talk about Codex in the strategic context of OpenAI's long-term strategy. And I think that's one reason Codex matters so much. Codex is strong not just because the model is strong. Codex is strong because OpenAI keeps maintaining the harness around the model so it feels intuitive and native as the model and the world evolve around it. It has become closer to an operating surface for work as it's evolved. So, it has a terminal and a desktop app and an IDE and a browser and computer use and files and plugins and memory and automations and approvals and sandboxing and network controls and keychain storage and manage configs and logs. This is way beyond a chat box with a smarter brain. It's a very carefully maintained workbench around machine work. And the Claude code team is doing the same thing, right? They're investing heavily in their harness. I'm really excited to do the Claude code is as amazing as Codex review, guys. So, please give me something that cool. And to go back to the workbench analogy, every tool in that workbench is carefully chosen with Codex, right? The terminal matters because real work lives in commands and repos and files and tests and local tools. This was Clyde Code's original insight, by the way. The browser matters because real work happens on interfaces that humans see. And both Anthropic and OpenAI are building that way. Computer use matters because not every tool has a clean API. Plugins matter because work lives in a bunch of other systems: GitHub, Google Drive, Jira, Slack, etc. Memory matters because preferences and corrections should not have to be rebuilt every day, right? Approvals and sandboxing matter because a capable agent still needs boundaries. Logs matter because when an agent does something weird, someone needs to know what happened. This whole surface together is the harness. It's an art to build a good harness. And there are really two teams in the world building good harnesses: the Anthropic team and the OpenAI team right now. And this is where the hyperscaler and frontier platform bet gets super interesting. If the model can help you ship the harness and test the harness and refactor the harness and observe the harness and train the harness, then capability gain is going to start to compound real fast. Because better agents can help build more effective harnesses, better harnesses can make the agents more useful, and then better agents can help rebuild that harness once more. That is why the Vercel story is not just a quirky sales automation story. It's a pattern we all need to learn from. The companies that win are not the ones that build the perfect wrapper once. They're the ones that keep rebuilding the wrapper as the model and the work change. They rebuild that workshop. They rebuild that harness. And this is why the direction of Codex over time feels really significant to me. If Codex keeps getting more capable, and that's an if, and the Codex harness keeps getting closer to the operating system of work, another if, then OpenAI is not only selling intelligence, they are selling the environment in which intelligence becomes useful. And that's the same bet Anthropic is making with Clyde Code and Clyde Co-work. The harness evolves, the model evolves, the harness lets the model touch more real work over time. More real work creates more pressure to improve the harness, and that loop is ignited like a flywheel. And that loop matters, and it raises the bar for all of the rest of us. Because if you're building your own agent setup, you are now not just choosing a model, you're choosing how much harness maintenance you are choosing to own versus how much harness maintenance you're outsourcing. A light custom harness might be a clean set of instructions and memory and source folders and repeatable methods around Codex or Quad. That can be enough. Here are the sources. Here's the job. Here's what you can't touch. Here's the proof I need. Here's when a human decides. A deeper custom harness is a very different thing. Because now you have a data feed, a review screen, permission levels, logs, model choice, escalation paths, approval rules, and a plan for what happens when the model changes. And that can be very worth it to invest in, but now you're not just building an agent. You are investing in the long-term maintenance of an agent and harness system. You are taking responsibility for evolving the system around the agent over time. And the more custom the harness, the more you own the upkeep. And this is not abstract for me. So now I'm thinking about my delegation model differently, and part of it is just the ordinary mess of work, right? Folders move, drafts change, source packets get updated, memory gets stale, and the way I want the agent to use local context changes as the agent gets better. So the thing I maintain is a lot more than a prompt. It's It's the whole way the agent meets my files. Where should it look first? Which folders are a source of truth? What should it ignore? What should it ask about before touching? What should it remember? What should it forget? When it searches memory, is that right? When does it actually go read the file? That is a harness question for me. And that's a tiny personal harness question, right? I'm not even talking about team harnesses here. And it has changed because the agents have changed, because the models have updated. And this brings me to the fourth principle, and it's the one that I think matters the most. You need to ask, I think all of us need to ask, what is my harness? What is my workshop? Not in a sort of technical way that makes it feel scary, but in a very practical way. If you use chat GPT or Claude or or CodeX or any other agentic tool, your harness is the
[transcript truncated]