← Library

concepts · tweet · 8 min

Context Management in Production AI Agents

@vasuman · Jan 12, 2026

AI Agents are not magic, but also are not as simple as "build an agent, automate everything, profit". Most people don’t understand what an agent is.

Those that do (<5%) try to build one and it falls apart. The agent hallucinates, forgets what it was doing mid-task, or calls the wrong tool at the wrong time. It works perfectly in demos and breaks immediately in production.

I've deployed agents for over a year now. I started my software career at Meta but left 6 months ago to build a company that does nothing but deploy production agents for enterprise. We're at $3M ARR and growing, not because we're smarter than anyone else, but because we've built and failed enough times to know what the formula is now.

This is everything I've learned about building agents that work. It should apply at any level, whether you’re a beginner, an expert, or somewhere in between.

My goal with this article is to share my biggest learnings from a few years of being in the AI space. My hope is that you walk away with useful information that you can use to build better agents. Let's begin.

Yes this is super obvious and you’ve probably heard it before. But that's because it's true. Most people think building agents is about chaining tools together. You pick a model, give it access to your database, and let it figure out what to do while you grab a beer. This approach fails immediately for a few different reasons.

The agent doesn't know what matters. It doesn't know what happened five steps ago. It only sees the current step in the process, guesses what to do (often poorly), and hopes for the best. That’s not the way that you want your agents to act, especially when you sell these agents to companies.

Context is often the biggest difference between an agent worth $1M and an agent worth $0. Here's the concepts you need to focus on and optimize for:

What the agent remembers. Meaning not just the current task, but the history of what led here. If an agent is handling an invoice exception, for example, it needs to know: what triggered this exception, who submitted the original invoice, what policy applies, and what happened last time this vendor had an issue. Without that history, the agent is just guessing, which is worse than if the agent didn’t even exist in the first place, because at that point a human would have figured it out. See: "AI sucks".

How information flows. When you have multiple agents, or one agent handling multiple steps, information needs to move between stages without getting lost, corrupted, or misconstrued. The agent that triages incoming requests needs to pass clean, structured context to the agent that resolves them. If that handoff is sloppy, everything downstream breaks. That means structured input and structured output that is verifiable at each stage. An example of this step is /compact in Claude Code, handing off context between LLM sessions.

What the agent knows about the domain. An agent handling legal contract review needs to understand what clauses matter, what risks look like, what the company's actual policies are. You can't just point it at documents and expect it to figure out what's important. That’s your job. But your job also includes being able to provide the resources in a structured format to your agent so that it has domain knowledge.

Bad context management is an agent that calls the same tool repeatedly because it forgot it already got the answer, or calls the wrong tool because it was fed the wrong information. Another example is an agent that makes a decision contradictory to something it learned two steps earlier, or an agent that treats every task as brand new even when there's a clear pattern from previous similar tasks.

Good context management means the agent operates like someone with domain knowledge. It connects dots across different pieces of information without explicit instructions on how they relate. This is why when I sell agents to enterprise, I say we truly can automate everything. This is because we build custom for businesses, and we span their entire existing knowledge base (whether that's documents or interviewing their employees) to make that happen.

This is the concept that separates agents that just demo well from agents that run and deliver results when in production.

The wrong way to think about agents: "This will do the work so we don't have to hire someone."

The right way is: "This will let three people do what used to require fifteen." Yes, agents are going to replace human labor, and if you say otherwise then you are respectfully delusional. The positive is that agents don't eliminate the need for human judgment. They eliminate the friction around human judgment. This can include things like research, data gathering, cross-referencing, formatting, routing, follow-up. You get the idea.

A finance team still needs to make decisions about exceptions. But instead of spending 70% of close week hunting for missing documentation, they spend 70% of close week actually resolving issues. The agent did all of the work, but the human approves it. The reality of the situation, from what I’ve seen doing this for customers, is they never fire employees. There’s nearly infinite work for employees to do in place of their previous manual work, at least for now. I do anticipate this will change over time as AI replaces that too.

The companies getting real value from agents aren't the ones trying to remove humans from the loop. Instead they are the ones who realized that most of what humans were doing wasn't actually the valuable part of their job, but rather the overhead required to get to the valuable part.

Build agents this way and accuracy stops being a concern: the agent handles what it is good at, just like employees focus on what they’re good at.

This also means you can deploy faster. You don't need the agent to handle every edge case. You need it to handle the common cases well and route the weird stuff to humans with enough context that the human can resolve it quickly. Again, at least for now…

How an agent retains information across a task - and across multiple tasks - determines whether it works at scale.

3 patterns show up constantly:

  1. Solo agents that handle a complete workflow. One agent handling one job, start to finish. These are the easiest to build because all the context stays in one place. The challenge is managing state as the workflow gets longer. The agent needs to remember what it decided at step three when it gets to step ten. If your context window fills up or you're not structuring memory correctly, late-stage decisions get made without early-stage context, and stuff breaks.

  2. Parallel agents that work on different pieces of the same problem simultaneously. Faster, but now you have a coordination problem. How do the results merge? What happens when two agents reach contradictory conclusions? You need a clear protocol for how information comes back together and how conflicts resolve. Often time this means a judge (either a human or another LLM) that resolves conflicts or race conditions.

  3. Collaborative agents that hand off to each other in sequence. Agent A does triage, passes to Agent B for research, passes to Agent C for resolution. This works well when the workflow has natural stages, but the handoffs are where things break. Whatever Agent A learns needs to survive the transition to Agent B in a format that Agent B can actually use.

Typically the agents that we deploy for enterprise are a mix of 2 and 3.

The mistake most people make is treating these like implementation schematics, when in reality they're architectural decisions that determine what your agent can and can't do.

If you're building an agent that handles sales deal approvals, you need to decide: Does one agent own the whole process? Or does a routing agent hand off to specialized agents for pricing review, legal review, and executive approval? Only you will know the actual process behind the decision making, which hopefully you can pass on to your fellow agent eventually. You can and should gather the information required to make a more informed decision by talking to the business or employees to figure out what their workflows actually look like, instead of just guessing.

The answer depends on how complex each stage is, how much context needs to carry between stages, and how often the stages need to coordinate in real-time versus sequentially.

If you get this wrong, you'll spend months debugging failures that aren't even bugs; they're architectural mismatches between your design, your problem, and your solution.

The default instinct when building AI systems is to create dashboards. Surface information. Show people what's happening. Please for the love of every single person on this planet do not create another dashboard.

Dashboards are useless.

Your finance team already knows there are missing receipts. Your sales team already knows deals are stuck in legal.

Agents should catch problems when they happen and route them to whoever can fix them. With everything needed to actually fix them. Right then.

When an invoice hits without proper documentation, don't add it to a report. Flag it immediately. Figure out who needs to provide what. Route it to them with the full context - the vendor, the amount, the policy that applies, the specific documentation that's missing. Block the transaction from posting until it's resolved. This last part is also crucial, because if you don’t do this, information starts leaking all over the org and you won’t have time to restore the problem.

When a deal approval sits for more than 24 hours, don't surface it in a weekly review. Escalate automatically. Include the deal context so they can approve or reject without digging through systems. You have to move with urgency.

When a supplier misses a milestone, don't wait for someone to notice. Trigger the contingency playbook. Start the response before anyone has to manually reali