← Library

concepts · tweet · 6 min

AI Agent Memory as Infrastructure Pattern

Rohit · Jan 18, 2026

3 months ago, I was rejected from a technical interview because I couldn’t build an agent that never forgets.

Every approach I knew worked… until it didn’t.

I walked into that room confident. I’d built chatbots. I understood embeddings. I knew how to use vector databases.

But when the interviewer asked me to design an agent that could remember a user’s preferences across weeks not just within a single conversation, I froze.

My instinct was the standard playbook: Store everything in a vector database and retrieve similar conversations when needed.

The questions that killed me were simple: What about scale? After a thousand sessions, how do you handle conflicting data? How do you stop it from faking memories just to fill the gaps?

I had no answer.

That failure forced me to actually deep dive and find a solution:

Most tutorials about "agents with memory" are teaching how to implement RAG for memory.

The problem isn't embeddings. It isn't token limits. It isn't even retrieval.

The problem is that memory is infrastructure, not a feature.

Here is the entire system I built to solve it and the code I used to do it.

The Real Problem With "Standard" Memory

Here is what I thought memory meant: Keeping the conversation history and stuffing it into the context window.

That works for about 10 exchanges. Then the context window fills up.

So you truncate old messages. Now your agent forgets the user is vegan and recommends a steakhouse.

You realize conversation history isn't memory it's just a chat log.

"Fine," I thought. "I'll embed every message and retrieve relevant ones using similarity search."

This worked better. For a while.

But after two weeks, the vector database had 500 entries. When the user asked, "What did I tell you about my work situation?" the retrieval system returned fragments from 12 different conversations.

The agent saw:

  1. "I love my job" (Week 1)

  2. "I'm thinking about quitting" (Week 2)

  3. "My manager is supportive" (Week 1)

  4. "My manager micromanages everything" (Week 2)

Which one is true?

The agent had no idea. It hallucinated a synthesis: "You love your supportive manager but you're thinking about quitting because of micromanagement."

Completely wrong. The user had switched jobs between Week 1 and Week 2.

This is the crucial realization: Embeddings measure similarity, not truth.

Vector databases have a blind spot: they don't understand time, context, or updates. They just spit back text that looks mathematically close to what you asked for. That isn’t remembering; it’s guessing.

The fix required a mental shift. Memory isn't a hard drive. It’s a process. You can't just store data; you have to give it a lifespan and let it evolve.

Before tackling the hard part (long-term memory), we need to handle short-term continuity.

Short-term memory is the ability to remember what was said 30 seconds ago. This is actually a solved problem.

The solution is Checkpointing.

Every agent operates as a state machine. It receives input, updates internal state, calls tools, generates output, and updates state again. A checkpoint is a snapshot of this entire state at a specific moment.

This gives you three capabilities:

Determinism: Replay any conversation.

Recoverability: Resume exactly where you left off if the agent crashes.

Debuggability: Rewind to inspect the agent's "thoughts."

In production, I use Postgres-backed checkpointers. Here is the pattern:

checkpoint code in python

This handles the "now." But checkpoints are ephemeral. They don't build wisdom. For that, we need Long-Term Architectures.

After months of failure, I found two architectures that actually work.

This mimics how humans categorize knowledge. It works best for assistants, therapists, or companions.

The Three-Layer Hierarchy:

Layer 1: Resources (Raw Data). The source of truth. Unprocessed logs, uploads, transcripts. Immutable and timestamped.

Layer 2: Items (Atomic Facts). Discrete facts extracted from resources ("User prefers Python," "User is allergic to shellfish").

Layer 3: Categories (Evolving Summaries). The high-level context. Items are grouped into files like work_preferences.md or personal_life.md.

The Write Path: Active Memorization

When new information arrives, the system doesn't just file it away it processes it. It pulls up the existing summary for that category and actively weaves the new detail into the narrative. This handles contradictions automatically: if a user mentions they’ve switched to Rust, the system doesn't just add 'Rust' to the list; it rewrites the profile to replace the old preference

python

import json

class FileBasedMemory:
    def memorize(self, conversation_text, user_id):
        # Stage 1: Resource Ingestion (The Source of Truth)
        # Always save the raw input first. This allows for traceability.
        resource_id = self.save_resource(user_id, conversation_text)
        
        # Stage 2: Extraction
        # Extract atomic facts from the conversation.
        items = self.extract_items(conversation_text)
        
        # Stage 3: Batching (The Fix)
        # Group items by category to avoid opening/writing files multiple times.
        # Structure: { "work_life": ["User hates Java", "User loves Python"], ... }
        updates_by_category = {}
        for item in items:
            cat = self.classify_item(item)
            if cat not in updates_by_category:
                updates_by_category[cat] = []
            updates_by_category[cat].append(item['content'])
            
            # Link item to the specific resource for traceability
            self.save_item(user_id, category=cat, item=item, source_resource_id=resource_id)

# Stage 4: Evolve Summaries (One Write Per Category)
        for category, new_memories in updates_by_category.items():
            existing_summary = self.load_category(user_id, category)
            
            # We pass the LIST of new memories, not just one
            updated_summary = self.evolve_summary(
                existing=existing_summary,
                new_memories=new_memories
            )
            
            self.save_category(user_id, category, updated_summary)

def extract_items(self, text):
        """Use LLM to extract atomic facts"""
        prompt = f"""Extract discrete facts from this conversation.
        Focus on preferences, behaviors, and important details.
        Conversation: {text}
        Return as JSON list of items."""
        return llm.invoke(prompt)

def evolve_summary(self, existing, new_memories):
        """
        Update category summary with a BATCH of new information.
        """
        # Convert list to bullet points for the prompt
        memory_list_text = "\n".join([f"- {m}" for m in new_memories])
        
        prompt = f"""You are a Memory Synchronization Specialist.
        
        Topic Scope: User Profile
        
        ## Original Profile
        {existing if existing else "No existing profile."}
        
        ## New Memory Items to Integrate
        {memory_list_text}
        
        # Task
        1. Update: If new items conflict with the Original Profile, overwrite the old facts.
        2. Add: If items are new, append them logically.
        3. Output: Return ONLY the updated markdown profile."""
        
        return llm.invoke(prompt)

# Helper stubs
    def save_resource(self, user_id, text): pass
    def save_item(self, user_id, category, item, source_resource_id): pass
    def save_category(self, user_id, category, content): pass
    def load_category(self, user_id, category): return ""
    def classify_item(self, item): return "general"

The Read Path (Tiered Retrieval): To save tokens, you don't pull everything.

  1. Pull Category Summaries.

  2. Ask LLM: "Is this enough?"

  3. If yes -> Respond.

  4. If no -> Drill down into specific items.

python

class FileBasedRetrieval:
    def retrieve(self, query, user_id):
        # Stage 1: Category Selection (The Fix)
        # Instead of loading ALL content, we just list category NAMES and ask
        # the LLM which ones might contain the answer.
        all_categories = self.list_categories(user_id)
        relevant_categories = self.select_relevant_categories(query, all_categories)
        
        # Load only the relevant summaries
        summaries = {cat: self.load_category(user_id, cat) 
                     for cat in relevant_categories}
        
        # Stage 2: Sufficiency Check
        # Check if the high-level summaries answer the query
        if self.is_sufficient(query, summaries):
            return summaries
        
        # Stage 3: Hierarchical Search
        # If summaries are vague, generate a specific query to find atomic items
        # or raw resources.
        search_query = self.generate_search_query(query, summaries)
        
        # Search Level 1: Atomic Items (Extracted facts)
        items = self.search_items(user_id, search_query)
        if items:
            return items
            
        # Search Level 2: Raw Resources (Full text search fallback)
        resources = self.search_resources(user_id, search_query)
        return resources

def select_relevant_categories(self, query, categories):
        """Filter to only the categories likely to hold the answer"""
        prompt = f"""Query: {query}
        Available Categories: {', '.join(categories)}
        
        Return a JSON list of the categories that are most relevant to this query."""
        return llm.invoke(prompt)

def is_sufficient(self, query, summaries):
        prompt = f"""Query: {query}
        Summaries: {summaries}
        Can you answer the query comprehensively with just these summaries? YES/NO"""
        return 'YES' in llm.invoke(prompt)

This works beautifully for narrative coherence. But it struggles with complex relations