Multi-Turn Memory Collapse: Why LLM Agents Forget Mid-Conversation

June 02, 2026 9 min read 4 views
Abstract illustration of a conversation bubble breaking apart into fragments, representing memory collapse in AI language model agents

You've built an LLM-powered agent that works perfectly in a short demo. Five turns in, it's sharp, contextual, and impressive. Ten turns in, it starts repeating itself. Fifteen turns in, it's forgotten the user's name, their original goal, and half of what was agreed earlier. This is memory collapse, and it's one of the most common failure modes in production LLM agents.

The good news is it's not a mystery. Once you understand why it happens, you have a clear set of tools to push it back.

What you'll learn

  • Why the context window is both the memory system and its own bottleneck
  • How naive conversation history management causes silent degradation
  • The four main strategies for extending effective memory in LLM agents
  • Common implementation pitfalls and how to avoid them
  • A practical decision framework for choosing the right memory approach

Prerequisites

This article assumes you're comfortable with the basics of calling an LLM API (OpenAI, Anthropic, or similar). You don't need to be using a framework like LangChain, though the concepts apply there too. Familiarity with Python and REST APIs is helpful for the code examples.

How LLMs Actually "Remember" Things

Large language models have no persistent memory between API calls. Every single request is stateless. The only way a model knows what happened earlier in a conversation is if you include that history in the prompt you send right now.

That means your agent's entire "memory" is whatever fits inside the context window of a single API call. For modern models that window is large β€” tens or hundreds of thousands of tokens β€” but it is still finite, and the costs and performance implications of filling it matter a great deal.

When developers build their first agent, they typically do the obvious thing: append every message to a list and send the whole list with every new request. This works until it doesn't.

# The naive approach β€” you've probably written this
conversation_history = []

def chat(user_message):
    conversation_history.append({"role": "user", "content": user_message})
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            *conversation_history
        ]
    )
    
    assistant_message = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": assistant_message})
    return assistant_message

This is fine for short sessions. For anything running longer than a few minutes or handling complex tasks, you're heading toward a cliff edge you won't see until you fall off it.

The Mechanics of Memory Collapse

Memory collapse isn't a sudden event. It happens gradually, through several overlapping mechanisms.

Token limit truncation

Once the accumulated conversation history exceeds the model's context window, something has to give. Many naive implementations throw an error and crash. Others silently drop the oldest messages to make room. Either way, the model is now missing critical early context β€” the user's original intent, constraints they stated upfront, or decisions that were made and should not be revisited.

The problem with silent truncation is that the model doesn't know what it doesn't know. It will answer confidently based on incomplete information, and the user has no idea why it's suddenly acting confused.

Attention dilution

Even before you hit the hard token limit, very long contexts degrade model performance. Research has consistently shown that models perform worse on tasks requiring information from the middle of a long context compared to information near the start or end. This is sometimes called the "lost in the middle" problem. If a critical constraint was stated in turn three of a thirty-turn conversation, the model may simply pay less attention to it by turn twenty-eight.

Semantic drift

Long conversations naturally drift. Topics shift, the user refines their request, edge cases are introduced. Without active memory management, the model tries to reconcile everything at once. It starts hedging, contradicting itself, or weighting recent tangents more heavily than the user's original goal.

The Four Memory Strategies

There's no single right answer here. The right strategy depends on your conversation length, task complexity, and latency budget. Most production systems combine more than one approach.

1. Sliding window

Keep only the last N messages in the context. Simple, predictable, cheap. The model always has fresh context but permanently loses early history.

MAX_HISTORY_MESSAGES = 20

def chat_with_window(user_message, history):
    history.append({"role": "user", "content": user_message})
    
    # Keep only the last N messages
    windowed_history = history[-MAX_HISTORY_MESSAGES:]
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            *windowed_history
        ]
    )
    
    assistant_message = response.choices[0].message.content
    history.append({"role": "assistant", "content": assistant_message})
    return assistant_message, history

Use this when conversations are short-lived and early context isn't critical β€” a customer support FAQ bot, for instance, where each question is largely self-contained.

2. Summarization

Periodically ask the model to summarize the conversation so far, then replace the raw history with the summary. You preserve semantic content while dramatically reducing token count.

def summarize_history(history, client):
    summary_prompt = [
        {
            "role": "system",
            "content": "Summarize the following conversation. Preserve all key facts, decisions, user preferences, and open questions. Be concise."
        },
        {
            "role": "user",
            "content": "\n".join([f"{m['role'].upper()}: {m['content']}" for m in history])
        }
    ]
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model is fine for summarization
        messages=summary_prompt
    )
    
    return response.choices[0].message.content

def chat_with_summary(user_message, history, summary, turn_count, client):
    history.append({"role": "user", "content": user_message})
    turn_count += 1
    
    # Summarize every 10 turns
    if turn_count % 10 == 0:
        summary = summarize_history(history, client)
        history = []  # Reset raw history after summarizing
    
    system_content = "You are a helpful assistant."
    if summary:
        system_content += f"\n\nConversation summary so far:\n{summary}"
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_content},
            *history
        ]
    )
    
    assistant_message = response.choices[0].message.content
    history.append({"role": "assistant", "content": assistant_message})
    return assistant_message, history, summary, turn_count

Summarization works well for longer-running conversations where the shape of the discussion matters more than the exact wording. The main cost is an extra API call every N turns β€” worth it for complex tasks.

3. External memory with vector search

For agents that need to remember specific facts across very long sessions or across multiple sessions, store important information in an external vector database and retrieve relevant chunks at query time. This is the Retrieval-Augmented Generation (RAG) pattern applied to conversation history.

The key insight is that you don't need to retrieve everything. You retrieve only what's relevant to the current message. A user who mentioned their preferred tech stack in turn two of a hundred-turn session can still get an accurate reference to it in turn ninety-nine, as long as you index and retrieve that fact correctly.

Popular choices for the vector store include Chroma, Pinecone, Weaviate, and pgvector (if you're already using PostgreSQL). The retrieval logic looks roughly like this:

import chromadb
from openai import OpenAI

client = OpenAI()
vector_client = chromadb.Client()
collection = vector_client.get_or_create_collection("conversation_memory")

def store_turn(turn_id, speaker, content):
    embedding_response = client.embeddings.create(
        model="text-embedding-3-small",
        input=content
    )
    embedding = embedding_response.data[0].embedding
    
    collection.add(
        ids=[turn_id],
        embeddings=[embedding],
        documents=[content],
        metadatas=[{"speaker": speaker}]
    )

def retrieve_relevant(query, n_results=5):
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    return results["documents"][0]

This approach has the highest setup cost but scales well to very long-running agents and multi-session memory.

4. Structured state extraction

Instead of storing raw conversation history, maintain a structured state object that gets updated as the conversation progresses. Think of it as a live document that the agent reads and writes.

import json

INITIAL_STATE = {
    "user_goal": None,
    "constraints": [],
    "decisions_made": [],
    "open_questions": [],
    "preferences": {}
}

def update_state(current_state, recent_messages, client):
    update_prompt = f"""
Given the current state and the most recent conversation turns, return an updated JSON state.
Only change fields that genuinely changed. Preserve existing data unless it was explicitly corrected.

Current state:
{json.dumps(current_state, indent=2)}

Recent conversation:
{chr(10).join([f"{m['role'].upper()}: {m['content']}" for m in recent_messages])}

Return only valid JSON matching the state schema."""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": update_prompt}],
        response_format={"type": "json_object"}
    )
    
    return json.loads(response.choices[0].message.content)

This pattern is especially powerful for task-oriented agents β€” booking systems, coding assistants, research agents β€” where you can define upfront what state matters and what can be discarded.

Choosing the Right Approach

These strategies aren't mutually exclusive. A well-designed production agent typically combines at least two of them. Here's a practical decision guide:

ScenarioRecommended approach
Short FAQ or support bot (<10 turns)Sliding window
Medium-length task completion (10–30 turns)Sliding window + summarization
Long-running sessions (30+ turns)Summarization + structured state
Multi-session memory (user returns days later)Vector store retrieval
Complex task with well-defined stateStructured state extraction

Start simple. Add complexity only when a simpler approach is clearly failing.

Common Pitfalls

Truncating without a summary boundary. If you drop old messages without first summarizing them, you lose that information permanently. Always summarize before you truncate if the content matters.

Summarizing too aggressively. A summary that compresses ten detailed messages into two sentences will lose specifics. Include facts, numbers, names, and decisions in your summarization prompt explicitly. Don't assume the model knows what to preserve.

Trusting the model to manage its own context. Asking the model to "remember this for later" inside the conversation does nothing. There's no in-weights update happening. The only thing that persists is what you include in the next API call.

Ignoring token counting. Don't wait for a rate limit or truncation error to discover you've blown the context window. Count tokens proactively using a library like tiktoken and enforce your budget before each API call.

import tiktoken

def count_tokens(messages, model="gpt-4o"):
    encoding = tiktoken.encoding_for_model(model)
    total = 0
    for message in messages:
        # 4 tokens overhead per message (role, content delimiters)
        total += 4
        total += len(encoding.encode(message["content"]))
    return total

MAX_TOKENS = 100_000  # Leave headroom for the response

def enforce_token_budget(messages, model="gpt-4o"):
    while count_tokens(messages, model) > MAX_TOKENS and len(messages) > 1:
        messages.pop(0)  # Drop oldest message
    return messages

Not testing at realistic conversation lengths. Most bugs in memory management only show up after turn fifteen or twenty. Build automated tests that run a full simulated conversation and assert that key facts from early turns are still accessible late in the conversation.

A Note on Framework Abstractions

If you're using LangChain, LlamaIndex, or a similar framework, these patterns are often wrapped behind convenience classes like ConversationBufferMemory, ConversationSummaryMemory, or VectorStoreRetrieverMemory. Those abstractions are useful, but they can hide failure modes. It's worth understanding what they're doing underneath so you know what breaks and why when production behaves unexpectedly.

The summarization memory in LangChain, for example, summarizes the full history every single time by default β€” which can be expensive and slow at scale. Knowing this, you might swap it for a version that only summarizes incrementally, summarizing the delta since the last checkpoint rather than everything from the beginning.

Wrapping Up

Memory collapse is predictable and preventable once you stop treating the context window as an infinite buffer and start treating it as a constrained resource you have to actively manage. Here's what to do next:

  1. Audit your current agent. Count how many tokens your average conversation consumes by turn fifteen. If you're already above 50% of your context budget, you need a memory strategy now.
  2. Add proactive token counting using tiktoken or your provider's equivalent before each API call. Fail loudly before you fail silently.
  3. Pick one strategy to implement first. Sliding window is the easiest starting point and often enough for many use cases. Add summarization on top once you hit its limits.
  4. Write a multi-turn regression test that simulates a twenty-plus-turn conversation and asserts that facts stated in the first three turns are still correctly referenced near the end.
  5. Consider structured state extraction if your agent has a well-defined task domain. A state object gives you explicit, auditable memory rather than hoping the model inferred the right things from a compressed summary.

The context window is not a bug β€” it's just a constraint. Treat it like one, and your agents will stay coherent no matter how long the conversation runs.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.