Multi-Turn Memory Collapse: Why LLM Agents Forget Mid-Conversation
You've built an LLM-powered agent that works perfectly in a short demo. Five turns in, it's sharp, contextual, and impressive. Ten turns in, it starts repeating itself. Fifteen turns in, it's forgotten the user's name, their original goal, and half of what was agreed earlier. This is memory collapse, and it's one of the most common failure modes in production LLM agents.
The good news is it's not a mystery. Once you understand why it happens, you have a clear set of tools to push it back.
What you'll learn
- Why the context window is both the memory system and its own bottleneck
- How naive conversation history management causes silent degradation
- The four main strategies for extending effective memory in LLM agents
- Common implementation pitfalls and how to avoid them
- A practical decision framework for choosing the right memory approach
Prerequisites
This article assumes you're comfortable with the basics of calling an LLM API (OpenAI, Anthropic, or similar). You don't need to be using a framework like LangChain, though the concepts apply there too. Familiarity with Python and REST APIs is helpful for the code examples.
How LLMs Actually "Remember" Things
Large language models have no persistent memory between API calls. Every single request is stateless. The only way a model knows what happened earlier in a conversation is if you include that history in the prompt you send right now.
That means your agent's entire "memory" is whatever fits inside the context window of a single API call. For modern models that window is large β tens or hundreds of thousands of tokens β but it is still finite, and the costs and performance implications of filling it matter a great deal.
When developers build their first agent, they typically do the obvious thing: append every message to a list and send the whole list with every new request. This works until it doesn't.
# The naive approach β you've probably written this
conversation_history = []
def chat(user_message):
conversation_history.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
*conversation_history
]
)
assistant_message = response.choices[0].message.content
conversation_history.append({"role": "assistant", "content": assistant_message})
return assistant_message
This is fine for short sessions. For anything running longer than a few minutes or handling complex tasks, you're heading toward a cliff edge you won't see until you fall off it.
The Mechanics of Memory Collapse
Memory collapse isn't a sudden event. It happens gradually, through several overlapping mechanisms.
Token limit truncation
Once the accumulated conversation history exceeds the model's context window, something has to give. Many naive implementations throw an error and crash. Others silently drop the oldest messages to make room. Either way, the model is now missing critical early context β the user's original intent, constraints they stated upfront, or decisions that were made and should not be revisited.
The problem with silent truncation is that the model doesn't know what it doesn't know. It will answer confidently based on incomplete information, and the user has no idea why it's suddenly acting confused.
Attention dilution
Even before you hit the hard token limit, very long contexts degrade model performance. Research has consistently shown that models perform worse on tasks requiring information from the middle of a long context compared to information near the start or end. This is sometimes called the "lost in the middle" problem. If a critical constraint was stated in turn three of a thirty-turn conversation, the model may simply pay less attention to it by turn twenty-eight.
Semantic drift
Long conversations naturally drift. Topics shift, the user refines their request, edge cases are introduced. Without active memory management, the model tries to reconcile everything at once. It starts hedging, contradicting itself, or weighting recent tangents more heavily than the user's original goal.
The Four Memory Strategies
There's no single right answer here. The right strategy depends on your conversation length, task complexity, and latency budget. Most production systems combine more than one approach.
1. Sliding window
Keep only the last N messages in the context. Simple, predictable, cheap. The model always has fresh context but permanently loses early history.
MAX_HISTORY_MESSAGES = 20
def chat_with_window(user_message, history):
history.append({"role": "user", "content": user_message})
# Keep only the last N messages
windowed_history = history[-MAX_HISTORY_MESSAGES:]
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
*windowed_history
]
)
assistant_message = response.choices[0].message.content
history.append({"role": "assistant", "content": assistant_message})
return assistant_message, history
Use this when conversations are short-lived and early context isn't critical β a customer support FAQ bot, for instance, where each question is largely self-contained.
2. Summarization
Periodically ask the model to summarize the conversation so far, then replace the raw history with the summary. You preserve semantic content while dramatically reducing token count.
def summarize_history(history, client):
summary_prompt = [
{
"role": "system",
"content": "Summarize the following conversation. Preserve all key facts, decisions, user preferences, and open questions. Be concise."
},
{
"role": "user",
"content": "\n".join([f"{m['role'].upper()}: {m['content']}" for m in history])
}
]
response = client.chat.completions.create(
model="gpt-4o-mini", # Cheaper model is fine for summarization
messages=summary_prompt
)
return response.choices[0].message.content
def chat_with_summary(user_message, history, summary, turn_count, client):
history.append({"role": "user", "content": user_message})
turn_count += 1
# Summarize every 10 turns
if turn_count % 10 == 0:
summary = summarize_history(history, client)
history = [] # Reset raw history after summarizing
system_content = "You are a helpful assistant."
if summary:
system_content += f"\n\nConversation summary so far:\n{summary}"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_content},
*history
]
)
assistant_message = response.choices[0].message.content
history.append({"role": "assistant", "content": assistant_message})
return assistant_message, history, summary, turn_count
Summarization works well for longer-running conversations where the shape of the discussion matters more than the exact wording. The main cost is an extra API call every N turns β worth it for complex tasks.
3. External memory with vector search
For agents that need to remember specific facts across very long sessions or across multiple sessions, store important information in an external vector database and retrieve relevant chunks at query time. This is the Retrieval-Augmented Generation (RAG) pattern applied to conversation history.
The key insight is that you don't need to retrieve everything. You retrieve only what's relevant to the current message. A user who mentioned their preferred tech stack in turn two of a hundred-turn session can still get an accurate reference to it in turn ninety-nine, as long as you index and retrieve that fact correctly.
Popular choices for the vector store include Chroma, Pinecone, Weaviate, and pgvector (if you're already using PostgreSQL). The retrieval logic looks roughly like this:
import chromadb
from openai import OpenAI
client = OpenAI()
vector_client = chromadb.Client()
collection = vector_client.get_or_create_collection("conversation_memory")
def store_turn(turn_id, speaker, content):
embedding_response = client.embeddings.create(
model="text-embedding-3-small",
input=content
)
embedding = embedding_response.data[0].embedding
collection.add(
ids=[turn_id],
embeddings=[embedding],
documents=[content],
metadatas=[{"speaker": speaker}]
)
def retrieve_relevant(query, n_results=5):
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results
)
return results["documents"][0]
This approach has the highest setup cost but scales well to very long-running agents and multi-session memory.
4. Structured state extraction
Instead of storing raw conversation history, maintain a structured state object that gets updated as the conversation progresses. Think of it as a live document that the agent reads and writes.
import json
INITIAL_STATE = {
"user_goal": None,
"constraints": [],
"decisions_made": [],
"open_questions": [],
"preferences": {}
}
def update_state(current_state, recent_messages, client):
update_prompt = f"""
Given the current state and the most recent conversation turns, return an updated JSON state.
Only change fields that genuinely changed. Preserve existing data unless it was explicitly corrected.
Current state:
{json.dumps(current_state, indent=2)}
Recent conversation:
{chr(10).join([f"{m['role'].upper()}: {m['content']}" for m in recent_messages])}
Return only valid JSON matching the state schema."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": update_prompt}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
This pattern is especially powerful for task-oriented agents β booking systems, coding assistants, research agents β where you can define upfront what state matters and what can be discarded.
Choosing the Right Approach
These strategies aren't mutually exclusive. A well-designed production agent typically combines at least two of them. Here's a practical decision guide:
| Scenario | Recommended approach |
|---|---|
| Short FAQ or support bot (<10 turns) | Sliding window |
| Medium-length task completion (10β30 turns) | Sliding window + summarization |
| Long-running sessions (30+ turns) | Summarization + structured state |
| Multi-session memory (user returns days later) | Vector store retrieval |
| Complex task with well-defined state | Structured state extraction |
Start simple. Add complexity only when a simpler approach is clearly failing.
Common Pitfalls
Truncating without a summary boundary. If you drop old messages without first summarizing them, you lose that information permanently. Always summarize before you truncate if the content matters.
Summarizing too aggressively. A summary that compresses ten detailed messages into two sentences will lose specifics. Include facts, numbers, names, and decisions in your summarization prompt explicitly. Don't assume the model knows what to preserve.
Trusting the model to manage its own context. Asking the model to "remember this for later" inside the conversation does nothing. There's no in-weights update happening. The only thing that persists is what you include in the next API call.
Ignoring token counting. Don't wait for a rate limit or truncation error to discover you've blown the context window. Count tokens proactively using a library like tiktoken and enforce your budget before each API call.
import tiktoken
def count_tokens(messages, model="gpt-4o"):
encoding = tiktoken.encoding_for_model(model)
total = 0
for message in messages:
# 4 tokens overhead per message (role, content delimiters)
total += 4
total += len(encoding.encode(message["content"]))
return total
MAX_TOKENS = 100_000 # Leave headroom for the response
def enforce_token_budget(messages, model="gpt-4o"):
while count_tokens(messages, model) > MAX_TOKENS and len(messages) > 1:
messages.pop(0) # Drop oldest message
return messages
Not testing at realistic conversation lengths. Most bugs in memory management only show up after turn fifteen or twenty. Build automated tests that run a full simulated conversation and assert that key facts from early turns are still accessible late in the conversation.
A Note on Framework Abstractions
If you're using LangChain, LlamaIndex, or a similar framework, these patterns are often wrapped behind convenience classes like ConversationBufferMemory, ConversationSummaryMemory, or VectorStoreRetrieverMemory. Those abstractions are useful, but they can hide failure modes. It's worth understanding what they're doing underneath so you know what breaks and why when production behaves unexpectedly.
The summarization memory in LangChain, for example, summarizes the full history every single time by default β which can be expensive and slow at scale. Knowing this, you might swap it for a version that only summarizes incrementally, summarizing the delta since the last checkpoint rather than everything from the beginning.
Wrapping Up
Memory collapse is predictable and preventable once you stop treating the context window as an infinite buffer and start treating it as a constrained resource you have to actively manage. Here's what to do next:
- Audit your current agent. Count how many tokens your average conversation consumes by turn fifteen. If you're already above 50% of your context budget, you need a memory strategy now.
- Add proactive token counting using
tiktokenor your provider's equivalent before each API call. Fail loudly before you fail silently. - Pick one strategy to implement first. Sliding window is the easiest starting point and often enough for many use cases. Add summarization on top once you hit its limits.
- Write a multi-turn regression test that simulates a twenty-plus-turn conversation and asserts that facts stated in the first three turns are still correctly referenced near the end.
- Consider structured state extraction if your agent has a well-defined task domain. A state object gives you explicit, auditable memory rather than hoping the model inferred the right things from a compressed summary.
The context window is not a bug β it's just a constraint. Treat it like one, and your agents will stay coherent no matter how long the conversation runs.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!