Prevent Token Limit Errors Silently Truncating LLM Context

You send a long conversation to your LLM API, get a perfectly fluent response back, and only notice hours later that the model completely ignored half the messages. No exception was raised. No warning was logged. The API returned HTTP 200 and your app moved on.

Silent context truncation is one of the most common sources of subtle bugs in LLM-powered applications. Once you understand why it happens and how to guard against it, you can build pipelines that degrade gracefully instead of silently lying to you.

Why token limits cause silent failures instead of obvious crashes
How to count tokens accurately before sending a request
Strategies for trimming or summarizing context to fit within limits
How to set up guardrails in Python that raise real errors when something would be dropped
Patterns for long-running conversations that stay within budget

Why Token Limits Are a Silent Problem

Most developers assume that exceeding a context window raises an exception. Some APIs do return an error — but many do not. Instead, the API silently truncates your input from the left (dropping the oldest messages) or from the right (cutting off your final message mid-sentence) before processing it. The response comes back clean, and nothing in your code path signals that anything went wrong.

This is especially dangerous in chat applications and agentic pipelines where earlier messages contain critical instructions, tool results, or user data. The model answers the most recent message coherently while having no memory of the context that was cut.

Understanding What a Token Actually Is

Before you can manage tokens, you need a reliable mental model of what they are. Tokenizers split text into subword units — roughly 3–4 characters per token for English prose, though code, non-Latin scripts, and whitespace-heavy formats skew that number significantly.

A rough rule of thumb: 1,000 tokens is about 750 words. But that estimate breaks down fast. A JSON payload with lots of curly braces and quoted keys uses more tokens than plain prose of the same character length. SQL with repeated keywords, Python with indentation, and Markdown with link syntax all have their own token profiles. Always measure; never guess.

Counting Tokens Before You Send

The right place to catch a token limit problem is before the API call, not after. Every major provider exposes a way to count tokens client-side.

OpenAI models with tiktoken

OpenAI's tiktoken library mirrors the exact tokenizer used server-side, so the count you get locally matches what the API sees.

import tiktoken

def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for message in messages:
        # Each message has a fixed overhead of 3 tokens
        total += 3
        for key, value in message.items():
            total += len(enc.encode(value))
    # Every reply is primed with 3 tokens
    total += 3
    return total

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain recursive CTEs in PostgreSQL."},
]

print(count_tokens(messages))  # e.g., 32

The overhead numbers (3 per message, 3 for reply priming) come from OpenAI's own documentation for their chat format. They are small but they add up in long conversations.

Anthropic Claude

The Anthropic Python SDK exposes a count_tokens method on the client that calls a lightweight endpoint without running inference. This is the most accurate option for Claude models because tokenization details are abstracted behind the API.

import anthropic

client = anthropic.Anthropic()

response = client.messages.count_tokens(
    model="claude-opus-4-5",
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Explain recursive CTEs in PostgreSQL."}
    ],
)

print(response.input_tokens)

Always check the SDK version you are running — the count_tokens method was added in a relatively recent release. If you are on an older version of the Anthropic SDK, upgrade before relying on it.

Setting a Hard Budget and Raising Real Errors

Counting tokens is only useful if you act on the count. The pattern below wraps your API call in a budget check that raises an explicit exception before the request leaves your application. You can then catch that exception and apply a trimming strategy rather than silently sending truncated context.

class TokenBudgetExceeded(Exception):
    def __init__(self, token_count: int, limit: int):
        self.token_count = token_count
        self.limit = limit
        super().__init__(
            f"Request uses {token_count} tokens, exceeds budget of {limit}."
        )

def safe_chat(
    messages: list[dict],
    model: str = "gpt-4o",
    context_limit: int = 128_000,
    reserve_for_output: int = 4_096,
) -> str:
    import openai

    budget = context_limit - reserve_for_output
    used = count_tokens(messages, model)

    if used > budget:
        raise TokenBudgetExceeded(used, budget)

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=reserve_for_output,
    )
    return response.choices[0].message.content

The reserve_for_output parameter is important. Many developers forget that the model needs token headroom to generate its response. A 128k context window does not mean you can send 128k tokens of input — you need to leave room for the answer.

Trimming Strategies When You Exceed the Budget

When your budget check fires, you have three practical options: truncate, summarize, or reject.

Sliding window truncation

The simplest approach is to drop the oldest non-system messages until the conversation fits. Always keep the system prompt intact, and keep the most recent user message — those are the two things the model must have.

def trim_to_budget(
    messages: list[dict],
    model: str,
    budget: int,
) -> list[dict]:
    system_messages = [m for m in messages if m["role"] == "system"]
    conversation = [m for m in messages if m["role"] != "system"]

    while conversation and count_tokens(system_messages + conversation, model) > budget:
        # Drop the oldest non-system message
        conversation.pop(0)

    return system_messages + conversation

This is cheap to compute and works well when older messages are genuinely less relevant. It fails when the conversation has critical context in early messages — for example, a document the user pasted at the start of the session.

Summarization compression

A more robust approach: before dropping old messages, ask the model to summarize them into a single compressed message. Replace the dropped block with a summary message tagged with a clear role label.

def summarize_history(
    history: list[dict],
    model: str = "gpt-4o-mini",
) -> dict:
    import openai

    client = openai.OpenAI()
    transcript = "\n".join(
        f"{m['role'].upper()}: {m['content']}" for m in history
    )
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": (
                    "Summarize the following conversation history concisely, "
                    "preserving all facts, decisions, and user preferences.\n\n"
                    + transcript
                ),
            }
        ],
        max_tokens=512,
    )
    return {
        "role": "system",
        "content": "[Conversation summary] " + response.choices[0].message.content,
    }

Use a smaller, faster model for summarization to keep latency and cost low. The summary becomes a new system-level message that subsequent turns can reference.

Handling Streamed Responses

Streaming complicates token tracking because you do not know the output length until the stream finishes. The safe pattern is to track the usage field returned in the final chunk of a streamed response and log it for post-hoc analysis. Some APIs include a usage object in stream metadata; others require you to accumulate chunk text and count it yourself after the fact.

import openai

client = openai.OpenAI()
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "List 10 sorting algorithms."}],
    stream=True,
    stream_options={"include_usage": True},
)

full_text = ""
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        full_text += chunk.choices[0].delta.content
    if chunk.usage:
        print(f"Input tokens: {chunk.usage.prompt_tokens}")
        print(f"Output tokens: {chunk.usage.completion_tokens}")

Logging actual usage after every request gives you real data to tune your budgets. If your pre-call estimate and the actual usage diverge significantly, your token-counting function has a bug or the model changed its tokenizer.

Common Pitfalls

Forgetting tool/function definitions count toward input tokens. If you pass a list of function schemas to the API, those schemas are tokenized as part of the input. A large set of tools can consume thousands of tokens before a single message is included. Count them explicitly.

Assuming the context limit is stable across model versions. Providers update context windows when they release new model versions. Pin your context_limit constant to a specific model version, not a model family name, and review it when you upgrade.

Using character length as a proxy for token count. It is fine for rough estimates, but never use it in a budget guard. A document full of Unicode characters can be three times more token-dense than the same character count in ASCII.

Not accounting for the system prompt in multi-turn applications. In a long chat session, the system prompt is re-sent with every request. If your system prompt is 2,000 tokens, that is 2,000 tokens permanently consumed from every single request's budget.

Silently swallowing TokenBudgetExceeded exceptions. It is tempting to catch and suppress these errors in production to avoid noisy logs. Instead, log them at WARNING level and emit a metric. They are a signal that your conversations are getting longer than expected, which is a product insight worth having.

Wrapping Up

Silent context truncation is fixable, but only if you treat it as a first-class concern from the start. Here are the concrete steps to take after reading this:

Add token counting to every request path in your application. Use tiktoken for OpenAI models or the Anthropic SDK's count_tokens method for Claude.
Define an explicit input budget as a named constant: context limit minus your expected output reserve. Raise a real exception when the budget is exceeded.
Implement a trimming strategy that matches your use case — sliding window for transient conversations, summarization compression when early context matters.
Log actual token usage from every API response and compare it to your pre-call estimates. Use that data to tighten your budgets over time.
Audit your tool definitions and system prompts for token weight. Even static parts of your prompt stack consume budget on every request.

Stopping Token Limit Errors From Silently Truncating Your LLM Context

Why Token Limits Are a Silent Problem

Understanding What a Token Actually Is

Counting Tokens Before You Send

OpenAI models with tiktoken

Anthropic Claude

Setting a Hard Budget and Raising Real Errors

Trimming Strategies When You Exceed the Budget

Sliding window truncation

Summarization compression

Handling Streamed Responses

Common Pitfalls

Wrapping Up

Related Articles

Fixing Embedding Drift: Why Your Vector Search Gets Worse Over Time

Chunking Strategies That Stop Your RAG Embeddings From Losing Context

Prompt Caching Is Silently Inflating Your LLM API Costs

Comments (0)

Leave a Comment

Stopping Token Limit Errors From Silently Truncating Your LLM Context

Why Token Limits Are a Silent Problem

Understanding What a Token Actually Is

Counting Tokens Before You Send

OpenAI models with tiktoken

Anthropic Claude

Setting a Hard Budget and Raising Real Errors

Trimming Strategies When You Exceed the Budget

Sliding window truncation

Summarization compression

Handling Streamed Responses

Common Pitfalls

Wrapping Up

Related Articles

Fixing Embedding Drift: Why Your Vector Search Gets Worse Over Time

Chunking Strategies That Stop Your RAG Embeddings From Losing Context

Prompt Caching Is Silently Inflating Your LLM API Costs

Comments (0)

Leave a Comment

Stay ahead of the curve