Batching LLM API Calls Without Blowing Up Latency or Rate Limits

May 23, 2026 8 min read 34 views
Abstract illustration of multiple request nodes flowing through a controlled funnel toward a single API endpoint, representing batched LLM API calls

You've shipped a feature that calls an LLM API on every user action. It works great in staging with five concurrent users. Then real traffic hits and your logs fill up with 429s, your p95 latency triples, and your costs spike because retries are firing constantly. This is not a scaling problem β€” it's a batching problem.

Batching LLM requests is different from batching database writes. You're not just grouping identical operations; you're managing token budgets, handling variable response times, and keeping end-users from staring at a spinner for six seconds. Getting this right takes more than wrapping calls in asyncio.gather.

  • How rate limits actually work on major LLM APIs and what counters you're really hitting
  • Request queuing patterns that smooth out burst traffic without adding painful latency
  • Token-aware batching so you stop guessing at request sizes
  • Retry strategies that back off intelligently instead of hammering the endpoint
  • When to use provider-native batch endpoints versus rolling your own queue

How LLM Rate Limits Actually Work

Most LLM providers enforce two separate limit types simultaneously: requests per minute (RPM) and tokens per minute (TPM). Hitting either one triggers a 429. Most engineers watch only the RPM and then get blindsided by TPM limits on long prompts.

When you send a request, the provider counts the input tokens immediately against your TPM budget. Output tokens are counted as they stream back. So a 2,000-token prompt with a 500-token response costs 2,500 TPM β€” before you've even seen the reply. If you're firing five of those concurrently, you've consumed 12,500 tokens in one burst.

Providers also typically reset these buckets on a rolling one-minute window, not a fixed clock minute. That means a burst at second 55 and another at second 62 can hit the same effective window. Design your throttling around the rolling window assumption, not a clean reset at :00.

The Case for a Local Request Queue

The most reliable pattern for sustained throughput is a token-aware request queue that sits between your application logic and the API client. Instead of calling the LLM directly, your application code enqueues a job. A small pool of workers drains the queue while tracking tokens-per-minute consumed in a sliding window.

This decouples your application response time from the LLM call. You can return a job ID immediately, let the LLM call complete asynchronously, and push the result via a callback or webhook. This works well for background tasks like document summarization, batch classification, or nightly report generation.

For interactive use cases where the user expects a response in the same request, you still benefit from a queue β€” but you need a timeout-aware design so the user doesn't wait forever. Cap your queue wait time, fail fast with a clear message, and let the client retry.

Token-Aware Batching in Practice

Before you can throttle by tokens, you need to count them. Every major provider uses a variant of the BPE tokenizer, and libraries exist to count tokens client-side before you send the request.

import tiktoken  # works for OpenAI-family models

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def estimate_request_tokens(system_prompt: str, user_message: str, model: str = "gpt-4o") -> int:
    # Add a small overhead for message framing
    return count_tokens(system_prompt + user_message, model) + 8

For Anthropic's Claude models, the anthropic Python SDK exposes a count_tokens method directly on the client, so you don't need a separate library. Use whichever counting method matches your provider.

With token counts in hand, your queue worker can make an informed decision before dispatching: check remaining TPM budget in the current window, and if the request would exceed it, hold the job until the window rolls forward.

import time
from collections import deque

class SlidingWindowRateLimiter:
    def __init__(self, max_tokens_per_minute: int, max_requests_per_minute: int):
        self.max_tpm = max_tokens_per_minute
        self.max_rpm = max_requests_per_minute
        self.token_log: deque = deque()  # (timestamp, token_count)
        self.request_log: deque = deque()  # timestamp

    def _evict_old(self, log: deque, window: float = 60.0):
        now = time.monotonic()
        while log and now - log[0][0] > window:
            log.popleft()

    def _evict_requests(self, window: float = 60.0):
        now = time.monotonic()
        while self.request_log and now - self.request_log[0] > window:
            self.request_log.popleft()

    def can_dispatch(self, token_count: int) -> bool:
        self._evict_old(self.token_log)
        self._evict_requests()
        current_tokens = sum(t for _, t in self.token_log)
        current_requests = len(self.request_log)
        return (
            current_tokens + token_count <= self.max_tpm
            and current_requests + 1 <= self.max_rpm
        )

    def record_dispatch(self, token_count: int):
        now = time.monotonic()
        self.token_log.append((now, token_count))
        self.request_log.append(now)

This class gives your worker a clean can_dispatch check before each API call. If it returns False, sleep briefly and check again.

Async Concurrency Without Thundering Herd

Running requests concurrently with asyncio is the right approach for I/O-bound LLM calls, but unbounded concurrency causes the thundering herd problem: every held request fires simultaneously the moment a rate limit window clears.

Use a semaphore to cap concurrency at a number you've calculated from your TPM and average request size. If your TPM limit is 100,000 and your average request is 2,000 tokens, you can safely run about 50 concurrent requests per minute β€” but spread across a rolling window, not all at once.

import asyncio
import anthropic

async def call_llm(client: anthropic.AsyncAnthropic, prompt: str, sem: asyncio.Semaphore) -> str:
    async with sem:
        message = await client.messages.create(
            model="claude-opus-4-5",
            max_tokens=512,
            messages=[{"role": "user", "content": prompt}],
        )
        return message.content[0].text

async def process_batch(prompts: list[str], max_concurrent: int = 10) -> list[str]:
    client = anthropic.AsyncAnthropic()
    sem = asyncio.Semaphore(max_concurrent)
    tasks = [call_llm(client, p, sem) for p in prompts]
    return await asyncio.gather(*tasks, return_exceptions=True)

The return_exceptions=True argument is important. Without it, a single failed request cancels all pending tasks in the gather. With it, exceptions are returned as values and you can handle them per-item.

Retry Logic That Doesn't Make Things Worse

A naive retry on 429 without backoff turns a temporary rate limit into a sustained one. The standard approach is exponential backoff with jitter: wait longer after each failed attempt, and add a random offset so multiple workers don't retry in lockstep.

import asyncio
import random
import anthropic

async def call_with_backoff(
    client: anthropic.AsyncAnthropic,
    prompt: str,
    max_retries: int = 5,
) -> str:
    base_delay = 1.0
    for attempt in range(max_retries):
        try:
            message = await client.messages.create(
                model="claude-opus-4-5",
                max_tokens=512,
                messages=[{"role": "user", "content": prompt}],
            )
            return message.content[0].text
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            await asyncio.sleep(delay)
        except anthropic.APIStatusError as e:
            # Don't retry on 4xx errors that aren't rate limits
            if e.status_code < 500:
                raise
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            await asyncio.sleep(delay)

Check the Retry-After header when it's present. Some providers include it on 429 responses, telling you exactly how many seconds to wait. Using that value instead of your calculated delay is more accurate and avoids unnecessary sleep time.

Using Provider-Native Batch Endpoints

For workloads that aren't latency-sensitive, provider-native batch APIs are the right tool. Anthropic and OpenAI both offer batch processing endpoints designed for offline jobs. You submit a list of requests, get back a batch ID, and poll for results. Throughput limits are much higher, and costs are typically lower per token.

The tradeoff is turnaround time β€” results may take minutes to hours, not milliseconds. That's fine for nightly classification runs, bulk embedding generation, or pre-computing summaries. It's wrong for anything a user is waiting on.

Anthropic's Message Batches API accepts up to 10,000 requests per batch. Each request in the batch is a full messages payload with its own custom_id. When the batch completes, you retrieve results keyed by that ID and route them to wherever they need to go.

import anthropic

client = anthropic.Anthropic()

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"job-{i}",
            "params": {
                "model": "claude-opus-4-5",
                "max_tokens": 256,
                "messages": [{"role": "user", "content": prompt}],
            },
        }
        for i, prompt in enumerate(prompts)
    ]
)
print(batch.id)  # Store this, then poll later

Build a simple polling loop that checks client.messages.batches.retrieve(batch_id) and waits until processing_status is ended. Then stream the results file to avoid loading everything into memory at once.

Common Pitfalls

Counting only input tokens. Your TPM budget covers both input and output. If you size requests based only on the prompt, you'll consistently overshoot. Estimate expected output length and include it in your pre-dispatch token budget check. A safe heuristic is to assume the full max_tokens value will be consumed even if it usually isn't.

Treating all requests as equal priority. Not every LLM call is equally urgent. A user-facing autocomplete suggestion should jump a queue ahead of a background tagging job. Build priority levels into your queue from day one β€” retrofitting priority later is painful.

Ignoring context window limits. Rate limits aren't the only ceiling. If you're batching documents into a single prompt for summarization, you can exceed the model's context window. Chunk documents before enqueueing, not inside the worker, so the queue always holds work that's safe to dispatch without modification.

Not tracking per-key limits separately. If you're using multiple API keys to increase throughput, make sure your rate limiter tracks each key's window independently. A single shared limiter across multiple keys will throttle you far below your actual capacity.

Forgetting that streaming changes your token accounting. When you stream a response, you don't know the output token count until the stream finishes. If you're accounting for tokens mid-stream, you need to track the running count as chunks arrive and update your limiter after the stream closes.

Wrapping Up

Batching LLM calls well is mostly about instrumentation and accounting before it's about clever architecture. Once you know your actual token consumption per request and per minute, the rest follows logically.

Here are concrete next steps:

  • Add client-side token counting to every LLM call in your codebase this week. Log the counts so you can see your actual TPM usage before you build a limiter around it.
  • Implement a semaphore-bounded async dispatcher for your interactive paths. Start with a conservative concurrency cap and increase it based on observed 429 rates.
  • Separate your background LLM jobs from your interactive ones. Route background jobs to a provider batch endpoint or a low-priority queue, and stop letting them compete for the same rate limit budget.
  • Add exponential backoff with jitter to every retry path. Check for the Retry-After header and use it when present.
  • Set up a dashboard or at minimum a log aggregation query that shows your RPM and TPM utilization over time. You can't tune a system you can't observe.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.