Tool Call Failures in LLM Agents: Fixing Bad Function Args

Your agent confidently calls get_order_status with order_id: null. Or it passes a date string where your function expects a Unix timestamp. Or it invents a parameter that doesn't exist in your schema. The function throws an error, the agent either silently fails or hallucinates a recovery, and your user gets garbage.

Tool call failures are one of the most common and frustrating failure modes when building LLM-powered agents. The good news is that most of them are preventable once you understand the root causes.

What you'll learn

Why LLMs produce malformed tool arguments in the first place
How to write function schemas that reduce ambiguity
How to add a validation layer between the model output and your function execution
How to implement structured retry logic when arguments are bad
How to use prompt engineering to steer the model toward correct argument construction

Prerequisites

This article assumes you're working with a tool-calling API (OpenAI, Anthropic, or a compatible wrapper). Code examples use Python. You should be familiar with how function calling works at a basic level — if you haven't seen it before, read the function calling quickstart for your provider first.

Why the Model Gets Arguments Wrong

The model is not executing code. It's predicting text. When it produces a function call with arguments, it's generating a JSON-shaped string that it believes satisfies the schema you provided — based on patterns in its training data, your system prompt, and the conversation context so far.

This means several things can go wrong:

Ambiguous schema descriptions — the model infers intent incorrectly because your description field was vague or missing.
Type coercion assumptions — the model treats "123" and 123 as interchangeable. Your function does not.
Missing required fields — the model skips a required argument because there's no clear value in context, or because it doesn't understand the field is required.
Hallucinated parameters — the model invents fields it expects your function to accept, often because it's seen similar APIs in training data.
Format mismatches — dates as "MM/DD/YYYY" when you wanted "YYYY-MM-DD", enum values in the wrong case, arrays instead of strings.

Understanding which of these is hitting you is the first diagnostic step.

Write Schemas That Leave No Room for Guessing

The JSON schema you pass to the model is its primary instruction set for building arguments. Most developers write the bare minimum — a name and a type. That's not enough.

Consider this weak schema for a flight search function:

{
  "name": "search_flights",
  "parameters": {
    "type": "object",
    "properties": {
      "origin": { "type": "string" },
      "destination": { "type": "string" },
      "date": { "type": "string" }
    },
    "required": ["origin", "destination", "date"]
  }
}

Now compare it to a schema with explicit descriptions and format constraints:

{
  "name": "search_flights",
  "description": "Search for available flights between two airports on a specific date. Use IATA airport codes for origin and destination.",
  "parameters": {
    "type": "object",
    "properties": {
      "origin": {
        "type": "string",
        "description": "IATA airport code for the departure airport, e.g. 'JFK', 'LHR', 'SYD'. Always uppercase, exactly 3 characters."
      },
      "destination": {
        "type": "string",
        "description": "IATA airport code for the arrival airport, e.g. 'CDG', 'NRT'. Always uppercase, exactly 3 characters."
      },
      "date": {
        "type": "string",
        "description": "Departure date in ISO 8601 format: YYYY-MM-DD. Example: '2025-09-15'."
      }
    },
    "required": ["origin", "destination", "date"]
  }
}

The second schema gives the model enough context to self-correct. Include description fields everywhere, add concrete examples inline, and specify formats explicitly. Treat every field as if someone who has never seen your API is reading it — because in a sense, that's exactly what's happening.

Validate Before You Execute

Never pass raw model output directly to your function. Always run a validation step in between. This decouples argument parsing from function execution and gives you a clear place to handle errors without crashing your agent loop.

A simple Pydantic-based approach works well here:

from pydantic import BaseModel, Field, validator
from typing import Optional
import re

class SearchFlightsArgs(BaseModel):
    origin: str = Field(..., min_length=3, max_length=3)
    destination: str = Field(..., min_length=3, max_length=3)
    date: str

    @validator("origin", "destination")
    def must_be_uppercase_iata(cls, v):
        if not re.match(r"^[A-Z]{3}$", v):
            raise ValueError(f"Expected uppercase IATA code, got: {v}")
        return v

    @validator("date")
    def must_be_iso_date(cls, v):
        if not re.match(r"^\d{4}-\d{2}-\d{2}$", v):
            raise ValueError(f"Expected YYYY-MM-DD format, got: {v}")
        return v


def safe_execute_tool(tool_name: str, raw_args: dict):
    validators = {
        "search_flights": SearchFlightsArgs,
    }
    schema = validators.get(tool_name)
    if schema is None:
        raise ValueError(f"Unknown tool: {tool_name}")
    try:
        validated = schema(**raw_args)
        return validated.dict()
    except Exception as e:
        return {"error": str(e), "raw_args": raw_args}

When validation fails, you get a structured error object instead of an exception flying up through your agent loop. That error becomes the input to your retry logic.

Implement a Retry Loop With Error Feedback

A single failed tool call shouldn't end the agent's run. Feed the validation error back to the model and ask it to try again. Most modern models can self-correct when given precise error messages.

import json

def agent_step_with_retry(client, messages, tools, max_retries=2):
    for attempt in range(max_retries + 1):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )

        choice = response.choices[0]
        if choice.finish_reason != "tool_calls":
            return choice.message.content

        tool_call = choice.message.tool_calls[0]
        tool_name = tool_call.function.name
        raw_args = json.loads(tool_call.function.arguments)

        result = safe_execute_tool(tool_name, raw_args)

        if "error" in result:
            if attempt == max_retries:
                return f"Tool call failed after {max_retries + 1} attempts: {result['error']}"
            # Feed the error back as a tool result
            messages = messages + [
                choice.message,
                {
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps({
                        "status": "error",
                        "message": result["error"],
                        "hint": "Please correct the arguments and try again."
                    })
                }
            ]
            continue

        return result

    return "Max retries exceeded"

The key here is that the error message passed back to the model is specific. Telling the model "Expected YYYY-MM-DD format, got: 09/15/2025" gives it exactly what it needs to fix the argument on the next attempt.

Use Prompt Engineering to Set Expectations Early

Your system prompt is another lever. Explicitly telling the model how to handle tool arguments reduces first-attempt failures significantly. You don't need a lengthy paragraph — a few targeted instructions work fine.

You have access to a set of tools. When calling a tool:
- Always check that required arguments are present before making the call.
- Use the exact formats specified in each tool's parameter descriptions.
- If you are unsure of a value (e.g., the user hasn't provided a date), ask the user before calling the tool.
- Never invent or guess argument values that were not supplied or clearly implied by the conversation.

The instruction "ask the user before calling the tool" is particularly important for agents that run in interactive sessions. Many bad tool calls happen because the model tries to fill in a missing argument by guessing rather than surfacing the gap to the user.

Handle Partial and Null Arguments Gracefully

Sometimes the model calls a tool correctly but with a null or undefined value for an optional field. Your function should have sane defaults and explicit null-handling rather than throwing on unexpected input shapes.

Consider this pattern in Python:

def search_flights(origin: str, destination: str, date: str, cabin_class: str = "economy"):
    # Normalize inputs defensively
    origin = (origin or "").strip().upper()
    destination = (destination or "").strip().upper()
    cabin_class = cabin_class if cabin_class in ("economy", "business", "first") else "economy"

    if not origin or not destination or not date:
        raise ValueError("origin, destination, and date are all required")

    # ... rest of the function

Defensive normalization at the function boundary means a model that passes "Economy" instead of "economy" doesn't cause a downstream failure. Save the strict validation for truly required fields and apply soft normalization to everything else.

Common Pitfalls to Watch For

Overloaded function names

If you have get_user and get_user_profile as separate tools, the model will sometimes call the wrong one. Rename them to be unambiguous: get_user_basic_info vs get_user_full_profile. Function names are part of the model's input.

Too many tools at once

Passing thirty tools to a single agent call dilutes the model's attention and increases the chance it calls the wrong function or constructs arguments from the wrong schema. If your agent has many capabilities, consider routing between specialized sub-agents that each hold a smaller, focused tool set.

Nested object schemas without examples

Deeply nested parameter schemas are error-prone. If your function accepts a nested object like filters.date_range.start, include a full example in the schema description. Better yet, flatten the schema if you can — fewer nesting levels mean fewer opportunities for the model to lose the thread.

Schema drift

You update your function's signature but forget to update the tool schema you send to the model. The model generates arguments that matched the old schema. Add a test that asserts your Pydantic models and your tool schemas stay in sync.

Silent failures

Some agent frameworks catch tool errors quietly and return an empty result. The model then tries to continue with no data. Always surface tool errors explicitly in the message history so the model knows something went wrong and can respond accordingly.

Wrapping Up

Tool call failures are rarely mysterious once you trace them back to their source. Most come from underspecified schemas, missing validation, or a model trying to fill in a gap it should have surfaced to the user instead.

Here are concrete actions you can take right now:

Audit your existing tool schemas — add description fields and format examples to every parameter that doesn't have them.
Add a Pydantic validation layer between the raw model output and your function calls, and return structured errors instead of exceptions.
Implement a two-attempt retry loop that feeds the validation error message back to the model as a tool response.
Update your system prompt to explicitly instruct the model to ask the user when a required argument is missing rather than guessing.
Write a test that verifies your tool schemas match your function signatures, so drift doesn't silently reintroduce failures.

Start with the schema improvements — they tend to eliminate the majority of failures before any other code changes are needed.

Tool Call Failures in LLM Agents: Why Functions Get Invoked With Bad Args

What you'll learn

Prerequisites

Why the Model Gets Arguments Wrong

Write Schemas That Leave No Room for Guessing

Validate Before You Execute

Implement a Retry Loop With Error Feedback

Use Prompt Engineering to Set Expectations Early

Handle Partial and Null Arguments Gracefully

Common Pitfalls to Watch For

Overloaded function names

Too many tools at once

Nested object schemas without examples

Schema drift

Silent failures

Wrapping Up

Related Articles

System Prompt Leakage: Why Your Instructions Aren't as Private as You Think

Structured Output Failures: Why JSON Mode Still Returns Broken Data

Reranking RAG Results When Semantic Similarity Picks the Wrong Chunks

Comments (0)

Leave a Comment

Tool Call Failures in LLM Agents: Why Functions Get Invoked With Bad Args

What you'll learn

Prerequisites

Why the Model Gets Arguments Wrong

Write Schemas That Leave No Room for Guessing

Validate Before You Execute

Implement a Retry Loop With Error Feedback

Use Prompt Engineering to Set Expectations Early

Handle Partial and Null Arguments Gracefully

Common Pitfalls to Watch For

Overloaded function names

Too many tools at once

Nested object schemas without examples

Schema drift

Silent failures

Wrapping Up

Related Articles

System Prompt Leakage: Why Your Instructions Aren't as Private as You Think

Structured Output Failures: Why JSON Mode Still Returns Broken Data

Reranking RAG Results When Semantic Similarity Picks the Wrong Chunks

Comments (0)

Leave a Comment

Stay ahead of the curve