Fixing AWS API Gateway Timeout Mismatches That Break Long-Running Requests
Your Lambda function completes successfully. Your backend logs show a 200. But the client gets a 504 Gateway Timeout every time a request runs longer than 29 seconds. The mismatch between API Gateway's hard integration timeout and your backend's actual execution time is one of the most frustrating silent failures in AWS.
This article walks you through exactly why this happens, how to diagnose which layer is responsible, and the concrete patterns you can use to stop it from breaking your users.
What you'll learn
- How API Gateway's integration timeout differs from Lambda's function timeout
- Why 504 errors appear even when your backend succeeds
- How to diagnose which timeout layer is triggering the error
- Async invocation patterns that sidestep the 29-second wall
- How to configure timeouts correctly at each layer
Prerequisites
You should be comfortable with the AWS console and have basic familiarity with Lambda functions and API Gateway REST or HTTP APIs. Code examples use Python and bash. You'll need the AWS CLI configured with appropriate permissions.
Understanding the Timeout Stack
AWS API Gateway imposes a maximum integration timeout of 29 seconds for REST APIs and HTTP APIs. This is a hard service limit β you cannot raise it with a support ticket or a configuration change. It exists because API Gateway is designed for synchronous request-response patterns, not long-running work.
Lambda, on the other hand, supports execution timeouts up to 15 minutes. This creates a natural mismatch: you can configure your Lambda to run for 5 minutes, but API Gateway will cut the connection at 29 seconds and return a 504 to the client β even if Lambda eventually succeeds.
The same problem hits HTTP integrations. If your API Gateway routes to an ALB, ECS container, or any HTTP backend that takes more than 29 seconds to respond, the client sees a 504 regardless of what the backend does.
Diagnosing Which Layer Is Responsible
Before you fix anything, confirm which timeout is actually firing. The symptoms look similar but the solutions differ.
Check CloudWatch Logs for API Gateway
Enable execution logging on your API Gateway stage. Go to the stage settings, enable CloudWatch Logs, and set the log level to INFO or ERROR. After reproducing the failure, look for entries like this in your log group (/aws/api-gateway/<api-id>):
Execution failed due to a timeout
Method completed with status: 504If you see this, API Gateway timed out waiting for your integration. Lambda may have continued running after this point.
Check Lambda Duration Metrics
In the Lambda console, open the Monitor tab and look at the Duration metric. If your function's average or maximum duration is close to or exceeding 29 seconds, you've found the mismatch. Cross-reference the timestamps: if Lambda duration is 35 seconds and API Gateway fired a 504 at 29 seconds, that's your problem.
Use X-Ray to Trace the Full Path
Enable AWS X-Ray tracing on both API Gateway and Lambda. X-Ray produces a service map showing exactly where time is spent. You'll see the API Gateway segment end at 29 seconds while the Lambda segment may show it still running. This trace evidence is useful when you're explaining the issue to a team or writing a post-mortem.
aws xray get-service-graph \
--start-time $(date -d '-1 hour' +%s) \
--end-time $(date +%s)The Wrong Fix: Just Raising the Lambda Timeout
The most common mistake is increasing the Lambda function's timeout to match the desired execution time, then being confused when 504s persist. Raising the Lambda timeout does nothing about the API Gateway limit. They are independent settings at different layers of the stack.
Similarly, setting the API Gateway integration timeout to its maximum (29 seconds, which may already be the default) does not give you more headroom. You are already at the ceiling.
Fix 1: Async Invocation with a Job ID Pattern
The cleanest architectural fix is to stop expecting a synchronous response for work that takes more than a few seconds. Instead, your API accepts the request, kicks off the work asynchronously, and immediately returns a job ID. The client polls a status endpoint.
Here's the pattern in practice:
# handler.py β the submission Lambda
import boto3
import uuid
import json
lambda_client = boto3.client('lambda')
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('JobStatus')
def submit_job(event, context):
job_id = str(uuid.uuid4())
# Write initial status
table.put_item(Item={
'job_id': job_id,
'status': 'PENDING'
})
# Invoke the worker Lambda asynchronously (Event invocation type)
lambda_client.invoke(
FunctionName='my-long-running-worker',
InvocationType='Event', # async β does not wait for response
Payload=json.dumps({'job_id': job_id, 'input': event.get('body')})
)
return {
'statusCode': 202,
'body': json.dumps({'job_id': job_id, 'status': 'PENDING'})
}
# worker.py β the long-running Lambda
import boto3
import json
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('JobStatus')
def process_job(event, context):
job_id = event['job_id']
table.update_item(
Key={'job_id': job_id},
UpdateExpression='SET #s = :s',
ExpressionAttributeNames={'#s': 'status'},
ExpressionAttributeValues={':s': 'RUNNING'}
)
# ... do the actual long-running work here ...
result = do_heavy_work(event['input'])
table.update_item(
Key={'job_id': job_id},
UpdateExpression='SET #s = :s, #r = :r',
ExpressionAttributeNames={'#s': 'status', '#r': 'result'},
ExpressionAttributeValues={':s': 'COMPLETE', ':r': result}
)
The submission endpoint returns HTTP 202 in well under a second. The client then calls a GET /jobs/{job_id} endpoint to check status. This endpoint reads from DynamoDB and returns immediately β no timeout risk at all.
Fix 2: SQS Decoupling
If you want the submission layer to be even more resilient, replace the direct Lambda-to-Lambda async invocation with an SQS queue. The API Gateway endpoint drops a message onto the queue and returns immediately. A separate Lambda consumes the queue at its own pace.
import boto3
import uuid
import json
sqs = boto3.client('sqs')
QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/my-work-queue'
def submit_via_sqs(event, context):
job_id = str(uuid.uuid4())
body = json.loads(event.get('body', '{}'))
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps({'job_id': job_id, **body}),
MessageAttributes={
'source': {
'StringValue': 'api-gateway',
'DataType': 'String'
}
}
)
return {
'statusCode': 202,
'body': json.dumps({'job_id': job_id})
}
SQS adds a buffer that protects you from traffic spikes and gives you built-in retry logic. If the worker Lambda fails, SQS will redeliver the message up to your configured maximum receive count before routing it to a dead-letter queue.
Fix 3: WebSockets for Real-Time Progress
When you need the client to receive updates without polling, API Gateway WebSocket APIs are worth considering. Unlike REST/HTTP APIs, WebSocket connections persist and you can push results back to the client when the work finishes β no polling, no timeout on the connection keeping the work alive.
The key point: the WebSocket connection itself does not have the 29-second integration timeout applied to each message exchange in the same way. You can decouple the message that triggers work from the message that returns results. The Lambda invoked by the $default route can immediately acknowledge the request, do work asynchronously, and then call the API Gateway Management API to push the result back to the connected client.
import boto3
import json
def push_result_to_client(connection_id, api_endpoint, result):
apigw = boto3.client(
'apigatewaymanagementapi',
endpoint_url=api_endpoint
)
apigw.post_to_connection(
ConnectionId=connection_id,
Data=json.dumps({'status': 'complete', 'result': result})
)
Fix 4: Adjust the Integration Timeout for Shorter Requests
If your request usually completes in, say, 8 seconds but occasionally spikes to 12, and you previously left the integration timeout at the default lower value, you can increase it up to the 29-second maximum. This is not a workaround for the hard limit, but it handles the case where the default timeout was set too low for legitimate workloads.
In the AWS console, go to your API Gateway resource, open the Integration Request, and find the Timeout field. The value is in milliseconds. Set it to 29000 to use the maximum.
Using the AWS CLI:
aws apigateway update-integration \
--rest-api-id abc123def \
--resource-id xyz789 \
--http-method POST \
--patch-operations op=replace,path=/timeoutInMillis,value=29000For HTTP APIs (v2), the integration timeout is configured per-integration in the aws apigatewayv2 CLI commands with the --timeout-in-millis flag.
Common Pitfalls and Gotchas
Lambda continues running after a 504. When API Gateway times out, it closes the connection to the client, but it does not stop your Lambda function. The function keeps running and consuming compute time (and cost) until it either finishes or hits its own timeout. Always set the Lambda timeout slightly above the API Gateway limit if you want the function to self-terminate cleanly, or use the async patterns above.
ALB has its own timeout setting. If you're using API Gateway in front of an Application Load Balancer, the ALB has an idle timeout (default 60 seconds) that is separate from the API Gateway integration timeout. You can hit the API Gateway 29-second limit even if the ALB would have waited longer. Conversely, if you bypass API Gateway and go directly to ALB, the ALB timeout becomes the constraint.
Async invocations can silently fail. When you invoke Lambda with InvocationType='Event', failures do not bubble back to the caller. Configure a Lambda destination or a dead-letter queue so you know when the async work fails. Without this, jobs can disappear without a trace.
Don't confuse connection timeout with integration timeout. API Gateway also has a concept of connection timeout for HTTP integrations. These are different values. The integration timeout governs how long API Gateway waits for the backend to respond once connected. Read the docs carefully when using HTTP integrations versus Lambda integrations.
CloudFront in front of API Gateway adds another layer. If you put CloudFront in front of API Gateway, CloudFront has its own origin response timeout (default 30 seconds, configurable up to 60 seconds). Clients hitting your CloudFront distribution may see errors shaped differently than expected. Make sure your CloudFront origin timeout is set appropriately.
Wrapping Up
The 29-second API Gateway integration timeout is a fixed constraint, not a misconfiguration you can tune away. Once you accept that, the right question becomes: how do I restructure this work so it fits?
Here are concrete next steps:
- Enable API Gateway execution logging and X-Ray tracing on your stage so you have visibility into where time is actually going before making changes.
- For any operation that can exceed 10 seconds under load, refactor it to the async job-ID pattern or SQS decoupling described above.
- Set up a Lambda dead-letter queue or destination for every async Lambda function so silent failures surface as alerts instead of missing data.
- Audit your Lambda function timeouts and make sure they are set deliberately β not left at the default 3 seconds or inflated to 15 minutes without thought.
- If you need real-time feedback to the user during long work, prototype the WebSocket push pattern; it eliminates polling and handles the timeout constraint cleanly.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!