Fixing AWS RDS Connection Pool Exhaustion During Sudden Traffic Spikes
Your app is running fine at normal load, then a flash sale hits or a post goes viral and within seconds every database request starts timing out. The error is always the same: too many connections. Your RDS instance is not overloaded on CPU or memory β it simply ran out of connection slots.
Connection pool exhaustion is one of the most common scaling failures for teams running relational databases on AWS RDS. The good news is it's predictable, diagnosable, and fixable before the next spike arrives.
What You'll Learn
- Why RDS has a hard connection limit and where that number comes from
- How to diagnose exhaustion using CloudWatch and database-level queries
- How to tune your application-side connection pool correctly
- How to deploy RDS Proxy or pgBouncer as a multiplexing layer
- How to set up alerting so you catch the next spike before users do
Prerequisites
This guide assumes you're running PostgreSQL or MySQL on Amazon RDS or Aurora. The concepts apply to both engines, though some queries are engine-specific β I'll call those out. You should have access to CloudWatch and at least read access to your RDS parameter groups.
Why RDS Has a Hard Connection Limit
Every connection to PostgreSQL or MySQL is an OS-level process or thread, and it consumes memory. RDS sets max_connections based on the instance's available RAM, not some arbitrary cap. On a db.t3.medium (2 GB RAM), PostgreSQL's default max_connections is around 170. On a db.r6g.large (16 GB RAM) you get closer to 1000.
The formula PostgreSQL uses internally is approximately:
max_connections β (RAM_in_bytes - shared_buffers) / (work_mem * some_factor)You don't need to memorize this. The point is that scaling up your instance gives you more connection headroom, but it's expensive and it doesn't solve the root problem β your application is opening far more connections than it needs.
Diagnosing the Problem
Check CloudWatch first
The metric you want is DatabaseConnections in the RDS namespace. Open CloudWatch, find your DB instance, and plot this metric over the past 24 hours with a 1-minute resolution. A healthy service shows a steady baseline with gradual rises during peak hours. Exhaustion looks like a near-vertical spike that plateaus exactly at your max_connections value.
Also check CPUUtilization and FreeableMemory alongside it. If CPU is low while connections are maxed, the database itself is not struggling β your application connection management is the problem.
Query the database directly
If you can still get a connection through a bastion or RDS Query Editor, run this on PostgreSQL:
SELECT
client_addr,
state,
COUNT(*) AS connection_count
FROM pg_stat_activity
GROUP BY client_addr, state
ORDER BY connection_count DESC;For MySQL, the equivalent is:
SELECT HOST, COMMAND, COUNT(*) AS cnt
FROM information_schema.PROCESSLIST
GROUP BY HOST, COMMAND
ORDER BY cnt DESC;Look for connections in the idle or Sleep state. A large number of idle connections from your application hosts means your pool is holding connections open without actually using them. That's wasted capacity.
Tuning Your Application-Side Connection Pool
Most frameworks ship with connection pool defaults that were sensible for a single-server app but become dangerous at scale. The two most common misconfigurations are pools that are too large and pools with no idle timeout.
The right pool size formula
A widely-cited starting point from the PostgreSQL community is:
pool_size = (number_of_cores * 2) + effective_spindle_count
For most cloud database workloads the spindle count is effectively 1 (SSD). So a 4-vCPU application server should open no more than 9 connections per process. Multiply that by your number of app instances and you can quickly see how Kubernetes scaling can exhaust your database.
Here's an example using SQLAlchemy in Python:
from sqlalchemy import create_engine
engine = create_engine(
"postgresql+psycopg2://user:pass@your-rds-host/dbname",
pool_size=5, # connections kept open per process
max_overflow=10, # extra connections allowed under burst
pool_timeout=30, # seconds to wait before raising an error
pool_recycle=1800, # recycle connections after 30 minutes
pool_pre_ping=True, # discard stale connections automatically
)
The pool_pre_ping=True setting is important on RDS. When RDS performs maintenance or failover, existing connections go stale. Without pre-ping, your application will try to use a dead connection and throw an error rather than gracefully opening a fresh one.
Node.js with pg
const { Pool } = require('pg');
const pool = new Pool({
host: process.env.DB_HOST,
database: process.env.DB_NAME,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
port: 5432,
max: 10, // maximum pool size per Node process
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});
module.exports = pool;
Notice the explicit idleTimeoutMillis. Without it, connections sit open forever even when your traffic drops off, keeping slots occupied that RDS could use for something else.
Using RDS Proxy as a Multiplexing Layer
Application-side tuning helps, but it still means each application process manages its own pool. When you run dozens of Lambda functions or hundreds of container replicas, the math breaks down fast. This is exactly the problem RDS Proxy was built to solve.
RDS Proxy sits between your application and your RDS instance. It maintains a single warm connection pool to the database and multiplexes thousands of application connections across a much smaller set of real database connections. From your application's perspective, it looks like a normal RDS endpoint.
Setting up RDS Proxy
You can provision an RDS Proxy through the console, CLI, or Terraform. Here's the Terraform resource as a reference:
resource "aws_db_proxy" "app_proxy" {
name = "app-rds-proxy"
debug_logging = false
engine_family = "POSTGRESQL"
idle_client_timeout = 1800
require_tls = true
role_arn = aws_iam_role.rds_proxy_role.arn
vpc_security_group_ids = [aws_security_group.rds_proxy_sg.id]
vpc_subnet_ids = var.private_subnet_ids
auth {
auth_scheme = "SECRETS"
iam_auth = "DISABLED"
secret_arn = aws_secretsmanager_secret.db_credentials.arn
}
}
resource "aws_db_proxy_default_target_group" "app_proxy_tg" {
db_proxy_name = aws_db_proxy.app_proxy.name
connection_pool_config {
max_connections_percent = 90
max_idle_connections_percent = 50
connection_borrow_timeout = 120
}
}
Once deployed, update your application's database host to the proxy endpoint instead of the RDS endpoint. No other code changes are required. RDS Proxy is especially impactful for Lambda functions, where each cold-start previously created a new database connection from scratch.
Self-Hosted Pooling with pgBouncer
If RDS Proxy's cost is a concern (it adds roughly 30β50% to the instance cost depending on size), pgBouncer is a mature open-source alternative you run yourself on an EC2 instance or as a sidecar container.
pgBouncer operates in three modes. Session pooling assigns a server connection for the client's entire session β minimal benefit for connection exhaustion. Transaction pooling assigns a server connection only during a transaction β this is the sweet spot for most web apps and gives you the best multiplexing ratio. Statement pooling is the most aggressive but breaks applications that use multi-statement transactions.
A minimal pgbouncer.ini for transaction pooling looks like this:
[databases]
myapp = host=your-rds-host.amazonaws.com port=5432 dbname=myapp
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 5432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
log_connections = 0
log_disconnections = 0
With this setup, pgBouncer accepts up to 1000 client connections but forwards only 25 real connections to RDS. Your application points at the pgBouncer host instead of RDS directly.
Common Pitfalls and Gotchas
Prepared statements and transaction pooling don't mix by default. If your ORM uses named prepared statements, pgBouncer in transaction mode will throw errors because prepared statements are session-scoped and pgBouncer doesn't preserve sessions. Disable server-side prepared statements in your ORM, or use pgBouncer's prepared_statement_cache_queries option (available in newer versions).
RDS Proxy requires Secrets Manager. You cannot pass a plain password to RDS Proxy β credentials must be stored in AWS Secrets Manager. Factor this into your setup time if you're not already using it.
Pool size and max_connections must be planned end-to-end. If you have 20 app servers each with a pool of 20, plus pgBouncer with a pool of 50, and RDS max_connections is 170, you're still going to exhaust connections when pgBouncer talks directly to RDS. The proxy or pooler pool size must be smaller than RDS max_connections minus headroom for admin connections.
Aurora Serverless v2 behaves differently. It scales max_connections with capacity units, but it doesn't scale instantly. During the brief scaling window, you can still hit connection limits. RDS Proxy is especially valuable here as a buffer.
Don't forget to reserve connections for yourself. Always leave at least 3β5 connections below max_connections so you can still connect via psql or a bastion host to diagnose the problem when everything is on fire.
Setting Up Alerting
Fixing the problem reactively is good. Knowing about it before users do is better. Create a CloudWatch alarm on the DatabaseConnections metric:
{
"AlarmName": "RDS-HighConnectionCount",
"MetricName": "DatabaseConnections",
"Namespace": "AWS/RDS",
"Statistic": "Maximum",
"Period": 60,
"EvaluationPeriods": 2,
"Threshold": 140,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"AlarmActions": ["arn:aws:sns:us-east-1:123456789012:ops-alerts"]
}
Set the threshold at about 80% of your max_connections value. That gives you time to respond β scale up, shed load, or route traffic β before the database is fully saturated. If you're using RDS Proxy, also monitor the ClientConnections and DatabaseConnectionsCurrentlyBorrowed proxy metrics.
Next Steps
- Run the
pg_stat_activityorPROCESSLISTquery on your current production database right now and see how many idle connections you actually have open. - Review your ORM or database client configuration and set an explicit
pool_size,max_overflow, and idle timeout if you haven't already. - If you're running Lambda functions or more than 50 application instances, evaluate RDS Proxy β provision it in a staging environment first and measure the connection count reduction.
- Set a CloudWatch alarm at 80% of
max_connectionsso you get notified before the next spike causes user-facing errors. - If cost is a constraint, deploy pgBouncer in transaction pooling mode on a small EC2 instance or as a sidecar, and benchmark it against your current setup under load.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!