Fixing AWS RDS Connection Pool Exhaustion During Sudden Traffic Spikes

May 22, 2026 7 min read 98 views
Diagram of database connection pooling architecture showing client connections being multiplexed through a proxy into a smaller pool of database server connections

Your app is running fine at normal load, then a flash sale hits or a post goes viral and within seconds every database request starts timing out. The error is always the same: too many connections. Your RDS instance is not overloaded on CPU or memory β€” it simply ran out of connection slots.

Connection pool exhaustion is one of the most common scaling failures for teams running relational databases on AWS RDS. The good news is it's predictable, diagnosable, and fixable before the next spike arrives.

What You'll Learn

  • Why RDS has a hard connection limit and where that number comes from
  • How to diagnose exhaustion using CloudWatch and database-level queries
  • How to tune your application-side connection pool correctly
  • How to deploy RDS Proxy or pgBouncer as a multiplexing layer
  • How to set up alerting so you catch the next spike before users do

Prerequisites

This guide assumes you're running PostgreSQL or MySQL on Amazon RDS or Aurora. The concepts apply to both engines, though some queries are engine-specific β€” I'll call those out. You should have access to CloudWatch and at least read access to your RDS parameter groups.

Why RDS Has a Hard Connection Limit

Every connection to PostgreSQL or MySQL is an OS-level process or thread, and it consumes memory. RDS sets max_connections based on the instance's available RAM, not some arbitrary cap. On a db.t3.medium (2 GB RAM), PostgreSQL's default max_connections is around 170. On a db.r6g.large (16 GB RAM) you get closer to 1000.

The formula PostgreSQL uses internally is approximately:

max_connections β‰ˆ (RAM_in_bytes - shared_buffers) / (work_mem * some_factor)

You don't need to memorize this. The point is that scaling up your instance gives you more connection headroom, but it's expensive and it doesn't solve the root problem β€” your application is opening far more connections than it needs.

Diagnosing the Problem

Check CloudWatch first

The metric you want is DatabaseConnections in the RDS namespace. Open CloudWatch, find your DB instance, and plot this metric over the past 24 hours with a 1-minute resolution. A healthy service shows a steady baseline with gradual rises during peak hours. Exhaustion looks like a near-vertical spike that plateaus exactly at your max_connections value.

Also check CPUUtilization and FreeableMemory alongside it. If CPU is low while connections are maxed, the database itself is not struggling β€” your application connection management is the problem.

Query the database directly

If you can still get a connection through a bastion or RDS Query Editor, run this on PostgreSQL:

SELECT
  client_addr,
  state,
  COUNT(*) AS connection_count
FROM pg_stat_activity
GROUP BY client_addr, state
ORDER BY connection_count DESC;

For MySQL, the equivalent is:

SELECT HOST, COMMAND, COUNT(*) AS cnt
FROM information_schema.PROCESSLIST
GROUP BY HOST, COMMAND
ORDER BY cnt DESC;

Look for connections in the idle or Sleep state. A large number of idle connections from your application hosts means your pool is holding connections open without actually using them. That's wasted capacity.

Tuning Your Application-Side Connection Pool

Most frameworks ship with connection pool defaults that were sensible for a single-server app but become dangerous at scale. The two most common misconfigurations are pools that are too large and pools with no idle timeout.

The right pool size formula

A widely-cited starting point from the PostgreSQL community is:

pool_size = (number_of_cores * 2) + effective_spindle_count

For most cloud database workloads the spindle count is effectively 1 (SSD). So a 4-vCPU application server should open no more than 9 connections per process. Multiply that by your number of app instances and you can quickly see how Kubernetes scaling can exhaust your database.

Here's an example using SQLAlchemy in Python:

from sqlalchemy import create_engine

engine = create_engine(
    "postgresql+psycopg2://user:pass@your-rds-host/dbname",
    pool_size=5,          # connections kept open per process
    max_overflow=10,      # extra connections allowed under burst
    pool_timeout=30,      # seconds to wait before raising an error
    pool_recycle=1800,    # recycle connections after 30 minutes
    pool_pre_ping=True,   # discard stale connections automatically
)

The pool_pre_ping=True setting is important on RDS. When RDS performs maintenance or failover, existing connections go stale. Without pre-ping, your application will try to use a dead connection and throw an error rather than gracefully opening a fresh one.

Node.js with pg

const { Pool } = require('pg');

const pool = new Pool({
  host: process.env.DB_HOST,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,
  port: 5432,
  max: 10,              // maximum pool size per Node process
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});

module.exports = pool;

Notice the explicit idleTimeoutMillis. Without it, connections sit open forever even when your traffic drops off, keeping slots occupied that RDS could use for something else.

Using RDS Proxy as a Multiplexing Layer

Application-side tuning helps, but it still means each application process manages its own pool. When you run dozens of Lambda functions or hundreds of container replicas, the math breaks down fast. This is exactly the problem RDS Proxy was built to solve.

RDS Proxy sits between your application and your RDS instance. It maintains a single warm connection pool to the database and multiplexes thousands of application connections across a much smaller set of real database connections. From your application's perspective, it looks like a normal RDS endpoint.

Setting up RDS Proxy

You can provision an RDS Proxy through the console, CLI, or Terraform. Here's the Terraform resource as a reference:

resource "aws_db_proxy" "app_proxy" {
  name                   = "app-rds-proxy"
  debug_logging          = false
  engine_family          = "POSTGRESQL"
  idle_client_timeout    = 1800
  require_tls            = true
  role_arn               = aws_iam_role.rds_proxy_role.arn
  vpc_security_group_ids = [aws_security_group.rds_proxy_sg.id]
  vpc_subnet_ids         = var.private_subnet_ids

  auth {
    auth_scheme = "SECRETS"
    iam_auth    = "DISABLED"
    secret_arn  = aws_secretsmanager_secret.db_credentials.arn
  }
}

resource "aws_db_proxy_default_target_group" "app_proxy_tg" {
  db_proxy_name = aws_db_proxy.app_proxy.name

  connection_pool_config {
    max_connections_percent      = 90
    max_idle_connections_percent = 50
    connection_borrow_timeout    = 120
  }
}

Once deployed, update your application's database host to the proxy endpoint instead of the RDS endpoint. No other code changes are required. RDS Proxy is especially impactful for Lambda functions, where each cold-start previously created a new database connection from scratch.

Self-Hosted Pooling with pgBouncer

If RDS Proxy's cost is a concern (it adds roughly 30–50% to the instance cost depending on size), pgBouncer is a mature open-source alternative you run yourself on an EC2 instance or as a sidecar container.

pgBouncer operates in three modes. Session pooling assigns a server connection for the client's entire session β€” minimal benefit for connection exhaustion. Transaction pooling assigns a server connection only during a transaction β€” this is the sweet spot for most web apps and gives you the best multiplexing ratio. Statement pooling is the most aggressive but breaks applications that use multi-statement transactions.

A minimal pgbouncer.ini for transaction pooling looks like this:

[databases]
myapp = host=your-rds-host.amazonaws.com port=5432 dbname=myapp

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 5432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
log_connections = 0
log_disconnections = 0

With this setup, pgBouncer accepts up to 1000 client connections but forwards only 25 real connections to RDS. Your application points at the pgBouncer host instead of RDS directly.

Common Pitfalls and Gotchas

Prepared statements and transaction pooling don't mix by default. If your ORM uses named prepared statements, pgBouncer in transaction mode will throw errors because prepared statements are session-scoped and pgBouncer doesn't preserve sessions. Disable server-side prepared statements in your ORM, or use pgBouncer's prepared_statement_cache_queries option (available in newer versions).

RDS Proxy requires Secrets Manager. You cannot pass a plain password to RDS Proxy β€” credentials must be stored in AWS Secrets Manager. Factor this into your setup time if you're not already using it.

Pool size and max_connections must be planned end-to-end. If you have 20 app servers each with a pool of 20, plus pgBouncer with a pool of 50, and RDS max_connections is 170, you're still going to exhaust connections when pgBouncer talks directly to RDS. The proxy or pooler pool size must be smaller than RDS max_connections minus headroom for admin connections.

Aurora Serverless v2 behaves differently. It scales max_connections with capacity units, but it doesn't scale instantly. During the brief scaling window, you can still hit connection limits. RDS Proxy is especially valuable here as a buffer.

Don't forget to reserve connections for yourself. Always leave at least 3–5 connections below max_connections so you can still connect via psql or a bastion host to diagnose the problem when everything is on fire.

Setting Up Alerting

Fixing the problem reactively is good. Knowing about it before users do is better. Create a CloudWatch alarm on the DatabaseConnections metric:

{
  "AlarmName": "RDS-HighConnectionCount",
  "MetricName": "DatabaseConnections",
  "Namespace": "AWS/RDS",
  "Statistic": "Maximum",
  "Period": 60,
  "EvaluationPeriods": 2,
  "Threshold": 140,
  "ComparisonOperator": "GreaterThanOrEqualToThreshold",
  "AlarmActions": ["arn:aws:sns:us-east-1:123456789012:ops-alerts"]
}

Set the threshold at about 80% of your max_connections value. That gives you time to respond β€” scale up, shed load, or route traffic β€” before the database is fully saturated. If you're using RDS Proxy, also monitor the ClientConnections and DatabaseConnectionsCurrentlyBorrowed proxy metrics.

Next Steps

  • Run the pg_stat_activity or PROCESSLIST query on your current production database right now and see how many idle connections you actually have open.
  • Review your ORM or database client configuration and set an explicit pool_size, max_overflow, and idle timeout if you haven't already.
  • If you're running Lambda functions or more than 50 application instances, evaluate RDS Proxy β€” provision it in a staging environment first and measure the connection count reduction.
  • Set a CloudWatch alarm at 80% of max_connections so you get notified before the next spike causes user-facing errors.
  • If cost is a constraint, deploy pgBouncer in transaction pooling mode on a small EC2 instance or as a sidecar, and benchmark it against your current setup under load.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.