Fixing DigitalOcean Managed Postgres Failover Gaps That Drop Connections
Your DigitalOcean Managed Postgres cluster just failed over to the standby node, and now your app is throwing connection refused errors for anywhere from a few seconds to over a minute. DigitalOcean handles the promotion automatically, but nothing tells your app the primary has changed β at least not fast enough. This is a well-known gap, and the fix requires changes on your side.
The good news: the problem is solvable with a combination of connection pooling configuration, retry logic, and a few DNS-aware settings. You don't need to rearchitect anything.
What you'll learn
- What actually happens during a Managed Postgres failover and why connections drop
- How to configure PgBouncer (DigitalOcean's built-in pooler) to reduce the blast radius
- How to write retry logic that handles transient failover errors without corrupting state
- Which PostgreSQL client settings matter most for reconnection speed
- How to test your failover behavior before it happens in production
Prerequisites
This guide assumes you have a DigitalOcean Managed Postgres cluster (any recent version) with at least one standby node enabled. Examples use Python with psycopg2 and SQLAlchemy, but the concepts apply to any language. You should be comfortable reading connection strings and basic database configuration.
What Happens During a Failover
When DigitalOcean detects that the primary node is unhealthy β due to hardware failure, a maintenance event, or a manual trigger β it promotes the standby to primary. This process involves a brief period where neither node is accepting writes.
The cluster's hostname (the one in your connection string) is a DNS CNAME that eventually resolves to the new primary's IP. The operative word is eventually. DNS propagation, even within DigitalOcean's internal network, takes time. Meanwhile, any existing TCP connections to the old primary are dead, and new connections may get refused or time out while DNS catches up.
The practical result: your app holds stale connections in its pool, new connection attempts fail, and queries start throwing errors. The window is typically short β often under 30 seconds for a healthy standby β but a misconfigured app will stay broken long after the cluster is healthy again.
Why PgBouncer Alone Doesn't Save You
DigitalOcean's Managed Postgres includes a built-in PgBouncer connection pooler, accessible on port 25061. A lot of teams assume this insulates them from failover entirely. It doesn't β not by default.
PgBouncer sits in front of the database and manages a pool of server-side connections. During a failover, PgBouncer's own connections to the old primary break. PgBouncer will try to reconnect, but if your client-side code doesn't handle the errors that bubble up during that reconnection window, you still get failures.
That said, using PgBouncer in transaction pooling mode does reduce the number of open server connections, which speeds up the reconnection phase after failover. You should use it β just not as your only defense.
Configuring Your Connection String Correctly
DigitalOcean gives you two hostnames: one for the primary and one that includes the pooler. Always use the pooler endpoint for application traffic. It looks like this:
postgresql://user:password@db-postgres-cluster-do-user-xxxx-0.b.db.ondigitalocean.com:25061/defaultdb?sslmode=requireBeyond that, two connection string parameters matter most for failover recovery: connect_timeout and tcp_keepalives_idle.
import psycopg2
conn = psycopg2.connect(
host="your-pooler-host.db.ondigitalocean.com",
port=25061,
dbname="defaultdb",
user="doadmin",
password="your-password",
sslmode="require",
connect_timeout=10, # Don't wait forever on a dead host
keepalives=1,
keepalives_idle=30, # Detect dead connections after 30s idle
keepalives_interval=5,
keepalives_count=5
)Without keepalives, a TCP connection to a dead node can sit silently broken in your pool for minutes. The OS-level keepalive settings tell the kernel to probe the connection and mark it as dead if the remote end doesn't respond.
Writing Retry Logic That Actually Works
Retry logic for database connections needs to distinguish between transient errors (failover in progress) and permanent errors (wrong password, schema issue). Retrying a syntax error forever is not useful.
During a Postgres failover, you'll see specific error codes. The most common ones are 08006 (connection failure), 08001 (unable to connect), and 57P01 (admin shutdown, which is what happens when the old primary is demoted). These are the ones worth retrying.
import time
import psycopg2
from psycopg2 import OperationalError
RETRYABLE_PGCODES = {"08006", "08001", "08004", "57P01"}
MAX_RETRIES = 5
BASE_DELAY = 1.0 # seconds
def execute_with_retry(conn_factory, query, params=None):
last_error = None
for attempt in range(MAX_RETRIES):
try:
conn = conn_factory()
with conn.cursor() as cur:
cur.execute(query, params)
conn.commit()
return cur.fetchall()
except OperationalError as e:
pgcode = getattr(e, 'pgcode', None)
if pgcode not in RETRYABLE_PGCODES:
raise # Not a failover error, don't retry
last_error = e
delay = BASE_DELAY * (2 ** attempt) # Exponential backoff
print(f"Transient DB error (attempt {attempt + 1}), retrying in {delay}s...")
time.sleep(delay)
finally:
try:
conn.close()
except Exception:
pass
raise last_errorThe exponential backoff gives the cluster time to complete the promotion and for DNS to propagate. Starting at one second and doubling gives you attempts at 1s, 2s, 4s, 8s, and 16s β which covers nearly any realistic failover window.
SQLAlchemy Pool Configuration for Failover Resilience
If you're using SQLAlchemy, the engine's connection pool needs to know how to handle stale connections. The default pool settings are too optimistic for a failover scenario.
from sqlalchemy import create_engine, event
from sqlalchemy.pool import QueuePool
engine = create_engine(
"postgresql+psycopg2://doadmin:password@your-pooler-host:25061/defaultdb",
poolclass=QueuePool,
pool_size=10,
max_overflow=5,
pool_timeout=30,
pool_recycle=300, # Recycle connections every 5 minutes
pool_pre_ping=True, # Check connection health before using it
connect_args={
"sslmode": "require",
"connect_timeout": 10,
"keepalives": 1,
"keepalives_idle": 30,
"keepalives_interval": 5,
"keepalives_count": 5,
}
)pool_pre_ping=True is the most important setting here. Before SQLAlchemy hands a connection from the pool to your code, it runs a lightweight SELECT 1 check. If the connection is dead (because the primary failed over), it discards it and opens a fresh one. This alone eliminates a large category of failover errors.
pool_recycle=300 forces connections to be replaced every five minutes. This prevents the pool from holding onto connections that may have drifted stale for other reasons.
Common Pitfalls
Caching the resolved IP address
Some database drivers or HTTP clients cache DNS lookups and don't re-resolve the hostname when reconnecting. After a failover, the cached IP still points to the old primary. Make sure your application's DNS TTL settings respect DigitalOcean's TTL on the cluster hostname, which is typically low (around 30β60 seconds). In Java-based apps, the JVM DNS cache is a common offender β set networkaddress.cache.ttl=30 in your JVM security properties.
Long-running transactions at failover time
Any transaction that was in-flight when the primary went down is lost. There's no way around this β the data never made it to the standby. Your retry logic needs to be transaction-aware: if a transaction fails mid-way, roll back the entire logical operation before retrying. Partial retries on top of partial writes cause data corruption.
Read-only replica connections during failover
If you're using the standby for read traffic, be aware that during failover the standby becomes the new primary and stops accepting read-only replica connections until a new standby is provisioned. Applications that hard-code the standby endpoint for reads will fail during and after the failover window. Route all read traffic through the primary endpoint (or the pooler) and let your application decide what to do with eventual consistency.
PgBouncer session mode with failover
If you're using PgBouncer in session pooling mode (rather than transaction mode), a client holds a server connection for its entire session. This means all the stale connections from before the failover are still tied up, and new clients queue behind them. Transaction pooling mode releases the server connection after each transaction, which dramatically speeds up the pool's ability to drain stale connections and acquire fresh ones after a failover.
Testing Your Failover Handling
Don't wait for an unplanned outage to find out your retry logic doesn't work. DigitalOcean lets you trigger a manual failover from the control panel under your cluster's Overview tab. Use this in a staging environment with your application running under load.
A basic test looks like this:
- Start a background process that continuously runs a simple write query in a loop, logging errors and retry attempts.
- Trigger the manual failover from the DigitalOcean control panel.
- Watch your logs. You should see a burst of retry log lines, followed by successful queries resuming.
- Verify no unhandled exceptions escaped your retry wrapper.
- Check your database for data integrity β no partial writes, no duplicates from over-eager retries.
If your app crashes instead of retrying, or stays broken after the cluster recovers, you have a configuration problem to fix before the same thing happens in production.
Wrapping Up
DigitalOcean Managed Postgres gives you automatic failover, but survivability is your application's responsibility. Here are the concrete steps to take right now:
- Switch to the PgBouncer pooler endpoint (port 25061) if you're not already using it, and confirm it's running in transaction pooling mode.
- Add TCP keepalive settings to your connection configuration so stale connections are detected quickly rather than sitting silently dead in the pool.
- Enable
pool_pre_pingin SQLAlchemy (or the equivalent health check in your ORM/driver) to discard dead connections before they reach your queries. - Write retry logic that targets specific Postgres error codes for transient failures and uses exponential backoff to give the cluster time to stabilize.
- Trigger a manual failover in staging and watch your application's behavior under real conditions before you trust it in production.
The failover gap itself is a few dozen seconds at most. With the right configuration, your application should handle that window transparently.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!