How I Run Production Postgres at Scale
Eight years of running Postgres for fintech. Here's the operational playbook.
Postgres at low scale is delightful. Postgres at scale requires real operational discipline. This is the playbook I use, distilled from running it at a Tier-1 bank, a bank, and now for fintech clients.
I've been running Postgres in production since 2018. Banks, fintechs, and now my own clients. Some installations have done several thousand TPS sustained, others have crossed 10TB. This is the operational playbook I use, distilled.
Sizing the instance
Default Postgres on RDS is undersized. The default max_connections is 100, shared_buffers is 25% of RAM, work_mem is 4MB. For real workloads, every one of those is wrong.
My starting configuration on a r6i.2xlarge (8 vCPU, 64 GB):
shared_buffers= 16 GB (25%)effective_cache_size= 48 GB (75%)work_mem= 32 MBmaintenance_work_mem= 2 GBmax_connections= 200 (with PgBouncer in front)random_page_cost= 1.1 (because EBS is fast)
These aren't sacred. Tune from observed workload. But the defaults will leave you at 30% of capacity.
Connection pooling, always
If your app connects directly to Postgres, you have a bug. Postgres connections are expensive (each is a process, with its own memory). At 500 concurrent connections, your database spends more time scheduling than querying.
PgBouncer in transaction-pooling mode in front of Postgres. App connects to PgBouncer. PgBouncer multiplexes onto a small pool of real Postgres connections. This is non-negotiable for me.
The gotcha: transaction pooling breaks prepared statements and session-level features like SET LOCAL. Know your application's expectations.
Monitor the right things
Most teams monitor CPU and memory. The metrics that actually predict outage:
pg_stat_replicationlag: replicas falling behind primary is the early warning of every disaster.pg_stat_activitylong-running queries: any query running over 30 seconds in OLTP is a problem.- Bloat ratios from
pg_stat_user_tables: if dead tuples exceed 20% of live tuples, vacuum is losing. - Lock waits: a transaction holding locks for more than a few seconds is going to cause cascading slowness.
- WAL generation rate: a sudden spike means you have a runaway write workload.
I've built a Grafana dashboard with all of these. It's the dashboard I open first during any incident.
Vacuum is not optional
Postgres MVCC means dead tuples accumulate. Autovacuum runs by default but its defaults are conservative. On a write-heavy table, autovacuum will fall behind, dead tuples will accumulate, queries will slow down silently.
Tune autovacuum aggressively on hot tables:
autovacuum_vacuum_scale_factor = 0.05(default 0.2)autovacuum_analyze_scale_factor = 0.025(default 0.1)autovacuum_vacuum_cost_limit = 2000(default 200)
If you have a table that's hot and gets bloated anyway, schedule a manual VACUUM (VERBOSE, ANALYZE) during off-peak hours. Yes, it's old-school. Yes, it works.
Indexes are infrastructure, not features
Every index has a write cost. Every missing index has a read cost. The real skill is knowing which.
Tools I use weekly:
pg_stat_user_indexesto find unused indexes (drop them).auto_explainextension to log slow query plans.- pganalyze for ongoing query insights, when budget allows.
The single most common antipattern I see: creating an index for every WHERE clause without checking whether the optimizer would use it. Postgres is smart enough that small tables don't need indexes. Bigger tables benefit from a small set of correct indexes, not a long list of duplicates.
Backups are not backups until they're tested
A backup you haven't restored is not a backup. It's a guess.
I run a monthly drill: pick a backup at random, restore it to a fresh instance, run a sanity check. If anything fails, the backup process is broken. This drill has caught issues for me three times across different clients.
Failover discipline
Use RDS Multi-AZ or Patroni or whatever you need, but the question that matters is: have you tested failover with production traffic in the last six months?
If not, you don't have failover. You have a hope.
Cost discipline
Postgres on RDS at scale is expensive. Tactics that have saved my clients real money:
- Right-size storage with
pg_repackinstead of growing the instance. - Move cold data to S3 with foreign data wrappers.
- Use read replicas for analytics, not the primary.
- Set up partitioning on big append-only tables (events, logs).
I helped one client cut their database bill from $14K/month to $4K/month with three of these tactics, no functionality lost.
Things I do not do
- I don't use Postgres for analytics workloads beyond a few thousand rows. Use ClickHouse for that.
- I don't use logical replication for HA. Use streaming replication.
- I don't store BLOBs in Postgres. Put them in S3 with a reference in the row.
The sharper insight
Most database outages are not caused by load. They're caused by configuration drift over months that nobody noticed until something tipped. Vacuum thresholds creeping up, replication lag normalizing at higher and higher values, slow queries accumulating in the background. Run a quarterly database review. Look for slow drift, not for spikes.
If you'd like a Postgres operations review for your own production, I do that as part of an architecture audit.