All Insights
playbooks· 12 min read

How I Run Production Postgres at Scale

Eight years of running Postgres for fintech. Here's the operational playbook.

SV
Sri VardhanAug 19, 202512 min

Postgres at low scale is delightful. Postgres at scale requires real operational discipline. This is the playbook I use, distilled from running it at a Tier-1 bank, a bank, and now for fintech clients.

I've been running Postgres in production since 2018. Banks, fintechs, and now my own clients. Some installations have done several thousand TPS sustained, others have crossed 10TB. This is the operational playbook I use, distilled.

Sizing the instance

Default Postgres on RDS is undersized. The default max_connections is 100, shared_buffers is 25% of RAM, work_mem is 4MB. For real workloads, every one of those is wrong.

My starting configuration on a r6i.2xlarge (8 vCPU, 64 GB):

  • shared_buffers = 16 GB (25%)
  • effective_cache_size = 48 GB (75%)
  • work_mem = 32 MB
  • maintenance_work_mem = 2 GB
  • max_connections = 200 (with PgBouncer in front)
  • random_page_cost = 1.1 (because EBS is fast)

These aren't sacred. Tune from observed workload. But the defaults will leave you at 30% of capacity.

Connection pooling, always

If your app connects directly to Postgres, you have a bug. Postgres connections are expensive (each is a process, with its own memory). At 500 concurrent connections, your database spends more time scheduling than querying.

PgBouncer in transaction-pooling mode in front of Postgres. App connects to PgBouncer. PgBouncer multiplexes onto a small pool of real Postgres connections. This is non-negotiable for me.

The gotcha: transaction pooling breaks prepared statements and session-level features like SET LOCAL. Know your application's expectations.

Monitor the right things

Most teams monitor CPU and memory. The metrics that actually predict outage:

  • pg_stat_replication lag: replicas falling behind primary is the early warning of every disaster.
  • pg_stat_activity long-running queries: any query running over 30 seconds in OLTP is a problem.
  • Bloat ratios from pg_stat_user_tables: if dead tuples exceed 20% of live tuples, vacuum is losing.
  • Lock waits: a transaction holding locks for more than a few seconds is going to cause cascading slowness.
  • WAL generation rate: a sudden spike means you have a runaway write workload.

I've built a Grafana dashboard with all of these. It's the dashboard I open first during any incident.

Vacuum is not optional

Postgres MVCC means dead tuples accumulate. Autovacuum runs by default but its defaults are conservative. On a write-heavy table, autovacuum will fall behind, dead tuples will accumulate, queries will slow down silently.

Tune autovacuum aggressively on hot tables:

  • autovacuum_vacuum_scale_factor = 0.05 (default 0.2)
  • autovacuum_analyze_scale_factor = 0.025 (default 0.1)
  • autovacuum_vacuum_cost_limit = 2000 (default 200)

If you have a table that's hot and gets bloated anyway, schedule a manual VACUUM (VERBOSE, ANALYZE) during off-peak hours. Yes, it's old-school. Yes, it works.

Indexes are infrastructure, not features

Every index has a write cost. Every missing index has a read cost. The real skill is knowing which.

Tools I use weekly:

  • pg_stat_user_indexes to find unused indexes (drop them).
  • auto_explain extension to log slow query plans.
  • pganalyze for ongoing query insights, when budget allows.

The single most common antipattern I see: creating an index for every WHERE clause without checking whether the optimizer would use it. Postgres is smart enough that small tables don't need indexes. Bigger tables benefit from a small set of correct indexes, not a long list of duplicates.

Backups are not backups until they're tested

A backup you haven't restored is not a backup. It's a guess.

I run a monthly drill: pick a backup at random, restore it to a fresh instance, run a sanity check. If anything fails, the backup process is broken. This drill has caught issues for me three times across different clients.

Failover discipline

Use RDS Multi-AZ or Patroni or whatever you need, but the question that matters is: have you tested failover with production traffic in the last six months?

If not, you don't have failover. You have a hope.

Cost discipline

Postgres on RDS at scale is expensive. Tactics that have saved my clients real money:

  • Right-size storage with pg_repack instead of growing the instance.
  • Move cold data to S3 with foreign data wrappers.
  • Use read replicas for analytics, not the primary.
  • Set up partitioning on big append-only tables (events, logs).

I helped one client cut their database bill from $14K/month to $4K/month with three of these tactics, no functionality lost.

Things I do not do

  • I don't use Postgres for analytics workloads beyond a few thousand rows. Use ClickHouse for that.
  • I don't use logical replication for HA. Use streaming replication.
  • I don't store BLOBs in Postgres. Put them in S3 with a reference in the row.

The sharper insight

Most database outages are not caused by load. They're caused by configuration drift over months that nobody noticed until something tipped. Vacuum thresholds creeping up, replication lag normalizing at higher and higher values, slow queries accumulating in the background. Run a quarterly database review. Look for slow drift, not for spikes.

If you'd like a Postgres operations review for your own production, I do that as part of an architecture audit.

References

postgresoperationsdatabasefintech

Want to discuss this topic?

I'm always happy to dive deeper. Reach out if you have questions or want to collaborate.

Get in Touch

Command Palette

Search for a command to run...