Postgres Concepts I Wish I'd Learned Earlier

Why I Wanted to Understand This

Postgres is one of those tools you can use for years while only really understanding the surface — SELECT, INSERT, indexes, maybe a transaction or two. Then you hit production scale and suddenly every conversation involves terms like MVCC, WAL, ACCESS EXCLUSIVE, bloat, prepared statements breaking under PgBouncer, and you realize there's a whole other layer of the database you've been ignoring.

These are my notes on the concepts that kept coming up when I started looking into things like Row-Level Security, connection pooling, zero-downtime migrations, and replication. They're connected — once you understand MVCC, half of the other weird Postgres behavior starts making sense.

MVCC: The Concept That Explains Everything

MVCC stands for Multi-Version Concurrency Control, and it's the single most important Postgres concept to internalize. A lot of "why does Postgres do that?" questions trace back to it.

The core idea: when you UPDATE a row in Postgres, it doesn't actually update it in place. It marks the old row as dead and writes a new row alongside it. When you DELETE a row, it just marks the row as dead — the data is still physically there.

Before UPDATE:
┌─────────────────────────────────┐
│ id=1, name='Alice', age=30 ✓    │
└─────────────────────────────────┘

After UPDATE orders SET age=31 WHERE id=1:
┌─────────────────────────────────┐
│ id=1, name='Alice', age=30 ✗    │  ← marked dead
│ id=1, name='Alice', age=31 ✓    │  ← new version
└─────────────────────────────────┘

This is how Postgres handles concurrency without locks getting in the way of reads. Two transactions can see different "versions" of the same row depending on when they started. A long-running report can read a consistent snapshot of the database while writes continue around it.

The downside: dead rows accumulate. A table with heavy updates can end up with 50% dead rows, wasting disk space and forcing queries to scan more data than they need. This is called bloat.

VACUUM: Cleaning Up After MVCC

Because MVCC leaves dead rows around, you need a process to reclaim them. That's VACUUM.

VACUUM scans a table, finds dead rows, and marks their space as reusable. Postgres runs this automatically via a background process called autovacuum, which usually keeps up fine on its own. You only need to think about it when:

You just did a huge batch operation (like a backfill) and want to clean up immediately.
Autovacuum is falling behind on a very write-heavy table.
You're investigating why a query that should be fast is suddenly slow.

VACUUM FULL is the nuclear option. It rewrites the entire table from scratch to fully reclaim space and defragment, but it locks the table the entire time. You almost never want to use it in production. The tool pg_repack does the same thing without the lock and is what most teams reach for instead.

Locks and Lock Levels

Postgres has 8 lock levels that can be held on a table. Most operations grab one automatically. The ones worth knowing:

Lock Level	Acquired By	Blocks
`ACCESS SHARE`	`SELECT`	Only `ACCESS EXCLUSIVE`
`ROW EXCLUSIVE`	`INSERT`, `UPDATE`, `DELETE`	DDL
`SHARE UPDATE EXCLUSIVE`	`VACUUM`, `CREATE INDEX CONCURRENTLY`	Other DDL
`ACCESS EXCLUSIVE`	Most `ALTER TABLE`	Everything, including reads

The big one to watch is ACCESS EXCLUSIVE. It blocks every single operation on the table — including SELECT. This is why most ALTER TABLE operations are dangerous in production: while the lock is held, even read traffic stops.

The Lock Queue Problem

Even a "fast" ALTER TABLE can take your app down because of how the lock queue works.

Time
 │
 │  Long-running SELECT holds ACCESS SHARE ────────────────┐
 │                                                          │
 │  ALTER TABLE wants ACCESS EXCLUSIVE → waits             │
 │                                                          │
 │  Every new SELECT, INSERT, UPDATE also waits ─────────► │
 │  (queued behind the ALTER TABLE)                         │
 │                                                          │
 ▼

The ALTER TABLE queues behind the slow SELECT. Then every other query queues behind the ALTER TABLE. Your "instant" migration just froze the application because it got stuck in line.

Mitigation: SET lock_timeout = '2s' before running migrations. If the migration can't grab the lock in 2 seconds, it fails immediately instead of wedging the queue. Better to retry than to take everything down.

CREATE INDEX CONCURRENTLY

A normal CREATE INDEX locks the table for writes for as long as it takes to build — which can be hours on large tables.

CREATE INDEX CONCURRENTLY builds the index without blocking writes. It takes longer overall and uses more resources, but the app keeps working during the build.

-- Bad in production: blocks writes for the entire build
CREATE INDEX idx_orders_tenant ON orders (tenant_id);
 
-- Good: writes continue while the index builds
CREATE INDEX CONCURRENTLY idx_orders_tenant ON orders (tenant_id);

In production, you almost always want CONCURRENTLY. The catch: it can't run inside a transaction, and if it fails partway through, you're left with an invalid index that you have to drop and rebuild.

WAL: The Write-Ahead Log

Every change to Postgres is written to a log file before being applied to the actual data files. This log is the WAL (Write-Ahead Log).

Why this matters:

1. Crash recovery. If the server crashes mid-write, on restart Postgres replays the WAL to reach a consistent state. No data loss.

2. Replication. A replica (or "standby") receives the WAL stream from the primary and replays it locally. That's how replicas stay in sync — they're not running the original SQL, they're applying physical changes from the WAL.

3. Backups. Continuous archival of WAL enables point-in-time recovery.

The thing to know about WAL in practice: a huge UPDATE writes a huge amount of WAL. Replicas have to receive and replay all of it. If they can't keep up, replication lag grows, and if your disk fills up with un-archived WAL, writes will eventually stop.

This is why chunked backfills matter. A single UPDATE on 100M rows generates massive WAL in one go. The same operation in 10K-row chunks generates the same total WAL but spreads it out, giving replicas time to catch up.

-- Watch replica lag during backfills
SELECT
  client_addr,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;

Replicas, Primaries, and Read Splitting

A primary accepts writes. Replicas receive WAL from the primary and serve read traffic.

The common pattern:

Writes → primary
Hot transactional reads → primary (to avoid lag issues)
Heavy analytical reads → replica (offload from primary)

The trade-off is replication lag. A row written to the primary at 10:00:00.000 might not appear on the replica until 10:00:00.500. So you can't write to the primary and then immediately read from the replica — you'll miss your own write.

For workloads where stale-by-milliseconds is fine (reports, dashboards, search indexing), replicas are a huge win. For "fetch the order I just created," stay on the primary.

Transactions, SET, and SET LOCAL

A transaction groups statements so they succeed or fail together. BEGIN starts one, COMMIT applies it, ROLLBACK throws it away.

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

Either both updates land or neither does. No half-states.

SET changes a configuration setting for the current connection (session). For example:

SET statement_timeout = '30s';  -- kill any query > 30s, for this connection

SET LOCAL does the same, but the change reverts when the transaction ends:

BEGIN;
  SET LOCAL statement_timeout = '5s';  -- only affects this transaction
  SELECT ...;
COMMIT;  -- statement_timeout returns to its previous value

This distinction is critical with connection poolers (covered below). With pooling in transaction mode, your "session" isn't really a session — your connection can change between transactions. SET without LOCAL either leaks settings into the next user of the connection or vanishes entirely. SET LOCAL is always safe because it lives and dies with the transaction.

Connection Pooling and the Pinning Problem

Each Postgres connection costs ~10 MB of RAM and has nontrivial setup cost. max_connections is usually 100 by default. With modern app servers each opening their own pool, you blow through that fast.

Connection poolers like PgBouncer, PgCat, and RDS Proxy sit between the app and the database, maintaining a small pool of real connections that get multiplexed across many client connections.

Three pool modes:

SESSION MODE        ─►  Client owns a connection for its entire lifetime
                        (Minimal multiplexing benefit)

TRANSACTION MODE    ─►  Client owns a connection only during a transaction
                        (Sweet spot for most web apps)

STATEMENT MODE      ─►  Connection returned after every statement
                        (Maximum multiplexing, breaks most apps)

The catch with transaction mode: anything that relies on session state breaks. This includes:

SET without LOCAL
LISTEN / NOTIFY
Temp tables
Prepared statements (until protocol-level support was added in PgBouncer 1.21+)
Advisory locks
WITH HOLD cursors

RDS Proxy handles this via pinning — when it sees a session-state operation, it pins your client to a specific backend connection until you disconnect. Safer than breaking, but you lose the multiplexing benefit on that connection.

Row-Level Security (RLS)

RLS pushes access control into the database itself. Instead of every query in your app remembering WHERE tenant_id = ?, the database enforces it automatically via a policy.

-- Enable RLS on the table
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
 
-- Force it even for the table owner (commonly forgotten)
ALTER TABLE orders FORCE ROW LEVEL SECURITY;
 
-- Define the policy
CREATE POLICY tenant_isolation ON orders
  FOR ALL
  USING (tenant_id = current_setting('app.tenant_id')::uuid);

After that, every query against orders — from the app, from a raw SQL session, from a reporting tool — is automatically filtered. Forget the WHERE clause and you still only see your tenant's rows.

A few gotchas:

Table owners bypass RLS by default. This is why FORCE ROW LEVEL SECURITY matters. Without it, the user who created the tables (usually your migration user, often the same as your app user) silently ignores all policies. Many RLS setups are broken this way and nobody notices because everything looks correct.

Policies are additive across PERMISSIVE policies (OR logic), restrictive across RESTRICTIVE policies (AND logic). Adding more permissive policies loosens security, not tightens it.

RLS predicates participate in query planning. A simple tenant_id = ? policy with an index works great. A complex policy with subqueries can cause plan regressions. Always EXPLAIN ANALYZE common queries with RLS enabled.

Constraints, and the NOT VALID Trick

A constraint like NOT NULL or CHECK (total > 0) restricts what values a column can hold. Adding one to an existing table normally requires Postgres to scan every row to verify, holding a strong lock the whole time. Painful on large tables.

The pattern for adding constraints safely:

-- Step 1: Add as NOT VALID — fast, brief strong lock,
--         enforces for new/changed rows but skips existing ones
ALTER TABLE orders
  ADD CONSTRAINT positive_total CHECK (total > 0) NOT VALID;
 
-- Step 2: Validate existing rows — slower scan,
--         but uses a weaker lock that doesn't block writes
ALTER TABLE orders VALIDATE CONSTRAINT positive_total;

This works for CHECK constraints and foreign keys in all modern Postgres versions. For NOT NULL it requires Postgres 18+.

EXPLAIN ANALYZE

Every Postgres user should know how to read EXPLAIN ANALYZE output. It runs a query and shows exactly how Postgres executed it: which indexes it used, how many rows it scanned, where time was spent.

EXPLAIN ANALYZE
SELECT * FROM orders WHERE tenant_id = 'abc' AND status = 'pending';

You're looking for:

Sequential scans on large tables (usually means a missing index)
Rows estimated vs. rows returned (large mismatches mean stale statistics — run ANALYZE)
Where the time actually went (sometimes a query is slow because of a join you didn't expect)

This is the single highest-leverage thing to learn for debugging slow queries.

Schemas: Two Meanings

The word "schema" in Postgres means two unrelated things, which confuses everyone:

1. The structure of a table — its columns, types, constraints. "Schema migration" uses this meaning.

2. A namespace that contains tables. acme.orders and globex.orders can be different tables in different schemas in the same database. "Schema per tenant" uses this meaning.

When someone says "schema," figure out which one they mean from context.

Session Variables

Postgres lets you create arbitrary key-value pairs on a connection:

SET app.tenant_id = 'abc-123';
SELECT current_setting('app.tenant_id');  -- returns 'abc-123'

Any name with a dot (app., myapp., whatever) is allowed for custom variables. This is how the RLS pattern above passes tenant context from the application down to the database — the app sets the variable, the RLS policy reads it.

Combined with SET LOCAL inside a transaction, this gives you a safe way to pass per-request context through a connection pool.

max_connections

The setting that caps how many concurrent connections Postgres accepts. Default is 100. Each connection uses ~10 MB of RAM, plus context-switching overhead — you can't just crank this up infinitely. At some point you run out of memory or throughput collapses.

This single limitation is the entire reason connection poolers exist. If you have 20 app servers each wanting a 50-connection pool, you need either:

A database that can handle 1000 connections (expensive, slow), or
A pooler that gives each app server its own logical pool while multiplexing onto a smaller number of real connections (the right answer)

Advisory Locks

Application-defined locks that Postgres holds for you, not tied to any row or table. You pick a number; Postgres tracks who holds the lock on that number.

SELECT pg_advisory_lock(12345);
-- ... do something only one process should do at a time ...
SELECT pg_advisory_unlock(12345);

Useful for cross-server coordination ("only one worker runs this job"). Postgres doesn't care what 12345 means — the application defines its meaning.

They persist on a connection, so they break under transaction-mode pooling unless handled carefully.

Key Takeaways

MVCC is the foundation. Postgres writes new row versions instead of updating in place. Almost every other weird behavior traces back to this.

Bloat is the consequence of MVCC. Dead rows accumulate. VACUUM cleans them up. Autovacuum usually handles it, but heavy-write tables sometimes need help.

Lock levels determine what can run concurrently. ACCESS EXCLUSIVE blocks everything, including reads. Most ALTER TABLE operations take it. Set lock_timeout before migrations so they fail fast instead of queueing up disasters.

WAL is how Postgres survives crashes and feeds replicas. Big batch operations generate huge WAL volume, which is why chunked backfills exist — to keep replicas in sync and disks from filling.

SET LOCAL inside a transaction is the safe pattern for connection pooling. Plain SET leaks between users of a pooled connection.

Connection poolers are mandatory at scale, but transaction mode breaks anything relying on session state. Prepared statements, temp tables, advisory locks, non-LOCAL SET — all need care.

RLS pushes access control into the database. Add FORCE ROW LEVEL SECURITY or your policies won't apply to the table owner — which is a silent footgun.

EXPLAIN ANALYZE is the most important debugging tool. Learn to read it. Most slow queries become obvious once you see the plan.