So you've been using Postgres for a while.

You wrap your writes in BEGIN ... COMMIT, you read the docs about isolation levels, maybe you even bumped one of your services to REPEATABLE READ because someone said it was safer. The app works. Until one Monday morning you notice the orders table is 18GB on disk while SELECT count(*) says there are about two million rows, and that math doesn't add up. Or a long-running export job from analytics is somehow slowing down OLTP writes that aren't even touching the same tables. Or a query that ran in 4ms last week now takes 600ms with no schema change in between.

That's MVCC saying hello.

Postgres doesn't lock rows the way you probably think it does. It doesn't overwrite them either. It writes new versions, marks the old ones as dead, and then quietly relies on a background process called autovacuum to clean up the mess later. Most of the time you don't have to think about any of that. The other 5% of the time, you can't debug your database without thinking about it, and the official docs assume you already know.

Let's fix that.

What A Postgres Transaction Actually Is

A transaction in Postgres is the same thing it is everywhere else: a group of statements that either all commit or all roll back. The skeleton is boring.

SQL
BEGIN;

UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;

Two updates. Either both happen or neither does. If anything fails between BEGIN and COMMIT, a ROLLBACK puts the world back the way it was. That part isn't surprising.

What's interesting is what Postgres is doing under the hood while that transaction is open. It's not editing the rows in place. It's not even necessarily blocking other transactions from reading them. What it's doing is writing new row versions next to the old ones and tagging each version with metadata that says who created me and who deleted me. Other transactions that are also running at the same time get a private view of the table that may or may not include those new versions, depending on when each transaction started.

That's MVCC: Multi-Version Concurrency Control. It's the reason readers don't block writers in Postgres. It's also the reason your tables grow.

Every Row Has Two Hidden Columns You Don't See

Open up any Postgres table and run this:

SQL
SELECT xmin, xmax, * FROM accounts LIMIT 5;

You'll get two columns you didn't define: xmin and xmax. They're system columns, present on every heap tuple in every table, and they're how Postgres tracks which transactions can see which versions of a row.

  • xmin is the transaction ID that created this row version.
  • xmax is the transaction ID that deleted or replaced it (or 0 if the row is still live).

When you INSERT a row, Postgres writes a tuple with xmin = current_xid and xmax = 0. When you UPDATE a row, Postgres doesn't change the row in place; it writes a new tuple with a fresh xmin, then sets xmax = current_xid on the old tuple. The old tuple stays on disk. It's just marked as superseded. When you DELETE, Postgres sets xmax on the existing tuple and writes nothing new.

So an UPDATE in Postgres is actually an INSERT plus a logical mark-as-dead on the previous version. That single design decision is the reason concurrency in Postgres works the way it does, and the reason vacuum exists.

Three-frame diagram showing how an UPDATE in PostgreSQL MVCC writes a new tuple version while marking the old one dead, then VACUUM reclaims the dead tuple

Snapshots: The Private View Each Transaction Gets

Now the question becomes: when transaction T runs SELECT * FROM accounts, which row versions does T see?

Postgres answers that with a snapshot. A snapshot is a small piece of metadata that says, roughly:

  • The current transaction ID.
  • The list of transaction IDs that were still in flight when this snapshot was taken.
  • The lowest transaction ID that hadn't started yet.

Every time you run a query, Postgres uses your snapshot to ask, for each tuple it touches: was the creator (xmin) committed before my snapshot, and was the deleter (xmax) either zero or not yet committed when my snapshot was taken? If yes, the tuple is visible. If no, it isn't.

That's the whole visibility check, conceptually. Two transaction IDs and a snapshot. The implementation has more corners (subtransactions, prepared transactions, hint bits, the commit log) but the logic above is what's actually deciding what your SELECT returns.

The interesting part is when the snapshot is taken, because that's the part you control with the isolation level.

Isolation Levels Are Just Snapshot Policies

Postgres officially supports three isolation levels: READ COMMITTED, REPEATABLE READ, and SERIALIZABLE. (READ UNCOMMITTED is accepted as a synonym for READ COMMITTED: Postgres doesn't actually do dirty reads, ever.)

The difference between them is how often you take a new snapshot.

READ COMMITTED (the default)

Each statement in your transaction takes a fresh snapshot when it starts.

SQL
BEGIN;

SELECT balance FROM accounts WHERE id = 1;
-- snapshot 1 — sees whatever was committed before this statement

-- some other transaction commits an UPDATE here

SELECT balance FROM accounts WHERE id = 1;
-- snapshot 2 — sees the new value

COMMIT;

Two SELECTs in the same transaction can return different values. That's not a bug, that's the contract. READ COMMITTED means I will never read uncommitted data, but I make no promise that my view is stable across statements.

For most application workloads this is fine. Your web request opens a transaction, does its thing, commits, and moves on. The window where two statements within the same request see different worlds is small and usually harmless.

REPEATABLE READ

The snapshot is taken once, when the first statement of the transaction runs, and reused for every subsequent statement.

SQL
BEGIN;
-- nothing happens yet

SELECT balance FROM accounts WHERE id = 1;
-- snapshot taken now — frozen for the rest of the transaction

-- other transactions can do whatever they want

SELECT balance FROM accounts WHERE id = 1;
-- same snapshot — returns the same value, even if someone committed an UPDATE

COMMIT;

This is what you want for reports, exports, anything that needs a consistent view of the database for more than one query. Postgres' REPEATABLE READ is also strong enough to prevent phantom reads, by the way; that's a quirk where it's stricter than the SQL standard requires.

The catch: if your transaction tries to UPDATE a row that another transaction has already updated since your snapshot was taken, Postgres throws a serialization failure (ERROR: could not serialize access due to concurrent update) and you have to retry the whole transaction. Your code needs a retry loop or it'll surface as a 500.

SERIALIZABLE

Same snapshot rules as REPEATABLE READ, but Postgres additionally tracks read-write dependencies between concurrent transactions and aborts one of them if their combined effect couldn't have been produced by running them one after the other in some order.

This is the strongest isolation. It's also the only one that makes SELECT FOR UPDATE and other locking gymnastics unnecessary in many cases. The price is more aborts under contention, more retries in your application, and a bit more bookkeeping in the database.

In practice most teams default to READ COMMITTED and only reach for stronger levels when a specific workload demands it. There's nothing wrong with that. Just know what each one buys you.

Why Readers Don't Block Writers

Here's the part that surprises people coming from MySQL or SQL Server: in Postgres, a long SELECT doesn't block a concurrent UPDATE on the same row, and a concurrent UPDATE doesn't block the SELECT either.

The reason is that the SELECT is reading an old version of the tuple, the one that was visible at the time its snapshot was taken, while the UPDATE is writing a brand new version next to it. Both transactions are working on different physical tuples even though they're talking about the same logical row. There's nothing to wait on.

That's a huge property. It means your hot read paths don't get stuck behind long-running writes, and your long analytical queries don't make OLTP writes back up. It's also what lets pg_dump run against a live production database without freezing it.

The flip side: writers do block other writers on the same row. If two transactions both UPDATE id = 1, one of them waits until the other commits or rolls back. That's a row lock, and it's the only kind of lock you'll see in normal workloads. Everything else is invisible MVCC bookkeeping.

The Cost: Bloat And Dead Tuples

You've probably guessed the downside by now. If UPDATE writes a new tuple instead of overwriting, and DELETE doesn't actually free anything, the table grows every time you write to it.

Run an experiment:

SQL
CREATE TABLE bloat_demo (id int PRIMARY KEY, value text);
INSERT INTO bloat_demo SELECT i, 'hello' FROM generate_series(1, 100000) AS i;

SELECT pg_size_pretty(pg_relation_size('bloat_demo'));
-- something like 4448 kB

UPDATE bloat_demo SET value = 'updated';

SELECT pg_size_pretty(pg_relation_size('bloat_demo'));
-- now around 8896 kB — same row count, twice the size

The table doubled because every old version is still on disk. Postgres can't reclaim that space until it's sure no remaining transaction can still see those old tuples. That sureness is what VACUUM is for.

The dead tuples that accumulate between vacuums are called bloat. Bloat affects you in three ways:

  1. Disk usage. Obvious. The table file is bigger than the live data.
  2. Cache hit rate. Postgres reads pages of data from disk into memory. If 30% of your pages are dead tuples, you're wasting 30% of your buffer pool on rows nobody can see.
  3. Query speed. A sequential scan over a bloated table reads every dead tuple too; Postgres still has to look at each one to check visibility before skipping it.

Index bloat is its own headache. When a row gets a new version, Postgres usually has to insert a new index entry too, even if the indexed column didn't change. There's a special case called HOT updates (heap-only-tuple) where the new version stays on the same page and no new index entry is needed, but it only applies when no indexed column was modified and there's free space on the page. If you're updating an indexed column or your pages are already full, every index needs a fresh entry, and your index files grow alongside the heap.

VACUUM: The Cleanup Process

VACUUM is the answer to bloat. Its job is to find dead tuples that no transaction can possibly see anymore and mark their disk space as reusable. Note: reusable, not returned to the OS. A regular VACUUM doesn't shrink the file. The space gets refilled by future inserts and updates instead of allocating new pages.

The "no transaction can possibly see them anymore" check is the key. Postgres tracks a value called the xmin horizon: the lowest transaction ID that any currently-running transaction might still need. Anything older than that is unambiguously dead, anything newer might still be visible to someone, and VACUUM is conservative about the boundary.

You can run vacuum manually:

SQL
VACUUM;                    -- vacuum everything in the current database
VACUUM accounts;           -- just one table
VACUUM (VERBOSE) accounts; -- show what got cleaned

But you almost never should. Postgres ships with autovacuum, a background process that wakes up periodically, looks at the per-table dead-tuple counters, and runs vacuum on tables that have crossed a threshold. The default thresholds are tunable per table:

SQL
ALTER TABLE high_churn_table SET (
  autovacuum_vacuum_scale_factor = 0.05,  -- vacuum when 5% of rows are dead
  autovacuum_vacuum_cost_limit = 1000     -- be more aggressive than the default
);

For most tables the defaults are fine. For tables that get hammered with updates (session stores, queue tables, counters) you usually want to lower the scale factor so autovacuum runs more often and bloat doesn't accumulate.

VACUUM FULL Is Not The Same Thing

There's also VACUUM FULL, and the name is misleading. It doesn't just do a more thorough version of VACUUM. It rewrites the entire table from scratch, copying live tuples into a new file and dropping the old one. That returns disk space to the OS, but it also takes an ACCESS EXCLUSIVE lock on the table: no one can read or write to it while it runs.

For a 100GB table, that's a long outage.

If you need to actually reclaim disk space, the modern answer is usually pg_repack or pg_squeeze, extensions that do the same rewrite-the-table dance but in chunks, online, without holding an exclusive lock. Reserve VACUUM FULL for emergency situations on small tables where you can take the downtime.

Long-Running Transactions: The Hidden Foot-Gun

This is where MVCC, vacuum, and concurrency all collide and break each other.

Remember the xmin horizon: the lowest transaction ID anyone might still need. As long as a transaction is open, the horizon can't advance past its snapshot, because that transaction might still want to read a tuple created or deleted at that point. That's true even if the transaction is idle. Even if it's been idle for nine hours because someone left psql open in another tmux pane.

Now imagine that idle transaction is sitting on REPEATABLE READ. It has a fixed snapshot from nine hours ago. Vacuum can't clean up any tuple that became dead in the last nine hours, because that snapshot might still want to see it. Your tables bloat. Your indexes bloat. Queries get slower. Disk fills up. And the only fix is to find and kill the long-running transaction, after which autovacuum catches up and life returns to normal.

SQL
SELECT
  pid,
  now() - xact_start AS duration,
  state,
  query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY duration DESC;

That's the query you want bookmarked. Run it any time bloat or vacuum behavior is mysterious. Anything in idle in transaction for more than a few minutes is a red flag. Anything for more than an hour is almost certainly the cause of whatever you're debugging.

The other classic offender is replication slots. If you use logical replication or streaming replication with replication slots, an idle or disconnected subscriber holds back the wal_horizon and can hold back the xmin horizon too; same effect, different source. pg_replication_slots tells you who's lagging.

Timeline showing a long-running idle transaction holding back the xmin horizon, preventing VACUUM from reclaiming dead tuples across a wide range of transaction IDs

Transaction ID Wraparound: The Other Reason Vacuum Matters

Postgres transaction IDs are 32-bit integers. That's about 4 billion of them, and a busy production database can chew through that in months.

When the counter is about to wrap, every existing tuple's xmin would suddenly look "newer" than the new transactions' IDs, and visibility checks would break catastrophically. To prevent that, Postgres reserves part of the xid space and runs a special operation called freezing during vacuum: rows old enough that everyone can see them get their xmin replaced with a special marker that says "this is older than all possible snapshots". Once frozen, a tuple is safe regardless of where the xid counter is.

Autovacuum runs an anti-wraparound vacuum automatically when a table gets close to the danger zone. You don't normally interact with it. The reason it's worth knowing about: anti-wraparound vacuums can't be skipped or canceled, and they run in foreground mode if normal autovacuum hasn't kept up. If you've ever seen a database go briefly unresponsive while a "VACUUM (FREEZE)" runs against your busiest table, that's why. The fix is to tune autovacuum to be aggressive enough that anti-wraparound never has to step in.

SQL
SELECT
  relname,
  age(relfrozenxid) AS xid_age
FROM pg_class
WHERE relkind = 'r'
ORDER BY xid_age DESC
LIMIT 10;

Tables with the highest xid_age are the ones autovacuum hasn't gotten around to freezing recently. If any of them are pushing past 200 million, your autovacuum is falling behind.

Concurrency Patterns You Should Actually Use

MVCC gives you optimistic-by-default concurrency. Most of the time you don't have to think about it. The cases where you do are predictable, and the patterns are well-known.

SELECT FOR UPDATE

If you need to read a row and then update it based on its current value, and you don't want anyone else changing it underneath you, lock it explicitly:

SQL
BEGIN;

SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;
-- the row is now locked for writes until I commit or rollback

UPDATE accounts SET balance = balance - 100 WHERE id = 1;

COMMIT;

FOR UPDATE makes other writers wait. Other readers (without FOR UPDATE) still see the old version through MVCC and don't block. This is the right primitive for things like "decrement an inventory count safely" or "claim the next pending job from a queue table".

SELECT FOR UPDATE SKIP LOCKED

If you're building a job queue inside Postgres, this is the magic you want:

SQL
SELECT id, payload
FROM jobs
WHERE status = 'pending'
ORDER BY id
LIMIT 1
FOR UPDATE SKIP LOCKED;

Multiple workers can run that query at the same time. Each one will lock and return a different row. None of them block each other. It turns Postgres into a competent work queue without an extra Redis or RabbitMQ.

Idempotent retries on serialization failures

If you use REPEATABLE READ or SERIALIZABLE, you have to handle serialization_failure and deadlock_detected errors. They're not bugs. They're the database telling you to retry. The pattern is the same in any language:

Python
import psycopg2

def transfer_funds(conn, src, dst, amount):
    for attempt in range(5):
        try:
            with conn.transaction(isolation_level="serializable"):
                cur = conn.cursor()
                cur.execute("SELECT balance FROM accounts WHERE id = %s", (src,))
                balance = cur.fetchone()[0]
                if balance < amount:
                    raise ValueError("insufficient funds")
                cur.execute("UPDATE accounts SET balance = balance - %s WHERE id = %s", (amount, src))
                cur.execute("UPDATE accounts SET balance = balance + %s WHERE id = %s", (amount, dst))
            return
        except psycopg2.errors.SerializationFailure:
            continue
    raise RuntimeError("transfer failed after 5 retries")

Same idea in Go, Node, Java: wrap the transaction in a retry loop, catch the specific error class, back off briefly between attempts. Don't retry forever, don't retry on errors that aren't retryable, don't bury the failure in logs and pretend it didn't happen.

Things That Catch People Off Guard

A scattered list of MVCC weirdness that I wish I'd known earlier in my Postgres career.

count(*) is slow on large tables, even with an index. There's no row-count to read off a header; Postgres has to visit each tuple and check visibility against your snapshot, because what counts as "a row" depends on who's asking. The pattern of caching counts in an aggregate table or using pg_class.reltuples for an approximation is real and used widely.

SELECT ... FROM x WHERE id = 1 can do a sequential scan of dead tuples too if a recent vacuum hasn't run. Visibility maps and index-only scans help, but only when the visibility map is up to date, which means after a recent vacuum.

UPDATE on an indexed column that didn't change still rewrites the index entry unless HOT applies. Fillfactor settings on tables (fillfactor = 70 or so) leave room on each page for HOT updates and are worth tuning on heavily-updated tables.

Truncating a table with DELETE FROM big_table instead of TRUNCATE writes one dead tuple per row, which then has to be vacuumed. For "delete everything in this table" workloads, TRUNCATE is dramatically cheaper because it doesn't go through MVCC at all; it just unlinks the table file.

Foreign keys take a row lock on the parent row when you insert a child. That used to be FOR UPDATE-strength and caused contention storms; modern Postgres uses FOR KEY SHARE which is gentler, but the locking still exists. If you're seeing weird waits on a parent table during high-write workloads, this is usually why.

pg_stat_activity.wait_event tells you what each backend is currently waiting on: Lock, BufferPin, IO, anything. Combined with pg_locks, it's how you debug "why is this query just sitting there?"

What This Buys You

MVCC is one of those topics where reading the docs once doesn't give you intuition. You need to have the model in your head: every row has hidden version columns, my transaction has a private snapshot, vacuum cleans up versions no one can see anymore. That model has to click before the rest of the database's behavior starts making sense. Once that model clicks, you can read pg_stat_activity, you can guess why a table is bloated, you can spot the long-running transaction that's holding back vacuum, you can pick an isolation level for the right reason instead of cargo-culting one.

Postgres is a database that rewards engineers who understand it. MVCC is the deepest of those rewards. Spend an hour with xmin, xmax, and a couple of psql sessions side by side, and you'll come out the other end able to debug things your team currently treats as black magic.

That's a pretty good return on an hour.