multitenancy in postgres

2026.04.01

i stumbled onto nile a few weeks ago and decided to do a deep dive last week. their tagline said something about postgres re-engineered for multi-tenant apps. i didn’t know what that meant exactly, so i did what any reasonable person does, i opened seventeen tabs and didn’t sleep (sorry mom).

this is what came out of that.

0. intro

multitenancy is one of those things where everyone has written about it and nobody has actually said anything. you’d look it up, see the same 3 diagrams - database per tenant, schema per tenant, shared schema before the blog ends.

i built pg_tenant while going through this. it works. it also showed me exactly where extensions run out of road, which is probably the most useful thing it did.

this blog is the post i wanted to find. something the actual tradeoffs, prod hickups, and the database internals. and a deep look at what nile built, specifically pg_karnak, their distributed DDL system, because once you understand what problem it’s solving, you realize it’s genuinely impressive engineering.

if you work on databases, there’s probably something in here you haven’t seen before. if you’re building a SaaS and you’re about to add a tenant_id column to everything just like i had in the past and would’ve again a week ago, you should read this.

1. the three dragons of multitenancy

everybody and their dad knows when it comes to multitenancy you have 3 cards you can draw from:

database per tenant - separate database for each tenant
schema per tenant - separate schema within one database
shared schema + RLS - all tenants share tables, row-level security filters access

1.1. database per tenant

the pitch is clean: full isolation, independent backups, move a noisy tenant to bigger hardware without touching anyone else. you can run one pgbouncer instance that talks to all your tenant databases, so it’s not like you need 500 pgbouncer processes. but each pool inside pgbouncer only serves one database, which means 500 tenants is 500 pools to configure and monitor. every postgres backend is an OS process (an architectural choice, not a universal law, but it’s postgres’s choice). the connection math gets ugly. and migrations? you just made ALTER TABLE a distributed systems problem. what happens when it succeeds on 300 databases and fails on 301? you now own a rollout system. that’s not a postgres problem anymore.

1.2. schema per tenant

search_path per session is genuinely elegant - no WHERE tenant_id anywhere, queries just route themselves. the trap shows up when you use pgbouncer in transaction mode (which is the mode you want at scale for connection density). in transaction mode, the connection returns to the pool after every transaction, but session-level state persists on the underlying postgres connection. you SET search_path = tenant_xyz, your transaction commits, pgbouncer hands that connection to the next client, and they inherit your search_path. their query runs against your schema. silent. no error. the fix is SET LOCAL inside every transaction, which scopes the setting to that transaction only. miss it once and you’re explaining to a customer why they saw someone else’s data. also: past a certain number of schemas, pg_catalog starts feeling the weight (more on that below).

1.3. shared schema

everyone and their dad (my dad actually did lol) ends up here because it’s cheap and simple. one table, one index, a tenant_id column on everything. the problem is that simple doesn’t stay simple. one tenant doing bulk writes creates dead tuple bloat for the entire table. autovacuum fights their mess on your shared dime. a slow tenant poisons query plans for everyone else. and RLS, which is how you enforce isolation here, has a performance footgun that’s completely invisible until your query plan falls apart. that’s the next section.

2. what they don’t tell you about each one

the VACUUM problem in shared schema

here’s something nobody puts in the comparison chart. in a shared schema, when tenant A does a bulk update or delete, they leave dead tuples in the table. autovacuum comes along and cleans it up, but autovacuum works on the whole table, not per tenant. so tenant A’s mess becomes everyone’s problem. the vacuum worker is burning I/O cleaning up rows that belong to one customer while tenant B’s queries are running slower because the visibility map is stale and index-only scans can’t be used.

partitioning by tenant_id turns this from a global problem into a local one. each partition gets its own vacuum, its own bloat, its own index. one tenant thrashing their data doesn’t degrade the physical state of everyone else’s. most blogs recommend partitioning for query performance. that’s true but it’s the smaller benefit.

the `SET LOCAL` trap with pgbouncer

pgbouncer in transaction mode is the right choice at scale, it gives you the highest connection density. the trap is that transaction mode returns the server connection to the pool after every transaction. any session-level state you set gets inherited by the next client that picks up that connection.

so if you do SET app.tenant_id = 'acme' at session level, and pgbouncer recycles the connection, the next tenant’s query runs with acme’s context. no error. no warning. just a quiet data leak.

the fix is SET LOCAL, which scopes the setting to the current transaction only. but there’s a catch: SET LOCAL only works inside an explicit transaction block. if you’re running with autocommit (which many ORMs do by default), SET LOCAL behaves like regular SET and you’re back to the same problem. so you need SET LOCAL and you need to be in a BEGIN/COMMIT block. miss either one in a code path that rarely runs and you won’t find out until something bad happens. this is also exactly why nile built tenant-aware connection pooling into their SDK instead of leaving it to the application. it’s not a convenience feature. it’s the only safe option.

the ways schema-per-tenant breaks at scale we don’t even think of

the thing that kills schema-per-tenant isn’t what you expect. it’s not performance on individual queries. it’s pg_catalog weight. every schema you create adds objects to the system catalogs: tables, indexes, sequences, constraints, all tracked globally. the actual threshold varies wildly depending on how many objects each schema contains. a schema with 50 tables hits catalog overhead much faster than one with 2. some teams report issues at 1,000 schemas, others run fine past 5,000. but at some point, query planning starts getting slower because the planner has more objects to reason about. \d in psql becomes visibly sluggish. autovacuum has more catalog tables to maintain.

then there’s migrations. you write one ALTER TABLE. now you run it across 3000 schemas. in what order? what’s your rollback story if it fails halfway? do you do it in parallel and risk locking? in sequence and accept that it takes hours? you’ve turned a one-line migration into a deployment pipeline. teams that choose schema-per-tenant early usually don’t feel this until they’re too deep to switch easily.

3. RLS isn’t free and the slowness is weirder than you think

the `initPlan` trick that changes everything

RLS policies are essentially WHERE clauses appended to every query on a protected table. in the simple case, tenant_id = current_setting('app.tenant_id')::uuid, this is nearly free. the planner inlines it, the index gets used, life is good.

the footgun is when you call a function in your policy without marking it STABLE. functions are VOLATILE by default, which prevents postgres from hoisting the evaluation out of the per-row filter. something like:

CREATE POLICY tenant_isolation ON documents
  USING (tenant_id = get_current_tenant());

looks fine. but if get_current_tenant() is volatile (the default), postgres can’t hoist it, so it evaluates the function for every row. on a 100k row table, that’s 100k function calls. your query goes from 1ms to 170ms not because the function is expensive, but because of how many times it runs.

here’s the fix:

USING (tenant_id = (SELECT get_current_tenant()))

i only figured this out from a github issue thread, which, great. wrapping the call in a SELECT subexpression triggers what postgres calls an initPlan: the optimizer evaluates it once per statement, caches the result, and uses that cached value for every row in that statement. same function, same result, completely different execution. you can see this in EXPLAIN ANALYZE: without the wrapper you get a function call in the filter condition, with it you get an InitPlan node evaluated once at the top.

`STABLE` helps but isn’t a complete fix

the natural instinct after learning about function call overhead is to mark your functions STABLE. STABLE tells postgres the function can’t modify the database and returns consistent results within a single table scan, which lets the planner inline or fold it in certain contexts.

but STABLE alone doesn’t reliably trigger caching in RLS policies. the Supabase discussions and the Scott Pierce article in the references are pretty clear on this: across postgres 14-16, STABLE functions in RLS contexts almost never get the optimization you’d expect without the SELECT wrapper. the wrapper is what forces the initPlan. if your STABLE function takes row data as a parameter, the planner can’t cache it regardless, each row is different. but even for functions that don’t take row data, the planner often refuses to optimize them in RLS without the wrapper.

STABLE is still worth setting, it helps in non-RLS contexts and occasionally the stars align. but for RLS specifically, assume you need the SELECT wrapper until EXPLAIN ANALYZE proves otherwise.

4. i tried to build my own extension

“it shouldn’t be that hard” - winit at 4:30 AM

i was maybe four tabs deep into the nile docs when i had the thought that ruins my late nights and the following mornings: how hard can it be to just build this as an extension. postgres has hooks. you can intercept queries, manage session state, enforce isolation. maybe you don’t need an entire company’s infrastructure. maybe you just need a clever .so file.

so i built pg_tenant. and it works. and building it taught me exactly where that reasoning falls apart.

what pg_tenant does and how it works

pg_tenant is a postgres extension written in rust using pgrx. the core idea is simple: give postgres first-class tenant context and enforce isolation automatically without the application having to think about it.

-- you initialize it once:
SELECT tenant_init();

-- create a tenant:
SELECT tenant_create('acme-corp', NULL, 'RowLevel');

-- set the context for the session:
SELECT tenant_set_id(tenant_get_by_slug('acme-corp'));

and from that point, every query on protected tables automatically enforces tenant isolation. no WHERE clauses. no application-level filtering.

under the hood, tenant_set_id calls set_config('pg_tenant.current_tenant_id', ...), a session-level configuration variable that RLS policies can read via current_setting(). the extension creates the policies automatically when you register a table, pointing them at that config variable. it supports two isolation modes: RowLevel which uses RLS, and SchemaBased which creates a dedicated schema per tenant and manages search_path. a third mode, DedicatedDatabase, is listed in the code as not yet implemented. that omission is the whole point.

processUtility_hook and where extension-land ends

postgres exposes a hook called processUtility_hook that lets you replace the standard utility command handler. when you register a function against this hook, your function becomes responsible for processing utility commands (DDL, GRANT, etc.). if you want the command to actually execute, you call standard_ProcessUtility yourself. if you don’t call it, the command never runs locally.

this matters because the hook isn’t just an interceptor, it’s the handler. pg_tenant uses it to intercept DDL and enforce schema routing: inspect the command, maybe modify context, then call through to standard processing. pg_karnak uses the same hook but for something more ambitious: intercept the DDL, don’t execute it locally yet, coordinate with other nodes first, then execute everywhere atomically.

the extension model gets you surprisingly far. you can manage tenant context, automate RLS policy creation, intercept DDL, route to schemas. what you cannot do is own the connection pool. an extension lives inside a single postgres process. it has no visibility into what other connections are doing, no way to ensure that the tenant context set in one transaction survives correctly into the next, no way to enforce that every connection through pgbouncer sets context before running queries. that coordination has to happen outside the database.

the distributed systems wall

the DedicatedDatabase isolation mode is unimplemented because implementing it correctly means solving a different problem entirely. the moment a tenant lives on a separate physical postgres instance, you need:

a way to route connections to the right instance
a way to apply schema migrations to all instances atomically
a way to handle failures mid-migration
a way to make all of this invisible to the application

none of that is an extension. that’s a transaction coordinator, a metadata store, and a distributed lock manager. you can’t cargo pgrx install your way to it. an extension is code that runs inside postgres. this problem requires something that runs above postgres, across multiple instances, with its own failure recovery.

that’s the wall. pg_tenant hits it at the database boundary. nile built through it.

5. what nile actually built

rls in production: the gotchas they hit shipping it for real

nile started with the same instinct most teams have, append WHERE tenant_id = $1 to everything, ship it, move on. as they added features, they ended up with 40+ different WHERE clauses scattered across endpoints, each one a potential place to forget the filter and leak data. they switched to RLS. the experience of actually shipping it to production is documented in their blog and it’s more honest than most.

two gotchas stand out. first: RLS doesn’t apply to superusers, ever. it also doesn’t apply to table owners by default. when they first deployed to their test environment, they hadn’t changed the database user from the superuser they used for setup. RLS wasn’t applying to any of their queries. everything passed. they caught it before production but only because they were specifically testing for it. you can force RLS on table owners with ALTER TABLE ... FORCE ROW LEVEL SECURITY, but that only fixes the table-owner case. if you’re running as an actual superuser, nothing helps except not being a superuser. if you’re not writing integration tests that explicitly verify tenant isolation from a non-owner, non-superuser role, you are not testing RLS.

second: they were running on the JVM, using thread-local storage to hold the current user ID for RLS. threads get reused between requests. a previous request’s user ID was leaking into the next request’s context, meaning queries were running with the wrong tenant’s identity. the fix was a post-response handler that explicitly resets thread-local state. this is a standard thread-pooling footgun that happens to bite you hard when you’re using it for tenant isolation. if you’re storing tenant context in thread-locals, you own this problem.

tenant virtualization: why owning the connection pool is not optional

nile’s model is that you connect to one database and set nile.tenant_id, and from that point you’re operating inside a virtual tenant database. the query goes to the right physical location, isolation is enforced, routing is handled. the application just runs queries.

the deceptively hard part is making SET nile.tenant_id safe in a connection pool. the naive version looks like this:

SET nile.tenant_id = 'acme';
SELECT * FROM todos;

and it looks fine. but if your application code returns the connection to the pool between those two statements (an easy mistake in async code or poorly structured pool usage), the SET runs on connection A, A goes back to the pool, the SELECT runs on connection B which has no tenant context set, and returns everything. even if you keep the same connection, the real problem is that session-level SET persists after you return the connection: the next borrower of connection A inherits acme’s context. this is not a theoretical concern. this is what happens with naive pooling code.

the only safe fix is to control the connection pool yourself. not configure it. own it. nile’s SDK implements its own tenant-aware pooling, when you ask for a connection for a specific tenant, the SDK ensures that tenant context is set on that connection before any query runs, and that the connection is properly managed across its lifetime. it’s not a wrapper around pgbouncer. it’s a pool that understands tenants as a first-class concept. this is why their SDK exists. it’s not convenience. it’s correctness.

`pg_karnak`: distributing DDL across databases atomically

this is the part that gets interesting.

nile’s virtual tenant databases can live on multiple physical postgres instances. tenants get placed on shared compute or dedicated compute depending on their tier. from the developer’s perspective it’s one database. under the hood it might be ten. the problem: when you run CREATE TABLE todos (...), it needs to apply to all of them. simultaneously. atomically. with the same guarantees you’d expect from a single database.

that’s pg_karnak. it uses processUtility_hook to intercept DDL (same hook i used in pg_tenant), but instead of just routing locally, it coordinates the DDL across all physical databases that make up a virtual tenant database. the extension hooks into the transaction lifecycle via XactCallback, postgres’s transaction state change notifications. when a transaction commits, pg_karnak’s coordinator takes over: prepare the DDL on all nodes, and only commit if all prepares succeed. if any node fails, roll back everywhere.

this isn’t just a wrapper around PREPARE TRANSACTION. they built a transaction coordinator that manages the lifecycle across independent postgres instances, handles partial failures, and recovers from crashes mid-commit. the full writeup goes into the failure modes and recovery semantics.

6. niche things i found very fancy

some stuff i came across that isn’t directly about the core multitenancy models but felt worth writing down.

cloudflare’s TCP Vegas trick on pgbouncer

cloudflare runs postgres for thousands of internal tenants on shared infrastructure. the noisy neighbor problem, one tenant’s burst writes starving everyone else’s connections, is unsolvable with static limits. you either cap tenants too low and waste capacity, or too high and let them blow up the cluster.

their solution: adapt TCP Vegas, the congestion avoidance algorithm, to connection pooling. each tenant starts with a small pool. cloudflare samples their transaction round-trip time continuously. as long as RTT stays stable, the pool size grows. the moment RTT degrades, the algorithm backs off, reducing that tenant’s concurrency automatically, without any manual intervention or knowledge of the underlying hardware. the database slowing down is the signal. the connection pool is the throttle.

it’s elegant because it’s self-tuning. no thresholds to configure, no per-tenant policies to maintain. the system finds each tenant’s optimal concurrency on its own and backs off when the shared resource is under pressure. the reason nobody else does this is you need to own the observability layer on top of the pool, you can’t bolt this onto pgbouncer without instrumenting it. cloudflare runs their own infrastructure so they can.

you can read their detailed write up about it here.

logical replication for zero-downtime tenant migration

at some point you’ll have a tenant that’s grown too big for shared infrastructure. you need to move them to a dedicated database without downtime. the naive approach is dump, restore, cutover, which means an outage window.

postgres 15 added row filters to logical replication. you can now do:

CREATE PUBLICATION tenant_migration
FOR TABLE todos, users, documents
WHERE (tenant_id = 'acme-corp');

start replication to the new database, wait for it to catch up, then cut the connection string over. the tenant’s data streams continuously to the new instance while your app keeps running. once replication lag hits zero, you flip the switch. the old rows stick around until you clean them up.

one caveat: the WHERE filter expression is evaluated against both old and new row versions for UPDATE/DELETE. if both versions are out of scope, the change isn’t replicated. if only one is in scope, it gets converted to an INSERT or DELETE to keep the subscriber consistent. the real limitation is that the filter expression can only use columns in the table’s replica identity (usually the primary key). so you can filter on tenant_id only if it’s part of the replica identity, which for most schemas it won’t be by default. you’d need to set up a custom replica identity that includes it. for the migration use case this usually works out, but it’s not as plug-and-play as the syntax suggests.

this is how you implement the hybrid model properly, start everyone on shared infrastructure, peel off heavy tenants later as they earn dedicated resources. no bespoke migration scripts, no downtime negotiation.

pg_stat_statements tracks query performance across your database. it’s the first place you look when something is slow. the problem in a multitenant setup: it normalizes queries and aggregates across all tenants. SELECT * FROM todos WHERE tenant_id = $1 is one row in pg_stat_statements regardless of whether tenant A runs it in 2ms and tenant B runs it in 4 seconds.

you have no idea which tenant is killing your database. you just know something is.

the workaround is query tagging, embedding tenant context in a comment that shows up in the query text:

/* tenant:acme-corp */ SELECT * FROM todos WHERE tenant_id = $1

pg_stat_statements won’t normalize away comments, so you get per-tenant rows. it’s manual and requires discipline. and it’s fragile: some ORMs and query builders strip or rewrite comments, so this can silently stop working. but when it works, it’s the difference between “the database is slow” and “acme-corp is slow because they’re doing full table scans.”

extended statistics and why your query planner is lying to you

the postgres query planner estimates row counts before executing a query. it uses per-column statistics collected by ANALYZE. the problem: column statistics are independent by default. the planner assumes tenant_id and status are uncorrelated, so its estimate for WHERE tenant_id = 'acme' AND status = 'active' is just the two selectivities multiplied together.

in a multitenant database this is almost always wrong. different tenants have wildly different data distributions. one tenant might have 90% of their rows as status = 'active'. another might have 10%. the planner’s estimate is averaged across all tenants, which means it’s wrong for everyone.

the fix is extended statistics:

CREATE STATISTICS tenant_status_stats ON tenant_id, status FROM todos;
ANALYZE todos;

this tells postgres to track the correlation between those columns explicitly. the planner’s row estimates get dramatically more accurate, which means better query plans, which means you stop scratching your head wondering why postgres chose a seq scan on a table with a perfectly good index.

something i don’t fully understand yet

there’s this thing in the nile blog about “tenant placement” - deciding which physical database a tenant lands on based on their tier or usage patterns. the mechanics of it seem straightforward (metadata table mapping tenant to database, routing layer that reads it), but i don’t understand how they handle the transition. like, if a tenant starts on shared compute and needs to move to dedicated, what happens to in-flight transactions? do they drain the old placement first? is there a brief period of unavailability? the blog post doesn’t go into it and i couldn’t find details. might be worth asking them directly.

7. so what does this all mean

honestly i’m not sure i have a clean takeaway.

multitenancy is a tension between isolation and simplicity, and every model is just a different place on that curve. you pick shared schema because it’s simple, then you spend time bolting on the isolation you gave up. you pick database-per-tenant for the isolation, then you spend time building the operational infrastructure to make it manageable. there’s no free lunch, just different bills.

what nile built is interesting because they’re trying to give you the simplicity of shared schema with the isolation of database-per-tenant. whether that actually works in practice, i don’t know. i haven’t run production traffic through it. the pg_karnak stuff is genuinely clever engineering, but clever engineering doesn’t always survive contact with real workloads. i’d want to see more about failure modes, latency overhead from the coordination layer, what happens when the coordinator itself has issues.

i built pg_tenant to understand where the extension model runs out. now i know: it’s the connection pool and the cross-database coordination. extensions can handle context management, RLS automation, schema isolation. they can’t own the pool. they can’t make one database feel like many. that part requires something above postgres, not inside it.

i don’t know if nile is the right answer. i don’t know if there is a right answer, or if multitenancy is just permanently annoying and you pick your flavor of annoyance. but i learned a lot digging into this, and maybe some of it is useful to someone else.

references

nile

RLS performance

connection pooling

multitenancy models

partitioning

postgres internals

pg_tenant

pg_tenant - lasect