All blueprints
Real-time Systemscomplex complexity

Real-time Collaboration Architecture

Architecture for collaborative applications with live cursors, presence, and conflict-free concurrent editing across multiple users and devices.

7

Components

5

Considerations

4

Alternatives

complex

Complexity

Fit

When this blueprint fits

And when to walk away from it

When to use this

Multiple users edit the same document, board, or canvas simultaneously and you need the experience to feel instant. Document editors, design tools, project management boards, and live whiteboards all sit here.

When NOT to use this

If editing conflicts are rare, last-write-wins with optimistic UI is dramatically simpler. Do not pay the CRDT cost for a CRUD app where two users editing the same record is a once-a-month event.

Architecture

System components

Key building blocks of this architecture, layered from infrastructure up.

01

WebSocket Gateway

Scalable WebSocket infrastructure with connection management, heartbeats, and graceful reconnects. I separate the gateway from the application logic so I can scale connection capacity independently. Sticky sessions via a load balancer hash on a stable client ID, and a Redis pub/sub backplane to fan out across nodes.
Socket.iouWebSocketsRedis Pub/SubSticky LB
02

CRDT Engine

Conflict-free replicated data types so concurrent edits converge without a central authority. Yjs is my default for document and rich-text scenarios because the ecosystem (ProseMirror, TipTap, Monaco bindings) is mature. Automerge fits better for structured state. See the collaborative editor lab for a working example.
YjsAutomergeTipTapProseMirror
03

Presence System

Live cursors, selections, typing indicators, and user status. Presence is a separate channel from document edits because it has different durability requirements: lose a cursor update and nobody cares, lose a document edit and you have lost data. Throttle aggressively (every 50ms is plenty) and never persist presence beyond the session.
WebSocketRedisThrottlingAwareness Protocol
04

Sync Protocol

Efficient sync protocol with offline buffer, exponential backoff, and binary compression. Pairs with the local-first sync lab. The protocol design matters more than the framework: every message should be replayable, idempotent, and carry a vector clock or causal stamp.
Binary ProtocolCompressionVector ClocksVersioning
05

Persistence Layer

Document snapshots, append-only update log, and time-travel history. I store CRDT updates in PostgreSQL as a binary log with periodic snapshots in S3 for fast cold-start. The combination gives me bounded read amplification and free version history.
PostgreSQLS3Binary LogsSnapshots
06

Access Control

Per-document permissions, share links, and tenant boundaries. Real-time amplifies access control bugs because a stale token might keep a removed collaborator connected. I revalidate permissions on every reconnect and emit kick events when access changes mid-session.
JWTCapability TokensRevocation Lists
07

Observability

Per-connection metrics, sync conflict counts, and replay debugging. Real-time bugs are often unreproducible without the message stream that caused them. Log every inbound and outbound message with a session ID, sample at 1% in production, and full-fidelity on errors.
OpenTelemetryDatadogMessage Replay

Planning

Critical considerations

The things I have learned the hard way and would not skip on the next build.

WebSocket connection limits force careful capacity planning. A single Node process tops out around 10,000 connections. Shard by tenant or by document hash, and use connection draining during deploys.
CRDT choice depends on the data structure. Yjs is the strongest pick for text and rich documents, Automerge is better for structured records, and a custom CRDT might beat both for a narrow domain like spreadsheet cells.
Implement graceful degradation for poor networks. Buffer updates locally, show an offline indicator after 3 seconds of failed sync, and apply a clear UI affordance when the user comes back online and merges land.
Plan for the noisy session. A user with 5MB of clipboard paste can clog the broadcast channel for everyone in their document. Per-session rate limits and chunked transfers protect the rest of the room.
Want a real-time architect? Get in touch.

Options

Alternative approaches

Where I would consider a different shape entirely, with the trade-offs spelled out.

Alternative 01
Liveblocks for managed real-time infrastructure with strong React primitives. Best value for teams who want to focus on product.
Alternative 02
PartyKit for edge-native real-time when low latency to global users matters more than ecosystem maturity.
Alternative 03
Ably or Pusher for enterprise-grade pub/sub when you need SLAs but do not need the full document model.
Alternative 04
Replicache for query-based sync when you want server-authoritative state with optimistic mutations.
Need a partner on this?

Need help implementing this blueprint?

I help teams adapt blueprints like this to their specific requirements and ship from planning through production.