Real-time Chat at Scale
Architecture for chat systems handling millions of concurrent users covering connection management, fanout, persistence, and moderation.
Components
Considerations
Alternatives
Complexity
Fit
When this blueprint fits
And when to walk away from it
When to use this
Chat is the core experience: customer support, community platforms, in-game chat, dating apps, social messengers. The right blueprint when message latency and reliability are competitive differentiators.
When NOT to use this
If chat is a side feature with a few thousand users, embed a managed service like Stream or Sendbird. The build-vs-buy break-even for chat sits somewhere around 100k monthly active users.
Architecture
System components
Key building blocks of this architecture, layered from infrastructure up.
Connection Layer
Message Bus
Presence Service
Persistence
Push and Email Fallback
Moderation
Search and Indexing
Planning
Critical considerations
The things I have learned the hard way and would not skip on the next build.
Options
Alternative approaches
Where I would consider a different shape entirely, with the trade-offs spelled out.
Implementation
Related playbooks
Step-by-step guides for the harder parts of this architecture.
Designing Event-Driven Systems
Event-driven architectures unlock real autonomy between services, and they expose a whole new category of bugs if you do not respect their constraints. This playbook is the design discipline I use: model events as facts, version schemas carefully, choose the right broker, build idempotent consumers, handle ordering and failure, and add the observability that makes async systems debuggable in production.
Production Monitoring & Observability
Observability is not three pillars on a slide, it is the difference between knowing why your system is misbehaving and guessing. This playbook is the monitoring stack I deploy on every production system: error tracking, structured logging, performance metrics, distributed tracing, and the dashboards and alerts that turn raw data into actionable signal without paging everyone at 3 AM.
In practice
Related case studies
Where I have applied this blueprint to real builds and what changed in practice.
Thinking
Related insights
Essays where I argue the trade-offs behind the choices in this blueprint.
Need help implementing this blueprint?
I help teams adapt blueprints like this to their specific requirements and ship from planning through production.
Real-time Systems
More in this category
Other blueprints with overlapping concerns.