Architectureadvanced

Designing Event-Driven Systems

Event-driven architectures unlock real autonomy between services, and they expose a whole new category of bugs if you do not respect their constraints. This playbook is the design discipline I use: model events as facts, version schemas carefully, choose the right broker, build idempotent consumers, handle ordering and failure, and add the observability that makes async systems debuggable in production.

120 min7 steps

Steps

Tools

Outcomes

advanced

Difficulty

Technologies used

KafkaRedis StreamsEventBridgePostgreSQL

The methodology

The phases, in order

Each phase below is something I actually run in a project. The descriptions are how I think about the work, not abstract definitions.

Phase

Phase 1 of 7

Identify Domain Events

Events are facts about things that happened, named in past tense: OrderPlaced, PaymentCaptured, UserSignedUp. They are not commands. If you find yourself naming an event that sounds like an instruction, that is a hint that you are still designing in a request-response shape and missing the benefit of events.

Phase

Phase 2 of 7

Design Event Schemas Carefully

Each event gets a schema with required fields, optional fields, and a version. I use a schema registry so producers and consumers cannot drift silently. Backwards compatibility is mandatory: never remove a field, never change a type, only add. Breaking a schema breaks every consumer, often silently.

Phase

Phase 3 of 7

Choose the Right Transport

Kafka for durable, high-volume, ordered streams. Redis Streams for simpler use cases under a million events a day. EventBridge or SQS for AWS-native workloads. The choice matters less than picking one and learning its failure modes deeply. Switching brokers is expensive and rarely worth it.

Phase

Phase 4 of 7

Build Idempotent Consumers

Every consumer must handle the same event twice safely. I include an event id on every message, store processed ids with a TTL, and short-circuit re-deliveries. The alternative is doubled-up side effects in production, which is how event-driven systems get their bad reputation.

Phase

Phase 5 of 7

Handle Ordering Where It Matters

Most events do not need strict ordering. The ones that do get routed to a partition keyed by the entity id, so events for one user always land in order on a single consumer. I write this down explicitly per event type so nobody assumes ordering they do not have.

Phase

Phase 6 of 7

Plan for Failure and Replay

Consumers fail. I build a dead-letter queue for messages that fail processing, with retry backoff and a max attempt count. The DLQ has tooling: inspect a failed message, fix the issue, replay. I also keep events around long enough to replay a consumer from scratch when a bug is fixed.

Phase

Phase 7 of 7

Observability for Async Systems

I trace events end to end: when produced, when consumed, by whom, with what latency. Without tracing, debugging an event-driven system is guessing. The instrumentation hooks into the monitoring playbook so events appear in the same dashboards as everything else.

Results

What You'll Achieve

Expected outcomes from implementing this playbook

Loosely coupled services with clean ownership of events

Reliable async workflows with idempotency and replay

End-to-end traces that make async debugging tractable

Clear documentation of which events exist and who owns them

See the event-driven blueprint or contact me for an audit.

Use this playbook

Want me to run this with you?

The playbook is the public version. The private version is me running it for your team against a real deadline. If you have a project on the line, that is usually the faster path.

Start a project Just ask a question

Related insights

More on this thinking

The studio journal

Essays and notes that pair with the playbooks.

Insights in Architecture

Filter the journal for pieces on this topic.

Related blueprints

Reference architectures

All blueprints

Production-grade reference systems I have shipped.

Labs

Experiments where I prototype the playbooks in public.

Architecture

Related Playbooks

Other playbooks in this category

advanced

Migrating a Monolith to Microservices

Most monolith-to-microservices stories end as cautionary tales because the team tried to design the future architecture instead of evolving toward it. This playbook is the staged migration I run: map the domain, find natural seams, extract behind a stable façade, adopt event-driven communication where it pays off, and decommission the old system gradually. Boring, slow, and the only version that consistently works.

intermediate

Refactoring Without Freezing the Roadmap

Every codebase accumulates debt. The mistake is treating that as a binary choice between shipping features and paying it down. This playbook is how I keep both moving in parallel: map the pain honestly, avoid the rewrite trap, lock current behavior with tests, ship the refactor behind feature flags, keep PRs small, and measure outcomes so the team knows the work is paying off.

Picking a Frontend Framework in 2026