Gaming5 months pre-launchArchitecture lead with 5 backend engineers

Live-Service Gaming Backend

From beta crashes to a stable live service at 2M concurrent players

A live-service gaming client

A live-service game shipped its closed beta and immediately fell over at 80,000 concurrent players - well below the launch target. Matchmaking timed out, inventory writes lost, and the leaderboard service became the de-facto authoritative source for user data because it was the only one still responding. I came in to redesign the backend before launch: per-region authoritative shards for player state, an event-sourced inventory ledger, a matchmaker that could scale horizontally, and a content-deploy pipeline that didn't require a maintenance window. We shipped on time and held 2M concurrent at launch with no critical incidents.

This is a representative architecture study based on real project patterns. Specific metrics and client details have been generalized to protect confidentiality.

Results

What changed, in numbers

The metrics the engagement is measured by.

2M

Peak CCU

concurrent players at launch

<12s

Match Wait

p95 from 90s in beta

0

Critical Incidents

during launch week

-35%

Cost per CCU

versus beta architecture

Challenge

What was broken

Backend designed for a single-region beta could not handle global launch. The player-state service was a single Postgres primary, matchmaking was a Python script with global locks, and content drops required taking the game offline. Players had already started speedrunning the bugs as a meme, which is fun but bad for retention.

Solution

The shape of the fix

A regionally sharded, event-sourced backend with horizontally scalable matchmaking and a no-downtime content pipeline - sized and load-tested against the actual launch curve, not a synthetic peak.

Approach

How I tackled it

The concrete moves that took the project from broken to shipped.

1

Sharded the player-state service per region with a global identity layer for cross-region transfers

2

Replaced the inventory service with an event-sourced ledger so duplicate writes became debug data, not lost items

3

Rewrote matchmaking as a horizontally scalable service with per-region pools and skill-based fallback

4

Built a content-drop pipeline using progressive rollout and feature flags so live updates didn't require downtime

5

Stood up region-aware load testing that simulated launch-day traffic curves, not synthetic peaks

6

Wired observability into the game client so we could see player-side errors, not just server-side ones

Outcomes

What shipped, and what it changed

Measured results from the engagement, told as a story rather than a scoreboard.

  • Held 2M concurrent players at launch with zero critical incidents

  • Reduced matchmaking p95 wait time from 90s in beta to under 12s at launch

  • Eliminated maintenance-window content drops - now 100% of updates ship live

  • Cut backend infrastructure cost-per-CCU by 35% versus the beta architecture

  • Reduced launch-week support tickets by 60% versus the studio's previous title

Stack

Technologies used

Linked entries open the technology page with related studies, playbooks, and notes.

Services

How I helped

The specific services involved in this engagement. Each links to a deeper breakdown.

Lessons

What I would tell the next team

The takeaways I carry into every similar engagement.

Load test the curve, not the peak. The shape of launch traffic kills more games than the height

Event sourcing for inventory is non-negotiable in a live-service game. Players will find every duplication bug within an hour

Region affinity beats global cleverness. Latency wins arguments

More patterns and playbooks live in Insights.

Have a similar challenge?

If any of this looks like the project on your desk, the conversation is the cheapest part. You can also browse other gaming work or the full service list.

Command Palette

Search for a command to run...