Live-Service Gaming Backend
From beta crashes to a stable live service at 2M concurrent players
A live-service gaming client
A live-service game shipped its closed beta and immediately fell over at 80,000 concurrent players - well below the launch target. Matchmaking timed out, inventory writes lost, and the leaderboard service became the de-facto authoritative source for user data because it was the only one still responding. I came in to redesign the backend before launch: per-region authoritative shards for player state, an event-sourced inventory ledger, a matchmaker that could scale horizontally, and a content-deploy pipeline that didn't require a maintenance window. We shipped on time and held 2M concurrent at launch with no critical incidents.
This is a representative architecture study based on real project patterns. Specific metrics and client details have been generalized to protect confidentiality.
Results
What changed, in numbers
The metrics the engagement is measured by.
2M
Peak CCU
concurrent players at launch
<12s
Match Wait
p95 from 90s in beta
0
Critical Incidents
during launch week
-35%
Cost per CCU
versus beta architecture
Challenge
What was broken
Backend designed for a single-region beta could not handle global launch. The player-state service was a single Postgres primary, matchmaking was a Python script with global locks, and content drops required taking the game offline. Players had already started speedrunning the bugs as a meme, which is fun but bad for retention.
Solution
The shape of the fix
A regionally sharded, event-sourced backend with horizontally scalable matchmaking and a no-downtime content pipeline - sized and load-tested against the actual launch curve, not a synthetic peak.
Approach
How I tackled it
The concrete moves that took the project from broken to shipped.
Sharded the player-state service per region with a global identity layer for cross-region transfers
Replaced the inventory service with an event-sourced ledger so duplicate writes became debug data, not lost items
Rewrote matchmaking as a horizontally scalable service with per-region pools and skill-based fallback
Built a content-drop pipeline using progressive rollout and feature flags so live updates didn't require downtime
Stood up region-aware load testing that simulated launch-day traffic curves, not synthetic peaks
Wired observability into the game client so we could see player-side errors, not just server-side ones
Outcomes
What shipped, and what it changed
Measured results from the engagement, told as a story rather than a scoreboard.
Held 2M concurrent players at launch with zero critical incidents
Reduced matchmaking p95 wait time from 90s in beta to under 12s at launch
Eliminated maintenance-window content drops - now 100% of updates ship live
Cut backend infrastructure cost-per-CCU by 35% versus the beta architecture
Reduced launch-week support tickets by 60% versus the studio's previous title
Stack
Technologies used
Linked entries open the technology page with related studies, playbooks, and notes.
Services
How I helped
The specific services involved in this engagement. Each links to a deeper breakdown.
Lessons
What I would tell the next team
The takeaways I carry into every similar engagement.
Load test the curve, not the peak. The shape of launch traffic kills more games than the height
Event sourcing for inventory is non-negotiable in a live-service game. Players will find every duplication bug within an hour
Region affinity beats global cleverness. Latency wins arguments
Related
Other studies you might recognize
Engagements with overlapping problem shapes, industries, or stacks.
Have a similar challenge?
If any of this looks like the project on your desk, the conversation is the cheapest part. You can also browse other gaming work or the full service list.