GAME LEVELS

Choose a level

40 system design challenges across four formats. Build, fix, optimise, design — each level teaches one clear concept.

BEGINNER
01
A-01SURVIVEAVAILABLE
First Deploy
One EC2. Traffic ramps 100→800 req/s. Keep uptime above 95% for 90 seconds.
Vertical scalingEC2RDS
02
A-02SURVIVEAVAILABLE
Scalable Web App
Traffic doubled overnight. Add a load balancer and scale horizontally.
ALBHorizontal scalingEC2 fleet
01
B-01INCIDENTAVAILABLE
Database on Fire
RDS connections are maxed. Users timing out. Fix it before SLA breach.
Connection poolingElastiCacheRDS
02
B-02INCIDENTAVAILABLE
The Stampede
Cache expired at 3am. Every user hit the DB simultaneously. Stop the bleeding.
Cache stampedeTTL jitterElastiCache
03
B-03INCIDENTAVAILABLE
Single Point Down
Your load balancer is a single node. It just crashed. All traffic is dead.
SPOFALBRedundancy
01
C-01COSTAVAILABLE
The AWS Bill
Overprovisioned fleet burning $12k/mo. Cut to $6k without downtime.
Right-sizingEC2Reserved instances
02
C-02COSTAVAILABLE
Always-On Dev Env
Dev/staging stack runs 24/7. It only needs 8 hours a day.
Scheduled stop/startLambdaCost lifecycle
03
C-03COSTAVAILABLE
Reserved vs On-Demand
All your EC2 is on-demand. Baseline load is predictable. Commit and save.
Reserved instancesSavings plansOn-demand
01
D-01DESIGNAVAILABLE
Blog Platform
50k readers/day, 100 writers. 95% reads. Budget: $500/mo. Uptime: 99.5%.
CloudFrontS3ALBRDS
02
D-02DESIGNAVAILABLE
URL Shortener
1B URLs, 10k writes/s, 100k reads/s, global P99 < 50 ms. Budget: $3k/mo.
DynamoDBElastiCacheCloudFrontLambda
03
D-03DESIGNAVAILABLE
File Upload Service
10k concurrent uploads up to 5GB. Virus scan required. Budget: $1k/mo.
S3LambdaSQSAPI Gateway
INTERMEDIATE
03
A-03SURVIVEAVAILABLE
Static Asset Storm
80% of requests are images. Your origin is drowning. Offload with CDN.
CloudFrontS3Origin offload
04
A-04SURVIVEAVAILABLE
Cache or Die
Read-heavy traffic is hammering RDS directly. Add caching before it collapses.
ElastiCacheCache-asideRDS
05
A-05SURVIVEAVAILABLE
Write Thunderstorm
A viral event sends 5,000 writes/s. RDS is the bottleneck. Buffer them.
SQSWrite bufferingAsync processing
06
A-06SURVIVEAVAILABLE
Flash Sale
10× traffic spike in 60 seconds. Auto-scaling, queue buffering, rate limiting.
Auto-scalingFlash saleTraffic spike
04
B-04INCIDENTAVAILABLE
Cascading Failure
One slow service is timing out callers. Timeouts are backing up everywhere.
Circuit breakerRetry stormElastiCache
05
B-05INCIDENTAVAILABLE
Hot Cache Flush
Ops flushed the cache for a deploy. All 2M users hit DB cold.
Cache warmingRDS read replicaCold start
06
B-06INCIDENTAVAILABLE
Queue Backup
SQS queue depth at 2 million. Workers can't keep up. Jobs are expiring.
SQS consumersDead-letter queueQueue scaling
07
B-07INCIDENTAVAILABLE
Read Replica Lag
Read replica is 45 seconds behind primary. Analytics reads are stale. Reports wrong.
RDS replicationReplica lagRead topology
04
C-04COSTAVAILABLE
Spot Fleet Swap
Stateless web tier is running on on-demand. Swap to Spot and save 70%.
Spot InstancesCost optimisationInterruption handling
05
C-05COSTAVAILABLE
EBS to S3 Migration
50TB of user uploads on EBS at $0.10/GB. Move them to S3 at $0.023/GB.
S3EBSStorage costZero-downtime migration
06
C-06COSTAVAILABLE
Lambda vs Always-On
Your report generator runs once a night. It lives on a $300/mo EC2.
LambdaEvent-drivenServerless cost
07
C-07COSTAVAILABLE
Cache Pays for Itself
RDS is over-spec'd to handle read load. One cache node costs less than the DB upgrade.
ElastiCache ROIRight-sizingCache economics
04
D-04DESIGNAVAILABLE
Social Feed
500k users, reads 20× writes, P95 feed load < 100ms. Budget $3k/mo.
Feed cachingRead replicasFan-out
05
D-05DESIGNAVAILABLE
Notification System
Send 10M notifications/day. Guaranteed delivery. At-least-once. Budget: $2k/mo.
SNS fan-outSQS queuesLambda consumers
06
D-06DESIGNAVAILABLE
Real-time Dashboard
100k events/s from IoT sensors. Dashboard refreshes every 5s. 30-day history.
KinesisLambdaDynamoDBStreaming
07
D-07DESIGNAVAILABLE
Rate Limiter Service
50k tenants. Per-tenant rate limit: 1k req/s. Enforce globally. P99 overhead < 5ms.
Token bucketElastiCacheRate limiting
ADVANCED
07
A-07SURVIVEAVAILABLE
Multi-AZ Under Fire
Traffic peaks. Then AZ-1 dies mid-ramp. You can't afford downtime.
Multi-AZFailoverRDS Multi-AZ
08
A-08SURVIVEAVAILABLE
Hot Partition
One DynamoDB partition absorbs 90% of requests. Table throttling.
DynamoDBDAXPartition keys
09
A-09SURVIVEAVAILABLE
Lambda Stampede
Serverless functions cold-starting under load. Concurrency limits hit.
LambdaProvisioned concurrencyAPI Gateway
10
A-10SURVIVEAVAILABLE
Black Friday
Full-stack stress test. 50× normal traffic. Everything must hold.
CDNAuto ScalingSQSElastiCache
08
B-08INCIDENTAVAILABLE
Region Down
us-east-1 is degraded. Reroute to failover region before SLA burns.
Route 53Multi-regionFailover
09
B-09INCIDENTAVAILABLE
DDoS Under Way
500k req/s incoming. Bot traffic is real. Your origin is drowning.
WAFCloudFrontRate limiting
10
B-10INCIDENTAVAILABLE
Split Brain
Network partition created two RDS primaries. Data is diverging.
Split brainLeader electionRDS
08
C-08COSTAVAILABLE
Cross-AZ Data Transfer
App in AZ-1 reads from DB in AZ-2. $0.01/GB × 50TB/mo = surprise bill.
Cross-AZ costsAZ placementData transfer
09
C-09COSTAVAILABLE
Cold Storage Tiering
200TB in S3 Standard. Only 5% accessed in last 90 days. Tier the rest.
S3 GlacierLifecycle policyStorage classes
10
C-10COSTAVAILABLE
Monolith to Serverless
Always-on monolith serves bursty, low-frequency traffic. Replatform it.
LambdaDynamoDBAPI GatewayServerless
08
D-08DESIGNAVAILABLE
Ride-sharing Backend
1M active riders, 100k drivers. Match in < 500 ms. Location updates 1/s.
DynamoDBLambdaSNSGeospatial
09
D-09DESIGNAVAILABLE
Video Streaming Platform
10M viewers/day, 1M concurrent. Adaptive bitrate. Global < 2s start.
CloudFrontS3LambdaHLS
10
D-10DESIGNAVAILABLE
The Interview
FAANG-style system design. 20-minute clock. Design for 1B users.
All node typesSynthesisTrade-offs