GAME LEVELS
Choose a level 40 system design challenges across four formats. Build, fix, optimise, design — each level teaches one clear concept. Start with 4 free sample levels; Pro unlocks the remaining 36.
All Levels Survive Incident Cost Design
01
A-01 SURVIVE FREE SAMPLE
First Deploy
One EC2. Traffic ramps 100→800 req/s. Keep uptime above 95% for 90 seconds.
Vertical scaling EC2 RDS
02
A-02 SURVIVE PRO
Scalable Web App
Traffic doubled overnight. Add a load balancer and scale horizontally.
ALB Horizontal scaling EC2 fleet
01
B-01 INCIDENT FREE SAMPLE
Database on Fire
RDS connections are maxed. Users timing out. Fix it before SLA breach.
Connection pooling ElastiCache RDS
02
B-02 INCIDENT PRO
The Stampede
Cache expired at 3am. Every user hit the DB simultaneously. Stop the bleeding.
Cache stampede TTL jitter ElastiCache
03
B-03 INCIDENT PRO
Single Point Down
Your load balancer is a single node. It just crashed. All traffic is dead.
SPOF ALB Redundancy
01
C-01 COST FREE SAMPLE
The AWS Bill
Overprovisioned fleet burning $12k/mo. Cut to $6k without downtime.
Right-sizing EC2 Reserved instances
02
C-02 COST PRO
Always-On Dev Env
Dev/staging stack runs 24/7. It only needs 8 hours a day.
Scheduled stop/start Lambda Cost lifecycle
03
C-03 COST PRO
Reserved vs On-Demand
All your EC2 is on-demand. Baseline load is predictable. Commit and save.
Reserved instances Savings plans On-demand
01
D-01 DESIGN FREE SAMPLE
Blog Platform
50k readers/day, 100 writers. 95% reads. Budget: $500/mo. Uptime: 99.5%.
CloudFront S3 ALB RDS
02
D-02 DESIGN PRO
URL Shortener
1B URLs, 10k writes/s, 100k reads/s, global P99 < 50 ms. Budget: $3k/mo.
DynamoDB ElastiCache CloudFront Lambda
03
D-03 DESIGN PRO
File Upload Service
10k concurrent uploads up to 5GB. Virus scan required. Budget: $1k/mo.
S3 Lambda SQS API Gateway
03
A-03 SURVIVE PRO
Static Asset Storm
80% of requests are images. Your origin is drowning. Offload with CDN.
CloudFront S3 Origin offload
04
A-04 SURVIVE PRO
Cache or Die
Read-heavy traffic is hammering RDS directly. Add caching before it collapses.
ElastiCache Cache-aside RDS
05
A-05 SURVIVE PRO
Write Thunderstorm
A viral event sends 5,000 writes/s. RDS is the bottleneck. Buffer them.
SQS Write buffering Async processing
06
A-06 SURVIVE PRO
Flash Sale
10× traffic spike in 60 seconds. Auto-scaling, queue buffering, rate limiting.
Auto-scaling Flash sale Traffic spike
04
B-04 INCIDENT PRO
Cascading Failure
One slow service is timing out callers. Timeouts are backing up everywhere.
Circuit breaker Retry storm ElastiCache
05
B-05 INCIDENT PRO
Hot Cache Flush
Ops flushed the cache for a deploy. All 2M users hit DB cold.
Cache warming RDS read replica Cold start
06
B-06 INCIDENT PRO
Queue Backup
SQS queue depth at 2 million. Workers can't keep up. Jobs are expiring.
SQS consumers Dead-letter queue Queue scaling
07
B-07 INCIDENT PRO
Read Replica Lag
Read replica is 45 seconds behind primary. Analytics reads are stale. Reports wrong.
RDS replication Replica lag Read topology
04
C-04 COST PRO
Spot Fleet Swap
Stateless web tier is running on on-demand. Swap to Spot and save 70%.
Spot Instances Cost optimisation Interruption handling
05
C-05 COST PRO
EBS to S3 Migration
50TB of user uploads on EBS at $0.10/GB. Move them to S3 at $0.023/GB.
S3 EBS Storage cost Zero-downtime migration
06
C-06 COST PRO
Lambda vs Always-On
Your report generator runs once a night. It lives on a $300/mo EC2.
Lambda Event-driven Serverless cost
07
C-07 COST PRO
Cache Pays for Itself
RDS is over-spec'd to handle read load. One cache node costs less than the DB upgrade.
ElastiCache ROI Right-sizing Cache economics
04
D-04 DESIGN PRO
Social Feed
500k users, reads 20× writes, P95 feed load < 100ms. Budget $3k/mo.
Feed caching Read replicas Fan-out
05
D-05 DESIGN PRO
Notification System
Send 10M notifications/day. Guaranteed delivery. At-least-once. Budget: $2k/mo.
SNS fan-out SQS queues Lambda consumers
06
D-06 DESIGN PRO
Real-time Dashboard
100k events/s from IoT sensors. Dashboard refreshes every 5s. 30-day history.
Kinesis Lambda DynamoDB Streaming
07
D-07 DESIGN PRO
Rate Limiter Service
50k tenants. Per-tenant rate limit: 1k req/s. Enforce globally. P99 overhead < 5ms.
Token bucket ElastiCache Rate limiting
07
A-07 SURVIVE PRO
Multi-AZ Under Fire
Traffic peaks. Then AZ-1 dies mid-ramp. You can't afford downtime.
Multi-AZ Failover RDS Multi-AZ
08
A-08 SURVIVE PRO
Hot Partition
One DynamoDB partition absorbs 90% of requests. Table throttling.
DynamoDB DAX Partition keys
09
A-09 SURVIVE PRO
Lambda Stampede
Serverless functions cold-starting under load. Concurrency limits hit.
Lambda Provisioned concurrency API Gateway
10
A-10 SURVIVE PRO
Black Friday
Full-stack stress test. 50× normal traffic. Everything must hold.
CDN Auto Scaling SQS ElastiCache
08
B-08 INCIDENT PRO
Region Down
us-east-1 is degraded. Reroute to failover region before SLA burns.
Route 53 Multi-region Failover
09
B-09 INCIDENT PRO
DDoS Under Way
500k req/s incoming. Bot traffic is real. Your origin is drowning.
WAF CloudFront Rate limiting
10
B-10 INCIDENT PRO
Split Brain
Network partition created two RDS primaries. Data is diverging.
Split brain Leader election RDS
08
C-08 COST PRO
Cross-AZ Data Transfer
App in AZ-1 reads from DB in AZ-2. $0.01/GB × 50TB/mo = surprise bill.
Cross-AZ costs AZ placement Data transfer
09
C-09 COST PRO
Cold Storage Tiering
200TB in S3 Standard. Only 5% accessed in last 90 days. Tier the rest.
S3 Glacier Lifecycle policy Storage classes
10
C-10 COST PRO
Monolith to Serverless
Always-on monolith serves bursty, low-frequency traffic. Replatform it.
Lambda DynamoDB API Gateway Serverless
08
D-08 DESIGN PRO
Ride-sharing Backend
1M active riders, 100k drivers. Match in < 500 ms. Location updates 1/s.
DynamoDB Lambda SNS Geospatial
09
D-09 DESIGN PRO
Video Streaming Platform
10M viewers/day, 1M concurrent. Adaptive bitrate. Global < 2s start.
CloudFront S3 Lambda HLS
10
D-10 DESIGN PRO
The Interview
FAANG-style system design. 20-minute clock. Design for 1B users.
All node types Synthesis Trade-offs