System Design Evolution: Building a URL Shortener from MVP to Planet-Scale

Developer from India.
โEvery simple system starts as a toy and evolves into infrastructure. The art lies in knowing when to evolve.โ
๐งญ Why This Article Exists
Most system design discussions about URL shorteners stop at โput a DB behind an API.โ
This piece goes all the way โ from single-binary MVP to Meta-level, multi-region infrastructure โ showing how and why each evolution happens.
Itโs meant to be both:
A reference document for system design interviews
A guide for engineers who actually want to build a scalable URL shortener in Go (TDD + production-grade)
๐ Table of Contents
Stage 1 โ MVP (Proof of Concept)
Stage 2 โ Single-Instance Production
Stage 3 โ Database Persistence
Stage 4 โ Caching Layer
Stage 5 โ Distributed System
Stage 6 โ Enterprise Scale
Architectural Decisions
SLOs and Failure Design
Interview Framework: How to Use This
Key Takeaways
Further Reading
๐งฉ Stage 1 โ MVP (Proof of Concept)
| Aspect | Value |
| Deployment | Single binary |
| Storage | Go map + RWMutex |
| Scalability | ~1K requests/sec |
| Cost | <$50/mo |
| Reliability | Data lost on restart โ |
Example (Go)
type URLShortener struct {
mu sync.RWMutex
data map[string]string
}
โ
Fast, tiny, deployable in minutes
โ No persistence, single point of failure
When to stop here:
Hackathon projects, demos, coding challenges.
โ๏ธ Stage 2 โ Single-Instance Production
New Concepts
Graceful shutdown
Environment-based config (
BASE_URL,PORT)URL canonicalization
Collision retry loop (insert-if-absent)
โ
Production-grade hygiene
โ Still no durability or scale
๐๏ธ Stage 3 โ Database Persistence
Architecture
Schema
CREATE TABLE urls (
key CHAR(8) PRIMARY KEY,
long_url TEXT NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
click_count INT DEFAULT 0
);
CREATE INDEX idx_long_url ON urls(long_url);
Why PostgreSQL?
Easy to reason about consistency
Native unique constraints for idempotency
Works well with Go
pgxpoolTrivial to scale reads via replicas
| Metric | Value |
| Latency | 20โ50 ms per read |
| Throughput | ~1K req/s |
| Data durability | โ |
| Horizontal scaling | ๐ง Needs more |
โ
Durable and correct
โ All reads hit DB
โ Writes limited by one primary
โก Stage 4 โ Add a Caching Layer
Architecture
Read path
Write path
Cache Policy
LRU: only top 1โ5 % URLs
TTL: 24 hours
Hit rate: 80โ90 % (Zipfian workloads)
| Metric | Value |
| Avg latency | ~5 ms |
| Peak throughput | 100K req/s |
| Cache memory | few GB |
| DB load reduction | ~90 % |
โ
Orders-of-magnitude faster
โ
DB load drops drastically
โ Need cache invalidation strategy
โ Cache consistency issues possible
๐ Stage 5 โ Distributed System
Sharded, Multi-Region Design
Key generation
8-char Base62 key = 62โธ โ 2.18 ร 10ยนโด combinations
First 2 chars = shard prefix (3,844 shards)
Predictable, uniform distribution
Replication Model
| Role | Responsibility |
| Primary | Writes |
| Replicas | Reads |
| Sync replication | Strong consistency |
| Async cross-region | DR + read latency |
Performance
~500K req/s reads
p99 latency โ 20 ms
Availability โ 99.99 %
โ
Scale by adding shards
โ
Regional isolation
โ Operational overhead (failover, backups)
๐ข Stage 6 โ Enterprise / Meta-Scale
Global Topology
Supporting Infrastructure
Redis Cluster: 300 shards (multi-region replication)
Postgres Shards: 3,844 per region ร 3 replicas
Eventing: Kafka for click streams
Observability: Prometheus, Grafana, ELK, Jaeger
Disaster Recovery: RPO < 10 s, RTO < 1 min
Numbers (illustrative)
| Metric | Value |
| Read throughput | 5 M req/s |
| Write throughput | 100 K req/s |
| p99 latency | 20 ms |
| Cache hit rate | 85 % |
| Availability | 99.999 % |
| Annual volume | 150 T requests/year |
โ
Planet-scale durability
โ
Real-time analytics
โ
Regional failover
โ Huge operational complexity
๐งฎ Architectural Decisions
| Decision | Choice | Reasoning |
| Key length | 8 chars | ~218 T unique keys, negligible collision probability |
| Sharding | Prefix (first 2 chars) | Uniform distribution, easy routing |
| Cache strategy | Hot 1โ5 %, LRU + TTL 24h | High hit-rate, bounded memory |
| Metrics | In-mem counters + periodic DB flush | 500 ns updates, durability via batch |
| Storage API | Interface-based (Store interface) | Enables swap: memory โ Redis โ Postgres |
| Security | Validate URLs, block private IPs, rate-limit writes, sign aliases | Prevent SSRF, spam, and abuse |
| Observability | RED metrics, structured logs, tracing | Enable SLO monitoring & debugging |
๐ SLOs and Failure Design
| Tier | Target | Degradation Strategy |
| Availability | 99.99 % | Serve from cache โ read replicas โ failover region |
| Latency | p50 โค 10 ms, p99 โค 50 ms | Prefer cached reads; fallback replicas |
| Durability | RPO โค 10 s, RTO โค 1 min | Promote replica, replay Kafka logs |
| Scalability | Linear โ 1M req/s per region | Horizontal app + cache scaling |
๐ง Interview Framework: How to Use This
When asked to โDesign a URL shortenerโ in interviews, use this 5-minute outline:
| Step | What to Say | Key Points |
| 1. Requirements | Shorten, redirect, delete, analytics | Functional + non-functional (scale, latency) |
| 2. MVP | One Go server + in-memory map | Start small |
| 3. Persistence | Add Postgres for durability | Read/write paths |
| 4. Scale reads | Add Redis read-through cache | Explain cache invalidation |
| 5. Scale writes | Shard DB by key prefix | 62ยฒ โ 3.8K shards |
| 6. Geo-scale | Multi-region replicas, CDN, rate limiting | Fault tolerance |
| 7. Observability | Metrics, tracing, logging | SLOs & debugging |
| 8. Trade-offs | Latency vs Durability vs Cost | Articulate why |
๐ก Hint: Keep numbers handy โ 8-char key โ 2 ร 10ยนโด, Redis hit ~ 1 ms, DB read ~ 50 ms.
๐ Key Takeaways
Scale evolves. Each layer solves a previous bottleneck.
Most traffic is read-heavy. Optimize 80 % cached reads first.
Simplicity beats premature scale. Donโt shard until you must.
Measure relentlessly. Latency histograms, p99, error budgets.
Design for failure. Everything breaks; plan for graceful degradation.
๐ Further Reading
๐ Closing Thoughts
Building a URL shortener seems trivial โ until you have to do it for billions of users.
Start with a single process.
Measure.
Then add layers only when you must.
Thatโs not just how you scale systems โ itโs how you scale engineering judgment.
๐ Code
All source code, tests, and diagrams are available here โ https://github.com/AkshayContributes/url-shortener
Written by Akshay Thakur (2025)
Staff-level System Design Reference | Go + Distributed Systems + Scale Engineering





