Scalability
An overview of scalability as a core non-functional requirement: what it is, what it isn’t, and the trade-offs behind common scaling techniques.
Context
Scalability is a system’s ability to handle increasing workload predictably and cost‑efficiently by adding resources and/or improving resource utilization.
“Workload” can mean different things depending on the system:
- Requests per second (RPS) or concurrent users
- Data volume (reads/writes per second, storage growth)
- Background jobs throughput (events/sec, tasks/sec)
- Latency targets under load (e.g., p95/p99 response times)
Scaling is the act of increasing capacity. The two classic approaches are:
- Vertical scaling (scale up): make a single node bigger
- Horizontal scaling (scale out): add more nodes
Important: Scalability is related to, but not the same as, performance (how fast one node is) or elasticity (how quickly capacity adjusts).
Vertical Scaling (Scale Up)
Vertical scaling means upgrading a single machine (more CPU, RAM, faster disk, better NIC).
It’s often the simplest first step, but it has notable drawbacks:
- Cost grows non-linearly (diminishing returns).
- High-end instances often cost disproportionately more than multiple mid-tier machines.
- Hard limits exist.
- You can’t scale a single host indefinitely (CPU sockets, memory channels, I/O ceilings).
- Single-node blast radius stays large.
- One box becomes a bigger single point of failure unless you add redundancy (which pushes you toward horizontal designs anyway).
- Upgrades may require downtime or complex migration.
- Even with blue/green or live migration, you usually need operational work to move state safely.
Vertical scaling is still valuable when:
- The bottleneck is simple (e.g., CPU-bound service with no shared-state issues)
- The system is early-stage and operational simplicity beats architectural complexity
- Your database benefits from bigger memory for cache hit rate (up to a point)
Horizontal Scaling (Scale Out)
Horizontal scaling means adding more nodes and distributing the load.
Key advantages:
- Better cost/performance with commodity machines (or smaller cloud instances)
- Incremental growth (add capacity gradually)
- Higher availability (a single node failure doesn’t take the whole system down)
- Parallelism for throughput-heavy workloads
However, horizontal scaling tends to move complexity into:
- Traffic distribution (load balancing)
- Data consistency and shared state
- Coordination (leader election, sharding, distributed locks)
- Observability and operational tooling
Stateless vs. Stateful Services
Stateless services
A service is stateless if any instance can handle any request because user/session state is not stored in the service’s memory.
Stateless services are typically straightforward to scale horizontally:
- Add instances
- Put them behind a load balancer
- Ensure instances are replaceable (immutable deploys, autoscaling)
Stateful services
Stateful services keep critical state locally (in-memory session, on-disk data, in-process cache with correctness requirements). These are harder to scale because state must be:
- Replicated, partitioned, or externalized (e.g., to Redis, a database, or an object store)
In practice, the most difficult component to scale is often the database because it couples:
- Storage
- Consistency guarantees
- Read/write throughput
- Contention on shared data
Common database scaling strategies include:
- Read replicas (scale reads, not writes)
- Caching (reduce load; requires careful invalidation strategy)
- Partitioning/sharding (scale writes and data volume, increases complexity)
- CQRS patterns (separate read/write models when necessary)
Load Balancing
A load balancer distributes incoming traffic across multiple nodes to:
- Avoid overloading a single instance
- Improve throughput and tail latency
- Detect unhealthy nodes and route around failures
Common implementations:
- Hardware load balancer (specialized appliance; powerful but expensive and less flexible)
- Managed load balancer (LBaaS) from a cloud provider
- Software load balancer (e.g., NGINX, HAProxy, Envoy) running on your own infrastructure
Layer 4 vs. Layer 7
Load balancers are commonly categorised by OSI layers:
-
Layer 4 (TCP/UDP)
- Routes based on connection-level information (IP/port)
- Usually lower overhead, less visibility into HTTP semantics
-
Layer 7 (HTTP)
- Can route based on path, host, headers, cookies, or even request body (with caution)
- Can enforce policies (auth checks, rate limiting, WAF integration)
- Can perform TLS termination
- Offloads expensive crypto from app nodes and simplifies certificate rotation
- Beware: “TLS termination everywhere” is not automatically safe—many environments still require mTLS service-to-service depending on threat model and compliance needs.
Sticky Sessions (Session Affinity)
Sticky sessions mean the load balancer consistently routes a user’s requests to the same backend instance.
This can be useful when the application stores session state in-memory, but it has trade-offs:
- Reduced balancing effectiveness (hot instances)
- Worse failure behaviour (if the “sticky” node dies, sessions may be lost)
- More difficult deployments (draining, rolling updates)
Implementation approaches:
- Duration-based LB cookies (LB issues a cookie and uses it to route)
- Application-controlled session cookies (LB uses app identifiers to maintain affinity)
A more scalable alternative is to avoid stickiness by:
- Making the service stateless
- Storing session/state in an external system (Redis, DB) with appropriate TTLs