Scalability

An overview of scalability as a core non-functional requirement: what it is, what it isn’t, and the trade-offs behind common scaling techniques.

Architecture

Context

Scalability is a system’s ability to handle increasing workload predictably and cost‑efficiently by adding resources and/or improving resource utilization.

“Workload” can mean different things depending on the system:

Requests per second (RPS) or concurrent users
Data volume (reads/writes per second, storage growth)
Background jobs throughput (events/sec, tasks/sec)
Latency targets under load (e.g., p95/p99 response times)

Scaling is the act of increasing capacity. The two classic approaches are:

Vertical scaling (scale up): make a single node bigger
Horizontal scaling (scale out): add more nodes

Important: Scalability is related to, but not the same as, performance (how fast one node is) or elasticity (how quickly capacity adjusts).

Vertical Scaling (Scale Up)

Vertical scaling means upgrading a single machine (more CPU, RAM, faster disk, better NIC).

It’s often the simplest first step, but it has notable drawbacks:

Cost grows non-linearly (diminishing returns).
- High-end instances often cost disproportionately more than multiple mid-tier machines.
Hard limits exist.
- You can’t scale a single host indefinitely (CPU sockets, memory channels, I/O ceilings).
Single-node blast radius stays large.
- One box becomes a bigger single point of failure unless you add redundancy (which pushes you toward horizontal designs anyway).
Upgrades may require downtime or complex migration.
- Even with blue/green or live migration, you usually need operational work to move state safely.

Vertical scaling is still valuable when:

The bottleneck is simple (e.g., CPU-bound service with no shared-state issues)
The system is early-stage and operational simplicity beats architectural complexity
Your database benefits from bigger memory for cache hit rate (up to a point)

Horizontal Scaling (Scale Out)

Horizontal scaling means adding more nodes and distributing the load.

Key advantages:

Better cost/performance with commodity machines (or smaller cloud instances)
Incremental growth (add capacity gradually)
Higher availability (a single node failure doesn’t take the whole system down)
Parallelism for throughput-heavy workloads

However, horizontal scaling tends to move complexity into:

Traffic distribution (load balancing)
Data consistency and shared state
Coordination (leader election, sharding, distributed locks)
Observability and operational tooling

Stateless vs. Stateful Services

Stateless services

A service is stateless if any instance can handle any request because user/session state is not stored in the service’s memory.

Stateless services are typically straightforward to scale horizontally:

Add instances
Put them behind a load balancer
Ensure instances are replaceable (immutable deploys, autoscaling)

Stateful services

Stateful services keep critical state locally (in-memory session, on-disk data, in-process cache with correctness requirements). These are harder to scale because state must be:

Replicated, partitioned, or externalized (e.g., to Redis, a database, or an object store)

In practice, the most difficult component to scale is often the database because it couples:

Storage
Consistency guarantees
Read/write throughput
Contention on shared data

Common database scaling strategies include:

Read replicas (scale reads, not writes)
Caching (reduce load; requires careful invalidation strategy)
Partitioning/sharding (scale writes and data volume, increases complexity)
CQRS patterns (separate read/write models when necessary)

Load Balancing

A load balancer distributes incoming traffic across multiple nodes to:

Avoid overloading a single instance
Improve throughput and tail latency
Detect unhealthy nodes and route around failures

Common implementations:

Hardware load balancer (specialized appliance; powerful but expensive and less flexible)
Managed load balancer (LBaaS) from a cloud provider
Software load balancer (e.g., NGINX, HAProxy, Envoy) running on your own infrastructure

Layer 4 vs. Layer 7

Load balancers are commonly categorised by OSI layers:

Layer 4 (TCP/UDP)
- Routes based on connection-level information (IP/port)
- Usually lower overhead, less visibility into HTTP semantics
Layer 7 (HTTP)
- Can route based on path, host, headers, cookies, or even request body (with caution)
- Can enforce policies (auth checks, rate limiting, WAF integration)
- Can perform TLS termination
  - Offloads expensive crypto from app nodes and simplifies certificate rotation
  - Beware: “TLS termination everywhere” is not automatically safe—many environments still require mTLS service-to-service depending on threat model and compliance needs.

Sticky Sessions (Session Affinity)

Sticky sessions mean the load balancer consistently routes a user’s requests to the same backend instance.

This can be useful when the application stores session state in-memory, but it has trade-offs:

Reduced balancing effectiveness (hot instances)
Worse failure behaviour (if the “sticky” node dies, sessions may be lost)
More difficult deployments (draining, rolling updates)

Implementation approaches:

Duration-based LB cookies (LB issues a cookie and uses it to route)
Application-controlled session cookies (LB uses app identifiers to maintain affinity)

A more scalable alternative is to avoid stickiness by:

Making the service stateless
Storing session/state in an external system (Redis, DB) with appropriate TTLs