Failure and Resilience

Scope

The types of failures that occur in distributed systems and the patterns for building resilient systems that can withstand them.

Why This Topic Exists

In any large-scale system, failures are not exceptional events; they are a normal and expected part of operation. A system’s success is defined not by its ability to avoid failure, but by its ability to gracefully handle failure when it occurs.

Core Tradeoffs

Fail-Fast vs. Degraded Performance: Is it better to stop and return an error immediately, or to continue operating in a limited, graceful capacity?
Redundancy vs. Cost: Having redundant components increases reliability but also significantly increases infrastructure and operational costs.
Automated Recovery vs. Manual Intervention: Automation can lead to faster recovery, but a faulty automation script can cause a much larger-scale outage than a single component failure.
Consistency vs. Availability: During a network partition, does the system remain available but risk serving stale or incorrect data, or does it become unavailable to ensure correctness?

Common Failure Modes

Cascading Failures: A failure in a downstream dependency triggers failures in upstream services, leading to a widespread, cascading outage.
“Gray” or Partial Failures: A system is not completely down but is partially failing—it might be slow, returning errors for a subset of requests, or serving incorrect data. These can be harder to detect and debug than a complete outage.
Network Partitions: A loss of communication between parts of the system, which can lead to “split-brain” scenarios where different parts of the system make conflicting decisions.
Single Points of Failure (SPOFs): A component (e.g., a database, a load balancer, a specific service) whose failure will cause the entire system to fail.

Interview Signals

Strong candidates talk about failure as a certainty and proactively discuss strategies to mitigate it. They should be able to describe patterns like retries (with exponential backoff and jitter), circuit breakers, bulkheads, and timeouts. They will also emphasize the importance of observability in detecting and diagnosing failures.

Reliability
Observability
Load Balancing
Communication
Backpressure