Reliability and SLAs
Scope
The principles and practices for defining, measuring, and achieving service reliability through Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs).
Why This Topic Exists
Reliability is a critical, non-functional requirement that must be explicitly designed and managed. A framework of SLAs, SLOs, and SLIs provides a quantitative and objective language for aligning user expectations, engineering priorities, and business goals. It allows teams to make data-driven decisions about risk and resource allocation.
Core Tradeoffs
- Reliability vs. Cost & Feature Velocity: Each additional “nine” of availability is exponentially more expensive to achieve, both in terms of engineering effort and opportunity cost (i.e., fewer new features).
- Error Budget Utilization: An error budget allows teams to take calculated risks. Spending it too quickly leads to instability; not spending it at all suggests the service is over-engineered and the team is not innovating fast enough.
- Internal SLO vs. External SLA: The external promise to users (SLA) must be looser than the internal engineering goal (SLO). This gap creates the error budget and insulates users from minor internal failures.
Common Failure Modes
- Focusing on Uptime Alone: A service can be “up” but still be unreliable if it is slow, returning errors, or serving corrupted data. Good SLIs must capture all aspects of user-perceived reliability.
- “Watermelon” Metrics: Dashboards that are “green” on the outside (e.g., server CPU is fine) but “red” on the inside (users are experiencing errors). This happens when SLIs are not closely tied to the user experience.
- Lack of an Error Budget: Without an explicit error budget, teams either become overly conservative and afraid to make changes, or they are too reckless and repeatedly violate user expectations.
- Unrealistic SLOs: Setting SLOs that are unachievable, not tied to user happiness, or for which there are no consequences if they are missed.
Interview Signals
Strong candidates don’t just define SLA, SLO, and SLI; they can explain the “why” behind them. They should be able to articulate the concept of an error budget and how it drives engineering decisions. Expect them to propose meaningful SLIs for a hypothetical service (e.g., p99 latency for checkout, or success rate for image uploads) and discuss the business implications of a 99.9% vs. 99.99% availability target.
Related Topics
- Observability
- Failure
- Performance
- Scalability