Skip to content

Error Budget

One-Liner

The maximum acceptable downtime or unreliability of a service within a defined period, derived from its Service Level Objectives (SLOs).

What It Is

The complement of an SLO. If a service has an SLO of 99.9% availability, its error budget is 0.1% unavailability. This budget represents the amount of acceptable failure that the service can experience without violating its SLO.

Why It Exists

To allow teams to balance reliability with innovation. It provides a quantitative, data-driven way to manage risk, allowing teams to ship new features (which inherently carry risk) as long as they stay within their allocated budget for unreliability.

How It Works

  • Teams continuously measure their service’s actual reliability against its SLOs.
  • Any unreliability (downtime, errors, latency spikes beyond SLO) “consumes” the error budget.
  • If the error budget is running low or is exhausted, it signals that the team should prioritize reliability work (e.g., fixing bugs, improving stability) over shipping new features.

Tradeoffs

Pros

  • Encourages data-driven decision making.
  • Aligns engineering incentives with user experience.
  • Provides a clear mechanism for managing risk.

Cons

  • Can be difficult to define and measure accurately.
  • Can lead to internal friction if not managed transparently.

Failure Modes

  • Unrealistic SLOs: If SLOs are too ambitious, the error budget is quickly exhausted, leading to constant reliability “fire drills.”
  • Gaming the budget: Teams might find ways to make the metrics look good without actually improving user experience.

Interview Traps

  • Not being able to explain how an error budget is derived from an SLO.
  • Not understanding its role in balancing reliability and feature development.

Real-World Usage

  • A core practice in Site Reliability Engineering (SRE) at Google and many other technology companies.

Anti-Patterns

  • Treating the error budget as a target for failure (i.e., “we have 0.1% to burn, so let’s burn it”). It’s a ceiling, not a floor.
  • SLA, SLO, SLI
  • Site Reliability Engineering (SRE)
  • Mean Time To Recovery (MTTR)