Error Budget

One-Liner

The maximum acceptable downtime or unreliability of a service within a defined period, derived from its Service Level Objectives (SLOs).

What It Is

The complement of an SLO. If a service has an SLO of 99.9% availability, its error budget is 0.1% unavailability. This budget represents the amount of acceptable failure that the service can experience without violating its SLO.

Why It Exists

To allow teams to balance reliability with innovation. It provides a quantitative, data-driven way to manage risk, allowing teams to ship new features (which inherently carry risk) as long as they stay within their allocated budget for unreliability.

How It Works

Teams continuously measure their service’s actual reliability against its SLOs.
Any unreliability (downtime, errors, latency spikes beyond SLO) “consumes” the error budget.
If the error budget is running low or is exhausted, it signals that the team should prioritize reliability work (e.g., fixing bugs, improving stability) over shipping new features.

Tradeoffs

Pros

Encourages data-driven decision making.
Aligns engineering incentives with user experience.
Provides a clear mechanism for managing risk.

Cons

Can be difficult to define and measure accurately.
Can lead to internal friction if not managed transparently.

Failure Modes

Unrealistic SLOs: If SLOs are too ambitious, the error budget is quickly exhausted, leading to constant reliability “fire drills.”
Gaming the budget: Teams might find ways to make the metrics look good without actually improving user experience.

Interview Traps

Not being able to explain how an error budget is derived from an SLO.
Not understanding its role in balancing reliability and feature development.

Real-World Usage

A core practice in Site Reliability Engineering (SRE) at Google and many other technology companies.

Anti-Patterns

Treating the error budget as a target for failure (i.e., “we have 0.1% to burn, so let’s burn it”). It’s a ceiling, not a floor.

SLA, SLO, SLI
Site Reliability Engineering (SRE)
Mean Time To Recovery (MTTR)