Skip to content

SLA, SLO, SLI

One-Liner

A hierarchy of metrics used to define, measure, and manage the reliability and performance of a service.

What It Is

  • SLI (Service Level Indicator): A quantitative measure of some aspect of the service provided. Examples include the success rate of requests, latency, or system uptime. It’s the raw data point.
  • SLO (Service Level Objective): A target value or range for an SLI. It specifies a desired level of service. For example, “99.9% of requests must succeed” or “p99 latency must be under 300ms.”
  • SLA (Service Level Agreement): A formal agreement with a customer or client that defines the level of service expected. SLOs are typically internal targets that are stricter than the SLA, providing a buffer. SLAs often include penalties for non-compliance.

Why It Exists

To provide a clear, measurable, and objective framework for understanding service quality from a user’s perspective, guiding engineering efforts, and managing expectations.

How It Works

SLIs are continuously measured. SLOs define the acceptable range for those SLIs. SLAs are contractual agreements based on achieving certain SLOs.

Tradeoffs

Pros

  • Clear communication about service health.
  • Data-driven decision making.
  • Alignment between engineering and business.

Cons

  • Can be difficult to define meaningful SLIs and SLOs.
  • Can lead to “gaming” the metrics if not carefully chosen.

Failure Modes

  • “Watermelon” metrics: Dashboards look green (SLOs are met) but users are unhappy because the SLIs don’t accurately reflect user experience.
  • Unrealistic SLOs: Setting targets that are impossible or too expensive to meet.

Interview Traps

  • Confusing the three terms.
  • Not being able to give concrete examples of SLIs and SLOs for a typical service.

Real-World Usage

  • Adopted widely in Site Reliability Engineering (SRE) practices.

Anti-Patterns

  • Only measuring infrastructure metrics (e.g., CPU, memory) instead of user-facing SLIs.
  • Having an SLA that is identical to the SLO, leaving no error budget.
  • Error Budget
  • Observability
  • Mean Time Between Failures (MTBF)
  • Mean Time To Recovery (MTTR)