Observability
Scope
The tools and practices for understanding a system’s internal state from its external outputs, covering the three pillars: metrics, logs, and distributed tracing.
Why This Topic Exists
In complex distributed systems, failures are inevitable and often unpredictable. While monitoring tells you when something is wrong, observability gives you the power to explore and ask why, enabling you to debug novel problems efficiently.
Core Tradeoffs
- Monitoring vs. Observability: Relying on pre-defined dashboards and alerts for known failure modes versus collecting the raw data needed to investigate unknown issues.
- Data Granularity vs. Cost: High-cardinality metrics and verbose logging provide deep insights but are significantly more expensive to store, transmit, and query.
- Sampling vs. Full Ingestion (for Tracing): Sampling traces reduces overhead, but you risk missing the specific, rare request that exhibits a critical bug.
- Metrics vs. Logs: Metrics are cheap and efficient for aggregatable data, while logs provide rich, high-cardinality context for specific events.
Common Failure Modes
- Alert Fatigue: Poorly configured or overly sensitive alerts that fire too often, leading to engineers ignoring them, including critical ones.
- “Garbage In, Garbage Out”: Unstructured logs or poorly named, untagged metrics are difficult to query and provide little value during an incident.
- Lack of Correlation: Having metrics, logs, and traces but no way to link them together (e.g., “show me the logs for this slow trace”).
- Focusing on the Wrong Signals: Monitoring low-level infrastructure metrics (like CPU) without a clear connection to user-facing impact (like error rates or latency).
Interview Signals
A strong candidate clearly articulates the difference between monitoring and observability. They can explain the role of each of the “Three Pillars” (metrics, logs, traces) and how they complement each other. They should be able to define Google’s “Four Golden Signals” and discuss the practical challenges of implementing observability, such as cost and data correlation.
Related Topics
- Reliability
- Performance
- Failure
- Debugging