A Guide to System Monitoring & Observability
This document outlines the core concepts, pillars, and practices for effectively monitoring modern software systems.
Table of Contents
- Monitoring vs. Observability
- The Three Pillars of Observability
- The Four Golden Signals
- Application Performance Monitoring (APM)
- Levels of Monitoring
- Key Takeaways
Monitoring vs. Observability
While often used interchangeably, monitoring and observability represent different approaches to understanding a system’s health.
-
Monitoring: The process of collecting and analyzing data to watch for predefined problems. It tells you what is happening and when something is wrong based on known failure modes.
Monitoring is about asking your system, “Are you okay?” based on a checklist of symptoms.
-
Observability: The ability to ask new questions about your system without having to ship new code to answer them. It allows you to infer the internal state of a system from its external outputs and helps you discover why something is wrong, especially for unknown or unpredictable failures.
Observability is about being able to explore and understand system behavior you didn’t predict.
In short: Monitoring tells you whether the system is working. Observability lets you ask why it isn’t.
The Three Pillars of Observability
Observability is built on three core types of telemetry data that work together to provide a complete picture of system health.
1. Metrics
Metrics are numeric representations of data measured over time. They are efficient to store and process, making them ideal for building dashboards and triggering alerts.
- Counter: A cumulative metric that only increases (e.g., requests served, errors).
- Gauge: A metric that can go up or down (e.g., current memory usage, active connections).
- Histogram: Samples observations (e.g., request durations) and counts them in configurable buckets, allowing for the calculation of quantiles.
- Summary: Similar to a histogram, but calculates quantiles on the client side.
2. Logs
Logs are immutable, timestamped records of discrete events. They provide detailed, contextual information about what occurred at a specific point in time.
Best Practice: Structured Logging Write logs in a structured format like JSON. This makes them machine-readable, allowing for powerful filtering and analysis.
- Bad (Unstructured):
User 123 failed to log in.- Good (Structured):
{"timestamp": "...", "level": "WARN", "event": "LoginFailure", "userId": 123, "reason": "InvalidPassword"}
3. Distributed Tracing
Tracing provides insight into the entire lifecycle of a request as it flows through a distributed system. It is essential for debugging bottlenecks and understanding dependencies in a microservices architecture.
- Trace: The end-to-end journey of a single request.
- Span: A single, named, and timed operation within a trace (e.g., an API call, a database query). A trace is a tree of spans.
The Four Golden Signals
Developed by Google’s Site Reliability Engineering (SRE) team, the Four Golden Signals are a set of key metrics that are essential for monitoring any user-facing system.
- Latency: The time it takes to service a request. It’s crucial to distinguish between the latency of successful requests and the latency of failed requests.
- Traffic: A measure of the demand on your system (e.g., requests per second).
- Errors: The rate of requests that fail, either explicitly (e.g., HTTP 5xx errors) or implicitly (e.g., a 200 OK response with incorrect content).
- Saturation: How “full” your service is. This is a measure of system utilization (e.g., CPU, memory) and warns of impending performance degradation.
Application Performance Monitoring (APM)
APM is the practice of using software tools and telemetry data to monitor the performance of business-critical applications. It brings together metrics, logs, and traces to provide a holistic view of application health.
Key APM Metrics to Watch
- CPU Usage: Ensures the application has the compute resources it needs.
- Response Times: Measures the latency users are experiencing.
- Error Rates: The frequency of errors (e.g., HTTP 500s, timeouts).
- Request Rate: The number of requests per minute/second.
- Application Instances: The number of running servers or containers, used for auto-scaling.
- Uptime / Availability: The percentage of time the application is operational.
Levels of Monitoring
Monitoring can be applied at different layers of the stack, each providing a different perspective.
| Level | Focus | Key Metrics | Purpose | Tools |
|---|---|---|---|---|
| Infrastructure | Health of the underlying hardware and network. | CPU usage, memory, disk space, network I/O. | Predict impending application failures. | Datadog, New Relic, Prometheus. |
| Service | Health of a specific service or API. | Latency, traffic, errors, saturation (Golden Signals). | Understand the performance of a single component. | Prometheus, Grafana. |
| Application | Overall health of the user-facing application. | Active users, sessions, business-specific metrics (e.g., cart additions). | Measure business impact and user behavior. | Google Analytics, Mixpanel. |
Key Takeaways
- Monitoring is for known issues; Observability is for unknown issues.
- The Three Pillars of Observability (Metrics, Logs, and Traces) provide a comprehensive view of system health.
- The Four Golden Signals (Latency, Traffic, Errors, Saturation) are essential for monitoring any user-facing system.
- Monitoring should be applied at all levels of the stack: Infrastructure, Service, and Application.