What is Observability?
Observability is the ability to measure the internal states of a system by examining its outputs. Unlike traditional monitoring that tells you when something is wrong, observability helps you understand why by providing the data needed to debug unknown issues.
Three Pillars of Observability
Logs
- Event records
- Contextual information
- Debugging details
Metrics
- Numerical measurements
- Time-series data
- Performance indicators
Traces
- Request flow
- Distributed systems
- Latency breakdown
Monitoring vs. Observability
| Monitoring | Observability |
|---|---|
| Known unknowns | Unknown unknowns |
| Predefined dashboards | Ad-hoc exploration |
| Alert on thresholds | Debug novel issues |
| What is broken | Why it's broken |
Key Concepts
Cardinality Number of unique values in metrics.
Sampling Collecting subset of data.
Correlation Linking logs, metrics, traces.
Context Metadata and dimensions.
Observability in AI Systems
LLM Observability
- Prompt/response logging
- Token usage
- Latency metrics
- Error tracking
RAG Observability
- Retrieval quality
- Context relevance
- Generation accuracy
Tools and Platforms
Open Source
- Prometheus
- Grafana
- Jaeger
- OpenTelemetry
Commercial
- Datadog
- New Relic
- Splunk
- Dynatrace
Best Practices
- Standardize instrumentation
- Use structured logging
- Implement distributed tracing
- Define SLIs and SLOs
- Correlate signals