---
name: observability-design
description: "Knowledge contract for observability design. Provides principles and patterns for logs, metrics, traces, correlation IDs, alerts, and SLOs. Referenced by design-architecture when defining observability strategy."
---

This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing observability. It does not produce artifacts directly.

## Core Principles

### Three Pillars of Observability
- **Logs**: Discrete events with context (who, what, when, where)
- **Metrics**: Numeric measurements aggregated over time (rates, histograms, gauges)
- **Traces**: End-to-end request flow across services and boundaries

### Observability Is Not Monitoring
- Monitoring tells you when something is broken (known unknowns)
- Observability lets you ask questions about why something is broken (unknown unknowns)
- Design for observability: emit enough data to diagnose novel problems

### Observability by Design
- Observability must be designed into the architecture, not bolted on after
- Every service must emit structured logs, metrics, and traces from day one
- Every external integration must have observability hooks

## Logs

### Log Levels
- **ERROR**: Something failed that requires investigation (not all errors are ERROR level)
- **WARN**: Something unexpected happened but the system can continue
- **INFO**: Business-significant events (order created, payment processed, user registered)
- **DEBUG**: Detailed information for debugging (only in development, not in production)
- **TRACE**: Very detailed information (almost never used in production)

### Structured Logging
- Use JSON format for all logs
- Every log entry must include: timestamp, level, service name, correlation ID
- Include relevant context: user ID, request ID, entity IDs, error details
- Never log sensitive data: passwords, tokens, PII, secrets

### Log Aggregation
- Send all logs to a centralized log aggregation system
- Define log retention period based on compliance requirements
- Define log access controls (who can see what logs)
- Consider log volume and cost (log only what you need)

## Metrics

### Metric Types
- **Counter**: Monotonically increasing value (request count, error count)
- **Gauge**: Point-in-time value (active connections, queue depth)
- **Histogram**: Distribution of values (request latency, payload size)
- **Summary**: Pre-calculated quantiles (p50, p90, p99 latency)

### Key Business Metrics
- Orders per minute
- Revenue per minute
- Active users
- Conversion rate
- Cart abandonment rate

### Key System Metrics
- Request rate (requests per second per endpoint)
- Error rate (4xx rate, 5xx rate per endpoint)
- Latency (p50, p90, p99 per endpoint)
- Queue depth and age
- Database connection pool usage
- Cache hit rate
- Memory and CPU usage per service

### Metric Naming Convention
- Use dot-separated names: `service.operation.metric`
- Include units in the name or metadata: `request.duration.milliseconds`
- Use consistent labels: `method`, `endpoint`, `status_code`, `tenant_id`

## Traces

### Distributed Tracing
- Every request gets a trace ID that propagates across all services
- Every operation within a request gets a span with operation name, start time, duration
- Span boundaries: service calls, database queries, external API calls, queue operations

### Correlation ID Propagation
- Generate a correlation ID at the request entry point
- Propagate correlation ID through all service calls (headers, message metadata)
- Include correlation ID in all logs, metrics, and error responses
- Use correlation ID to trace a request end-to-end across all services

### Span Design
- Include relevant context in spans: user ID, entity IDs, operation type
- Tag spans with error information when operations fail
- Keep span cardinality reasonable (avoid high-cardinality attributes as tags)

## Alerts

### Alert Design Principles
- Alert on symptoms, not causes (user impact, not internal metrics)
- Every alert must have a clear runbook or remediation steps
- Every alert must be actionable (if you can't act on it, don't alert on it)
- Avoid alert fatigue: set thresholds based on SLOs, not arbitrary numbers

### Alert Categories
- **Page-worthy**: System is broken, immediate action required (high error rate, service down)
- **Ticket-worthy**: Degradation that needs investigation soon (rising latency, approaching limits)
- **Log-worthy**: Informational, no immediate action (deployment completed, config changed)

### Alert Thresholds
- Base alert thresholds on SLOs, not arbitrary numbers
- Use burn rate alerting: alert when the error budget is burning too fast
- Define escalation paths: who gets paged, who gets a ticket, who gets an email

## SLOs (Service Level Objectives)

### SLO Design
- Define SLOs based on user impact, not internal metrics
- Typical SLO categories:
  - **Availability**: % of requests that succeed (e.g., 99.9%)
  - **Latency**: % of requests that complete within a threshold (e.g., p99 < 500ms)
  - **Correctness**: % of operations that produce correct results
  - **Freshness**: % of data that is within staleness threshold

### Error Budget
- Error budget = 100% - SLO target
- If SLO is 99.9%, error budget is 0.1% per month
- Track error budget burn rate: how fast are we consuming the budget?
- When error budget is exhausted, focus shifts from feature development to reliability

### SLO Framework
- Define the SLO (what we promise)
- Define the SLI (how we measure it)
- Define the error budget (what we can afford to fail)
- Define the alerting (when we're burning budget too fast)

## Anti-Patterns

- **Logging everything**: Generates noise, increases cost, makes debugging harder
- **Missing correlation ID**: Can't trace requests across services
- **Alerting on causes, not symptoms**: Alerts fire but users aren't impacted
- **Missing business metrics**: Can't tell if the system is serving users well
- **High-cardinality metrics**: Explosive metric count, expensive to store and query
- **Missing observability for external calls**: External integration failures are invisible
- **Logging sensitive data**: Passwords, tokens, PII in logs