6.1 KiB
6.1 KiB
| name | description |
|---|---|
| observability-design | Knowledge contract for observability design. Provides principles and patterns for logs, metrics, traces, correlation IDs, alerts, and SLOs. Referenced by design-architecture when defining observability strategy. |
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing observability. It does not produce artifacts directly.
Core Principles
Three Pillars of Observability
- Logs: Discrete events with context (who, what, when, where)
- Metrics: Numeric measurements aggregated over time (rates, histograms, gauges)
- Traces: End-to-end request flow across services and boundaries
Observability Is Not Monitoring
- Monitoring tells you when something is broken (known unknowns)
- Observability lets you ask questions about why something is broken (unknown unknowns)
- Design for observability: emit enough data to diagnose novel problems
Observability by Design
- Observability must be designed into the architecture, not bolted on after
- Every service must emit structured logs, metrics, and traces from day one
- Every external integration must have observability hooks
Logs
Log Levels
- ERROR: Something failed that requires investigation (not all errors are ERROR level)
- WARN: Something unexpected happened but the system can continue
- INFO: Business-significant events (order created, payment processed, user registered)
- DEBUG: Detailed information for debugging (only in development, not in production)
- TRACE: Very detailed information (almost never used in production)
Structured Logging
- Use JSON format for all logs
- Every log entry must include: timestamp, level, service name, correlation ID
- Include relevant context: user ID, request ID, entity IDs, error details
- Never log sensitive data: passwords, tokens, PII, secrets
Log Aggregation
- Send all logs to a centralized log aggregation system
- Define log retention period based on compliance requirements
- Define log access controls (who can see what logs)
- Consider log volume and cost (log only what you need)
Metrics
Metric Types
- Counter: Monotonically increasing value (request count, error count)
- Gauge: Point-in-time value (active connections, queue depth)
- Histogram: Distribution of values (request latency, payload size)
- Summary: Pre-calculated quantiles (p50, p90, p99 latency)
Key Business Metrics
- Orders per minute
- Revenue per minute
- Active users
- Conversion rate
- Cart abandonment rate
Key System Metrics
- Request rate (requests per second per endpoint)
- Error rate (4xx rate, 5xx rate per endpoint)
- Latency (p50, p90, p99 per endpoint)
- Queue depth and age
- Database connection pool usage
- Cache hit rate
- Memory and CPU usage per service
Metric Naming Convention
- Use dot-separated names:
service.operation.metric - Include units in the name or metadata:
request.duration.milliseconds - Use consistent labels:
method,endpoint,status_code,tenant_id
Traces
Distributed Tracing
- Every request gets a trace ID that propagates across all services
- Every operation within a request gets a span with operation name, start time, duration
- Span boundaries: service calls, database queries, external API calls, queue operations
Correlation ID Propagation
- Generate a correlation ID at the request entry point
- Propagate correlation ID through all service calls (headers, message metadata)
- Include correlation ID in all logs, metrics, and error responses
- Use correlation ID to trace a request end-to-end across all services
Span Design
- Include relevant context in spans: user ID, entity IDs, operation type
- Tag spans with error information when operations fail
- Keep span cardinality reasonable (avoid high-cardinality attributes as tags)
Alerts
Alert Design Principles
- Alert on symptoms, not causes (user impact, not internal metrics)
- Every alert must have a clear runbook or remediation steps
- Every alert must be actionable (if you can't act on it, don't alert on it)
- Avoid alert fatigue: set thresholds based on SLOs, not arbitrary numbers
Alert Categories
- Page-worthy: System is broken, immediate action required (high error rate, service down)
- Ticket-worthy: Degradation that needs investigation soon (rising latency, approaching limits)
- Log-worthy: Informational, no immediate action (deployment completed, config changed)
Alert Thresholds
- Base alert thresholds on SLOs, not arbitrary numbers
- Use burn rate alerting: alert when the error budget is burning too fast
- Define escalation paths: who gets paged, who gets a ticket, who gets an email
SLOs (Service Level Objectives)
SLO Design
- Define SLOs based on user impact, not internal metrics
- Typical SLO categories:
- Availability: % of requests that succeed (e.g., 99.9%)
- Latency: % of requests that complete within a threshold (e.g., p99 < 500ms)
- Correctness: % of operations that produce correct results
- Freshness: % of data that is within staleness threshold
Error Budget
- Error budget = 100% - SLO target
- If SLO is 99.9%, error budget is 0.1% per month
- Track error budget burn rate: how fast are we consuming the budget?
- When error budget is exhausted, focus shifts from feature development to reliability
SLO Framework
- Define the SLO (what we promise)
- Define the SLI (how we measure it)
- Define the error budget (what we can afford to fail)
- Define the alerting (when we're burning budget too fast)
Anti-Patterns
- Logging everything: Generates noise, increases cost, makes debugging harder
- Missing correlation ID: Can't trace requests across services
- Alerting on causes, not symptoms: Alerts fire but users aren't impacted
- Missing business metrics: Can't tell if the system is serving users well
- High-cardinality metrics: Explosive metric count, expensive to store and query
- Missing observability for external calls: External integration failures are invisible
- Logging sensitive data: Passwords, tokens, PII in logs