--- name: observability-design description: "Knowledge contract for observability design. Provides principles and patterns for logs, metrics, traces, correlation IDs, alerts, and SLOs. Referenced by design-architecture when defining observability strategy." --- This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing observability. It does not produce artifacts directly. ## Core Principles ### Three Pillars of Observability - **Logs**: Discrete events with context (who, what, when, where) - **Metrics**: Numeric measurements aggregated over time (rates, histograms, gauges) - **Traces**: End-to-end request flow across services and boundaries ### Observability Is Not Monitoring - Monitoring tells you when something is broken (known unknowns) - Observability lets you ask questions about why something is broken (unknown unknowns) - Design for observability: emit enough data to diagnose novel problems ### Observability by Design - Observability must be designed into the architecture, not bolted on after - Every service must emit structured logs, metrics, and traces from day one - Every external integration must have observability hooks ## Logs ### Log Levels - **ERROR**: Something failed that requires investigation (not all errors are ERROR level) - **WARN**: Something unexpected happened but the system can continue - **INFO**: Business-significant events (order created, payment processed, user registered) - **DEBUG**: Detailed information for debugging (only in development, not in production) - **TRACE**: Very detailed information (almost never used in production) ### Structured Logging - Use JSON format for all logs - Every log entry must include: timestamp, level, service name, correlation ID - Include relevant context: user ID, request ID, entity IDs, error details - Never log sensitive data: passwords, tokens, PII, secrets ### Log Aggregation - Send all logs to a centralized log aggregation system - Define log retention period based on compliance requirements - Define log access controls (who can see what logs) - Consider log volume and cost (log only what you need) ## Metrics ### Metric Types - **Counter**: Monotonically increasing value (request count, error count) - **Gauge**: Point-in-time value (active connections, queue depth) - **Histogram**: Distribution of values (request latency, payload size) - **Summary**: Pre-calculated quantiles (p50, p90, p99 latency) ### Key Business Metrics - Orders per minute - Revenue per minute - Active users - Conversion rate - Cart abandonment rate ### Key System Metrics - Request rate (requests per second per endpoint) - Error rate (4xx rate, 5xx rate per endpoint) - Latency (p50, p90, p99 per endpoint) - Queue depth and age - Database connection pool usage - Cache hit rate - Memory and CPU usage per service ### Metric Naming Convention - Use dot-separated names: `service.operation.metric` - Include units in the name or metadata: `request.duration.milliseconds` - Use consistent labels: `method`, `endpoint`, `status_code`, `tenant_id` ## Traces ### Distributed Tracing - Every request gets a trace ID that propagates across all services - Every operation within a request gets a span with operation name, start time, duration - Span boundaries: service calls, database queries, external API calls, queue operations ### Correlation ID Propagation - Generate a correlation ID at the request entry point - Propagate correlation ID through all service calls (headers, message metadata) - Include correlation ID in all logs, metrics, and error responses - Use correlation ID to trace a request end-to-end across all services ### Span Design - Include relevant context in spans: user ID, entity IDs, operation type - Tag spans with error information when operations fail - Keep span cardinality reasonable (avoid high-cardinality attributes as tags) ## Alerts ### Alert Design Principles - Alert on symptoms, not causes (user impact, not internal metrics) - Every alert must have a clear runbook or remediation steps - Every alert must be actionable (if you can't act on it, don't alert on it) - Avoid alert fatigue: set thresholds based on SLOs, not arbitrary numbers ### Alert Categories - **Page-worthy**: System is broken, immediate action required (high error rate, service down) - **Ticket-worthy**: Degradation that needs investigation soon (rising latency, approaching limits) - **Log-worthy**: Informational, no immediate action (deployment completed, config changed) ### Alert Thresholds - Base alert thresholds on SLOs, not arbitrary numbers - Use burn rate alerting: alert when the error budget is burning too fast - Define escalation paths: who gets paged, who gets a ticket, who gets an email ## SLOs (Service Level Objectives) ### SLO Design - Define SLOs based on user impact, not internal metrics - Typical SLO categories: - **Availability**: % of requests that succeed (e.g., 99.9%) - **Latency**: % of requests that complete within a threshold (e.g., p99 < 500ms) - **Correctness**: % of operations that produce correct results - **Freshness**: % of data that is within staleness threshold ### Error Budget - Error budget = 100% - SLO target - If SLO is 99.9%, error budget is 0.1% per month - Track error budget burn rate: how fast are we consuming the budget? - When error budget is exhausted, focus shifts from feature development to reliability ### SLO Framework - Define the SLO (what we promise) - Define the SLI (how we measure it) - Define the error budget (what we can afford to fail) - Define the alerting (when we're burning budget too fast) ## Anti-Patterns - **Logging everything**: Generates noise, increases cost, makes debugging harder - **Missing correlation ID**: Can't trace requests across services - **Alerting on causes, not symptoms**: Alerts fire but users aren't impacted - **Missing business metrics**: Can't tell if the system is serving users well - **High-cardinality metrics**: Explosive metric count, expensive to store and query - **Missing observability for external calls**: External integration failures are invisible - **Logging sensitive data**: Passwords, tokens, PII in logs