opencode-workflow/skills/observability-design/SKILL.md

6.1 KiB

name description
observability-design Knowledge contract for observability design. Provides principles and patterns for logs, metrics, traces, correlation IDs, alerts, and SLOs. Referenced by design-architecture when defining observability strategy.

This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing observability. It does not produce artifacts directly.

Core Principles

Three Pillars of Observability

  • Logs: Discrete events with context (who, what, when, where)
  • Metrics: Numeric measurements aggregated over time (rates, histograms, gauges)
  • Traces: End-to-end request flow across services and boundaries

Observability Is Not Monitoring

  • Monitoring tells you when something is broken (known unknowns)
  • Observability lets you ask questions about why something is broken (unknown unknowns)
  • Design for observability: emit enough data to diagnose novel problems

Observability by Design

  • Observability must be designed into the architecture, not bolted on after
  • Every service must emit structured logs, metrics, and traces from day one
  • Every external integration must have observability hooks

Logs

Log Levels

  • ERROR: Something failed that requires investigation (not all errors are ERROR level)
  • WARN: Something unexpected happened but the system can continue
  • INFO: Business-significant events (order created, payment processed, user registered)
  • DEBUG: Detailed information for debugging (only in development, not in production)
  • TRACE: Very detailed information (almost never used in production)

Structured Logging

  • Use JSON format for all logs
  • Every log entry must include: timestamp, level, service name, correlation ID
  • Include relevant context: user ID, request ID, entity IDs, error details
  • Never log sensitive data: passwords, tokens, PII, secrets

Log Aggregation

  • Send all logs to a centralized log aggregation system
  • Define log retention period based on compliance requirements
  • Define log access controls (who can see what logs)
  • Consider log volume and cost (log only what you need)

Metrics

Metric Types

  • Counter: Monotonically increasing value (request count, error count)
  • Gauge: Point-in-time value (active connections, queue depth)
  • Histogram: Distribution of values (request latency, payload size)
  • Summary: Pre-calculated quantiles (p50, p90, p99 latency)

Key Business Metrics

  • Orders per minute
  • Revenue per minute
  • Active users
  • Conversion rate
  • Cart abandonment rate

Key System Metrics

  • Request rate (requests per second per endpoint)
  • Error rate (4xx rate, 5xx rate per endpoint)
  • Latency (p50, p90, p99 per endpoint)
  • Queue depth and age
  • Database connection pool usage
  • Cache hit rate
  • Memory and CPU usage per service

Metric Naming Convention

  • Use dot-separated names: service.operation.metric
  • Include units in the name or metadata: request.duration.milliseconds
  • Use consistent labels: method, endpoint, status_code, tenant_id

Traces

Distributed Tracing

  • Every request gets a trace ID that propagates across all services
  • Every operation within a request gets a span with operation name, start time, duration
  • Span boundaries: service calls, database queries, external API calls, queue operations

Correlation ID Propagation

  • Generate a correlation ID at the request entry point
  • Propagate correlation ID through all service calls (headers, message metadata)
  • Include correlation ID in all logs, metrics, and error responses
  • Use correlation ID to trace a request end-to-end across all services

Span Design

  • Include relevant context in spans: user ID, entity IDs, operation type
  • Tag spans with error information when operations fail
  • Keep span cardinality reasonable (avoid high-cardinality attributes as tags)

Alerts

Alert Design Principles

  • Alert on symptoms, not causes (user impact, not internal metrics)
  • Every alert must have a clear runbook or remediation steps
  • Every alert must be actionable (if you can't act on it, don't alert on it)
  • Avoid alert fatigue: set thresholds based on SLOs, not arbitrary numbers

Alert Categories

  • Page-worthy: System is broken, immediate action required (high error rate, service down)
  • Ticket-worthy: Degradation that needs investigation soon (rising latency, approaching limits)
  • Log-worthy: Informational, no immediate action (deployment completed, config changed)

Alert Thresholds

  • Base alert thresholds on SLOs, not arbitrary numbers
  • Use burn rate alerting: alert when the error budget is burning too fast
  • Define escalation paths: who gets paged, who gets a ticket, who gets an email

SLOs (Service Level Objectives)

SLO Design

  • Define SLOs based on user impact, not internal metrics
  • Typical SLO categories:
    • Availability: % of requests that succeed (e.g., 99.9%)
    • Latency: % of requests that complete within a threshold (e.g., p99 < 500ms)
    • Correctness: % of operations that produce correct results
    • Freshness: % of data that is within staleness threshold

Error Budget

  • Error budget = 100% - SLO target
  • If SLO is 99.9%, error budget is 0.1% per month
  • Track error budget burn rate: how fast are we consuming the budget?
  • When error budget is exhausted, focus shifts from feature development to reliability

SLO Framework

  • Define the SLO (what we promise)
  • Define the SLI (how we measure it)
  • Define the error budget (what we can afford to fail)
  • Define the alerting (when we're burning budget too fast)

Anti-Patterns

  • Logging everything: Generates noise, increases cost, makes debugging harder
  • Missing correlation ID: Can't trace requests across services
  • Alerting on causes, not symptoms: Alerts fire but users aren't impacted
  • Missing business metrics: Can't tell if the system is serving users well
  • High-cardinality metrics: Explosive metric count, expensive to store and query
  • Missing observability for external calls: External integration failures are invisible
  • Logging sensitive data: Passwords, tokens, PII in logs