opencode-workflow/SKILL.md at 082c9203fa3af7311dbce8fa87c5a737e9c5818c

6.1 KiB

Raw Blame History

name	description
observability-design	Knowledge contract for observability design. Provides principles and patterns for logs, metrics, traces, correlation IDs, alerts, and SLOs. Referenced by design-architecture when defining observability strategy.

This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing observability. It does not produce artifacts directly.

Core Principles

Three Pillars of Observability

Logs: Discrete events with context (who, what, when, where)
Metrics: Numeric measurements aggregated over time (rates, histograms, gauges)
Traces: End-to-end request flow across services and boundaries

Observability Is Not Monitoring

Monitoring tells you when something is broken (known unknowns)
Observability lets you ask questions about why something is broken (unknown unknowns)
Design for observability: emit enough data to diagnose novel problems

Observability by Design

Observability must be designed into the architecture, not bolted on after
Every service must emit structured logs, metrics, and traces from day one
Every external integration must have observability hooks

Logs

Log Levels

ERROR: Something failed that requires investigation (not all errors are ERROR level)
WARN: Something unexpected happened but the system can continue
INFO: Business-significant events (order created, payment processed, user registered)
DEBUG: Detailed information for debugging (only in development, not in production)
TRACE: Very detailed information (almost never used in production)

Structured Logging

Use JSON format for all logs
Every log entry must include: timestamp, level, service name, correlation ID
Include relevant context: user ID, request ID, entity IDs, error details
Never log sensitive data: passwords, tokens, PII, secrets

Log Aggregation

Send all logs to a centralized log aggregation system
Define log retention period based on compliance requirements
Define log access controls (who can see what logs)
Consider log volume and cost (log only what you need)

Metrics

Metric Types

Counter: Monotonically increasing value (request count, error count)
Gauge: Point-in-time value (active connections, queue depth)
Histogram: Distribution of values (request latency, payload size)
Summary: Pre-calculated quantiles (p50, p90, p99 latency)

Key Business Metrics

Orders per minute
Revenue per minute
Active users
Conversion rate
Cart abandonment rate

Key System Metrics

Request rate (requests per second per endpoint)
Error rate (4xx rate, 5xx rate per endpoint)
Latency (p50, p90, p99 per endpoint)
Queue depth and age
Database connection pool usage
Cache hit rate
Memory and CPU usage per service

Metric Naming Convention

Use dot-separated names: service.operation.metric
Include units in the name or metadata: request.duration.milliseconds
Use consistent labels: method, endpoint, status_code, tenant_id

Traces

Distributed Tracing

Every request gets a trace ID that propagates across all services
Every operation within a request gets a span with operation name, start time, duration
Span boundaries: service calls, database queries, external API calls, queue operations

Correlation ID Propagation

Generate a correlation ID at the request entry point
Propagate correlation ID through all service calls (headers, message metadata)
Include correlation ID in all logs, metrics, and error responses
Use correlation ID to trace a request end-to-end across all services

Span Design

Include relevant context in spans: user ID, entity IDs, operation type
Tag spans with error information when operations fail
Keep span cardinality reasonable (avoid high-cardinality attributes as tags)

Alerts

Alert Design Principles

Alert on symptoms, not causes (user impact, not internal metrics)
Every alert must have a clear runbook or remediation steps
Every alert must be actionable (if you can't act on it, don't alert on it)
Avoid alert fatigue: set thresholds based on SLOs, not arbitrary numbers

Alert Categories

Page-worthy: System is broken, immediate action required (high error rate, service down)
Ticket-worthy: Degradation that needs investigation soon (rising latency, approaching limits)
Log-worthy: Informational, no immediate action (deployment completed, config changed)

Alert Thresholds

Base alert thresholds on SLOs, not arbitrary numbers
Use burn rate alerting: alert when the error budget is burning too fast
Define escalation paths: who gets paged, who gets a ticket, who gets an email

SLOs (Service Level Objectives)

SLO Design

Define SLOs based on user impact, not internal metrics
Typical SLO categories:
- Availability: % of requests that succeed (e.g., 99.9%)
- Latency: % of requests that complete within a threshold (e.g., p99 < 500ms)
- Correctness: % of operations that produce correct results
- Freshness: % of data that is within staleness threshold

Error Budget

Error budget = 100% - SLO target
If SLO is 99.9%, error budget is 0.1% per month
Track error budget burn rate: how fast are we consuming the budget?
When error budget is exhausted, focus shifts from feature development to reliability

SLO Framework

Define the SLO (what we promise)
Define the SLI (how we measure it)
Define the error budget (what we can afford to fail)
Define the alerting (when we're burning budget too fast)

Anti-Patterns

Logging everything: Generates noise, increases cost, makes debugging harder
Missing correlation ID: Can't trace requests across services
Alerting on causes, not symptoms: Alerts fire but users aren't impacted
Missing business metrics: Can't tell if the system is serving users well
High-cardinality metrics: Explosive metric count, expensive to store and query
Missing observability for external calls: External integration failures are invisible
Logging sensitive data: Passwords, tokens, PII in logs

6.1 KiB Raw Blame History

Core Principles

Three Pillars of Observability

Observability Is Not Monitoring

Observability by Design

Logs

Log Levels

Structured Logging

Log Aggregation

Metrics

Metric Types

Key Business Metrics

Key System Metrics

Metric Naming Convention

Traces

Distributed Tracing

Correlation ID Propagation

Span Design

Alerts

Alert Design Principles

Alert Categories

Alert Thresholds

SLOs (Service Level Objectives)

SLO Design

Error Budget

SLO Framework

Anti-Patterns

6.1 KiB

Raw Blame History