opencode-workflow/skills/observability-design/SKILL.md

---
name: observability-design
description: "Knowledge contract for observability design. Provides principles and patterns for logs, metrics, traces, correlation IDs, alerts, and SLOs. Referenced by design-architecture when defining observability strategy."
---

This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing observability. It does not produce artifacts directly.

## Core Principles

### Three Pillars of Observability
- **Logs**: Discrete events with context (who, what, when, where)
- **Metrics**: Numeric measurements aggregated over time (rates, histograms, gauges)
- **Traces**: End-to-end request flow across services and boundaries

### Observability Is Not Monitoring
- Monitoring tells you when something is broken (known unknowns)
- Observability lets you ask questions about why something is broken (unknown unknowns)
- Design for observability: emit enough data to diagnose novel problems

### Observability by Design
- Observability must be designed into the architecture, not bolted on after
- Every service must emit structured logs, metrics, and traces from day one
- Every external integration must have observability hooks

## Logs

### Log Levels
- **ERROR**: Something failed that requires investigation (not all errors are ERROR level)
- **WARN**: Something unexpected happened but the system can continue
- **INFO**: Business-significant events (order created, payment processed, user registered)
- **DEBUG**: Detailed information for debugging (only in development, not in production)
- **TRACE**: Very detailed information (almost never used in production)

### Structured Logging
- Use JSON format for all logs
- Every log entry must include: timestamp, level, service name, correlation ID
- Include relevant context: user ID, request ID, entity IDs, error details
- Never log sensitive data: passwords, tokens, PII, secrets

### Log Aggregation
- Send all logs to a centralized log aggregation system
- Define log retention period based on compliance requirements
- Define log access controls (who can see what logs)
- Consider log volume and cost (log only what you need)

## Metrics

### Metric Types
- **Counter**: Monotonically increasing value (request count, error count)
- **Gauge**: Point-in-time value (active connections, queue depth)
- **Histogram**: Distribution of values (request latency, payload size)
- **Summary**: Pre-calculated quantiles (p50, p90, p99 latency)

### Key Business Metrics
- Orders per minute
- Revenue per minute
- Active users
- Conversion rate
- Cart abandonment rate

### Key System Metrics
- Request rate (requests per second per endpoint)
- Error rate (4xx rate, 5xx rate per endpoint)
- Latency (p50, p90, p99 per endpoint)
- Queue depth and age
- Database connection pool usage
- Cache hit rate
- Memory and CPU usage per service

### Metric Naming Convention
- Use dot-separated names: `service.operation.metric`
- Include units in the name or metadata: `request.duration.milliseconds`
- Use consistent labels: `method`, `endpoint`, `status_code`, `tenant_id`

## Traces

### Distributed Tracing
- Every request gets a trace ID that propagates across all services
- Every operation within a request gets a span with operation name, start time, duration
- Span boundaries: service calls, database queries, external API calls, queue operations

### Correlation ID Propagation
- Generate a correlation ID at the request entry point
- Propagate correlation ID through all service calls (headers, message metadata)
- Include correlation ID in all logs, metrics, and error responses
- Use correlation ID to trace a request end-to-end across all services

### Span Design
- Include relevant context in spans: user ID, entity IDs, operation type
- Tag spans with error information when operations fail
- Keep span cardinality reasonable (avoid high-cardinality attributes as tags)

## Alerts

### Alert Design Principles
- Alert on symptoms, not causes (user impact, not internal metrics)
- Every alert must have a clear runbook or remediation steps
- Every alert must be actionable (if you can't act on it, don't alert on it)
- Avoid alert fatigue: set thresholds based on SLOs, not arbitrary numbers

### Alert Categories
- **Page-worthy**: System is broken, immediate action required (high error rate, service down)
- **Ticket-worthy**: Degradation that needs investigation soon (rising latency, approaching limits)
- **Log-worthy**: Informational, no immediate action (deployment completed, config changed)

### Alert Thresholds
- Base alert thresholds on SLOs, not arbitrary numbers
- Use burn rate alerting: alert when the error budget is burning too fast
- Define escalation paths: who gets paged, who gets a ticket, who gets an email

## SLOs (Service Level Objectives)

### SLO Design
- Define SLOs based on user impact, not internal metrics
- Typical SLO categories:
  - **Availability**: % of requests that succeed (e.g., 99.9%)
  - **Latency**: % of requests that complete within a threshold (e.g., p99 < 500ms)
  - **Correctness**: % of operations that produce correct results
  - **Freshness**: % of data that is within staleness threshold

### Error Budget
- Error budget = 100% - SLO target
- If SLO is 99.9%, error budget is 0.1% per month
- Track error budget burn rate: how fast are we consuming the budget?
- When error budget is exhausted, focus shifts from feature development to reliability

### SLO Framework
- Define the SLO (what we promise)
- Define the SLI (how we measure it)
- Define the error budget (what we can afford to fail)
- Define the alerting (when we're burning budget too fast)

## Anti-Patterns

- **Logging everything**: Generates noise, increases cost, makes debugging harder
- **Missing correlation ID**: Can't trace requests across services
- **Alerting on causes, not symptoms**: Alerts fire but users aren't impacted
- **Missing business metrics**: Can't tell if the system is serving users well
- **High-cardinality metrics**: Explosive metric count, expensive to store and query
- **Missing observability for external calls**: External integration failures are invisible
- **Logging sensitive data**: Passwords, tokens, PII in logs
feat/architect (#4) Co-authored-by: 王性驊 <danielwang@supermicro.com> Reviewed-on: https://code.30cm.net/daniel.w/opencode-workflow/pulls/4 2026-04-13 01:19:39 +00:00			`---`
			`name: observability-design`
			`description: "Knowledge contract for observability design. Provides principles and patterns for logs, metrics, traces, correlation IDs, alerts, and SLOs. Referenced by design-architecture when defining observability strategy."`
			`---`

			`This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing observability. It does not produce artifacts directly.`

			`## Core Principles`

			`### Three Pillars of Observability`
			`- Logs: Discrete events with context (who, what, when, where)`
			`- Metrics: Numeric measurements aggregated over time (rates, histograms, gauges)`
			`- Traces: End-to-end request flow across services and boundaries`

			`### Observability Is Not Monitoring`
			`- Monitoring tells you when something is broken (known unknowns)`
			`- Observability lets you ask questions about why something is broken (unknown unknowns)`
			`- Design for observability: emit enough data to diagnose novel problems`

			`### Observability by Design`
			`- Observability must be designed into the architecture, not bolted on after`
			`- Every service must emit structured logs, metrics, and traces from day one`
			`- Every external integration must have observability hooks`

			`## Logs`

			`### Log Levels`
			`- ERROR: Something failed that requires investigation (not all errors are ERROR level)`
			`- WARN: Something unexpected happened but the system can continue`
			`- INFO: Business-significant events (order created, payment processed, user registered)`
			`- DEBUG: Detailed information for debugging (only in development, not in production)`
			`- TRACE: Very detailed information (almost never used in production)`

			`### Structured Logging`
			`- Use JSON format for all logs`
			`- Every log entry must include: timestamp, level, service name, correlation ID`
			`- Include relevant context: user ID, request ID, entity IDs, error details`
			`- Never log sensitive data: passwords, tokens, PII, secrets`

			`### Log Aggregation`
			`- Send all logs to a centralized log aggregation system`
			`- Define log retention period based on compliance requirements`
			`- Define log access controls (who can see what logs)`
			`- Consider log volume and cost (log only what you need)`

			`## Metrics`

			`### Metric Types`
			`- Counter: Monotonically increasing value (request count, error count)`
			`- Gauge: Point-in-time value (active connections, queue depth)`
			`- Histogram: Distribution of values (request latency, payload size)`
			`- Summary: Pre-calculated quantiles (p50, p90, p99 latency)`

			`### Key Business Metrics`
			`- Orders per minute`
			`- Revenue per minute`
			`- Active users`
			`- Conversion rate`
			`- Cart abandonment rate`

			`### Key System Metrics`
			`- Request rate (requests per second per endpoint)`
			`- Error rate (4xx rate, 5xx rate per endpoint)`
			`- Latency (p50, p90, p99 per endpoint)`
			`- Queue depth and age`
			`- Database connection pool usage`
			`- Cache hit rate`
			`- Memory and CPU usage per service`

			`### Metric Naming Convention`
			- Use dot-separated names: `service.operation.metric`
			- Include units in the name or metadata: `request.duration.milliseconds`
			- Use consistent labels: `method`, `endpoint`, `status_code`, `tenant_id`

			`## Traces`

			`### Distributed Tracing`
			`- Every request gets a trace ID that propagates across all services`
			`- Every operation within a request gets a span with operation name, start time, duration`
			`- Span boundaries: service calls, database queries, external API calls, queue operations`

			`### Correlation ID Propagation`
			`- Generate a correlation ID at the request entry point`
			`- Propagate correlation ID through all service calls (headers, message metadata)`
			`- Include correlation ID in all logs, metrics, and error responses`
			`- Use correlation ID to trace a request end-to-end across all services`

			`### Span Design`
			`- Include relevant context in spans: user ID, entity IDs, operation type`
			`- Tag spans with error information when operations fail`
			`- Keep span cardinality reasonable (avoid high-cardinality attributes as tags)`

			`## Alerts`

			`### Alert Design Principles`
			`- Alert on symptoms, not causes (user impact, not internal metrics)`
			`- Every alert must have a clear runbook or remediation steps`
			`- Every alert must be actionable (if you can't act on it, don't alert on it)`
			`- Avoid alert fatigue: set thresholds based on SLOs, not arbitrary numbers`

			`### Alert Categories`
			`- Page-worthy: System is broken, immediate action required (high error rate, service down)`
			`- Ticket-worthy: Degradation that needs investigation soon (rising latency, approaching limits)`
			`- Log-worthy: Informational, no immediate action (deployment completed, config changed)`

			`### Alert Thresholds`
			`- Base alert thresholds on SLOs, not arbitrary numbers`
			`- Use burn rate alerting: alert when the error budget is burning too fast`
			`- Define escalation paths: who gets paged, who gets a ticket, who gets an email`

			`## SLOs (Service Level Objectives)`

			`### SLO Design`
			`- Define SLOs based on user impact, not internal metrics`
			`- Typical SLO categories:`
			`- Availability: % of requests that succeed (e.g., 99.9%)`
			`- Latency: % of requests that complete within a threshold (e.g., p99 < 500ms)`
			`- Correctness: % of operations that produce correct results`
			`- Freshness: % of data that is within staleness threshold`

			`### Error Budget`
			`- Error budget = 100% - SLO target`
			`- If SLO is 99.9%, error budget is 0.1% per month`
			`- Track error budget burn rate: how fast are we consuming the budget?`
			`- When error budget is exhausted, focus shifts from feature development to reliability`

			`### SLO Framework`
			`- Define the SLO (what we promise)`
			`- Define the SLI (how we measure it)`
			`- Define the error budget (what we can afford to fail)`
			`- Define the alerting (when we're burning budget too fast)`

			`## Anti-Patterns`

			`- Logging everything: Generates noise, increases cost, makes debugging harder`
			`- Missing correlation ID: Can't trace requests across services`
			`- Alerting on causes, not symptoms: Alerts fire but users aren't impacted`
			`- Missing business metrics: Can't tell if the system is serving users well`
			`- High-cardinality metrics: Explosive metric count, expensive to store and query`
			`- Missing observability for external calls: External integration failures are invisible`
			`- Logging sensitive data: Passwords, tokens, PII in logs`