141 lines
6.1 KiB
Markdown
141 lines
6.1 KiB
Markdown
|
|
---
|
||
|
|
name: observability-design
|
||
|
|
description: "Knowledge contract for observability design. Provides principles and patterns for logs, metrics, traces, correlation IDs, alerts, and SLOs. Referenced by design-architecture when defining observability strategy."
|
||
|
|
---
|
||
|
|
|
||
|
|
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing observability. It does not produce artifacts directly.
|
||
|
|
|
||
|
|
## Core Principles
|
||
|
|
|
||
|
|
### Three Pillars of Observability
|
||
|
|
- **Logs**: Discrete events with context (who, what, when, where)
|
||
|
|
- **Metrics**: Numeric measurements aggregated over time (rates, histograms, gauges)
|
||
|
|
- **Traces**: End-to-end request flow across services and boundaries
|
||
|
|
|
||
|
|
### Observability Is Not Monitoring
|
||
|
|
- Monitoring tells you when something is broken (known unknowns)
|
||
|
|
- Observability lets you ask questions about why something is broken (unknown unknowns)
|
||
|
|
- Design for observability: emit enough data to diagnose novel problems
|
||
|
|
|
||
|
|
### Observability by Design
|
||
|
|
- Observability must be designed into the architecture, not bolted on after
|
||
|
|
- Every service must emit structured logs, metrics, and traces from day one
|
||
|
|
- Every external integration must have observability hooks
|
||
|
|
|
||
|
|
## Logs
|
||
|
|
|
||
|
|
### Log Levels
|
||
|
|
- **ERROR**: Something failed that requires investigation (not all errors are ERROR level)
|
||
|
|
- **WARN**: Something unexpected happened but the system can continue
|
||
|
|
- **INFO**: Business-significant events (order created, payment processed, user registered)
|
||
|
|
- **DEBUG**: Detailed information for debugging (only in development, not in production)
|
||
|
|
- **TRACE**: Very detailed information (almost never used in production)
|
||
|
|
|
||
|
|
### Structured Logging
|
||
|
|
- Use JSON format for all logs
|
||
|
|
- Every log entry must include: timestamp, level, service name, correlation ID
|
||
|
|
- Include relevant context: user ID, request ID, entity IDs, error details
|
||
|
|
- Never log sensitive data: passwords, tokens, PII, secrets
|
||
|
|
|
||
|
|
### Log Aggregation
|
||
|
|
- Send all logs to a centralized log aggregation system
|
||
|
|
- Define log retention period based on compliance requirements
|
||
|
|
- Define log access controls (who can see what logs)
|
||
|
|
- Consider log volume and cost (log only what you need)
|
||
|
|
|
||
|
|
## Metrics
|
||
|
|
|
||
|
|
### Metric Types
|
||
|
|
- **Counter**: Monotonically increasing value (request count, error count)
|
||
|
|
- **Gauge**: Point-in-time value (active connections, queue depth)
|
||
|
|
- **Histogram**: Distribution of values (request latency, payload size)
|
||
|
|
- **Summary**: Pre-calculated quantiles (p50, p90, p99 latency)
|
||
|
|
|
||
|
|
### Key Business Metrics
|
||
|
|
- Orders per minute
|
||
|
|
- Revenue per minute
|
||
|
|
- Active users
|
||
|
|
- Conversion rate
|
||
|
|
- Cart abandonment rate
|
||
|
|
|
||
|
|
### Key System Metrics
|
||
|
|
- Request rate (requests per second per endpoint)
|
||
|
|
- Error rate (4xx rate, 5xx rate per endpoint)
|
||
|
|
- Latency (p50, p90, p99 per endpoint)
|
||
|
|
- Queue depth and age
|
||
|
|
- Database connection pool usage
|
||
|
|
- Cache hit rate
|
||
|
|
- Memory and CPU usage per service
|
||
|
|
|
||
|
|
### Metric Naming Convention
|
||
|
|
- Use dot-separated names: `service.operation.metric`
|
||
|
|
- Include units in the name or metadata: `request.duration.milliseconds`
|
||
|
|
- Use consistent labels: `method`, `endpoint`, `status_code`, `tenant_id`
|
||
|
|
|
||
|
|
## Traces
|
||
|
|
|
||
|
|
### Distributed Tracing
|
||
|
|
- Every request gets a trace ID that propagates across all services
|
||
|
|
- Every operation within a request gets a span with operation name, start time, duration
|
||
|
|
- Span boundaries: service calls, database queries, external API calls, queue operations
|
||
|
|
|
||
|
|
### Correlation ID Propagation
|
||
|
|
- Generate a correlation ID at the request entry point
|
||
|
|
- Propagate correlation ID through all service calls (headers, message metadata)
|
||
|
|
- Include correlation ID in all logs, metrics, and error responses
|
||
|
|
- Use correlation ID to trace a request end-to-end across all services
|
||
|
|
|
||
|
|
### Span Design
|
||
|
|
- Include relevant context in spans: user ID, entity IDs, operation type
|
||
|
|
- Tag spans with error information when operations fail
|
||
|
|
- Keep span cardinality reasonable (avoid high-cardinality attributes as tags)
|
||
|
|
|
||
|
|
## Alerts
|
||
|
|
|
||
|
|
### Alert Design Principles
|
||
|
|
- Alert on symptoms, not causes (user impact, not internal metrics)
|
||
|
|
- Every alert must have a clear runbook or remediation steps
|
||
|
|
- Every alert must be actionable (if you can't act on it, don't alert on it)
|
||
|
|
- Avoid alert fatigue: set thresholds based on SLOs, not arbitrary numbers
|
||
|
|
|
||
|
|
### Alert Categories
|
||
|
|
- **Page-worthy**: System is broken, immediate action required (high error rate, service down)
|
||
|
|
- **Ticket-worthy**: Degradation that needs investigation soon (rising latency, approaching limits)
|
||
|
|
- **Log-worthy**: Informational, no immediate action (deployment completed, config changed)
|
||
|
|
|
||
|
|
### Alert Thresholds
|
||
|
|
- Base alert thresholds on SLOs, not arbitrary numbers
|
||
|
|
- Use burn rate alerting: alert when the error budget is burning too fast
|
||
|
|
- Define escalation paths: who gets paged, who gets a ticket, who gets an email
|
||
|
|
|
||
|
|
## SLOs (Service Level Objectives)
|
||
|
|
|
||
|
|
### SLO Design
|
||
|
|
- Define SLOs based on user impact, not internal metrics
|
||
|
|
- Typical SLO categories:
|
||
|
|
- **Availability**: % of requests that succeed (e.g., 99.9%)
|
||
|
|
- **Latency**: % of requests that complete within a threshold (e.g., p99 < 500ms)
|
||
|
|
- **Correctness**: % of operations that produce correct results
|
||
|
|
- **Freshness**: % of data that is within staleness threshold
|
||
|
|
|
||
|
|
### Error Budget
|
||
|
|
- Error budget = 100% - SLO target
|
||
|
|
- If SLO is 99.9%, error budget is 0.1% per month
|
||
|
|
- Track error budget burn rate: how fast are we consuming the budget?
|
||
|
|
- When error budget is exhausted, focus shifts from feature development to reliability
|
||
|
|
|
||
|
|
### SLO Framework
|
||
|
|
- Define the SLO (what we promise)
|
||
|
|
- Define the SLI (how we measure it)
|
||
|
|
- Define the error budget (what we can afford to fail)
|
||
|
|
- Define the alerting (when we're burning budget too fast)
|
||
|
|
|
||
|
|
## Anti-Patterns
|
||
|
|
|
||
|
|
- **Logging everything**: Generates noise, increases cost, makes debugging harder
|
||
|
|
- **Missing correlation ID**: Can't trace requests across services
|
||
|
|
- **Alerting on causes, not symptoms**: Alerts fire but users aren't impacted
|
||
|
|
- **Missing business metrics**: Can't tell if the system is serving users well
|
||
|
|
- **High-cardinality metrics**: Explosive metric count, expensive to store and query
|
||
|
|
- **Missing observability for external calls**: External integration failures are invisible
|
||
|
|
- **Logging sensitive data**: Passwords, tokens, PII in logs
|