opencode-workflow/skills/error-model-design/SKILL.md

196 lines
7.5 KiB
Markdown

---
name: error-model-design
description: "Knowledge contract for designing error categories, propagation strategies, retryable vs non-retryable errors, partial failure behavior, and fallback strategies. Referenced by design-architecture when defining error handling."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is defining error handling strategy.
## Core Principle
Error handling must be designed systematically, not added as an afterthought. Every error category must trace to a PRD edge case or NFR. The error model must be consistent across the entire system.
## Error Categories
### Client Errors (4xx)
Errors caused by the client sending invalid or incorrect requests.
Common client errors:
- `400 Bad Request` - malformed request body, missing required fields
- `401 Unauthorized` - missing or invalid authentication
- `403 Forbidden` - authenticated but not authorized for this resource
- `404 Not Found` - requested resource does not exist
- `409 Conflict` - state conflict (duplicate, version mismatch, business rule violation)
- `422 Unprocessable Entity` - valid format but business rule violation
- `429 Too Many Requests` - rate limit exceeded
Design principles:
- Client errors are non-retryable (unless 429 with Retry-After)
- Error response must include enough detail for the client to correct the request
- Error codes should be consistent and documented in the API contract (see `api-contract-design`)
### Server Errors (5xx)
Errors caused by the server failing to process a valid request.
Common server errors:
- `500 Internal Server Error` - unexpected server failure
- `502 Bad Gateway` - upstream service failure
- `503 Service Unavailable` - temporary unavailability
- `504 Gateway Timeout` - upstream service timeout
Design principles:
- Server errors may be retryable (see retryable vs non-retryable)
- Error response should not leak internal details in production
- All unexpected server errors must be logged and alerted
- Circuit breakers should protect against cascading server errors
### Business Rule Violations
Errors where the request is valid but violates a business rule.
Design principles:
- Use 422 or 409 depending on the nature of the violation
- Include the specific business rule that was violated
- Include enough context for the client to understand and correct the issue
- Map each business rule violation to a PRD functional requirement
### Timeout Errors
Errors where an operation did not complete within the expected time.
Design principles:
- Always distinguish timeout from confirmed failure
- Timeout means "unknown state" not "failed"
- Define timeout values per operation type
- Document recovery procedures for timed-out operations
- See `distributed-system-basics` for timeout vs failure handling
### Cascading Failures
Failures that propagate from one service to another, potentially bringing down the entire system.
Design principles:
- Use circuit breakers to stop cascade propagation
- Use bulkheads to isolate failure domains
- Define fallback behavior for each dependency failure
- Monitor and alert on circuit breaker state changes
## Error Propagation Strategy
### Fail-Fast
Immediately return an error to the caller when a dependency fails.
Use when:
- The caller cannot proceed without the dependency
- Partial data is worse than no data
- The PRD requires immediate feedback
### Graceful Degradation
Continue serving reduced functionality when a dependency fails.
Use when:
- The PRD allows partial functionality
- Some data is better than no data
- The feature has a clear fallback path
Define for each graceful degradation:
- What functionality is reduced
- What the user sees instead
- How the system recovers when the dependency returns
### Circuit Breaker
Stop calling a failing dependency after a threshold of failures, allowing it time to recover.
Define for each circuit breaker:
- Failure threshold (how many failures before opening)
- Recovery timeout (how long before trying again)
- Half-open behavior (how many requests to allow during recovery)
- Fallback behavior when circuit is open
Use when:
- A dependency is experiencing persistent failures
- Continuing to call will make things worse (cascading failure risk)
- The system can operate with reduced functionality
## Error Response Format
Define a consistent error response format across the entire system:
```json
{
"error": {
"code": "ERROR_CODE",
"message": "Human-readable message describing what happened",
"details": [
{
"field": "field_name",
"code": "SPECIFIC_ERROR_CODE",
"message": "Specific error description"
}
],
"request_id": "correlation-id-for-tracing"
}
}
```
Design principles:
- `code` is a machine-readable string constant (not HTTP status code)
- `message` is human-readable and suitable for display or logging
- `details` provides field-level validation errors when applicable
- `request_id` enables cross-service error tracing
- Never include stack traces, internal paths, or implementation details in production error responses
## Retryable vs Non-Retryable Errors
### Retryable Errors
- Server errors (500, 502, 503, 504) with backoff
- Timeout errors with backoff
- Rate limit errors (429) with Retry-After
- Network connectivity errors
### Non-Retryable Errors
- Client errors (400, 401, 403, 404, 422, 409)
- Business rule violations
- Malformed requests
- Authentication failures
Define per endpoint whether an error is retryable. Include this in the API contract.
## Partial Failure Behavior
Define partial failure behavior for operations that span multiple steps or services:
- **All-or-nothing**: The entire operation succeeds or fails atomically. Use for financial transactions, inventory operations, or any data requiring strong consistency.
- **Best-effort**: Complete as much as possible and report partial success. Use for batch operations, notifications, or operations where partial success is acceptable.
- **Compensating transaction (saga)**: Each step has a compensating action. If a step fails, previous steps are undone via compensation. Use for multi-service operations where atomicity is required but distributed transactions are not available.
For each partial failure scenario:
- Define what "partial" means in this context
- Define whether partial success is acceptable or must be fully rolled back
- Define the recovery procedure
- Map to a PRD edge case
## Fallback Strategy
For each external dependency, define:
- What happens when the dependency is unavailable
- Fallback behavior (cached data, default response, queue and retry, fail with user message)
- How the system recovers when the dependency returns
- SLA implications of the fallback
## Observability
For error model design, define:
- What errors are logged (all unexpected errors, all server errors, sampled client errors)
- What errors trigger alerts (server error rate, DLQ depth, circuit breaker state)
- Error metrics (error rate by code, error rate by endpoint, p99 latency)
- Request tracing (correlation IDs across service boundaries)
Map observability requirements to PRD NFRs.
## Anti-Patterns
- Returning generic 500 errors for all server failures
- Not distinguishing timeout from failure
- Ignoring partial failure scenarios
- Leaking internal details in error responses
- Using the same error handling strategy for all operations regardless of criticality
- Not defining fallback behavior for external dependencies
- Alerting on all errors instead of actionable thresholds
- Using circuit breakers without fallback behavior