196 lines
7.5 KiB
Markdown
196 lines
7.5 KiB
Markdown
---
|
|
name: error-model-design
|
|
description: "Knowledge contract for designing error categories, propagation strategies, retryable vs non-retryable errors, partial failure behavior, and fallback strategies. Referenced by design-architecture when defining error handling."
|
|
---
|
|
|
|
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is defining error handling strategy.
|
|
|
|
## Core Principle
|
|
|
|
Error handling must be designed systematically, not added as an afterthought. Every error category must trace to a PRD edge case or NFR. The error model must be consistent across the entire system.
|
|
|
|
## Error Categories
|
|
|
|
### Client Errors (4xx)
|
|
Errors caused by the client sending invalid or incorrect requests.
|
|
|
|
Common client errors:
|
|
- `400 Bad Request` - malformed request body, missing required fields
|
|
- `401 Unauthorized` - missing or invalid authentication
|
|
- `403 Forbidden` - authenticated but not authorized for this resource
|
|
- `404 Not Found` - requested resource does not exist
|
|
- `409 Conflict` - state conflict (duplicate, version mismatch, business rule violation)
|
|
- `422 Unprocessable Entity` - valid format but business rule violation
|
|
- `429 Too Many Requests` - rate limit exceeded
|
|
|
|
Design principles:
|
|
- Client errors are non-retryable (unless 429 with Retry-After)
|
|
- Error response must include enough detail for the client to correct the request
|
|
- Error codes should be consistent and documented in the API contract (see `api-contract-design`)
|
|
|
|
### Server Errors (5xx)
|
|
Errors caused by the server failing to process a valid request.
|
|
|
|
Common server errors:
|
|
- `500 Internal Server Error` - unexpected server failure
|
|
- `502 Bad Gateway` - upstream service failure
|
|
- `503 Service Unavailable` - temporary unavailability
|
|
- `504 Gateway Timeout` - upstream service timeout
|
|
|
|
Design principles:
|
|
- Server errors may be retryable (see retryable vs non-retryable)
|
|
- Error response should not leak internal details in production
|
|
- All unexpected server errors must be logged and alerted
|
|
- Circuit breakers should protect against cascading server errors
|
|
|
|
### Business Rule Violations
|
|
Errors where the request is valid but violates a business rule.
|
|
|
|
Design principles:
|
|
- Use 422 or 409 depending on the nature of the violation
|
|
- Include the specific business rule that was violated
|
|
- Include enough context for the client to understand and correct the issue
|
|
- Map each business rule violation to a PRD functional requirement
|
|
|
|
### Timeout Errors
|
|
Errors where an operation did not complete within the expected time.
|
|
|
|
Design principles:
|
|
- Always distinguish timeout from confirmed failure
|
|
- Timeout means "unknown state" not "failed"
|
|
- Define timeout values per operation type
|
|
- Document recovery procedures for timed-out operations
|
|
- See `distributed-system-basics` for timeout vs failure handling
|
|
|
|
### Cascading Failures
|
|
Failures that propagate from one service to another, potentially bringing down the entire system.
|
|
|
|
Design principles:
|
|
- Use circuit breakers to stop cascade propagation
|
|
- Use bulkheads to isolate failure domains
|
|
- Define fallback behavior for each dependency failure
|
|
- Monitor and alert on circuit breaker state changes
|
|
|
|
## Error Propagation Strategy
|
|
|
|
### Fail-Fast
|
|
Immediately return an error to the caller when a dependency fails.
|
|
|
|
Use when:
|
|
- The caller cannot proceed without the dependency
|
|
- Partial data is worse than no data
|
|
- The PRD requires immediate feedback
|
|
|
|
### Graceful Degradation
|
|
Continue serving reduced functionality when a dependency fails.
|
|
|
|
Use when:
|
|
- The PRD allows partial functionality
|
|
- Some data is better than no data
|
|
- The feature has a clear fallback path
|
|
|
|
Define for each graceful degradation:
|
|
- What functionality is reduced
|
|
- What the user sees instead
|
|
- How the system recovers when the dependency returns
|
|
|
|
### Circuit Breaker
|
|
Stop calling a failing dependency after a threshold of failures, allowing it time to recover.
|
|
|
|
Define for each circuit breaker:
|
|
- Failure threshold (how many failures before opening)
|
|
- Recovery timeout (how long before trying again)
|
|
- Half-open behavior (how many requests to allow during recovery)
|
|
- Fallback behavior when circuit is open
|
|
|
|
Use when:
|
|
- A dependency is experiencing persistent failures
|
|
- Continuing to call will make things worse (cascading failure risk)
|
|
- The system can operate with reduced functionality
|
|
|
|
## Error Response Format
|
|
|
|
Define a consistent error response format across the entire system:
|
|
|
|
```json
|
|
{
|
|
"error": {
|
|
"code": "ERROR_CODE",
|
|
"message": "Human-readable message describing what happened",
|
|
"details": [
|
|
{
|
|
"field": "field_name",
|
|
"code": "SPECIFIC_ERROR_CODE",
|
|
"message": "Specific error description"
|
|
}
|
|
],
|
|
"request_id": "correlation-id-for-tracing"
|
|
}
|
|
}
|
|
```
|
|
|
|
Design principles:
|
|
- `code` is a machine-readable string constant (not HTTP status code)
|
|
- `message` is human-readable and suitable for display or logging
|
|
- `details` provides field-level validation errors when applicable
|
|
- `request_id` enables cross-service error tracing
|
|
- Never include stack traces, internal paths, or implementation details in production error responses
|
|
|
|
## Retryable vs Non-Retryable Errors
|
|
|
|
### Retryable Errors
|
|
- Server errors (500, 502, 503, 504) with backoff
|
|
- Timeout errors with backoff
|
|
- Rate limit errors (429) with Retry-After
|
|
- Network connectivity errors
|
|
|
|
### Non-Retryable Errors
|
|
- Client errors (400, 401, 403, 404, 422, 409)
|
|
- Business rule violations
|
|
- Malformed requests
|
|
- Authentication failures
|
|
|
|
Define per endpoint whether an error is retryable. Include this in the API contract.
|
|
|
|
## Partial Failure Behavior
|
|
|
|
Define partial failure behavior for operations that span multiple steps or services:
|
|
|
|
- **All-or-nothing**: The entire operation succeeds or fails atomically. Use for financial transactions, inventory operations, or any data requiring strong consistency.
|
|
- **Best-effort**: Complete as much as possible and report partial success. Use for batch operations, notifications, or operations where partial success is acceptable.
|
|
- **Compensating transaction (saga)**: Each step has a compensating action. If a step fails, previous steps are undone via compensation. Use for multi-service operations where atomicity is required but distributed transactions are not available.
|
|
|
|
For each partial failure scenario:
|
|
- Define what "partial" means in this context
|
|
- Define whether partial success is acceptable or must be fully rolled back
|
|
- Define the recovery procedure
|
|
- Map to a PRD edge case
|
|
|
|
## Fallback Strategy
|
|
|
|
For each external dependency, define:
|
|
- What happens when the dependency is unavailable
|
|
- Fallback behavior (cached data, default response, queue and retry, fail with user message)
|
|
- How the system recovers when the dependency returns
|
|
- SLA implications of the fallback
|
|
|
|
## Observability
|
|
|
|
For error model design, define:
|
|
- What errors are logged (all unexpected errors, all server errors, sampled client errors)
|
|
- What errors trigger alerts (server error rate, DLQ depth, circuit breaker state)
|
|
- Error metrics (error rate by code, error rate by endpoint, p99 latency)
|
|
- Request tracing (correlation IDs across service boundaries)
|
|
|
|
Map observability requirements to PRD NFRs.
|
|
|
|
## Anti-Patterns
|
|
|
|
- Returning generic 500 errors for all server failures
|
|
- Not distinguishing timeout from failure
|
|
- Ignoring partial failure scenarios
|
|
- Leaking internal details in error responses
|
|
- Using the same error handling strategy for all operations regardless of criticality
|
|
- Not defining fallback behavior for external dependencies
|
|
- Alerting on all errors instead of actionable thresholds
|
|
- Using circuit breakers without fallback behavior |