7.5 KiB
| name | description |
|---|---|
| error-model-design | Knowledge contract for designing error categories, propagation strategies, retryable vs non-retryable errors, partial failure behavior, and fallback strategies. Referenced by design-architecture when defining error handling. |
This is a knowledge contract, not a workflow skill. It is referenced by design-architecture when the architect is defining error handling strategy.
Core Principle
Error handling must be designed systematically, not added as an afterthought. Every error category must trace to a PRD edge case or NFR. The error model must be consistent across the entire system.
Error Categories
Client Errors (4xx)
Errors caused by the client sending invalid or incorrect requests.
Common client errors:
400 Bad Request- malformed request body, missing required fields401 Unauthorized- missing or invalid authentication403 Forbidden- authenticated but not authorized for this resource404 Not Found- requested resource does not exist409 Conflict- state conflict (duplicate, version mismatch, business rule violation)422 Unprocessable Entity- valid format but business rule violation429 Too Many Requests- rate limit exceeded
Design principles:
- Client errors are non-retryable (unless 429 with Retry-After)
- Error response must include enough detail for the client to correct the request
- Error codes should be consistent and documented in the API contract (see
api-contract-design)
Server Errors (5xx)
Errors caused by the server failing to process a valid request.
Common server errors:
500 Internal Server Error- unexpected server failure502 Bad Gateway- upstream service failure503 Service Unavailable- temporary unavailability504 Gateway Timeout- upstream service timeout
Design principles:
- Server errors may be retryable (see retryable vs non-retryable)
- Error response should not leak internal details in production
- All unexpected server errors must be logged and alerted
- Circuit breakers should protect against cascading server errors
Business Rule Violations
Errors where the request is valid but violates a business rule.
Design principles:
- Use 422 or 409 depending on the nature of the violation
- Include the specific business rule that was violated
- Include enough context for the client to understand and correct the issue
- Map each business rule violation to a PRD functional requirement
Timeout Errors
Errors where an operation did not complete within the expected time.
Design principles:
- Always distinguish timeout from confirmed failure
- Timeout means "unknown state" not "failed"
- Define timeout values per operation type
- Document recovery procedures for timed-out operations
- See
distributed-system-basicsfor timeout vs failure handling
Cascading Failures
Failures that propagate from one service to another, potentially bringing down the entire system.
Design principles:
- Use circuit breakers to stop cascade propagation
- Use bulkheads to isolate failure domains
- Define fallback behavior for each dependency failure
- Monitor and alert on circuit breaker state changes
Error Propagation Strategy
Fail-Fast
Immediately return an error to the caller when a dependency fails.
Use when:
- The caller cannot proceed without the dependency
- Partial data is worse than no data
- The PRD requires immediate feedback
Graceful Degradation
Continue serving reduced functionality when a dependency fails.
Use when:
- The PRD allows partial functionality
- Some data is better than no data
- The feature has a clear fallback path
Define for each graceful degradation:
- What functionality is reduced
- What the user sees instead
- How the system recovers when the dependency returns
Circuit Breaker
Stop calling a failing dependency after a threshold of failures, allowing it time to recover.
Define for each circuit breaker:
- Failure threshold (how many failures before opening)
- Recovery timeout (how long before trying again)
- Half-open behavior (how many requests to allow during recovery)
- Fallback behavior when circuit is open
Use when:
- A dependency is experiencing persistent failures
- Continuing to call will make things worse (cascading failure risk)
- The system can operate with reduced functionality
Error Response Format
Define a consistent error response format across the entire system:
{
"error": {
"code": "ERROR_CODE",
"message": "Human-readable message describing what happened",
"details": [
{
"field": "field_name",
"code": "SPECIFIC_ERROR_CODE",
"message": "Specific error description"
}
],
"request_id": "correlation-id-for-tracing"
}
}
Design principles:
codeis a machine-readable string constant (not HTTP status code)messageis human-readable and suitable for display or loggingdetailsprovides field-level validation errors when applicablerequest_idenables cross-service error tracing- Never include stack traces, internal paths, or implementation details in production error responses
Retryable vs Non-Retryable Errors
Retryable Errors
- Server errors (500, 502, 503, 504) with backoff
- Timeout errors with backoff
- Rate limit errors (429) with Retry-After
- Network connectivity errors
Non-Retryable Errors
- Client errors (400, 401, 403, 404, 422, 409)
- Business rule violations
- Malformed requests
- Authentication failures
Define per endpoint whether an error is retryable. Include this in the API contract.
Partial Failure Behavior
Define partial failure behavior for operations that span multiple steps or services:
- All-or-nothing: The entire operation succeeds or fails atomically. Use for financial transactions, inventory operations, or any data requiring strong consistency.
- Best-effort: Complete as much as possible and report partial success. Use for batch operations, notifications, or operations where partial success is acceptable.
- Compensating transaction (saga): Each step has a compensating action. If a step fails, previous steps are undone via compensation. Use for multi-service operations where atomicity is required but distributed transactions are not available.
For each partial failure scenario:
- Define what "partial" means in this context
- Define whether partial success is acceptable or must be fully rolled back
- Define the recovery procedure
- Map to a PRD edge case
Fallback Strategy
For each external dependency, define:
- What happens when the dependency is unavailable
- Fallback behavior (cached data, default response, queue and retry, fail with user message)
- How the system recovers when the dependency returns
- SLA implications of the fallback
Observability
For error model design, define:
- What errors are logged (all unexpected errors, all server errors, sampled client errors)
- What errors trigger alerts (server error rate, DLQ depth, circuit breaker state)
- Error metrics (error rate by code, error rate by endpoint, p99 latency)
- Request tracing (correlation IDs across service boundaries)
Map observability requirements to PRD NFRs.
Anti-Patterns
- Returning generic 500 errors for all server failures
- Not distinguishing timeout from failure
- Ignoring partial failure scenarios
- Leaking internal details in error responses
- Using the same error handling strategy for all operations regardless of criticality
- Not defining fallback behavior for external dependencies
- Alerting on all errors instead of actionable thresholds
- Using circuit breakers without fallback behavior