--- name: error-model-design description: "Knowledge contract for designing error categories, propagation strategies, retryable vs non-retryable errors, partial failure behavior, and fallback strategies. Referenced by design-architecture when defining error handling." --- This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is defining error handling strategy. ## Core Principle Error handling must be designed systematically, not added as an afterthought. Every error category must trace to a PRD edge case or NFR. The error model must be consistent across the entire system. ## Error Categories ### Client Errors (4xx) Errors caused by the client sending invalid or incorrect requests. Common client errors: - `400 Bad Request` - malformed request body, missing required fields - `401 Unauthorized` - missing or invalid authentication - `403 Forbidden` - authenticated but not authorized for this resource - `404 Not Found` - requested resource does not exist - `409 Conflict` - state conflict (duplicate, version mismatch, business rule violation) - `422 Unprocessable Entity` - valid format but business rule violation - `429 Too Many Requests` - rate limit exceeded Design principles: - Client errors are non-retryable (unless 429 with Retry-After) - Error response must include enough detail for the client to correct the request - Error codes should be consistent and documented in the API contract (see `api-contract-design`) ### Server Errors (5xx) Errors caused by the server failing to process a valid request. Common server errors: - `500 Internal Server Error` - unexpected server failure - `502 Bad Gateway` - upstream service failure - `503 Service Unavailable` - temporary unavailability - `504 Gateway Timeout` - upstream service timeout Design principles: - Server errors may be retryable (see retryable vs non-retryable) - Error response should not leak internal details in production - All unexpected server errors must be logged and alerted - Circuit breakers should protect against cascading server errors ### Business Rule Violations Errors where the request is valid but violates a business rule. Design principles: - Use 422 or 409 depending on the nature of the violation - Include the specific business rule that was violated - Include enough context for the client to understand and correct the issue - Map each business rule violation to a PRD functional requirement ### Timeout Errors Errors where an operation did not complete within the expected time. Design principles: - Always distinguish timeout from confirmed failure - Timeout means "unknown state" not "failed" - Define timeout values per operation type - Document recovery procedures for timed-out operations - See `distributed-system-basics` for timeout vs failure handling ### Cascading Failures Failures that propagate from one service to another, potentially bringing down the entire system. Design principles: - Use circuit breakers to stop cascade propagation - Use bulkheads to isolate failure domains - Define fallback behavior for each dependency failure - Monitor and alert on circuit breaker state changes ## Error Propagation Strategy ### Fail-Fast Immediately return an error to the caller when a dependency fails. Use when: - The caller cannot proceed without the dependency - Partial data is worse than no data - The PRD requires immediate feedback ### Graceful Degradation Continue serving reduced functionality when a dependency fails. Use when: - The PRD allows partial functionality - Some data is better than no data - The feature has a clear fallback path Define for each graceful degradation: - What functionality is reduced - What the user sees instead - How the system recovers when the dependency returns ### Circuit Breaker Stop calling a failing dependency after a threshold of failures, allowing it time to recover. Define for each circuit breaker: - Failure threshold (how many failures before opening) - Recovery timeout (how long before trying again) - Half-open behavior (how many requests to allow during recovery) - Fallback behavior when circuit is open Use when: - A dependency is experiencing persistent failures - Continuing to call will make things worse (cascading failure risk) - The system can operate with reduced functionality ## Error Response Format Define a consistent error response format across the entire system: ```json { "error": { "code": "ERROR_CODE", "message": "Human-readable message describing what happened", "details": [ { "field": "field_name", "code": "SPECIFIC_ERROR_CODE", "message": "Specific error description" } ], "request_id": "correlation-id-for-tracing" } } ``` Design principles: - `code` is a machine-readable string constant (not HTTP status code) - `message` is human-readable and suitable for display or logging - `details` provides field-level validation errors when applicable - `request_id` enables cross-service error tracing - Never include stack traces, internal paths, or implementation details in production error responses ## Retryable vs Non-Retryable Errors ### Retryable Errors - Server errors (500, 502, 503, 504) with backoff - Timeout errors with backoff - Rate limit errors (429) with Retry-After - Network connectivity errors ### Non-Retryable Errors - Client errors (400, 401, 403, 404, 422, 409) - Business rule violations - Malformed requests - Authentication failures Define per endpoint whether an error is retryable. Include this in the API contract. ## Partial Failure Behavior Define partial failure behavior for operations that span multiple steps or services: - **All-or-nothing**: The entire operation succeeds or fails atomically. Use for financial transactions, inventory operations, or any data requiring strong consistency. - **Best-effort**: Complete as much as possible and report partial success. Use for batch operations, notifications, or operations where partial success is acceptable. - **Compensating transaction (saga)**: Each step has a compensating action. If a step fails, previous steps are undone via compensation. Use for multi-service operations where atomicity is required but distributed transactions are not available. For each partial failure scenario: - Define what "partial" means in this context - Define whether partial success is acceptable or must be fully rolled back - Define the recovery procedure - Map to a PRD edge case ## Fallback Strategy For each external dependency, define: - What happens when the dependency is unavailable - Fallback behavior (cached data, default response, queue and retry, fail with user message) - How the system recovers when the dependency returns - SLA implications of the fallback ## Observability For error model design, define: - What errors are logged (all unexpected errors, all server errors, sampled client errors) - What errors trigger alerts (server error rate, DLQ depth, circuit breaker state) - Error metrics (error rate by code, error rate by endpoint, p99 latency) - Request tracing (correlation IDs across service boundaries) Map observability requirements to PRD NFRs. ## Anti-Patterns - Returning generic 500 errors for all server failures - Not distinguishing timeout from failure - Ignoring partial failure scenarios - Leaking internal details in error responses - Using the same error handling strategy for all operations regardless of criticality - Not defining fallback behavior for external dependencies - Alerting on all errors instead of actionable thresholds - Using circuit breakers without fallback behavior