opencode-workflow/skills/error-model-design/SKILL.md

7.5 KiB

name description
error-model-design Knowledge contract for designing error categories, propagation strategies, retryable vs non-retryable errors, partial failure behavior, and fallback strategies. Referenced by design-architecture when defining error handling.

This is a knowledge contract, not a workflow skill. It is referenced by design-architecture when the architect is defining error handling strategy.

Core Principle

Error handling must be designed systematically, not added as an afterthought. Every error category must trace to a PRD edge case or NFR. The error model must be consistent across the entire system.

Error Categories

Client Errors (4xx)

Errors caused by the client sending invalid or incorrect requests.

Common client errors:

  • 400 Bad Request - malformed request body, missing required fields
  • 401 Unauthorized - missing or invalid authentication
  • 403 Forbidden - authenticated but not authorized for this resource
  • 404 Not Found - requested resource does not exist
  • 409 Conflict - state conflict (duplicate, version mismatch, business rule violation)
  • 422 Unprocessable Entity - valid format but business rule violation
  • 429 Too Many Requests - rate limit exceeded

Design principles:

  • Client errors are non-retryable (unless 429 with Retry-After)
  • Error response must include enough detail for the client to correct the request
  • Error codes should be consistent and documented in the API contract (see api-contract-design)

Server Errors (5xx)

Errors caused by the server failing to process a valid request.

Common server errors:

  • 500 Internal Server Error - unexpected server failure
  • 502 Bad Gateway - upstream service failure
  • 503 Service Unavailable - temporary unavailability
  • 504 Gateway Timeout - upstream service timeout

Design principles:

  • Server errors may be retryable (see retryable vs non-retryable)
  • Error response should not leak internal details in production
  • All unexpected server errors must be logged and alerted
  • Circuit breakers should protect against cascading server errors

Business Rule Violations

Errors where the request is valid but violates a business rule.

Design principles:

  • Use 422 or 409 depending on the nature of the violation
  • Include the specific business rule that was violated
  • Include enough context for the client to understand and correct the issue
  • Map each business rule violation to a PRD functional requirement

Timeout Errors

Errors where an operation did not complete within the expected time.

Design principles:

  • Always distinguish timeout from confirmed failure
  • Timeout means "unknown state" not "failed"
  • Define timeout values per operation type
  • Document recovery procedures for timed-out operations
  • See distributed-system-basics for timeout vs failure handling

Cascading Failures

Failures that propagate from one service to another, potentially bringing down the entire system.

Design principles:

  • Use circuit breakers to stop cascade propagation
  • Use bulkheads to isolate failure domains
  • Define fallback behavior for each dependency failure
  • Monitor and alert on circuit breaker state changes

Error Propagation Strategy

Fail-Fast

Immediately return an error to the caller when a dependency fails.

Use when:

  • The caller cannot proceed without the dependency
  • Partial data is worse than no data
  • The PRD requires immediate feedback

Graceful Degradation

Continue serving reduced functionality when a dependency fails.

Use when:

  • The PRD allows partial functionality
  • Some data is better than no data
  • The feature has a clear fallback path

Define for each graceful degradation:

  • What functionality is reduced
  • What the user sees instead
  • How the system recovers when the dependency returns

Circuit Breaker

Stop calling a failing dependency after a threshold of failures, allowing it time to recover.

Define for each circuit breaker:

  • Failure threshold (how many failures before opening)
  • Recovery timeout (how long before trying again)
  • Half-open behavior (how many requests to allow during recovery)
  • Fallback behavior when circuit is open

Use when:

  • A dependency is experiencing persistent failures
  • Continuing to call will make things worse (cascading failure risk)
  • The system can operate with reduced functionality

Error Response Format

Define a consistent error response format across the entire system:

{
  "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable message describing what happened",
    "details": [
      {
        "field": "field_name",
        "code": "SPECIFIC_ERROR_CODE",
        "message": "Specific error description"
      }
    ],
    "request_id": "correlation-id-for-tracing"
  }
}

Design principles:

  • code is a machine-readable string constant (not HTTP status code)
  • message is human-readable and suitable for display or logging
  • details provides field-level validation errors when applicable
  • request_id enables cross-service error tracing
  • Never include stack traces, internal paths, or implementation details in production error responses

Retryable vs Non-Retryable Errors

Retryable Errors

  • Server errors (500, 502, 503, 504) with backoff
  • Timeout errors with backoff
  • Rate limit errors (429) with Retry-After
  • Network connectivity errors

Non-Retryable Errors

  • Client errors (400, 401, 403, 404, 422, 409)
  • Business rule violations
  • Malformed requests
  • Authentication failures

Define per endpoint whether an error is retryable. Include this in the API contract.

Partial Failure Behavior

Define partial failure behavior for operations that span multiple steps or services:

  • All-or-nothing: The entire operation succeeds or fails atomically. Use for financial transactions, inventory operations, or any data requiring strong consistency.
  • Best-effort: Complete as much as possible and report partial success. Use for batch operations, notifications, or operations where partial success is acceptable.
  • Compensating transaction (saga): Each step has a compensating action. If a step fails, previous steps are undone via compensation. Use for multi-service operations where atomicity is required but distributed transactions are not available.

For each partial failure scenario:

  • Define what "partial" means in this context
  • Define whether partial success is acceptable or must be fully rolled back
  • Define the recovery procedure
  • Map to a PRD edge case

Fallback Strategy

For each external dependency, define:

  • What happens when the dependency is unavailable
  • Fallback behavior (cached data, default response, queue and retry, fail with user message)
  • How the system recovers when the dependency returns
  • SLA implications of the fallback

Observability

For error model design, define:

  • What errors are logged (all unexpected errors, all server errors, sampled client errors)
  • What errors trigger alerts (server error rate, DLQ depth, circuit breaker state)
  • Error metrics (error rate by code, error rate by endpoint, p99 latency)
  • Request tracing (correlation IDs across service boundaries)

Map observability requirements to PRD NFRs.

Anti-Patterns

  • Returning generic 500 errors for all server failures
  • Not distinguishing timeout from failure
  • Ignoring partial failure scenarios
  • Leaking internal details in error responses
  • Using the same error handling strategy for all operations regardless of criticality
  • Not defining fallback behavior for external dependencies
  • Alerting on all errors instead of actionable thresholds
  • Using circuit breakers without fallback behavior