opencode-workflow/SKILL.md at e36f0c15cd76d23ceb9e37fb1e46f594cbd6247c

7.5 KiB

Raw Blame History

name	description
error-model-design	Knowledge contract for designing error categories, propagation strategies, retryable vs non-retryable errors, partial failure behavior, and fallback strategies. Referenced by design-architecture when defining error handling.

This is a knowledge contract, not a workflow skill. It is referenced by design-architecture when the architect is defining error handling strategy.

Core Principle

Error handling must be designed systematically, not added as an afterthought. Every error category must trace to a PRD edge case or NFR. The error model must be consistent across the entire system.

Error Categories

Client Errors (4xx)

Errors caused by the client sending invalid or incorrect requests.

Common client errors:

400 Bad Request - malformed request body, missing required fields
401 Unauthorized - missing or invalid authentication
403 Forbidden - authenticated but not authorized for this resource
404 Not Found - requested resource does not exist
409 Conflict - state conflict (duplicate, version mismatch, business rule violation)
422 Unprocessable Entity - valid format but business rule violation
429 Too Many Requests - rate limit exceeded

Design principles:

Client errors are non-retryable (unless 429 with Retry-After)
Error response must include enough detail for the client to correct the request
Error codes should be consistent and documented in the API contract (see api-contract-design)

Server Errors (5xx)

Errors caused by the server failing to process a valid request.

Common server errors:

500 Internal Server Error - unexpected server failure
502 Bad Gateway - upstream service failure
503 Service Unavailable - temporary unavailability
504 Gateway Timeout - upstream service timeout

Design principles:

Server errors may be retryable (see retryable vs non-retryable)
Error response should not leak internal details in production
All unexpected server errors must be logged and alerted
Circuit breakers should protect against cascading server errors

Business Rule Violations

Errors where the request is valid but violates a business rule.

Design principles:

Use 422 or 409 depending on the nature of the violation
Include the specific business rule that was violated
Include enough context for the client to understand and correct the issue
Map each business rule violation to a PRD functional requirement

Timeout Errors

Errors where an operation did not complete within the expected time.

Design principles:

Always distinguish timeout from confirmed failure
Timeout means "unknown state" not "failed"
Define timeout values per operation type
Document recovery procedures for timed-out operations
See distributed-system-basics for timeout vs failure handling

Cascading Failures

Failures that propagate from one service to another, potentially bringing down the entire system.

Design principles:

Use circuit breakers to stop cascade propagation
Use bulkheads to isolate failure domains
Define fallback behavior for each dependency failure
Monitor and alert on circuit breaker state changes

Error Propagation Strategy

Fail-Fast

Immediately return an error to the caller when a dependency fails.

Use when:

The caller cannot proceed without the dependency
Partial data is worse than no data
The PRD requires immediate feedback

Graceful Degradation

Continue serving reduced functionality when a dependency fails.

Use when:

The PRD allows partial functionality
Some data is better than no data
The feature has a clear fallback path

Define for each graceful degradation:

What functionality is reduced
What the user sees instead
How the system recovers when the dependency returns

Circuit Breaker

Stop calling a failing dependency after a threshold of failures, allowing it time to recover.

Define for each circuit breaker:

Failure threshold (how many failures before opening)
Recovery timeout (how long before trying again)
Half-open behavior (how many requests to allow during recovery)
Fallback behavior when circuit is open

Use when:

A dependency is experiencing persistent failures
Continuing to call will make things worse (cascading failure risk)
The system can operate with reduced functionality

Error Response Format

Define a consistent error response format across the entire system:

{
  "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable message describing what happened",
    "details": [
      {
        "field": "field_name",
        "code": "SPECIFIC_ERROR_CODE",
        "message": "Specific error description"
      }
    ],
    "request_id": "correlation-id-for-tracing"
  }
}

Design principles:

code is a machine-readable string constant (not HTTP status code)
message is human-readable and suitable for display or logging
details provides field-level validation errors when applicable
request_id enables cross-service error tracing
Never include stack traces, internal paths, or implementation details in production error responses

Retryable vs Non-Retryable Errors

Retryable Errors

Server errors (500, 502, 503, 504) with backoff
Timeout errors with backoff
Rate limit errors (429) with Retry-After
Network connectivity errors

Non-Retryable Errors

Client errors (400, 401, 403, 404, 422, 409)
Business rule violations
Malformed requests
Authentication failures

Define per endpoint whether an error is retryable. Include this in the API contract.

Partial Failure Behavior

Define partial failure behavior for operations that span multiple steps or services:

All-or-nothing: The entire operation succeeds or fails atomically. Use for financial transactions, inventory operations, or any data requiring strong consistency.
Best-effort: Complete as much as possible and report partial success. Use for batch operations, notifications, or operations where partial success is acceptable.
Compensating transaction (saga): Each step has a compensating action. If a step fails, previous steps are undone via compensation. Use for multi-service operations where atomicity is required but distributed transactions are not available.

For each partial failure scenario:

Define what "partial" means in this context
Define whether partial success is acceptable or must be fully rolled back
Define the recovery procedure
Map to a PRD edge case

Fallback Strategy

For each external dependency, define:

What happens when the dependency is unavailable
Fallback behavior (cached data, default response, queue and retry, fail with user message)
How the system recovers when the dependency returns
SLA implications of the fallback

Observability

For error model design, define:

What errors are logged (all unexpected errors, all server errors, sampled client errors)
What errors trigger alerts (server error rate, DLQ depth, circuit breaker state)
Error metrics (error rate by code, error rate by endpoint, p99 latency)
Request tracing (correlation IDs across service boundaries)

Map observability requirements to PRD NFRs.

Anti-Patterns

Returning generic 500 errors for all server failures
Not distinguishing timeout from failure
Ignoring partial failure scenarios
Leaking internal details in error responses
Using the same error handling strategy for all operations regardless of criticality
Not defining fallback behavior for external dependencies
Alerting on all errors instead of actionable thresholds
Using circuit breakers without fallback behavior

7.5 KiB Raw Blame History

Core Principle

Error Categories

Client Errors (4xx)

Server Errors (5xx)

Business Rule Violations

Timeout Errors

Cascading Failures

Error Propagation Strategy

Fail-Fast

Graceful Degradation

Circuit Breaker

Error Response Format

Retryable vs Non-Retryable Errors

Retryable Errors

Non-Retryable Errors

Partial Failure Behavior

Fallback Strategy

Observability

Anti-Patterns

7.5 KiB

Raw Blame History