opencode-workflow/SKILL.md at 082c9203fa3af7311dbce8fa87c5a737e9c5818c

6.1 KiB

Raw Blame History

name	description
distributed-system-basics	Knowledge contract for understanding and designing for distributed system concerns: at-least-once vs exactly-once, retry behavior, duplicate requests, idempotency, timeout vs failure, partial failure, eventual consistency, and ordering guarantees. Referenced by design-architecture when dealing with distributed concerns.

This is a knowledge contract, not a workflow skill. It is referenced by design-architecture when the architect is designing for distributed system concerns.

Delivery Guarantees

At-Most-Once

Message may be lost but never delivered twice
Use when: loss is acceptable, retries are not, throughput is priority
Trade-off: simplicity and speed at the cost of reliability

At-Least-Once

Message is never lost but may be delivered more than once
Use when: loss is unacceptable, consumers are idempotent or can deduplicate
Trade-off: reliability at the cost of requiring idempotency handling
Most common default for production systems

Exactly-Once

Message is delivered once and only once
Use when: duplicates are harmful and idempotency is hard or impossible
Trade-off: significant complexity, performance overhead, and coordination cost
Often achieved via idempotency + at-least-once rather than true exactly-once protocol

Choose the weakest guarantee that meets PRD requirements. Do not default to exactly-once unless the PRD requires it.

Retry Behavior

When to Retry

Transient network failures
Temporary resource unavailability (503, timeouts)
Rate limit exceeded (429, with backoff)
Upstream service failures (502, 504)

When NOT to Retry

Client errors (400, 401, 403, 404, 422)
Business rule violations
Malformed requests
Non-retryable error codes explicitly defined in the API contract

Retry Strategy Parameters

Maximum retries: define per operation (typically 2-5)
Backoff strategy:
- Fixed interval: predictable but may overwhelm recovering service
- Exponential backoff: increasingly longer waits (recommended default)
- Exponential backoff with jitter: adds randomness to avoid thundering herd
Retry budget: limit total retries per time window to prevent cascading failure

Retry Anti-Patterns

Retrying non-idempotent operations without deduplication
Infinite retries without a circuit breaker
Synchronous retries that block the caller indefinitely
Ignoring Retry-After headers

Duplicate Requests

Duplicates arise from:

Network retries
Client timeouts with successful server processing
Message queue redelivery
User double-submit

Handling strategies:

Idempotency keys (preferred for API operations)
Deduplication at consumer level (for event processing)
Natural idempotency (read operations, certain write patterns)
Idempotency is covered in detail in the idempotency-design knowledge contract

Timeout vs Failure

Timeout

The operation may have succeeded; you just do not know
Must be handled as "unknown state" not "failed state"
Requires idempotency or state reconciliation

Failure

The operation definitively did not succeed
Can be safely retried

Design implications:

Always distinguish between timeout and confirmed failure
For timeouts, retry with idempotency or check state before retrying
Define timeout values per operation type (short for interactive, long for batch)
Document timeout values in API contracts

Partial Failure

Partial failure occurs when:

A multi-step operation fails after some steps succeed
A batch operation partially succeeds
An upstream dependency fails mid-transaction

Handling strategies:

Compensating transactions (saga pattern) for multi-service operations
Partial success responses (207 Multi-Status for batch operations)
Atomic operations where possible (single-service transactions)
Outbox pattern for ensuring eventual consistency

Design principles:

Define what "partial" means for each operation
Define whether partial success is acceptable or must be fully rolled back
Document recovery procedures for each partial failure scenario
Map partial failure scenarios to PRD edge cases

Eventual Consistency

Eventual consistency means:

Updates propagate asynchronously
Reads may return stale data for a bounded period
All replicas eventually converge

When to use:

Cross-service data synchronization
Read replicas and caching
Event-driven architectures
High-write, low-latency-requirement scenarios

When NOT to use:

Financial balances where immediate consistency is required
Inventory counts where overselling is unacceptable
Authorization decisions where stale permissions are harmful
Any scenario the PRD marks as requiring strong consistency