opencode-workflow/skills/distributed-system-basics/SKILL.md

---
name: distributed-system-basics
description: "Knowledge contract for understanding and designing for distributed system concerns: at-least-once vs exactly-once, retry behavior, duplicate requests, idempotency, timeout vs failure, partial failure, eventual consistency, and ordering guarantees. Referenced by design-architecture when dealing with distributed concerns."
---

This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing for distributed system concerns.

## Delivery Guarantees

### At-Most-Once
- Message may be lost but never delivered twice
- Use when: loss is acceptable, retries are not, throughput is priority
- Trade-off: simplicity and speed at the cost of reliability

### At-Least-Once
- Message is never lost but may be delivered more than once
- Use when: loss is unacceptable, consumers are idempotent or can deduplicate
- Trade-off: reliability at the cost of requiring idempotency handling
- Most common default for production systems

### Exactly-Once
- Message is delivered once and only once
- Use when: duplicates are harmful and idempotency is hard or impossible
- Trade-off: significant complexity, performance overhead, and coordination cost
- Often achieved via idempotency + at-least-once rather than true exactly-once protocol

Choose the weakest guarantee that meets PRD requirements. Do not default to exactly-once unless the PRD requires it.

## Retry Behavior

### When to Retry
- Transient network failures
- Temporary resource unavailability (503, timeouts)
- Rate limit exceeded (429, with backoff)
- Upstream service failures (502, 504)

### When NOT to Retry
- Client errors (400, 401, 403, 404, 422)
- Business rule violations
- Malformed requests
- Non-retryable error codes explicitly defined in the API contract

### Retry Strategy Parameters
- Maximum retries: define per operation (typically 2-5)
- Backoff strategy:
  - Fixed interval: predictable but may overwhelm recovering service
  - Exponential backoff: increasingly longer waits (recommended default)
  - Exponential backoff with jitter: adds randomness to avoid thundering herd
- Retry budget: limit total retries per time window to prevent cascading failure

### Retry Anti-Patterns
- Retrying non-idempotent operations without deduplication
- Infinite retries without a circuit breaker
- Synchronous retries that block the caller indefinitely
- Ignoring Retry-After headers

## Duplicate Requests

Duplicates arise from:
- Network retries
- Client timeouts with successful server processing
- Message queue redelivery
- User double-submit

Handling strategies:
- Idempotency keys (preferred for API operations)
- Deduplication at consumer level (for event processing)
- Natural idempotency (read operations, certain write patterns)
- Idempotency is covered in detail in the `idempotency-design` knowledge contract

## Timeout vs Failure

### Timeout
- The operation may have succeeded; you just do not know
- Must be handled as "unknown state" not "failed state"
- Requires idempotency or state reconciliation

### Failure
- The operation definitively did not succeed
- Can be safely retried

Design implications:
- Always distinguish between timeout and confirmed failure
- For timeouts, retry with idempotency or check state before retrying
- Define timeout values per operation type (short for interactive, long for batch)
- Document timeout values in API contracts

## Partial Failure

Partial failure occurs when:
- A multi-step operation fails after some steps succeed
- A batch operation partially succeeds
- An upstream dependency fails mid-transaction

Handling strategies:
- Compensating transactions (saga pattern) for multi-service operations
- Partial success responses (207 Multi-Status for batch operations)
- Atomic operations where possible (single-service transactions)
- Outbox pattern for ensuring eventual consistency

Design principles:
- Define what "partial" means for each operation
- Define whether partial success is acceptable or must be fully rolled back
- Document recovery procedures for each partial failure scenario
- Map partial failure scenarios to PRD edge cases

## Eventual Consistency

Eventual consistency means:
- Updates propagate asynchronously
- Reads may return stale data for a bounded period
- All replicas eventually converge

When to use:
- Cross-service data synchronization
- Read replicas and caching
- Event-driven architectures
- High-write, low-latency-requirement scenarios

When NOT to use:
- Financial balances where immediate consistency is required
- Inventory counts where overselling is unacceptable
- Authorization decisions where stale permissions are harmful
- Any scenario the PRD marks as requiring strong consistency

Design implications:
- Define acceptable staleness bounds per data type
- Define how consumers detect and handle stale data
- Define convergence guarantees (time-bound, version-bound)
- Document which data is eventually consistent and which is strongly consistent

## Ordering Guarantees

### Per-Partition Ordering
- Messages within a single partition or queue are ordered
- Use when: operation sequence matters within a context (e.g., per user, per order)
- Ensure: partition key is set to the context identifier

### Global Ordering
- All messages across all partitions are ordered
- Use when: global sequence matters (rare)
- Trade-off: severely limits throughput and availability
- Avoid unless the PRD explicitly requires it

### No Ordering Guarantee
- Messages may arrive in any order
- Use when: operations are independent and order does not matter
- Ensure: consumers can handle out-of-order delivery

Define ordering guarantees per queue/topic:
- State the guarantee clearly
- Define the partition key if per-partition ordering is used
- Define how out-of-order delivery is handled when ordering is expected but not guaranteed

## Anti-Patterns

- Assuming network calls never fail
- Retrying without idempotency
- Treating timeout as failure
- Ignoring partial failure scenarios
- Assuming global ordering when only per-partition ordering is needed
- Using strong consistency when eventual consistency would suffice
- Using eventual consistency when the PRD requires strong consistency