opencode-workflow/skills/distributed-system-basics/SKILL.md

6.1 KiB

name description
distributed-system-basics Knowledge contract for understanding and designing for distributed system concerns: at-least-once vs exactly-once, retry behavior, duplicate requests, idempotency, timeout vs failure, partial failure, eventual consistency, and ordering guarantees. Referenced by design-architecture when dealing with distributed concerns.

This is a knowledge contract, not a workflow skill. It is referenced by design-architecture when the architect is designing for distributed system concerns.

Delivery Guarantees

At-Most-Once

  • Message may be lost but never delivered twice
  • Use when: loss is acceptable, retries are not, throughput is priority
  • Trade-off: simplicity and speed at the cost of reliability

At-Least-Once

  • Message is never lost but may be delivered more than once
  • Use when: loss is unacceptable, consumers are idempotent or can deduplicate
  • Trade-off: reliability at the cost of requiring idempotency handling
  • Most common default for production systems

Exactly-Once

  • Message is delivered once and only once
  • Use when: duplicates are harmful and idempotency is hard or impossible
  • Trade-off: significant complexity, performance overhead, and coordination cost
  • Often achieved via idempotency + at-least-once rather than true exactly-once protocol

Choose the weakest guarantee that meets PRD requirements. Do not default to exactly-once unless the PRD requires it.

Retry Behavior

When to Retry

  • Transient network failures
  • Temporary resource unavailability (503, timeouts)
  • Rate limit exceeded (429, with backoff)
  • Upstream service failures (502, 504)

When NOT to Retry

  • Client errors (400, 401, 403, 404, 422)
  • Business rule violations
  • Malformed requests
  • Non-retryable error codes explicitly defined in the API contract

Retry Strategy Parameters

  • Maximum retries: define per operation (typically 2-5)
  • Backoff strategy:
    • Fixed interval: predictable but may overwhelm recovering service
    • Exponential backoff: increasingly longer waits (recommended default)
    • Exponential backoff with jitter: adds randomness to avoid thundering herd
  • Retry budget: limit total retries per time window to prevent cascading failure

Retry Anti-Patterns

  • Retrying non-idempotent operations without deduplication
  • Infinite retries without a circuit breaker
  • Synchronous retries that block the caller indefinitely
  • Ignoring Retry-After headers

Duplicate Requests

Duplicates arise from:

  • Network retries
  • Client timeouts with successful server processing
  • Message queue redelivery
  • User double-submit

Handling strategies:

  • Idempotency keys (preferred for API operations)
  • Deduplication at consumer level (for event processing)
  • Natural idempotency (read operations, certain write patterns)
  • Idempotency is covered in detail in the idempotency-design knowledge contract

Timeout vs Failure

Timeout

  • The operation may have succeeded; you just do not know
  • Must be handled as "unknown state" not "failed state"
  • Requires idempotency or state reconciliation

Failure

  • The operation definitively did not succeed
  • Can be safely retried

Design implications:

  • Always distinguish between timeout and confirmed failure
  • For timeouts, retry with idempotency or check state before retrying
  • Define timeout values per operation type (short for interactive, long for batch)
  • Document timeout values in API contracts

Partial Failure

Partial failure occurs when:

  • A multi-step operation fails after some steps succeed
  • A batch operation partially succeeds
  • An upstream dependency fails mid-transaction

Handling strategies:

  • Compensating transactions (saga pattern) for multi-service operations
  • Partial success responses (207 Multi-Status for batch operations)
  • Atomic operations where possible (single-service transactions)
  • Outbox pattern for ensuring eventual consistency

Design principles:

  • Define what "partial" means for each operation
  • Define whether partial success is acceptable or must be fully rolled back
  • Document recovery procedures for each partial failure scenario
  • Map partial failure scenarios to PRD edge cases

Eventual Consistency

Eventual consistency means:

  • Updates propagate asynchronously
  • Reads may return stale data for a bounded period
  • All replicas eventually converge

When to use:

  • Cross-service data synchronization
  • Read replicas and caching
  • Event-driven architectures
  • High-write, low-latency-requirement scenarios

When NOT to use:

  • Financial balances where immediate consistency is required
  • Inventory counts where overselling is unacceptable
  • Authorization decisions where stale permissions are harmful
  • Any scenario the PRD marks as requiring strong consistency

Design implications:

  • Define acceptable staleness bounds per data type
  • Define how consumers detect and handle stale data
  • Define convergence guarantees (time-bound, version-bound)
  • Document which data is eventually consistent and which is strongly consistent

Ordering Guarantees

Per-Partition Ordering

  • Messages within a single partition or queue are ordered
  • Use when: operation sequence matters within a context (e.g., per user, per order)
  • Ensure: partition key is set to the context identifier

Global Ordering

  • All messages across all partitions are ordered
  • Use when: global sequence matters (rare)
  • Trade-off: severely limits throughput and availability
  • Avoid unless the PRD explicitly requires it

No Ordering Guarantee

  • Messages may arrive in any order
  • Use when: operations are independent and order does not matter
  • Ensure: consumers can handle out-of-order delivery

Define ordering guarantees per queue/topic:

  • State the guarantee clearly
  • Define the partition key if per-partition ordering is used
  • Define how out-of-order delivery is handled when ordering is expected but not guaranteed

Anti-Patterns

  • Assuming network calls never fail
  • Retrying without idempotency
  • Treating timeout as failure
  • Ignoring partial failure scenarios
  • Assuming global ordering when only per-partition ordering is needed
  • Using strong consistency when eventual consistency would suffice
  • Using eventual consistency when the PRD requires strong consistency