--- name: distributed-system-basics description: "Knowledge contract for understanding and designing for distributed system concerns: at-least-once vs exactly-once, retry behavior, duplicate requests, idempotency, timeout vs failure, partial failure, eventual consistency, and ordering guarantees. Referenced by design-architecture when dealing with distributed concerns." --- This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing for distributed system concerns. ## Delivery Guarantees ### At-Most-Once - Message may be lost but never delivered twice - Use when: loss is acceptable, retries are not, throughput is priority - Trade-off: simplicity and speed at the cost of reliability ### At-Least-Once - Message is never lost but may be delivered more than once - Use when: loss is unacceptable, consumers are idempotent or can deduplicate - Trade-off: reliability at the cost of requiring idempotency handling - Most common default for production systems ### Exactly-Once - Message is delivered once and only once - Use when: duplicates are harmful and idempotency is hard or impossible - Trade-off: significant complexity, performance overhead, and coordination cost - Often achieved via idempotency + at-least-once rather than true exactly-once protocol Choose the weakest guarantee that meets PRD requirements. Do not default to exactly-once unless the PRD requires it. ## Retry Behavior ### When to Retry - Transient network failures - Temporary resource unavailability (503, timeouts) - Rate limit exceeded (429, with backoff) - Upstream service failures (502, 504) ### When NOT to Retry - Client errors (400, 401, 403, 404, 422) - Business rule violations - Malformed requests - Non-retryable error codes explicitly defined in the API contract ### Retry Strategy Parameters - Maximum retries: define per operation (typically 2-5) - Backoff strategy: - Fixed interval: predictable but may overwhelm recovering service - Exponential backoff: increasingly longer waits (recommended default) - Exponential backoff with jitter: adds randomness to avoid thundering herd - Retry budget: limit total retries per time window to prevent cascading failure ### Retry Anti-Patterns - Retrying non-idempotent operations without deduplication - Infinite retries without a circuit breaker - Synchronous retries that block the caller indefinitely - Ignoring Retry-After headers ## Duplicate Requests Duplicates arise from: - Network retries - Client timeouts with successful server processing - Message queue redelivery - User double-submit Handling strategies: - Idempotency keys (preferred for API operations) - Deduplication at consumer level (for event processing) - Natural idempotency (read operations, certain write patterns) - Idempotency is covered in detail in the `idempotency-design` knowledge contract ## Timeout vs Failure ### Timeout - The operation may have succeeded; you just do not know - Must be handled as "unknown state" not "failed state" - Requires idempotency or state reconciliation ### Failure - The operation definitively did not succeed - Can be safely retried Design implications: - Always distinguish between timeout and confirmed failure - For timeouts, retry with idempotency or check state before retrying - Define timeout values per operation type (short for interactive, long for batch) - Document timeout values in API contracts ## Partial Failure Partial failure occurs when: - A multi-step operation fails after some steps succeed - A batch operation partially succeeds - An upstream dependency fails mid-transaction Handling strategies: - Compensating transactions (saga pattern) for multi-service operations - Partial success responses (207 Multi-Status for batch operations) - Atomic operations where possible (single-service transactions) - Outbox pattern for ensuring eventual consistency Design principles: - Define what "partial" means for each operation - Define whether partial success is acceptable or must be fully rolled back - Document recovery procedures for each partial failure scenario - Map partial failure scenarios to PRD edge cases ## Eventual Consistency Eventual consistency means: - Updates propagate asynchronously - Reads may return stale data for a bounded period - All replicas eventually converge When to use: - Cross-service data synchronization - Read replicas and caching - Event-driven architectures - High-write, low-latency-requirement scenarios When NOT to use: - Financial balances where immediate consistency is required - Inventory counts where overselling is unacceptable - Authorization decisions where stale permissions are harmful - Any scenario the PRD marks as requiring strong consistency Design implications: - Define acceptable staleness bounds per data type - Define how consumers detect and handle stale data - Define convergence guarantees (time-bound, version-bound) - Document which data is eventually consistent and which is strongly consistent ## Ordering Guarantees ### Per-Partition Ordering - Messages within a single partition or queue are ordered - Use when: operation sequence matters within a context (e.g., per user, per order) - Ensure: partition key is set to the context identifier ### Global Ordering - All messages across all partitions are ordered - Use when: global sequence matters (rare) - Trade-off: severely limits throughput and availability - Avoid unless the PRD explicitly requires it ### No Ordering Guarantee - Messages may arrive in any order - Use when: operations are independent and order does not matter - Ensure: consumers can handle out-of-order delivery Define ordering guarantees per queue/topic: - State the guarantee clearly - Define the partition key if per-partition ordering is used - Define how out-of-order delivery is handled when ordering is expected but not guaranteed ## Anti-Patterns - Assuming network calls never fail - Retrying without idempotency - Treating timeout as failure - Ignoring partial failure scenarios - Assuming global ordering when only per-partition ordering is needed - Using strong consistency when eventual consistency would suffice - Using eventual consistency when the PRD requires strong consistency