6.1 KiB
6.1 KiB
| name | description |
|---|---|
| distributed-system-basics | Knowledge contract for understanding and designing for distributed system concerns: at-least-once vs exactly-once, retry behavior, duplicate requests, idempotency, timeout vs failure, partial failure, eventual consistency, and ordering guarantees. Referenced by design-architecture when dealing with distributed concerns. |
This is a knowledge contract, not a workflow skill. It is referenced by design-architecture when the architect is designing for distributed system concerns.
Delivery Guarantees
At-Most-Once
- Message may be lost but never delivered twice
- Use when: loss is acceptable, retries are not, throughput is priority
- Trade-off: simplicity and speed at the cost of reliability
At-Least-Once
- Message is never lost but may be delivered more than once
- Use when: loss is unacceptable, consumers are idempotent or can deduplicate
- Trade-off: reliability at the cost of requiring idempotency handling
- Most common default for production systems
Exactly-Once
- Message is delivered once and only once
- Use when: duplicates are harmful and idempotency is hard or impossible
- Trade-off: significant complexity, performance overhead, and coordination cost
- Often achieved via idempotency + at-least-once rather than true exactly-once protocol
Choose the weakest guarantee that meets PRD requirements. Do not default to exactly-once unless the PRD requires it.
Retry Behavior
When to Retry
- Transient network failures
- Temporary resource unavailability (503, timeouts)
- Rate limit exceeded (429, with backoff)
- Upstream service failures (502, 504)
When NOT to Retry
- Client errors (400, 401, 403, 404, 422)
- Business rule violations
- Malformed requests
- Non-retryable error codes explicitly defined in the API contract
Retry Strategy Parameters
- Maximum retries: define per operation (typically 2-5)
- Backoff strategy:
- Fixed interval: predictable but may overwhelm recovering service
- Exponential backoff: increasingly longer waits (recommended default)
- Exponential backoff with jitter: adds randomness to avoid thundering herd
- Retry budget: limit total retries per time window to prevent cascading failure
Retry Anti-Patterns
- Retrying non-idempotent operations without deduplication
- Infinite retries without a circuit breaker
- Synchronous retries that block the caller indefinitely
- Ignoring Retry-After headers
Duplicate Requests
Duplicates arise from:
- Network retries
- Client timeouts with successful server processing
- Message queue redelivery
- User double-submit
Handling strategies:
- Idempotency keys (preferred for API operations)
- Deduplication at consumer level (for event processing)
- Natural idempotency (read operations, certain write patterns)
- Idempotency is covered in detail in the
idempotency-designknowledge contract
Timeout vs Failure
Timeout
- The operation may have succeeded; you just do not know
- Must be handled as "unknown state" not "failed state"
- Requires idempotency or state reconciliation
Failure
- The operation definitively did not succeed
- Can be safely retried
Design implications:
- Always distinguish between timeout and confirmed failure
- For timeouts, retry with idempotency or check state before retrying
- Define timeout values per operation type (short for interactive, long for batch)
- Document timeout values in API contracts
Partial Failure
Partial failure occurs when:
- A multi-step operation fails after some steps succeed
- A batch operation partially succeeds
- An upstream dependency fails mid-transaction
Handling strategies:
- Compensating transactions (saga pattern) for multi-service operations
- Partial success responses (207 Multi-Status for batch operations)
- Atomic operations where possible (single-service transactions)
- Outbox pattern for ensuring eventual consistency
Design principles:
- Define what "partial" means for each operation
- Define whether partial success is acceptable or must be fully rolled back
- Document recovery procedures for each partial failure scenario
- Map partial failure scenarios to PRD edge cases
Eventual Consistency
Eventual consistency means:
- Updates propagate asynchronously
- Reads may return stale data for a bounded period
- All replicas eventually converge
When to use:
- Cross-service data synchronization
- Read replicas and caching
- Event-driven architectures
- High-write, low-latency-requirement scenarios
When NOT to use:
- Financial balances where immediate consistency is required
- Inventory counts where overselling is unacceptable
- Authorization decisions where stale permissions are harmful
- Any scenario the PRD marks as requiring strong consistency
Design implications:
- Define acceptable staleness bounds per data type
- Define how consumers detect and handle stale data
- Define convergence guarantees (time-bound, version-bound)
- Document which data is eventually consistent and which is strongly consistent
Ordering Guarantees
Per-Partition Ordering
- Messages within a single partition or queue are ordered
- Use when: operation sequence matters within a context (e.g., per user, per order)
- Ensure: partition key is set to the context identifier
Global Ordering
- All messages across all partitions are ordered
- Use when: global sequence matters (rare)
- Trade-off: severely limits throughput and availability
- Avoid unless the PRD explicitly requires it
No Ordering Guarantee
- Messages may arrive in any order
- Use when: operations are independent and order does not matter
- Ensure: consumers can handle out-of-order delivery
Define ordering guarantees per queue/topic:
- State the guarantee clearly
- Define the partition key if per-partition ordering is used
- Define how out-of-order delivery is handled when ordering is expected but not guaranteed
Anti-Patterns
- Assuming network calls never fail
- Retrying without idempotency
- Treating timeout as failure
- Ignoring partial failure scenarios
- Assuming global ordering when only per-partition ordering is needed
- Using strong consistency when eventual consistency would suffice
- Using eventual consistency when the PRD requires strong consistency