163 lines
6.1 KiB
Markdown
163 lines
6.1 KiB
Markdown
---
|
|
name: distributed-system-basics
|
|
description: "Knowledge contract for understanding and designing for distributed system concerns: at-least-once vs exactly-once, retry behavior, duplicate requests, idempotency, timeout vs failure, partial failure, eventual consistency, and ordering guarantees. Referenced by design-architecture when dealing with distributed concerns."
|
|
---
|
|
|
|
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing for distributed system concerns.
|
|
|
|
## Delivery Guarantees
|
|
|
|
### At-Most-Once
|
|
- Message may be lost but never delivered twice
|
|
- Use when: loss is acceptable, retries are not, throughput is priority
|
|
- Trade-off: simplicity and speed at the cost of reliability
|
|
|
|
### At-Least-Once
|
|
- Message is never lost but may be delivered more than once
|
|
- Use when: loss is unacceptable, consumers are idempotent or can deduplicate
|
|
- Trade-off: reliability at the cost of requiring idempotency handling
|
|
- Most common default for production systems
|
|
|
|
### Exactly-Once
|
|
- Message is delivered once and only once
|
|
- Use when: duplicates are harmful and idempotency is hard or impossible
|
|
- Trade-off: significant complexity, performance overhead, and coordination cost
|
|
- Often achieved via idempotency + at-least-once rather than true exactly-once protocol
|
|
|
|
Choose the weakest guarantee that meets PRD requirements. Do not default to exactly-once unless the PRD requires it.
|
|
|
|
## Retry Behavior
|
|
|
|
### When to Retry
|
|
- Transient network failures
|
|
- Temporary resource unavailability (503, timeouts)
|
|
- Rate limit exceeded (429, with backoff)
|
|
- Upstream service failures (502, 504)
|
|
|
|
### When NOT to Retry
|
|
- Client errors (400, 401, 403, 404, 422)
|
|
- Business rule violations
|
|
- Malformed requests
|
|
- Non-retryable error codes explicitly defined in the API contract
|
|
|
|
### Retry Strategy Parameters
|
|
- Maximum retries: define per operation (typically 2-5)
|
|
- Backoff strategy:
|
|
- Fixed interval: predictable but may overwhelm recovering service
|
|
- Exponential backoff: increasingly longer waits (recommended default)
|
|
- Exponential backoff with jitter: adds randomness to avoid thundering herd
|
|
- Retry budget: limit total retries per time window to prevent cascading failure
|
|
|
|
### Retry Anti-Patterns
|
|
- Retrying non-idempotent operations without deduplication
|
|
- Infinite retries without a circuit breaker
|
|
- Synchronous retries that block the caller indefinitely
|
|
- Ignoring Retry-After headers
|
|
|
|
## Duplicate Requests
|
|
|
|
Duplicates arise from:
|
|
- Network retries
|
|
- Client timeouts with successful server processing
|
|
- Message queue redelivery
|
|
- User double-submit
|
|
|
|
Handling strategies:
|
|
- Idempotency keys (preferred for API operations)
|
|
- Deduplication at consumer level (for event processing)
|
|
- Natural idempotency (read operations, certain write patterns)
|
|
- Idempotency is covered in detail in the `idempotency-design` knowledge contract
|
|
|
|
## Timeout vs Failure
|
|
|
|
### Timeout
|
|
- The operation may have succeeded; you just do not know
|
|
- Must be handled as "unknown state" not "failed state"
|
|
- Requires idempotency or state reconciliation
|
|
|
|
### Failure
|
|
- The operation definitively did not succeed
|
|
- Can be safely retried
|
|
|
|
Design implications:
|
|
- Always distinguish between timeout and confirmed failure
|
|
- For timeouts, retry with idempotency or check state before retrying
|
|
- Define timeout values per operation type (short for interactive, long for batch)
|
|
- Document timeout values in API contracts
|
|
|
|
## Partial Failure
|
|
|
|
Partial failure occurs when:
|
|
- A multi-step operation fails after some steps succeed
|
|
- A batch operation partially succeeds
|
|
- An upstream dependency fails mid-transaction
|
|
|
|
Handling strategies:
|
|
- Compensating transactions (saga pattern) for multi-service operations
|
|
- Partial success responses (207 Multi-Status for batch operations)
|
|
- Atomic operations where possible (single-service transactions)
|
|
- Outbox pattern for ensuring eventual consistency
|
|
|
|
Design principles:
|
|
- Define what "partial" means for each operation
|
|
- Define whether partial success is acceptable or must be fully rolled back
|
|
- Document recovery procedures for each partial failure scenario
|
|
- Map partial failure scenarios to PRD edge cases
|
|
|
|
## Eventual Consistency
|
|
|
|
Eventual consistency means:
|
|
- Updates propagate asynchronously
|
|
- Reads may return stale data for a bounded period
|
|
- All replicas eventually converge
|
|
|
|
When to use:
|
|
- Cross-service data synchronization
|
|
- Read replicas and caching
|
|
- Event-driven architectures
|
|
- High-write, low-latency-requirement scenarios
|
|
|
|
When NOT to use:
|
|
- Financial balances where immediate consistency is required
|
|
- Inventory counts where overselling is unacceptable
|
|
- Authorization decisions where stale permissions are harmful
|
|
- Any scenario the PRD marks as requiring strong consistency
|
|
|
|
Design implications:
|
|
- Define acceptable staleness bounds per data type
|
|
- Define how consumers detect and handle stale data
|
|
- Define convergence guarantees (time-bound, version-bound)
|
|
- Document which data is eventually consistent and which is strongly consistent
|
|
|
|
## Ordering Guarantees
|
|
|
|
### Per-Partition Ordering
|
|
- Messages within a single partition or queue are ordered
|
|
- Use when: operation sequence matters within a context (e.g., per user, per order)
|
|
- Ensure: partition key is set to the context identifier
|
|
|
|
### Global Ordering
|
|
- All messages across all partitions are ordered
|
|
- Use when: global sequence matters (rare)
|
|
- Trade-off: severely limits throughput and availability
|
|
- Avoid unless the PRD explicitly requires it
|
|
|
|
### No Ordering Guarantee
|
|
- Messages may arrive in any order
|
|
- Use when: operations are independent and order does not matter
|
|
- Ensure: consumers can handle out-of-order delivery
|
|
|
|
Define ordering guarantees per queue/topic:
|
|
- State the guarantee clearly
|
|
- Define the partition key if per-partition ordering is used
|
|
- Define how out-of-order delivery is handled when ordering is expected but not guaranteed
|
|
|
|
## Anti-Patterns
|
|
|
|
- Assuming network calls never fail
|
|
- Retrying without idempotency
|
|
- Treating timeout as failure
|
|
- Ignoring partial failure scenarios
|
|
- Assuming global ordering when only per-partition ordering is needed
|
|
- Using strong consistency when eventual consistency would suffice
|
|
- Using eventual consistency when the PRD requires strong consistency |