142 lines
5.9 KiB
Markdown
142 lines
5.9 KiB
Markdown
|
|
---
|
||
|
|
name: async-queue-design
|
||
|
|
description: "Knowledge contract for designing asynchronous workflows, queue topics, producers, consumers, retry strategies, DLQ, ordering guarantees, and timeout behavior. Referenced by design-architecture when designing async models."
|
||
|
|
---
|
||
|
|
|
||
|
|
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing asynchronous workflows.
|
||
|
|
|
||
|
|
## Core Principle
|
||
|
|
|
||
|
|
Asynchronous processing must be justified by a PRD requirement. Do not make operations asynchronous just because async is "better" or more "scalable." Every async decision must trace to a specific PRD functional requirement or NFR.
|
||
|
|
|
||
|
|
## When to Use Async
|
||
|
|
|
||
|
|
Use async when:
|
||
|
|
- The operation is long-running and cannot complete within the caller's timeout
|
||
|
|
- The PRD explicitly requires non-blocking behavior (e.g., "submit and check status later")
|
||
|
|
- Multiple consumers need to react to the same event
|
||
|
|
- Throughput requirements exceed synchronous processing capacity
|
||
|
|
- Decoupling producer and consumer is architecturally necessary (see `system-decomposition`)
|
||
|
|
- The PRD requires eventual consistency across service boundaries
|
||
|
|
|
||
|
|
Do NOT use async when:
|
||
|
|
- The operation is fast enough for synchronous handling
|
||
|
|
- The caller needs an immediate result
|
||
|
|
- The system is simple enough that direct calls suffice
|
||
|
|
- Async adds complexity without a corresponding PRD requirement
|
||
|
|
|
||
|
|
## Queue/Topic Design
|
||
|
|
|
||
|
|
For each queue or topic, define:
|
||
|
|
- Name and purpose (traced to PRD requirement)
|
||
|
|
- Producer service(s)
|
||
|
|
- Consumer service(s)
|
||
|
|
- Message schema (payload format, headers, metadata)
|
||
|
|
- Ordering guarantee (per-partition ordered, unordered)
|
||
|
|
- Durability guarantee (at-least-once, exactly-once for important messages)
|
||
|
|
- Retention policy (how long messages are kept)
|
||
|
|
|
||
|
|
### Topic vs Queue
|
||
|
|
|
||
|
|
Use a topic (pub/sub) when:
|
||
|
|
- Multiple independent consumers need the same event
|
||
|
|
- Consumers have different processing logic
|
||
|
|
- Adding new consumers should not require changes to the producer
|
||
|
|
|
||
|
|
Use a queue (point-to-point) when:
|
||
|
|
- Exactly one consumer should process each message
|
||
|
|
- Work distribution across instances of the same service is needed
|
||
|
|
- Ordering within a partition matters
|
||
|
|
|
||
|
|
### Message Schema
|
||
|
|
|
||
|
|
Define message schemas explicitly:
|
||
|
|
- Message type or event name
|
||
|
|
- Payload schema (with versioning strategy)
|
||
|
|
- Metadata headers (correlation ID, causation ID, timestamp, source)
|
||
|
|
- Schema evolution strategy (backward compatibility, versioning)
|
||
|
|
|
||
|
|
## Retry Strategy
|
||
|
|
|
||
|
|
For each async operation, define:
|
||
|
|
|
||
|
|
### Retry Parameters
|
||
|
|
- Maximum retries: typically 3-5 for transient failures
|
||
|
|
- Backoff strategy:
|
||
|
|
- Fixed interval: simple but may overwhelm recovering service
|
||
|
|
- Exponential backoff: recommended default, increasingly longer waits
|
||
|
|
- Exponential backoff with jitter: prevents thundering herd
|
||
|
|
- Retry budget: maximum concurrent retries per consumer to prevent cascading failure
|
||
|
|
|
||
|
|
### What to Retry
|
||
|
|
- Transient network errors
|
||
|
|
- Temporary resource unavailability (503, timeouts)
|
||
|
|
- Rate limit exceeded (429, with backoff and Retry-After header)
|
||
|
|
- Upstream service failures (502, 504)
|
||
|
|
|
||
|
|
### What NOT to Retry
|
||
|
|
- Business rule violations (non-retryable error codes)
|
||
|
|
- Malformed messages (bad schema, missing required fields)
|
||
|
|
- Permanent failures (authentication errors, not-found errors)
|
||
|
|
- Messages that have exceeded maximum retries (route to DLQ)
|
||
|
|
|
||
|
|
## Dead-Letter Queue (DLQ) Strategy
|
||
|
|
|
||
|
|
For each queue/topic with retry, define:
|
||
|
|
- DLQ name (e.g., `{original-queue}.dlq`)
|
||
|
|
- Condition for routing to DLQ: exceeded max retries, permanent failure, or poison message
|
||
|
|
- DLQ message retention policy
|
||
|
|
- Alerting: when messages appear in DLQ, who is notified
|
||
|
|
- Recovery process: how DLQ messages are inspected, fixed, and reprocessed
|
||
|
|
|
||
|
|
DLQ design principles:
|
||
|
|
- Every retryable queue MUST have a DLQ
|
||
|
|
- DLQ messages must include original message, error details, and retry count
|
||
|
|
- DLQ must be monitored and alerted on; silent DLQs are a failure mode
|
||
|
|
- Recovery from DLQ may require manual intervention or a replay mechanism
|
||
|
|
|
||
|
|
## Ordering Guarantees
|
||
|
|
|
||
|
|
For each queue/topic, explicitly state the ordering guarantee:
|
||
|
|
|
||
|
|
- **Per-partition ordered**: Messages within the same partition key are delivered in order. Use when order within a context matters (e.g., per user, per order).
|
||
|
|
- **Unordered**: No ordering guarantee across messages. Use when operations are independent.
|
||
|
|
- **Globally ordered**: All messages are delivered in order. Avoid unless the PRD explicitly requires it (severely limits throughput).
|
||
|
|
|
||
|
|
If ordering is required:
|
||
|
|
- Define the partition key (e.g., `user_id`, `order_id`)
|
||
|
|
- Define how out-of-order delivery is handled when it occurs
|
||
|
|
- Define whether strict ordering or best-effort ordering is acceptable
|
||
|
|
|
||
|
|
## Timeout Behavior
|
||
|
|
|
||
|
|
For each async operation, define:
|
||
|
|
- Processing timeout: maximum time a consumer may take to process a message
|
||
|
|
- Visibility timeout: how long a message is invisible to other consumers while being processed
|
||
|
|
- What happens on timeout:
|
||
|
|
- Message is returned to the queue for retry (if below max retries)
|
||
|
|
- Message is routed to DLQ (if max retries exceeded)
|
||
|
|
- Alerting is triggered for operational visibility
|
||
|
|
|
||
|
|
Timeout design principles:
|
||
|
|
- Always set timeouts; no infinite waits
|
||
|
|
- Timeout values must be based on observed processing times, not guesses
|
||
|
|
- Document timeout values and adjust based on production metrics
|
||
|
|
|
||
|
|
## Cancellation
|
||
|
|
|
||
|
|
Define whether async operations can be cancelled and how:
|
||
|
|
- Cancellation signal mechanism (cancel event, status field, cancel API)
|
||
|
|
- What happens to in-progress work when cancellation is received
|
||
|
|
- Whether cancellation is best-effort or guaranteed
|
||
|
|
- How cancellation is reflected in the operation status
|
||
|
|
|
||
|
|
## Anti-Patterns
|
||
|
|
|
||
|
|
- Making operations async without a PRD requirement
|
||
|
|
- Not defining a DLQ for retryable queues
|
||
|
|
- Setting infinite timeouts or no timeouts
|
||
|
|
- Assuming global ordering when per-partition ordering suffices
|
||
|
|
- Not versioning message schemas
|
||
|
|
- Processing messages without idempotency (see `idempotency-design`)
|
||
|
|
- Ignoring backpressure when consumers are overwhelmed
|