opencode-workflow/skills/async-queue-design/SKILL.md

142 lines
5.9 KiB
Markdown

---
name: async-queue-design
description: "Knowledge contract for designing asynchronous workflows, queue topics, producers, consumers, retry strategies, DLQ, ordering guarantees, and timeout behavior. Referenced by design-architecture when designing async models."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing asynchronous workflows.
## Core Principle
Asynchronous processing must be justified by a PRD requirement. Do not make operations asynchronous just because async is "better" or more "scalable." Every async decision must trace to a specific PRD functional requirement or NFR.
## When to Use Async
Use async when:
- The operation is long-running and cannot complete within the caller's timeout
- The PRD explicitly requires non-blocking behavior (e.g., "submit and check status later")
- Multiple consumers need to react to the same event
- Throughput requirements exceed synchronous processing capacity
- Decoupling producer and consumer is architecturally necessary (see `system-decomposition`)
- The PRD requires eventual consistency across service boundaries
Do NOT use async when:
- The operation is fast enough for synchronous handling
- The caller needs an immediate result
- The system is simple enough that direct calls suffice
- Async adds complexity without a corresponding PRD requirement
## Queue/Topic Design
For each queue or topic, define:
- Name and purpose (traced to PRD requirement)
- Producer service(s)
- Consumer service(s)
- Message schema (payload format, headers, metadata)
- Ordering guarantee (per-partition ordered, unordered)
- Durability guarantee (at-least-once, exactly-once for important messages)
- Retention policy (how long messages are kept)
### Topic vs Queue
Use a topic (pub/sub) when:
- Multiple independent consumers need the same event
- Consumers have different processing logic
- Adding new consumers should not require changes to the producer
Use a queue (point-to-point) when:
- Exactly one consumer should process each message
- Work distribution across instances of the same service is needed
- Ordering within a partition matters
### Message Schema
Define message schemas explicitly:
- Message type or event name
- Payload schema (with versioning strategy)
- Metadata headers (correlation ID, causation ID, timestamp, source)
- Schema evolution strategy (backward compatibility, versioning)
## Retry Strategy
For each async operation, define:
### Retry Parameters
- Maximum retries: typically 3-5 for transient failures
- Backoff strategy:
- Fixed interval: simple but may overwhelm recovering service
- Exponential backoff: recommended default, increasingly longer waits
- Exponential backoff with jitter: prevents thundering herd
- Retry budget: maximum concurrent retries per consumer to prevent cascading failure
### What to Retry
- Transient network errors
- Temporary resource unavailability (503, timeouts)
- Rate limit exceeded (429, with backoff and Retry-After header)
- Upstream service failures (502, 504)
### What NOT to Retry
- Business rule violations (non-retryable error codes)
- Malformed messages (bad schema, missing required fields)
- Permanent failures (authentication errors, not-found errors)
- Messages that have exceeded maximum retries (route to DLQ)
## Dead-Letter Queue (DLQ) Strategy
For each queue/topic with retry, define:
- DLQ name (e.g., `{original-queue}.dlq`)
- Condition for routing to DLQ: exceeded max retries, permanent failure, or poison message
- DLQ message retention policy
- Alerting: when messages appear in DLQ, who is notified
- Recovery process: how DLQ messages are inspected, fixed, and reprocessed
DLQ design principles:
- Every retryable queue MUST have a DLQ
- DLQ messages must include original message, error details, and retry count
- DLQ must be monitored and alerted on; silent DLQs are a failure mode
- Recovery from DLQ may require manual intervention or a replay mechanism
## Ordering Guarantees
For each queue/topic, explicitly state the ordering guarantee:
- **Per-partition ordered**: Messages within the same partition key are delivered in order. Use when order within a context matters (e.g., per user, per order).
- **Unordered**: No ordering guarantee across messages. Use when operations are independent.
- **Globally ordered**: All messages are delivered in order. Avoid unless the PRD explicitly requires it (severely limits throughput).
If ordering is required:
- Define the partition key (e.g., `user_id`, `order_id`)
- Define how out-of-order delivery is handled when it occurs
- Define whether strict ordering or best-effort ordering is acceptable
## Timeout Behavior
For each async operation, define:
- Processing timeout: maximum time a consumer may take to process a message
- Visibility timeout: how long a message is invisible to other consumers while being processed
- What happens on timeout:
- Message is returned to the queue for retry (if below max retries)
- Message is routed to DLQ (if max retries exceeded)
- Alerting is triggered for operational visibility
Timeout design principles:
- Always set timeouts; no infinite waits
- Timeout values must be based on observed processing times, not guesses
- Document timeout values and adjust based on production metrics
## Cancellation
Define whether async operations can be cancelled and how:
- Cancellation signal mechanism (cancel event, status field, cancel API)
- What happens to in-progress work when cancellation is received
- Whether cancellation is best-effort or guaranteed
- How cancellation is reflected in the operation status
## Anti-Patterns
- Making operations async without a PRD requirement
- Not defining a DLQ for retryable queues
- Setting infinite timeouts or no timeouts
- Assuming global ordering when per-partition ordering suffices
- Not versioning message schemas
- Processing messages without idempotency (see `idempotency-design`)
- Ignoring backpressure when consumers are overwhelmed