--- name: async-queue-design description: "Knowledge contract for designing asynchronous workflows, queue topics, producers, consumers, retry strategies, DLQ, ordering guarantees, and timeout behavior. Referenced by design-architecture when designing async models." --- This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing asynchronous workflows. ## Core Principle Asynchronous processing must be justified by a PRD requirement. Do not make operations asynchronous just because async is "better" or more "scalable." Every async decision must trace to a specific PRD functional requirement or NFR. ## When to Use Async Use async when: - The operation is long-running and cannot complete within the caller's timeout - The PRD explicitly requires non-blocking behavior (e.g., "submit and check status later") - Multiple consumers need to react to the same event - Throughput requirements exceed synchronous processing capacity - Decoupling producer and consumer is architecturally necessary (see `system-decomposition`) - The PRD requires eventual consistency across service boundaries Do NOT use async when: - The operation is fast enough for synchronous handling - The caller needs an immediate result - The system is simple enough that direct calls suffice - Async adds complexity without a corresponding PRD requirement ## Queue/Topic Design For each queue or topic, define: - Name and purpose (traced to PRD requirement) - Producer service(s) - Consumer service(s) - Message schema (payload format, headers, metadata) - Ordering guarantee (per-partition ordered, unordered) - Durability guarantee (at-least-once, exactly-once for important messages) - Retention policy (how long messages are kept) ### Topic vs Queue Use a topic (pub/sub) when: - Multiple independent consumers need the same event - Consumers have different processing logic - Adding new consumers should not require changes to the producer Use a queue (point-to-point) when: - Exactly one consumer should process each message - Work distribution across instances of the same service is needed - Ordering within a partition matters ### Message Schema Define message schemas explicitly: - Message type or event name - Payload schema (with versioning strategy) - Metadata headers (correlation ID, causation ID, timestamp, source) - Schema evolution strategy (backward compatibility, versioning) ## Retry Strategy For each async operation, define: ### Retry Parameters - Maximum retries: typically 3-5 for transient failures - Backoff strategy: - Fixed interval: simple but may overwhelm recovering service - Exponential backoff: recommended default, increasingly longer waits - Exponential backoff with jitter: prevents thundering herd - Retry budget: maximum concurrent retries per consumer to prevent cascading failure ### What to Retry - Transient network errors - Temporary resource unavailability (503, timeouts) - Rate limit exceeded (429, with backoff and Retry-After header) - Upstream service failures (502, 504) ### What NOT to Retry - Business rule violations (non-retryable error codes) - Malformed messages (bad schema, missing required fields) - Permanent failures (authentication errors, not-found errors) - Messages that have exceeded maximum retries (route to DLQ) ## Dead-Letter Queue (DLQ) Strategy For each queue/topic with retry, define: - DLQ name (e.g., `{original-queue}.dlq`) - Condition for routing to DLQ: exceeded max retries, permanent failure, or poison message - DLQ message retention policy - Alerting: when messages appear in DLQ, who is notified - Recovery process: how DLQ messages are inspected, fixed, and reprocessed DLQ design principles: - Every retryable queue MUST have a DLQ - DLQ messages must include original message, error details, and retry count - DLQ must be monitored and alerted on; silent DLQs are a failure mode - Recovery from DLQ may require manual intervention or a replay mechanism ## Ordering Guarantees For each queue/topic, explicitly state the ordering guarantee: - **Per-partition ordered**: Messages within the same partition key are delivered in order. Use when order within a context matters (e.g., per user, per order). - **Unordered**: No ordering guarantee across messages. Use when operations are independent. - **Globally ordered**: All messages are delivered in order. Avoid unless the PRD explicitly requires it (severely limits throughput). If ordering is required: - Define the partition key (e.g., `user_id`, `order_id`) - Define how out-of-order delivery is handled when it occurs - Define whether strict ordering or best-effort ordering is acceptable ## Timeout Behavior For each async operation, define: - Processing timeout: maximum time a consumer may take to process a message - Visibility timeout: how long a message is invisible to other consumers while being processed - What happens on timeout: - Message is returned to the queue for retry (if below max retries) - Message is routed to DLQ (if max retries exceeded) - Alerting is triggered for operational visibility Timeout design principles: - Always set timeouts; no infinite waits - Timeout values must be based on observed processing times, not guesses - Document timeout values and adjust based on production metrics ## Cancellation Define whether async operations can be cancelled and how: - Cancellation signal mechanism (cancel event, status field, cancel API) - What happens to in-progress work when cancellation is received - Whether cancellation is best-effort or guaranteed - How cancellation is reflected in the operation status ## Anti-Patterns - Making operations async without a PRD requirement - Not defining a DLQ for retryable queues - Setting infinite timeouts or no timeouts - Assuming global ordering when per-partition ordering suffices - Not versioning message schemas - Processing messages without idempotency (see `idempotency-design`) - Ignoring backpressure when consumers are overwhelmed