opencode-workflow/skills/async-queue-design/SKILL.md

---
name: async-queue-design
description: "Knowledge contract for designing asynchronous workflows, queue topics, producers, consumers, retry strategies, DLQ, ordering guarantees, and timeout behavior. Referenced by design-architecture when designing async models."
---

This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing asynchronous workflows.

## Core Principle

Asynchronous processing must be justified by a PRD requirement. Do not make operations asynchronous just because async is "better" or more "scalable." Every async decision must trace to a specific PRD functional requirement or NFR.

## When to Use Async

Use async when:
- The operation is long-running and cannot complete within the caller's timeout
- The PRD explicitly requires non-blocking behavior (e.g., "submit and check status later")
- Multiple consumers need to react to the same event
- Throughput requirements exceed synchronous processing capacity
- Decoupling producer and consumer is architecturally necessary (see `system-decomposition`)
- The PRD requires eventual consistency across service boundaries

Do NOT use async when:
- The operation is fast enough for synchronous handling
- The caller needs an immediate result
- The system is simple enough that direct calls suffice
- Async adds complexity without a corresponding PRD requirement

## Queue/Topic Design

For each queue or topic, define:
- Name and purpose (traced to PRD requirement)
- Producer service(s)
- Consumer service(s)
- Message schema (payload format, headers, metadata)
- Ordering guarantee (per-partition ordered, unordered)
- Durability guarantee (at-least-once, exactly-once for important messages)
- Retention policy (how long messages are kept)

### Topic vs Queue

Use a topic (pub/sub) when:
- Multiple independent consumers need the same event
- Consumers have different processing logic
- Adding new consumers should not require changes to the producer

Use a queue (point-to-point) when:
- Exactly one consumer should process each message
- Work distribution across instances of the same service is needed
- Ordering within a partition matters

### Message Schema

Define message schemas explicitly:
- Message type or event name
- Payload schema (with versioning strategy)
- Metadata headers (correlation ID, causation ID, timestamp, source)
- Schema evolution strategy (backward compatibility, versioning)

## Retry Strategy

For each async operation, define:

### Retry Parameters
- Maximum retries: typically 3-5 for transient failures
- Backoff strategy:
  - Fixed interval: simple but may overwhelm recovering service
  - Exponential backoff: recommended default, increasingly longer waits
  - Exponential backoff with jitter: prevents thundering herd
- Retry budget: maximum concurrent retries per consumer to prevent cascading failure

### What to Retry
- Transient network errors
- Temporary resource unavailability (503, timeouts)
- Rate limit exceeded (429, with backoff and Retry-After header)
- Upstream service failures (502, 504)

### What NOT to Retry
- Business rule violations (non-retryable error codes)
- Malformed messages (bad schema, missing required fields)
- Permanent failures (authentication errors, not-found errors)
- Messages that have exceeded maximum retries (route to DLQ)

## Dead-Letter Queue (DLQ) Strategy

For each queue/topic with retry, define:
- DLQ name (e.g., `{original-queue}.dlq`)
- Condition for routing to DLQ: exceeded max retries, permanent failure, or poison message
- DLQ message retention policy
- Alerting: when messages appear in DLQ, who is notified
- Recovery process: how DLQ messages are inspected, fixed, and reprocessed

DLQ design principles:
- Every retryable queue MUST have a DLQ
- DLQ messages must include original message, error details, and retry count
- DLQ must be monitored and alerted on; silent DLQs are a failure mode
- Recovery from DLQ may require manual intervention or a replay mechanism

## Ordering Guarantees

For each queue/topic, explicitly state the ordering guarantee:

- **Per-partition ordered**: Messages within the same partition key are delivered in order. Use when order within a context matters (e.g., per user, per order).
- **Unordered**: No ordering guarantee across messages. Use when operations are independent.
- **Globally ordered**: All messages are delivered in order. Avoid unless the PRD explicitly requires it (severely limits throughput).

If ordering is required:
- Define the partition key (e.g., `user_id`, `order_id`)
- Define how out-of-order delivery is handled when it occurs
- Define whether strict ordering or best-effort ordering is acceptable

## Timeout Behavior

For each async operation, define:
- Processing timeout: maximum time a consumer may take to process a message
- Visibility timeout: how long a message is invisible to other consumers while being processed
- What happens on timeout:
  - Message is returned to the queue for retry (if below max retries)
  - Message is routed to DLQ (if max retries exceeded)
  - Alerting is triggered for operational visibility

Timeout design principles:
- Always set timeouts; no infinite waits
- Timeout values must be based on observed processing times, not guesses
- Document timeout values and adjust based on production metrics

## Cancellation

Define whether async operations can be cancelled and how:
- Cancellation signal mechanism (cancel event, status field, cancel API)
- What happens to in-progress work when cancellation is received
- Whether cancellation is best-effort or guaranteed
- How cancellation is reflected in the operation status

## Anti-Patterns

- Making operations async without a PRD requirement
- Not defining a DLQ for retryable queues
- Setting infinite timeouts or no timeouts
- Assuming global ordering when per-partition ordering suffices
- Not versioning message schemas
- Processing messages without idempotency (see `idempotency-design`)
- Ignoring backpressure when consumers are overwhelmed
feat/architect (#4) Co-authored-by: 王性驊 <danielwang@supermicro.com> Reviewed-on: https://code.30cm.net/daniel.w/opencode-workflow/pulls/4 2026-04-13 01:19:39 +00:00			`---`
			`name: async-queue-design`
			`description: "Knowledge contract for designing asynchronous workflows, queue topics, producers, consumers, retry strategies, DLQ, ordering guarantees, and timeout behavior. Referenced by design-architecture when designing async models."`
			`---`

			This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing asynchronous workflows.

			`## Core Principle`

			`Asynchronous processing must be justified by a PRD requirement. Do not make operations asynchronous just because async is "better" or more "scalable." Every async decision must trace to a specific PRD functional requirement or NFR.`

			`## When to Use Async`

			`Use async when:`
			`- The operation is long-running and cannot complete within the caller's timeout`
			`- The PRD explicitly requires non-blocking behavior (e.g., "submit and check status later")`
			`- Multiple consumers need to react to the same event`
			`- Throughput requirements exceed synchronous processing capacity`
			- Decoupling producer and consumer is architecturally necessary (see `system-decomposition`)
			`- The PRD requires eventual consistency across service boundaries`

			`Do NOT use async when:`
			`- The operation is fast enough for synchronous handling`
			`- The caller needs an immediate result`
			`- The system is simple enough that direct calls suffice`
			`- Async adds complexity without a corresponding PRD requirement`

			`## Queue/Topic Design`

			`For each queue or topic, define:`
			`- Name and purpose (traced to PRD requirement)`
			`- Producer service(s)`
			`- Consumer service(s)`
			`- Message schema (payload format, headers, metadata)`
			`- Ordering guarantee (per-partition ordered, unordered)`
			`- Durability guarantee (at-least-once, exactly-once for important messages)`
			`- Retention policy (how long messages are kept)`

			`### Topic vs Queue`

			`Use a topic (pub/sub) when:`
			`- Multiple independent consumers need the same event`
			`- Consumers have different processing logic`
			`- Adding new consumers should not require changes to the producer`

			`Use a queue (point-to-point) when:`
			`- Exactly one consumer should process each message`
			`- Work distribution across instances of the same service is needed`
			`- Ordering within a partition matters`

			`### Message Schema`

			`Define message schemas explicitly:`
			`- Message type or event name`
			`- Payload schema (with versioning strategy)`
			`- Metadata headers (correlation ID, causation ID, timestamp, source)`
			`- Schema evolution strategy (backward compatibility, versioning)`

			`## Retry Strategy`

			`For each async operation, define:`

			`### Retry Parameters`
			`- Maximum retries: typically 3-5 for transient failures`
			`- Backoff strategy:`
			`- Fixed interval: simple but may overwhelm recovering service`
			`- Exponential backoff: recommended default, increasingly longer waits`
			`- Exponential backoff with jitter: prevents thundering herd`
			`- Retry budget: maximum concurrent retries per consumer to prevent cascading failure`

			`### What to Retry`
			`- Transient network errors`
			`- Temporary resource unavailability (503, timeouts)`
			`- Rate limit exceeded (429, with backoff and Retry-After header)`
			`- Upstream service failures (502, 504)`

			`### What NOT to Retry`
			`- Business rule violations (non-retryable error codes)`
			`- Malformed messages (bad schema, missing required fields)`
			`- Permanent failures (authentication errors, not-found errors)`
			`- Messages that have exceeded maximum retries (route to DLQ)`

			`## Dead-Letter Queue (DLQ) Strategy`

			`For each queue/topic with retry, define:`
			- DLQ name (e.g., `{original-queue}.dlq`)
			`- Condition for routing to DLQ: exceeded max retries, permanent failure, or poison message`
			`- DLQ message retention policy`
			`- Alerting: when messages appear in DLQ, who is notified`
			`- Recovery process: how DLQ messages are inspected, fixed, and reprocessed`

			`DLQ design principles:`
			`- Every retryable queue MUST have a DLQ`
			`- DLQ messages must include original message, error details, and retry count`
			`- DLQ must be monitored and alerted on; silent DLQs are a failure mode`
			`- Recovery from DLQ may require manual intervention or a replay mechanism`

			`## Ordering Guarantees`

			`For each queue/topic, explicitly state the ordering guarantee:`

			`- Per-partition ordered: Messages within the same partition key are delivered in order. Use when order within a context matters (e.g., per user, per order).`
			`- Unordered: No ordering guarantee across messages. Use when operations are independent.`
			`- Globally ordered: All messages are delivered in order. Avoid unless the PRD explicitly requires it (severely limits throughput).`

			`If ordering is required:`
			- Define the partition key (e.g., `user_id`, `order_id`)
			`- Define how out-of-order delivery is handled when it occurs`
			`- Define whether strict ordering or best-effort ordering is acceptable`

			`## Timeout Behavior`

			`For each async operation, define:`
			`- Processing timeout: maximum time a consumer may take to process a message`
			`- Visibility timeout: how long a message is invisible to other consumers while being processed`
			`- What happens on timeout:`
			`- Message is returned to the queue for retry (if below max retries)`
			`- Message is routed to DLQ (if max retries exceeded)`
			`- Alerting is triggered for operational visibility`

			`Timeout design principles:`
			`- Always set timeouts; no infinite waits`
			`- Timeout values must be based on observed processing times, not guesses`
			`- Document timeout values and adjust based on production metrics`

			`## Cancellation`

			`Define whether async operations can be cancelled and how:`
			`- Cancellation signal mechanism (cancel event, status field, cancel API)`
			`- What happens to in-progress work when cancellation is received`
			`- Whether cancellation is best-effort or guaranteed`
			`- How cancellation is reflected in the operation status`

			`## Anti-Patterns`

			`- Making operations async without a PRD requirement`
			`- Not defining a DLQ for retryable queues`
			`- Setting infinite timeouts or no timeouts`
			`- Assuming global ordering when per-partition ordering suffices`
			`- Not versioning message schemas`
			- Processing messages without idempotency (see `idempotency-design`)
			`- Ignoring backpressure when consumers are overwhelmed`