opencode-workflow/skills/async-queue-design/SKILL.md

5.9 KiB

name description
async-queue-design Knowledge contract for designing asynchronous workflows, queue topics, producers, consumers, retry strategies, DLQ, ordering guarantees, and timeout behavior. Referenced by design-architecture when designing async models.

This is a knowledge contract, not a workflow skill. It is referenced by design-architecture when the architect is designing asynchronous workflows.

Core Principle

Asynchronous processing must be justified by a PRD requirement. Do not make operations asynchronous just because async is "better" or more "scalable." Every async decision must trace to a specific PRD functional requirement or NFR.

When to Use Async

Use async when:

  • The operation is long-running and cannot complete within the caller's timeout
  • The PRD explicitly requires non-blocking behavior (e.g., "submit and check status later")
  • Multiple consumers need to react to the same event
  • Throughput requirements exceed synchronous processing capacity
  • Decoupling producer and consumer is architecturally necessary (see system-decomposition)
  • The PRD requires eventual consistency across service boundaries

Do NOT use async when:

  • The operation is fast enough for synchronous handling
  • The caller needs an immediate result
  • The system is simple enough that direct calls suffice
  • Async adds complexity without a corresponding PRD requirement

Queue/Topic Design

For each queue or topic, define:

  • Name and purpose (traced to PRD requirement)
  • Producer service(s)
  • Consumer service(s)
  • Message schema (payload format, headers, metadata)
  • Ordering guarantee (per-partition ordered, unordered)
  • Durability guarantee (at-least-once, exactly-once for important messages)
  • Retention policy (how long messages are kept)

Topic vs Queue

Use a topic (pub/sub) when:

  • Multiple independent consumers need the same event
  • Consumers have different processing logic
  • Adding new consumers should not require changes to the producer

Use a queue (point-to-point) when:

  • Exactly one consumer should process each message
  • Work distribution across instances of the same service is needed
  • Ordering within a partition matters

Message Schema

Define message schemas explicitly:

  • Message type or event name
  • Payload schema (with versioning strategy)
  • Metadata headers (correlation ID, causation ID, timestamp, source)
  • Schema evolution strategy (backward compatibility, versioning)

Retry Strategy

For each async operation, define:

Retry Parameters

  • Maximum retries: typically 3-5 for transient failures
  • Backoff strategy:
    • Fixed interval: simple but may overwhelm recovering service
    • Exponential backoff: recommended default, increasingly longer waits
    • Exponential backoff with jitter: prevents thundering herd
  • Retry budget: maximum concurrent retries per consumer to prevent cascading failure

What to Retry

  • Transient network errors
  • Temporary resource unavailability (503, timeouts)
  • Rate limit exceeded (429, with backoff and Retry-After header)
  • Upstream service failures (502, 504)

What NOT to Retry

  • Business rule violations (non-retryable error codes)
  • Malformed messages (bad schema, missing required fields)
  • Permanent failures (authentication errors, not-found errors)
  • Messages that have exceeded maximum retries (route to DLQ)

Dead-Letter Queue (DLQ) Strategy

For each queue/topic with retry, define:

  • DLQ name (e.g., {original-queue}.dlq)
  • Condition for routing to DLQ: exceeded max retries, permanent failure, or poison message
  • DLQ message retention policy
  • Alerting: when messages appear in DLQ, who is notified
  • Recovery process: how DLQ messages are inspected, fixed, and reprocessed

DLQ design principles:

  • Every retryable queue MUST have a DLQ
  • DLQ messages must include original message, error details, and retry count
  • DLQ must be monitored and alerted on; silent DLQs are a failure mode
  • Recovery from DLQ may require manual intervention or a replay mechanism

Ordering Guarantees

For each queue/topic, explicitly state the ordering guarantee:

  • Per-partition ordered: Messages within the same partition key are delivered in order. Use when order within a context matters (e.g., per user, per order).
  • Unordered: No ordering guarantee across messages. Use when operations are independent.
  • Globally ordered: All messages are delivered in order. Avoid unless the PRD explicitly requires it (severely limits throughput).

If ordering is required:

  • Define the partition key (e.g., user_id, order_id)
  • Define how out-of-order delivery is handled when it occurs
  • Define whether strict ordering or best-effort ordering is acceptable

Timeout Behavior

For each async operation, define:

  • Processing timeout: maximum time a consumer may take to process a message
  • Visibility timeout: how long a message is invisible to other consumers while being processed
  • What happens on timeout:
    • Message is returned to the queue for retry (if below max retries)
    • Message is routed to DLQ (if max retries exceeded)
    • Alerting is triggered for operational visibility

Timeout design principles:

  • Always set timeouts; no infinite waits
  • Timeout values must be based on observed processing times, not guesses
  • Document timeout values and adjust based on production metrics

Cancellation

Define whether async operations can be cancelled and how:

  • Cancellation signal mechanism (cancel event, status field, cancel API)
  • What happens to in-progress work when cancellation is received
  • Whether cancellation is best-effort or guaranteed
  • How cancellation is reflected in the operation status

Anti-Patterns

  • Making operations async without a PRD requirement
  • Not defining a DLQ for retryable queues
  • Setting infinite timeouts or no timeouts
  • Assuming global ordering when per-partition ordering suffices
  • Not versioning message schemas
  • Processing messages without idempotency (see idempotency-design)
  • Ignoring backpressure when consumers are overwhelmed