opencode-workflow/SKILL.md at 082c9203fa3af7311dbce8fa87c5a737e9c5818c

5.9 KiB

Raw Blame History

name	description
async-queue-design	Knowledge contract for designing asynchronous workflows, queue topics, producers, consumers, retry strategies, DLQ, ordering guarantees, and timeout behavior. Referenced by design-architecture when designing async models.

This is a knowledge contract, not a workflow skill. It is referenced by design-architecture when the architect is designing asynchronous workflows.

Core Principle

Asynchronous processing must be justified by a PRD requirement. Do not make operations asynchronous just because async is "better" or more "scalable." Every async decision must trace to a specific PRD functional requirement or NFR.

When to Use Async

Use async when:

The operation is long-running and cannot complete within the caller's timeout
The PRD explicitly requires non-blocking behavior (e.g., "submit and check status later")
Multiple consumers need to react to the same event
Throughput requirements exceed synchronous processing capacity
Decoupling producer and consumer is architecturally necessary (see system-decomposition)
The PRD requires eventual consistency across service boundaries

Do NOT use async when:

The operation is fast enough for synchronous handling
The caller needs an immediate result
The system is simple enough that direct calls suffice
Async adds complexity without a corresponding PRD requirement

Queue/Topic Design

For each queue or topic, define:

Name and purpose (traced to PRD requirement)
Producer service(s)
Consumer service(s)
Message schema (payload format, headers, metadata)
Ordering guarantee (per-partition ordered, unordered)
Durability guarantee (at-least-once, exactly-once for important messages)
Retention policy (how long messages are kept)

Topic vs Queue

Use a topic (pub/sub) when:

Multiple independent consumers need the same event
Consumers have different processing logic
Adding new consumers should not require changes to the producer

Use a queue (point-to-point) when:

Exactly one consumer should process each message
Work distribution across instances of the same service is needed
Ordering within a partition matters

Message Schema

Define message schemas explicitly:

Message type or event name
Payload schema (with versioning strategy)
Metadata headers (correlation ID, causation ID, timestamp, source)
Schema evolution strategy (backward compatibility, versioning)

Retry Strategy

For each async operation, define:

Retry Parameters

Maximum retries: typically 3-5 for transient failures
Backoff strategy:
- Fixed interval: simple but may overwhelm recovering service
- Exponential backoff: recommended default, increasingly longer waits
- Exponential backoff with jitter: prevents thundering herd
Retry budget: maximum concurrent retries per consumer to prevent cascading failure

What to Retry

Transient network errors
Temporary resource unavailability (503, timeouts)
Rate limit exceeded (429, with backoff and Retry-After header)
Upstream service failures (502, 504)

What NOT to Retry

Business rule violations (non-retryable error codes)
Malformed messages (bad schema, missing required fields)
Permanent failures (authentication errors, not-found errors)
Messages that have exceeded maximum retries (route to DLQ)

Dead-Letter Queue (DLQ) Strategy

For each queue/topic with retry, define:

DLQ name (e.g., {original-queue}.dlq)
Condition for routing to DLQ: exceeded max retries, permanent failure, or poison message
DLQ message retention policy
Alerting: when messages appear in DLQ, who is notified
Recovery process: how DLQ messages are inspected, fixed, and reprocessed

DLQ design principles:

Every retryable queue MUST have a DLQ
DLQ messages must include original message, error details, and retry count
DLQ must be monitored and alerted on; silent DLQs are a failure mode
Recovery from DLQ may require manual intervention or a replay mechanism

Ordering Guarantees

For each queue/topic, explicitly state the ordering guarantee:

Per-partition ordered: Messages within the same partition key are delivered in order. Use when order within a context matters (e.g., per user, per order).
Unordered: No ordering guarantee across messages. Use when operations are independent.
Globally ordered: All messages are delivered in order. Avoid unless the PRD explicitly requires it (severely limits throughput).

If ordering is required:

Define the partition key (e.g., user_id, order_id)
Define how out-of-order delivery is handled when it occurs
Define whether strict ordering or best-effort ordering is acceptable

Timeout Behavior