opencode-workflow/SKILL.md at main

7.2 KiB

Raw Permalink Blame History

name	description
integration-boundary-design	Knowledge contract for integration boundary design. Provides principles and patterns for external API integration, webhook handling, polling, retry strategies, rate limiting, and failure mode handling. Referenced by design-architecture when defining integration boundaries.

This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing integration boundaries. It does not produce artifacts directly.

Core Principles

Integration Isolation

External system failures must not cascade into system failures
Circuit breakers must protect internal services from external failures
Integration code must be isolated from business logic (anti-corruption layer)

Explicit Contracts

Every external integration must have an explicitly defined contract
Contracts must include request/response schemas, error codes, and SLAs
Changes to contracts must be versioned and backward-compatible whenever possible

Assume Failure

External systems will fail, timeout, return unexpected data, and change without notice
Design for failure: define timeout, retry, and fallback for every integration
Never assume external system availability or correctness

External API Integration

Patterns

Synchronous API call: Request-response, immediate feedback
Asynchronous API call: Request acknowledged, result via callback or polling
Batch API call: Accumulate requests and send in bulk
Streaming API: Continuous stream of data (SSE, WebSocket, gRPC streaming)

Design Considerations

Define timeout for every outbound API call (default: 5-30 seconds depending on SLA)
Define retry strategy for every outbound call (max retries, backoff, jitter)
Define circuit breaker thresholds (error rate, timeout rate, consecutive failures)
Define fallback behavior when circuit is open (cached data, default response, error)
Define data transformation at the boundary (anti-corruption layer)
Monitor all external calls: latency, error rate, circuit breaker state

Webhook Handling

Inbound Webhooks (Receiving)

Define webhook signature verification (HMAC, asymmetric)
Define idempotency for webhook processing (external systems may deliver duplicates)
Define webhook ordering assumptions (ordered vs unordered)
Define webhook timeout and response (always respond 200 quickly, process asynchronously)
Define webhook retry handling (what if processing fails?)

Outbound Webhooks (Sending)

Define webhook delivery guarantee (at-least-once, at-most-once)
Define webhook retry strategy (max retries, backoff, jitter)
Define webhook payload format (versioned, backward-compatible)
Define webhook authentication (HMAC signature, OAuth2, API key)
Define webhook status tracking (delivered, failed, pending)

Polling

When to Use Polling

When the external system doesn't support webhooks or streaming
When the external system has a polling-based API by design
When real-time updates are not required

Design Considerations

Define polling interval based on data freshness requirements
Use incremental polling (ETag, Last-Modified, since parameter) to avoid redundant data transfer
Define how to handle polling failures (skip and retry next interval)
Define how to handle data gaps (missed polls due to downtime)
Consider long-polling as an alternative when supported

Retry Strategy

Retry Decision Tree

Is the error retryable? (network errors, timeouts, 429, 503 are typically retryable)
What is the retry strategy? (exponential backoff with jitter)
What is the max retry count? (3-5 is typical for transient errors)
What is the max total retry time? (prevent infinite retry loops)
What to do after max retries? (DLQ, alert, manual intervention)

Backoff Strategies

Exponential backoff: Delay doubles each retry (1s, 2s, 4s, 8s...)
Exponential backoff with jitter: Add randomness to prevent thundering herd
Linear backoff: Fixed additional delay each retry (1s, 2s, 3s, 4s...)
Fixed retry: Same delay every retry (simple but ineffective)

Retry Budget

Define maximum retries per time window (prevent retry storms)
Define retry budget per external system (don't overwhelm a recovering system)
Consider separate retry budgets for critical vs non-critical operations

Rate Limiting

Patterns

Token bucket: Fixed rate refill, burst-capable, most common
Leaky bucket: Fixed rate processing, smooths burst
Fixed window: Simple, but allows burst at window boundaries
Sliding window: More accurate than fixed window, slightly more complex

Design Considerations

Define rate limits per endpoint, per client, and per system
Define rate limit headers to return (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset)
Define response when rate limited (429 Too Many Requests with Retry-After header)
Define rate limit storage (Redis, memory, external service)
Define rate limit for outbound calls to external systems (respect their limits)

Failure Mode Handling

Failure Mode Classification

Transient: Network timeout, temporary service unavailable (retry with backoff)
Permanent: Invalid request, authentication failure (fail immediately, no retry)
Partial: Some data processed, some failed (compensate or retry partial)
Cascading: Failure in one service causing failures in others (circuit breaker)

Design Decision Matrix

Failure Type	Detection	Response
Timeout	No response within threshold	Retry with backoff, circuit breaker
5xx Error	HTTP 500-599	Retry with backoff, circuit breaker
429 Rate Limited	HTTP 429	Backoff and retry after Retry-After
4xx Client Error	HTTP 400-499	Fail immediately, log and alert
Connection Refused	TCP connection failure	Circuit breaker, fail fast
Invalid Data	Schema validation failure	Fail immediately, DLQ for investigation

Circuit Breaker States

Closed: Normal operation, requests pass through
Open: Failure threshold exceeded, requests fail fast (fallback)
Half-Open: After cooldown, allow test request; if success, close; if fail, stay open

Fallback Strategies

Cached data: Serve stale data from cache (with staleness warning)
Default response: Return a sensible default (for non-critical data)
Graceful degradation: Return partial data if some services are unavailable
Queue and retry: Store the request and process later when the system recovers
Fail fast: Return error immediately (for critical operations that can't be degraded)

Anti-Patterns

Synchronous chain of external calls: Minimize synchronous external calls in request path
Missing timeout on outbound calls: Always set a timeout, never wait indefinitely
Missing circuit breaker for external systems: External failures must not cascade
Missing idempotency for retries: Retries will cause duplicate processing
Missing rate limiting for outbound calls: Will hit external system rate limits
Missing data transformation at boundary: External data models must not leak into internal models
Missing monitoring on external calls: External call latency and error rates must be tracked

7.2 KiB Raw Permalink Blame History

Core Principles

Integration Isolation

Explicit Contracts

Assume Failure

External API Integration

Patterns

Design Considerations

Webhook Handling

Inbound Webhooks (Receiving)

Outbound Webhooks (Sending)

Polling

When to Use Polling

Design Considerations

Retry Strategy

Retry Decision Tree

Backoff Strategies

Retry Budget

Rate Limiting

Patterns

Design Considerations

Failure Mode Handling

Failure Mode Classification

Design Decision Matrix

Circuit Breaker States

Fallback Strategies

Anti-Patterns

7.2 KiB

Raw Permalink Blame History