--- name: integration-boundary-design description: "Knowledge contract for integration boundary design. Provides principles and patterns for external API integration, webhook handling, polling, retry strategies, rate limiting, and failure mode handling. Referenced by design-architecture when defining integration boundaries." --- This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing integration boundaries. It does not produce artifacts directly. ## Core Principles ### Integration Isolation - External system failures must not cascade into system failures - Circuit breakers must protect internal services from external failures - Integration code must be isolated from business logic (anti-corruption layer) ### Explicit Contracts - Every external integration must have an explicitly defined contract - Contracts must include request/response schemas, error codes, and SLAs - Changes to contracts must be versioned and backward-compatible whenever possible ### Assume Failure - External systems will fail, timeout, return unexpected data, and change without notice - Design for failure: define timeout, retry, and fallback for every integration - Never assume external system availability or correctness ## External API Integration ### Patterns - **Synchronous API call**: Request-response, immediate feedback - **Asynchronous API call**: Request acknowledged, result via callback or polling - **Batch API call**: Accumulate requests and send in bulk - **Streaming API**: Continuous stream of data (SSE, WebSocket, gRPC streaming) ### Design Considerations - Define timeout for every outbound API call (default: 5-30 seconds depending on SLA) - Define retry strategy for every outbound call (max retries, backoff, jitter) - Define circuit breaker thresholds (error rate, timeout rate, consecutive failures) - Define fallback behavior when circuit is open (cached data, default response, error) - Define data transformation at the boundary (anti-corruption layer) - Monitor all external calls: latency, error rate, circuit breaker state ## Webhook Handling ### Inbound Webhooks (Receiving) - Define webhook signature verification (HMAC, asymmetric) - Define idempotency for webhook processing (external systems may deliver duplicates) - Define webhook ordering assumptions (ordered vs unordered) - Define webhook timeout and response (always respond 200 quickly, process asynchronously) - Define webhook retry handling (what if processing fails?) ### Outbound Webhooks (Sending) - Define webhook delivery guarantee (at-least-once, at-most-once) - Define webhook retry strategy (max retries, backoff, jitter) - Define webhook payload format (versioned, backward-compatible) - Define webhook authentication (HMAC signature, OAuth2, API key) - Define webhook status tracking (delivered, failed, pending) ## Polling ### When to Use Polling - When the external system doesn't support webhooks or streaming - When the external system has a polling-based API by design - When real-time updates are not required ### Design Considerations - Define polling interval based on data freshness requirements - Use incremental polling (ETag, Last-Modified, since parameter) to avoid redundant data transfer - Define how to handle polling failures (skip and retry next interval) - Define how to handle data gaps (missed polls due to downtime) - Consider long-polling as an alternative when supported ## Retry Strategy ### Retry Decision Tree 1. Is the error retryable? (network errors, timeouts, 429, 503 are typically retryable) 2. What is the retry strategy? (exponential backoff with jitter) 3. What is the max retry count? (3-5 is typical for transient errors) 4. What is the max total retry time? (prevent infinite retry loops) 5. What to do after max retries? (DLQ, alert, manual intervention) ### Backoff Strategies - **Exponential backoff**: Delay doubles each retry (1s, 2s, 4s, 8s...) - **Exponential backoff with jitter**: Add randomness to prevent thundering herd - **Linear backoff**: Fixed additional delay each retry (1s, 2s, 3s, 4s...) - **Fixed retry**: Same delay every retry (simple but ineffective) ### Retry Budget - Define maximum retries per time window (prevent retry storms) - Define retry budget per external system (don't overwhelm a recovering system) - Consider separate retry budgets for critical vs non-critical operations ## Rate Limiting ### Patterns - **Token bucket**: Fixed rate refill, burst-capable, most common - **Leaky bucket**: Fixed rate processing, smooths burst - **Fixed window**: Simple, but allows burst at window boundaries - **Sliding window**: More accurate than fixed window, slightly more complex ### Design Considerations - Define rate limits per endpoint, per client, and per system - Define rate limit headers to return (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) - Define response when rate limited (429 Too Many Requests with Retry-After header) - Define rate limit storage (Redis, memory, external service) - Define rate limit for outbound calls to external systems (respect their limits) ## Failure Mode Handling ### Failure Mode Classification - **Transient**: Network timeout, temporary service unavailable (retry with backoff) - **Permanent**: Invalid request, authentication failure (fail immediately, no retry) - **Partial**: Some data processed, some failed (compensate or retry partial) - **Cascading**: Failure in one service causing failures in others (circuit breaker) ### Design Decision Matrix | Failure Type | Detection | Response | |-------------|-----------|----------| | Timeout | No response within threshold | Retry with backoff, circuit breaker | | 5xx Error | HTTP 500-599 | Retry with backoff, circuit breaker | | 429 Rate Limited | HTTP 429 | Backoff and retry after Retry-After | | 4xx Client Error | HTTP 400-499 | Fail immediately, log and alert | | Connection Refused | TCP connection failure | Circuit breaker, fail fast | | Invalid Data | Schema validation failure | Fail immediately, DLQ for investigation | ### Circuit Breaker States - **Closed**: Normal operation, requests pass through - **Open**: Failure threshold exceeded, requests fail fast (fallback) - **Half-Open**: After cooldown, allow test request; if success, close; if fail, stay open ### Fallback Strategies - **Cached data**: Serve stale data from cache (with staleness warning) - **Default response**: Return a sensible default (for non-critical data) - **Graceful degradation**: Return partial data if some services are unavailable - **Queue and retry**: Store the request and process later when the system recovers - **Fail fast**: Return error immediately (for critical operations that can't be degraded) ## Anti-Patterns - **Synchronous chain of external calls**: Minimize synchronous external calls in request path - **Missing timeout on outbound calls**: Always set a timeout, never wait indefinitely - **Missing circuit breaker for external systems**: External failures must not cascade - **Missing idempotency for retries**: Retries will cause duplicate processing - **Missing rate limiting for outbound calls**: Will hit external system rate limits - **Missing data transformation at boundary**: External data models must not leak into internal models - **Missing monitoring on external calls**: External call latency and error rates must be tracked