7.2 KiB
7.2 KiB
| name | description |
|---|---|
| integration-boundary-design | Knowledge contract for integration boundary design. Provides principles and patterns for external API integration, webhook handling, polling, retry strategies, rate limiting, and failure mode handling. Referenced by design-architecture when defining integration boundaries. |
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing integration boundaries. It does not produce artifacts directly.
Core Principles
Integration Isolation
- External system failures must not cascade into system failures
- Circuit breakers must protect internal services from external failures
- Integration code must be isolated from business logic (anti-corruption layer)
Explicit Contracts
- Every external integration must have an explicitly defined contract
- Contracts must include request/response schemas, error codes, and SLAs
- Changes to contracts must be versioned and backward-compatible whenever possible
Assume Failure
- External systems will fail, timeout, return unexpected data, and change without notice
- Design for failure: define timeout, retry, and fallback for every integration
- Never assume external system availability or correctness
External API Integration
Patterns
- Synchronous API call: Request-response, immediate feedback
- Asynchronous API call: Request acknowledged, result via callback or polling
- Batch API call: Accumulate requests and send in bulk
- Streaming API: Continuous stream of data (SSE, WebSocket, gRPC streaming)
Design Considerations
- Define timeout for every outbound API call (default: 5-30 seconds depending on SLA)
- Define retry strategy for every outbound call (max retries, backoff, jitter)
- Define circuit breaker thresholds (error rate, timeout rate, consecutive failures)
- Define fallback behavior when circuit is open (cached data, default response, error)
- Define data transformation at the boundary (anti-corruption layer)
- Monitor all external calls: latency, error rate, circuit breaker state
Webhook Handling
Inbound Webhooks (Receiving)
- Define webhook signature verification (HMAC, asymmetric)
- Define idempotency for webhook processing (external systems may deliver duplicates)
- Define webhook ordering assumptions (ordered vs unordered)
- Define webhook timeout and response (always respond 200 quickly, process asynchronously)
- Define webhook retry handling (what if processing fails?)
Outbound Webhooks (Sending)
- Define webhook delivery guarantee (at-least-once, at-most-once)
- Define webhook retry strategy (max retries, backoff, jitter)
- Define webhook payload format (versioned, backward-compatible)
- Define webhook authentication (HMAC signature, OAuth2, API key)
- Define webhook status tracking (delivered, failed, pending)
Polling
When to Use Polling
- When the external system doesn't support webhooks or streaming
- When the external system has a polling-based API by design
- When real-time updates are not required
Design Considerations
- Define polling interval based on data freshness requirements
- Use incremental polling (ETag, Last-Modified, since parameter) to avoid redundant data transfer
- Define how to handle polling failures (skip and retry next interval)
- Define how to handle data gaps (missed polls due to downtime)
- Consider long-polling as an alternative when supported
Retry Strategy
Retry Decision Tree
- Is the error retryable? (network errors, timeouts, 429, 503 are typically retryable)
- What is the retry strategy? (exponential backoff with jitter)
- What is the max retry count? (3-5 is typical for transient errors)
- What is the max total retry time? (prevent infinite retry loops)
- What to do after max retries? (DLQ, alert, manual intervention)
Backoff Strategies
- Exponential backoff: Delay doubles each retry (1s, 2s, 4s, 8s...)
- Exponential backoff with jitter: Add randomness to prevent thundering herd
- Linear backoff: Fixed additional delay each retry (1s, 2s, 3s, 4s...)
- Fixed retry: Same delay every retry (simple but ineffective)
Retry Budget
- Define maximum retries per time window (prevent retry storms)
- Define retry budget per external system (don't overwhelm a recovering system)
- Consider separate retry budgets for critical vs non-critical operations
Rate Limiting
Patterns
- Token bucket: Fixed rate refill, burst-capable, most common
- Leaky bucket: Fixed rate processing, smooths burst
- Fixed window: Simple, but allows burst at window boundaries
- Sliding window: More accurate than fixed window, slightly more complex
Design Considerations
- Define rate limits per endpoint, per client, and per system
- Define rate limit headers to return (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset)
- Define response when rate limited (429 Too Many Requests with Retry-After header)
- Define rate limit storage (Redis, memory, external service)
- Define rate limit for outbound calls to external systems (respect their limits)
Failure Mode Handling
Failure Mode Classification
- Transient: Network timeout, temporary service unavailable (retry with backoff)
- Permanent: Invalid request, authentication failure (fail immediately, no retry)
- Partial: Some data processed, some failed (compensate or retry partial)
- Cascading: Failure in one service causing failures in others (circuit breaker)
Design Decision Matrix
| Failure Type | Detection | Response |
|---|---|---|
| Timeout | No response within threshold | Retry with backoff, circuit breaker |
| 5xx Error | HTTP 500-599 | Retry with backoff, circuit breaker |
| 429 Rate Limited | HTTP 429 | Backoff and retry after Retry-After |
| 4xx Client Error | HTTP 400-499 | Fail immediately, log and alert |
| Connection Refused | TCP connection failure | Circuit breaker, fail fast |
| Invalid Data | Schema validation failure | Fail immediately, DLQ for investigation |
Circuit Breaker States
- Closed: Normal operation, requests pass through
- Open: Failure threshold exceeded, requests fail fast (fallback)
- Half-Open: After cooldown, allow test request; if success, close; if fail, stay open
Fallback Strategies
- Cached data: Serve stale data from cache (with staleness warning)
- Default response: Return a sensible default (for non-critical data)
- Graceful degradation: Return partial data if some services are unavailable
- Queue and retry: Store the request and process later when the system recovers
- Fail fast: Return error immediately (for critical operations that can't be degraded)
Anti-Patterns
- Synchronous chain of external calls: Minimize synchronous external calls in request path
- Missing timeout on outbound calls: Always set a timeout, never wait indefinitely
- Missing circuit breaker for external systems: External failures must not cascade
- Missing idempotency for retries: Retries will cause duplicate processing
- Missing rate limiting for outbound calls: Will hit external system rate limits
- Missing data transformation at boundary: External data models must not leak into internal models
- Missing monitoring on external calls: External call latency and error rates must be tracked