144 lines
7.2 KiB
Markdown
144 lines
7.2 KiB
Markdown
---
|
|
name: integration-boundary-design
|
|
description: "Knowledge contract for integration boundary design. Provides principles and patterns for external API integration, webhook handling, polling, retry strategies, rate limiting, and failure mode handling. Referenced by design-architecture when defining integration boundaries."
|
|
---
|
|
|
|
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing integration boundaries. It does not produce artifacts directly.
|
|
|
|
## Core Principles
|
|
|
|
### Integration Isolation
|
|
- External system failures must not cascade into system failures
|
|
- Circuit breakers must protect internal services from external failures
|
|
- Integration code must be isolated from business logic (anti-corruption layer)
|
|
|
|
### Explicit Contracts
|
|
- Every external integration must have an explicitly defined contract
|
|
- Contracts must include request/response schemas, error codes, and SLAs
|
|
- Changes to contracts must be versioned and backward-compatible whenever possible
|
|
|
|
### Assume Failure
|
|
- External systems will fail, timeout, return unexpected data, and change without notice
|
|
- Design for failure: define timeout, retry, and fallback for every integration
|
|
- Never assume external system availability or correctness
|
|
|
|
## External API Integration
|
|
|
|
### Patterns
|
|
- **Synchronous API call**: Request-response, immediate feedback
|
|
- **Asynchronous API call**: Request acknowledged, result via callback or polling
|
|
- **Batch API call**: Accumulate requests and send in bulk
|
|
- **Streaming API**: Continuous stream of data (SSE, WebSocket, gRPC streaming)
|
|
|
|
### Design Considerations
|
|
- Define timeout for every outbound API call (default: 5-30 seconds depending on SLA)
|
|
- Define retry strategy for every outbound call (max retries, backoff, jitter)
|
|
- Define circuit breaker thresholds (error rate, timeout rate, consecutive failures)
|
|
- Define fallback behavior when circuit is open (cached data, default response, error)
|
|
- Define data transformation at the boundary (anti-corruption layer)
|
|
- Monitor all external calls: latency, error rate, circuit breaker state
|
|
|
|
## Webhook Handling
|
|
|
|
### Inbound Webhooks (Receiving)
|
|
- Define webhook signature verification (HMAC, asymmetric)
|
|
- Define idempotency for webhook processing (external systems may deliver duplicates)
|
|
- Define webhook ordering assumptions (ordered vs unordered)
|
|
- Define webhook timeout and response (always respond 200 quickly, process asynchronously)
|
|
- Define webhook retry handling (what if processing fails?)
|
|
|
|
### Outbound Webhooks (Sending)
|
|
- Define webhook delivery guarantee (at-least-once, at-most-once)
|
|
- Define webhook retry strategy (max retries, backoff, jitter)
|
|
- Define webhook payload format (versioned, backward-compatible)
|
|
- Define webhook authentication (HMAC signature, OAuth2, API key)
|
|
- Define webhook status tracking (delivered, failed, pending)
|
|
|
|
## Polling
|
|
|
|
### When to Use Polling
|
|
- When the external system doesn't support webhooks or streaming
|
|
- When the external system has a polling-based API by design
|
|
- When real-time updates are not required
|
|
|
|
### Design Considerations
|
|
- Define polling interval based on data freshness requirements
|
|
- Use incremental polling (ETag, Last-Modified, since parameter) to avoid redundant data transfer
|
|
- Define how to handle polling failures (skip and retry next interval)
|
|
- Define how to handle data gaps (missed polls due to downtime)
|
|
- Consider long-polling as an alternative when supported
|
|
|
|
## Retry Strategy
|
|
|
|
### Retry Decision Tree
|
|
1. Is the error retryable? (network errors, timeouts, 429, 503 are typically retryable)
|
|
2. What is the retry strategy? (exponential backoff with jitter)
|
|
3. What is the max retry count? (3-5 is typical for transient errors)
|
|
4. What is the max total retry time? (prevent infinite retry loops)
|
|
5. What to do after max retries? (DLQ, alert, manual intervention)
|
|
|
|
### Backoff Strategies
|
|
- **Exponential backoff**: Delay doubles each retry (1s, 2s, 4s, 8s...)
|
|
- **Exponential backoff with jitter**: Add randomness to prevent thundering herd
|
|
- **Linear backoff**: Fixed additional delay each retry (1s, 2s, 3s, 4s...)
|
|
- **Fixed retry**: Same delay every retry (simple but ineffective)
|
|
|
|
### Retry Budget
|
|
- Define maximum retries per time window (prevent retry storms)
|
|
- Define retry budget per external system (don't overwhelm a recovering system)
|
|
- Consider separate retry budgets for critical vs non-critical operations
|
|
|
|
## Rate Limiting
|
|
|
|
### Patterns
|
|
- **Token bucket**: Fixed rate refill, burst-capable, most common
|
|
- **Leaky bucket**: Fixed rate processing, smooths burst
|
|
- **Fixed window**: Simple, but allows burst at window boundaries
|
|
- **Sliding window**: More accurate than fixed window, slightly more complex
|
|
|
|
### Design Considerations
|
|
- Define rate limits per endpoint, per client, and per system
|
|
- Define rate limit headers to return (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset)
|
|
- Define response when rate limited (429 Too Many Requests with Retry-After header)
|
|
- Define rate limit storage (Redis, memory, external service)
|
|
- Define rate limit for outbound calls to external systems (respect their limits)
|
|
|
|
## Failure Mode Handling
|
|
|
|
### Failure Mode Classification
|
|
- **Transient**: Network timeout, temporary service unavailable (retry with backoff)
|
|
- **Permanent**: Invalid request, authentication failure (fail immediately, no retry)
|
|
- **Partial**: Some data processed, some failed (compensate or retry partial)
|
|
- **Cascading**: Failure in one service causing failures in others (circuit breaker)
|
|
|
|
### Design Decision Matrix
|
|
| Failure Type | Detection | Response |
|
|
|-------------|-----------|----------|
|
|
| Timeout | No response within threshold | Retry with backoff, circuit breaker |
|
|
| 5xx Error | HTTP 500-599 | Retry with backoff, circuit breaker |
|
|
| 429 Rate Limited | HTTP 429 | Backoff and retry after Retry-After |
|
|
| 4xx Client Error | HTTP 400-499 | Fail immediately, log and alert |
|
|
| Connection Refused | TCP connection failure | Circuit breaker, fail fast |
|
|
| Invalid Data | Schema validation failure | Fail immediately, DLQ for investigation |
|
|
|
|
### Circuit Breaker States
|
|
- **Closed**: Normal operation, requests pass through
|
|
- **Open**: Failure threshold exceeded, requests fail fast (fallback)
|
|
- **Half-Open**: After cooldown, allow test request; if success, close; if fail, stay open
|
|
|
|
### Fallback Strategies
|
|
- **Cached data**: Serve stale data from cache (with staleness warning)
|
|
- **Default response**: Return a sensible default (for non-critical data)
|
|
- **Graceful degradation**: Return partial data if some services are unavailable
|
|
- **Queue and retry**: Store the request and process later when the system recovers
|
|
- **Fail fast**: Return error immediately (for critical operations that can't be degraded)
|
|
|
|
## Anti-Patterns
|
|
|
|
- **Synchronous chain of external calls**: Minimize synchronous external calls in request path
|
|
- **Missing timeout on outbound calls**: Always set a timeout, never wait indefinitely
|
|
- **Missing circuit breaker for external systems**: External failures must not cascade
|
|
- **Missing idempotency for retries**: Retries will cause duplicate processing
|
|
- **Missing rate limiting for outbound calls**: Will hit external system rate limits
|
|
- **Missing data transformation at boundary**: External data models must not leak into internal models
|
|
- **Missing monitoring on external calls**: External call latency and error rates must be tracked |