6.4 KiB
6.4 KiB
| name | description |
|---|---|
| migration-rollout-design | Knowledge contract for migration and rollout design. Provides principles and patterns for backward compatibility, rollout strategies, canary deployments, feature flags, schema evolution, and rollback. Referenced by design-architecture when defining migration and rollout strategy. |
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing migration and rollout strategies. It does not produce artifacts directly.
Core Principles
Backward Compatibility First
- New versions must coexist with old versions during migration
- APIs must be backward-compatible until all consumers have migrated
- Database schemas must support both old and new code during migration
- Never break existing functionality during migration
Incremental Over Big-Bang
- Migrate incrementally, one step at a time
- Each step must be independently deployable and reversible
- Test each step before proceeding to the next
- Big-bang migrations have higher risk and harder rollback
Rollback by Default
- Every migration step must have a clear rollback plan
- Practice rollback before you need it
- Automated rollback is preferred over manual rollback
- Feature flags enable instant rollback without deployment
Rollout Strategies
Blue-Green Deployment
- Maintain two identical environments (blue and green)
- Deploy new version to the inactive environment
- Switch traffic from active to inactive environment
- If issues are detected, switch traffic back
- Best for: Infrastructure-level deployments with full environment replication
Canary Deployment
- Deploy new version to a small percentage of traffic (1%, 5%, 10%, 25%, 50%, 100%)
- Monitor metrics at each stage before increasing traffic
- If issues are detected, shift traffic back to the old version
- Best for: Application-level deployments where you want to test with real traffic gradually
Rolling Deployment
- Deploy new version to instances one at a time (or in small batches)
- Old and new versions run side by side during the rollout
- If issues are detected, stop the rollout and roll back the updated instances
- Best for: Stateless services where instances can be updated independently
Feature Flag Deployment
- Deploy new code with features disabled (feature flags set to false)
- Enable features gradually using feature flags
- Can enable per-user, per-tenant, per-percentage
- If issues are detected, disable the feature flag instantly
- Best for: Feature-level deployments where you want to decouple code deployment from feature release
Feature Flags
Types of Feature Flags
- Release flags: Enable/disable new features during rollout (short-lived)
- Operational flags: Enable/disable operational features (circuit breakers, maintenance mode)
- Experiment flags: A/B testing and gradual rollout (medium-lived)
- Permission flags: Enable features for specific users/tenants (long-lived)
Design Considerations
- Feature flags must not add significant latency (evaluate quickly)
- Feature flag evaluation must be consistent within a request (don't re-evaluate mid-request)
- Feature flags must have a defined lifecycle: create, enable, monitor, remove
- Remove feature flags after full rollout to prevent technical debt
- Use a feature flag management service (not hardcoded flags)
- Log feature flag evaluations for debugging
Feature Flag Rollout
- Start with 0% (flag off)
- Enable for internal users (dogfood)
- Enable for a small percentage of users (canary)
- Enable for all users (full rollout)
- Monitor metrics at each stage
- Remove the flag after full rollout
Schema Evolution
Additive Changes (Safe)
- Add a new column with a default value
- Add a new table
- Add a new index (with caution for large tables)
- Add a new optional field to an API response
- Add a new API endpoint
Destructive Changes (Require Migration)
- Remove a column (requires migration)
- Rename a column (requires migration)
- Change a column type (requires migration)
- Remove a table (requires migration)
- Remove an API endpoint (requires consumer migration)
Migration Strategy for Destructive Changes
- Expand: Add the new structure alongside the old (both exist)
- Migrate: Migrate data and code to use the new structure (both exist)
- Contract: Remove the old structure (only new exists)
Example: Renaming a column
- Add new column, keep old column, dual-write to both
- Migrate existing data from old to new column
- Update all reads to use new column
- Remove old column
Database Migration Best Practices
- Every migration must be reversible (up and down migration)
- Test migrations against production-like data volumes
- Run migrations in a transaction when possible
- For large tables, use online schema change tools (pt-online-schema-change, gh-ost)
- Never lock a production table for more than seconds during a migration
Rollback
Application Rollback
- Revert to previous deployment version
- Feature flag disable (instant, no deployment needed)
- Blue-green switch (instant, requires environment)
- Canary shift-back (requires redirecting traffic)
- Rolling redeploy of previous version (requires new deployment)
Database Rollback
- Run the down migration (reverse of up migration)
- Restore from backup (for destructive changes without down migration)
- Feature flag to disable new code that uses new schema (code rollback, schema stays)
Rollback Decision Matrix
| What Failed | Rollback Method | Data Loss Risk |
|---|---|---|
| Application bug | Deploy previous version | None |
| Feature bug | Disable feature flag | None |
| Schema migration bug | Run down migration | Low if reversible |
| Data migration bug | Restore from backup | High if not reversible |
| Integration failure | Circuit breaker / fallback | None |
Anti-Patterns
- Big-bang migration: Migrating everything at once has high risk and hard rollback
- Breaking API changes without versioning: Old clients will break
- Schema migration without backward compatibility: Old code will fail against new schema
- Deploying without feature flags: Can't instantly rollback if issues are detected
- Not testing rollback: Rollback must be tested before you need it
- Removing old code before consumers have migrated: Premature removal breaks dependencies
- Not monitoring during rollout: Issues must be detected quickly to prevent wider impact