145 lines
6.4 KiB
Markdown
145 lines
6.4 KiB
Markdown
|
|
---
|
||
|
|
name: migration-rollout-design
|
||
|
|
description: "Knowledge contract for migration and rollout design. Provides principles and patterns for backward compatibility, rollout strategies, canary deployments, feature flags, schema evolution, and rollback. Referenced by design-architecture when defining migration and rollout strategy."
|
||
|
|
---
|
||
|
|
|
||
|
|
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing migration and rollout strategies. It does not produce artifacts directly.
|
||
|
|
|
||
|
|
## Core Principles
|
||
|
|
|
||
|
|
### Backward Compatibility First
|
||
|
|
- New versions must coexist with old versions during migration
|
||
|
|
- APIs must be backward-compatible until all consumers have migrated
|
||
|
|
- Database schemas must support both old and new code during migration
|
||
|
|
- Never break existing functionality during migration
|
||
|
|
|
||
|
|
### Incremental Over Big-Bang
|
||
|
|
- Migrate incrementally, one step at a time
|
||
|
|
- Each step must be independently deployable and reversible
|
||
|
|
- Test each step before proceeding to the next
|
||
|
|
- Big-bang migrations have higher risk and harder rollback
|
||
|
|
|
||
|
|
### Rollback by Default
|
||
|
|
- Every migration step must have a clear rollback plan
|
||
|
|
- Practice rollback before you need it
|
||
|
|
- Automated rollback is preferred over manual rollback
|
||
|
|
- Feature flags enable instant rollback without deployment
|
||
|
|
|
||
|
|
## Rollout Strategies
|
||
|
|
|
||
|
|
### Blue-Green Deployment
|
||
|
|
- Maintain two identical environments (blue and green)
|
||
|
|
- Deploy new version to the inactive environment
|
||
|
|
- Switch traffic from active to inactive environment
|
||
|
|
- If issues are detected, switch traffic back
|
||
|
|
- **Best for**: Infrastructure-level deployments with full environment replication
|
||
|
|
|
||
|
|
### Canary Deployment
|
||
|
|
- Deploy new version to a small percentage of traffic (1%, 5%, 10%, 25%, 50%, 100%)
|
||
|
|
- Monitor metrics at each stage before increasing traffic
|
||
|
|
- If issues are detected, shift traffic back to the old version
|
||
|
|
- **Best for**: Application-level deployments where you want to test with real traffic gradually
|
||
|
|
|
||
|
|
### Rolling Deployment
|
||
|
|
- Deploy new version to instances one at a time (or in small batches)
|
||
|
|
- Old and new versions run side by side during the rollout
|
||
|
|
- If issues are detected, stop the rollout and roll back the updated instances
|
||
|
|
- **Best for**: Stateless services where instances can be updated independently
|
||
|
|
|
||
|
|
### Feature Flag Deployment
|
||
|
|
- Deploy new code with features disabled (feature flags set to false)
|
||
|
|
- Enable features gradually using feature flags
|
||
|
|
- Can enable per-user, per-tenant, per-percentage
|
||
|
|
- If issues are detected, disable the feature flag instantly
|
||
|
|
- **Best for**: Feature-level deployments where you want to decouple code deployment from feature release
|
||
|
|
|
||
|
|
## Feature Flags
|
||
|
|
|
||
|
|
### Types of Feature Flags
|
||
|
|
- **Release flags**: Enable/disable new features during rollout (short-lived)
|
||
|
|
- **Operational flags**: Enable/disable operational features (circuit breakers, maintenance mode)
|
||
|
|
- **Experiment flags**: A/B testing and gradual rollout (medium-lived)
|
||
|
|
- **Permission flags**: Enable features for specific users/tenants (long-lived)
|
||
|
|
|
||
|
|
### Design Considerations
|
||
|
|
- Feature flags must not add significant latency (evaluate quickly)
|
||
|
|
- Feature flag evaluation must be consistent within a request (don't re-evaluate mid-request)
|
||
|
|
- Feature flags must have a defined lifecycle: create, enable, monitor, remove
|
||
|
|
- Remove feature flags after full rollout to prevent technical debt
|
||
|
|
- Use a feature flag management service (not hardcoded flags)
|
||
|
|
- Log feature flag evaluations for debugging
|
||
|
|
|
||
|
|
### Feature Flag Rollout
|
||
|
|
- Start with 0% (flag off)
|
||
|
|
- Enable for internal users (dogfood)
|
||
|
|
- Enable for a small percentage of users (canary)
|
||
|
|
- Enable for all users (full rollout)
|
||
|
|
- Monitor metrics at each stage
|
||
|
|
- Remove the flag after full rollout
|
||
|
|
|
||
|
|
## Schema Evolution
|
||
|
|
|
||
|
|
### Additive Changes (Safe)
|
||
|
|
- Add a new column with a default value
|
||
|
|
- Add a new table
|
||
|
|
- Add a new index (with caution for large tables)
|
||
|
|
- Add a new optional field to an API response
|
||
|
|
- Add a new API endpoint
|
||
|
|
|
||
|
|
### Destructive Changes (Require Migration)
|
||
|
|
- Remove a column (requires migration)
|
||
|
|
- Rename a column (requires migration)
|
||
|
|
- Change a column type (requires migration)
|
||
|
|
- Remove a table (requires migration)
|
||
|
|
- Remove an API endpoint (requires consumer migration)
|
||
|
|
|
||
|
|
### Migration Strategy for Destructive Changes
|
||
|
|
1. **Expand**: Add the new structure alongside the old (both exist)
|
||
|
|
2. **Migrate**: Migrate data and code to use the new structure (both exist)
|
||
|
|
3. **Contract**: Remove the old structure (only new exists)
|
||
|
|
|
||
|
|
Example: Renaming a column
|
||
|
|
1. Add new column, keep old column, dual-write to both
|
||
|
|
2. Migrate existing data from old to new column
|
||
|
|
3. Update all reads to use new column
|
||
|
|
4. Remove old column
|
||
|
|
|
||
|
|
### Database Migration Best Practices
|
||
|
|
- Every migration must be reversible (up and down migration)
|
||
|
|
- Test migrations against production-like data volumes
|
||
|
|
- Run migrations in a transaction when possible
|
||
|
|
- For large tables, use online schema change tools (pt-online-schema-change, gh-ost)
|
||
|
|
- Never lock a production table for more than seconds during a migration
|
||
|
|
|
||
|
|
## Rollback
|
||
|
|
|
||
|
|
### Application Rollback
|
||
|
|
- Revert to previous deployment version
|
||
|
|
- Feature flag disable (instant, no deployment needed)
|
||
|
|
- Blue-green switch (instant, requires environment)
|
||
|
|
- Canary shift-back (requires redirecting traffic)
|
||
|
|
- Rolling redeploy of previous version (requires new deployment)
|
||
|
|
|
||
|
|
### Database Rollback
|
||
|
|
- Run the down migration (reverse of up migration)
|
||
|
|
- Restore from backup (for destructive changes without down migration)
|
||
|
|
- Feature flag to disable new code that uses new schema (code rollback, schema stays)
|
||
|
|
|
||
|
|
### Rollback Decision Matrix
|
||
|
|
| What Failed | Rollback Method | Data Loss Risk |
|
||
|
|
|-------------|----------------|----------------|
|
||
|
|
| Application bug | Deploy previous version | None |
|
||
|
|
| Feature bug | Disable feature flag | None |
|
||
|
|
| Schema migration bug | Run down migration | Low if reversible |
|
||
|
|
| Data migration bug | Restore from backup | High if not reversible |
|
||
|
|
| Integration failure | Circuit breaker / fallback | None |
|
||
|
|
|
||
|
|
## Anti-Patterns
|
||
|
|
|
||
|
|
- **Big-bang migration**: Migrating everything at once has high risk and hard rollback
|
||
|
|
- **Breaking API changes without versioning**: Old clients will break
|
||
|
|
- **Schema migration without backward compatibility**: Old code will fail against new schema
|
||
|
|
- **Deploying without feature flags**: Can't instantly rollback if issues are detected
|
||
|
|
- **Not testing rollback**: Rollback must be tested before you need it
|
||
|
|
- **Removing old code before consumers have migrated**: Premature removal breaks dependencies
|
||
|
|
- **Not monitoring during rollout**: Issues must be detected quickly to prevent wider impact
|