opencode-workflow/skills/migration-rollout-design/SKILL.md

145 lines
6.4 KiB
Markdown

---
name: migration-rollout-design
description: "Knowledge contract for migration and rollout design. Provides principles and patterns for backward compatibility, rollout strategies, canary deployments, feature flags, schema evolution, and rollback. Referenced by design-architecture when defining migration and rollout strategy."
---
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing migration and rollout strategies. It does not produce artifacts directly.
## Core Principles
### Backward Compatibility First
- New versions must coexist with old versions during migration
- APIs must be backward-compatible until all consumers have migrated
- Database schemas must support both old and new code during migration
- Never break existing functionality during migration
### Incremental Over Big-Bang
- Migrate incrementally, one step at a time
- Each step must be independently deployable and reversible
- Test each step before proceeding to the next
- Big-bang migrations have higher risk and harder rollback
### Rollback by Default
- Every migration step must have a clear rollback plan
- Practice rollback before you need it
- Automated rollback is preferred over manual rollback
- Feature flags enable instant rollback without deployment
## Rollout Strategies
### Blue-Green Deployment
- Maintain two identical environments (blue and green)
- Deploy new version to the inactive environment
- Switch traffic from active to inactive environment
- If issues are detected, switch traffic back
- **Best for**: Infrastructure-level deployments with full environment replication
### Canary Deployment
- Deploy new version to a small percentage of traffic (1%, 5%, 10%, 25%, 50%, 100%)
- Monitor metrics at each stage before increasing traffic
- If issues are detected, shift traffic back to the old version
- **Best for**: Application-level deployments where you want to test with real traffic gradually
### Rolling Deployment
- Deploy new version to instances one at a time (or in small batches)
- Old and new versions run side by side during the rollout
- If issues are detected, stop the rollout and roll back the updated instances
- **Best for**: Stateless services where instances can be updated independently
### Feature Flag Deployment
- Deploy new code with features disabled (feature flags set to false)
- Enable features gradually using feature flags
- Can enable per-user, per-tenant, per-percentage
- If issues are detected, disable the feature flag instantly
- **Best for**: Feature-level deployments where you want to decouple code deployment from feature release
## Feature Flags
### Types of Feature Flags
- **Release flags**: Enable/disable new features during rollout (short-lived)
- **Operational flags**: Enable/disable operational features (circuit breakers, maintenance mode)
- **Experiment flags**: A/B testing and gradual rollout (medium-lived)
- **Permission flags**: Enable features for specific users/tenants (long-lived)
### Design Considerations
- Feature flags must not add significant latency (evaluate quickly)
- Feature flag evaluation must be consistent within a request (don't re-evaluate mid-request)
- Feature flags must have a defined lifecycle: create, enable, monitor, remove
- Remove feature flags after full rollout to prevent technical debt
- Use a feature flag management service (not hardcoded flags)
- Log feature flag evaluations for debugging
### Feature Flag Rollout
- Start with 0% (flag off)
- Enable for internal users (dogfood)
- Enable for a small percentage of users (canary)
- Enable for all users (full rollout)
- Monitor metrics at each stage
- Remove the flag after full rollout
## Schema Evolution
### Additive Changes (Safe)
- Add a new column with a default value
- Add a new table
- Add a new index (with caution for large tables)
- Add a new optional field to an API response
- Add a new API endpoint
### Destructive Changes (Require Migration)
- Remove a column (requires migration)
- Rename a column (requires migration)
- Change a column type (requires migration)
- Remove a table (requires migration)
- Remove an API endpoint (requires consumer migration)
### Migration Strategy for Destructive Changes
1. **Expand**: Add the new structure alongside the old (both exist)
2. **Migrate**: Migrate data and code to use the new structure (both exist)
3. **Contract**: Remove the old structure (only new exists)
Example: Renaming a column
1. Add new column, keep old column, dual-write to both
2. Migrate existing data from old to new column
3. Update all reads to use new column
4. Remove old column
### Database Migration Best Practices
- Every migration must be reversible (up and down migration)
- Test migrations against production-like data volumes
- Run migrations in a transaction when possible
- For large tables, use online schema change tools (pt-online-schema-change, gh-ost)
- Never lock a production table for more than seconds during a migration
## Rollback
### Application Rollback
- Revert to previous deployment version
- Feature flag disable (instant, no deployment needed)
- Blue-green switch (instant, requires environment)
- Canary shift-back (requires redirecting traffic)
- Rolling redeploy of previous version (requires new deployment)
### Database Rollback
- Run the down migration (reverse of up migration)
- Restore from backup (for destructive changes without down migration)
- Feature flag to disable new code that uses new schema (code rollback, schema stays)
### Rollback Decision Matrix
| What Failed | Rollback Method | Data Loss Risk |
|-------------|----------------|----------------|
| Application bug | Deploy previous version | None |
| Feature bug | Disable feature flag | None |
| Schema migration bug | Run down migration | Low if reversible |
| Data migration bug | Restore from backup | High if not reversible |
| Integration failure | Circuit breaker / fallback | None |
## Anti-Patterns
- **Big-bang migration**: Migrating everything at once has high risk and hard rollback
- **Breaking API changes without versioning**: Old clients will break
- **Schema migration without backward compatibility**: Old code will fail against new schema
- **Deploying without feature flags**: Can't instantly rollback if issues are detected
- **Not testing rollback**: Rollback must be tested before you need it
- **Removing old code before consumers have migrated**: Premature removal breaks dependencies
- **Not monitoring during rollout**: Issues must be detected quickly to prevent wider impact