--- name: migration-rollout-design description: "Knowledge contract for migration and rollout design. Provides principles and patterns for backward compatibility, rollout strategies, canary deployments, feature flags, schema evolution, and rollback. Referenced by design-architecture when defining migration and rollout strategy." --- This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing migration and rollout strategies. It does not produce artifacts directly. ## Core Principles ### Backward Compatibility First - New versions must coexist with old versions during migration - APIs must be backward-compatible until all consumers have migrated - Database schemas must support both old and new code during migration - Never break existing functionality during migration ### Incremental Over Big-Bang - Migrate incrementally, one step at a time - Each step must be independently deployable and reversible - Test each step before proceeding to the next - Big-bang migrations have higher risk and harder rollback ### Rollback by Default - Every migration step must have a clear rollback plan - Practice rollback before you need it - Automated rollback is preferred over manual rollback - Feature flags enable instant rollback without deployment ## Rollout Strategies ### Blue-Green Deployment - Maintain two identical environments (blue and green) - Deploy new version to the inactive environment - Switch traffic from active to inactive environment - If issues are detected, switch traffic back - **Best for**: Infrastructure-level deployments with full environment replication ### Canary Deployment - Deploy new version to a small percentage of traffic (1%, 5%, 10%, 25%, 50%, 100%) - Monitor metrics at each stage before increasing traffic - If issues are detected, shift traffic back to the old version - **Best for**: Application-level deployments where you want to test with real traffic gradually ### Rolling Deployment - Deploy new version to instances one at a time (or in small batches) - Old and new versions run side by side during the rollout - If issues are detected, stop the rollout and roll back the updated instances - **Best for**: Stateless services where instances can be updated independently ### Feature Flag Deployment - Deploy new code with features disabled (feature flags set to false) - Enable features gradually using feature flags - Can enable per-user, per-tenant, per-percentage - If issues are detected, disable the feature flag instantly - **Best for**: Feature-level deployments where you want to decouple code deployment from feature release ## Feature Flags ### Types of Feature Flags - **Release flags**: Enable/disable new features during rollout (short-lived) - **Operational flags**: Enable/disable operational features (circuit breakers, maintenance mode) - **Experiment flags**: A/B testing and gradual rollout (medium-lived) - **Permission flags**: Enable features for specific users/tenants (long-lived) ### Design Considerations - Feature flags must not add significant latency (evaluate quickly) - Feature flag evaluation must be consistent within a request (don't re-evaluate mid-request) - Feature flags must have a defined lifecycle: create, enable, monitor, remove - Remove feature flags after full rollout to prevent technical debt - Use a feature flag management service (not hardcoded flags) - Log feature flag evaluations for debugging ### Feature Flag Rollout - Start with 0% (flag off) - Enable for internal users (dogfood) - Enable for a small percentage of users (canary) - Enable for all users (full rollout) - Monitor metrics at each stage - Remove the flag after full rollout ## Schema Evolution ### Additive Changes (Safe) - Add a new column with a default value - Add a new table - Add a new index (with caution for large tables) - Add a new optional field to an API response - Add a new API endpoint ### Destructive Changes (Require Migration) - Remove a column (requires migration) - Rename a column (requires migration) - Change a column type (requires migration) - Remove a table (requires migration) - Remove an API endpoint (requires consumer migration) ### Migration Strategy for Destructive Changes 1. **Expand**: Add the new structure alongside the old (both exist) 2. **Migrate**: Migrate data and code to use the new structure (both exist) 3. **Contract**: Remove the old structure (only new exists) Example: Renaming a column 1. Add new column, keep old column, dual-write to both 2. Migrate existing data from old to new column 3. Update all reads to use new column 4. Remove old column ### Database Migration Best Practices - Every migration must be reversible (up and down migration) - Test migrations against production-like data volumes - Run migrations in a transaction when possible - For large tables, use online schema change tools (pt-online-schema-change, gh-ost) - Never lock a production table for more than seconds during a migration ## Rollback ### Application Rollback - Revert to previous deployment version - Feature flag disable (instant, no deployment needed) - Blue-green switch (instant, requires environment) - Canary shift-back (requires redirecting traffic) - Rolling redeploy of previous version (requires new deployment) ### Database Rollback - Run the down migration (reverse of up migration) - Restore from backup (for destructive changes without down migration) - Feature flag to disable new code that uses new schema (code rollback, schema stays) ### Rollback Decision Matrix | What Failed | Rollback Method | Data Loss Risk | |-------------|----------------|----------------| | Application bug | Deploy previous version | None | | Feature bug | Disable feature flag | None | | Schema migration bug | Run down migration | Low if reversible | | Data migration bug | Restore from backup | High if not reversible | | Integration failure | Circuit breaker / fallback | None | ## Anti-Patterns - **Big-bang migration**: Migrating everything at once has high risk and hard rollback - **Breaking API changes without versioning**: Old clients will break - **Schema migration without backward compatibility**: Old code will fail against new schema - **Deploying without feature flags**: Can't instantly rollback if issues are detected - **Not testing rollback**: Rollback must be tested before you need it - **Removing old code before consumers have migrated**: Premature removal breaks dependencies - **Not monitoring during rollout**: Issues must be detected quickly to prevent wider impact