opencode-workflow/SKILL.md at b61495d34d2db62663f240ab4b943e2597011a7a

6.4 KiB

Raw Blame History

name	description
migration-rollout-design	Knowledge contract for migration and rollout design. Provides principles and patterns for backward compatibility, rollout strategies, canary deployments, feature flags, schema evolution, and rollback. Referenced by design-architecture when defining migration and rollout strategy.

This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing migration and rollout strategies. It does not produce artifacts directly.

Core Principles

Backward Compatibility First

New versions must coexist with old versions during migration
APIs must be backward-compatible until all consumers have migrated
Database schemas must support both old and new code during migration
Never break existing functionality during migration

Incremental Over Big-Bang

Migrate incrementally, one step at a time
Each step must be independently deployable and reversible
Test each step before proceeding to the next
Big-bang migrations have higher risk and harder rollback

Rollback by Default

Every migration step must have a clear rollback plan
Practice rollback before you need it
Automated rollback is preferred over manual rollback
Feature flags enable instant rollback without deployment

Rollout Strategies

Blue-Green Deployment

Maintain two identical environments (blue and green)
Deploy new version to the inactive environment
Switch traffic from active to inactive environment
If issues are detected, switch traffic back
Best for: Infrastructure-level deployments with full environment replication

Canary Deployment

Deploy new version to a small percentage of traffic (1%, 5%, 10%, 25%, 50%, 100%)
Monitor metrics at each stage before increasing traffic
If issues are detected, shift traffic back to the old version
Best for: Application-level deployments where you want to test with real traffic gradually

Rolling Deployment

Deploy new version to instances one at a time (or in small batches)
Old and new versions run side by side during the rollout
If issues are detected, stop the rollout and roll back the updated instances
Best for: Stateless services where instances can be updated independently

Feature Flag Deployment

Deploy new code with features disabled (feature flags set to false)
Enable features gradually using feature flags
Can enable per-user, per-tenant, per-percentage
If issues are detected, disable the feature flag instantly
Best for: Feature-level deployments where you want to decouple code deployment from feature release

Feature Flags

Types of Feature Flags

Release flags: Enable/disable new features during rollout (short-lived)
Operational flags: Enable/disable operational features (circuit breakers, maintenance mode)
Experiment flags: A/B testing and gradual rollout (medium-lived)
Permission flags: Enable features for specific users/tenants (long-lived)

Design Considerations

Feature flags must not add significant latency (evaluate quickly)
Feature flag evaluation must be consistent within a request (don't re-evaluate mid-request)
Feature flags must have a defined lifecycle: create, enable, monitor, remove
Remove feature flags after full rollout to prevent technical debt
Use a feature flag management service (not hardcoded flags)
Log feature flag evaluations for debugging

Feature Flag Rollout

Start with 0% (flag off)
Enable for internal users (dogfood)
Enable for a small percentage of users (canary)
Enable for all users (full rollout)
Monitor metrics at each stage
Remove the flag after full rollout

Schema Evolution

Additive Changes (Safe)

Add a new column with a default value
Add a new table
Add a new index (with caution for large tables)
Add a new optional field to an API response
Add a new API endpoint

Destructive Changes (Require Migration)

Remove a column (requires migration)
Rename a column (requires migration)
Change a column type (requires migration)
Remove a table (requires migration)
Remove an API endpoint (requires consumer migration)

Migration Strategy for Destructive Changes

Expand: Add the new structure alongside the old (both exist)
Migrate: Migrate data and code to use the new structure (both exist)
Contract: Remove the old structure (only new exists)

Example: Renaming a column

Add new column, keep old column, dual-write to both
Migrate existing data from old to new column
Update all reads to use new column
Remove old column

Database Migration Best Practices

Every migration must be reversible (up and down migration)
Test migrations against production-like data volumes
Run migrations in a transaction when possible
For large tables, use online schema change tools (pt-online-schema-change, gh-ost)
Never lock a production table for more than seconds during a migration

Rollback

Application Rollback

Revert to previous deployment version
Feature flag disable (instant, no deployment needed)
Blue-green switch (instant, requires environment)
Canary shift-back (requires redirecting traffic)
Rolling redeploy of previous version (requires new deployment)

Database Rollback

Run the down migration (reverse of up migration)
Restore from backup (for destructive changes without down migration)
Feature flag to disable new code that uses new schema (code rollback, schema stays)

Rollback Decision Matrix

What Failed	Rollback Method	Data Loss Risk
Application bug	Deploy previous version	None
Feature bug	Disable feature flag	None
Schema migration bug	Run down migration	Low if reversible
Data migration bug	Restore from backup	High if not reversible
Integration failure	Circuit breaker / fallback	None

Anti-Patterns

Big-bang migration: Migrating everything at once has high risk and hard rollback
Breaking API changes without versioning: Old clients will break
Schema migration without backward compatibility: Old code will fail against new schema
Deploying without feature flags: Can't instantly rollback if issues are detected
Not testing rollback: Rollback must be tested before you need it
Removing old code before consumers have migrated: Premature removal breaks dependencies
Not monitoring during rollout: Issues must be detected quickly to prevent wider impact

6.4 KiB Raw Blame History