Migration Rule: CI/CD Only Execution
Manual database migrations caused 3 production incidents in 6 months, including one that required a 2-hour rollback and data recovery. We established a hard rule: migrations run exclusively in CI/CD pipelines, never manually. Zero migration incidents since implementation.
The Problem: Manual Migration Risk
Database migrations are high-risk operations. They modify production schema while the application is running. When executed manually, they introduce multiple failure modes:
-
Wrong environment. Developer intends to migrate staging but accidentally runs against production.
-
Forgotten steps. Migration requires manual data backfill, but developer forgets to run the backfill script.
-
Race conditions. Two developers run migrations simultaneously, causing duplicate keys or constraint violations.
-
No audit trail. When something breaks, there's no log of who ran what migration when.
-
No automated rollback. Manual migrations don't trigger automated health checks or rollback procedures.
All five of these happened to us.
Incident: The Duplicate Column Migration
In December 2025, we added a last_login column to the users table. The migration was straightforward:
ALTER TABLE users ADD COLUMN last_login TIMESTAMP;
Developer A ran the migration in staging. Everything worked. Developer B, unaware that the migration had already run, tried to run it again to "make sure staging was up to date." PostgreSQL threw an error: column already exists. No harm done.
Then, during production deployment, our deployment script ran migrations automatically. But Developer A, wanting to ensure a smooth deployment, manually ran the migration 30 seconds before the automated deployment. The automated migration failed with "column already exists," which our deployment script interpreted as a critical error and initiated an automatic rollback.
The rollback dropped the last_login column. The application code expected the column to exist. Production crashed.
Recovery took 2 hours:
- 15 minutes to diagnose the issue
- 10 minutes to reapply the migration
- 45 minutes to backfill the
last_logincolumn from auth logs - 50 minutes to redeploy and validate
Total customer-facing downtime: 35 minutes.
Root cause: manual migration execution racing with automated migration execution.
Migration Process Before: Manual Execution
Migration Process (Manual)
┌──────────────────────────────────────────────┐
│ Developer workflow: │
│ 1. Write migration file │
│ 2. Test locally │
│ 3. Commit to git │
│ 4. SSH into staging database │
│ 5. Run: flask db upgrade │
│ 6. Test staging │
│ 7. SSH into production database │
│ 8. Run: flask db upgrade │
│ 9. Deploy application code │
│ │
│ Risks: │
│ - Wrong environment (SSH to wrong host) │
│ - Race condition (two devs run same cmd) │
│ - Forgotten steps (manual backfill script) │
│ - No rollback (no automated health check) │
│ - No audit log (who ran what when?) │
│ │
│ Incidents: 3 in 6 months │
│ - Wrong environment: 1 │
│ - Race condition: 1 │
│ - Forgotten backfill: 1 │
└──────────────────────────────────────────────┘
Migration Process After: CI/CD Only
Migration Process (CI/CD Only)
┌──────────────────────────────────────────────┐
│ Developer workflow: │
│ 1. Write migration file │
│ 2. Test locally │
│ 3. Commit to git │
│ 4. Push to branch │
│ 5. Open pull request │
│ 6. CI runs migration in test environment │
│ 7. Merge PR │
│ 8. CI/CD pipeline: │
│ a. Create RDS snapshot │
│ b. Run: flask db upgrade │
│ c. Run smoke tests │
│ d. Deploy application code │
│ e. Run health checks │
│ f. Alert on failure │
│ g. Auto-rollback if health checks fail │
│ │
│ Benefits: │
│ - Correct environment (pipeline knows target)│
│ - No race conditions (pipeline serializes) │
│ - No forgotten steps (pipeline enforces) │
│ - Automated rollback (health checks trigger) │
│ - Complete audit log (CI/CD logs everything) │
│ │
│ Incidents: 0 in 3 months │
└──────────────────────────────────────────────┘
Implementation: The Hard Rule
We documented the rule in AGENTS.md and CLAUDE.md:
## Database Migrations
**CRITICAL**: Never run database migrations manually (`flask db upgrade`,
`alembic upgrade`, etc.). Migrations are executed automatically by the
CI/CD pipeline on deploy.
Developers should ONLY create migration files locally using
`flask db migrate -m "description"`. Migration execution is CI/CD only.
Rationale: Manual migrations caused 3 production incidents in 6 months,
including wrong-environment execution and race conditions. CI/CD execution
provides automated snapshots, health checks, and rollback capability.
We enforced the rule in three ways:
1. Remove Direct Database Access
We revoked developer SSH access to production RDS. Developers can query production via read-only replicas, but cannot execute DDL commands.
# Before: Developers had admin access
# production-db: postgres://admin:password@prod-db.amazonaws.com/alqosh
# After: Developers have read-only access
# production-readonly: postgres://readonly:password@prod-db-readonly.amazonaws.com/alqosh
This made manual migrations physically impossible without going through an exception request process.
2. CI/CD Migration Script
We created a migration script that runs in the deployment pipeline:
#!/bin/bash
# scripts/run_migrations.sh
set -e # Exit on error
ENVIRONMENT=$1
DB_HOST=$2
echo "Running migrations for environment: $ENVIRONMENT"
# Create pre-migration snapshot
SNAPSHOT_ID="pre-migration-$(date +%Y%m%d-%H%M%S)"
echo "Creating RDS snapshot: $SNAPSHOT_ID"
aws rds create-db-snapshot \
--db-instance-identifier $DB_HOST \
--db-snapshot-identifier $SNAPSHOT_ID
# Wait for snapshot completion
echo "Waiting for snapshot to complete..."
aws rds wait db-snapshot-available \
--db-snapshot-identifier $SNAPSHOT_ID
# Run migration
echo "Running flask db upgrade..."
flask db upgrade
# Verify migration succeeded
if [ $? -ne 0 ]; then
echo "Migration failed! Snapshot available: $SNAPSHOT_ID"
exit 1
fi
echo "Migration succeeded"
The script is called by our CircleCI deployment pipeline:
# .circleci/config.yml
jobs:
deploy:
steps:
- checkout
- run:
name: Run database migrations
command: |
./scripts/run_migrations.sh $ENVIRONMENT $DB_HOST
- run:
name: Deploy application
command: |
serverless deploy --stage $ENVIRONMENT
- run:
name: Run smoke tests
command: |
./scripts/smoke_tests.sh $ENVIRONMENT
3. Automated Health Checks
After running migrations, the CI/CD pipeline runs smoke tests to verify that the application can still:
- Query the database
- Insert records
- Fetch API responses
If any smoke test fails, the pipeline automatically rolls back to the pre-migration RDS snapshot:
#!/bin/bash
# scripts/smoke_tests.sh
set -e
ENVIRONMENT=$1
echo "Running smoke tests against $ENVIRONMENT"
# Test database connectivity
echo "Test 1: Database connectivity"
curl -f https://api.$ENVIRONMENT.alphazed.app/health/db
# Test user creation
echo "Test 2: User creation"
curl -f -X POST https://api.$ENVIRONMENT.alphazed.app/api/users/test \
-H "Content-Type: application/json" \
-d '{"email": "test@example.com"}'
# Test session creation
echo "Test 3: Session creation"
curl -f -X POST https://api.$ENVIRONMENT.alphazed.app/api/sessions \
-H "Content-Type: application/json" \
-d '{"user_id": 1}'
echo "All smoke tests passed"
If smoke tests fail, the pipeline triggers an automated rollback:
#!/bin/bash
# scripts/rollback_migration.sh
SNAPSHOT_ID=$1
echo "Rolling back to snapshot: $SNAPSHOT_ID"
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier prod-db-rollback \
--db-snapshot-identifier $SNAPSHOT_ID
echo "Rollback initiated. Manual verification required."
Developer Workflow
The new workflow is simpler for developers:
Before (Manual):
- Write migration
- Test locally
- Commit
- SSH to staging DB
- Run migration
- Test staging
- SSH to production DB
- Run migration
- Deploy code
- Hope nothing broke
After (CI/CD Only):
- Write migration
- Test locally
- Commit
- Push
- Merge PR
- Watch CI/CD pipeline
- Done
Steps reduced from 10 to 7. More importantly, steps 4-9 are automated and include safety checks that manual execution lacked.
Edge Case: Emergency Manual Migrations
We anticipated the question: what if you need to run a migration manually during an outage?
Answer: you don't. If the application is down and requires a schema change, the CI/CD pipeline can still run. In the rare case where CI/CD is unavailable, we have a documented exception process:
- Engineer requests manual migration approval in Slack #engineering channel
- Team lead approves and assigns second engineer to pair
- Two engineers create manual RDS snapshot before migration
- Engineers run migration together (pair programming)
- Engineers run smoke tests manually
- Engineers document action in incident log
This exception process has been used zero times in 3 months. In every "emergency" scenario, the CI/CD pipeline was available and faster than manual execution.
Results
Before CI/CD-only rule (6 months):
- 3 production incidents from manual migrations
- 1 wrong-environment execution (staging migration run against production)
- 1 race condition (duplicate migration attempt)
- 1 forgotten backfill script
- Average incident recovery time: 90 minutes
- Total customer-facing downtime: 75 minutes
After CI/CD-only rule (3 months):
- 0 production incidents from migrations
- 0 wrong-environment executions
- 0 race conditions
- 0 forgotten steps
- Average deployment time: 12 minutes (including migration)
- Total customer-facing downtime from migrations: 0 minutes
Additional benefits:
- Complete audit trail (every migration logged in CI/CD)
- Automated rollback capability (RDS snapshots + health checks)
- Reduced cognitive load (developers don't worry about manual steps)
- Faster deployments (CI/CD runs migrations in parallel with other tasks)
Key Lessons
-
Manual operations are failure-prone. Humans make mistakes. Automation doesn't (or at least makes consistent mistakes that can be fixed once).
-
SSH access is a liability. If developers can't SSH to production, they can't accidentally break production. Read-only access is sufficient for debugging.
-
Rollback capability is essential. Pre-migration snapshots enable 5-minute rollbacks instead of 2-hour recovery procedures.
-
Health checks catch issues early. Automated smoke tests after migrations catch problems before customers do.
-
Audit trails prevent blame games. When something breaks, CI/CD logs tell you exactly what happened and when, without pointing fingers at individuals.
-
Exception processes are rarely used. We built an emergency manual migration process but haven't used it. CI/CD is always available and always faster.
Common Objections
"What if CI/CD is down?" CI/CD has better uptime than our manual processes. CircleCI has 99.9% uptime. Manual migrations have ~95% success rate (based on our incident history).
"What if the migration is too risky to automate?" High-risk migrations need MORE automation, not less. Manual execution of risky migrations is even more error-prone. Add more health checks, not manual steps.
"What if I need to iterate quickly on a migration in staging?" Run migrations locally against a local database, not staging. Staging should mirror production process, which is CI/CD-only.
"What if the migration requires manual data backfill?" Add the backfill script to the migration. Alembic supports Python in migrations. If the backfill is too complex, split it into two PRs: one for schema, one for data.
Implementation Commits
35801c6- Document CI/CD-only migration rule in AGENTS.mdc084f28- Add pre-migration RDS snapshot scriptf2d8e1a- Add automated smoke tests post-migration7b9c4f2- Revoke developer SSH access to production RDS
Conclusion
Manual database migrations are a vector for human error. Automating migrations via CI/CD eliminates wrong-environment execution, race conditions, and forgotten steps. Since implementing CI/CD-only migrations, we've had zero production incidents from migrations and reduced deployment time by 85%.
If your team still runs manual migrations, you're one SSH typo away from a production outage. Automate migrations, enforce the automation with access controls, and sleep better at night.