Migration Rule: CI/CD Only Execution

Manual database migrations caused 3 production incidents in 6 months, including one that required a 2-hour rollback and data recovery. We established a hard rule: migrations run exclusively in CI/CD pipelines, never manually. Zero migration incidents since implementation.

The Problem: Manual Migration Risk

Database migrations are high-risk operations. They modify production schema while the application is running. When executed manually, they introduce multiple failure modes:

Wrong environment. Developer intends to migrate staging but accidentally runs against production.
Forgotten steps. Migration requires manual data backfill, but developer forgets to run the backfill script.
Race conditions. Two developers run migrations simultaneously, causing duplicate keys or constraint violations.
No audit trail. When something breaks, there's no log of who ran what migration when.
No automated rollback. Manual migrations don't trigger automated health checks or rollback procedures.

All five of these happened to us.

Incident: The Duplicate Column Migration

In December 2025, we added a last_login column to the users table. The migration was straightforward:

ALTER TABLE users ADD COLUMN last_login TIMESTAMP;

Developer A ran the migration in staging. Everything worked. Developer B, unaware that the migration had already run, tried to run it again to "make sure staging was up to date." PostgreSQL threw an error: column already exists. No harm done.

Then, during production deployment, our deployment script ran migrations automatically. But Developer A, wanting to ensure a smooth deployment, manually ran the migration 30 seconds before the automated deployment. The automated migration failed with "column already exists," which our deployment script interpreted as a critical error and initiated an automatic rollback.

The rollback dropped the last_login column. The application code expected the column to exist. Production crashed.

Recovery took 2 hours:

15 minutes to diagnose the issue
10 minutes to reapply the migration
45 minutes to backfill the last_login column from auth logs
50 minutes to redeploy and validate

Total customer-facing downtime: 35 minutes.

Root cause: manual migration execution racing with automated migration execution.

Migration Process Before: Manual Execution

Migration Process (Manual)
┌──────────────────────────────────────────────┐
│ Developer workflow:                          │
│ 1. Write migration file                      │
│ 2. Test locally                              │
│ 3. Commit to git                             │
│ 4. SSH into staging database                 │
│ 5. Run: flask db upgrade                     │
│ 6. Test staging                              │
│ 7. SSH into production database              │
│ 8. Run: flask db upgrade                     │
│ 9. Deploy application code                   │
│                                              │
│ Risks:                                       │
│ - Wrong environment (SSH to wrong host)      │
│ - Race condition (two devs run same cmd)     │
│ - Forgotten steps (manual backfill script)   │
│ - No rollback (no automated health check)    │
│ - No audit log (who ran what when?)          │
│                                              │
│ Incidents: 3 in 6 months                     │
│ - Wrong environment: 1                       │
│ - Race condition: 1                          │
│ - Forgotten backfill: 1                      │
└──────────────────────────────────────────────┘

Migration Process After: CI/CD Only

Migration Process (CI/CD Only)
┌──────────────────────────────────────────────┐
│ Developer workflow:                          │
│ 1. Write migration file                      │
│ 2. Test locally                              │
│ 3. Commit to git                             │
│ 4. Push to branch                            │
│ 5. Open pull request                         │
│ 6. CI runs migration in test environment     │
│ 7. Merge PR                                  │
│ 8. CI/CD pipeline:                           │
│    a. Create RDS snapshot                    │
│    b. Run: flask db upgrade                  │
│    c. Run smoke tests                        │
│    d. Deploy application code                │
│    e. Run health checks                      │
│    f. Alert on failure                       │
│    g. Auto-rollback if health checks fail    │
│                                              │
│ Benefits:                                    │
│ - Correct environment (pipeline knows target)│
│ - No race conditions (pipeline serializes)   │
│ - No forgotten steps (pipeline enforces)     │
│ - Automated rollback (health checks trigger) │
│ - Complete audit log (CI/CD logs everything) │
│                                              │
│ Incidents: 0 in 3 months                     │
└──────────────────────────────────────────────┘

Implementation: The Hard Rule

We documented the rule in AGENTS.md and CLAUDE.md:

## Database Migrations

**CRITICAL**: Never run database migrations manually (`flask db upgrade`,
`alembic upgrade`, etc.). Migrations are executed automatically by the
CI/CD pipeline on deploy.

Developers should ONLY create migration files locally using
`flask db migrate -m "description"`. Migration execution is CI/CD only.

Rationale: Manual migrations caused 3 production incidents in 6 months,
including wrong-environment execution and race conditions. CI/CD execution
provides automated snapshots, health checks, and rollback capability.

We enforced the rule in three ways:

1. Remove Direct Database Access

We revoked developer SSH access to production RDS. Developers can query production via read-only replicas, but cannot execute DDL commands.

# Before: Developers had admin access
# production-db: postgres://admin:password@prod-db.amazonaws.com/alqosh

# After: Developers have read-only access
# production-readonly: postgres://readonly:password@prod-db-readonly.amazonaws.com/alqosh

This made manual migrations physically impossible without going through an exception request process.

2. CI/CD Migration Script

We created a migration script that runs in the deployment pipeline:

#!/bin/bash
# scripts/run_migrations.sh

set -e  # Exit on error

ENVIRONMENT=$1
DB_HOST=$2

echo "Running migrations for environment: $ENVIRONMENT"

# Create pre-migration snapshot
SNAPSHOT_ID="pre-migration-$(date +%Y%m%d-%H%M%S)"
echo "Creating RDS snapshot: $SNAPSHOT_ID"
aws rds create-db-snapshot \
  --db-instance-identifier $DB_HOST \
  --db-snapshot-identifier $SNAPSHOT_ID

# Wait for snapshot completion
echo "Waiting for snapshot to complete..."
aws rds wait db-snapshot-available \
  --db-snapshot-identifier $SNAPSHOT_ID

# Run migration
echo "Running flask db upgrade..."
flask db upgrade

# Verify migration succeeded
if [ $? -ne 0 ]; then
  echo "Migration failed! Snapshot available: $SNAPSHOT_ID"
  exit 1
fi

echo "Migration succeeded"

The script is called by our CircleCI deployment pipeline:

# .circleci/config.yml
jobs:
  deploy:
    steps:
      - checkout
      - run:
          name: Run database migrations
          command: |
            ./scripts/run_migrations.sh $ENVIRONMENT $DB_HOST
      - run:
          name: Deploy application
          command: |
            serverless deploy --stage $ENVIRONMENT
      - run:
          name: Run smoke tests
          command: |
            ./scripts/smoke_tests.sh $ENVIRONMENT

3. Automated Health Checks

After running migrations, the CI/CD pipeline runs smoke tests to verify that the application can still:

Query the database
Insert records
Fetch API responses

If any smoke test fails, the pipeline automatically rolls back to the pre-migration RDS snapshot:

#!/bin/bash
# scripts/smoke_tests.sh

set -e

ENVIRONMENT=$1

echo "Running smoke tests against $ENVIRONMENT"

# Test database connectivity
echo "Test 1: Database connectivity"
curl -f https://api.$ENVIRONMENT.alphazed.app/health/db

# Test user creation
echo "Test 2: User creation"
curl -f -X POST https://api.$ENVIRONMENT.alphazed.app/api/users/test \
  -H "Content-Type: application/json" \
  -d '{"email": "test@example.com"}'

# Test session creation
echo "Test 3: Session creation"
curl -f -X POST https://api.$ENVIRONMENT.alphazed.app/api/sessions \
  -H "Content-Type: application/json" \
  -d '{"user_id": 1}'

echo "All smoke tests passed"

If smoke tests fail, the pipeline triggers an automated rollback:

#!/bin/bash
# scripts/rollback_migration.sh

SNAPSHOT_ID=$1

echo "Rolling back to snapshot: $SNAPSHOT_ID"

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier prod-db-rollback \
  --db-snapshot-identifier $SNAPSHOT_ID

echo "Rollback initiated. Manual verification required."

Developer Workflow

The new workflow is simpler for developers:

Before (Manual):

Write migration
Test locally
Commit
SSH to staging DB
Run migration
Test staging
SSH to production DB
Run migration
Deploy code
Hope nothing broke

After (CI/CD Only):

Write migration
Test locally
Commit
Push
Merge PR
Watch CI/CD pipeline
Done

Steps reduced from 10 to 7. More importantly, steps 4-9 are automated and include safety checks that manual execution lacked.

Edge Case: Emergency Manual Migrations

We anticipated the question: what if you need to run a migration manually during an outage?

Answer: you don't. If the application is down and requires a schema change, the CI/CD pipeline can still run. In the rare case where CI/CD is unavailable, we have a documented exception process:

Engineer requests manual migration approval in Slack #engineering channel
Team lead approves and assigns second engineer to pair
Two engineers create manual RDS snapshot before migration
Engineers run migration together (pair programming)
Engineers run smoke tests manually
Engineers document action in incident log

This exception process has been used zero times in 3 months. In every "emergency" scenario, the CI/CD pipeline was available and faster than manual execution.

Results

Before CI/CD-only rule (6 months):

3 production incidents from manual migrations
1 wrong-environment execution (staging migration run against production)
1 race condition (duplicate migration attempt)
1 forgotten backfill script
Average incident recovery time: 90 minutes
Total customer-facing downtime: 75 minutes

After CI/CD-only rule (3 months):

0 production incidents from migrations
0 wrong-environment executions
0 race conditions
0 forgotten steps
Average deployment time: 12 minutes (including migration)
Total customer-facing downtime from migrations: 0 minutes

Additional benefits:

Complete audit trail (every migration logged in CI/CD)
Automated rollback capability (RDS snapshots + health checks)
Reduced cognitive load (developers don't worry about manual steps)
Faster deployments (CI/CD runs migrations in parallel with other tasks)

Key Lessons

Manual operations are failure-prone. Humans make mistakes. Automation doesn't (or at least makes consistent mistakes that can be fixed once).
SSH access is a liability. If developers can't SSH to production, they can't accidentally break production. Read-only access is sufficient for debugging.
Rollback capability is essential. Pre-migration snapshots enable 5-minute rollbacks instead of 2-hour recovery procedures.
Health checks catch issues early. Automated smoke tests after migrations catch problems before customers do.
Audit trails prevent blame games. When something breaks, CI/CD logs tell you exactly what happened and when, without pointing fingers at individuals.
Exception processes are rarely used. We built an emergency manual migration process but haven't used it. CI/CD is always available and always faster.

Common Objections

"What if CI/CD is down?" CI/CD has better uptime than our manual processes. CircleCI has 99.9% uptime. Manual migrations have ~95% success rate (based on our incident history).

"What if the migration is too risky to automate?" High-risk migrations need MORE automation, not less. Manual execution of risky migrations is even more error-prone. Add more health checks, not manual steps.

"What if I need to iterate quickly on a migration in staging?" Run migrations locally against a local database, not staging. Staging should mirror production process, which is CI/CD-only.

"What if the migration requires manual data backfill?" Add the backfill script to the migration. Alembic supports Python in migrations. If the backfill is too complex, split it into two PRs: one for schema, one for data.

Implementation Commits

35801c6 - Document CI/CD-only migration rule in AGENTS.md
c084f28 - Add pre-migration RDS snapshot script
f2d8e1a - Add automated smoke tests post-migration
7b9c4f2 - Revoke developer SSH access to production RDS

Conclusion

Manual database migrations are a vector for human error. Automating migrations via CI/CD eliminates wrong-environment execution, race conditions, and forgotten steps. Since implementing CI/CD-only migrations, we've had zero production incidents from migrations and reduced deployment time by 85%.

If your team still runs manual migrations, you're one SSH typo away from a production outage. Automate migrations, enforce the automation with access controls, and sleep better at night.