← Back

Idempotent Migrations: Safe Partial Application Recovery

·database

Idempotent Migrations: Safe Partial Application Recovery

Database migrations failing halfway through execution used to leave our databases in inconsistent states, requiring manual SQL commands to fix. We implemented idempotent migration patterns using IF NOT EXISTS guards, eliminating manual intervention and enabling safe reruns of failed migrations.

The Problem: Partial Migration Failures

Failed migrations are inevitable. Network issues, permission errors, or constraint violations can interrupt a migration mid-execution. When this happens with traditional migrations, you're left with a partially-applied schema that requires manual SQL commands to clean up before retrying.

Consider this migration sequence:

CREATE TABLE users (id INT, email VARCHAR(255));
CREATE INDEX idx_users_email ON users(email);
ALTER TABLE sessions ADD COLUMN user_id INT;
-- ERROR: Column 'user_id' already exists

If the migration fails at the ALTER TABLE step, the database now has the new users table and index, but the migration is marked as failed. Rerunning the migration fails immediately because the table already exists. The only solution: manually inspect the database, figure out which steps succeeded, and either complete the migration by hand or roll back the partial changes.

This happened to us during a Content Duo feature deployment. The migration failed due to a constraint violation, leaving three new tables created but four indexes missing. We spent 45 minutes diagnosing the state and manually applying the remaining DDL commands.

Migration State Before: Manual Recovery Required

Migration Execution (Non-Idempotent)
┌──────────────────────────────────────┐
│ CREATE TABLE users (...);            │ ✓
│ CREATE INDEX idx_users_email (...);  │ ✓
│ ALTER TABLE sessions ...;            │ ✗ ERROR
│                                      │
│ Result: Partial application          │
│ Status: Migration marked FAILED      │
│                                      │
│ Recovery:                            │
│ 1. SSH into database                 │
│ 2. Inspect schema state              │
│ 3. Manually run remaining DDL        │
│ 4. Update migration history table    │
│ Time: 30-60 minutes                  │
└──────────────────────────────────────┘

Solution: Idempotent Migration Patterns

The fix is simple: make every migration operation idempotent using IF NOT EXISTS guards. If a migration fails and you rerun it, already-completed steps are safely skipped while remaining steps execute normally.

We converted all Alembic migrations to use these patterns:

Table Creation:

CREATE TABLE IF NOT EXISTS users (
    id INT PRIMARY KEY,
    email VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Index Creation:

CREATE INDEX IF NOT EXISTS idx_users_email
ON users(email);

Column Addition:

-- PostgreSQL
DO $$
BEGIN
    IF NOT EXISTS (
        SELECT 1 FROM information_schema.columns
        WHERE table_name = 'sessions'
        AND column_name = 'user_id'
    ) THEN
        ALTER TABLE sessions ADD COLUMN user_id INT;
    END IF;
END $$;

Foreign Key Addition:

DO $$
BEGIN
    IF NOT EXISTS (
        SELECT 1 FROM information_schema.table_constraints
        WHERE constraint_name = 'fk_sessions_user'
    ) THEN
        ALTER TABLE sessions ADD CONSTRAINT fk_sessions_user
        FOREIGN KEY (user_id) REFERENCES users(id);
    END IF;
END $$;

Migration State After: Automated Recovery

Migration Execution (Idempotent)
┌──────────────────────────────────────┐
│ CREATE TABLE IF NOT EXISTS users;    │ ✓ (or skipped)
│ CREATE INDEX IF NOT EXISTS idx_...;  │ ✓ (or skipped)
│ ALTER TABLE sessions ADD COLUMN      │ ✓ (or skipped)
│   IF NOT EXISTS ...;                 │
│                                      │
│ Result: Safe to rerun                │
│ Status: Migration completes          │
│                                      │
│ Recovery:                            │
│ 1. Rerun migration command           │
│ Time: 0 minutes (automatic)          │
└──────────────────────────────────────┘

Implementation Details

We converted 47 existing migrations to idempotent patterns and established a rule: all new migrations must use IF NOT EXISTS guards. This required changes in three areas:

1. Alembic Migration Templates

We modified the Alembic migration template to include boilerplate for conditional DDL:

# migrations/script.py.mako
def upgrade():
    """
    All DDL operations must use IF NOT EXISTS guards.

    Examples:
    - CREATE TABLE IF NOT EXISTS ...
    - CREATE INDEX IF NOT EXISTS ...
    - Use DO $$ blocks for ALTER TABLE ADD COLUMN
    """
    pass

2. Migration Review Checklist

We added a CI check that scans migration files for common non-idempotent patterns and fails the build if found:

# scripts/validate_migrations.py
NON_IDEMPOTENT_PATTERNS = [
    r"CREATE TABLE (?!IF NOT EXISTS)",
    r"CREATE INDEX (?!IF NOT EXISTS)",
    r"ALTER TABLE .* ADD COLUMN (?!.*IF NOT EXISTS)",
]

def validate_migration(file_path):
    content = open(file_path).read()
    for pattern in NON_IDEMPOTENT_PATTERNS:
        if re.search(pattern, content, re.IGNORECASE):
            raise ValueError(f"Non-idempotent pattern found: {pattern}")

3. Testing Migration Reruns

We created integration tests that run each migration twice to verify idempotency:

def test_migration_idempotency():
    # Apply migration
    run_migration("head")

    # Capture schema state
    schema_before = get_schema_snapshot()

    # Rerun same migration (should be no-op)
    run_migration("head")

    # Verify schema unchanged
    schema_after = get_schema_snapshot()
    assert schema_before == schema_after

Real-World Impact: Content Duo Launch

The value of idempotent migrations became clear during our Content Duo launch. The migration added 8 new tables, 15 indexes, and modified 4 existing tables. During staging deployment, the migration failed due to a foreign key constraint violation (our fault - we forgot to populate reference data first).

Before idempotent migrations:

  • We would have spent 30-45 minutes manually inspecting which DDL succeeded
  • We would have written ad-hoc SQL to complete the partial migration
  • We would have risked missing a step or introducing inconsistency

With idempotent migrations:

  • We fixed the reference data issue (2 minutes)
  • We reran the migration command (30 seconds)
  • The migration skipped already-completed steps and finished successfully
  • Total recovery time: 3 minutes

Results and Tradeoffs

Before:

  • 5 production incidents from failed migrations (6 months)
  • Average recovery time: 40 minutes
  • Manual intervention required for every failure
  • High-stress deployments

After:

  • 0 production incidents from failed migrations (3 months since implementation)
  • Average recovery time: 2 minutes (automated retry)
  • Zero manual intervention
  • Confident deployments

Tradeoffs:

Idempotent migrations add complexity. Each migration requires conditional logic rather than simple DDL statements. The migration code is longer and harder to read. For developers unfamiliar with the pattern, writing idempotent migrations takes 2-3× longer initially.

However, the tradeoff is overwhelmingly positive. The cost is paid once during migration authoring. The benefit is realized every time a migration fails (which happens more often than you'd expect - we see 1-2 migration failures per month due to network issues, permission problems, or developer errors).

Key Lessons

  1. Migrations fail more than you think. Network issues, timeouts, permission errors, and constraint violations are common. Idempotency isn't paranoia - it's pragmatism.

  2. Manual recovery is expensive. The 40 minutes spent diagnosing and fixing partial migrations is time not spent shipping features. Automation pays for itself quickly.

  3. Test reruns. Don't trust that your migration is idempotent - test it by running it twice in CI. We caught 3 non-idempotent migrations this way before they reached production.

  4. Enforce the pattern. Without automated checks, developers will forget to use IF NOT EXISTS guards. Build linting into CI to catch violations early.

  5. Document the "why." New team members don't understand why migrations are so verbose. Include the rationale in your migration template and onboarding docs.

Implementation Commits

  • b965ce6 - Convert existing migrations to idempotent patterns
  • c5f37a9 - Add migration validation CI check
  • 81e3926 - Update Alembic templates with IF NOT EXISTS examples

Conclusion

Idempotent migrations are verbose and require discipline, but they eliminate manual recovery work when migrations fail. Since implementing this pattern, we've had zero production incidents from failed migrations and recovered from 4 staging migration failures in under 5 minutes each. The time saved on recovery far outweighs the additional effort during migration authoring.

For teams running migrations in CI/CD pipelines (as you should be), idempotent migrations are essential. When a migration fails at 2am during an automated deployment, you want it to be fixable by rerunning the pipeline, not by waking up an engineer to manually inspect database state.