Idempotent Migrations: Safe Partial Application Recovery
Database migrations failing halfway through execution used to leave our databases in inconsistent states, requiring manual SQL commands to fix. We implemented idempotent migration patterns using IF NOT EXISTS guards, eliminating manual intervention and enabling safe reruns of failed migrations.
The Problem: Partial Migration Failures
Failed migrations are inevitable. Network issues, permission errors, or constraint violations can interrupt a migration mid-execution. When this happens with traditional migrations, you're left with a partially-applied schema that requires manual SQL commands to clean up before retrying.
Consider this migration sequence:
CREATE TABLE users (id INT, email VARCHAR(255));
CREATE INDEX idx_users_email ON users(email);
ALTER TABLE sessions ADD COLUMN user_id INT;
-- ERROR: Column 'user_id' already exists
If the migration fails at the ALTER TABLE step, the database now has the new users table and index, but the migration is marked as failed. Rerunning the migration fails immediately because the table already exists. The only solution: manually inspect the database, figure out which steps succeeded, and either complete the migration by hand or roll back the partial changes.
This happened to us during a Content Duo feature deployment. The migration failed due to a constraint violation, leaving three new tables created but four indexes missing. We spent 45 minutes diagnosing the state and manually applying the remaining DDL commands.
Migration State Before: Manual Recovery Required
Migration Execution (Non-Idempotent)
┌──────────────────────────────────────┐
│ CREATE TABLE users (...); │ ✓
│ CREATE INDEX idx_users_email (...); │ ✓
│ ALTER TABLE sessions ...; │ ✗ ERROR
│ │
│ Result: Partial application │
│ Status: Migration marked FAILED │
│ │
│ Recovery: │
│ 1. SSH into database │
│ 2. Inspect schema state │
│ 3. Manually run remaining DDL │
│ 4. Update migration history table │
│ Time: 30-60 minutes │
└──────────────────────────────────────┘
Solution: Idempotent Migration Patterns
The fix is simple: make every migration operation idempotent using IF NOT EXISTS guards. If a migration fails and you rerun it, already-completed steps are safely skipped while remaining steps execute normally.
We converted all Alembic migrations to use these patterns:
Table Creation:
CREATE TABLE IF NOT EXISTS users (
id INT PRIMARY KEY,
email VARCHAR(255) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Index Creation:
CREATE INDEX IF NOT EXISTS idx_users_email
ON users(email);
Column Addition:
-- PostgreSQL
DO $$
BEGIN
IF NOT EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name = 'sessions'
AND column_name = 'user_id'
) THEN
ALTER TABLE sessions ADD COLUMN user_id INT;
END IF;
END $$;
Foreign Key Addition:
DO $$
BEGIN
IF NOT EXISTS (
SELECT 1 FROM information_schema.table_constraints
WHERE constraint_name = 'fk_sessions_user'
) THEN
ALTER TABLE sessions ADD CONSTRAINT fk_sessions_user
FOREIGN KEY (user_id) REFERENCES users(id);
END IF;
END $$;
Migration State After: Automated Recovery
Migration Execution (Idempotent)
┌──────────────────────────────────────┐
│ CREATE TABLE IF NOT EXISTS users; │ ✓ (or skipped)
│ CREATE INDEX IF NOT EXISTS idx_...; │ ✓ (or skipped)
│ ALTER TABLE sessions ADD COLUMN │ ✓ (or skipped)
│ IF NOT EXISTS ...; │
│ │
│ Result: Safe to rerun │
│ Status: Migration completes │
│ │
│ Recovery: │
│ 1. Rerun migration command │
│ Time: 0 minutes (automatic) │
└──────────────────────────────────────┘
Implementation Details
We converted 47 existing migrations to idempotent patterns and established a rule: all new migrations must use IF NOT EXISTS guards. This required changes in three areas:
1. Alembic Migration Templates
We modified the Alembic migration template to include boilerplate for conditional DDL:
# migrations/script.py.mako
def upgrade():
"""
All DDL operations must use IF NOT EXISTS guards.
Examples:
- CREATE TABLE IF NOT EXISTS ...
- CREATE INDEX IF NOT EXISTS ...
- Use DO $$ blocks for ALTER TABLE ADD COLUMN
"""
pass
2. Migration Review Checklist
We added a CI check that scans migration files for common non-idempotent patterns and fails the build if found:
# scripts/validate_migrations.py
NON_IDEMPOTENT_PATTERNS = [
r"CREATE TABLE (?!IF NOT EXISTS)",
r"CREATE INDEX (?!IF NOT EXISTS)",
r"ALTER TABLE .* ADD COLUMN (?!.*IF NOT EXISTS)",
]
def validate_migration(file_path):
content = open(file_path).read()
for pattern in NON_IDEMPOTENT_PATTERNS:
if re.search(pattern, content, re.IGNORECASE):
raise ValueError(f"Non-idempotent pattern found: {pattern}")
3. Testing Migration Reruns
We created integration tests that run each migration twice to verify idempotency:
def test_migration_idempotency():
# Apply migration
run_migration("head")
# Capture schema state
schema_before = get_schema_snapshot()
# Rerun same migration (should be no-op)
run_migration("head")
# Verify schema unchanged
schema_after = get_schema_snapshot()
assert schema_before == schema_after
Real-World Impact: Content Duo Launch
The value of idempotent migrations became clear during our Content Duo launch. The migration added 8 new tables, 15 indexes, and modified 4 existing tables. During staging deployment, the migration failed due to a foreign key constraint violation (our fault - we forgot to populate reference data first).
Before idempotent migrations:
- We would have spent 30-45 minutes manually inspecting which DDL succeeded
- We would have written ad-hoc SQL to complete the partial migration
- We would have risked missing a step or introducing inconsistency
With idempotent migrations:
- We fixed the reference data issue (2 minutes)
- We reran the migration command (30 seconds)
- The migration skipped already-completed steps and finished successfully
- Total recovery time: 3 minutes
Results and Tradeoffs
Before:
- 5 production incidents from failed migrations (6 months)
- Average recovery time: 40 minutes
- Manual intervention required for every failure
- High-stress deployments
After:
- 0 production incidents from failed migrations (3 months since implementation)
- Average recovery time: 2 minutes (automated retry)
- Zero manual intervention
- Confident deployments
Tradeoffs:
Idempotent migrations add complexity. Each migration requires conditional logic rather than simple DDL statements. The migration code is longer and harder to read. For developers unfamiliar with the pattern, writing idempotent migrations takes 2-3× longer initially.
However, the tradeoff is overwhelmingly positive. The cost is paid once during migration authoring. The benefit is realized every time a migration fails (which happens more often than you'd expect - we see 1-2 migration failures per month due to network issues, permission problems, or developer errors).
Key Lessons
-
Migrations fail more than you think. Network issues, timeouts, permission errors, and constraint violations are common. Idempotency isn't paranoia - it's pragmatism.
-
Manual recovery is expensive. The 40 minutes spent diagnosing and fixing partial migrations is time not spent shipping features. Automation pays for itself quickly.
-
Test reruns. Don't trust that your migration is idempotent - test it by running it twice in CI. We caught 3 non-idempotent migrations this way before they reached production.
-
Enforce the pattern. Without automated checks, developers will forget to use IF NOT EXISTS guards. Build linting into CI to catch violations early.
-
Document the "why." New team members don't understand why migrations are so verbose. Include the rationale in your migration template and onboarding docs.
Implementation Commits
b965ce6- Convert existing migrations to idempotent patternsc5f37a9- Add migration validation CI check81e3926- Update Alembic templates with IF NOT EXISTS examples
Conclusion
Idempotent migrations are verbose and require discipline, but they eliminate manual recovery work when migrations fail. Since implementing this pattern, we've had zero production incidents from failed migrations and recovered from 4 staging migration failures in under 5 minutes each. The time saved on recovery far outweighs the additional effort during migration authoring.
For teams running migrations in CI/CD pipelines (as you should be), idempotent migrations are essential. When a migration fails at 2am during an automated deployment, you want it to be fixable by rerunning the pipeline, not by waking up an engineer to manually inspect database state.