RDS Snapshot Strategy: Pre-Migration Safety Net
A failed database migration in production corrupted 15,000 user records and required 3 hours of manual data recovery from application logs. We implemented automated RDS snapshots before every migration, reducing recovery time from 3 hours to 5 minutes.
The Incident: Schema Change Gone Wrong
In November 2025, we deployed a migration to add a foreign key constraint between user_sessions and users tables. The migration was tested in staging and passed all checks. In production, it failed.
The migration added the constraint:
ALTER TABLE user_sessions
ADD CONSTRAINT fk_sessions_user
FOREIGN KEY (user_id) REFERENCES users(id);
PostgreSQL validated the constraint by checking existing data. It found 15,000 sessions with user_id values that didn't exist in the users table (orphaned records from deleted users). The migration failed.
Here's where it went wrong. Our migration used a two-step process:
-- Step 1: Add constraint (FAILED)
ALTER TABLE user_sessions
ADD CONSTRAINT fk_sessions_user
FOREIGN KEY (user_id) REFERENCES users(id);
-- Step 2: Add index on user_id (NEVER EXECUTED)
CREATE INDEX idx_sessions_user_id ON user_sessions(user_id);
The constraint addition failed. But before failing, PostgreSQL had already done some work. It had marked the constraint as "invalid" in the system catalogs and created internal tracking structures.
Our rollback procedure (revert the migration) tried to drop a constraint that was never fully created. This left the database in an inconsistent state where:
- The constraint existed but was marked invalid
- The application code expected no constraint
- New inserts were rejected by the "invalid" constraint
- Production was down
We spent 3 hours:
- Diagnosing the inconsistent state (45 minutes)
- Manually dropping the invalid constraint (15 minutes)
- Cleaning up orphaned session records (90 minutes)
- Rewriting the migration to delete orphans first (30 minutes)
Total downtime: 1 hour 15 minutes (until we bypassed the constraint check).
This incident would have been a 5-minute rollback with automated RDS snapshots.
Migration Risk Before: No Safety Net
Migration Failure Scenario (No Snapshot)
┌──────────────────────────────────────────────┐
│ Time: 02:00 - Start migration │
│ ├─ Run: flask db upgrade │
│ ├─ Migration fails at step 3/10 │
│ └─ Database in partially modified state │
│ │
│ Time: 02:05 - Discover failure │
│ ├─ Attempt rollback migration │
│ ├─ Rollback fails (inconsistent state) │
│ └─ Production down │
│ │
│ Time: 02:10 - Emergency response │
│ ├─ Page on-call engineer │
│ ├─ SSH into database │
│ ├─ Manually inspect schema state │
│ ├─ Run manual SQL to fix state │
│ ├─ Hope you didn't make it worse │
│ └─ Pray the fix works │
│ │
│ Time: 05:00 - Recovery complete (3 hours) │
│ ├─ Database state restored │
│ ├─ Application code rolled back │
│ └─ Production online │
│ │
│ Customer impact: 1-3 hours downtime │
│ Team impact: 3-6 hours of emergency work │
│ Risk: Data loss if manual fix corrupts DB │
└──────────────────────────────────────────────┘
Migration Safety After: Automated Snapshots
Migration Failure Scenario (With Snapshot)
┌──────────────────────────────────────────────┐
│ Time: 02:00 - Create pre-migration snapshot │
│ ├─ Automated snapshot: pre-mig-20251115-0200│
│ ├─ Snapshot time: 2 minutes │
│ └─ Database state captured │
│ │
│ Time: 02:02 - Start migration │
│ ├─ Run: flask db upgrade │
│ ├─ Migration fails at step 3/10 │
│ └─ Database in partially modified state │
│ │
│ Time: 02:05 - Discover failure │
│ ├─ CI/CD health check detects failure │
│ ├─ Automated rollback triggered │
│ └─ Restore from snapshot │
│ │
│ Time: 02:10 - Recovery complete (5 minutes) │
│ ├─ Database restored to pre-migration state │
│ ├─ Application code unchanged │
│ └─ Production online │
│ │
│ Customer impact: 5 minutes downtime │
│ Team impact: 0 hours (automated) │
│ Risk: Zero (snapshot restore is safe) │
└──────────────────────────────────────────────┘
Implementation: Pre-Migration Snapshots
We integrated automated RDS snapshots into our migration pipeline. Every migration creates a snapshot before executing DDL commands.
Snapshot Script
#!/bin/bash
# scripts/create_migration_snapshot.sh
set -e
ENVIRONMENT=$1
DB_INSTANCE=$2
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_ID="pre-migration-${ENVIRONMENT}-${TIMESTAMP}"
echo "Creating RDS snapshot: $SNAPSHOT_ID"
echo "Database instance: $DB_INSTANCE"
# Create snapshot
aws rds create-db-snapshot \
--db-instance-identifier $DB_INSTANCE \
--db-snapshot-identifier $SNAPSHOT_ID \
--tags Key=Type,Value=PreMigration \
Key=Environment,Value=$ENVIRONMENT \
Key=Timestamp,Value=$TIMESTAMP
echo "Snapshot creation initiated: $SNAPSHOT_ID"
# Wait for snapshot to complete
echo "Waiting for snapshot to complete (typically 2-5 minutes)..."
aws rds wait db-snapshot-available \
--db-snapshot-identifier $SNAPSHOT_ID
echo "Snapshot completed: $SNAPSHOT_ID"
echo "Snapshot can be used for rollback if migration fails"
# Store snapshot ID for rollback script
echo $SNAPSHOT_ID > /tmp/latest_snapshot_id.txt
Integration with CI/CD
We integrated the snapshot script into our CircleCI deployment pipeline:
# .circleci/config.yml
jobs:
deploy:
steps:
- checkout
- run:
name: Create pre-migration snapshot
command: |
./scripts/create_migration_snapshot.sh \
$CIRCLE_BRANCH \
$RDS_INSTANCE_ID
- run:
name: Run database migrations
command: |
flask db upgrade
- run:
name: Run smoke tests
command: |
./scripts/smoke_tests.sh
# If smoke tests fail, trigger rollback
when: always
- run:
name: Rollback on failure
command: |
if [ $? -ne 0 ]; then
SNAPSHOT_ID=$(cat /tmp/latest_snapshot_id.txt)
./scripts/rollback_from_snapshot.sh $SNAPSHOT_ID
fi
when: on_fail
Rollback Script
When migrations fail, we restore from the snapshot:
#!/bin/bash
# scripts/rollback_from_snapshot.sh
set -e
SNAPSHOT_ID=$1
if [ -z "$SNAPSHOT_ID" ]; then
echo "Error: No snapshot ID provided"
exit 1
fi
echo "Rolling back to snapshot: $SNAPSHOT_ID"
# Get current DB instance identifier
CURRENT_DB=$(aws rds describe-db-snapshots \
--db-snapshot-identifier $SNAPSHOT_ID \
--query 'DBSnapshots[0].DBInstanceIdentifier' \
--output text)
echo "Target database: $CURRENT_DB"
# Restore from snapshot
# Note: This creates a new DB instance, then we'll rename it
TEMP_DB="${CURRENT_DB}-rollback-temp"
echo "Step 1: Restore snapshot to temporary instance: $TEMP_DB"
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier $TEMP_DB \
--db-snapshot-identifier $SNAPSHOT_ID
echo "Waiting for restore to complete (typically 5-10 minutes)..."
aws rds wait db-instance-available \
--db-instance-identifier $TEMP_DB
echo "Step 2: Update application to use temporary instance"
# Update application config to point to temp instance
# (Implementation specific to your setup)
echo "Step 3: Delete current database instance"
aws rds delete-db-instance \
--db-instance-identifier $CURRENT_DB \
--skip-final-snapshot
echo "Step 4: Rename temporary instance to original name"
aws rds modify-db-instance \
--db-instance-identifier $TEMP_DB \
--new-db-instance-identifier $CURRENT_DB \
--apply-immediately
echo "Rollback complete"
echo "Database restored to pre-migration state"
Snapshot Retention Policy
RDS snapshots cost $0.095 per GB-month. We store ~50GB of data, so each snapshot costs ~$4.75/month. We implemented a retention policy to balance cost and safety:
Retention Rules:
- Pre-migration snapshots: 7 days
- Production deployment snapshots: 30 days
- Manual snapshots (tagged "permanent"): Never deleted
- All other snapshots: 24 hours
#!/bin/bash
# scripts/cleanup_old_snapshots.sh
# Delete snapshots older than 7 days (unless tagged permanent)
aws rds describe-db-snapshots \
--query 'DBSnapshots[?SnapshotCreateTime<=`'$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S)'`]' \
--output json | jq -r '.[] | select(.Tags[]? | .Key == "Permanent" | not) | .DBSnapshotIdentifier' | \
while read snapshot; do
echo "Deleting snapshot: $snapshot"
aws rds delete-db-snapshot --db-snapshot-identifier $snapshot
done
We run this cleanup script daily via cron. Typical cost: $15-20/month for snapshot storage, compared to the $5,000+ cost of 3 hours of engineering time for manual recovery.
Real-World Recovery: Content Duo Migration
In January 2026, we deployed the Content Duo feature, which included 12 new tables and modifications to 8 existing tables. The migration was complex and involved:
- Creating 12 new tables
- Adding 15 foreign key constraints
- Creating 20 indexes
- Migrating data from old
bitstable to newcurriculum_conceptstable
In staging, the migration took 8 minutes. In production, it failed after 5 minutes due to a constraint violation we hadn't anticipated (orphaned bits records from deleted content).
Without snapshots: This would have been a 2-3 hour recovery involving manual diagnosis of which of the 47 DDL operations succeeded, which failed, and how to safely roll back the partial state.
With snapshots:
- Migration failed at 02:07
- CI/CD triggered automatic rollback at 02:08
- Snapshot restore completed at 02:13
- Production was back online at 02:15
- Total downtime: 8 minutes
We fixed the constraint issue (added a data cleanup step before the constraint), retested in staging, and redeployed successfully 2 hours later.
Snapshot Performance Impact
Creating an RDS snapshot adds ~2-5 minutes to deployment time. We measured snapshot creation time across 30 migrations:
- Min: 1m 45s (snapshot of 25GB database)
- Max: 6m 20s (snapshot of 75GB database)
- Average: 3m 12s
We also measured snapshot restore time:
- Min: 4m 10s
- Max: 12m 30s
- Average: 7m 15s
The 3-minute overhead on deployments is negligible compared to the 2-3 hours saved during migration failures.
Alternative Approaches
We evaluated three alternatives before implementing automated snapshots:
1. Manual Snapshots
Pros:
- No automation required
- Zero cost for snapshot creation logic
Cons:
- Depends on humans remembering to create snapshots
- Humans forget (we tested this - 40% of deployments missed manual snapshots)
- No guarantee snapshot is created at right time
Verdict: Rejected. Human error is the problem we're solving, not a solution.
2. AWS Backup Service
Pros:
- Managed service
- Automatic snapshot scheduling
- Cross-region backup
Cons:
- Snapshots run on schedule, not pre-migration
- 15-minute minimum frequency
- Migration could run between scheduled snapshots
- More expensive ($0.05/GB vs. $0.095/GB for RDS snapshots)
Verdict: Rejected. Scheduled snapshots don't align with migration timing.
3. Database Transaction Rollback
Pros:
- Instant rollback (no snapshot restore needed)
- Zero storage cost
Cons:
- Doesn't work for DDL operations (most migrations are DDL)
- Doesn't protect against non-transaction failures (e.g., process crash)
- Doesn't protect against application-level data corruption
Verdict: Rejected. DDL operations aren't transaction-safe in most databases.
Automated RDS snapshots before migrations were the only approach that provided reliable, automated, point-in-time recovery.
Results
Before automated snapshots (6 months):
- 2 failed migrations requiring manual recovery
- Average recovery time: 2.5 hours
- Total customer-facing downtime: 4 hours
- Total engineering time: 12 hours (including post-incident reviews)
After automated snapshots (4 months):
- 1 failed migration (Content Duo deployment)
- Recovery time: 8 minutes (automated)
- Total customer-facing downtime: 8 minutes
- Total engineering time: 0 hours (automated rollback)
Cost comparison:
- Snapshot storage: $20/month
- Engineer time saved: ~10 hours/quarter at $150/hour = $1,500/quarter
- ROI: 75× return on investment
Key Lessons
-
Snapshots are cheap insurance. $20/month to eliminate 2-3 hour recovery procedures is a no-brainer.
-
Automate the safety net. Manual snapshots are forgotten. Automated snapshots always run.
-
Test your rollback procedure. We run quarterly disaster recovery drills where we intentionally fail a migration and practice restoring from snapshots.
-
Snapshot before EVERY migration. We initially only created snapshots for "risky" migrations. We now snapshot every migration because defining "risky" is subjective and error-prone.
-
Monitor snapshot age. We alert if a production database hasn't been snapshotted in 24 hours, indicating that our automation might be broken.
Monitoring and Alerting
We monitor snapshot health with CloudWatch alarms:
# cloudwatch-alarms.yml
Alarms:
- Name: MissingPreMigrationSnapshot
Metric: TimeSinceLastSnapshot
Threshold: 86400 # 24 hours
Action: PageOnCall
- Name: SnapshotCreationFailure
Metric: SnapshotCreationErrors
Threshold: 1
Action: PageOnCall
- Name: SnapshotStorageExceeded
Metric: SnapshotStorageGB
Threshold: 500 # $50/month cost
Action: EmailDevOps
We also track snapshot metrics in Datadog:
- Snapshot creation time (track trends)
- Snapshot restore time (track degradation)
- Snapshot count (detect retention policy failures)
- Snapshot storage cost (monitor budget)
Implementation Commits
c084f28- Add pre-migration snapshot creation script7a2f1d9- Integrate snapshot script into CI/CD pipeline2b8e4c3- Add automated rollback on migration failuref1d9a8e- Implement snapshot retention policy and cleanup
Conclusion
Database migrations fail. When they do, you need a fast, reliable rollback mechanism. Automated RDS snapshots before every migration provide a 5-minute recovery path that eliminates the need for manual diagnosis and manual SQL commands.
Since implementing automated snapshots, we've reduced migration recovery time from 2-3 hours to 5-10 minutes and eliminated 100% of manual recovery work. The $20/month cost is negligible compared to the $1,500/quarter saved in engineering time.
If you're running database migrations without automated snapshots, you're betting that migrations never fail. History suggests that's a losing bet.