← Back

RDS Snapshot Strategy: Pre-Migration Safety Net

·database

RDS Snapshot Strategy: Pre-Migration Safety Net

A failed database migration in production corrupted 15,000 user records and required 3 hours of manual data recovery from application logs. We implemented automated RDS snapshots before every migration, reducing recovery time from 3 hours to 5 minutes.

The Incident: Schema Change Gone Wrong

In November 2025, we deployed a migration to add a foreign key constraint between user_sessions and users tables. The migration was tested in staging and passed all checks. In production, it failed.

The migration added the constraint:

ALTER TABLE user_sessions
ADD CONSTRAINT fk_sessions_user
FOREIGN KEY (user_id) REFERENCES users(id);

PostgreSQL validated the constraint by checking existing data. It found 15,000 sessions with user_id values that didn't exist in the users table (orphaned records from deleted users). The migration failed.

Here's where it went wrong. Our migration used a two-step process:

-- Step 1: Add constraint (FAILED)
ALTER TABLE user_sessions
ADD CONSTRAINT fk_sessions_user
FOREIGN KEY (user_id) REFERENCES users(id);

-- Step 2: Add index on user_id (NEVER EXECUTED)
CREATE INDEX idx_sessions_user_id ON user_sessions(user_id);

The constraint addition failed. But before failing, PostgreSQL had already done some work. It had marked the constraint as "invalid" in the system catalogs and created internal tracking structures.

Our rollback procedure (revert the migration) tried to drop a constraint that was never fully created. This left the database in an inconsistent state where:

  • The constraint existed but was marked invalid
  • The application code expected no constraint
  • New inserts were rejected by the "invalid" constraint
  • Production was down

We spent 3 hours:

  1. Diagnosing the inconsistent state (45 minutes)
  2. Manually dropping the invalid constraint (15 minutes)
  3. Cleaning up orphaned session records (90 minutes)
  4. Rewriting the migration to delete orphans first (30 minutes)

Total downtime: 1 hour 15 minutes (until we bypassed the constraint check).

This incident would have been a 5-minute rollback with automated RDS snapshots.

Migration Risk Before: No Safety Net

Migration Failure Scenario (No Snapshot)
┌──────────────────────────────────────────────┐
│ Time: 02:00 - Start migration                │
│ ├─ Run: flask db upgrade                     │
│ ├─ Migration fails at step 3/10              │
│ └─ Database in partially modified state      │
│                                              │
│ Time: 02:05 - Discover failure               │
│ ├─ Attempt rollback migration                │
│ ├─ Rollback fails (inconsistent state)       │
│ └─ Production down                           │
│                                              │
│ Time: 02:10 - Emergency response             │
│ ├─ Page on-call engineer                     │
│ ├─ SSH into database                         │
│ ├─ Manually inspect schema state             │
│ ├─ Run manual SQL to fix state               │
│ ├─ Hope you didn't make it worse             │
│ └─ Pray the fix works                        │
│                                              │
│ Time: 05:00 - Recovery complete (3 hours)    │
│ ├─ Database state restored                   │
│ ├─ Application code rolled back              │
│ └─ Production online                         │
│                                              │
│ Customer impact: 1-3 hours downtime          │
│ Team impact: 3-6 hours of emergency work     │
│ Risk: Data loss if manual fix corrupts DB    │
└──────────────────────────────────────────────┘

Migration Safety After: Automated Snapshots

Migration Failure Scenario (With Snapshot)
┌──────────────────────────────────────────────┐
│ Time: 02:00 - Create pre-migration snapshot  │
│ ├─ Automated snapshot: pre-mig-20251115-0200│
│ ├─ Snapshot time: 2 minutes                  │
│ └─ Database state captured                   │
│                                              │
│ Time: 02:02 - Start migration                │
│ ├─ Run: flask db upgrade                     │
│ ├─ Migration fails at step 3/10              │
│ └─ Database in partially modified state      │
│                                              │
│ Time: 02:05 - Discover failure               │
│ ├─ CI/CD health check detects failure        │
│ ├─ Automated rollback triggered              │
│ └─ Restore from snapshot                     │
│                                              │
│ Time: 02:10 - Recovery complete (5 minutes)  │
│ ├─ Database restored to pre-migration state  │
│ ├─ Application code unchanged                │
│ └─ Production online                         │
│                                              │
│ Customer impact: 5 minutes downtime          │
│ Team impact: 0 hours (automated)             │
│ Risk: Zero (snapshot restore is safe)        │
└──────────────────────────────────────────────┘

Implementation: Pre-Migration Snapshots

We integrated automated RDS snapshots into our migration pipeline. Every migration creates a snapshot before executing DDL commands.

Snapshot Script

#!/bin/bash
# scripts/create_migration_snapshot.sh

set -e

ENVIRONMENT=$1
DB_INSTANCE=$2
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_ID="pre-migration-${ENVIRONMENT}-${TIMESTAMP}"

echo "Creating RDS snapshot: $SNAPSHOT_ID"
echo "Database instance: $DB_INSTANCE"

# Create snapshot
aws rds create-db-snapshot \
  --db-instance-identifier $DB_INSTANCE \
  --db-snapshot-identifier $SNAPSHOT_ID \
  --tags Key=Type,Value=PreMigration \
         Key=Environment,Value=$ENVIRONMENT \
         Key=Timestamp,Value=$TIMESTAMP

echo "Snapshot creation initiated: $SNAPSHOT_ID"

# Wait for snapshot to complete
echo "Waiting for snapshot to complete (typically 2-5 minutes)..."
aws rds wait db-snapshot-available \
  --db-snapshot-identifier $SNAPSHOT_ID

echo "Snapshot completed: $SNAPSHOT_ID"
echo "Snapshot can be used for rollback if migration fails"

# Store snapshot ID for rollback script
echo $SNAPSHOT_ID > /tmp/latest_snapshot_id.txt

Integration with CI/CD

We integrated the snapshot script into our CircleCI deployment pipeline:

# .circleci/config.yml
jobs:
  deploy:
    steps:
      - checkout

      - run:
          name: Create pre-migration snapshot
          command: |
            ./scripts/create_migration_snapshot.sh \
              $CIRCLE_BRANCH \
              $RDS_INSTANCE_ID

      - run:
          name: Run database migrations
          command: |
            flask db upgrade

      - run:
          name: Run smoke tests
          command: |
            ./scripts/smoke_tests.sh
          # If smoke tests fail, trigger rollback
          when: always

      - run:
          name: Rollback on failure
          command: |
            if [ $? -ne 0 ]; then
              SNAPSHOT_ID=$(cat /tmp/latest_snapshot_id.txt)
              ./scripts/rollback_from_snapshot.sh $SNAPSHOT_ID
            fi
          when: on_fail

Rollback Script

When migrations fail, we restore from the snapshot:

#!/bin/bash
# scripts/rollback_from_snapshot.sh

set -e

SNAPSHOT_ID=$1

if [ -z "$SNAPSHOT_ID" ]; then
  echo "Error: No snapshot ID provided"
  exit 1
fi

echo "Rolling back to snapshot: $SNAPSHOT_ID"

# Get current DB instance identifier
CURRENT_DB=$(aws rds describe-db-snapshots \
  --db-snapshot-identifier $SNAPSHOT_ID \
  --query 'DBSnapshots[0].DBInstanceIdentifier' \
  --output text)

echo "Target database: $CURRENT_DB"

# Restore from snapshot
# Note: This creates a new DB instance, then we'll rename it
TEMP_DB="${CURRENT_DB}-rollback-temp"

echo "Step 1: Restore snapshot to temporary instance: $TEMP_DB"
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier $TEMP_DB \
  --db-snapshot-identifier $SNAPSHOT_ID

echo "Waiting for restore to complete (typically 5-10 minutes)..."
aws rds wait db-instance-available \
  --db-instance-identifier $TEMP_DB

echo "Step 2: Update application to use temporary instance"
# Update application config to point to temp instance
# (Implementation specific to your setup)

echo "Step 3: Delete current database instance"
aws rds delete-db-instance \
  --db-instance-identifier $CURRENT_DB \
  --skip-final-snapshot

echo "Step 4: Rename temporary instance to original name"
aws rds modify-db-instance \
  --db-instance-identifier $TEMP_DB \
  --new-db-instance-identifier $CURRENT_DB \
  --apply-immediately

echo "Rollback complete"
echo "Database restored to pre-migration state"

Snapshot Retention Policy

RDS snapshots cost $0.095 per GB-month. We store ~50GB of data, so each snapshot costs ~$4.75/month. We implemented a retention policy to balance cost and safety:

Retention Rules:

  • Pre-migration snapshots: 7 days
  • Production deployment snapshots: 30 days
  • Manual snapshots (tagged "permanent"): Never deleted
  • All other snapshots: 24 hours
#!/bin/bash
# scripts/cleanup_old_snapshots.sh

# Delete snapshots older than 7 days (unless tagged permanent)
aws rds describe-db-snapshots \
  --query 'DBSnapshots[?SnapshotCreateTime<=`'$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S)'`]' \
  --output json | jq -r '.[] | select(.Tags[]? | .Key == "Permanent" | not) | .DBSnapshotIdentifier' | \
while read snapshot; do
  echo "Deleting snapshot: $snapshot"
  aws rds delete-db-snapshot --db-snapshot-identifier $snapshot
done

We run this cleanup script daily via cron. Typical cost: $15-20/month for snapshot storage, compared to the $5,000+ cost of 3 hours of engineering time for manual recovery.

Real-World Recovery: Content Duo Migration

In January 2026, we deployed the Content Duo feature, which included 12 new tables and modifications to 8 existing tables. The migration was complex and involved:

  1. Creating 12 new tables
  2. Adding 15 foreign key constraints
  3. Creating 20 indexes
  4. Migrating data from old bits table to new curriculum_concepts table

In staging, the migration took 8 minutes. In production, it failed after 5 minutes due to a constraint violation we hadn't anticipated (orphaned bits records from deleted content).

Without snapshots: This would have been a 2-3 hour recovery involving manual diagnosis of which of the 47 DDL operations succeeded, which failed, and how to safely roll back the partial state.

With snapshots:

  • Migration failed at 02:07
  • CI/CD triggered automatic rollback at 02:08
  • Snapshot restore completed at 02:13
  • Production was back online at 02:15
  • Total downtime: 8 minutes

We fixed the constraint issue (added a data cleanup step before the constraint), retested in staging, and redeployed successfully 2 hours later.

Snapshot Performance Impact

Creating an RDS snapshot adds ~2-5 minutes to deployment time. We measured snapshot creation time across 30 migrations:

  • Min: 1m 45s (snapshot of 25GB database)
  • Max: 6m 20s (snapshot of 75GB database)
  • Average: 3m 12s

We also measured snapshot restore time:

  • Min: 4m 10s
  • Max: 12m 30s
  • Average: 7m 15s

The 3-minute overhead on deployments is negligible compared to the 2-3 hours saved during migration failures.

Alternative Approaches

We evaluated three alternatives before implementing automated snapshots:

1. Manual Snapshots

Pros:

  • No automation required
  • Zero cost for snapshot creation logic

Cons:

  • Depends on humans remembering to create snapshots
  • Humans forget (we tested this - 40% of deployments missed manual snapshots)
  • No guarantee snapshot is created at right time

Verdict: Rejected. Human error is the problem we're solving, not a solution.

2. AWS Backup Service

Pros:

  • Managed service
  • Automatic snapshot scheduling
  • Cross-region backup

Cons:

  • Snapshots run on schedule, not pre-migration
  • 15-minute minimum frequency
  • Migration could run between scheduled snapshots
  • More expensive ($0.05/GB vs. $0.095/GB for RDS snapshots)

Verdict: Rejected. Scheduled snapshots don't align with migration timing.

3. Database Transaction Rollback

Pros:

  • Instant rollback (no snapshot restore needed)
  • Zero storage cost

Cons:

  • Doesn't work for DDL operations (most migrations are DDL)
  • Doesn't protect against non-transaction failures (e.g., process crash)
  • Doesn't protect against application-level data corruption

Verdict: Rejected. DDL operations aren't transaction-safe in most databases.

Automated RDS snapshots before migrations were the only approach that provided reliable, automated, point-in-time recovery.

Results

Before automated snapshots (6 months):

  • 2 failed migrations requiring manual recovery
  • Average recovery time: 2.5 hours
  • Total customer-facing downtime: 4 hours
  • Total engineering time: 12 hours (including post-incident reviews)

After automated snapshots (4 months):

  • 1 failed migration (Content Duo deployment)
  • Recovery time: 8 minutes (automated)
  • Total customer-facing downtime: 8 minutes
  • Total engineering time: 0 hours (automated rollback)

Cost comparison:

  • Snapshot storage: $20/month
  • Engineer time saved: ~10 hours/quarter at $150/hour = $1,500/quarter
  • ROI: 75× return on investment

Key Lessons

  1. Snapshots are cheap insurance. $20/month to eliminate 2-3 hour recovery procedures is a no-brainer.

  2. Automate the safety net. Manual snapshots are forgotten. Automated snapshots always run.

  3. Test your rollback procedure. We run quarterly disaster recovery drills where we intentionally fail a migration and practice restoring from snapshots.

  4. Snapshot before EVERY migration. We initially only created snapshots for "risky" migrations. We now snapshot every migration because defining "risky" is subjective and error-prone.

  5. Monitor snapshot age. We alert if a production database hasn't been snapshotted in 24 hours, indicating that our automation might be broken.

Monitoring and Alerting

We monitor snapshot health with CloudWatch alarms:

# cloudwatch-alarms.yml
Alarms:
  - Name: MissingPreMigrationSnapshot
    Metric: TimeSinceLastSnapshot
    Threshold: 86400  # 24 hours
    Action: PageOnCall

  - Name: SnapshotCreationFailure
    Metric: SnapshotCreationErrors
    Threshold: 1
    Action: PageOnCall

  - Name: SnapshotStorageExceeded
    Metric: SnapshotStorageGB
    Threshold: 500  # $50/month cost
    Action: EmailDevOps

We also track snapshot metrics in Datadog:

  • Snapshot creation time (track trends)
  • Snapshot restore time (track degradation)
  • Snapshot count (detect retention policy failures)
  • Snapshot storage cost (monitor budget)

Implementation Commits

  • c084f28 - Add pre-migration snapshot creation script
  • 7a2f1d9 - Integrate snapshot script into CI/CD pipeline
  • 2b8e4c3 - Add automated rollback on migration failure
  • f1d9a8e - Implement snapshot retention policy and cleanup

Conclusion

Database migrations fail. When they do, you need a fast, reliable rollback mechanism. Automated RDS snapshots before every migration provide a 5-minute recovery path that eliminates the need for manual diagnosis and manual SQL commands.

Since implementing automated snapshots, we've reduced migration recovery time from 2-3 hours to 5-10 minutes and eliminated 100% of manual recovery work. The $20/month cost is negligible compared to the $1,500/quarter saved in engineering time.

If you're running database migrations without automated snapshots, you're betting that migrations never fail. History suggests that's a losing bet.