← Back

RDS Query Optimization: Database Performance Analysis

·performance

RDS Query Optimization: Database Performance Analysis

Slow database queries increased Lambda execution time and costs. We analyzed query performance, added indexes, and rewrote inefficient queries, achieving a 50× speedup for critical operations.

The Slow Query Problem

Our RDS PostgreSQL database served as the source of truth for user progress, content, and analytics. As the user base grew from 5,000 to 50,000, previously acceptable queries slowed to 2-5 seconds, causing Lambda timeouts and degraded user experience.

Symptoms:

  • API endpoints timing out (30s Lambda limit)
  • Increased RDS CPU utilization (70-90%)
  • Slow mobile app response times (3-5s for lesson loading)
  • Rising RDS costs due to increased instance size

The breaking point came when user progress queries took 3+ seconds, making the app feel sluggish and causing users to abandon lessons.

Before: Unoptimized Queries

Slow Query Example: User Attempts History
┌──────────────────────────────────────────────────┐
│ Query:                                           │
│ SELECT * FROM attempts                           │
│   WHERE user_id = 12345                          │
│   ORDER BY created_at DESC;                      │
│                                                  │
│ Execution:                                       │
│ ┌────────────────────────────────────────────┐  │
│ │ 1. Full table scan (attempts table)        │  │
│ │    - Table size: 50M rows                  │  │
│ │    - Rows scanned: 50M rows                │  │
│ │    - Time: 2,000ms                         │  │
│ │                                            │  │
│ │ 2. Filter by user_id                       │  │
│ │    - Matching rows: 5,000                  │  │
│ │    - Time: 300ms                           │  │
│ │                                            │  │
│ │ 3. Sort by created_at                      │  │
│ │    - Sort 5,000 rows                       │  │
│ │    - Time: 200ms                           │  │
│ │                                            │  │
│ │ Total: 2,500ms (2.5 seconds)               │  │
│ └────────────────────────────────────────────┘  │
│                                                  │
│ EXPLAIN output:                                  │
│ Seq Scan on attempts (cost=0..1250000 rows=50M) │
│   Filter: (user_id = 12345)                      │
│   Rows Removed by Filter: 49,995,000            │
└──────────────────────────────────────────────────┘

Performance Issues:
- Full table scan (no index on user_id)
- SELECT * fetches unnecessary columns
- No query result caching
- Sort operation on 5,000 rows

Query Performance Profile:

Top 10 Slow Queries (7-day period)
┌────────────────────────────────────────────────────────┐
│ Query                          Avg Time  Calls   Total │
│ 1. User attempts history       2,500ms   50k     125h │
│ 2. Content by subject          1,800ms   30k     54h  │
│ 3. Leaderboard calculation     3,200ms   10k     32h  │
│ 4. User progress aggregation   2,100ms   15k     31.5h│
│ 5. Session summary             1,500ms   20k     30h  │
│ 6. Analytics rollup            5,000ms   5k      25h  │
│ 7. Badge eligibility check     900ms     25k     22.5h│
│ 8. Streak calculation          1,200ms   18k     21.6h│
│ 9. Content recommendations     1,600ms   12k     19.2h│
│ 10. Achievement tracking       800ms     20k     16h  │
│                                                        │
│ Total time in slow queries: 377 hours/week            │
└────────────────────────────────────────────────────────┘

After: Optimized Queries with Indexes

Optimized Query: User Attempts History
┌──────────────────────────────────────────────────┐
│ Query:                                           │
│ SELECT id, user_id, bit_id, correct, created_at  │
│   FROM attempts                                  │
│   WHERE user_id = 12345                          │
│   ORDER BY created_at DESC                       │
│   LIMIT 100;                                     │
│                                                  │
│ Index:                                           │
│ CREATE INDEX idx_attempts_user_created           │
│   ON attempts(user_id, created_at DESC);         │
│                                                  │
│ Execution:                                       │
│ ┌────────────────────────────────────────────┐  │
│ │ 1. Index scan (idx_attempts_user_created)  │  │
│ │    - Rows scanned: 100 (LIMIT applied)     │  │
│ │    - Time: 5ms                             │  │
│ │                                            │  │
│ │ 2. Fetch specific columns only             │  │
│ │    - Columns: 5 (vs 15 before)             │  │
│ │    - Time: 10ms                            │  │
│ │                                            │  │
│ │ Total: 15ms (0.015 seconds)                │  │
│ └────────────────────────────────────────────┘  │
│                                                  │
│ EXPLAIN output:                                  │
│ Index Scan using idx_attempts_user_created       │
│   (cost=0..100 rows=100)                         │
│   Index Cond: (user_id = 12345)                  │
│   Limit: 100                                     │
└──────────────────────────────────────────────────┘

Optimizations Applied:
✓ Composite index (user_id, created_at DESC)
✓ LIMIT clause (fetch only needed rows)
✓ Explicit column selection (avoid SELECT *)
✓ Query result caching (5-minute TTL)

Performance: 2,500ms → 15ms (167× faster)

Implementation Details

Query Analysis Process

Step 1: Enable Query Logging

-- Enable slow query logging in RDS
ALTER SYSTEM SET log_min_duration_statement = 1000;  -- Log queries > 1s
ALTER SYSTEM SET log_statement = 'all';
SELECT pg_reload_conf();

Step 2: Analyze Query Plans

-- Identify slow queries from pg_stat_statements
SELECT
    query,
    calls,
    total_exec_time / 1000 AS total_time_seconds,
    mean_exec_time AS avg_time_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Step 3: Use EXPLAIN ANALYZE

EXPLAIN ANALYZE
SELECT * FROM attempts WHERE user_id = 12345 ORDER BY created_at DESC;

-- Output showed:
-- Seq Scan on attempts  (cost=0.00..1250000.00 rows=50000000)
-- Planning Time: 0.5ms
-- Execution Time: 2500.0ms
-- ^ Indicates full table scan, needs index

Index Creation Strategy

Composite Indexes for Multi-Column Queries:

-- User attempts (user_id + created_at)
CREATE INDEX idx_attempts_user_created
  ON attempts(user_id, created_at DESC);

-- Content by subject (subject_id + order)
CREATE INDEX idx_content_subject_order
  ON content(subject_id, display_order);

-- Session lookup (user_id + status)
CREATE INDEX idx_sessions_user_status
  ON sessions(user_id, status);

-- Analytics rollup (event_date + user_id)
CREATE INDEX idx_analytics_date_user
  ON analytics_events(event_date, user_id);

Partial Indexes for Filtered Queries:

-- Active sessions only (status = 'active')
CREATE INDEX idx_sessions_active
  ON sessions(user_id, created_at)
  WHERE status = 'active';

-- Recent attempts (last 30 days)
CREATE INDEX idx_attempts_recent
  ON attempts(user_id, created_at)
  WHERE created_at > NOW() - INTERVAL '30 days';

Query Rewriting Examples

**Example 1: Eliminate SELECT ***

-- Before
SELECT * FROM users WHERE email = 'user@example.com';
-- Fetches 25 columns, including JSONB fields (slow)

-- After
SELECT id, email, name, created_at FROM users WHERE email = 'user@example.com';
-- Fetches only needed columns (5× faster)

Example 2: Add Pagination

-- Before
SELECT * FROM attempts WHERE user_id = 12345 ORDER BY created_at DESC;
-- Fetches all 5,000 rows (slow)

-- After
SELECT id, bit_id, correct, created_at
FROM attempts
WHERE user_id = 12345
ORDER BY created_at DESC
LIMIT 100 OFFSET 0;
-- Fetches only 100 rows per page (50× faster)

Example 3: Use JOIN instead of Subquery

-- Before (subquery, slow)
SELECT * FROM users
WHERE id IN (
    SELECT user_id FROM sessions WHERE status = 'active'
);
-- Execution time: 1,800ms

-- After (JOIN, fast)
SELECT DISTINCT u.*
FROM users u
INNER JOIN sessions s ON s.user_id = u.id
WHERE s.status = 'active';
-- Execution time: 120ms (15× faster)

Example 4: Use COUNT(*) Optimization

-- Before (exact count, slow for large tables)
SELECT COUNT(*) FROM attempts WHERE user_id = 12345;
-- Execution time: 800ms

-- After (approximate count for display purposes)
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE relname = 'attempts';
-- Execution time: 5ms (160× faster)
-- Acceptable for "~5,000 attempts" display

Database Connection Pooling

Before: New Connection Per Request

# src/core/database.py
def get_db_connection():
    return psycopg2.connect(
        host=DB_HOST,
        database=DB_NAME,
        user=DB_USER,
        password=DB_PASSWORD
    )
# Connection time: 50-100ms per request

After: Connection Pooling with SQLAlchemy

# src/core/database.py
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

engine = create_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=20,           # 20 persistent connections
    max_overflow=10,        # Up to 30 total connections
    pool_recycle=3600,      # Recycle connections every hour
    pool_pre_ping=True      # Verify connection before use
)

def get_db_connection():
    return engine.connect()
# Connection time: <1ms (reuse existing)

Performance Impact

Query Execution Time

Top 10 Queries After Optimization:

Optimized Query Performance (7-day period)
┌────────────────────────────────────────────────────────────────┐
│ Query                      Before    After    Improvement      │
│ 1. User attempts history   2,500ms   15ms     167× faster      │
│ 2. Content by subject      1,800ms   25ms     72× faster       │
│ 3. Leaderboard calc        3,200ms   180ms    18× faster       │
│ 4. User progress agg       2,100ms   40ms     52× faster       │
│ 5. Session summary         1,500ms   30ms     50× faster       │
│ 6. Analytics rollup        5,000ms   200ms    25× faster       │
│ 7. Badge eligibility       900ms     12ms     75× faster       │
│ 8. Streak calculation      1,200ms   20ms     60× faster       │
│ 9. Content recommendations 1,600ms   35ms     46× faster       │
│ 10. Achievement tracking   800ms     10ms     80× faster       │
│                                                                │
│ Avg improvement: 64× faster                                    │
│ Time saved: 352 hours/week (93% reduction)                     │
└────────────────────────────────────────────────────────────────┘

RDS Resource Utilization

CPU and IOPS Metrics:

RDS Database Metrics (30-day comparison)
┌────────────────────────────────────────────────┐
 Metric                Before    After   Change 
 Avg CPU utilization:  82%       28%     -66%  
 Peak CPU:             95%       45%     -53%  
 Avg IOPS:             8,500     2,100   -75%  
 Read IOPS:            7,000     1,500   -79%  
 Write IOPS:           1,500     600     -60%  
 Network throughput:   120 MB/s  35 MB/s -71%  
 Active connections:   180       120     -33%  
 Query cache hit rate: 45%       92%     +104% 
└────────────────────────────────────────────────┘

Lambda Execution Impact

Lambda Duration Savings:

Lambda Performance (API endpoints)
┌────────────────────────────────────────────────┐
│ Endpoint              Before    After   Change │
│ GET /user/progress    3,200ms   180ms   -94%  │
│ GET /user/attempts    2,800ms   120ms   -96%  │
│ GET /content/lessons  2,100ms   150ms   -93%  │
│ GET /leaderboard      4,500ms   450ms   -90%  │
│ GET /user/badges      1,500ms   80ms    -95%  │
│                                                │
│ Avg API latency:      2,820ms   196ms   -93%  │
│ P95 latency:          4,200ms   350ms   -92%  │
│ Timeout errors:       120/day   0       -100% │
└────────────────────────────────────────────────┘

Cost Impact

RDS Instance Rightsizing

Instance Downgrade:

RDS Instance Optimization
┌────────────────────────────────────────────────┐
                      Before       After        
 Instance type:       db.r5.2xlarge db.r5.large│
 vCPUs:               8             2           
 RAM:                 64 GB         16 GB       
 Cost:                $840/month    $210/month  
                                                
 Justification:                                 
 - CPU utilization: 82%  28% (can downgrade)  
 - Query optimization reduced resource needs   
                                                
 Monthly savings: $630                          
└────────────────────────────────────────────────┘

Lambda Cost Reduction

Duration-Based Cost Savings:

Lambda Costs (Monthly)
┌────────────────────────────────────────────────┐
│                        Before    After  Savings│
│ Invocations:           $400      $400   $0    │
│ Duration (DB queries): $680      $120   $560  │
│ Total:                 $1,080    $520   $560  │
│                                                │
│ Breakdown:                                     │
│ - Avg duration: 2.8s → 0.2s (93% faster)      │
│ - Monthly invocations: 3M requests            │
│ - Duration cost: 3M × 2.8s → 3M × 0.2s        │
└────────────────────────────────────────────────┘

Total Cost Savings: $630 (RDS) + $560 (Lambda) = $1,190/month

Monitoring and Alerting

CloudWatch RDS Metrics

Custom Dashboard:

RDS Query Performance Dashboard
┌────────────────────────────────────────────────┐
 Slow Query Count (>100ms):     8 queries       
 Avg Query Duration:            45ms            
 P95 Query Duration:            120ms           
 P99 Query Duration:            280ms           
 Index Usage Rate:              96%             
 Sequential Scan Rate:          4%              
 Cache Hit Rate:                92%             
 Active Connections:            85 / 150        
└────────────────────────────────────────────────┘

Alerting Rules:

  1. Slow query count > 50 - Alert if many queries exceed 100ms
  2. Sequential scan rate > 10% - Indicates missing indexes
  3. Cache hit rate < 85% - Indicates poor query patterns
  4. CPU utilization > 60% - May need further optimization

Results Summary

RDS Query Optimization Impact (30-day comparison)
┌────────────────────────────────────────────────┐
│ Metric                 Before    After  Change │
│ Avg query time:        2,820ms   196ms  -93%  │
│ Slow queries (>1s):    377h/wk   25h/wk -93%  │
│ RDS CPU utilization:   82%       28%    -66%  │
│ RDS IOPS:              8,500     2,100  -75%  │
│ Lambda timeout errors: 120/day   0      -100% │
│ RDS instance cost:     $840      $210   -75%  │
│ Lambda duration cost:  $680      $120   -82%  │
│ Total monthly savings: -         -      $1,190│
└────────────────────────────────────────────────┘

Quantified Outcomes:

  • 50× average speedup - 2,500ms → 50ms for critical queries
  • 93% reduction in API latency - 2.8s → 0.2s average
  • $1,190/month saved - RDS downsizing + Lambda duration
  • 100% elimination of timeouts - 120/day → 0
  • 352 hours/week saved - Reduced time in database queries

Key Takeaways

  1. Indexes are non-negotiable for production databases. Every filtered or sorted column should have an appropriate index.

  2. SELECT * is expensive. Fetching only needed columns reduced query time by 5× in some cases.

  3. EXPLAIN ANALYZE reveals truth. Query performance intuition is often wrong—always measure.

  4. Connection pooling matters. 50-100ms connection overhead per request adds up quickly.

  5. Cost optimization follows performance optimization. Faster queries enabled RDS downsizing, compounding savings.

Database query optimization delivered the highest ROI of any performance improvement: 93% latency reduction and $1,190/month in cost savings with a one-time engineering investment.


Related Posts:

  • API Response Caching Strategy: Reduce Database Load
  • Scheduler Query Optimization: Background Job Efficiency
  • Lambda SnapStart Rollout & Disable

Documentation: 2026-01-20-lambda-perf-analysis-findings.md Impact: 50× query speedup, $1,190/month saved, zero timeout errors