RDS Query Optimization: Database Performance Analysis
Slow database queries increased Lambda execution time and costs. We analyzed query performance, added indexes, and rewrote inefficient queries, achieving a 50× speedup for critical operations.
The Slow Query Problem
Our RDS PostgreSQL database served as the source of truth for user progress, content, and analytics. As the user base grew from 5,000 to 50,000, previously acceptable queries slowed to 2-5 seconds, causing Lambda timeouts and degraded user experience.
Symptoms:
- API endpoints timing out (30s Lambda limit)
- Increased RDS CPU utilization (70-90%)
- Slow mobile app response times (3-5s for lesson loading)
- Rising RDS costs due to increased instance size
The breaking point came when user progress queries took 3+ seconds, making the app feel sluggish and causing users to abandon lessons.
Before: Unoptimized Queries
Slow Query Example: User Attempts History
┌──────────────────────────────────────────────────┐
│ Query: │
│ SELECT * FROM attempts │
│ WHERE user_id = 12345 │
│ ORDER BY created_at DESC; │
│ │
│ Execution: │
│ ┌────────────────────────────────────────────┐ │
│ │ 1. Full table scan (attempts table) │ │
│ │ - Table size: 50M rows │ │
│ │ - Rows scanned: 50M rows │ │
│ │ - Time: 2,000ms │ │
│ │ │ │
│ │ 2. Filter by user_id │ │
│ │ - Matching rows: 5,000 │ │
│ │ - Time: 300ms │ │
│ │ │ │
│ │ 3. Sort by created_at │ │
│ │ - Sort 5,000 rows │ │
│ │ - Time: 200ms │ │
│ │ │ │
│ │ Total: 2,500ms (2.5 seconds) │ │
│ └────────────────────────────────────────────┘ │
│ │
│ EXPLAIN output: │
│ Seq Scan on attempts (cost=0..1250000 rows=50M) │
│ Filter: (user_id = 12345) │
│ Rows Removed by Filter: 49,995,000 │
└──────────────────────────────────────────────────┘
Performance Issues:
- Full table scan (no index on user_id)
- SELECT * fetches unnecessary columns
- No query result caching
- Sort operation on 5,000 rows
Query Performance Profile:
Top 10 Slow Queries (7-day period)
┌────────────────────────────────────────────────────────┐
│ Query Avg Time Calls Total │
│ 1. User attempts history 2,500ms 50k 125h │
│ 2. Content by subject 1,800ms 30k 54h │
│ 3. Leaderboard calculation 3,200ms 10k 32h │
│ 4. User progress aggregation 2,100ms 15k 31.5h│
│ 5. Session summary 1,500ms 20k 30h │
│ 6. Analytics rollup 5,000ms 5k 25h │
│ 7. Badge eligibility check 900ms 25k 22.5h│
│ 8. Streak calculation 1,200ms 18k 21.6h│
│ 9. Content recommendations 1,600ms 12k 19.2h│
│ 10. Achievement tracking 800ms 20k 16h │
│ │
│ Total time in slow queries: 377 hours/week │
└────────────────────────────────────────────────────────┘
After: Optimized Queries with Indexes
Optimized Query: User Attempts History
┌──────────────────────────────────────────────────┐
│ Query: │
│ SELECT id, user_id, bit_id, correct, created_at │
│ FROM attempts │
│ WHERE user_id = 12345 │
│ ORDER BY created_at DESC │
│ LIMIT 100; │
│ │
│ Index: │
│ CREATE INDEX idx_attempts_user_created │
│ ON attempts(user_id, created_at DESC); │
│ │
│ Execution: │
│ ┌────────────────────────────────────────────┐ │
│ │ 1. Index scan (idx_attempts_user_created) │ │
│ │ - Rows scanned: 100 (LIMIT applied) │ │
│ │ - Time: 5ms │ │
│ │ │ │
│ │ 2. Fetch specific columns only │ │
│ │ - Columns: 5 (vs 15 before) │ │
│ │ - Time: 10ms │ │
│ │ │ │
│ │ Total: 15ms (0.015 seconds) │ │
│ └────────────────────────────────────────────┘ │
│ │
│ EXPLAIN output: │
│ Index Scan using idx_attempts_user_created │
│ (cost=0..100 rows=100) │
│ Index Cond: (user_id = 12345) │
│ Limit: 100 │
└──────────────────────────────────────────────────┘
Optimizations Applied:
✓ Composite index (user_id, created_at DESC)
✓ LIMIT clause (fetch only needed rows)
✓ Explicit column selection (avoid SELECT *)
✓ Query result caching (5-minute TTL)
Performance: 2,500ms → 15ms (167× faster)
Implementation Details
Query Analysis Process
Step 1: Enable Query Logging
-- Enable slow query logging in RDS
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s
ALTER SYSTEM SET log_statement = 'all';
SELECT pg_reload_conf();
Step 2: Analyze Query Plans
-- Identify slow queries from pg_stat_statements
SELECT
query,
calls,
total_exec_time / 1000 AS total_time_seconds,
mean_exec_time AS avg_time_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
Step 3: Use EXPLAIN ANALYZE
EXPLAIN ANALYZE
SELECT * FROM attempts WHERE user_id = 12345 ORDER BY created_at DESC;
-- Output showed:
-- Seq Scan on attempts (cost=0.00..1250000.00 rows=50000000)
-- Planning Time: 0.5ms
-- Execution Time: 2500.0ms
-- ^ Indicates full table scan, needs index
Index Creation Strategy
Composite Indexes for Multi-Column Queries:
-- User attempts (user_id + created_at)
CREATE INDEX idx_attempts_user_created
ON attempts(user_id, created_at DESC);
-- Content by subject (subject_id + order)
CREATE INDEX idx_content_subject_order
ON content(subject_id, display_order);
-- Session lookup (user_id + status)
CREATE INDEX idx_sessions_user_status
ON sessions(user_id, status);
-- Analytics rollup (event_date + user_id)
CREATE INDEX idx_analytics_date_user
ON analytics_events(event_date, user_id);
Partial Indexes for Filtered Queries:
-- Active sessions only (status = 'active')
CREATE INDEX idx_sessions_active
ON sessions(user_id, created_at)
WHERE status = 'active';
-- Recent attempts (last 30 days)
CREATE INDEX idx_attempts_recent
ON attempts(user_id, created_at)
WHERE created_at > NOW() - INTERVAL '30 days';
Query Rewriting Examples
**Example 1: Eliminate SELECT ***
-- Before
SELECT * FROM users WHERE email = 'user@example.com';
-- Fetches 25 columns, including JSONB fields (slow)
-- After
SELECT id, email, name, created_at FROM users WHERE email = 'user@example.com';
-- Fetches only needed columns (5× faster)
Example 2: Add Pagination
-- Before
SELECT * FROM attempts WHERE user_id = 12345 ORDER BY created_at DESC;
-- Fetches all 5,000 rows (slow)
-- After
SELECT id, bit_id, correct, created_at
FROM attempts
WHERE user_id = 12345
ORDER BY created_at DESC
LIMIT 100 OFFSET 0;
-- Fetches only 100 rows per page (50× faster)
Example 3: Use JOIN instead of Subquery
-- Before (subquery, slow)
SELECT * FROM users
WHERE id IN (
SELECT user_id FROM sessions WHERE status = 'active'
);
-- Execution time: 1,800ms
-- After (JOIN, fast)
SELECT DISTINCT u.*
FROM users u
INNER JOIN sessions s ON s.user_id = u.id
WHERE s.status = 'active';
-- Execution time: 120ms (15× faster)
Example 4: Use COUNT(*) Optimization
-- Before (exact count, slow for large tables)
SELECT COUNT(*) FROM attempts WHERE user_id = 12345;
-- Execution time: 800ms
-- After (approximate count for display purposes)
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE relname = 'attempts';
-- Execution time: 5ms (160× faster)
-- Acceptable for "~5,000 attempts" display
Database Connection Pooling
Before: New Connection Per Request
# src/core/database.py
def get_db_connection():
return psycopg2.connect(
host=DB_HOST,
database=DB_NAME,
user=DB_USER,
password=DB_PASSWORD
)
# Connection time: 50-100ms per request
After: Connection Pooling with SQLAlchemy
# src/core/database.py
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool
engine = create_engine(
DATABASE_URL,
poolclass=QueuePool,
pool_size=20, # 20 persistent connections
max_overflow=10, # Up to 30 total connections
pool_recycle=3600, # Recycle connections every hour
pool_pre_ping=True # Verify connection before use
)
def get_db_connection():
return engine.connect()
# Connection time: <1ms (reuse existing)
Performance Impact
Query Execution Time
Top 10 Queries After Optimization:
Optimized Query Performance (7-day period)
┌────────────────────────────────────────────────────────────────┐
│ Query Before After Improvement │
│ 1. User attempts history 2,500ms 15ms 167× faster │
│ 2. Content by subject 1,800ms 25ms 72× faster │
│ 3. Leaderboard calc 3,200ms 180ms 18× faster │
│ 4. User progress agg 2,100ms 40ms 52× faster │
│ 5. Session summary 1,500ms 30ms 50× faster │
│ 6. Analytics rollup 5,000ms 200ms 25× faster │
│ 7. Badge eligibility 900ms 12ms 75× faster │
│ 8. Streak calculation 1,200ms 20ms 60× faster │
│ 9. Content recommendations 1,600ms 35ms 46× faster │
│ 10. Achievement tracking 800ms 10ms 80× faster │
│ │
│ Avg improvement: 64× faster │
│ Time saved: 352 hours/week (93% reduction) │
└────────────────────────────────────────────────────────────────┘
RDS Resource Utilization
CPU and IOPS Metrics:
RDS Database Metrics (30-day comparison)
┌────────────────────────────────────────────────┐
│ Metric Before After Change │
│ Avg CPU utilization: 82% 28% -66% │
│ Peak CPU: 95% 45% -53% │
│ Avg IOPS: 8,500 2,100 -75% │
│ Read IOPS: 7,000 1,500 -79% │
│ Write IOPS: 1,500 600 -60% │
│ Network throughput: 120 MB/s 35 MB/s -71% │
│ Active connections: 180 120 -33% │
│ Query cache hit rate: 45% 92% +104% │
└────────────────────────────────────────────────┘
Lambda Execution Impact
Lambda Duration Savings:
Lambda Performance (API endpoints)
┌────────────────────────────────────────────────┐
│ Endpoint Before After Change │
│ GET /user/progress 3,200ms 180ms -94% │
│ GET /user/attempts 2,800ms 120ms -96% │
│ GET /content/lessons 2,100ms 150ms -93% │
│ GET /leaderboard 4,500ms 450ms -90% │
│ GET /user/badges 1,500ms 80ms -95% │
│ │
│ Avg API latency: 2,820ms 196ms -93% │
│ P95 latency: 4,200ms 350ms -92% │
│ Timeout errors: 120/day 0 -100% │
└────────────────────────────────────────────────┘
Cost Impact
RDS Instance Rightsizing
Instance Downgrade:
RDS Instance Optimization
┌────────────────────────────────────────────────┐
│ Before After │
│ Instance type: db.r5.2xlarge db.r5.large│
│ vCPUs: 8 2 │
│ RAM: 64 GB 16 GB │
│ Cost: $840/month $210/month │
│ │
│ Justification: │
│ - CPU utilization: 82% → 28% (can downgrade) │
│ - Query optimization reduced resource needs │
│ │
│ Monthly savings: $630 │
└────────────────────────────────────────────────┘
Lambda Cost Reduction
Duration-Based Cost Savings:
Lambda Costs (Monthly)
┌────────────────────────────────────────────────┐
│ Before After Savings│
│ Invocations: $400 $400 $0 │
│ Duration (DB queries): $680 $120 $560 │
│ Total: $1,080 $520 $560 │
│ │
│ Breakdown: │
│ - Avg duration: 2.8s → 0.2s (93% faster) │
│ - Monthly invocations: 3M requests │
│ - Duration cost: 3M × 2.8s → 3M × 0.2s │
└────────────────────────────────────────────────┘
Total Cost Savings: $630 (RDS) + $560 (Lambda) = $1,190/month
Monitoring and Alerting
CloudWatch RDS Metrics
Custom Dashboard:
RDS Query Performance Dashboard
┌────────────────────────────────────────────────┐
│ Slow Query Count (>100ms): 8 queries │
│ Avg Query Duration: 45ms │
│ P95 Query Duration: 120ms │
│ P99 Query Duration: 280ms │
│ Index Usage Rate: 96% │
│ Sequential Scan Rate: 4% │
│ Cache Hit Rate: 92% │
│ Active Connections: 85 / 150 │
└────────────────────────────────────────────────┘
Alerting Rules:
- Slow query count > 50 - Alert if many queries exceed 100ms
- Sequential scan rate > 10% - Indicates missing indexes
- Cache hit rate < 85% - Indicates poor query patterns
- CPU utilization > 60% - May need further optimization
Results Summary
RDS Query Optimization Impact (30-day comparison)
┌────────────────────────────────────────────────┐
│ Metric Before After Change │
│ Avg query time: 2,820ms 196ms -93% │
│ Slow queries (>1s): 377h/wk 25h/wk -93% │
│ RDS CPU utilization: 82% 28% -66% │
│ RDS IOPS: 8,500 2,100 -75% │
│ Lambda timeout errors: 120/day 0 -100% │
│ RDS instance cost: $840 $210 -75% │
│ Lambda duration cost: $680 $120 -82% │
│ Total monthly savings: - - $1,190│
└────────────────────────────────────────────────┘
Quantified Outcomes:
- 50× average speedup - 2,500ms → 50ms for critical queries
- 93% reduction in API latency - 2.8s → 0.2s average
- $1,190/month saved - RDS downsizing + Lambda duration
- 100% elimination of timeouts - 120/day → 0
- 352 hours/week saved - Reduced time in database queries
Key Takeaways
-
Indexes are non-negotiable for production databases. Every filtered or sorted column should have an appropriate index.
-
SELECT * is expensive. Fetching only needed columns reduced query time by 5× in some cases.
-
EXPLAIN ANALYZE reveals truth. Query performance intuition is often wrong—always measure.
-
Connection pooling matters. 50-100ms connection overhead per request adds up quickly.
-
Cost optimization follows performance optimization. Faster queries enabled RDS downsizing, compounding savings.
Database query optimization delivered the highest ROI of any performance improvement: 93% latency reduction and $1,190/month in cost savings with a one-time engineering investment.
Related Posts:
- API Response Caching Strategy: Reduce Database Load
- Scheduler Query Optimization: Background Job Efficiency
- Lambda SnapStart Rollout & Disable
Documentation: 2026-01-20-lambda-perf-analysis-findings.md
Impact: 50× query speedup, $1,190/month saved, zero timeout errors