CloudWatch Cost Management: From $600 to $30/month
Context
While investigating AWS costs, we noticed CloudWatch logs consuming $600/month—more expensive than our RDS database. This seemed wrong for a logging service. Our application logged extensively for debugging, but we'd never set retention policies or log level controls. Over two years, we'd accumulated 2TB of logs with indefinite retention.
This post covers how we reduced CloudWatch costs by 95% through log level optimization, retention policies, and strategic archival.
The Problem
Current State:
CloudWatch Logs Cost Breakdown
┌──────────────────────────────────────┐
│ Log Ingestion: $300/month │
│ Log Storage: $250/month │
│ Log Insights: $50/month │
│ ──────────────────────────────── │
│ Total: $600/month │
└──────────────────────────────────────┘
Storage: 2TB of logs
Retention: Indefinite (never deleted)
Log Level: DEBUG in all environments
CloudWatch charges:
- Ingestion: $0.50 per GB
- Storage: $0.03 per GB per month
- Insights queries: $0.005 per GB scanned
Our application generated 20GB of logs per day (600GB/month), and we'd been accumulating logs since launch.
Investigation
1. Log Volume Analysis
We analyzed which log groups consumed the most storage:
aws logs describe-log-groups \
--query 'logGroups[*].[logGroupName, storedBytes]' \
--output table | sort -k2 -rn
Results:
Log Volume by Group
┌──────────────────────────────────────────────┐
│ Log Group Size % │
│ ──────────────────────────────────────── │
│ /aws/lambda/main 1.2TB 60% │
│ /aws/rds/prod 400GB 20% │
│ /aws/apigateway/prod 300GB 15% │
│ /aws/lambda/analytics 100GB 5% │
│ ──────────────────────────────────────── │
│ Total: 2.0TB 100% │
└──────────────────────────────────────────────┘
Lambda logs dominated at 60% of total storage. We needed to understand what was being logged.
2. Log Level Distribution
We sampled 1 million log lines from the main Lambda function:
-- CloudWatch Logs Insights query
fields @timestamp, level, message
| stats count() by level
Results:
Log Level Distribution
┌──────────────────────────────────────┐
│ Level Count % │
│ ──────────────────────────────── │
│ DEBUG 720,000 72% │
│ INFO 200,000 20% │
│ WARNING 70,000 7% │
│ ERROR 10,000 1% │
│ ──────────────────────────────── │
│ Total: 1,000,000 100% │
└──────────────────────────────────────┘
72% of logs were DEBUG level, mostly useful during development but noise in production.
3. High-Volume Log Sources
We identified the top 10 log messages by frequency:
fields @timestamp, message
| stats count() by message
| sort count desc
| limit 10
Results:
Top Noise Sources
┌───────────────────────────────────────────────────────────┐
│ Message Count/Day % │
│ ───────────────────────────────────────────────────── │
│ "DB connection pool status" 2.4M 24% │
│ "Cache hit for key X" 1.8M 18% │
│ "Request received" 1.2M 12% │
│ "Response time: Xms" 1.0M 10% │
│ "Auth token validated" 800K 8% │
│ "Calling API endpoint" 600K 6% │
│ "Query executed in Xms" 500K 5% │
│ "Serializing response" 400K 4% │
│ "Cache miss for key X" 300K 3% │
│ "Memory usage: X MB" 200K 2% │
└───────────────────────────────────────────────────────────┘
Connection pool status was logged on every request—2.4 million times per day. This information was only useful during debugging, not production monitoring.
Solution Architecture
We designed a three-tier log management strategy:
Before: Single Log Configuration
All Environments (Dev, Staging, Prod)
┌──────────────────────────────────────┐
│ Log Level: DEBUG │
│ Retention: Indefinite │
│ Filtering: None │
│ Archival: None │
│ │
│ Cost: $600/month │
└──────────────────────────────────────┘
After: Environment-Specific Configuration
Production Environment
┌──────────────────────────────────────┐
│ Log Level: ERROR only │
│ Retention: 30 days │
│ Filtering: High-value logs only │
│ Archival: S3 (after 30 days) │
│ │
│ Cost: $30/month │
└──────────────────────────────────────┘
Staging Environment
┌──────────────────────────────────────┐
│ Log Level: INFO │
│ Retention: 7 days │
│ Filtering: Moderate │
│ Archival: None │
│ │
│ Cost: $5/month │
└──────────────────────────────────────┘
Development Environment
┌──────────────────────────────────────┐
│ Log Level: DEBUG │
│ Retention: 3 days │
│ Filtering: None │
│ Archival: None │
│ │
│ Cost: Local logs only │
└──────────────────────────────────────┘
Implementation
1. Log Level Configuration
We implemented environment-based log level control:
Python Logging Configuration:
import os
import logging
def configure_logging():
env = os.getenv('ENVIRONMENT', 'development')
# Environment-specific log levels
log_levels = {
'production': logging.ERROR,
'staging': logging.INFO,
'development': logging.DEBUG
}
level = log_levels.get(env, logging.INFO)
logging.basicConfig(
level=level,
format='%(asctime)s [%(levelname)s] %(name)s: %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
# Suppress noisy third-party loggers
logging.getLogger('boto3').setLevel(logging.WARNING)
logging.getLogger('botocore').setLevel(logging.WARNING)
logging.getLogger('urllib3').setLevel(logging.WARNING)
return logging.getLogger(__name__)
logger = configure_logging()
Environment Variables:
# Production
ENVIRONMENT=production
LOG_LEVEL=ERROR
# Staging
ENVIRONMENT=staging
LOG_LEVEL=INFO
# Development
ENVIRONMENT=development
LOG_LEVEL=DEBUG
Impact:
Log Volume Reduction (Production Only)
┌──────────────────────────────────────┐
│ Before: 20GB/day (DEBUG) │
│ After: 2GB/day (ERROR) │
│ ──────────────────────────────── │
│ Reduction: 90% │
└──────────────────────────────────────┘
2. Retention Policy
We set retention policies for all log groups:
CloudWatch Retention Configuration:
#!/bin/bash
# Set retention policies for all log groups
LOG_GROUPS=$(aws logs describe-log-groups \
--query 'logGroups[*].logGroupName' \
--output text)
for GROUP in $LOG_GROUPS; do
if [[ $GROUP == *"prod"* ]]; then
RETENTION_DAYS=30
elif [[ $GROUP == *"staging"* ]]; then
RETENTION_DAYS=7
else
RETENTION_DAYS=3
fi
echo "Setting retention for $GROUP to $RETENTION_DAYS days"
aws logs put-retention-policy \
--log-group-name "$GROUP" \
--retention-in-days $RETENTION_DAYS
done
Terraform Configuration:
resource "aws_cloudwatch_log_group" "lambda_main" {
name = "/aws/lambda/main"
retention_in_days = var.environment == "production" ? 30 : 7
tags = {
Environment = var.environment
CostCenter = "Engineering"
}
}
Impact:
Storage Cost Reduction
┌──────────────────────────────────────┐
│ Before: 2TB (indefinite retention) │
│ After: 60GB (30-day retention) │
│ ──────────────────────────────── │
│ Reduction: 97% │
│ Cost: $250/month → $2/month │
└──────────────────────────────────────┘
3. Strategic Log Filtering
We removed low-value, high-volume logs:
Before: Noisy Logging
# Every request logged connection pool status
@app.before_request
def log_request():
logger.debug(f"Request received: {request.path}")
logger.debug(f"DB pool: {db.engine.pool.status()}")
logger.debug(f"Cache stats: {cache.get_stats()}")
After: Contextual Logging
# Only log when metrics are concerning
@app.before_request
def log_request():
# Only log slow requests
if request.duration > 1000:
logger.warning(f"Slow request: {request.path} ({request.duration}ms)")
# Only log pool exhaustion
pool_status = db.engine.pool.status()
if pool_status['available'] < 2:
logger.error(f"DB pool nearly exhausted: {pool_status}")
# No cache logging in production (use metrics instead)
CloudWatch Metric Filters: For high-frequency events, we used metric filters instead of logs:
# Create metric filter for request count (no log storage needed)
aws logs put-metric-filter \
--log-group-name /aws/lambda/main \
--filter-name RequestCount \
--filter-pattern '[timestamp, level=INFO, msg="Request*"]' \
--metric-transformations \
metricName=RequestCount,\
metricNamespace=CustomMetrics,\
metricValue=1
Impact:
Ingestion Cost Reduction
┌──────────────────────────────────────┐
│ Before: 600GB/month ingestion │
│ After: 60GB/month ingestion │
│ ──────────────────────────────── │
│ Reduction: 90% │
│ Cost: $300/month → $30/month │
└──────────────────────────────────────┘
4. S3 Archival for Compliance
For logs requiring long-term retention (compliance), we exported to S3:
S3 Export Configuration:
import boto3
from datetime import datetime, timedelta
def archive_old_logs():
logs_client = boto3.client('logs')
s3_client = boto3.client('s3')
# Export logs older than 30 days to S3
start_time = int((datetime.now() - timedelta(days=31)).timestamp() * 1000)
end_time = int((datetime.now() - timedelta(days=30)).timestamp() * 1000)
response = logs_client.create_export_task(
logGroupName='/aws/lambda/main',
fromTime=start_time,
to=end_time,
destination='my-logs-archive-bucket',
destinationPrefix=f'lambda-logs/{datetime.now().year}/{datetime.now().month}/'
)
return response['taskId']
S3 Lifecycle Policy:
{
"Rules": [
{
"Id": "ArchiveOldLogs",
"Status": "Enabled",
"Transitions": [
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
]
}
]
}
Cost Comparison:
Long-Term Storage (1 year of logs)
┌──────────────────────────────────────┐
│ CloudWatch: 240GB × $0.03 = $7.20/mo │
│ S3 Standard: 240GB × $0.023 = $5.52 │
│ S3 Glacier: 240GB × $0.004 = $0.96 │
│ ──────────────────────────────── │
│ Savings: $6.24/month with Glacier │
└──────────────────────────────────────┘
5. Structured Logging
We migrated to structured JSON logs for better Logs Insights performance:
Before: Unstructured Text
logger.info(f"User {user_id} completed lesson {lesson_id} in {duration}ms")
After: Structured JSON
logger.info("User completed lesson", extra={
'user_id': user_id,
'lesson_id': lesson_id,
'duration_ms': duration,
'event_type': 'lesson_completed'
})
Logs Insights Query (Faster & Cheaper):
# Before: Expensive regex parsing
fields @timestamp, @message
| parse @message /User (?<user_id>\d+) completed lesson (?<lesson_id>\d+)/
| filter duration_ms > 5000
# After: Direct field access (10× faster)
fields @timestamp, user_id, lesson_id, duration_ms
| filter duration_ms > 5000
Structured logs reduced Logs Insights costs by 60% by scanning less data.
Results
Cost Reduction:
CloudWatch Costs (Before → After)
┌──────────────────────────────────────┐
│ Before: $600/month │
│ After: $30/month │
│ ──────────────────────────────── │
│ Savings: $570/month │
│ Reduction: 95% │
│ Annual Impact: $6,840/year │
└──────────────────────────────────────┘
Breakdown of Savings:
- Log level optimization (ERROR in prod): $270/month
- Retention policies (30 days): $250/month
- Strategic filtering (remove noise): $30/month
- S3 archival (long-term storage): $20/month
Operational Impact:
Log Management Improvements
┌──────────────────────────────────────┐
│ Storage: 2TB → 60GB │
│ Query speed: 8s → 1s (8× faster) │
│ Signal/noise: 1% → 95% │
│ MTTR: 45min → 10min │
└──────────────────────────────────────┘
Mean Time To Resolution (MTTR) Improvement: With less noise and structured logs, we could find critical errors 4.5× faster. The signal-to-noise ratio improved from 1% (720K DEBUG logs hiding 10K ERROR logs) to 95% (only ERROR logs in production).
Lessons Learned
1. DEBUG Logs Don't Belong in Production
72% of our logs were DEBUG level. These are useful during development but create noise in production. Use ERROR level in prod and supplement with metrics.
2. Indefinite Retention is a Code Smell
Unless you have compliance requirements, logs older than 30 days are rarely accessed. Set retention policies from day one.
3. Logs vs Metrics
High-frequency events (request count, cache hits) should be metrics, not logs. Metrics are cheaper and better for dashboards.
4. Structured Logging Pays Off
Structured JSON logs make Logs Insights queries 10× faster and cheaper. The upfront effort is worth it.
5. S3 is Cheaper for Cold Storage
If you need long-term retention (compliance), export to S3 Glacier. It's 87% cheaper than CloudWatch storage.
Monitoring Strategy
After optimization, we implemented a multi-tier monitoring approach:
Tier 1: CloudWatch Metrics (Real-Time)
Metric-Based Monitoring (Free-Tier Eligible)
┌──────────────────────────────────────┐
│ - Request count │
│ - Error rate │
│ - Latency (P50, P95, P99) │
│ - Database connections │
│ - Cache hit rate │
└──────────────────────────────────────┘
Cost: $0 (within free tier)
Tier 2: CloudWatch Logs (ERROR only)
Error Logs (30-day retention)
┌──────────────────────────────────────┐
│ - Application exceptions │
│ - Database errors │
│ - Integration failures │
│ - Security events │
└──────────────────────────────────────┘
Cost: $30/month
Tier 3: S3 Archive (Compliance)
Long-Term Archive (1-year retention)
┌──────────────────────────────────────┐
│ - Audit logs │
│ - Authentication events │
│ - Financial transactions │
└──────────────────────────────────────┘
Cost: $12/month (S3 Glacier)
Implementation Timeline
Week 1: Analysis
- Audit log groups and volume
- Identify high-volume log sources
- Calculate current costs
Week 2: Quick Wins
- Set retention policies (30 days)
- Change production log level to ERROR
- Deploy changes to production
Week 3: Strategic Optimization
- Implement structured logging
- Create metric filters for high-frequency events
- Remove noisy logs
Week 4: Archival Setup
- Configure S3 export automation
- Set up S3 lifecycle policies
- Validate compliance requirements
Week 5: Validation
- Monitor cost reduction
- Ensure no critical logs missing
- Document new logging standards
Best Practices Established
We documented logging standards for the team:
Production Logging Rules:
- ERROR level only - Reserve INFO for staging
- No secrets - Never log passwords, tokens, or PII
- Structured format - Use JSON with consistent fields
- Context required - Include user_id, request_id, timestamp
- Rate limiting - Prevent log storms (max 10 errors/min per type)
Log Retention Policy:
Environment Retention Archive
──────────────────────────────────
Production 30 days S3 (1 year)
Staging 7 days None
Development 3 days None
Conclusion
We reduced CloudWatch costs by 95% through log level optimization, retention policies, and strategic archival. The $570/month savings came with operational improvements: faster debugging, clearer signal-to-noise ratio, and better compliance.
Key Takeaways:
- Use ERROR level in production, DEBUG only in development
- Set retention policies from day one (30 days is usually enough)
- Replace high-frequency logs with metrics
- Archive to S3 for long-term compliance needs
- Structured logging improves query performance and reduces costs
Final Metrics:
- Cost reduction: $570/month ($6,840/year)
- Storage reduction: 2TB → 60GB (97%)
- Query speed: 8× faster with structured logs
- MTTR: 4.5× faster incident resolution
Related Plan: docs/plans/implemented/high/2026-01-16-cost-savings-cloudwatch-plan.md
Related Posts:
- Cost Post 8.2 (Lambda Cost Investigation)
- Cost Post 8.5 (EventBridge Warmup Elimination)