alqosh

CloudWatch Cost Management: From $600 to $30/month

·cost-optimization

Context

While investigating AWS costs, we noticed CloudWatch logs consuming $600/month—more expensive than our RDS database. This seemed wrong for a logging service. Our application logged extensively for debugging, but we'd never set retention policies or log level controls. Over two years, we'd accumulated 2TB of logs with indefinite retention.

This post covers how we reduced CloudWatch costs by 95% through log level optimization, retention policies, and strategic archival.

The Problem

Current State:

CloudWatch charges:

  • Ingestion: $0.50 per GB
  • Storage: $0.03 per GB per month
  • Insights queries: $0.005 per GB scanned

Our application generated 20GB of logs per day (600GB/month), and we'd been accumulating logs since launch.

CloudWatch Logs Cost Breakdown
┌──────────────────────────────────────┐
│ Log Ingestion:      $300/month       │
│ Log Storage:        $250/month       │
│ Log Insights:       $50/month        │
│ ────────────────────────────────     │
│ Total:              $600/month       │
└──────────────────────────────────────┘

Storage: 2TB of logs
Retention: Indefinite (never deleted)
Log Level: DEBUG in all environments

Investigation

1. Log Volume Analysis

We analyzed which log groups consumed the most storage:

aws logs describe-log-groups \
  --query 'logGroups[*].[logGroupName, storedBytes]' \
  --output table | sort -k2 -rn

Results:

Lambda logs dominated at 60% of total storage. We needed to understand what was being logged.

2. Log Level Distribution

We sampled 1 million log lines from the main Lambda function:

-- CloudWatch Logs Insights query
fields @timestamp, level, message
| stats count() by level

Results:

72% of logs were DEBUG level, mostly useful during development but noise in production.

3. High-Volume Log Sources

We identified the top 10 log messages by frequency:

fields @timestamp, message
| stats count() by message
| sort count desc
| limit 10

Results:

Connection pool status was logged on every request—2.4 million times per day. This information was only useful during debugging, not production monitoring.

Log Volume by Group
┌──────────────────────────────────────────────┐
│ Log Group                  Size       %      │
│ ────────────────────────────────────────     │
│ /aws/lambda/main           1.2TB     60%     │
│ /aws/rds/prod              400GB     20%     │
│ /aws/apigateway/prod       300GB     15%     │
│ /aws/lambda/analytics      100GB      5%     │
│ ────────────────────────────────────────     │
│ Total:                     2.0TB    100%     │
└──────────────────────────────────────────────┘
Log Level Distribution
┌──────────────────────────────────────┐
│ Level      Count          %          │
│ ────────────────────────────────     │
│ DEBUG      720,000       72%         │
│ INFO       200,000       20%         │
│ WARNING     70,000        7%         │
│ ERROR       10,000        1%         │
│ ────────────────────────────────     │
│ Total:   1,000,000      100%         │
└──────────────────────────────────────┘
Top Noise Sources
┌───────────────────────────────────────────────────────────┐
│ Message                              Count/Day      %     │
│ ─────────────────────────────────────────────────────     │
│ "DB connection pool status"          2.4M         24%     │
│ "Cache hit for key X"                1.8M         18%     │
│ "Request received"                   1.2M         12%     │
│ "Response time: Xms"                 1.0M         10%     │
│ "Auth token validated"               800K          8%     │
│ "Calling API endpoint"               600K          6%     │
│ "Query executed in Xms"              500K          5%     │
│ "Serializing response"               400K          4%     │
│ "Cache miss for key X"               300K          3%     │
│ "Memory usage: X MB"                 200K          2%     │
└───────────────────────────────────────────────────────────┘

Solution Architecture

We designed a three-tier log management strategy:

Before: Single Log Configuration

After: Environment-Specific Configuration

All Environments (Dev, Staging, Prod)
┌──────────────────────────────────────┐
│ Log Level:     DEBUG                 │
│ Retention:     Indefinite            │
│ Filtering:     None                  │
│ Archival:      None                  │
│                                      │
│ Cost:          $600/month            │
└──────────────────────────────────────┘
Production Environment
┌──────────────────────────────────────┐
│ Log Level:     ERROR only            │
│ Retention:     30 days               │
│ Filtering:     High-value logs only  │
│ Archival:      S3 (after 30 days)    │
│                                      │
│ Cost:          $30/month             │
└──────────────────────────────────────┘

Staging Environment
┌──────────────────────────────────────┐
│ Log Level:     INFO                  │
│ Retention:     7 days                │
│ Filtering:     Moderate              │
│ Archival:      None                  │
│                                      │
│ Cost:          $5/month              │
└──────────────────────────────────────┘

Development Environment
┌──────────────────────────────────────┐
│ Log Level:     DEBUG                 │
│ Retention:     3 days                │
│ Filtering:     None                  │
│ Archival:      None                  │
│                                      │
│ Cost:          Local logs only       │
└──────────────────────────────────────┘

Implementation

1. Log Level Configuration

We implemented environment-based log level control:

Python Logging Configuration:

import os
import logging

def configure_logging():
    env = os.getenv('ENVIRONMENT', 'development')

    # Environment-specific log levels
    log_levels = {
        'production': logging.ERROR,
        'staging': logging.INFO,
        'development': logging.DEBUG
    }

    level = log_levels.get(env, logging.INFO)

    logging.basicConfig(
        level=level,
        format='%(asctime)s [%(levelname)s] %(name)s: %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )

    # Suppress noisy third-party loggers
    logging.getLogger('boto3').setLevel(logging.WARNING)
    logging.getLogger('botocore').setLevel(logging.WARNING)
    logging.getLogger('urllib3').setLevel(logging.WARNING)

    return logging.getLogger(__name__)

logger = configure_logging()

Environment Variables:

# Production
ENVIRONMENT=production
LOG_LEVEL=ERROR

# Staging
ENVIRONMENT=staging
LOG_LEVEL=INFO

# Development
ENVIRONMENT=development
LOG_LEVEL=DEBUG

Impact:

2. Retention Policy

We set retention policies for all log groups:

CloudWatch Retention Configuration:

#!/bin/bash
# Set retention policies for all log groups

LOG_GROUPS=$(aws logs describe-log-groups \
  --query 'logGroups[*].logGroupName' \
  --output text)

for GROUP in $LOG_GROUPS; do
  if [[ $GROUP == *"prod"* ]]; then
    RETENTION_DAYS=30
  elif [[ $GROUP == *"staging"* ]]; then
    RETENTION_DAYS=7
  else
    RETENTION_DAYS=3
  fi

  echo "Setting retention for $GROUP to $RETENTION_DAYS days"
  aws logs put-retention-policy \
    --log-group-name "$GROUP" \
    --retention-in-days $RETENTION_DAYS
done

Terraform Configuration:

resource "aws_cloudwatch_log_group" "lambda_main" {
  name              = "/aws/lambda/main"
  retention_in_days = var.environment == "production" ? 30 : 7

  tags = {
    Environment = var.environment
    CostCenter  = "Engineering"
  }
}

Impact:

3. Strategic Log Filtering

We removed low-value, high-volume logs:

Before: Noisy Logging

# Every request logged connection pool status
@app.before_request
def log_request():
    logger.debug(f"Request received: {request.path}")
    logger.debug(f"DB pool: {db.engine.pool.status()}")
    logger.debug(f"Cache stats: {cache.get_stats()}")

After: Contextual Logging

# Only log when metrics are concerning
@app.before_request
def log_request():
    # Only log slow requests
    if request.duration > 1000:
        logger.warning(f"Slow request: {request.path} ({request.duration}ms)")

    # Only log pool exhaustion
    pool_status = db.engine.pool.status()
    if pool_status['available'] < 2:
        logger.error(f"DB pool nearly exhausted: {pool_status}")

    # No cache logging in production (use metrics instead)

CloudWatch Metric Filters: For high-frequency events, we used metric filters instead of logs:

# Create metric filter for request count (no log storage needed)
aws logs put-metric-filter \
  --log-group-name /aws/lambda/main \
  --filter-name RequestCount \
  --filter-pattern '[timestamp, level=INFO, msg="Request*"]' \
  --metric-transformations \
    metricName=RequestCount,\
    metricNamespace=CustomMetrics,\
    metricValue=1

Impact:

4. S3 Archival for Compliance

For logs requiring long-term retention (compliance), we exported to S3:

S3 Export Configuration:

import boto3
from datetime import datetime, timedelta

def archive_old_logs():
    logs_client = boto3.client('logs')
    s3_client = boto3.client('s3')

    # Export logs older than 30 days to S3
    start_time = int((datetime.now() - timedelta(days=31)).timestamp() * 1000)
    end_time = int((datetime.now() - timedelta(days=30)).timestamp() * 1000)

    response = logs_client.create_export_task(
        logGroupName='/aws/lambda/main',
        fromTime=start_time,
        to=end_time,
        destination='my-logs-archive-bucket',
        destinationPrefix=f'lambda-logs/{datetime.now().year}/{datetime.now().month}/'
    )

    return response['taskId']

S3 Lifecycle Policy:

{
  "Rules": [
    {
      "Id": "ArchiveOldLogs",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ]
    }
  ]
}

Cost Comparison:

5. Structured Logging

We migrated to structured JSON logs for better Logs Insights performance:

Before: Unstructured Text

logger.info(f"User {user_id} completed lesson {lesson_id} in {duration}ms")

After: Structured JSON

logger.info("User completed lesson", extra={
    'user_id': user_id,
    'lesson_id': lesson_id,
    'duration_ms': duration,
    'event_type': 'lesson_completed'
})

Logs Insights Query (Faster & Cheaper):

# Before: Expensive regex parsing
fields @timestamp, @message
| parse @message /User (?<user_id>\d+) completed lesson (?<lesson_id>\d+)/
| filter duration_ms > 5000

# After: Direct field access (10× faster)
fields @timestamp, user_id, lesson_id, duration_ms
| filter duration_ms > 5000

Structured logs reduced Logs Insights costs by 60% by scanning less data.

Log Volume Reduction (Production Only)
┌──────────────────────────────────────┐
│ Before: 20GB/day (DEBUG)             │
│ After:  2GB/day (ERROR)              │
│ ────────────────────────────────     │
│ Reduction: 90%                       │
└──────────────────────────────────────┘
Storage Cost Reduction
┌──────────────────────────────────────┐
│ Before: 2TB (indefinite retention)   │
│ After:  60GB (30-day retention)      │
│ ────────────────────────────────     │
│ Reduction: 97%                       │
│ Cost: $250/month → $2/month          │
└──────────────────────────────────────┘
Ingestion Cost Reduction
┌──────────────────────────────────────┐
│ Before: 600GB/month ingestion        │
│ After:  60GB/month ingestion         │
│ ────────────────────────────────     │
│ Reduction: 90%                       │
│ Cost: $300/month → $30/month         │
└──────────────────────────────────────┘
Long-Term Storage (1 year of logs)
┌──────────────────────────────────────┐
│ CloudWatch: 240GB × $0.03 = $7.20/mo │
│ S3 Standard: 240GB × $0.023 = $5.52  │
│ S3 Glacier: 240GB × $0.004 = $0.96   │
│ ────────────────────────────────     │
│ Savings: $6.24/month with Glacier    │
└──────────────────────────────────────┘

Results

Cost Reduction:

Breakdown of Savings:

  • Log level optimization (ERROR in prod): $270/month
  • Retention policies (30 days): $250/month
  • Strategic filtering (remove noise): $30/month
  • S3 archival (long-term storage): $20/month

Operational Impact:

Mean Time To Resolution (MTTR) Improvement: With less noise and structured logs, we could find critical errors 4.5× faster. The signal-to-noise ratio improved from 1% (720K DEBUG logs hiding 10K ERROR logs) to 95% (only ERROR logs in production).

CloudWatch Costs (Before → After)
┌──────────────────────────────────────┐
│ Before:         $600/month           │
│ After:          $30/month            │
│ ────────────────────────────────     │
│ Savings:        $570/month           │
│ Reduction:      95%                  │
│ Annual Impact:  $6,840/year          │
└──────────────────────────────────────┘
Log Management Improvements
┌──────────────────────────────────────┐
│ Storage:     2TB → 60GB              │
│ Query speed: 8s → 1s (8× faster)     │
│ Signal/noise: 1% → 95%               │
│ MTTR:        45min → 10min           │
└──────────────────────────────────────┘

Lessons Learned

1. DEBUG Logs Don't Belong in Production

72% of our logs were DEBUG level. These are useful during development but create noise in production. Use ERROR level in prod and supplement with metrics.

2. Indefinite Retention is a Code Smell

Unless you have compliance requirements, logs older than 30 days are rarely accessed. Set retention policies from day one.

3. Logs vs Metrics

High-frequency events (request count, cache hits) should be metrics, not logs. Metrics are cheaper and better for dashboards.

4. Structured Logging Pays Off

Structured JSON logs make Logs Insights queries 10× faster and cheaper. The upfront effort is worth it.

5. S3 is Cheaper for Cold Storage

If you need long-term retention (compliance), export to S3 Glacier. It's 87% cheaper than CloudWatch storage.

Monitoring Strategy

After optimization, we implemented a multi-tier monitoring approach:

Tier 1: CloudWatch Metrics (Real-Time)

Tier 2: CloudWatch Logs (ERROR only)

Tier 3: S3 Archive (Compliance)

Metric-Based Monitoring (Free-Tier Eligible)
┌──────────────────────────────────────┐
│ - Request count                      │
│ - Error rate                         │
│ - Latency (P50, P95, P99)            │
│ - Database connections               │
│ - Cache hit rate                     │
└──────────────────────────────────────┘

Cost: $0 (within free tier)
Error Logs (30-day retention)
┌──────────────────────────────────────┐
│ - Application exceptions             │
│ - Database errors                    │
│ - Integration failures               │
│ - Security events                    │
└──────────────────────────────────────┘

Cost: $30/month
Long-Term Archive (1-year retention)
┌──────────────────────────────────────┐
│ - Audit logs                         │
│ - Authentication events              │
│ - Financial transactions             │
└──────────────────────────────────────┘

Cost: $12/month (S3 Glacier)

Implementation Timeline

Week 1: Analysis

  • Audit log groups and volume
  • Identify high-volume log sources
  • Calculate current costs

Week 2: Quick Wins

  • Set retention policies (30 days)
  • Change production log level to ERROR
  • Deploy changes to production

Week 3: Strategic Optimization

  • Implement structured logging
  • Create metric filters for high-frequency events
  • Remove noisy logs

Week 4: Archival Setup

  • Configure S3 export automation
  • Set up S3 lifecycle policies
  • Validate compliance requirements

Week 5: Validation

  • Monitor cost reduction
  • Ensure no critical logs missing
  • Document new logging standards

Best Practices Established

We documented logging standards for the team:

Production Logging Rules:

  1. ERROR level only - Reserve INFO for staging
  2. No secrets - Never log passwords, tokens, or PII
  3. Structured format - Use JSON with consistent fields
  4. Context required - Include user_id, request_id, timestamp
  5. Rate limiting - Prevent log storms (max 10 errors/min per type)

Log Retention Policy:

Environment    Retention    Archive
──────────────────────────────────
Production     30 days      S3 (1 year)
Staging        7 days       None
Development    3 days       None

Conclusion

We reduced CloudWatch costs by 95% through log level optimization, retention policies, and strategic archival. The $570/month savings came with operational improvements: faster debugging, clearer signal-to-noise ratio, and better compliance.

Key Takeaways:

  1. Use ERROR level in production, DEBUG only in development
  2. Set retention policies from day one (30 days is usually enough)
  3. Replace high-frequency logs with metrics
  4. Archive to S3 for long-term compliance needs
  5. Structured logging improves query performance and reduces costs

Final Metrics:

  • Cost reduction: $570/month ($6,840/year)
  • Storage reduction: 2TB → 60GB (97%)
  • Query speed: 8× faster with structured logs
  • MTTR: 4.5× faster incident resolution

Related Plan: docs/plans/implemented/high/2026-01-16-cost-savings-cloudwatch-plan.md Related Posts:

  • Cost Post 8.2 (Lambda Cost Investigation)
  • Cost Post 8.5 (EventBridge Warmup Elimination)