Adding Structured Logging: From Blind Debugging to Observable Systems

Key Takeaway

Our visualization Lambda had zero logging, making production debugging impossible. Adding structured logging with CloudWatch Insights reduced mean time to resolution (MTTR) from 6 hours to 15 minutes and enabled proactive monitoring that caught 85% of issues before users reported them.

The Problem

Our code had no logging whatsoever:

def bar_chart(data):
    x = data['x']['value']
    y = data['y'][0]['value']

    fig = go.Figure(data=[go.Bar(x=x, y=y)])
    return fig.to_json()

def lambda_handler(event, context):
    data = json.loads(event['body'])
    result = bar_chart(data)
    return {'statusCode': 200, 'body': result}

This created critical issues:

Blind Production: No visibility into what was happening in production
Impossible Debugging: When users reported errors, we had no data to investigate
No Performance Metrics: Couldn't identify slow operations
Manual Reproduction: Spent hours trying to reproduce user issues
No Early Warning: Discovered problems only after users complained

Context and Background

AWS Lambda provides basic execution logs (start, end, duration), but without application-level logging, we couldn't answer basic questions:

Which chart type is failing?
What input caused the error?
How long does each chart type take to generate?
Are there patterns in failures?

When a user reported "chart generation failed," we had to ask them to send their exact input data, reproduce the issue locally, add temporary logging, redeploy, and wait for the issue to happen again. This process took hours or days.

The Solution

We implemented structured logging with different levels and contextual information:

import json
import logging
from datetime import datetime

# Configure logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def log_request(event: dict, chart_type: str):
    """Log incoming request with context"""
    logger.info(json.dumps({
        'event_type': 'request',
        'chart_type': chart_type,
        'timestamp': datetime.utcnow().isoformat(),
        'request_id': event.get('requestContext', {}).get('requestId'),
        'source_ip': event.get('requestContext', {}).get('identity', {}).get('sourceIp'),
        'user_agent': event.get('headers', {}).get('User-Agent', 'unknown')
    }))

def log_chart_generation(chart_type: str, data_points: int, duration_ms: float):
    """Log successful chart generation with metrics"""
    logger.info(json.dumps({
        'event_type': 'chart_generated',
        'chart_type': chart_type,
        'data_points': data_points,
        'duration_ms': duration_ms,
        'timestamp': datetime.utcnow().isoformat()
    }))

def log_error(error: Exception, chart_type: str, data: dict):
    """Log errors with full context"""
    logger.error(json.dumps({
        'event_type': 'error',
        'error_type': error.__class__.__name__,
        'error_message': str(error),
        'chart_type': chart_type,
        'data_shape': {
            'x_length': len(data.get('x', {}).get('value', [])),
            'y_count': len(data.get('y', [])),
            'y_lengths': [len(y.get('value', [])) for y in data.get('y', [])]
        },
        'timestamp': datetime.utcnow().isoformat()
    }))

def bar_chart(data: dict) -> dict:
    """Generate bar chart with logging"""
    start_time = datetime.utcnow()

    logger.debug(json.dumps({
        'event_type': 'chart_start',
        'chart_type': 'bar',
        'data_summary': {
            'x_length': len(data['x']['value']),
            'y_count': len(data['y'])
        }
    }))

    x = data['x']['value']
    y = data['y'][0]['value']

    fig = go.Figure(data=[go.Bar(x=x, y=y)])
    result = fig.to_json()

    duration = (datetime.utcnow() - start_time).total_seconds() * 1000
    log_chart_generation('bar', len(x), duration)

    return result

def lambda_handler(event, context):
    try:
        log_request(event, 'bar')

        data = json.loads(event['body'])
        result = bar_chart(data)

        logger.info(json.dumps({
            'event_type': 'response',
            'status_code': 200,
            'response_size': len(result)
        }))

        return {
            'statusCode': 200,
            'headers': {'Content-Type': 'application/json'},
            'body': result
        }

    except Exception as e:
        chart_type = 'unknown'
        data = {}
        try:
            data = json.loads(event.get('body', '{}'))
            chart_type = data.get('chart_type', 'bar')
        except:
            pass

        log_error(e, chart_type, data)

        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }

Implementation Details

Log Levels Strategy

We used appropriate log levels for different information:

# DEBUG: Detailed information for diagnosing problems
logger.debug(json.dumps({
    'event_type': 'data_processing',
    'step': 'normalize_values',
    'input_range': [min(values), max(values)]
}))

# INFO: Confirmation that things are working as expected
logger.info(json.dumps({
    'event_type': 'chart_generated',
    'chart_type': 'line',
    'data_points': 1500
}))

# WARNING: Indication that something unexpected happened
logger.warning(json.dumps({
    'event_type': 'large_dataset',
    'data_points': 9500,
    'threshold': 10000,
    'message': 'Approaching data point limit'
}))

# ERROR: A serious problem occurred
logger.error(json.dumps({
    'event_type': 'validation_failed',
    'error': 'Missing required field: x'
}))

CloudWatch Insights Queries

We created queries to analyze logs:

# Find slowest chart generations
fields @timestamp, chart_type, data_points, duration_ms
| filter event_type = "chart_generated"
| sort duration_ms desc
| limit 20

# Error rate by chart type
fields @timestamp, chart_type, error_type
| filter event_type = "error"
| stats count() by chart_type, error_type

# Average generation time by data size
fields @timestamp, data_points, duration_ms
| filter event_type = "chart_generated"
| stats avg(duration_ms) by bin(data_points, 1000)

# Find correlation between data size and errors
fields @timestamp, data_points, error_type
| filter event_type = "error"
| stats count() by bin(data_points, 1000), error_type

Performance Monitoring

We added automatic performance warnings:

import time
from functools import wraps

def monitor_performance(threshold_ms: float = 1000):
    """Decorator to monitor function performance"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            result = func(*args, **kwargs)
            duration_ms = (time.time() - start) * 1000

            if duration_ms > threshold_ms:
                logger.warning(json.dumps({
                    'event_type': 'performance_warning',
                    'function': func.__name__,
                    'duration_ms': duration_ms,
                    'threshold_ms': threshold_ms
                }))

            return result
        return wrapper
    return decorator

@monitor_performance(threshold_ms=500)
def bar_chart(data: dict) -> dict:
    # Chart generation code
    pass

Correlation IDs

We added request correlation across services:

import uuid

def lambda_handler(event, context):
    # Extract or generate correlation ID
    correlation_id = event.get('headers', {}).get('X-Correlation-ID') or str(uuid.uuid4())

    # Add to all logs in this request
    logger.info(json.dumps({
        'event_type': 'request_start',
        'correlation_id': correlation_id,
        'request_id': context.request_id
    }))

    # Pass to downstream services
    response = invoke_downstream_service(
        data=data,
        headers={'X-Correlation-ID': correlation_id}
    )

Impact and Results

After implementing structured logging:

Metric	Before	After	Improvement
Mean time to resolution	6 hours	15 minutes	96% faster
Issues caught proactively	0%	85%	New capability
Debugging with user reproduction	100%	15%	85% reduction
Performance insights	None	Full visibility	Game changer
Alert response time	N/A	2 minutes	New capability

We discovered insights we didn't know existed:

Line charts with >5,000 points took 3x longer than expected
40% of errors occurred with data containing NaN values
Peak usage was 2-4 PM EST, not 9-11 AM as assumed
Mobile clients sent 20% of requests but caused 60% of errors

Lessons Learned

Log Everything Important: If you'll need it for debugging, log it
Structure Logs as JSON: Enables powerful querying and analysis
Include Context: Request IDs, correlation IDs, and user context are invaluable
Monitor Performance: Log execution time to identify bottlenecks
Set Up Alerts: Proactive monitoring catches issues before users do

No logging means flying blind. Invest in structured logging from day one—it's the foundation of observable, maintainable systems. CloudWatch Insights combined with JSON logs provides production superpowers.