Building Robust Error Handling with Exception Hierarchies

Key Takeaway

Using generic Exception for all errors made debugging difficult and prevented appropriate error responses. Creating a custom exception hierarchy enabled specific error handling, proper HTTP status codes, and targeted retry logic.

The Problem

Original error handling was too broad:

try:
    monitor_budget()
except Exception as e:
    return {'statusCode': 500}  # Everything is 500

This prevented:

Distinguishing transient from permanent errors
Returning appropriate HTTP status codes
Implementing selective retry logic
Understanding root causes from logs

The Solution

Create exception hierarchy:

# core/exceptions.py
class MonitoringError(Exception):
    """Base exception for monitoring-related errors"""
    pass

class MetricRetrievalError(MonitoringError):
    """Cannot retrieve metrics from CloudWatch"""
    pass

class ThresholdEvaluationError(MonitoringError):
    """Cannot evaluate threshold conditions"""
    pass

class NotificationError(Exception):
    """Notification delivery failed"""
    pass

class ConfigurationError(Exception):
    """Invalid configuration"""
    pass

Use specific exceptions:

def get_cloudwatch_metrics(metric_name):
    try:
        return cloudwatch.get_metric_statistics(...)
    except ClientError as e:
        if e.response['Error']['Code'] == 'Throttling':
            raise MetricRetrievalError("CloudWatch throttled, retry later")
        raise MetricRetrievalError(f"Failed to get metrics: {e}")

Handle appropriately in handlers:

def lambda_handler(event, context):
    try:
        result = process_monitoring(event)
        return {'statusCode': 200, 'body': json.dumps(result)}

    except ConfigurationError as e:
        logger.error(f"Configuration error: {e}")
        return {'statusCode': 500, 'body': 'Configuration error'}

    except MetricRetrievalError as e:
        logger.warning(f"Metric retrieval failed (retryable): {e}")
        return {'statusCode': 503, 'body': 'Service temporarily unavailable'}

    except NotificationError as e:
        logger.error(f"Notification failed: {e}")
        # Don't fail monitoring if notification fails
        return {'statusCode': 200, 'body': 'Monitoring succeeded, notification failed'}

    except MonitoringError as e:
        logger.error(f"Monitoring error: {e}")
        return {'statusCode': 500, 'body': 'Monitoring error'}

Implementation Details

Add retry logic based on exception type:

def with_retry(max_attempts=3):
    """Retry only on transient errors"""
    def decorator(func):
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except MetricRetrievalError as e:
                    if attempt == max_attempts - 1:
                        raise
                    time.sleep(2 ** attempt)  # Exponential backoff
                except (ConfigurationError, ThresholdEvaluationError):
                    # Don't retry permanent errors
                    raise
        return wrapper
    return decorator

Impact and Results

Debugging: Clear error types in logs
Reliability: Transient errors retried, permanent errors fail fast
User Experience: Appropriate HTTP status codes
Monitoring: Different alerts for different error types

Lessons Learned

Exception Hierarchies: Create specific exception types for specific errors
Retry Logic: Only retry transient failures
HTTP Status Codes: Map exception types to appropriate status codes
Logging: Include exception type in logs for easier debugging

Custom exception hierarchies transform error handling from reactive debugging to proactive system design. Invest in proper exception design early in your project.