Building Robust Error Handling with Exception Hierarchies
Key Takeaway
Using generic Exception for all errors made debugging difficult and prevented appropriate error responses. Creating a custom exception hierarchy enabled specific error handling, proper HTTP status codes, and targeted retry logic.
The Problem
Original error handling was too broad:
try:
monitor_budget()
except Exception as e:
return {'statusCode': 500} # Everything is 500
This prevented:
- Distinguishing transient from permanent errors
- Returning appropriate HTTP status codes
- Implementing selective retry logic
- Understanding root causes from logs
The Solution
Create exception hierarchy:
# core/exceptions.py
class MonitoringError(Exception):
"""Base exception for monitoring-related errors"""
pass
class MetricRetrievalError(MonitoringError):
"""Cannot retrieve metrics from CloudWatch"""
pass
class ThresholdEvaluationError(MonitoringError):
"""Cannot evaluate threshold conditions"""
pass
class NotificationError(Exception):
"""Notification delivery failed"""
pass
class ConfigurationError(Exception):
"""Invalid configuration"""
pass
Use specific exceptions:
def get_cloudwatch_metrics(metric_name):
try:
return cloudwatch.get_metric_statistics(...)
except ClientError as e:
if e.response['Error']['Code'] == 'Throttling':
raise MetricRetrievalError("CloudWatch throttled, retry later")
raise MetricRetrievalError(f"Failed to get metrics: {e}")
Handle appropriately in handlers:
def lambda_handler(event, context):
try:
result = process_monitoring(event)
return {'statusCode': 200, 'body': json.dumps(result)}
except ConfigurationError as e:
logger.error(f"Configuration error: {e}")
return {'statusCode': 500, 'body': 'Configuration error'}
except MetricRetrievalError as e:
logger.warning(f"Metric retrieval failed (retryable): {e}")
return {'statusCode': 503, 'body': 'Service temporarily unavailable'}
except NotificationError as e:
logger.error(f"Notification failed: {e}")
# Don't fail monitoring if notification fails
return {'statusCode': 200, 'body': 'Monitoring succeeded, notification failed'}
except MonitoringError as e:
logger.error(f"Monitoring error: {e}")
return {'statusCode': 500, 'body': 'Monitoring error'}
Implementation Details
Add retry logic based on exception type:
def with_retry(max_attempts=3):
"""Retry only on transient errors"""
def decorator(func):
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except MetricRetrievalError as e:
if attempt == max_attempts - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
except (ConfigurationError, ThresholdEvaluationError):
# Don't retry permanent errors
raise
return wrapper
return decorator
Impact and Results
- Debugging: Clear error types in logs
- Reliability: Transient errors retried, permanent errors fail fast
- User Experience: Appropriate HTTP status codes
- Monitoring: Different alerts for different error types
Lessons Learned
- Exception Hierarchies: Create specific exception types for specific errors
- Retry Logic: Only retry transient failures
- HTTP Status Codes: Map exception types to appropriate status codes
- Logging: Include exception type in logs for easier debugging
Custom exception hierarchies transform error handling from reactive debugging to proactive system design. Invest in proper exception design early in your project.