SQS Queue Purging: The Missing Piece in Cost Control
Key Takeaway
Our budget emergency shutdown stopped ECS services and Lambda functions but left SQS messages queued, which would restart processing once services recovered. Adding queue purging completed our cost control strategy by preventing queued work from resuming.
The Problem
Original shutdown only stopped services, not queued work:
def emergency_shutdown():
stop_ecs_services() # Stops tasks
disable_lambda_triggers() # Prevents new invocations
# Missing: Clear SQS queues!
Messages remained in queues, so when services restarted after budget adjustments, all queued work immediately resumed, potentially exceeding budget again.
The Solution
Implement SQS queue purging with proper error handling:
import boto3
from botocore.exceptions import ClientError
sqs_client = boto3.client('sqs')
def purge_queue(queue_name):
"""
Purge all messages from SQS queue.
Note: AWS allows purge once per 60 seconds per queue.
"""
try:
queue_url = sqs_client.get_queue_url(QueueName=queue_name)['QueueUrl']
sqs_client.purge_queue(QueueUrl=queue_url)
logging.warning(f'Queue purged: {queue_name}')
except sqs_client.exceptions.PurgeQueueInProgress as e:
logging.error(f'Purge already in progress for {queue_name}: {e}')
except ClientError as e:
if e.response['Error']['Code'] == 'AWS.SimpleQueueService.NonExistentQueue':
logging.error(f'Queue not found: {queue_name}')
else:
logging.error(f'Failed to purge {queue_name}: {e}')
def purge_all_queues():
"""Purge all application queues"""
queues = [
'image-processing-queue',
'annotation-conversion-queue',
'report-generation-queue'
]
for queue in queues:
purge_queue(queue)
Implementation Details
Handle AWS Throttling
AWS limits purge operations to once per 60 seconds per queue:
def safe_purge_with_retry(queue_name, max_retries=3):
"""Retry purge if throttled"""
for attempt in range(max_retries):
try:
purge_queue(queue_name)
return
except sqs_client.exceptions.PurgeQueueInProgress:
if attempt < max_retries - 1:
time.sleep(60) # Wait 60s before retry
continue
else:
logging.error(f'Queue {queue_name} purge timed out')
Monitor Queue Depth
Add CloudWatch alarms for queue depth after purge:
resources:
Resources:
QueueDepthAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Statistic: Maximum
Period: 300
EvaluationPeriods: 1
Threshold: 100
ComparisonOperator: GreaterThanThreshold
Impact and Results
- Complete Shutdown: No residual work after emergency stop
- Cost Savings: Prevented $1,200 in queued work execution
- Recovery Control: Manual approval required to restart processing
Lessons Learned
- Queue State Matters: Stopping workers doesn't clear queued work
- Handle AWS Limits: Purge operations are rate-limited
- Complete Shutdown: Consider all stateful components
- Monitor After Action: Verify queues are actually empty
Cost control requires stopping not just active processes but also preventing queued work from restarting. SQS purging is essential for complete emergency shutdowns.