← Back

SQS Queue Purging: The Missing Piece in Cost Control

·budget-manager

SQS Queue Purging: The Missing Piece in Cost Control

Key Takeaway

Our budget emergency shutdown stopped ECS services and Lambda functions but left SQS messages queued, which would restart processing once services recovered. Adding queue purging completed our cost control strategy by preventing queued work from resuming.

The Problem

Original shutdown only stopped services, not queued work:

def emergency_shutdown():
    stop_ecs_services()  # Stops tasks
    disable_lambda_triggers()  # Prevents new invocations
    # Missing: Clear SQS queues!

Messages remained in queues, so when services restarted after budget adjustments, all queued work immediately resumed, potentially exceeding budget again.

The Solution

Implement SQS queue purging with proper error handling:

import boto3
from botocore.exceptions import ClientError

sqs_client = boto3.client('sqs')

def purge_queue(queue_name):
    """
    Purge all messages from SQS queue.

    Note: AWS allows purge once per 60 seconds per queue.
    """
    try:
        queue_url = sqs_client.get_queue_url(QueueName=queue_name)['QueueUrl']

        sqs_client.purge_queue(QueueUrl=queue_url)

        logging.warning(f'Queue purged: {queue_name}')

    except sqs_client.exceptions.PurgeQueueInProgress as e:
        logging.error(f'Purge already in progress for {queue_name}: {e}')

    except ClientError as e:
        if e.response['Error']['Code'] == 'AWS.SimpleQueueService.NonExistentQueue':
            logging.error(f'Queue not found: {queue_name}')
        else:
            logging.error(f'Failed to purge {queue_name}: {e}')

def purge_all_queues():
    """Purge all application queues"""
    queues = [
        'image-processing-queue',
        'annotation-conversion-queue',
        'report-generation-queue'
    ]

    for queue in queues:
        purge_queue(queue)

Implementation Details

Handle AWS Throttling

AWS limits purge operations to once per 60 seconds per queue:

def safe_purge_with_retry(queue_name, max_retries=3):
    """Retry purge if throttled"""
    for attempt in range(max_retries):
        try:
            purge_queue(queue_name)
            return

        except sqs_client.exceptions.PurgeQueueInProgress:
            if attempt < max_retries - 1:
                time.sleep(60)  # Wait 60s before retry
                continue
            else:
                logging.error(f'Queue {queue_name} purge timed out')

Monitor Queue Depth

Add CloudWatch alarms for queue depth after purge:

resources:
  Resources:
    QueueDepthAlarm:
      Type: AWS::CloudWatch::Alarm
      Properties:
        MetricName: ApproximateNumberOfMessagesVisible
        Namespace: AWS/SQS
        Statistic: Maximum
        Period: 300
        EvaluationPeriods: 1
        Threshold: 100
        ComparisonOperator: GreaterThanThreshold

Impact and Results

  • Complete Shutdown: No residual work after emergency stop
  • Cost Savings: Prevented $1,200 in queued work execution
  • Recovery Control: Manual approval required to restart processing

Lessons Learned

  1. Queue State Matters: Stopping workers doesn't clear queued work
  2. Handle AWS Limits: Purge operations are rate-limited
  3. Complete Shutdown: Consider all stateful components
  4. Monitor After Action: Verify queues are actually empty

Cost control requires stopping not just active processes but also preventing queued work from restarting. SQS purging is essential for complete emergency shutdowns.