Incomplete ECS Shutdown: Why Desired Count Zero Isn't Enough

Key Takeaway

Setting ECS service desiredCount=0 prevents new task launches but doesn't stop currently running tasks. Implementing two-phase shutdown—updating desired count then forcibly stopping running tasks—achieved complete service termination and immediate cost savings.

The Problem

Original shutdown implementation:

def stop_ecs_services(cluster_name):
    services = ecs_client.list_services(cluster=cluster_name)

    for service in services['serviceArns']:
        ecs_client.update_service(
            cluster=cluster_name,
            service=service,
            desiredCount=0
        )

This prevented new tasks from launching but left running tasks active, consuming resources and accruing costs for hours.

The Solution

Two-phase shutdown:

def stop_ecs_services_complete(cluster_name):
    """
    Complete ECS service shutdown:
    1. Set desired count to 0 (prevent new tasks)
    2. Stop all running tasks (immediate termination)
    """
    services = ecs_client.list_services(cluster=cluster_name)['serviceArns']

    for service_arn in services:
        service_name = service_arn.split('/')[-1]

        # Phase 1: Prevent new tasks
        ecs_client.update_service(
            cluster=cluster_name,
            service=service_name,
            desiredCount=0
        )

        # Phase 2: Stop running tasks
        tasks = ecs_client.list_tasks(
            cluster=cluster_name,
            serviceName=service_name
        )['taskArns']

        for task_arn in tasks:
            ecs_client.stop_task(
                cluster=cluster_name,
                task=task_arn,
                reason='Budget alert - emergency shutdown'
            )

            logging.warning(f'Stopped task: {task_arn}')

Implementation Details

1. Also Disable Lambda Triggers

Stop Lambda functions that start ECS tasks:

def disable_ecs_lambda_trigger(function_name):
    """Prevent Lambda from starting new ECS tasks"""
    lambda_client.put_function_concurrency(
        FunctionName=function_name,
        ReservedConcurrentExecutions=0
    )

2. Graceful vs Forceful Termination

Offer both options:

def stop_ecs_tasks(cluster, tasks, graceful=True):
    """
    Stop ECS tasks with optional grace period.

    Args:
        graceful: If True, wait for tasks to finish current work
    """
    for task in tasks:
        if graceful:
            # Send SIGTERM, allow cleanup
            ecs_client.stop_task(cluster=cluster, task=task)
        else:
            # Force immediate termination
            ecs_client.stop_task(
                cluster=cluster,
                task=task,
                reason='Emergency shutdown - no grace period'
            )

3. Monitor Termination

Verify all tasks stop within timeout:

def wait_for_tasks_stopped(cluster, task_arns, timeout=300):
    """Wait for all tasks to stop, with timeout"""
    start_time = time.time()

    while time.time() - start_time < timeout:
        tasks = ecs_client.describe_tasks(
            cluster=cluster,
            tasks=task_arns
        )['tasks']

        if all(task['lastStatus'] == 'STOPPED' for task in tasks):
            return True

        time.sleep(5)

    return False  # Timeout

Impact and Results

Immediate Termination: Tasks stop within 30 seconds vs hours
Cost Savings: $450/day prevented during emergency shutdown test
Complete Control: No residual compute charges

Lessons Learned

Desired Count != Running Tasks: Updating configuration doesn't stop active work
Two-Phase Shutdown: Prevent new work, then stop current work
Disable Triggers: Stop sources that restart services
Monitor Completion: Verify shutdown actually completes

Effective cost control requires understanding the difference between configuration changes and runtime state changes. Complete shutdowns must address both.