Incomplete ECS Shutdown: Why Desired Count Zero Isn't Enough
Key Takeaway
Setting ECS service desiredCount=0 prevents new task launches but doesn't stop currently running tasks. Implementing two-phase shutdown—updating desired count then forcibly stopping running tasks—achieved complete service termination and immediate cost savings.
The Problem
Original shutdown implementation:
def stop_ecs_services(cluster_name):
services = ecs_client.list_services(cluster=cluster_name)
for service in services['serviceArns']:
ecs_client.update_service(
cluster=cluster_name,
service=service,
desiredCount=0
)
This prevented new tasks from launching but left running tasks active, consuming resources and accruing costs for hours.
The Solution
Two-phase shutdown:
def stop_ecs_services_complete(cluster_name):
"""
Complete ECS service shutdown:
1. Set desired count to 0 (prevent new tasks)
2. Stop all running tasks (immediate termination)
"""
services = ecs_client.list_services(cluster=cluster_name)['serviceArns']
for service_arn in services:
service_name = service_arn.split('/')[-1]
# Phase 1: Prevent new tasks
ecs_client.update_service(
cluster=cluster_name,
service=service_name,
desiredCount=0
)
# Phase 2: Stop running tasks
tasks = ecs_client.list_tasks(
cluster=cluster_name,
serviceName=service_name
)['taskArns']
for task_arn in tasks:
ecs_client.stop_task(
cluster=cluster_name,
task=task_arn,
reason='Budget alert - emergency shutdown'
)
logging.warning(f'Stopped task: {task_arn}')
Implementation Details
1. Also Disable Lambda Triggers
Stop Lambda functions that start ECS tasks:
def disable_ecs_lambda_trigger(function_name):
"""Prevent Lambda from starting new ECS tasks"""
lambda_client.put_function_concurrency(
FunctionName=function_name,
ReservedConcurrentExecutions=0
)
2. Graceful vs Forceful Termination
Offer both options:
def stop_ecs_tasks(cluster, tasks, graceful=True):
"""
Stop ECS tasks with optional grace period.
Args:
graceful: If True, wait for tasks to finish current work
"""
for task in tasks:
if graceful:
# Send SIGTERM, allow cleanup
ecs_client.stop_task(cluster=cluster, task=task)
else:
# Force immediate termination
ecs_client.stop_task(
cluster=cluster,
task=task,
reason='Emergency shutdown - no grace period'
)
3. Monitor Termination
Verify all tasks stop within timeout:
def wait_for_tasks_stopped(cluster, task_arns, timeout=300):
"""Wait for all tasks to stop, with timeout"""
start_time = time.time()
while time.time() - start_time < timeout:
tasks = ecs_client.describe_tasks(
cluster=cluster,
tasks=task_arns
)['tasks']
if all(task['lastStatus'] == 'STOPPED' for task in tasks):
return True
time.sleep(5)
return False # Timeout
Impact and Results
- Immediate Termination: Tasks stop within 30 seconds vs hours
- Cost Savings: $450/day prevented during emergency shutdown test
- Complete Control: No residual compute charges
Lessons Learned
- Desired Count != Running Tasks: Updating configuration doesn't stop active work
- Two-Phase Shutdown: Prevent new work, then stop current work
- Disable Triggers: Stop sources that restart services
- Monitor Completion: Verify shutdown actually completes
Effective cost control requires understanding the difference between configuration changes and runtime state changes. Complete shutdowns must address both.