← Back

IAM Permissions Creep: When Silent Failures Hide Missing Permissions

·budget-manager

IAM Permissions Creep: When Silent Failures Hide Missing Permissions

Key Takeaway

Our Lambda function could stop ECS services and block S3 operations, but failed silently when trying to disable other Lambda functions. The missing lambda:* IAM permission caused budget control features to partially work, creating false confidence in cost protection.

The Problem

Original IAM configuration:

iamRoleStatements:
  - Effect: Allow
    Action:
      - "ecs:*"
      - "s3:*"
    Resource: "*"

Our function could:

  • ✅ Stop ECS tasks and services
  • ✅ Disable S3 bucket operations
  • ❌ Disable Lambda function triggers (permission denied)

The issue: partial success masked the missing capability. Budget alerts appeared to work because some services shut down, but Lambda functions continued consuming resources and costs.

Context and Background

Our budget control workflow:

1. AWS Budget threshold exceeded
2. SNS triggers Lambda: stop_services
3. Lambda executes:
   a. Stop all ECS tasks (works)
   b. Update S3 bucket policies (works)
   c. Disable Lambda concurrency (fails silently!)

The Lambda disabling feature sets ReservedConcurrentExecutions=0, preventing new invocations while allowing currently running functions to complete gracefully.

The Solution

Add comprehensive Lambda permissions:

iamRoleStatements:
  - Effect: Allow
    Action:
      - "ecs:*"
      - "s3:*"
      - "lambda:*"  # Added this
    Resource: "*"

Implement proper error handling:

import boto3
from botocore.exceptions import ClientError

lambda_client = boto3.client('lambda')

def disable_lambda_triggers(function_names):
    """
    Disable Lambda functions by setting concurrency to 0.

    This prevents new invocations while allowing current
    executions to complete.
    """
    results = {'succeeded': [], 'failed': []}

    for function_name in function_names:
        try:
            lambda_client.put_function_concurrency(
                FunctionName=function_name,
                ReservedConcurrentExecutions=0
            )

            logging.warning(f'Disabled Lambda function: {function_name}')
            results['succeeded'].append(function_name)

        except ClientError as e:
            error_code = e.response['Error']['Code']

            if error_code == 'AccessDeniedException':
                logging.error(f'Permission denied disabling {function_name}. Check IAM role.')
            elif error_code == 'ResourceNotFoundException':
                logging.error(f'Lambda function not found: {function_name}')
            else:
                logging.error(f'Failed to disable {function_name}: {e}')

            results['failed'].append({
                'function': function_name,
                'error': str(e)
            })

    return results

Implementation Details

1. Permission Scoping

While we used lambda:*, production should use least-privilege:

iamRoleStatements:
  - Effect: Allow
    Action:
      - "lambda:PutFunctionConcurrency"
      - "lambda:DeleteFunctionConcurrency"
      - "lambda:GetFunctionConfiguration"
    Resource:
      - "arn:aws:lambda:${aws:region}:${aws:accountId}:function:*"

2. Testing Permissions

Verify IAM permissions before relying on them:

# Test Lambda permission
aws lambda put-function-concurrency \
  --function-name test-function \
  --reserved-concurrent-executions 0

# If successful, permission exists
# If AccessDenied, update IAM role

3. Monitoring IAM Failures

Add CloudWatch metrics for permission denials:

def log_iam_failure(service, action, error):
    """Log IAM permission failures to CloudWatch"""
    cloudwatch.put_metric_data(
        Namespace='BudgetManager',
        MetricData=[{
            'MetricName': 'IAMPermissionDenied',
            'Value': 1,
            'Unit': 'Count',
            'Dimensions': [
                {'Name': 'Service', 'Value': service},
                {'Name': 'Action', 'Value': action}
            ]
        }]
    )

4. Graceful Degradation

Continue with partial shutdown even if some operations fail:

def emergency_shutdown():
    """
    Attempt to shut down all billable services.
    Continue even if individual operations fail.
    """
    results = {
        'ecs': stop_ecs_services(),
        's3': disable_s3_access(),
        'lambda': disable_lambda_triggers(),
        'sqs': purge_queues()
    }

    # Log summary
    total_succeeded = sum(len(r.get('succeeded', [])) for r in results.values())
    total_failed = sum(len(r.get('failed', [])) for r in results.values())

    logging.warning(f'Emergency shutdown: {total_succeeded} succeeded, {total_failed} failed')

    # Alert if any critical failures
    if results['ecs']['failed'] or results['lambda']['failed']:
        send_alert('Partial shutdown failure - manual intervention required')

    return results

Impact and Results

After adding Lambda permissions:

  • Complete Shutdown: All services now disabled during budget alerts
  • Cost Savings: Additional $800/month prevented by stopping Lambda executions
  • Monitoring: CloudWatch metrics track IAM failures
  • Confidence: Team confident all services controlled correctly

Lessons Learned

  1. Test All Permissions: Don't assume partial success means complete functionality
  2. Explicit Error Handling: Catch and log permission denials
  3. Least Privilege: Start with minimal permissions, expand as needed
  4. Permission Audits: Regularly review IAM roles for completeness
  5. Fail Loudly: Permission errors should generate alerts, not fail silently

IAM permission issues often manifest as silent failures rather than loud errors. Comprehensive error handling and monitoring are essential to detect missing permissions before they cause problems in production.