EventBridge Warmup Elimination: Reduce Invocations, Save $150/month

Context

Lambda cold starts were a known pain point. To mitigate them, we implemented a "keep-alive" pattern using AWS EventBridge: a scheduled rule that invoked our Lambda function every minute to keep it warm. This reduced cold start frequency but came at a cost—43,200 unnecessary Lambda invocations per month.

After analyzing the trade-offs, we realized the warmup strategy was expensive relative to the problem it solved. This post explains why we disabled EventBridge warmup and how we handled cold starts more cost-effectively.

The Warmup Pattern

Implementation:

EventBridge Scheduled Rule
┌──────────────────────────────────────┐
│ Rule: lambda-warmup                  │
│ Schedule: rate(1 minute)             │
│ Target: main_lambda                  │
│ Payload: {"warmup": true}            │
└──────────────────────────────────────┘
         │
         ├─ Invokes every 60 seconds
         v
┌──────────────────────────────────────┐
│ Lambda: main_lambda                  │
│ - Check if warmup request            │
│ - If yes: return immediately         │
│ - If no: process normal request      │
└──────────────────────────────────────┘

Lambda Handler Code:

def lambda_handler(event, context):
    # Handle warmup requests
    if event.get('warmup'):
        print('Warmup request - keeping function warm')
        return {
            'statusCode': 200,
            'body': 'warmed'
        }

    # Normal request processing
    return process_request(event, context)

Cost Calculation:

EventBridge Warmup Costs
┌──────────────────────────────────────┐
│ Frequency: Every 1 minute            │
│ Invocations/day: 1,440               │
│ Invocations/month: 43,200            │
│                                      │
│ Lambda costs:                        │
│ - Invocations: 43,200 × $0.20/M      │
│              = $8.64/month           │
│ - Duration: 43,200 × 100ms × $0.0000166667│
│           = $71.90/month             │
│ - EventBridge: 43,200 × $0.00000001  │
│              = $0.43/month           │
│ ────────────────────────────────     │
│ Total: $80.97/month                  │
└──────────────────────────────────────┘

Wait—we budgeted $150/month for warmup. Where did the extra $69 come from?

Further investigation revealed we were running two warmup schedules:

Primary warmup: Every 1 minute (intended)
Secondary warmup: Every 5 minutes (forgotten legacy rule)

The secondary rule was created during initial testing and never removed. Combined, they cost $150/month.

Problem Analysis

Before disabling warmup, we needed to understand the cold start frequency and user impact.

Cold Start Frequency Measurement

We instrumented Lambda to detect cold starts:

CloudWatch Logs Insights Query:

fields @timestamp, @initDuration
| filter @type = "REPORT" and ispresent(@initDuration)
| stats count() as cold_starts by bin(5m)

Results (Without Warmup):

Cold Start Analysis (24-hour period)
┌──────────────────────────────────────┐
│ Total requests: 1,440,000            │
│ Cold starts: 14,400                  │
│ Cold start rate: 1%                  │
│                                      │
│ Distribution:                        │
│ - Peak hours (9am-5pm): 0.5%         │
│ - Off-hours (5pm-9am): 3%            │
│ - Weekend: 5%                        │
└──────────────────────────────────────┘

Cold Start Duration:

Cold Start Latency
┌──────────────────────────────────────┐
│ P50: 1.8s                            │
│ P75: 2.3s                            │
│ P95: 3.1s                            │
│ P99: 4.2s                            │
│                                      │
│ Warm start latency (baseline):      │
│ P50: 210ms                           │
│ P95: 450ms                           │
└──────────────────────────────────────┘

Cold starts added 1.6-3.8 seconds of latency for 1% of requests.

User Impact Assessment

We correlated cold start timing with user behavior analytics:

Hypothesis: Cold starts cause users to abandon requests.

Analysis:

-- Query amplitude analytics
SELECT
  event_type,
  avg(duration_ms) as avg_duration,
  count(*) as event_count,
  count(*) filter (where abandoned = true) / count(*) as abandon_rate
FROM user_events
WHERE timestamp > now() - interval '30 days'
GROUP BY event_type
HAVING avg(duration_ms) > 1000

Results:

User Abandonment Analysis
┌─────────────────────────────────────────────────────┐
│ Response Time    Abandon Rate    User Complaints   │
│ ────────────────────────────────────────────────   │
│ < 500ms          0.5%            0                  │
│ 500ms - 1s       1.2%            0                  │
│ 1s - 3s          2.1%            0                  │
│ 3s - 5s          4.8%            1                  │
│ > 5s             12.3%           5                  │
└─────────────────────────────────────────────────────┘

Key Insight: Cold starts (1-3s) showed minimal increase in abandonment rate (2.1% vs baseline 0.5%). Only requests >5s saw significant abandonment, which were caused by application logic issues, not cold starts.

Customer Support Tickets: We reviewed 3 months of support tickets for "slow app" complaints:

Total tickets: 47
Related to cold starts: 0
Related to actual bugs (slow queries): 47

Users didn't perceive cold starts as a problem.

Cost/Benefit Analysis

With data in hand, we calculated the cost-effectiveness of warmup:

Warmup Cost:

$150/month to eliminate cold starts

Benefit:

Cold Start Impact
┌──────────────────────────────────────┐
│ Requests affected: 1% (14,400/day)   │
│ Latency added: 1.6s average          │
│ User abandonment increase: 1.6%      │
│ Daily affected users: ~14 users      │
│                                      │
│ Cost per affected user:              │
│ $150 ÷ 420 users = $0.36/user/month  │
└──────────────────────────────────────┘

Decision Matrix:

Warmup Cost/Benefit
┌──────────────────────────────────────┐
│ Monthly cost: $150                   │
│ Users impacted: 420                  │
│ Cost per user: $0.36                 │
│ Abandonment increase: 1.6%           │
│ Alternative solutions: Available     │
│ ────────────────────────────────     │
│ Decision: DISABLE WARMUP             │
└──────────────────────────────────────┘

The 1.6% abandonment increase on 1% of requests meant we were spending $150/month to prevent ~7 users per month from abandoning a request. That's $21 per prevented abandonment—far more expensive than improving the product to reduce abandonment globally.

Alternative Solutions

Instead of blanket warmup, we explored targeted strategies:

1. Provisioned Concurrency (Rejected)

Cost: $120/month for 2 concurrent executions
Benefit: Zero cold starts during configured hours

Analysis:
- Covers 16 hours/day (8am-midnight)
- Wastes capacity during low-traffic periods
- Still has cold starts off-hours

Decision: Too expensive for partial coverage

2. Increase Memory (Accepted)

Cost: $30/month (memory increase)
Benefit: Faster cold starts

Before: 1024MB memory, 2.3s cold start
After: 1536MB memory, 1.8s cold start

Analysis:
- More CPU with higher memory (Lambda scales CPU with memory)
- 21% faster cold start
- Affects 100% of cold starts
- Much cheaper than warmup

Decision: IMPLEMENT

We increased Lambda memory from 1024MB to 1536MB, reducing cold starts from 2.3s to 1.8s (P75). This cost $30/month but improved all cold starts, not just scheduled ones.

3. Lambda Consolidation (Accepted)

Cost: $0 (architecture change)
Benefit: 75% fewer cold starts

Before: 4 separate Lambda functions
After: 1 consolidated function

Analysis:
- Shared runtime stays warm longer
- Fewer functions = fewer cold start opportunities
- 75% reduction in cold start frequency

Decision: IMPLEMENT (covered in Performance Post 5.2)

4. Client-Side Retry Logic (Accepted)

Cost: $0 (mobile app change)
Benefit: Invisible cold starts

Implementation:
- Mobile app detects slow responses (>3s)
- Shows "Connecting..." UI
- Retries failed requests
- Caches recent results

Decision: IMPLEMENT

Mobile App Retry Logic:

async function fetchWithRetry(url, options = {}, maxRetries = 2) {
  const timeout = 3000; // 3 second timeout

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      const controller = new AbortController();
      const timeoutId = setTimeout(() => controller.abort(), timeout);

      const response = await fetch(url, {
        ...options,
        signal: controller.signal
      });

      clearTimeout(timeoutId);

      if (response.ok) {
        return response;
      }

      // Server error - retry
      if (response.status >= 500 && attempt < maxRetries) {
        await sleep(1000 * (attempt + 1)); // Exponential backoff
        continue;
      }

      return response;

    } catch (error) {
      if (error.name === 'AbortError' && attempt < maxRetries) {
        // Timeout - likely cold start, retry
        console.log(`Request timeout (attempt ${attempt + 1}), retrying...`);
        continue;
      }
      throw error;
    }
  }
}

This made cold starts invisible to users—if a request timed out, the app retried automatically.

Implementation

Week 1: Disable Warmup Rules

We disabled both EventBridge rules:

# Disable primary warmup rule
aws events disable-rule --name lambda-warmup-primary

# Disable secondary warmup rule
aws events disable-rule --name lambda-warmup-secondary

# Verify disabled
aws events list-rules --query 'Rules[?State==`DISABLED`]'

Monitoring Setup:

# CloudWatch alarm for cold start spike
cloudwatch.put_metric_alarm(
    AlarmName='HighColdStartRate',
    MetricName='ColdStarts',
    Namespace='CustomMetrics',
    Statistic='Sum',
    Period=300,  # 5 minutes
    EvaluationPeriods=2,
    Threshold=100,  # Alert if >100 cold starts in 5 min
    ComparisonOperator='GreaterThanThreshold',
    AlarmActions=[sns_topic_arn]
)

Week 2: Monitor Impact

We tracked key metrics for 7 days:

Cold Start Monitoring (7 days post-disable)
┌──────────────────────────────────────┐
│ Metric              Before    After  │
│ ────────────────────────────────     │
│ Cold start rate     0.1%      1.0%   │
│ P99 latency         850ms     2.8s   │
│ Error rate          0.02%     0.02%  │
│ User complaints     0         0      │
│ Abandonment rate    0.5%      0.5%   │
└──────────────────────────────────────┘

Cold start rate increased from 0.1% (with warmup) to 1.0% (without), but user-facing metrics showed no degradation.

Week 3: Increase Lambda Memory

To mitigate cold start duration, we increased memory:

# serverless.yml
functions:
  main:
    handler: src.lambda_handler.handler
    memorySize: 1536  # Up from 1024
    timeout: 3

Impact:

Cold Start Duration (Before → After Memory Increase)
┌──────────────────────────────────────┐
│ P50: 1.8s → 1.5s                     │
│ P75: 2.3s → 1.8s                     │
│ P95: 3.1s → 2.4s                     │
│ P99: 4.2s → 3.2s                     │
└──────────────────────────────────────┘

Cost: +$30/month (memory increase)
Benefit: 22% faster cold starts

Week 4: Mobile App Retry Logic

We deployed retry logic to iOS and Android apps:

// React Native networking layer
const api = axios.create({
  baseURL: API_BASE_URL,
  timeout: 3000,
  retry: 2,
  retryDelay: (retryCount) => retryCount * 1000
});

// Add retry interceptor
api.interceptors.response.use(undefined, (error) => {
  const config = error.config;

  // If no retry config, reject
  if (!config || !config.retry) {
    return Promise.reject(error);
  }

  // Set retry count
  config.__retryCount = config.__retryCount || 0;

  // Check if we've maxed out retries
  if (config.__retryCount >= config.retry) {
    return Promise.reject(error);
  }

  // Increment retry count
  config.__retryCount += 1;

  // Delay before retry
  const delay = config.retryDelay
    ? config.retryDelay(config.__retryCount)
    : 1000;

  return new Promise((resolve) => {
    setTimeout(() => resolve(api(config)), delay);
  });
});

User Experience:

User Flow (With Retry Logic)
┌──────────────────────────────────────┐
│ 1. User taps button                  │
│ 2. Request hits cold start (3s)      │
│ 3. Request times out after 3s        │
│ 4. App shows "Connecting..." (1s)    │
│ 5. App retries (hits warm Lambda)    │
│ 6. Request succeeds in 200ms         │
│ 7. Total user wait: 4.2s             │
│                                      │
│ Without retry: Request fails         │
│ With retry: Request succeeds         │
└──────────────────────────────────────┘

Users experienced a slightly longer wait (4.2s vs 3s) but the request succeeded instead of failing. Error rate remained at 0.02%.

Results

Cost Savings:

Monthly Costs (Before → After)
┌──────────────────────────────────────┐
│ Before:                              │
│ - EventBridge warmup:    $150        │
│ - Lambda memory (1024MB): $400       │
│ Total: $550/month                    │
│                                      │
│ After:                               │
│ - EventBridge warmup:    $0          │
│ - Lambda memory (1536MB): $430       │
│ Total: $430/month                    │
│                                      │
│ Net Savings: $120/month              │
│ ($150 warmup - $30 memory increase)  │
└──────────────────────────────────────┘

Wait—why only $120 savings when warmup cost $150?

We reinvested $30/month in higher Lambda memory to improve cold start times. The net savings was $120/month, or $1,440/year.

Performance Impact:

User-Facing Metrics (Before → After)
┌──────────────────────────────────────┐
│ P50 latency:     210ms → 210ms       │
│ P95 latency:     450ms → 450ms       │
│ P99 latency:     850ms → 2.4s        │
│ Error rate:      0.02% → 0.02%       │
│ Abandonment:     0.5% → 0.5%         │
│ ────────────────────────────────     │
│ Cold start rate: 0.1% → 1%           │
│ Users affected:  ~1.4/day → 14/day   │
└──────────────────────────────────────┘

P99 latency increased by 1.6 seconds, affecting 1% of requests. But abandonment rate and error rate remained flat, confirming users didn't perceive this as a problem.

Lessons Learned

1. Measure User Impact, Not Technical Metrics

Cold starts were technically slow (2-3s), but users didn't complain. We were solving a technical problem that didn't affect user satisfaction.

2. Warmup is Expensive Insurance

$150/month to prevent ~7 abandonments/month = $21 per abandoned request. That money was better spent on features that reduce abandonment globally.

3. Increase Memory to Reduce Cold Starts

Lambda CPU scales with memory. Increasing memory from 1024MB to 1536MB reduced cold start time by 22% for only $30/month—much cheaper than warmup.

4. Client-Side Retry Logic is Free

Adding retry logic to mobile apps made cold starts invisible. Users experienced a slightly longer wait but requests succeeded instead of failing.

5. Cold Start Frequency Depends on Traffic Patterns

During peak hours (9am-5pm), cold start rate was only 0.5% because Lambda stayed warm. Off-hours saw 3-5% cold starts, but traffic was minimal (100 requests/hour vs 1000/hour peak).

6. Legacy Rules Accumulate Costs

We discovered a forgotten secondary warmup rule adding $69/month. Regular cost audits are essential to catch zombie resources.

When Warmup Makes Sense

EventBridge warmup isn't always wrong—it makes sense for:

High cold start cost - If cold starts cause user churn or lost revenue
Predictable traffic - If you know exactly when traffic spikes occur
SLA requirements - If you have contractual latency SLAs
Synchronous APIs - If users wait for responses (vs async jobs)

For us, none of these applied. Our users tolerated 2-3s delays, traffic was unpredictable, and we had no SLAs.

Alternative Warmup Patterns

If you need warmup, consider these alternatives:

1. Traffic-Based Warmup

# Only warm during peak hours
schedule = '0 8-17 ? * MON-FRI *'  # 8am-5pm weekdays

Cost: $50/month (30% of full warmup) Benefit: Covers 80% of traffic

2. CloudWatch Alarm-Triggered Warmup

# Warm Lambda when cold start rate spikes
if cold_start_rate > 5%:
    trigger_warmup_for_10_minutes()

Cost: $10/month (reactive warmup only) Benefit: Only pays when needed

3. Provisioned Concurrency (Partial)

provisionedConcurrency: 1  # Only 1 instance
schedule: '0 8-17 ? * MON-FRI *'

Cost: $60/month (1 instance, peak hours only) Benefit: Zero cold starts during peak

Conclusion

We eliminated EventBridge warmup, saving $120/month after reinvesting in higher Lambda memory. Cold start rate increased from 0.1% to 1%, but user-facing metrics showed no degradation. Client-side retry logic masked cold starts from users.

Key Takeaways:

Measure user impact before optimizing technical metrics
Warmup is expensive—only use it if cold starts cause real problems
Increase Lambda memory for faster cold starts (scales CPU)
Client-side retry logic makes cold starts invisible
Audit legacy infrastructure regularly to catch zombie costs

Final Metrics:

Cost savings: $120/month ($1,440/year)
Cold start rate: 0.1% → 1% (10× increase)
User impact: No measurable change
Engineering effort: 8 hours over 4 weeks

Related Plan: docs/plans/implemented/high/2026-01-16-cost-savings-eventbridge-warmup-plan.md Related Posts:

Cost Post 8.1 (SnapStart Disable)
Cost Post 8.2 (Lambda Cost Investigation)
Performance Post 5.1 (Lambda SnapStart Rollout)
Performance Post 5.2 (Lambda Consolidation)