EventBridge Warmup Elimination: Reduce Invocations, Save $150/month
Context
Lambda cold starts were a known pain point. To mitigate them, we implemented a "keep-alive" pattern using AWS EventBridge: a scheduled rule that invoked our Lambda function every minute to keep it warm. This reduced cold start frequency but came at a cost—43,200 unnecessary Lambda invocations per month.
After analyzing the trade-offs, we realized the warmup strategy was expensive relative to the problem it solved. This post explains why we disabled EventBridge warmup and how we handled cold starts more cost-effectively.
The Warmup Pattern
Implementation:
EventBridge Scheduled Rule
┌──────────────────────────────────────┐
│ Rule: lambda-warmup │
│ Schedule: rate(1 minute) │
│ Target: main_lambda │
│ Payload: {"warmup": true} │
└──────────────────────────────────────┘
│
├─ Invokes every 60 seconds
v
┌──────────────────────────────────────┐
│ Lambda: main_lambda │
│ - Check if warmup request │
│ - If yes: return immediately │
│ - If no: process normal request │
└──────────────────────────────────────┘
Lambda Handler Code:
def lambda_handler(event, context):
# Handle warmup requests
if event.get('warmup'):
print('Warmup request - keeping function warm')
return {
'statusCode': 200,
'body': 'warmed'
}
# Normal request processing
return process_request(event, context)
Cost Calculation:
EventBridge Warmup Costs
┌──────────────────────────────────────┐
│ Frequency: Every 1 minute │
│ Invocations/day: 1,440 │
│ Invocations/month: 43,200 │
│ │
│ Lambda costs: │
│ - Invocations: 43,200 × $0.20/M │
│ = $8.64/month │
│ - Duration: 43,200 × 100ms × $0.0000166667│
│ = $71.90/month │
│ - EventBridge: 43,200 × $0.00000001 │
│ = $0.43/month │
│ ──────────────────────────────── │
│ Total: $80.97/month │
└──────────────────────────────────────┘
Wait—we budgeted $150/month for warmup. Where did the extra $69 come from?
Further investigation revealed we were running two warmup schedules:
- Primary warmup: Every 1 minute (intended)
- Secondary warmup: Every 5 minutes (forgotten legacy rule)
The secondary rule was created during initial testing and never removed. Combined, they cost $150/month.
Problem Analysis
Before disabling warmup, we needed to understand the cold start frequency and user impact.
Cold Start Frequency Measurement
We instrumented Lambda to detect cold starts:
CloudWatch Logs Insights Query:
fields @timestamp, @initDuration
| filter @type = "REPORT" and ispresent(@initDuration)
| stats count() as cold_starts by bin(5m)
Results (Without Warmup):
Cold Start Analysis (24-hour period)
┌──────────────────────────────────────┐
│ Total requests: 1,440,000 │
│ Cold starts: 14,400 │
│ Cold start rate: 1% │
│ │
│ Distribution: │
│ - Peak hours (9am-5pm): 0.5% │
│ - Off-hours (5pm-9am): 3% │
│ - Weekend: 5% │
└──────────────────────────────────────┘
Cold Start Duration:
Cold Start Latency
┌──────────────────────────────────────┐
│ P50: 1.8s │
│ P75: 2.3s │
│ P95: 3.1s │
│ P99: 4.2s │
│ │
│ Warm start latency (baseline): │
│ P50: 210ms │
│ P95: 450ms │
└──────────────────────────────────────┘
Cold starts added 1.6-3.8 seconds of latency for 1% of requests.
User Impact Assessment
We correlated cold start timing with user behavior analytics:
Hypothesis: Cold starts cause users to abandon requests.
Analysis:
-- Query amplitude analytics
SELECT
event_type,
avg(duration_ms) as avg_duration,
count(*) as event_count,
count(*) filter (where abandoned = true) / count(*) as abandon_rate
FROM user_events
WHERE timestamp > now() - interval '30 days'
GROUP BY event_type
HAVING avg(duration_ms) > 1000
Results:
User Abandonment Analysis
┌─────────────────────────────────────────────────────┐
│ Response Time Abandon Rate User Complaints │
│ ──────────────────────────────────────────────── │
│ < 500ms 0.5% 0 │
│ 500ms - 1s 1.2% 0 │
│ 1s - 3s 2.1% 0 │
│ 3s - 5s 4.8% 1 │
│ > 5s 12.3% 5 │
└─────────────────────────────────────────────────────┘
Key Insight: Cold starts (1-3s) showed minimal increase in abandonment rate (2.1% vs baseline 0.5%). Only requests >5s saw significant abandonment, which were caused by application logic issues, not cold starts.
Customer Support Tickets: We reviewed 3 months of support tickets for "slow app" complaints:
- Total tickets: 47
- Related to cold starts: 0
- Related to actual bugs (slow queries): 47
Users didn't perceive cold starts as a problem.
Cost/Benefit Analysis
With data in hand, we calculated the cost-effectiveness of warmup:
Warmup Cost:
$150/month to eliminate cold starts
Benefit:
Cold Start Impact
┌──────────────────────────────────────┐
│ Requests affected: 1% (14,400/day) │
│ Latency added: 1.6s average │
│ User abandonment increase: 1.6% │
│ Daily affected users: ~14 users │
│ │
│ Cost per affected user: │
│ $150 ÷ 420 users = $0.36/user/month │
└──────────────────────────────────────┘
Decision Matrix:
Warmup Cost/Benefit
┌──────────────────────────────────────┐
│ Monthly cost: $150 │
│ Users impacted: 420 │
│ Cost per user: $0.36 │
│ Abandonment increase: 1.6% │
│ Alternative solutions: Available │
│ ──────────────────────────────── │
│ Decision: DISABLE WARMUP │
└──────────────────────────────────────┘
The 1.6% abandonment increase on 1% of requests meant we were spending $150/month to prevent ~7 users per month from abandoning a request. That's $21 per prevented abandonment—far more expensive than improving the product to reduce abandonment globally.
Alternative Solutions
Instead of blanket warmup, we explored targeted strategies:
1. Provisioned Concurrency (Rejected)
Cost: $120/month for 2 concurrent executions
Benefit: Zero cold starts during configured hours
Analysis:
- Covers 16 hours/day (8am-midnight)
- Wastes capacity during low-traffic periods
- Still has cold starts off-hours
Decision: Too expensive for partial coverage
2. Increase Memory (Accepted)
Cost: $30/month (memory increase)
Benefit: Faster cold starts
Before: 1024MB memory, 2.3s cold start
After: 1536MB memory, 1.8s cold start
Analysis:
- More CPU with higher memory (Lambda scales CPU with memory)
- 21% faster cold start
- Affects 100% of cold starts
- Much cheaper than warmup
Decision: IMPLEMENT
We increased Lambda memory from 1024MB to 1536MB, reducing cold starts from 2.3s to 1.8s (P75). This cost $30/month but improved all cold starts, not just scheduled ones.
3. Lambda Consolidation (Accepted)
Cost: $0 (architecture change)
Benefit: 75% fewer cold starts
Before: 4 separate Lambda functions
After: 1 consolidated function
Analysis:
- Shared runtime stays warm longer
- Fewer functions = fewer cold start opportunities
- 75% reduction in cold start frequency
Decision: IMPLEMENT (covered in Performance Post 5.2)
4. Client-Side Retry Logic (Accepted)
Cost: $0 (mobile app change)
Benefit: Invisible cold starts
Implementation:
- Mobile app detects slow responses (>3s)
- Shows "Connecting..." UI
- Retries failed requests
- Caches recent results
Decision: IMPLEMENT
Mobile App Retry Logic:
async function fetchWithRetry(url, options = {}, maxRetries = 2) {
const timeout = 3000; // 3 second timeout
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeout);
const response = await fetch(url, {
...options,
signal: controller.signal
});
clearTimeout(timeoutId);
if (response.ok) {
return response;
}
// Server error - retry
if (response.status >= 500 && attempt < maxRetries) {
await sleep(1000 * (attempt + 1)); // Exponential backoff
continue;
}
return response;
} catch (error) {
if (error.name === 'AbortError' && attempt < maxRetries) {
// Timeout - likely cold start, retry
console.log(`Request timeout (attempt ${attempt + 1}), retrying...`);
continue;
}
throw error;
}
}
}
This made cold starts invisible to users—if a request timed out, the app retried automatically.
Implementation
Week 1: Disable Warmup Rules
We disabled both EventBridge rules:
# Disable primary warmup rule
aws events disable-rule --name lambda-warmup-primary
# Disable secondary warmup rule
aws events disable-rule --name lambda-warmup-secondary
# Verify disabled
aws events list-rules --query 'Rules[?State==`DISABLED`]'
Monitoring Setup:
# CloudWatch alarm for cold start spike
cloudwatch.put_metric_alarm(
AlarmName='HighColdStartRate',
MetricName='ColdStarts',
Namespace='CustomMetrics',
Statistic='Sum',
Period=300, # 5 minutes
EvaluationPeriods=2,
Threshold=100, # Alert if >100 cold starts in 5 min
ComparisonOperator='GreaterThanThreshold',
AlarmActions=[sns_topic_arn]
)
Week 2: Monitor Impact
We tracked key metrics for 7 days:
Cold Start Monitoring (7 days post-disable)
┌──────────────────────────────────────┐
│ Metric Before After │
│ ──────────────────────────────── │
│ Cold start rate 0.1% 1.0% │
│ P99 latency 850ms 2.8s │
│ Error rate 0.02% 0.02% │
│ User complaints 0 0 │
│ Abandonment rate 0.5% 0.5% │
└──────────────────────────────────────┘
Cold start rate increased from 0.1% (with warmup) to 1.0% (without), but user-facing metrics showed no degradation.
Week 3: Increase Lambda Memory
To mitigate cold start duration, we increased memory:
# serverless.yml
functions:
main:
handler: src.lambda_handler.handler
memorySize: 1536 # Up from 1024
timeout: 3
Impact:
Cold Start Duration (Before → After Memory Increase)
┌──────────────────────────────────────┐
│ P50: 1.8s → 1.5s │
│ P75: 2.3s → 1.8s │
│ P95: 3.1s → 2.4s │
│ P99: 4.2s → 3.2s │
└──────────────────────────────────────┘
Cost: +$30/month (memory increase)
Benefit: 22% faster cold starts
Week 4: Mobile App Retry Logic
We deployed retry logic to iOS and Android apps:
// React Native networking layer
const api = axios.create({
baseURL: API_BASE_URL,
timeout: 3000,
retry: 2,
retryDelay: (retryCount) => retryCount * 1000
});
// Add retry interceptor
api.interceptors.response.use(undefined, (error) => {
const config = error.config;
// If no retry config, reject
if (!config || !config.retry) {
return Promise.reject(error);
}
// Set retry count
config.__retryCount = config.__retryCount || 0;
// Check if we've maxed out retries
if (config.__retryCount >= config.retry) {
return Promise.reject(error);
}
// Increment retry count
config.__retryCount += 1;
// Delay before retry
const delay = config.retryDelay
? config.retryDelay(config.__retryCount)
: 1000;
return new Promise((resolve) => {
setTimeout(() => resolve(api(config)), delay);
});
});
User Experience:
User Flow (With Retry Logic)
┌──────────────────────────────────────┐
│ 1. User taps button │
│ 2. Request hits cold start (3s) │
│ 3. Request times out after 3s │
│ 4. App shows "Connecting..." (1s) │
│ 5. App retries (hits warm Lambda) │
│ 6. Request succeeds in 200ms │
│ 7. Total user wait: 4.2s │
│ │
│ Without retry: Request fails │
│ With retry: Request succeeds │
└──────────────────────────────────────┘
Users experienced a slightly longer wait (4.2s vs 3s) but the request succeeded instead of failing. Error rate remained at 0.02%.
Results
Cost Savings:
Monthly Costs (Before → After)
┌──────────────────────────────────────┐
│ Before: │
│ - EventBridge warmup: $150 │
│ - Lambda memory (1024MB): $400 │
│ Total: $550/month │
│ │
│ After: │
│ - EventBridge warmup: $0 │
│ - Lambda memory (1536MB): $430 │
│ Total: $430/month │
│ │
│ Net Savings: $120/month │
│ ($150 warmup - $30 memory increase) │
└──────────────────────────────────────┘
Wait—why only $120 savings when warmup cost $150?
We reinvested $30/month in higher Lambda memory to improve cold start times. The net savings was $120/month, or $1,440/year.
Performance Impact:
User-Facing Metrics (Before → After)
┌──────────────────────────────────────┐
│ P50 latency: 210ms → 210ms │
│ P95 latency: 450ms → 450ms │
│ P99 latency: 850ms → 2.4s │
│ Error rate: 0.02% → 0.02% │
│ Abandonment: 0.5% → 0.5% │
│ ──────────────────────────────── │
│ Cold start rate: 0.1% → 1% │
│ Users affected: ~1.4/day → 14/day │
└──────────────────────────────────────┘
P99 latency increased by 1.6 seconds, affecting 1% of requests. But abandonment rate and error rate remained flat, confirming users didn't perceive this as a problem.
Lessons Learned
1. Measure User Impact, Not Technical Metrics
Cold starts were technically slow (2-3s), but users didn't complain. We were solving a technical problem that didn't affect user satisfaction.
2. Warmup is Expensive Insurance
$150/month to prevent ~7 abandonments/month = $21 per abandoned request. That money was better spent on features that reduce abandonment globally.
3. Increase Memory to Reduce Cold Starts
Lambda CPU scales with memory. Increasing memory from 1024MB to 1536MB reduced cold start time by 22% for only $30/month—much cheaper than warmup.
4. Client-Side Retry Logic is Free
Adding retry logic to mobile apps made cold starts invisible. Users experienced a slightly longer wait but requests succeeded instead of failing.
5. Cold Start Frequency Depends on Traffic Patterns
During peak hours (9am-5pm), cold start rate was only 0.5% because Lambda stayed warm. Off-hours saw 3-5% cold starts, but traffic was minimal (100 requests/hour vs 1000/hour peak).
6. Legacy Rules Accumulate Costs
We discovered a forgotten secondary warmup rule adding $69/month. Regular cost audits are essential to catch zombie resources.
When Warmup Makes Sense
EventBridge warmup isn't always wrong—it makes sense for:
- High cold start cost - If cold starts cause user churn or lost revenue
- Predictable traffic - If you know exactly when traffic spikes occur
- SLA requirements - If you have contractual latency SLAs
- Synchronous APIs - If users wait for responses (vs async jobs)
For us, none of these applied. Our users tolerated 2-3s delays, traffic was unpredictable, and we had no SLAs.
Alternative Warmup Patterns
If you need warmup, consider these alternatives:
1. Traffic-Based Warmup
# Only warm during peak hours
schedule = '0 8-17 ? * MON-FRI *' # 8am-5pm weekdays
Cost: $50/month (30% of full warmup) Benefit: Covers 80% of traffic
2. CloudWatch Alarm-Triggered Warmup
# Warm Lambda when cold start rate spikes
if cold_start_rate > 5%:
trigger_warmup_for_10_minutes()
Cost: $10/month (reactive warmup only) Benefit: Only pays when needed
3. Provisioned Concurrency (Partial)
provisionedConcurrency: 1 # Only 1 instance
schedule: '0 8-17 ? * MON-FRI *'
Cost: $60/month (1 instance, peak hours only) Benefit: Zero cold starts during peak
Conclusion
We eliminated EventBridge warmup, saving $120/month after reinvesting in higher Lambda memory. Cold start rate increased from 0.1% to 1%, but user-facing metrics showed no degradation. Client-side retry logic masked cold starts from users.
Key Takeaways:
- Measure user impact before optimizing technical metrics
- Warmup is expensive—only use it if cold starts cause real problems
- Increase Lambda memory for faster cold starts (scales CPU)
- Client-side retry logic makes cold starts invisible
- Audit legacy infrastructure regularly to catch zombie costs
Final Metrics:
- Cost savings: $120/month ($1,440/year)
- Cold start rate: 0.1% → 1% (10× increase)
- User impact: No measurable change
- Engineering effort: 8 hours over 4 weeks
Related Plan: docs/plans/implemented/high/2026-01-16-cost-savings-eventbridge-warmup-plan.md
Related Posts:
- Cost Post 8.1 (SnapStart Disable)
- Cost Post 8.2 (Lambda Cost Investigation)
- Performance Post 5.1 (Lambda SnapStart Rollout)
- Performance Post 5.2 (Lambda Consolidation)