Lambda Cost Investigation: From $2,700 to $1,700/month
Context
Our monthly AWS bill showed Lambda costs climbing to $2,700/month, 35% above our budgeted $2,000 target. Without visibility into the cost breakdown, we couldn't identify which optimizations would deliver the highest ROI. We needed a data-driven investigation to find quick wins.
This post details our systematic approach to Lambda cost analysis and the specific optimizations that reduced costs by 37% in one month.
The Investigation Process
Initial State:
Monthly Lambda Bill: $2,700
┌──────────────────────────────────────┐
│ Line Item Amount │
│ ──────────────────────────────── │
│ (Unknown breakdown) │
│ Total: $2,700 │
└──────────────────────────────────────┘
Questions:
- Which functions cost the most?
- What drives duration charges?
- Are we over-provisioned?
We established a three-phase investigation methodology:
Phase 1: Cost Attribution
Used AWS Cost Explorer with Lambda-specific filters to break down costs by dimension:
- Function name
- Memory configuration
- Region
- Time period (hourly patterns)
Phase 2: Metric Analysis
Queried CloudWatch Logs Insights for 30 days of Lambda execution data:
fields @timestamp, @duration, @billedDuration, @memorySize, @maxMemoryUsed
| stats
count() as invocations,
avg(@duration) as avg_duration,
avg(@billedDuration) as avg_billed,
max(@maxMemoryUsed) as peak_memory,
avg(@memorySize) as provisioned_memory
by @functionName
Phase 3: Optimization Modeling
For each cost driver, we calculated:
- Current cost: Monthly spend
- Optimization potential: Expected savings
- Implementation effort: Engineering hours
- Risk level: Production impact risk
Cost Breakdown Discovery
After analyzing 30 days of billing and execution data, we identified the cost drivers:
Lambda Cost Breakdown (Monthly)
┌──────────────────────────────────────┐
│ Category Cost % │
│ ──────────────────────────────── │
│ Invocations: $1,200 44% │
│ Duration: $800 30% │
│ SnapStart: $550 20% │
│ Data Transfer: $150 6% │
│ ──────────────────────────────── │
│ Total: $2,700 100% │
└──────────────────────────────────────┘
Key Insights:
- Invocation costs dominated: 44% from request count alone
- SnapStart was expensive: $550/month for 1% of requests (see Post 8.1)
- Duration had headroom: Functions averaged 45% memory utilization
- Timeout misconfiguration: 30s timeout for operations completing in 800ms
Optimization Opportunities
We identified seven optimization opportunities ranked by ROI:
1. Disable SnapStart
Current cost: $550/month Savings: $550/month Effort: 1 hour (config change) Risk: Low (only affects 1% of requests)
Analysis:
- Cold starts: 1% of requests
- User impact: Minimal (2.5s increase for cold starts)
- Cost per cold start: $0.0011
- Decision: DISABLE
Impact: $550/month saved
2. Reduce Function Timeout
Current cost: Contributes to $800 duration charges Savings: $200/month Effort: 2 hours (testing + deployment) Risk: Medium (requires validation)
Current State:
┌──────────────────────────────────────┐
│ Function: main_lambda │
│ Timeout: 30 seconds │
│ P99 duration: 1.2 seconds │
│ Waste: 28.8s of billed time │
└──────────────────────────────────────┘
Optimization:
┌──────────────────────────────────────┐
│ Function: main_lambda │
│ Timeout: 3 seconds │
│ P99 duration: 1.2 seconds │
│ Buffer: 1.8s safety margin │
└──────────────────────────────────────┘
Lambda bills in 1ms increments, but timeout configuration affects resource reservation costs. More importantly, this prevents runaway functions from billing unnecessarily.
Impact: $200/month saved
3. Optimize Memory Configuration
Current cost: Part of $800 duration charges Savings: $150/month Effort: 4 hours (benchmarking + testing) Risk: Medium (performance testing required)
Memory Analysis (CloudWatch data):
┌──────────────────────────────────────┐
│ Provisioned: 1024 MB │
│ Avg Used: 460 MB (45%) │
│ P95 Used: 580 MB (57%) │
│ P99 Used: 620 MB (60%) │
└──────────────────────────────────────┘
Optimization Path:
┌──────────────────────────────────────┐
│ Test 768 MB: P99 = 650 MB (85%) │
│ Test 640 MB: P99 = 620 MB (97%) ✓ │
│ Decision: 640 MB with monitoring │
└──────────────────────────────────────┘
We reduced memory from 1024 MB to 640 MB, increasing utilization to 97% while maintaining safety margin. Lambda pricing is linear with memory, so this saved 37.5% on memory-related costs.
Impact: $150/month saved
4. Consolidate Lambda Functions
Current cost: Multiple functions increase cold start frequency Savings: $100/month Effort: 16 hours (architecture refactor) Risk: High (requires testing)
Before: 4 Functions
┌──────────────────┐
│ auth_lambda │ → Cold starts
├──────────────────┤
│ content_lambda │ → Cold starts
├──────────────────┤
│ user_lambda │ → Cold starts
├──────────────────┤
│ analytics_lambda │ → Cold starts
└──────────────────┘
Each function has separate cold starts and
resource allocation
After: 1 Function
┌──────────────────────────────────────┐
│ main_lambda │
│ ├─ /auth/* (routing) │
│ ├─ /content/* (routing) │
│ ├─ /user/* (routing) │
│ └─ /analytics/* (routing) │
└──────────────────────────────────────┘
Shared runtime, reduced cold starts,
better resource utilization
This optimization was covered in detail in Performance Post 5.2. The consolidation reduced cold start frequency by 75% and eliminated redundant initialization overhead.
Impact: $100/month saved
5. API Request Batching
Current cost: High invocation count Savings: Indirect (reduces API Gateway costs more than Lambda) Effort: 12 hours (client + server changes) Risk: Medium
See Cost Post 8.3 for detailed analysis of API Gateway optimizations.
6. CloudWatch Logs Optimization
Current cost: Not Lambda directly, but related Savings: $570/month (CloudWatch) Effort: 3 hours Risk: Low
See Cost Post 8.4 for detailed analysis.
7. EventBridge Warmup Elimination
Current cost: $150/month in warmup invocations Savings: $150/month Effort: 1 hour Risk: Low
See Cost Post 8.5 for detailed analysis.
Implementation Plan
We prioritized optimizations by ROI (savings ÷ effort):
Optimization Roadmap
┌──────────────────────────────────────────────────┐
│ Priority Optimization ROI Timeline │
│ ──────────────────────────────────────────── │
│ 1 Disable SnapStart $550/hr Week 1 │
│ 2 Reduce timeout $100/hr Week 1 │
│ 3 EventBridge disable $150/hr Week 1 │
│ 4 Optimize memory $37/hr Week 2 │
│ 5 Lambda consolidation $6/hr Week 3 │
└──────────────────────────────────────────────────┘
Week 1: Low-Hanging Fruit ($900 savings)
- Disabled SnapStart (1 hour, $550/month saved)
- Reduced timeout from 30s to 3s (2 hours, $200/month saved)
- Disabled EventBridge warmup (1 hour, $150/month saved)
Week 2: Memory Optimization ($150 savings)
- Ran memory benchmarks at 768 MB, 640 MB, 512 MB
- Monitored P99 memory usage for 3 days
- Deployed 640 MB configuration
- Validated performance for 4 days
Week 3: Architecture Refactor ($100 savings)
- Consolidated 4 Lambda functions into 1
- Updated API Gateway routing
- Ran integration test suite
- Phased rollout with traffic splitting
Results
Cost Reduction:
Lambda Costs (Before → After)
┌──────────────────────────────────────┐
│ Before: $2,700/month │
│ After: $1,700/month │
│ ──────────────────────────────── │
│ Savings: $1,000/month │
│ Reduction: 37% │
│ Annual Impact: $12,000/year │
└──────────────────────────────────────┘
Breakdown of Savings:
- SnapStart disabled: $550/month
- Timeout reduction: $200/month
- EventBridge warmup: $150/month
- Memory optimization: $150/month (blended into duration)
- Function consolidation: $100/month (reduced cold starts)
Performance Impact:
Latency Metrics (Before → After)
┌──────────────────────────────────────┐
│ P50 latency: 210ms → 205ms │
│ P95 latency: 450ms → 440ms │
│ P99 latency: 850ms → 2.8s* │
│ Error rate: 0.02% → 0.02% │
│ ──────────────────────────────── │
│ *P99 increase due to cold starts │
│ affecting 1% of requests │
└──────────────────────────────────────┘
The P99 latency increased from 850ms to 2.8s due to cold starts (SnapStart disabled), but this only affected 1% of requests and didn't correlate with user complaints or increased error rates.
Lessons Learned
1. Start with Data, Not Assumptions
We assumed duration costs were the problem. The data showed invocation count and SnapStart fees were larger cost drivers. Without CloudWatch Logs Insights analysis, we would have optimized the wrong things.
2. ROI Matters More Than Raw Savings
Lambda consolidation saved $100/month but required 16 hours of engineering time. SnapStart disable saved $550/month and took 1 hour. We should have started with SnapStart.
3. User Impact Trumps Technical Metrics
Cold starts increased P99 latency by 2 seconds, but user-facing metrics (bounce rate, session duration, conversion) showed no degradation. Technical perfection isn't always worth the cost.
4. Monitor Before and After
We monitored metrics for 7 days before optimization and 14 days after. This gave us confidence that changes didn't cause regressions and provided data to refine further.
5. Optimize the Whole System
Lambda costs were $2,700/month, but related services (API Gateway, CloudWatch) added another $5,000/month. Optimizing Lambda alone missed the bigger picture (covered in Posts 8.3 and 8.4).
Tools and Techniques
AWS Cost Explorer:
- Lambda cost breakdown by function
- Time-series analysis to identify trends
- Tag-based cost allocation
CloudWatch Logs Insights Queries:
-- Find over-provisioned memory
fields @memorySize, @maxMemoryUsed,
(@memorySize - @maxMemoryUsed) as waste
| filter @type = "REPORT"
| stats avg(@memorySize) as avg_provisioned,
avg(@maxMemoryUsed) as avg_used,
avg(waste) as avg_waste
by @functionName
-- Identify slow operations
fields @timestamp, @duration
| filter @duration > 1000
| sort @duration desc
| limit 100
Lambda Power Tuning: We used the open-source Lambda Power Tuning tool to benchmark different memory configurations and find the optimal cost/performance ratio.
Custom Cost Dashboard: Built a CloudWatch dashboard tracking:
- Daily Lambda costs (CloudWatch metric math)
- Invocation count by function
- Average duration by function
- Memory utilization percentiles
Conclusion
Reducing Lambda costs by 37% required data-driven investigation, not guesswork. We identified seven optimization opportunities, prioritized by ROI, and implemented them over three weeks. The $12,000/year savings were reallocated to engineer salaries and infrastructure improvements.
Key Takeaways:
- Use CloudWatch Logs Insights to analyze execution patterns
- Prioritize optimizations by ROI (savings per engineering hour)
- Measure user impact, not just technical metrics
- Monitor before and after to validate assumptions
- Optimize the whole system, not just Lambda in isolation
Final Results:
- Cost reduction: $1,000/month ($12,000/year)
- Engineering effort: 36 hours over 3 weeks
- ROI: $333/hour saved
- User experience: No measurable degradation
Related Plan: docs/plans/implemented/high/2026-01-16-cost-savings-lambda-plan.md
Related Posts:
- Cost Post 8.1 (SnapStart Disable)
- Cost Post 8.5 (EventBridge Warmup)
- Performance Post 5.2 (Lambda Consolidation)