Performance Optimization: Batch Operations for Large Dataset Visualization
Key Takeaway
Our visualization Lambda timed out when processing datasets with more than 5,000 points, processing one chart at a time with no optimization. Implementing batch operations, data sampling, and parallel processing reduced average response time from 8.5 seconds to 1.2 seconds and eliminated timeouts.
The Problem
We generated charts sequentially without any performance optimization:
def lambda_handler(event, context):
data = json.loads(event['body'])
# Process entire dataset every time
chart = generate_chart(data) # No batching, no sampling
return {
'statusCode': 200,
'body': json.dumps({'chart': chart})
}
def generate_chart(data):
# Inefficient: creates full chart even for 100,000 points
x = data['x']['value'] # Load all data
y = data['y'][0]['value'] # Load all data
fig = go.Figure(data=[go.Bar(x=x, y=y)]) # Process everything
return fig.to_json() # Serialize everything
This caused severe performance issues:
- Lambda Timeouts: 15-second limit exceeded for large datasets
- Memory Exhaustion: Charts with 50,000+ points consumed 2GB+ memory
- Slow Rendering: Browsers struggled to render massive JSON payloads
- Network Overhead: 10MB+ response sizes on slow connections
- Cost Escalation: Long execution times = higher AWS costs
Context and Background
Different use cases had vastly different data sizes:
- Simple dashboards: 10-100 data points
- Daily reports: 500-1,000 points
- Monthly analytics: 5,000-10,000 points
- Historical analysis: 50,000-1,000,000 points
Our one-size-fits-all approach worked fine for small datasets but failed catastrophically for large ones. Users with historical data couldn't generate charts at all, receiving timeout errors or browser crashes when trying to render 100,000-point visualizations.
The fundamental issue: humans can't visually distinguish between 10,000 and 100,000 points on a chart. Processing all points was wasteful.
The Solution
We implemented intelligent batching, sampling, and optimization:
import numpy as np
from typing import List, Dict, Tuple
import time
class DataProcessor:
"""Optimize data processing for visualization"""
# Thresholds for different optimization strategies
THRESHOLD_SMALL = 1000 # No optimization needed
THRESHOLD_MEDIUM = 10000 # Apply sampling
THRESHOLD_LARGE = 50000 # Aggressive sampling + aggregation
@staticmethod
def calculate_optimal_sample_size(data_size: int, max_points: int = 2000) -> int:
"""Calculate optimal sample size based on data volume"""
if data_size <= DataProcessor.THRESHOLD_SMALL:
return data_size # No sampling
# Use logarithmic scaling for sample size
sample_size = min(
max_points,
int(np.sqrt(data_size) * 10)
)
return max(sample_size, 500) # Minimum 500 points
@staticmethod
def downsample_data(
x: List,
y: List,
target_size: int,
method: str = 'lttb'
) -> Tuple[List, List]:
"""
Downsample data using various algorithms
Args:
x: X values
y: Y values
target_size: Desired number of points
method: Sampling method ('lttb', 'average', 'minmax', 'random')
Returns:
(sampled_x, sampled_y) tuple
"""
if len(x) <= target_size:
return x, y
if method == 'lttb':
# Largest Triangle Three Buckets - preserves visual shape
return DataProcessor._lttb_downsample(x, y, target_size)
elif method == 'minmax':
# Keep min/max in each bucket - good for time series
return DataProcessor._minmax_downsample(x, y, target_size)
elif method == 'average':
# Average values in buckets - smooths data
return DataProcessor._average_downsample(x, y, target_size)
else:
# Simple uniform sampling
indices = np.linspace(0, len(x) - 1, target_size, dtype=int)
return [x[i] for i in indices], [y[i] for i in indices]
@staticmethod
def _lttb_downsample(x: List, y: List, target_size: int) -> Tuple[List, List]:
"""
Largest Triangle Three Buckets algorithm
Preserves visual characteristics of data
"""
if target_size >= len(x):
return x, y
# Always include first and last points
sampled_x = [x[0]]
sampled_y = [y[0]]
bucket_size = (len(x) - 2) / (target_size - 2)
a = 0 # Start with first point
for i in range(target_size - 2):
# Calculate average point in next bucket
avg_range_start = int((i + 1) * bucket_size) + 1
avg_range_end = int((i + 2) * bucket_size) + 1
avg_range_end = min(avg_range_end, len(x))
avg_x = sum(x[avg_range_start:avg_range_end]) / (avg_range_end - avg_range_start)
avg_y = sum(y[avg_range_start:avg_range_end]) / (avg_range_end - avg_range_start)
# Get current bucket range
range_start = int(i * bucket_size) + 1
range_end = int((i + 1) * bucket_size) + 1
# Find point in current bucket with largest triangle area
max_area = -1
max_area_point = range_start
for j in range(range_start, range_end):
# Calculate triangle area
area = abs(
(x[a] - avg_x) * (y[j] - y[a]) -
(x[a] - x[j]) * (avg_y - y[a])
) * 0.5
if area > max_area:
max_area = area
max_area_point = j
sampled_x.append(x[max_area_point])
sampled_y.append(y[max_area_point])
a = max_area_point
# Add last point
sampled_x.append(x[-1])
sampled_y.append(y[-1])
return sampled_x, sampled_y
@staticmethod
def _minmax_downsample(x: List, y: List, target_size: int) -> Tuple[List, List]:
"""Keep min and max values in each bucket"""
bucket_size = len(x) // (target_size // 2)
sampled_x = []
sampled_y = []
for i in range(0, len(x), bucket_size):
bucket_x = x[i:i + bucket_size]
bucket_y = y[i:i + bucket_size]
if len(bucket_y) > 0:
# Add min and max
min_idx = bucket_y.index(min(bucket_y))
max_idx = bucket_y.index(max(bucket_y))
# Add in chronological order
for idx in sorted([min_idx, max_idx]):
sampled_x.append(bucket_x[idx])
sampled_y.append(bucket_y[idx])
return sampled_x, sampled_y
def generate_optimized_chart(data: dict) -> dict:
"""Generate chart with performance optimizations"""
start_time = time.time()
x_values = data['x']['value']
y_series = data['y']
# Determine optimization strategy
data_size = len(x_values)
target_size = DataProcessor.calculate_optimal_sample_size(data_size)
logger.info(f"Processing {data_size} points, target: {target_size}")
# Apply sampling if needed
if data_size > DataProcessor.THRESHOLD_SMALL:
logger.info(f"Downsampling from {data_size} to {target_size} points")
optimized_series = []
for series in y_series:
sampled_x, sampled_y = DataProcessor.downsample_data(
x_values,
series['value'],
target_size,
method='lttb' # Best visual preservation
)
optimized_series.append({
'name': series['name'],
'x': sampled_x,
'y': sampled_y
})
# Create chart with sampled data
fig = go.Figure()
for series in optimized_series:
fig.add_trace(go.Scatter(
x=series['x'],
y=series['y'],
name=series['name'],
mode='lines+markers' if target_size < 100 else 'lines'
))
else:
# Small dataset - no optimization needed
fig = go.Figure()
for series in y_series:
fig.add_trace(go.Scatter(
x=x_values,
y=series['value'],
name=series['name']
))
# Add layout
fig.update_layout(
title=data.get('title', 'Chart'),
xaxis_title=data.get('x_label', 'X'),
yaxis_title=data.get('y_label', 'Y')
)
execution_time = (time.time() - start_time) * 1000
return {
'chart': fig.to_json(),
'metadata': {
'original_points': data_size,
'rendered_points': target_size,
'sampling_applied': data_size > DataProcessor.THRESHOLD_SMALL,
'execution_time_ms': execution_time
}
}
Implementation Details
Adaptive Sampling Strategy
We chose sampling method based on data characteristics:
def select_sampling_method(data: dict) -> str:
"""Choose optimal sampling method for data type"""
# Check if data is time series
x_values = data['x']['value']
if all(isinstance(x, (int, float)) for x in x_values):
# Numeric X - check if sorted (time series)
if x_values == sorted(x_values):
return 'lttb' # Best for time series
# Check for high variance
y_values = data['y'][0]['value']
variance = np.var(y_values)
mean = np.mean(y_values)
if variance / (mean + 1) > 2: # High variance
return 'minmax' # Preserve peaks and valleys
return 'lttb' # Default to LTTB
Batch Processing for Multiple Charts
We processed multiple chart requests in parallel:
import asyncio
from concurrent.futures import ThreadPoolExecutor
async def generate_charts_batch(requests: List[dict]) -> List[dict]:
"""Generate multiple charts in parallel"""
with ThreadPoolExecutor(max_workers=5) as executor:
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(executor, generate_optimized_chart, req)
for req in requests
]
results = await asyncio.gather(*tasks)
return results
Client-Side Streaming
For very large datasets, we implemented streaming:
def generate_chart_streaming(data: dict):
"""Stream chart data in chunks"""
# Generate chart metadata first
metadata = {
'title': data.get('title'),
'axes': {
'x': data.get('x_label'),
'y': data.get('y_label')
},
'total_points': len(data['x']['value'])
}
yield json.dumps({'type': 'metadata', 'data': metadata}) + '\n'
# Stream data points in batches of 1000
batch_size = 1000
x_values = data['x']['value']
for i in range(0, len(x_values), batch_size):
batch = {
'x': x_values[i:i + batch_size],
'y': [
s['value'][i:i + batch_size]
for s in data['y']
]
}
yield json.dumps({'type': 'data', 'data': batch}) + '\n'
yield json.dumps({'type': 'complete'}) + '\n'
Impact and Results
After implementing optimizations:
| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Avg response time (10K points) | 8.5s | 1.2s | 86% faster | | Max dataset size | 5,000 | 1,000,000 | 200x increase | | Timeout rate | 23% | 0% | 100% reduction | | Memory usage (50K points) | 2.1GB | 380MB | 82% reduction | | Response size (50K points) | 12MB | 450KB | 96% reduction | | Lambda cost per 10K requests | $24 | $6 | 75% reduction |
Performance by dataset size:
- 1,000 points: 180ms (was 210ms)
- 10,000 points: 1.2s (was 8.5s)
- 50,000 points: 2.3s (was timeout)
- 100,000 points: 3.1s (was timeout)
Lessons Learned
- Sample Intelligently: Use LTTB or similar algorithms to preserve visual characteristics
- Set Thresholds: Different data sizes need different optimization strategies
- Measure Everything: Track execution time, memory, and response size
- Stream When Possible: For very large datasets, stream results to client
- Optimize for Perception: 2,000 points looks identical to 100,000 on most charts
Performance optimization isn't about making code faster—it's about delivering the same user experience with fewer resources. Smart sampling algorithms like LTTB allow us to visualize millions of points while sending only thousands to the client.