← Back

Performance Optimization: Batch Operations for Large Dataset Visualization

·visualization-utils

Performance Optimization: Batch Operations for Large Dataset Visualization

Key Takeaway

Our visualization Lambda timed out when processing datasets with more than 5,000 points, processing one chart at a time with no optimization. Implementing batch operations, data sampling, and parallel processing reduced average response time from 8.5 seconds to 1.2 seconds and eliminated timeouts.

The Problem

We generated charts sequentially without any performance optimization:

def lambda_handler(event, context):
    data = json.loads(event['body'])

    # Process entire dataset every time
    chart = generate_chart(data)  # No batching, no sampling

    return {
        'statusCode': 200,
        'body': json.dumps({'chart': chart})
    }

def generate_chart(data):
    # Inefficient: creates full chart even for 100,000 points
    x = data['x']['value']  # Load all data
    y = data['y'][0]['value']  # Load all data

    fig = go.Figure(data=[go.Bar(x=x, y=y)])  # Process everything
    return fig.to_json()  # Serialize everything

This caused severe performance issues:

  1. Lambda Timeouts: 15-second limit exceeded for large datasets
  2. Memory Exhaustion: Charts with 50,000+ points consumed 2GB+ memory
  3. Slow Rendering: Browsers struggled to render massive JSON payloads
  4. Network Overhead: 10MB+ response sizes on slow connections
  5. Cost Escalation: Long execution times = higher AWS costs

Context and Background

Different use cases had vastly different data sizes:

  • Simple dashboards: 10-100 data points
  • Daily reports: 500-1,000 points
  • Monthly analytics: 5,000-10,000 points
  • Historical analysis: 50,000-1,000,000 points

Our one-size-fits-all approach worked fine for small datasets but failed catastrophically for large ones. Users with historical data couldn't generate charts at all, receiving timeout errors or browser crashes when trying to render 100,000-point visualizations.

The fundamental issue: humans can't visually distinguish between 10,000 and 100,000 points on a chart. Processing all points was wasteful.

The Solution

We implemented intelligent batching, sampling, and optimization:

import numpy as np
from typing import List, Dict, Tuple
import time

class DataProcessor:
    """Optimize data processing for visualization"""

    # Thresholds for different optimization strategies
    THRESHOLD_SMALL = 1000  # No optimization needed
    THRESHOLD_MEDIUM = 10000  # Apply sampling
    THRESHOLD_LARGE = 50000  # Aggressive sampling + aggregation

    @staticmethod
    def calculate_optimal_sample_size(data_size: int, max_points: int = 2000) -> int:
        """Calculate optimal sample size based on data volume"""
        if data_size <= DataProcessor.THRESHOLD_SMALL:
            return data_size  # No sampling

        # Use logarithmic scaling for sample size
        sample_size = min(
            max_points,
            int(np.sqrt(data_size) * 10)
        )

        return max(sample_size, 500)  # Minimum 500 points

    @staticmethod
    def downsample_data(
        x: List,
        y: List,
        target_size: int,
        method: str = 'lttb'
    ) -> Tuple[List, List]:
        """
        Downsample data using various algorithms

        Args:
            x: X values
            y: Y values
            target_size: Desired number of points
            method: Sampling method ('lttb', 'average', 'minmax', 'random')

        Returns:
            (sampled_x, sampled_y) tuple
        """
        if len(x) <= target_size:
            return x, y

        if method == 'lttb':
            # Largest Triangle Three Buckets - preserves visual shape
            return DataProcessor._lttb_downsample(x, y, target_size)

        elif method == 'minmax':
            # Keep min/max in each bucket - good for time series
            return DataProcessor._minmax_downsample(x, y, target_size)

        elif method == 'average':
            # Average values in buckets - smooths data
            return DataProcessor._average_downsample(x, y, target_size)

        else:
            # Simple uniform sampling
            indices = np.linspace(0, len(x) - 1, target_size, dtype=int)
            return [x[i] for i in indices], [y[i] for i in indices]

    @staticmethod
    def _lttb_downsample(x: List, y: List, target_size: int) -> Tuple[List, List]:
        """
        Largest Triangle Three Buckets algorithm
        Preserves visual characteristics of data
        """
        if target_size >= len(x):
            return x, y

        # Always include first and last points
        sampled_x = [x[0]]
        sampled_y = [y[0]]

        bucket_size = (len(x) - 2) / (target_size - 2)

        a = 0  # Start with first point

        for i in range(target_size - 2):
            # Calculate average point in next bucket
            avg_range_start = int((i + 1) * bucket_size) + 1
            avg_range_end = int((i + 2) * bucket_size) + 1
            avg_range_end = min(avg_range_end, len(x))

            avg_x = sum(x[avg_range_start:avg_range_end]) / (avg_range_end - avg_range_start)
            avg_y = sum(y[avg_range_start:avg_range_end]) / (avg_range_end - avg_range_start)

            # Get current bucket range
            range_start = int(i * bucket_size) + 1
            range_end = int((i + 1) * bucket_size) + 1

            # Find point in current bucket with largest triangle area
            max_area = -1
            max_area_point = range_start

            for j in range(range_start, range_end):
                # Calculate triangle area
                area = abs(
                    (x[a] - avg_x) * (y[j] - y[a]) -
                    (x[a] - x[j]) * (avg_y - y[a])
                ) * 0.5

                if area > max_area:
                    max_area = area
                    max_area_point = j

            sampled_x.append(x[max_area_point])
            sampled_y.append(y[max_area_point])
            a = max_area_point

        # Add last point
        sampled_x.append(x[-1])
        sampled_y.append(y[-1])

        return sampled_x, sampled_y

    @staticmethod
    def _minmax_downsample(x: List, y: List, target_size: int) -> Tuple[List, List]:
        """Keep min and max values in each bucket"""
        bucket_size = len(x) // (target_size // 2)

        sampled_x = []
        sampled_y = []

        for i in range(0, len(x), bucket_size):
            bucket_x = x[i:i + bucket_size]
            bucket_y = y[i:i + bucket_size]

            if len(bucket_y) > 0:
                # Add min and max
                min_idx = bucket_y.index(min(bucket_y))
                max_idx = bucket_y.index(max(bucket_y))

                # Add in chronological order
                for idx in sorted([min_idx, max_idx]):
                    sampled_x.append(bucket_x[idx])
                    sampled_y.append(bucket_y[idx])

        return sampled_x, sampled_y

def generate_optimized_chart(data: dict) -> dict:
    """Generate chart with performance optimizations"""
    start_time = time.time()

    x_values = data['x']['value']
    y_series = data['y']

    # Determine optimization strategy
    data_size = len(x_values)
    target_size = DataProcessor.calculate_optimal_sample_size(data_size)

    logger.info(f"Processing {data_size} points, target: {target_size}")

    # Apply sampling if needed
    if data_size > DataProcessor.THRESHOLD_SMALL:
        logger.info(f"Downsampling from {data_size} to {target_size} points")

        optimized_series = []

        for series in y_series:
            sampled_x, sampled_y = DataProcessor.downsample_data(
                x_values,
                series['value'],
                target_size,
                method='lttb'  # Best visual preservation
            )

            optimized_series.append({
                'name': series['name'],
                'x': sampled_x,
                'y': sampled_y
            })

        # Create chart with sampled data
        fig = go.Figure()

        for series in optimized_series:
            fig.add_trace(go.Scatter(
                x=series['x'],
                y=series['y'],
                name=series['name'],
                mode='lines+markers' if target_size < 100 else 'lines'
            ))

    else:
        # Small dataset - no optimization needed
        fig = go.Figure()

        for series in y_series:
            fig.add_trace(go.Scatter(
                x=x_values,
                y=series['value'],
                name=series['name']
            ))

    # Add layout
    fig.update_layout(
        title=data.get('title', 'Chart'),
        xaxis_title=data.get('x_label', 'X'),
        yaxis_title=data.get('y_label', 'Y')
    )

    execution_time = (time.time() - start_time) * 1000

    return {
        'chart': fig.to_json(),
        'metadata': {
            'original_points': data_size,
            'rendered_points': target_size,
            'sampling_applied': data_size > DataProcessor.THRESHOLD_SMALL,
            'execution_time_ms': execution_time
        }
    }

Implementation Details

Adaptive Sampling Strategy

We chose sampling method based on data characteristics:

def select_sampling_method(data: dict) -> str:
    """Choose optimal sampling method for data type"""

    # Check if data is time series
    x_values = data['x']['value']

    if all(isinstance(x, (int, float)) for x in x_values):
        # Numeric X - check if sorted (time series)
        if x_values == sorted(x_values):
            return 'lttb'  # Best for time series

    # Check for high variance
    y_values = data['y'][0]['value']
    variance = np.var(y_values)
    mean = np.mean(y_values)

    if variance / (mean + 1) > 2:  # High variance
        return 'minmax'  # Preserve peaks and valleys

    return 'lttb'  # Default to LTTB

Batch Processing for Multiple Charts

We processed multiple chart requests in parallel:

import asyncio
from concurrent.futures import ThreadPoolExecutor

async def generate_charts_batch(requests: List[dict]) -> List[dict]:
    """Generate multiple charts in parallel"""

    with ThreadPoolExecutor(max_workers=5) as executor:
        loop = asyncio.get_event_loop()

        tasks = [
            loop.run_in_executor(executor, generate_optimized_chart, req)
            for req in requests
        ]

        results = await asyncio.gather(*tasks)

    return results

Client-Side Streaming

For very large datasets, we implemented streaming:

def generate_chart_streaming(data: dict):
    """Stream chart data in chunks"""

    # Generate chart metadata first
    metadata = {
        'title': data.get('title'),
        'axes': {
            'x': data.get('x_label'),
            'y': data.get('y_label')
        },
        'total_points': len(data['x']['value'])
    }

    yield json.dumps({'type': 'metadata', 'data': metadata}) + '\n'

    # Stream data points in batches of 1000
    batch_size = 1000
    x_values = data['x']['value']

    for i in range(0, len(x_values), batch_size):
        batch = {
            'x': x_values[i:i + batch_size],
            'y': [
                s['value'][i:i + batch_size]
                for s in data['y']
            ]
        }

        yield json.dumps({'type': 'data', 'data': batch}) + '\n'

    yield json.dumps({'type': 'complete'}) + '\n'

Impact and Results

After implementing optimizations:

| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Avg response time (10K points) | 8.5s | 1.2s | 86% faster | | Max dataset size | 5,000 | 1,000,000 | 200x increase | | Timeout rate | 23% | 0% | 100% reduction | | Memory usage (50K points) | 2.1GB | 380MB | 82% reduction | | Response size (50K points) | 12MB | 450KB | 96% reduction | | Lambda cost per 10K requests | $24 | $6 | 75% reduction |

Performance by dataset size:

  • 1,000 points: 180ms (was 210ms)
  • 10,000 points: 1.2s (was 8.5s)
  • 50,000 points: 2.3s (was timeout)
  • 100,000 points: 3.1s (was timeout)

Lessons Learned

  1. Sample Intelligently: Use LTTB or similar algorithms to preserve visual characteristics
  2. Set Thresholds: Different data sizes need different optimization strategies
  3. Measure Everything: Track execution time, memory, and response size
  4. Stream When Possible: For very large datasets, stream results to client
  5. Optimize for Perception: 2,000 points looks identical to 100,000 on most charts

Performance optimization isn't about making code faster—it's about delivering the same user experience with fewer resources. Smart sampling algorithms like LTTB allow us to visualize millions of points while sending only thousands to the client.