← Back

Batch Processing at Scale: Solving Memory Issues with Smart Pagination

·backend-core

Batch Processing at Scale: Solving Memory Issues with Smart Pagination

Key Takeaway

Converting annotations to geometric shapes failed when processing large datasets because we tried to load all records into memory at once. Implementing batch processing with 5,000-record chunks reduced memory usage by 80% and enabled processing of millions of annotations without crashes.

The Problem

Our workflow step for converting annotations to circles (later hexagons) crashed when handling projects with large annotation counts. The root causes were:

  1. Memory Overflow: Loading 100,000+ records exhausted Lambda memory limits
  2. Timeout Failures: Processing all records in one operation exceeded function timeouts
  3. Database Strain: Single massive queries locked database connections
  4. No Progress Tracking: Failed operations provided no indication of how far they progressed
  5. All-or-Nothing: Partial failures required complete re-processing

The original implementation looked like this:

def convert_annotations_to_circles(self, project_id):
    # Load everything at once - dangerous!
    all_annotations = self.annotation_repo.get_all(project_id)

    for annotation in all_annotations:  # Could be 100,000+
        circle = self.create_circle(annotation)
        self.save_circle(circle)

Context and Background

This issue emerged in our workflow engine, which orchestrates multi-step AI inference pipelines. One critical step identifies overlapping annotations and converts them to geometric shapes for spatial analysis.

In development, we tested with sample datasets of 500-1,000 annotations, where the naive approach worked fine. Production datasets containing 50,000-150,000 annotations exposed the fundamental flaw: we hadn't designed for scale.

The Solution

We implemented pagination with configurable batch sizes:

# Configuration
BATCH_SIZE = 5000  # Configurable via environment variable

def convert_annotations_to_circles(self, project_id):
    offset = 0
    total_processed = 0

    while True:
        # Process in manageable chunks
        batch = self.annotation_repo.get_paginated(
            project_id=project_id,
            limit=BATCH_SIZE,
            offset=offset
        )

        if not batch:
            break  # No more records

        # Process current batch
        for annotation in batch:
            circle = self.create_circle(annotation)
            self.save_circle(circle)

        total_processed += len(batch)
        offset += BATCH_SIZE

        # Optional: Update progress tracking
        self.update_progress(total_processed)

We also created a dedicated pagination utility:

# In /src/core/db/pagination.py
class Pagination:
    @staticmethod
    def paginate_query(query, page_size=5000):
        """Generator that yields query results in batches"""
        offset = 0
        while True:
            batch = query.offset(offset).limit(page_size).all()
            if not batch:
                break
            yield batch
            offset += page_size

Implementation Details

The batch processing solution involved:

  1. Batch Size Tuning: Tested with 1,000, 5,000, and 10,000 record batches

    • 5,000 provided optimal balance of memory usage and query overhead
  2. Progress Tracking: Added checkpoint updates after each batch

    • Enables resumption from last successful batch on failure
  3. Memory Management: Explicitly cleared batch references after processing

    • Python garbage collection freed memory between batches
  4. Database Connection Pooling: Configured SQLAlchemy pool size for batch queries

    • Prevented connection exhaustion under sustained load
  5. Configuration: Made batch size environment-variable configurable

    • Different environments can tune based on available resources

Performance Metrics

Before and after comparison:

| Metric | Before (No Batching) | After (5K Batching) | |--------|---------------------|---------------------| | Memory Usage (100K records) | 8GB+ (crash) | ~500MB (stable) | | Processing Time | N/A (timeout) | 12 minutes | | Success Rate | 15% | 99.8% | | Database Connection Time | 45s (one query) | 120s total (24 queries) | | Resumability | None | Yes |

Impact and Results

After implementing batch processing:

  • Reliability: Eliminated out-of-memory crashes completely
  • Scalability: Successfully processed projects with 250,000+ annotations
  • Predictability: Linear performance scaling with dataset size
  • Resilience: Failed batches can be retried without reprocessing
  • Monitoring: Progress tracking enables better user experience

Lessons Learned

  1. Design for Scale Early: Test with production-scale data during development
  2. Batch Everything: Any operation processing unbounded data needs pagination
  3. Tune Batch Sizes: Optimal size depends on record size and processing complexity
  4. Track Progress: Checkpointing enables resumption and user visibility
  5. Memory is Finite: Never assume you can load entire datasets into memory

Batch processing is not premature optimization—it's essential architecture for scalable systems. The overhead of pagination is negligible compared to the cost of crashes and timeouts.