Batch Processing at Scale: Solving Memory Issues with Smart Pagination

Key Takeaway

Converting annotations to geometric shapes failed when processing large datasets because we tried to load all records into memory at once. Implementing batch processing with 5,000-record chunks reduced memory usage by 80% and enabled processing of millions of annotations without crashes.

The Problem

Our workflow step for converting annotations to circles (later hexagons) crashed when handling projects with large annotation counts. The root causes were:

Memory Overflow: Loading 100,000+ records exhausted Lambda memory limits
Timeout Failures: Processing all records in one operation exceeded function timeouts
Database Strain: Single massive queries locked database connections
No Progress Tracking: Failed operations provided no indication of how far they progressed
All-or-Nothing: Partial failures required complete re-processing

The original implementation looked like this:

def convert_annotations_to_circles(self, project_id):
    # Load everything at once - dangerous!
    all_annotations = self.annotation_repo.get_all(project_id)

    for annotation in all_annotations:  # Could be 100,000+
        circle = self.create_circle(annotation)
        self.save_circle(circle)

Context and Background

This issue emerged in our workflow engine, which orchestrates multi-step AI inference pipelines. One critical step identifies overlapping annotations and converts them to geometric shapes for spatial analysis.

In development, we tested with sample datasets of 500-1,000 annotations, where the naive approach worked fine. Production datasets containing 50,000-150,000 annotations exposed the fundamental flaw: we hadn't designed for scale.

The Solution

We implemented pagination with configurable batch sizes:

# Configuration
BATCH_SIZE = 5000  # Configurable via environment variable

def convert_annotations_to_circles(self, project_id):
    offset = 0
    total_processed = 0

    while True:
        # Process in manageable chunks
        batch = self.annotation_repo.get_paginated(
            project_id=project_id,
            limit=BATCH_SIZE,
            offset=offset
        )

        if not batch:
            break  # No more records

        # Process current batch
        for annotation in batch:
            circle = self.create_circle(annotation)
            self.save_circle(circle)

        total_processed += len(batch)
        offset += BATCH_SIZE

        # Optional: Update progress tracking
        self.update_progress(total_processed)

We also created a dedicated pagination utility:

# In /src/core/db/pagination.py
class Pagination:
    @staticmethod
    def paginate_query(query, page_size=5000):
        """Generator that yields query results in batches"""
        offset = 0
        while True:
            batch = query.offset(offset).limit(page_size).all()
            if not batch:
                break
            yield batch
            offset += page_size

Implementation Details

The batch processing solution involved:

Batch Size Tuning: Tested with 1,000, 5,000, and 10,000 record batches
- 5,000 provided optimal balance of memory usage and query overhead
Progress Tracking: Added checkpoint updates after each batch
- Enables resumption from last successful batch on failure
Memory Management: Explicitly cleared batch references after processing
- Python garbage collection freed memory between batches
Database Connection Pooling: Configured SQLAlchemy pool size for batch queries
- Prevented connection exhaustion under sustained load
Configuration: Made batch size environment-variable configurable
- Different environments can tune based on available resources

Performance Metrics

Before and after comparison:

Metric	Before (No Batching)	After (5K Batching)
Memory Usage (100K records)	8GB+ (crash)	~500MB (stable)
Processing Time	N/A (timeout)	12 minutes
Success Rate	15%	99.8%
Database Connection Time	45s (one query)	120s total (24 queries)
Resumability	None	Yes

Impact and Results

After implementing batch processing:

Reliability: Eliminated out-of-memory crashes completely
Scalability: Successfully processed projects with 250,000+ annotations
Predictability: Linear performance scaling with dataset size
Resilience: Failed batches can be retried without reprocessing
Monitoring: Progress tracking enables better user experience

Lessons Learned

Design for Scale Early: Test with production-scale data during development
Batch Everything: Any operation processing unbounded data needs pagination
Tune Batch Sizes: Optimal size depends on record size and processing complexity
Track Progress: Checkpointing enables resumption and user visibility
Memory is Finite: Never assume you can load entire datasets into memory

Batch processing is not premature optimization—it's essential architecture for scalable systems. The overhead of pagination is negligible compared to the cost of crashes and timeouts.