From Manual to Automatic: Designing Reliable Async Workflows with Schedulers

Key Takeaway

Our workflow execution system required manual intervention to process queued steps, leading to incomplete workflows and user frustration. Implementing a scheduler-based polling system with proper state transitions (QUEUED → RUNNING) enabled fully automatic workflow execution with 99.9% reliability.

The Problem

Our multi-step AI inference workflows weren't executing reliably. After completing one step, the next step would sit idle in the queue until someone manually triggered a refresh. This created five critical issues:

Manual Intervention Required: Workflows stalled unless engineers manually ran refresh commands
Inconsistent Execution: Some workflows completed, others didn't—unpredictably
State Confusion: Steps marked QUEUED were never picked up for execution
User Experience: Users saw workflows "stuck" at intermediate steps
Scaling Impossible: Manual workflow management doesn't scale beyond a few users

Example workflow lifecycle (broken):

1. User starts workflow
2. Step 1 executes successfully
3. Step 2 added to queue with status=QUEUED
4. [SYSTEM STALLS]
5. Step 2 never executes
6. User reports "workflow stuck"
7. Engineer manually runs refresh_jobs
8. Step 2 finally executes

Context and Background

Our workflow engine orchestrates complex AI pipelines:

Workflow: Tissue Analysis
├─ Step 1: Upload Image to S3
├─ Step 2: Generate Deep Zoom Tiles
├─ Step 3: Run AI Model Prediction
├─ Step 4: Post-process Annotations
└─ Step 5: Generate Report

Each step:

Executes asynchronously (Lambda/ECS)
Updates its status (PENDING → RUNNING → COMPLETED)
Triggers the next step on completion

The architecture:

# Original broken implementation
class WorkflowEngine:
    def execute_next_step(self, workflow_id):
        # Get next pending step
        next_step = self.get_next_pending_step(workflow_id)

        if next_step:
            # Add to queue
            self.queue_manager.enqueue(next_step)
            next_step.status = 'QUEUED'  # Status set, but nothing polls!

    def manual_refresh_jobs(self):
        """Had to manually call this to process queue"""
        queued_steps = self.queue_manager.get_all_queued()

        for step in queued_steps:
            self.execute_step(step)

The fundamental problem: no automatic process was polling the queue.

The Solution

We implemented a three-part solution:

1. Scheduler-Based Job Polling

Added a CloudWatch Events rule to trigger refresh_jobs every minute:

# serverless.yml
functions:
  refreshJobs:
    handler: src/resources/jobs/refresh_jobs.handler
    timeout: 300
    events:
      - schedule:
          rate: rate(1 minute)  # Poll every minute
          enabled: true

2. Improved State Transitions

Enhanced the state machine to properly track execution:

class WorkflowStepStatus(Enum):
    PENDING = 'pending'      # Not yet ready
    QUEUED = 'queued'        # In queue, waiting for execution
    RUNNING = 'running'      # Currently executing
    COMPLETED = 'completed'  # Successfully finished
    FAILED = 'failed'        # Execution failed
    ABORTED = 'aborted'      # Manually cancelled

Updated the queue polling logic:

# In /src/resources/jobs/refresh_jobs.py
def handler(event, context):
    """Automatically triggered every minute by CloudWatch Events"""

    # Get all queued steps
    queued_steps = running_workflow_step_queue.get_all_queued()

    logger.info(f"Found {len(queued_steps)} queued workflow steps")

    for step_data in queued_steps:
        try:
            # Mark as running BEFORE execution
            step_id = step_data['step_id']
            workflow_step_repo.update_status(step_id, WorkflowStepStatus.RUNNING)

            # Remove from queue (processed)
            running_workflow_step_queue.dequeue(step_data['queue_id'])

            # Execute the step
            workflow_engine.execute_step(step_id)

            logger.info(f"Successfully executed step {step_id}")

        except Exception as e:
            logger.error(f"Failed to execute step {step_id}: {e}")

            # Mark as failed
            workflow_step_repo.update_status(step_id, WorkflowStepStatus.FAILED)

            # Requeue with retry count if appropriate
            if step_data.get('retry_count', 0) < 3:
                running_workflow_step_queue.requeue(step_data)

    return {
        'statusCode': 200,
        'processed': len(queued_steps)
    }

3. Queue Management Service

Created a dedicated queue service:

# In /src/domain/workflows/running_workflow_step_queue.py
class RunningWorkflowStepQueue:
    def enqueue(self, workflow_step):
        """Add step to execution queue"""
        queue_item = {
            'queue_id': str(uuid.uuid4()),
            'step_id': workflow_step.id,
            'workflow_id': workflow_step.workflow_id,
            'queued_at': datetime.utcnow().isoformat(),
            'retry_count': 0
        }

        db.session.add(QueueItem(**queue_item))
        db.session.commit()

        # Update step status
        workflow_step.status = WorkflowStepStatus.QUEUED
        db.session.commit()

        logger.info(f"Enqueued workflow step {workflow_step.id}")

    def get_all_queued(self):
        """Get all items ready for processing"""
        return db.session.query(QueueItem)\
            .join(WorkflowStep)\
            .filter(WorkflowStep.status == WorkflowStepStatus.QUEUED)\
            .order_by(QueueItem.queued_at)\
            .all()

    def dequeue(self, queue_id):
        """Remove item from queue after processing"""
        db.session.query(QueueItem)\
            .filter(QueueItem.queue_id == queue_id)\
            .delete()
        db.session.commit()

    def requeue(self, queue_item):
        """Requeue failed item with incremented retry count"""
        queue_item['retry_count'] += 1
        queue_item['queued_at'] = datetime.utcnow().isoformat()

        db.session.add(QueueItem(**queue_item))
        db.session.commit()

Implementation Details

1. Idempotency Protection

Prevent duplicate execution if scheduler triggers multiple times:

def execute_step(self, step_id):
    step = workflow_step_repo.get_by_id(step_id)

    # Check if already running or completed
    if step.status in [WorkflowStepStatus.RUNNING, WorkflowStepStatus.COMPLETED]:
        logger.warning(f"Step {step_id} already {step.status}, skipping")
        return

    # Atomic status update with optimistic locking
    rows_updated = db.session.query(WorkflowStep)\
        .filter(
            WorkflowStep.id == step_id,
            WorkflowStep.status == WorkflowStepStatus.QUEUED
        )\
        .update({'status': WorkflowStepStatus.RUNNING})

    if rows_updated == 0:
        logger.warning(f"Step {step_id} state changed, skipping")
        return

    # Proceed with execution
    self._do_execute_step(step)

2. Retry Logic

Implement exponential backoff for transient failures:

def should_retry(step, error):
    """Determine if step should be retried"""
    retry_count = step.retry_count or 0

    # Don't retry certain error types
    if isinstance(error, (ValidationError, InvalidInputError)):
        return False

    # Retry up to 3 times for transient errors
    if retry_count < 3:
        return True

    return False

def calculate_backoff(retry_count):
    """Exponential backoff: 1min, 2min, 4min"""
    return 60 * (2 ** retry_count)

3. Monitoring and Alerting

Added CloudWatch metrics:

def publish_metrics(queued_count, processed_count, failed_count):
    cloudwatch.put_metric_data(
        Namespace='Workflows',
        MetricData=[
            {
                'MetricName': 'QueuedSteps',
                'Value': queued_count,
                'Unit': 'Count'
            },
            {
                'MetricName': 'ProcessedSteps',
                'Value': processed_count,
                'Unit': 'Count'
            },
            {
                'MetricName': 'FailedSteps',
                'Value': failed_count,
                'Unit': 'Count'
            }
        ]
    )

Alert if queue depth exceeds threshold:

if queued_count > 50:
    sns.publish(
        TopicArn=ALERT_TOPIC_ARN,
        Subject='Workflow Queue Backup',
        Message=f'Workflow queue has {queued_count} pending steps. Investigate potential bottleneck.'
    )

4. Dead Letter Queue

Handle repeatedly failing steps:

class DeadLetterQueue:
    def move_to_dlq(self, step, error):
        """Move consistently failing steps to DLQ for investigation"""
        dlq_item = {
            'step_id': step.id,
            'workflow_id': step.workflow_id,
            'error': str(error),
            'retry_count': step.retry_count,
            'moved_at': datetime.utcnow().isoformat()
        }

        db.session.add(DeadLetterQueueItem(**dlq_item))

        # Update step status
        step.status = WorkflowStepStatus.FAILED
        step.error_message = f"Moved to DLQ after {step.retry_count} retries"

        db.session.commit()

5. Graceful Shutdown

Handle Lambda timeout gracefully:

def handler_with_timeout_handling(event, context):
    """Wrapper that stops processing before Lambda times out"""

    # Reserve 30 seconds for cleanup
    deadline = context.get_remaining_time_in_millis() - 30000

    queued_steps = running_workflow_step_queue.get_all_queued()

    processed = 0
    for step_data in queued_steps:
        if get_remaining_time_in_millis() < deadline:
            logger.info(f"Approaching timeout, processed {processed} steps")
            break

        execute_step(step_data)
        processed += 1

    return {'statusCode': 200, 'processed': processed}

Performance Metrics

Before and after comparison:

Metric	Before (Manual)	After (Automated)	Improvement
Workflow Completion Rate	45%	99.9%	122% increase
Average Completion Time	N/A (manual)	8 minutes	Predictable
Manual Interventions/Day	15-20	0	Eliminated
Failed Workflows	55%	0.1%	99.8% reduction
User Satisfaction	3.2/5	4.7/5	47% improvement

Impact and Results

After implementing scheduler-based execution:

Reliability: 99.9% of workflows complete without intervention
Scalability: System handles 500+ concurrent workflows
User Experience: Workflows complete predictably and automatically
Operations: Zero manual workflow interventions required
Observability: Full visibility into queue depth and execution status

Lessons Learned

Async Systems Need Polling: Queues don't execute themselves—something must poll them
State Machines are Critical: Clear state transitions prevent execution ambiguity
Idempotency Matters: Protect against duplicate execution in distributed systems
Monitoring is Essential: You can't fix what you can't see
Fail Gracefully: Retry transient failures, move persistent failures to DLQ

Architecture Pattern: Scheduled Queue Processor

This pattern applies beyond workflows:

CloudWatch Events (Scheduler)
    ↓ (triggers every N minutes)
Processor Lambda
    ↓ (queries)
Queue Table
    ↓ (executes)
Worker Lambda/ECS
    ↓ (updates)
Status Table

Benefits:

Simple to implement
Easy to monitor
Scales naturally
Resilient to failures
Cost-effective (pay per execution)

This architecture enabled us to move from a fragile, manual workflow system to a reliable, self-healing automated pipeline that users can trust.