← Back

The PDF Nightmare: When Multiple Systems Must Align

·backend-core

The PDF Nightmare: When Multiple Systems Must Align

Key Takeaway

Clinical report PDF generation failed due to misaligned configuration across four systems: binary paths, IAM permissions, URL encoding, and package selection. Fixing PDF generation required holistic debugging across infrastructure, dependencies, networking, and security layers.

The Problem

Users requested clinical PDF reports but received error 500. Investigation revealed a cascade of unrelated failures:

  1. Missing Binary Path: wkhtmltopdf binary couldn't be found by Lambda
  2. IAM Permission Denied: Lambda couldn't write PDFs to S3
  3. URL Encoding Issues: Special characters in report URLs broke generation
  4. Wrong Package: Selected HTML-to-PDF library didn't work in Lambda environment
  5. Silent Failures: Errors weren't properly logged or surfaced to users

Each issue masked the next, creating a frustrating debugging cycle where fixing one problem revealed another beneath it.

Context and Background

Our platform generates clinical reports summarizing spatial analysis results. The workflow:

1. User requests report
2. Backend generates HTML report with charts/images
3. HTML converted to PDF using wkhtmltopdf
4. PDF uploaded to S3
5. User receives download link

The PDF generation ran in a Lambda function with tight constraints:

  • 3GB memory limit
  • 15-minute timeout
  • Read-only filesystem except /tmp
  • No native package management
  • Restricted network access

The Solution

We fixed four interrelated issues:

Problem 1: Binary Path Configuration

Issue: wkhtmltopdf not found at expected path

# Original broken code
def generate_pdf(html_content):
    config = pdfkit.configuration()  # Uses default path
    pdfkit.from_string(html_content, 'output.pdf', configuration=config)

Error: OSError: No wkhtmltopdf executable found

Fix: Configure explicit path from environment variable

import os
import pdfkit

def generate_pdf(html_content, output_path):
    """
    Generate PDF from HTML using wkhtmltopdf.

    Environment Variables:
        WKHTMLTOPDF_PATH: Path to wkhtmltopdf binary (required)
    """
    wkhtmltopdf_path = os.environ.get('WKHTMLTOPDF_PATH')

    if not wkhtmltopdf_path:
        raise ConfigurationError("WKHTMLTOPDF_PATH environment variable not set")

    if not os.path.exists(wkhtmltopdf_path):
        raise ConfigurationError(f"wkhtmltopdf not found at {wkhtmltopdf_path}")

    # Configure pdfkit with explicit path
    config = pdfkit.configuration(wkhtmltopdf=wkhtmltopdf_path)

    # Generate PDF
    pdfkit.from_string(
        html_content,
        output_path,
        configuration=config,
        options={
            'enable-local-file-access': None,  # Required for images
            'quiet': None
        }
    )

    return output_path

Deployment: Added binary layer to Lambda

# serverless.yml
functions:
  generatePDF:
    handler: handler.generate_pdf
    layers:
      - arn:aws:lambda:us-east-1:123456789:layer:wkhtmltopdf:1
    environment:
      WKHTMLTOPDF_PATH: /opt/bin/wkhtmltopdf

Problem 2: IAM Permissions

Issue: Lambda couldn't write to S3

def upload_pdf_to_s3(pdf_path, bucket, key):
    s3_client.upload_file(pdf_path, bucket, key)

Error: ClientError: Access Denied

Fix: Added comprehensive S3 permissions to Lambda role

# serverless.yml IAM configuration
iamRoleStatements:
  - Effect: Allow
    Action:
      - s3:PutObject
      - s3:PutObjectAcl
      - s3:GetObject
    Resource:
      - arn:aws:s3:::${self:custom.reportsBucket}/*
  - Effect: Allow
    Action:
      - s3:ListBucket
    Resource:
      - arn:aws:s3:::${self:custom.reportsBucket}

Validation: Added permission check before upload

def validate_s3_permissions(bucket, key):
    """Verify Lambda has necessary S3 permissions before attempting upload"""
    try:
        # Test write permission with minimal operation
        s3_client.head_bucket(Bucket=bucket)
        return True
    except ClientError as e:
        error_code = e.response['Error']['Code']
        if error_code in ['403', 'AccessDenied']:
            raise PermissionError(f"Lambda lacks S3 permissions for bucket: {bucket}")
        raise

Problem 3: URL Encoding

Issue: URLs with special characters broke PDF generation

# Original code - double encoding
def generate_report_url(user_id, report_name):
    encoded_name = urllib.parse.quote(report_name)  # First encoding
    url = f"https://api.example.com/reports/{encoded_name}"
    return url

# Later, URL passed to pdfkit
pdfkit.from_url(url, 'output.pdf')  # URL gets encoded again internally!

For report name "Tissue Sample #42 (2024)":

  • First encoding: Tissue%20Sample%20%2342%20%282024%29
  • Second encoding: Tissue%2520Sample%2520%252342%2520%25282024%2529

Fix: Use proper URL handling with requote_uri

from werkzeug.urls import iri_to_uri

def generate_report_url(user_id, report_name):
    """
    Generate properly encoded URL for PDF generation.

    Uses iri_to_uri to handle Unicode and special characters
    without double-encoding.
    """
    # Create IRI (Internationalized Resource Identifier)
    iri = f"https://api.example.com/reports/{user_id}/{report_name}"

    # Convert to URI with proper encoding
    uri = iri_to_uri(iri, safe_conversion=True)

    return uri

def generate_pdf_from_url(url):
    """Generate PDF, handling URL encoding correctly"""
    # Verify URL is properly formed
    parsed = urllib.parse.urlparse(url)
    if not parsed.scheme or not parsed.netloc:
        raise ValueError(f"Invalid URL: {url}")

    # pdfkit handles encoding internally, pass as-is
    config = pdfkit.configuration(wkhtmltopdf=os.environ['WKHTMLTOPDF_PATH'])

    pdfkit.from_url(
        url,
        '/tmp/output.pdf',
        configuration=config,
        options={
            'enable-local-file-access': None,
            'load-error-handling': 'ignore',  # Handle broken links gracefully
            'quiet': None
        }
    )

Problem 4: Package Selection

Issue: Initial package (WeasyPrint) didn't work in Lambda

# Original attempt with WeasyPrint
from weasyprint import HTML

def generate_pdf_weasyprint(html_content):
    HTML(string=html_content).write_pdf('output.pdf')

Error: OSError: cannot load library 'gobject-2.0'

WeasyPrint requires system libraries (cairo, pango, gobject) that aren't available in Lambda's stripped-down environment.

Fix: Switch to wkhtmltopdf with proper layer packaging

| Library | Pros | Cons | Lambda Support | |---------|------|------|----------------| | WeasyPrint | Clean API, CSS support | Heavy dependencies | ❌ Difficult | | ReportLab | Full PDF control | Complex API, no HTML | ✅ Works | | xhtml2pdf | Pure Python | Limited CSS support | ✅ Works | | wkhtmltopdf | Excellent rendering | External binary required | ✅ Works with layer |

Selected wkhtmltopdf because:

  • Best HTML/CSS rendering quality
  • Available as pre-compiled Lambda layer
  • Actively maintained
  • Production-proven

Implementation Details

Complete PDF Generation Pipeline

import os
import tempfile
import logging
from typing import Optional
import pdfkit
from werkzeug.urls import iri_to_uri

logger = logging.getLogger(__name__)

class PDFGenerator:
    def __init__(self):
        self.wkhtmltopdf_path = os.environ.get('WKHTMLTOPDF_PATH')
        self.s3_bucket = os.environ.get('REPORTS_BUCKET')

        if not self.wkhtmltopdf_path:
            raise ConfigurationError("WKHTMLTOPDF_PATH not configured")

        self.config = pdfkit.configuration(wkhtmltopdf=self.wkhtmltopdf_path)

    def generate_from_html(self, html_content: str, report_id: str) -> str:
        """
        Generate PDF from HTML content and upload to S3.

        Args:
            html_content: HTML string to convert
            report_id: Unique identifier for the report

        Returns:
            S3 URL of generated PDF

        Raises:
            PDFGenerationError: If PDF generation fails
            S3UploadError: If S3 upload fails
        """
        pdf_path = None

        try:
            # Create temporary file for PDF
            with tempfile.NamedTemporaryFile(
                delete=False,
                suffix='.pdf',
                dir='/tmp'
            ) as tmp_file:
                pdf_path = tmp_file.name

            # Generate PDF
            logger.info(f"Generating PDF for report {report_id}")

            pdfkit.from_string(
                html_content,
                pdf_path,
                configuration=self.config,
                options=self._get_pdf_options()
            )

            # Validate PDF was created
            if not os.path.exists(pdf_path) or os.path.getsize(pdf_path) == 0:
                raise PDFGenerationError("PDF file empty or not created")

            # Upload to S3
            s3_key = f"reports/{report_id}.pdf"
            s3_url = self._upload_to_s3(pdf_path, s3_key)

            logger.info(f"Successfully generated PDF for report {report_id}")

            return s3_url

        except Exception as e:
            logger.error(f"Failed to generate PDF for report {report_id}: {e}")
            raise PDFGenerationError(f"PDF generation failed: {e}") from e

        finally:
            # Cleanup temporary file
            if pdf_path and os.path.exists(pdf_path):
                try:
                    os.remove(pdf_path)
                except Exception as e:
                    logger.warning(f"Failed to cleanup temporary PDF: {e}")

    def _get_pdf_options(self) -> dict:
        """Configure wkhtmltopdf options"""
        return {
            'page-size': 'A4',
            'margin-top': '0.75in',
            'margin-right': '0.75in',
            'margin-bottom': '0.75in',
            'margin-left': '0.75in',
            'encoding': 'UTF-8',
            'enable-local-file-access': None,
            'no-outline': None,
            'quiet': None
        }

    def _upload_to_s3(self, file_path: str, s3_key: str) -> str:
        """Upload PDF to S3 and return URL"""
        try:
            s3_client.upload_file(
                file_path,
                self.s3_bucket,
                s3_key,
                ExtraArgs={
                    'ContentType': 'application/pdf',
                    'ContentDisposition': 'inline'  # Display in browser
                }
            )

            # Generate presigned URL (valid for 1 hour)
            url = s3_client.generate_presigned_url(
                'get_object',
                Params={'Bucket': self.s3_bucket, 'Key': s3_key},
                ExpiresIn=3600
            )

            return url

        except ClientError as e:
            raise S3UploadError(f"Failed to upload PDF to S3: {e}") from e


class PDFGenerationError(Exception):
    """Raised when PDF generation fails"""
    pass

class S3UploadError(Exception):
    """Raised when S3 upload fails"""
    pass

Error Handling

Comprehensive error handling for debugging:

@app.errorhandler(PDFGenerationError)
def handle_pdf_error(e):
    logger.error(f"PDF generation error: {e}", exc_info=True)

    return jsonify({
        'error': 'PDF generation failed',
        'message': 'Unable to generate report PDF. Please try again later.',
        'report_issue': True  # Flag for support team
    }), 500

@app.errorhandler(S3UploadError)
def handle_s3_error(e):
    logger.error(f"S3 upload error: {e}", exc_info=True)

    return jsonify({
        'error': 'Upload failed',
        'message': 'Report generated but upload failed. Please contact support.',
        'report_issue': True
    }), 500

Impact and Results

After fixing all four issues:

  • Success Rate: 0% → 98.5%
  • Error Visibility: Detailed logs instead of generic 500 errors
  • User Satisfaction: From "broken feature" to "works great"
  • Support Tickets: Reduced by 95%
  • Generation Time: ~8 seconds average

Lessons Learned

  1. Holistic Debugging: PDF generation requires alignment across multiple systems
  2. Environment Matters: Libraries that work locally may fail in Lambda
  3. Test Permissions Early: IAM issues appear late in the development cycle
  4. URL Encoding is Complex: Understand when encoding happens at each layer
  5. Layer Management: Binary dependencies need proper packaging for Lambda

PDF generation in serverless environments is complex because it requires coordination between infrastructure (IAM), dependencies (binary layers), networking (URL handling), and application logic. Test each layer independently, then validate the integrated system with real-world data.