The PDF Nightmare: When Multiple Systems Must Align
Key Takeaway
Clinical report PDF generation failed due to misaligned configuration across four systems: binary paths, IAM permissions, URL encoding, and package selection. Fixing PDF generation required holistic debugging across infrastructure, dependencies, networking, and security layers.
The Problem
Users requested clinical PDF reports but received error 500. Investigation revealed a cascade of unrelated failures:
- Missing Binary Path: wkhtmltopdf binary couldn't be found by Lambda
- IAM Permission Denied: Lambda couldn't write PDFs to S3
- URL Encoding Issues: Special characters in report URLs broke generation
- Wrong Package: Selected HTML-to-PDF library didn't work in Lambda environment
- Silent Failures: Errors weren't properly logged or surfaced to users
Each issue masked the next, creating a frustrating debugging cycle where fixing one problem revealed another beneath it.
Context and Background
Our platform generates clinical reports summarizing spatial analysis results. The workflow:
1. User requests report
2. Backend generates HTML report with charts/images
3. HTML converted to PDF using wkhtmltopdf
4. PDF uploaded to S3
5. User receives download link
The PDF generation ran in a Lambda function with tight constraints:
- 3GB memory limit
- 15-minute timeout
- Read-only filesystem except /tmp
- No native package management
- Restricted network access
The Solution
We fixed four interrelated issues:
Problem 1: Binary Path Configuration
Issue: wkhtmltopdf not found at expected path
# Original broken code
def generate_pdf(html_content):
config = pdfkit.configuration() # Uses default path
pdfkit.from_string(html_content, 'output.pdf', configuration=config)
Error: OSError: No wkhtmltopdf executable found
Fix: Configure explicit path from environment variable
import os
import pdfkit
def generate_pdf(html_content, output_path):
"""
Generate PDF from HTML using wkhtmltopdf.
Environment Variables:
WKHTMLTOPDF_PATH: Path to wkhtmltopdf binary (required)
"""
wkhtmltopdf_path = os.environ.get('WKHTMLTOPDF_PATH')
if not wkhtmltopdf_path:
raise ConfigurationError("WKHTMLTOPDF_PATH environment variable not set")
if not os.path.exists(wkhtmltopdf_path):
raise ConfigurationError(f"wkhtmltopdf not found at {wkhtmltopdf_path}")
# Configure pdfkit with explicit path
config = pdfkit.configuration(wkhtmltopdf=wkhtmltopdf_path)
# Generate PDF
pdfkit.from_string(
html_content,
output_path,
configuration=config,
options={
'enable-local-file-access': None, # Required for images
'quiet': None
}
)
return output_path
Deployment: Added binary layer to Lambda
# serverless.yml
functions:
generatePDF:
handler: handler.generate_pdf
layers:
- arn:aws:lambda:us-east-1:123456789:layer:wkhtmltopdf:1
environment:
WKHTMLTOPDF_PATH: /opt/bin/wkhtmltopdf
Problem 2: IAM Permissions
Issue: Lambda couldn't write to S3
def upload_pdf_to_s3(pdf_path, bucket, key):
s3_client.upload_file(pdf_path, bucket, key)
Error: ClientError: Access Denied
Fix: Added comprehensive S3 permissions to Lambda role
# serverless.yml IAM configuration
iamRoleStatements:
- Effect: Allow
Action:
- s3:PutObject
- s3:PutObjectAcl
- s3:GetObject
Resource:
- arn:aws:s3:::${self:custom.reportsBucket}/*
- Effect: Allow
Action:
- s3:ListBucket
Resource:
- arn:aws:s3:::${self:custom.reportsBucket}
Validation: Added permission check before upload
def validate_s3_permissions(bucket, key):
"""Verify Lambda has necessary S3 permissions before attempting upload"""
try:
# Test write permission with minimal operation
s3_client.head_bucket(Bucket=bucket)
return True
except ClientError as e:
error_code = e.response['Error']['Code']
if error_code in ['403', 'AccessDenied']:
raise PermissionError(f"Lambda lacks S3 permissions for bucket: {bucket}")
raise
Problem 3: URL Encoding
Issue: URLs with special characters broke PDF generation
# Original code - double encoding
def generate_report_url(user_id, report_name):
encoded_name = urllib.parse.quote(report_name) # First encoding
url = f"https://api.example.com/reports/{encoded_name}"
return url
# Later, URL passed to pdfkit
pdfkit.from_url(url, 'output.pdf') # URL gets encoded again internally!
For report name "Tissue Sample #42 (2024)":
- First encoding:
Tissue%20Sample%20%2342%20%282024%29 - Second encoding:
Tissue%2520Sample%2520%252342%2520%25282024%2529
Fix: Use proper URL handling with requote_uri
from werkzeug.urls import iri_to_uri
def generate_report_url(user_id, report_name):
"""
Generate properly encoded URL for PDF generation.
Uses iri_to_uri to handle Unicode and special characters
without double-encoding.
"""
# Create IRI (Internationalized Resource Identifier)
iri = f"https://api.example.com/reports/{user_id}/{report_name}"
# Convert to URI with proper encoding
uri = iri_to_uri(iri, safe_conversion=True)
return uri
def generate_pdf_from_url(url):
"""Generate PDF, handling URL encoding correctly"""
# Verify URL is properly formed
parsed = urllib.parse.urlparse(url)
if not parsed.scheme or not parsed.netloc:
raise ValueError(f"Invalid URL: {url}")
# pdfkit handles encoding internally, pass as-is
config = pdfkit.configuration(wkhtmltopdf=os.environ['WKHTMLTOPDF_PATH'])
pdfkit.from_url(
url,
'/tmp/output.pdf',
configuration=config,
options={
'enable-local-file-access': None,
'load-error-handling': 'ignore', # Handle broken links gracefully
'quiet': None
}
)
Problem 4: Package Selection
Issue: Initial package (WeasyPrint) didn't work in Lambda
# Original attempt with WeasyPrint
from weasyprint import HTML
def generate_pdf_weasyprint(html_content):
HTML(string=html_content).write_pdf('output.pdf')
Error: OSError: cannot load library 'gobject-2.0'
WeasyPrint requires system libraries (cairo, pango, gobject) that aren't available in Lambda's stripped-down environment.
Fix: Switch to wkhtmltopdf with proper layer packaging
| Library | Pros | Cons | Lambda Support | |---------|------|------|----------------| | WeasyPrint | Clean API, CSS support | Heavy dependencies | ❌ Difficult | | ReportLab | Full PDF control | Complex API, no HTML | ✅ Works | | xhtml2pdf | Pure Python | Limited CSS support | ✅ Works | | wkhtmltopdf | Excellent rendering | External binary required | ✅ Works with layer |
Selected wkhtmltopdf because:
- Best HTML/CSS rendering quality
- Available as pre-compiled Lambda layer
- Actively maintained
- Production-proven
Implementation Details
Complete PDF Generation Pipeline
import os
import tempfile
import logging
from typing import Optional
import pdfkit
from werkzeug.urls import iri_to_uri
logger = logging.getLogger(__name__)
class PDFGenerator:
def __init__(self):
self.wkhtmltopdf_path = os.environ.get('WKHTMLTOPDF_PATH')
self.s3_bucket = os.environ.get('REPORTS_BUCKET')
if not self.wkhtmltopdf_path:
raise ConfigurationError("WKHTMLTOPDF_PATH not configured")
self.config = pdfkit.configuration(wkhtmltopdf=self.wkhtmltopdf_path)
def generate_from_html(self, html_content: str, report_id: str) -> str:
"""
Generate PDF from HTML content and upload to S3.
Args:
html_content: HTML string to convert
report_id: Unique identifier for the report
Returns:
S3 URL of generated PDF
Raises:
PDFGenerationError: If PDF generation fails
S3UploadError: If S3 upload fails
"""
pdf_path = None
try:
# Create temporary file for PDF
with tempfile.NamedTemporaryFile(
delete=False,
suffix='.pdf',
dir='/tmp'
) as tmp_file:
pdf_path = tmp_file.name
# Generate PDF
logger.info(f"Generating PDF for report {report_id}")
pdfkit.from_string(
html_content,
pdf_path,
configuration=self.config,
options=self._get_pdf_options()
)
# Validate PDF was created
if not os.path.exists(pdf_path) or os.path.getsize(pdf_path) == 0:
raise PDFGenerationError("PDF file empty or not created")
# Upload to S3
s3_key = f"reports/{report_id}.pdf"
s3_url = self._upload_to_s3(pdf_path, s3_key)
logger.info(f"Successfully generated PDF for report {report_id}")
return s3_url
except Exception as e:
logger.error(f"Failed to generate PDF for report {report_id}: {e}")
raise PDFGenerationError(f"PDF generation failed: {e}") from e
finally:
# Cleanup temporary file
if pdf_path and os.path.exists(pdf_path):
try:
os.remove(pdf_path)
except Exception as e:
logger.warning(f"Failed to cleanup temporary PDF: {e}")
def _get_pdf_options(self) -> dict:
"""Configure wkhtmltopdf options"""
return {
'page-size': 'A4',
'margin-top': '0.75in',
'margin-right': '0.75in',
'margin-bottom': '0.75in',
'margin-left': '0.75in',
'encoding': 'UTF-8',
'enable-local-file-access': None,
'no-outline': None,
'quiet': None
}
def _upload_to_s3(self, file_path: str, s3_key: str) -> str:
"""Upload PDF to S3 and return URL"""
try:
s3_client.upload_file(
file_path,
self.s3_bucket,
s3_key,
ExtraArgs={
'ContentType': 'application/pdf',
'ContentDisposition': 'inline' # Display in browser
}
)
# Generate presigned URL (valid for 1 hour)
url = s3_client.generate_presigned_url(
'get_object',
Params={'Bucket': self.s3_bucket, 'Key': s3_key},
ExpiresIn=3600
)
return url
except ClientError as e:
raise S3UploadError(f"Failed to upload PDF to S3: {e}") from e
class PDFGenerationError(Exception):
"""Raised when PDF generation fails"""
pass
class S3UploadError(Exception):
"""Raised when S3 upload fails"""
pass
Error Handling
Comprehensive error handling for debugging:
@app.errorhandler(PDFGenerationError)
def handle_pdf_error(e):
logger.error(f"PDF generation error: {e}", exc_info=True)
return jsonify({
'error': 'PDF generation failed',
'message': 'Unable to generate report PDF. Please try again later.',
'report_issue': True # Flag for support team
}), 500
@app.errorhandler(S3UploadError)
def handle_s3_error(e):
logger.error(f"S3 upload error: {e}", exc_info=True)
return jsonify({
'error': 'Upload failed',
'message': 'Report generated but upload failed. Please contact support.',
'report_issue': True
}), 500
Impact and Results
After fixing all four issues:
- Success Rate: 0% → 98.5%
- Error Visibility: Detailed logs instead of generic 500 errors
- User Satisfaction: From "broken feature" to "works great"
- Support Tickets: Reduced by 95%
- Generation Time: ~8 seconds average
Lessons Learned
- Holistic Debugging: PDF generation requires alignment across multiple systems
- Environment Matters: Libraries that work locally may fail in Lambda
- Test Permissions Early: IAM issues appear late in the development cycle
- URL Encoding is Complex: Understand when encoding happens at each layer
- Layer Management: Binary dependencies need proper packaging for Lambda
PDF generation in serverless environments is complex because it requires coordination between infrastructure (IAM), dependencies (binary layers), networking (URL handling), and application logic. Test each layer independently, then validate the integrated system with real-world data.