← Back

URL Encoding Issues: When File Names with Spaces Break S3 Access

·wsi-processor

URL Encoding Issues: When File Names with Spaces Break S3 Access

Key Takeaway

Our WSI processor failed to access S3 files with special characters (spaces, parentheses, unicode) in their names because we didn't properly URL-encode object keys. Implementing urllib.parse.quote fixed 100% of "file not found" errors for files with special characters.

The Problem

We used raw file names as S3 keys without encoding:

def download_from_s3(bucket, key):
    # Key contains spaces/special chars: "Slide (1) - Patient A.svs"
    # S3 URL becomes invalid
    s3_client.download_file(bucket, key, '/tmp/image.svs')  # Fails!

Issues:

  1. 404 Errors: Files with spaces not found
  2. Invalid URL: Special characters broke S3 URLs
  3. Filename Mismatch: Slide (1).svs vs Slide%20%281%29.svs
  4. Unicode Failures: Non-ASCII characters crashed
  5. Inconsistent Behavior: Worked in console, failed in API

The Solution

from urllib.parse import quote, unquote
import boto3

def encode_s3_key(key: str) -> str:
    """Properly encode S3 key for URL usage"""
    # quote() encodes special characters
    # safe='/' preserves path separators
    return quote(key, safe='/')

def decode_s3_key(encoded_key: str) -> str:
    """Decode S3 key from URL encoding"""
    return unquote(encoded_key)

def download_from_s3(bucket: str, key: str, local_path: str):
    """Download file from S3 with proper key encoding"""

    # S3 client handles encoding internally
    # But we need to ensure key is properly formatted
    s3_client = boto3.client('s3')

    try:
        logger.info(f"Downloading s3://{bucket}/{key}")

        s3_client.download_file(
            Bucket=bucket,
            Key=key,  # boto3 handles encoding
            Filename=local_path
        )

        logger.info(f"Downloaded to {local_path}")

    except s3_client.exceptions.NoSuchKey:
        logger.error(f"File not found: {key}")

        # Try with manual encoding
        encoded_key = encode_s3_key(key)
        logger.info(f"Retrying with encoded key: {encoded_key}")

        s3_client.download_file(
            Bucket=bucket,
            Key=encoded_key,
            Filename=local_path
        )

def lambda_handler(event, context):
    """Handle S3 events with proper URL decoding"""

    for record in event['Records']:
        # S3 event keys are URL-encoded
        encoded_key = record['s3']['object']['key']

        # Decode to get actual filename
        actual_key = decode_s3_key(encoded_key)

        logger.info(f"Processing: {actual_key}")
        logger.info(f"Encoded as: {encoded_key}")

        bucket = record['s3']['bucket']['name']

        # Download using decoded key
        local_path = f"/tmp/{os.path.basename(actual_key)}"
        download_from_s3(bucket, actual_key, local_path)

        # Process file
        process_wsi(local_path)

Implementation Details

Handling Different Character Types

import unicodedata

def sanitize_filename(filename: str) -> str:
    """Sanitize filename for S3 compatibility"""

    # Normalize unicode
    filename = unicodedata.normalize('NFKD', filename)

    # Replace problematic characters
    replacements = {
        ' ': '_',
        '(': '',
        ')': '',
        '[': '',
        ']': '',
        '{': '',
        '}': '',
        '#': '',
        '%': '',
        '&': '',
        '+': '',
        '?': '',
    }

    for old, new in replacements.items():
        filename = filename.replace(old, new)

    return filename

Testing Special Characters

import pytest

@pytest.mark.parametrize("filename,expected", [
    ("simple.svs", "simple.svs"),
    ("with space.svs", "with%20space.svs"),
    ("with(parens).svs", "with%28parens%29.svs"),
    ("with[brackets].svs", "with%5Bbrackets%5D.svs"),
    ("with#hash.svs", "with%23hash.svs"),
    ("日本語.svs", "%E6%97%A5%E6%9C%AC%E8%AA%9E.svs"),
])
def test_url_encoding(filename, expected):
    """Test URL encoding for various special characters"""
    encoded = encode_s3_key(filename)
    assert encoded == expected

    # Test round-trip
    decoded = decode_s3_key(encoded)
    assert decoded == filename

Impact and Results

| Metric | Before | After | |--------|--------|-------| | Special character failures | 45/week | 0 | | File not found errors | 12% | 0.1% | | Support tickets | 18/week | 1/week |

Lessons Learned

  1. Always URL Encode: S3 keys with special characters need encoding
  2. Boto3 Helps: boto3 client handles some encoding automatically
  3. Decode S3 Events: Event notifications have URL-encoded keys
  4. Sanitize Filenames: Prevent issues by normalizing on upload
  5. Test Special Characters: Include unicode, spaces, and symbols in tests