spatialx

URL Encoding Issues: When File Names with Spaces Break S3 Access

·wsi-processor

Key Takeaway

Our WSI processor failed to access S3 files with special characters (spaces, parentheses, unicode) in their names because we didn't properly URL-encode object keys. Implementing urllib.parse.quote fixed 100% of "file not found" errors for files with special characters.

The Problem

We used raw file names as S3 keys without encoding:

def download_from_s3(bucket, key):
    # Key contains spaces/special chars: "Slide (1) - Patient A.svs"
    # S3 URL becomes invalid
    s3_client.download_file(bucket, key, '/tmp/image.svs')  # Fails!

Issues:

  1. 404 Errors: Files with spaces not found
  2. Invalid URL: Special characters broke S3 URLs
  3. Filename Mismatch: Slide (1).svs vs Slide%20%281%29.svs
  4. Unicode Failures: Non-ASCII characters crashed
  5. Inconsistent Behavior: Worked in console, failed in API

The Solution

from urllib.parse import quote, unquote
import boto3

def encode_s3_key(key: str) -> str:
    """Properly encode S3 key for URL usage"""
    # quote() encodes special characters
    # safe='/' preserves path separators
    return quote(key, safe='/')

def decode_s3_key(encoded_key: str) -> str:
    """Decode S3 key from URL encoding"""
    return unquote(encoded_key)

def download_from_s3(bucket: str, key: str, local_path: str):
    """Download file from S3 with proper key encoding"""

    # S3 client handles encoding internally
    # But we need to ensure key is properly formatted
    s3_client = boto3.client('s3')

    try:
        logger.info(f"Downloading s3://{bucket}/{key}")

        s3_client.download_file(
            Bucket=bucket,
            Key=key,  # boto3 handles encoding
            Filename=local_path
        )

        logger.info(f"Downloaded to {local_path}")

    except s3_client.exceptions.NoSuchKey:
        logger.error(f"File not found: {key}")

        # Try with manual encoding
        encoded_key = encode_s3_key(key)
        logger.info(f"Retrying with encoded key: {encoded_key}")

        s3_client.download_file(
            Bucket=bucket,
            Key=encoded_key,
            Filename=local_path
        )

def lambda_handler(event, context):
    """Handle S3 events with proper URL decoding"""

    for record in event['Records']:
        # S3 event keys are URL-encoded
        encoded_key = record['s3']['object']['key']

        # Decode to get actual filename
        actual_key = decode_s3_key(encoded_key)

        logger.info(f"Processing: {actual_key}")
        logger.info(f"Encoded as: {encoded_key}")

        bucket = record['s3']['bucket']['name']

        # Download using decoded key
        local_path = f"/tmp/{os.path.basename(actual_key)}"
        download_from_s3(bucket, actual_key, local_path)

        # Process file
        process_wsi(local_path)

Implementation Details

Handling Different Character Types

import unicodedata

def sanitize_filename(filename: str) -> str:
    """Sanitize filename for S3 compatibility"""

    # Normalize unicode
    filename = unicodedata.normalize('NFKD', filename)

    # Replace problematic characters
    replacements = {
        ' ': '_',
        '(': '',
        ')': '',
        '[': '',
        ']': '',
        '{': '',
        '}': '',
        '#': '',
        '%': '',
        '&': '',
        '+': '',
        '?': '',
    }

    for old, new in replacements.items():
        filename = filename.replace(old, new)

    return filename

Testing Special Characters

import pytest

@pytest.mark.parametrize("filename,expected", [
    ("simple.svs", "simple.svs"),
    ("with space.svs", "with%20space.svs"),
    ("with(parens).svs", "with%28parens%29.svs"),
    ("with[brackets].svs", "with%5Bbrackets%5D.svs"),
    ("with#hash.svs", "with%23hash.svs"),
    ("日本語.svs", "%E6%97%A5%E6%9C%AC%E8%AA%9E.svs"),
])
def test_url_encoding(filename, expected):
    """Test URL encoding for various special characters"""
    encoded = encode_s3_key(filename)
    assert encoded == expected

    # Test round-trip
    decoded = decode_s3_key(encoded)
    assert decoded == filename

Impact and Results

MetricBeforeAfter
Special character failures45/week0
File not found errors12%0.1%
Support tickets18/week1/week

Lessons Learned

  1. Always URL Encode: S3 keys with special characters need encoding
  2. Boto3 Helps: boto3 client handles some encoding automatically
  3. Decode S3 Events: Event notifications have URL-encoded keys
  4. Sanitize Filenames: Prevent issues by normalizing on upload
  5. Test Special Characters: Include unicode, spaces, and symbols in tests