spatialx
URL Encoding Issues: When File Names with Spaces Break S3 Access
·wsi-processor
Key Takeaway
Our WSI processor failed to access S3 files with special characters (spaces, parentheses, unicode) in their names because we didn't properly URL-encode object keys. Implementing urllib.parse.quote fixed 100% of "file not found" errors for files with special characters.
The Problem
We used raw file names as S3 keys without encoding:
def download_from_s3(bucket, key):
# Key contains spaces/special chars: "Slide (1) - Patient A.svs"
# S3 URL becomes invalid
s3_client.download_file(bucket, key, '/tmp/image.svs') # Fails!
Issues:
- 404 Errors: Files with spaces not found
- Invalid URL: Special characters broke S3 URLs
- Filename Mismatch:
Slide (1).svsvsSlide%20%281%29.svs - Unicode Failures: Non-ASCII characters crashed
- Inconsistent Behavior: Worked in console, failed in API
The Solution
from urllib.parse import quote, unquote
import boto3
def encode_s3_key(key: str) -> str:
"""Properly encode S3 key for URL usage"""
# quote() encodes special characters
# safe='/' preserves path separators
return quote(key, safe='/')
def decode_s3_key(encoded_key: str) -> str:
"""Decode S3 key from URL encoding"""
return unquote(encoded_key)
def download_from_s3(bucket: str, key: str, local_path: str):
"""Download file from S3 with proper key encoding"""
# S3 client handles encoding internally
# But we need to ensure key is properly formatted
s3_client = boto3.client('s3')
try:
logger.info(f"Downloading s3://{bucket}/{key}")
s3_client.download_file(
Bucket=bucket,
Key=key, # boto3 handles encoding
Filename=local_path
)
logger.info(f"Downloaded to {local_path}")
except s3_client.exceptions.NoSuchKey:
logger.error(f"File not found: {key}")
# Try with manual encoding
encoded_key = encode_s3_key(key)
logger.info(f"Retrying with encoded key: {encoded_key}")
s3_client.download_file(
Bucket=bucket,
Key=encoded_key,
Filename=local_path
)
def lambda_handler(event, context):
"""Handle S3 events with proper URL decoding"""
for record in event['Records']:
# S3 event keys are URL-encoded
encoded_key = record['s3']['object']['key']
# Decode to get actual filename
actual_key = decode_s3_key(encoded_key)
logger.info(f"Processing: {actual_key}")
logger.info(f"Encoded as: {encoded_key}")
bucket = record['s3']['bucket']['name']
# Download using decoded key
local_path = f"/tmp/{os.path.basename(actual_key)}"
download_from_s3(bucket, actual_key, local_path)
# Process file
process_wsi(local_path)
Implementation Details
Handling Different Character Types
import unicodedata
def sanitize_filename(filename: str) -> str:
"""Sanitize filename for S3 compatibility"""
# Normalize unicode
filename = unicodedata.normalize('NFKD', filename)
# Replace problematic characters
replacements = {
' ': '_',
'(': '',
')': '',
'[': '',
']': '',
'{': '',
'}': '',
'#': '',
'%': '',
'&': '',
'+': '',
'?': '',
}
for old, new in replacements.items():
filename = filename.replace(old, new)
return filename
Testing Special Characters
import pytest
@pytest.mark.parametrize("filename,expected", [
("simple.svs", "simple.svs"),
("with space.svs", "with%20space.svs"),
("with(parens).svs", "with%28parens%29.svs"),
("with[brackets].svs", "with%5Bbrackets%5D.svs"),
("with#hash.svs", "with%23hash.svs"),
("日本語.svs", "%E6%97%A5%E6%9C%AC%E8%AA%9E.svs"),
])
def test_url_encoding(filename, expected):
"""Test URL encoding for various special characters"""
encoded = encode_s3_key(filename)
assert encoded == expected
# Test round-trip
decoded = decode_s3_key(encoded)
assert decoded == filename
Impact and Results
| Metric | Before | After |
|---|---|---|
| Special character failures | 45/week | 0 |
| File not found errors | 12% | 0.1% |
| Support tickets | 18/week | 1/week |
Lessons Learned
- Always URL Encode: S3 keys with special characters need encoding
- Boto3 Helps: boto3 client handles some encoding automatically
- Decode S3 Events: Event notifications have URL-encoded keys
- Sanitize Filenames: Prevent issues by normalizing on upload
- Test Special Characters: Include unicode, spaces, and symbols in tests