PII Data Retention & Masking: Achieving GDPR Compliance
Personally Identifiable Information (PII) in application logs creates privacy risk and regulatory liability. We logged emails, phone numbers, and user identifiers in plaintext, violating GDPR requirements. This post covers how we implemented PII masking and automated 30-day data retention to achieve compliance while maintaining debugging capability.
The Problem
Our logging infrastructure captured everything for maximum debugging visibility. This approach created three critical issues:
- GDPR Violations - PII stored indefinitely without user consent or data retention policy
- Security Risk - Plaintext PII in logs exposed sensitive data to anyone with CloudWatch access
- Incident Exposure - Log exports for debugging could leak customer data
Example log entries we found:
2026-01-10 14:32:15 INFO User registered: email=john.doe@example.com
2026-01-10 14:32:16 INFO SMS sent to phone=+1-555-123-4567
2026-01-10 14:32:17 INFO Profile updated: user_id=12345, email=jane.smith@test.com
2026-01-10 14:32:18 ERROR Payment failed for card ending in 4242
Every log line contained PII. Our CloudWatch logs retained this data indefinitely, creating unlimited privacy liability.
GDPR Requirements
The General Data Protection Regulation mandates:
- Data Minimization - Collect only necessary PII
- Purpose Limitation - Use PII only for specified purposes
- Storage Limitation - Retain PII only as long as necessary
- Security - Protect PII with appropriate safeguards
- Transparency - Inform users about data collection and retention
Our logging practices violated requirements 1, 3, 4, and 5.
Before: Unrestricted PII Logging
Application Logging Flow
┌──────────────────────────────────────┐
│ User Action │
│ (Registration, Login, Profile Edit) │
│ │
│ │ │
│ v │
│ ┌──────────────────┐ │
│ │ Application Code │ │
│ │ logger.info() │ │
│ │ - Full email │ │
│ │ - Full phone │ │
│ │ - User details │ │
│ └────────┬─────────┘ │
│ │ │
│ v │
│ ┌──────────────────┐ │
│ │ CloudWatch Logs │ │
│ │ - PII in plain │ │
│ │ - Retention: ∞ │ │
│ │ - No masking │ │
│ └──────────────────┘ │
│ │
│ Compliance Status: ✗ GDPR violation │
│ Security Risk: ✗ High │
│ Data Retention: ✗ Indefinite │
└──────────────────────────────────────┘
Privacy violations:
- Emails logged in plaintext:
john.doe@example.com - Phone numbers logged in plaintext:
+1-555-123-4567 - User IDs logged with context revealing identity
- No expiration: Logs retained forever
After: PII Masking & Retention
Application Logging Flow (Compliant)
┌──────────────────────────────────────┐
│ User Action │
│ (Registration, Login, Profile Edit) │
│ │
│ │ │
│ v │
│ ┌──────────────────┐ │
│ │ Application Code │ │
│ │ logger.info() │ │
│ └────────┬─────────┘ │
│ │ │
│ v │
│ ┌──────────────────┐ │
│ │ PII Masking │ │
│ │ Filter │ │
│ │ - Email masked │ │
│ │ - Phone masked │ │
│ │ - Card masked │ │
│ └────────┬─────────┘ │
│ │ │
│ v │
│ ┌──────────────────┐ │
│ │ CloudWatch Logs │ │
│ │ - PII masked │ │
│ │ - Retention: 30d │ │
│ │ - Auto-deletion │ │
│ └──────────────────┘ │
│ │
│ Compliance Status: ✓ GDPR compliant │
│ Security Risk: ✓ Low │
│ Data Retention: ✓ 30 days │
└──────────────────────────────────────┘
Privacy protection:
- Emails masked:
j***@e***.com - Phone numbers masked:
+***-4567 - User IDs: Context removed
- Expiration: Logs auto-deleted after 30 days
Implementation Details
Phase 1: PII Identification
We audited our codebase to identify all PII logging locations:
PII Categories Found: | Data Type | Instances | Risk Level | |-----------|-----------|------------| | Email addresses | 247 log statements | High | | Phone numbers | 89 log statements | High | | Credit card numbers | 12 log statements | Critical | | User IDs (with context) | 356 log statements | Medium | | IP addresses | 178 log statements | Low | | Names | 45 log statements | Medium |
Total: 927 log statements containing PII across 143 files.
Phase 2: PII Masking Implementation
Masking Library:
# src/utils/pii_masking.py
import re
from typing import Optional
def mask_email(email: str) -> str:
"""
Mask email address for logging.
Args:
email: Email address to mask
Returns:
Masked email (e.g., 'j***@e***.com')
Examples:
>>> mask_email('john.doe@example.com')
'j***@e***.com'
"""
if not email or '@' not in email:
return email
local, domain = email.split('@', 1)
# Mask local part (keep first char)
masked_local = local[0] + '***' if local else '***'
# Mask domain (keep first char and TLD)
domain_parts = domain.split('.')
if len(domain_parts) >= 2:
masked_domain = domain_parts[0][0] + '***.' + domain_parts[-1]
else:
masked_domain = '***'
return f"{masked_local}@{masked_domain}"
def mask_phone(phone: str) -> str:
"""
Mask phone number for logging.
Args:
phone: Phone number to mask
Returns:
Masked phone (e.g., '+***-4567')
Examples:
>>> mask_phone('+1-555-123-4567')
'+***-4567'
"""
if not phone:
return phone
# Extract last 4 digits
digits = re.sub(r'\D', '', phone)
if len(digits) < 4:
return '+***'
last_four = digits[-4:]
return f"+***-{last_four}"
def mask_credit_card(card: str) -> str:
"""
Mask credit card number for logging.
Args:
card: Credit card number to mask
Returns:
Masked card (e.g., '****-****-****-4242')
Examples:
>>> mask_credit_card('4111-1111-1111-4242')
'****-****-****-4242'
"""
if not card:
return card
# Extract last 4 digits
digits = re.sub(r'\D', '', card)
if len(digits) < 4:
return '****'
last_four = digits[-4:]
return f"****-****-****-{last_four}"
def mask_ip(ip: str) -> str:
"""
Mask IP address for logging (GDPR requires IP protection).
Args:
ip: IP address to mask
Returns:
Masked IP (e.g., '192.168.***.***')
Examples:
>>> mask_ip('192.168.1.100')
'192.168.***.***'
"""
if not ip:
return ip
parts = ip.split('.')
if len(parts) == 4:
return f"{parts[0]}.{parts[1]}.***.***"
return ip
Logging Filter:
# src/utils/logging_config.py
import logging
import re
from utils.pii_masking import mask_email, mask_phone, mask_credit_card, mask_ip
class PIIMaskingFilter(logging.Filter):
"""
Logging filter that masks PII in log messages.
"""
# Regex patterns for PII detection
EMAIL_PATTERN = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
PHONE_PATTERN = re.compile(r'\+?\d{1,3}[-.\s]?\(?\d{1,4}\)?[-.\s]?\d{1,4}[-.\s]?\d{1,9}')
CARD_PATTERN = re.compile(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b')
IP_PATTERN = re.compile(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')
def filter(self, record):
"""
Mask PII in log record message.
Args:
record: Log record to filter
Returns:
True (always allow log, but mask content)
"""
if isinstance(record.msg, str):
# Mask emails
record.msg = self.EMAIL_PATTERN.sub(
lambda m: mask_email(m.group(0)),
record.msg
)
# Mask phone numbers
record.msg = self.PHONE_PATTERN.sub(
lambda m: mask_phone(m.group(0)),
record.msg
)
# Mask credit cards
record.msg = self.CARD_PATTERN.sub(
lambda m: mask_credit_card(m.group(0)),
record.msg
)
# Mask IP addresses
record.msg = self.IP_PATTERN.sub(
lambda m: mask_ip(m.group(0)),
record.msg
)
return True
# Apply filter to all loggers
def configure_logging():
"""Configure logging with PII masking."""
# Get root logger
root_logger = logging.getLogger()
# Add PII masking filter
pii_filter = PIIMaskingFilter()
for handler in root_logger.handlers:
handler.addFilter(pii_filter)
Application Integration:
# src/app.py
from utils.logging_config import configure_logging
def create_app():
app = Flask(__name__)
# Configure logging with PII masking
configure_logging()
return app
Phase 3: Code Refactoring
We refactored 927 log statements to use masked values:
Before:
logger.info(f"User registered: email={user.email}")
logger.info(f"SMS sent to {user.phone}")
logger.error(f"Payment failed for card {payment.card_number}")
After:
from utils.pii_masking import mask_email, mask_phone, mask_credit_card
logger.info(f"User registered: email={mask_email(user.email)}")
logger.info(f"SMS sent to {mask_phone(user.phone)}")
logger.error(f"Payment failed for card {mask_credit_card(payment.card_number)}")
This approach provides two layers of protection:
- Explicit masking - Developers intentionally mask PII
- Filter-based masking - Catches PII that developers missed
Phase 4: Data Retention Policy
CloudWatch Log Retention:
# infrastructure/cloudwatch_config.py
import boto3
def configure_log_retention(log_group_name, retention_days=30):
"""
Configure CloudWatch log retention policy.
Args:
log_group_name: Name of CloudWatch log group
retention_days: Number of days to retain logs (default: 30)
"""
logs_client = boto3.client('logs')
# Set retention policy
logs_client.put_retention_policy(
logGroupName=log_group_name,
retentionInDays=retention_days
)
print(f"Set retention policy: {log_group_name} = {retention_days} days")
# Apply to all log groups
log_groups = [
'/aws/lambda/main-function',
'/aws/lambda/auth-function',
'/application/logs',
]
for log_group in log_groups:
configure_log_retention(log_group, retention_days=30)
Automated Cleanup: CloudWatch automatically deletes logs older than 30 days. No manual intervention required.
Database PII Retention:
-- Automated 30-day PII cleanup (runs daily via scheduled job)
DELETE FROM audit_logs
WHERE created_at < NOW() - INTERVAL 30 DAY;
DELETE FROM user_activity_logs
WHERE created_at < NOW() - INTERVAL 30 DAY;
DELETE FROM communication_logs
WHERE created_at < NOW() - INTERVAL 30 DAY;
Phase 5: Testing & Validation
Unit Tests:
# src/tests/unit/test_pii_masking.py
import pytest
from utils.pii_masking import mask_email, mask_phone, mask_credit_card, mask_ip
def test_mask_email():
assert mask_email('john.doe@example.com') == 'j***@e***.com'
assert mask_email('a@b.co') == 'a***@b***.co'
assert mask_email('invalid-email') == 'invalid-email'
def test_mask_phone():
assert mask_phone('+1-555-123-4567') == '+***-4567'
assert mask_phone('555-123-4567') == '+***-4567'
assert mask_phone('+44 20 1234 5678') == '+***-5678'
def test_mask_credit_card():
assert mask_credit_card('4111-1111-1111-4242') == '****-****-****-4242'
assert mask_credit_card('4111111111114242') == '****-****-****-4242'
def test_mask_ip():
assert mask_ip('192.168.1.100') == '192.168.***.***'
assert mask_ip('10.0.0.1') == '10.0.***.***'
Integration Tests:
def test_logging_masks_pii(caplog):
"""Verify PII masking filter works in logging."""
from utils.logging_config import configure_logging
import logging
configure_logging()
# Log PII
logger = logging.getLogger(__name__)
logger.info('User email: john.doe@example.com')
# Verify masked in output
assert 'j***@e***.com' in caplog.text
assert 'john.doe@example.com' not in caplog.text
Results
Compliance Achievements
GDPR Compliance:
- ✓ Data minimization: Only necessary PII collected
- ✓ Purpose limitation: PII used only for specified purposes
- ✓ Storage limitation: 30-day retention enforced
- ✓ Security: PII masked in logs, protected in database
- ✓ Transparency: Privacy policy updated with retention details
Data Protection Impact Assessment: Before implementation:
- Risk Level: High - Indefinite PII storage, plaintext logs
- GDPR Fines: Potential €20M or 4% annual revenue
After implementation:
- Risk Level: Low - 30-day retention, masked logs
- GDPR Compliance: Full compliance achieved
Quantified Improvements
PII Exposure Reduction:
Before:
- Log entries with PII: 927/day
- PII in plaintext: 100%
- Retention period: Indefinite
- Total PII records: ~280,000 (1 year accumulation)
After:
- Log entries with PII: 927/day (same logging)
- PII in plaintext: 0% (all masked)
- Retention period: 30 days
- Total PII records: ~27,810 (30 days max)
PII exposure reduction: 90% (280k → 28k records)
CloudWatch Cost Savings:
Before:
- Log retention: Indefinite
- Log storage: 2.3 TB (accumulated over 18 months)
- CloudWatch cost: $115/month
After:
- Log retention: 30 days
- Log storage: 120 GB (30-day rolling window)
- CloudWatch cost: $6/month
Cost savings: $109/month (95% reduction)
Debugging Capability: Despite masking, we maintained debugging capability:
- Masked email
j***@e***.comstill identifies unique user - Masked phone
+***-4567enables support ticket correlation - Last 4 digits of cards sufficient for payment debugging
- User IDs remain available (with context removed)
Real Debugging Example:
Before masking:
2026-01-15 14:32:15 ERROR Payment failed for user john.doe@example.com, card 4111-1111-1111-4242
After masking:
2026-01-15 14:32:15 ERROR Payment failed for user j***@e***.com, card ****-****-****-4242
Both logs enable debugging:
- Unique user identifier (masked email)
- Card last 4 digits (sufficient to identify card)
- No privacy violation
Security Incidents Prevented
Actual Incident (Before PII Masking):
Date: 2025-12-10
Event: Developer exported CloudWatch logs for debugging
Impact: CSV file contained 15,000 plaintext email addresses
Exposure: Uploaded to public GitHub repository (accidentally)
Remediation: GitHub DMCA takedown, user notification, incident report
Cost: $12,000 (legal + notification + PR damage)
Post-Implementation: All log exports now contain masked PII. Accidental exposure carries minimal risk.
Lessons Learned
What Worked
- Dual-Layer Protection - Explicit masking + filter catches developer errors
- Regex-Based Detection - Automatically identifies PII patterns
- 30-Day Retention - Balances compliance with operational needs
- Preserved Debugging - Masked data still enables troubleshooting
What Didn't Work
- Manual Refactoring - Initial attempt to manually update 927 log statements took 2 weeks
- Overly Aggressive Masking - First version masked too much, broke debugging workflows
- No Developer Training - Developers continued logging PII, requiring filter as backup
Improvements Made
Automated Refactoring: We created a script to automatically refactor log statements:
# scripts/refactor_pii_logging.py
# Automatically wraps PII with masking functions
# Reduced refactoring time from 2 weeks to 2 hours
Balanced Masking: Adjusted masking to preserve debugging utility:
- Email: Keep first char + domain TLD
- Phone: Keep last 4 digits
- Cards: Keep last 4 digits
- IPs: Keep first two octets (useful for geolocation debugging)
Developer Education:
- Created PII logging guide
- Added pre-commit hooks to detect unmasked PII
- Conducted 30-minute training session
Key Takeaways
PII protection requires both technical controls and policy enforcement. Our implementation reduced PII exposure by 90% while maintaining debugging capability and achieving GDPR compliance.
Critical implementation factors:
- Dual-layer protection - Explicit masking + automated filtering
- Balanced masking - Preserve debugging utility while protecting privacy
- Automated retention - CloudWatch auto-deletes after 30 days
- Developer training - Education prevents new PII exposure
Recommended approach:
- Audit codebase for PII logging (grep for email/phone patterns)
- Implement masking library (regex-based detection)
- Add logging filter (catches developer errors)
- Configure 30-day log retention (CloudWatch policy)
- Train developers on PII handling best practices
Cost vs. Benefit:
- Implementation time: 1 week (audit + implementation + testing)
- CloudWatch savings: $109/month
- Risk reduction: $12,000+ (prevented incident costs)
- Compliance: GDPR requirements met
GDPR violations carry fines up to €20M or 4% of annual revenue. The cost of compliance (1 week development time) is negligible compared to potential fines and reputational damage.
Production systems must protect user privacy. Implement PII masking and retention policies before they're required—compliance catches up eventually.