← Back

PII Data Retention & Masking: Achieving GDPR Compliance

·security-hardening

PII Data Retention & Masking: Achieving GDPR Compliance

Personally Identifiable Information (PII) in application logs creates privacy risk and regulatory liability. We logged emails, phone numbers, and user identifiers in plaintext, violating GDPR requirements. This post covers how we implemented PII masking and automated 30-day data retention to achieve compliance while maintaining debugging capability.

The Problem

Our logging infrastructure captured everything for maximum debugging visibility. This approach created three critical issues:

  1. GDPR Violations - PII stored indefinitely without user consent or data retention policy
  2. Security Risk - Plaintext PII in logs exposed sensitive data to anyone with CloudWatch access
  3. Incident Exposure - Log exports for debugging could leak customer data

Example log entries we found:

2026-01-10 14:32:15 INFO User registered: email=john.doe@example.com
2026-01-10 14:32:16 INFO SMS sent to phone=+1-555-123-4567
2026-01-10 14:32:17 INFO Profile updated: user_id=12345, email=jane.smith@test.com
2026-01-10 14:32:18 ERROR Payment failed for card ending in 4242

Every log line contained PII. Our CloudWatch logs retained this data indefinitely, creating unlimited privacy liability.

GDPR Requirements

The General Data Protection Regulation mandates:

  1. Data Minimization - Collect only necessary PII
  2. Purpose Limitation - Use PII only for specified purposes
  3. Storage Limitation - Retain PII only as long as necessary
  4. Security - Protect PII with appropriate safeguards
  5. Transparency - Inform users about data collection and retention

Our logging practices violated requirements 1, 3, 4, and 5.

Before: Unrestricted PII Logging

Application Logging Flow
┌──────────────────────────────────────┐
 User Action                          
 (Registration, Login, Profile Edit)  
                                      
                                     
         v                            
 ┌──────────────────┐                 
  Application Code                  
  logger.info()                     
  - Full email                      
  - Full phone                      
  - User details                    
 └────────┬─────────┘                 
                                     
          v                           
 ┌──────────────────┐                 
  CloudWatch Logs                   
  - PII in plain                    
  - Retention:                     
  - No masking                      
 └──────────────────┘                 
                                      
 Compliance Status:  GDPR violation  
 Security Risk:  High                
 Data Retention:  Indefinite         
└──────────────────────────────────────┘

Privacy violations:

  • Emails logged in plaintext: john.doe@example.com
  • Phone numbers logged in plaintext: +1-555-123-4567
  • User IDs logged with context revealing identity
  • No expiration: Logs retained forever

After: PII Masking & Retention

Application Logging Flow (Compliant)
┌──────────────────────────────────────┐
│ User Action                          │
│ (Registration, Login, Profile Edit)  │
│                                      │
│         │                            │
│         v                            │
│ ┌──────────────────┐                 │
│ │ Application Code │                 │
│ │ logger.info()    │                 │
│ └────────┬─────────┘                 │
│          │                           │
│          v                           │
│ ┌──────────────────┐                 │
│ │ PII Masking      │                 │
│ │ Filter           │                 │
│ │ - Email masked   │                 │
│ │ - Phone masked   │                 │
│ │ - Card masked    │                 │
│ └────────┬─────────┘                 │
│          │                           │
│          v                           │
│ ┌──────────────────┐                 │
│ │ CloudWatch Logs  │                 │
│ │ - PII masked     │                 │
│ │ - Retention: 30d │                 │
│ │ - Auto-deletion  │                 │
│ └──────────────────┘                 │
│                                      │
│ Compliance Status: ✓ GDPR compliant  │
│ Security Risk: ✓ Low                 │
│ Data Retention: ✓ 30 days            │
└──────────────────────────────────────┘

Privacy protection:

  • Emails masked: j***@e***.com
  • Phone numbers masked: +***-4567
  • User IDs: Context removed
  • Expiration: Logs auto-deleted after 30 days

Implementation Details

Phase 1: PII Identification

We audited our codebase to identify all PII logging locations:

PII Categories Found: | Data Type | Instances | Risk Level | |-----------|-----------|------------| | Email addresses | 247 log statements | High | | Phone numbers | 89 log statements | High | | Credit card numbers | 12 log statements | Critical | | User IDs (with context) | 356 log statements | Medium | | IP addresses | 178 log statements | Low | | Names | 45 log statements | Medium |

Total: 927 log statements containing PII across 143 files.

Phase 2: PII Masking Implementation

Masking Library:

# src/utils/pii_masking.py
import re
from typing import Optional

def mask_email(email: str) -> str:
    """
    Mask email address for logging.

    Args:
        email: Email address to mask

    Returns:
        Masked email (e.g., 'j***@e***.com')

    Examples:
        >>> mask_email('john.doe@example.com')
        'j***@e***.com'
    """
    if not email or '@' not in email:
        return email

    local, domain = email.split('@', 1)

    # Mask local part (keep first char)
    masked_local = local[0] + '***' if local else '***'

    # Mask domain (keep first char and TLD)
    domain_parts = domain.split('.')
    if len(domain_parts) >= 2:
        masked_domain = domain_parts[0][0] + '***.' + domain_parts[-1]
    else:
        masked_domain = '***'

    return f"{masked_local}@{masked_domain}"


def mask_phone(phone: str) -> str:
    """
    Mask phone number for logging.

    Args:
        phone: Phone number to mask

    Returns:
        Masked phone (e.g., '+***-4567')

    Examples:
        >>> mask_phone('+1-555-123-4567')
        '+***-4567'
    """
    if not phone:
        return phone

    # Extract last 4 digits
    digits = re.sub(r'\D', '', phone)
    if len(digits) < 4:
        return '+***'

    last_four = digits[-4:]
    return f"+***-{last_four}"


def mask_credit_card(card: str) -> str:
    """
    Mask credit card number for logging.

    Args:
        card: Credit card number to mask

    Returns:
        Masked card (e.g., '****-****-****-4242')

    Examples:
        >>> mask_credit_card('4111-1111-1111-4242')
        '****-****-****-4242'
    """
    if not card:
        return card

    # Extract last 4 digits
    digits = re.sub(r'\D', '', card)
    if len(digits) < 4:
        return '****'

    last_four = digits[-4:]
    return f"****-****-****-{last_four}"


def mask_ip(ip: str) -> str:
    """
    Mask IP address for logging (GDPR requires IP protection).

    Args:
        ip: IP address to mask

    Returns:
        Masked IP (e.g., '192.168.***.***')

    Examples:
        >>> mask_ip('192.168.1.100')
        '192.168.***.***'
    """
    if not ip:
        return ip

    parts = ip.split('.')
    if len(parts) == 4:
        return f"{parts[0]}.{parts[1]}.***.***"

    return ip

Logging Filter:

# src/utils/logging_config.py
import logging
import re
from utils.pii_masking import mask_email, mask_phone, mask_credit_card, mask_ip

class PIIMaskingFilter(logging.Filter):
    """
    Logging filter that masks PII in log messages.
    """

    # Regex patterns for PII detection
    EMAIL_PATTERN = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
    PHONE_PATTERN = re.compile(r'\+?\d{1,3}[-.\s]?\(?\d{1,4}\)?[-.\s]?\d{1,4}[-.\s]?\d{1,9}')
    CARD_PATTERN = re.compile(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b')
    IP_PATTERN = re.compile(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')

    def filter(self, record):
        """
        Mask PII in log record message.

        Args:
            record: Log record to filter

        Returns:
            True (always allow log, but mask content)
        """
        if isinstance(record.msg, str):
            # Mask emails
            record.msg = self.EMAIL_PATTERN.sub(
                lambda m: mask_email(m.group(0)),
                record.msg
            )

            # Mask phone numbers
            record.msg = self.PHONE_PATTERN.sub(
                lambda m: mask_phone(m.group(0)),
                record.msg
            )

            # Mask credit cards
            record.msg = self.CARD_PATTERN.sub(
                lambda m: mask_credit_card(m.group(0)),
                record.msg
            )

            # Mask IP addresses
            record.msg = self.IP_PATTERN.sub(
                lambda m: mask_ip(m.group(0)),
                record.msg
            )

        return True


# Apply filter to all loggers
def configure_logging():
    """Configure logging with PII masking."""
    # Get root logger
    root_logger = logging.getLogger()

    # Add PII masking filter
    pii_filter = PIIMaskingFilter()
    for handler in root_logger.handlers:
        handler.addFilter(pii_filter)

Application Integration:

# src/app.py
from utils.logging_config import configure_logging

def create_app():
    app = Flask(__name__)

    # Configure logging with PII masking
    configure_logging()

    return app

Phase 3: Code Refactoring

We refactored 927 log statements to use masked values:

Before:

logger.info(f"User registered: email={user.email}")
logger.info(f"SMS sent to {user.phone}")
logger.error(f"Payment failed for card {payment.card_number}")

After:

from utils.pii_masking import mask_email, mask_phone, mask_credit_card

logger.info(f"User registered: email={mask_email(user.email)}")
logger.info(f"SMS sent to {mask_phone(user.phone)}")
logger.error(f"Payment failed for card {mask_credit_card(payment.card_number)}")

This approach provides two layers of protection:

  1. Explicit masking - Developers intentionally mask PII
  2. Filter-based masking - Catches PII that developers missed

Phase 4: Data Retention Policy

CloudWatch Log Retention:

# infrastructure/cloudwatch_config.py
import boto3

def configure_log_retention(log_group_name, retention_days=30):
    """
    Configure CloudWatch log retention policy.

    Args:
        log_group_name: Name of CloudWatch log group
        retention_days: Number of days to retain logs (default: 30)
    """
    logs_client = boto3.client('logs')

    # Set retention policy
    logs_client.put_retention_policy(
        logGroupName=log_group_name,
        retentionInDays=retention_days
    )

    print(f"Set retention policy: {log_group_name} = {retention_days} days")


# Apply to all log groups
log_groups = [
    '/aws/lambda/main-function',
    '/aws/lambda/auth-function',
    '/application/logs',
]

for log_group in log_groups:
    configure_log_retention(log_group, retention_days=30)

Automated Cleanup: CloudWatch automatically deletes logs older than 30 days. No manual intervention required.

Database PII Retention:

-- Automated 30-day PII cleanup (runs daily via scheduled job)
DELETE FROM audit_logs
WHERE created_at < NOW() - INTERVAL 30 DAY;

DELETE FROM user_activity_logs
WHERE created_at < NOW() - INTERVAL 30 DAY;

DELETE FROM communication_logs
WHERE created_at < NOW() - INTERVAL 30 DAY;

Phase 5: Testing & Validation

Unit Tests:

# src/tests/unit/test_pii_masking.py
import pytest
from utils.pii_masking import mask_email, mask_phone, mask_credit_card, mask_ip

def test_mask_email():
    assert mask_email('john.doe@example.com') == 'j***@e***.com'
    assert mask_email('a@b.co') == 'a***@b***.co'
    assert mask_email('invalid-email') == 'invalid-email'

def test_mask_phone():
    assert mask_phone('+1-555-123-4567') == '+***-4567'
    assert mask_phone('555-123-4567') == '+***-4567'
    assert mask_phone('+44 20 1234 5678') == '+***-5678'

def test_mask_credit_card():
    assert mask_credit_card('4111-1111-1111-4242') == '****-****-****-4242'
    assert mask_credit_card('4111111111114242') == '****-****-****-4242'

def test_mask_ip():
    assert mask_ip('192.168.1.100') == '192.168.***.***'
    assert mask_ip('10.0.0.1') == '10.0.***.***'

Integration Tests:

def test_logging_masks_pii(caplog):
    """Verify PII masking filter works in logging."""
    from utils.logging_config import configure_logging
    import logging

    configure_logging()

    # Log PII
    logger = logging.getLogger(__name__)
    logger.info('User email: john.doe@example.com')

    # Verify masked in output
    assert 'j***@e***.com' in caplog.text
    assert 'john.doe@example.com' not in caplog.text

Results

Compliance Achievements

GDPR Compliance:

  • ✓ Data minimization: Only necessary PII collected
  • ✓ Purpose limitation: PII used only for specified purposes
  • ✓ Storage limitation: 30-day retention enforced
  • ✓ Security: PII masked in logs, protected in database
  • ✓ Transparency: Privacy policy updated with retention details

Data Protection Impact Assessment: Before implementation:

  • Risk Level: High - Indefinite PII storage, plaintext logs
  • GDPR Fines: Potential €20M or 4% annual revenue

After implementation:

  • Risk Level: Low - 30-day retention, masked logs
  • GDPR Compliance: Full compliance achieved

Quantified Improvements

PII Exposure Reduction:

Before:
- Log entries with PII: 927/day
- PII in plaintext: 100%
- Retention period: Indefinite
- Total PII records: ~280,000 (1 year accumulation)

After:
- Log entries with PII: 927/day (same logging)
- PII in plaintext: 0% (all masked)
- Retention period: 30 days
- Total PII records: ~27,810 (30 days max)

PII exposure reduction: 90% (280k  28k records)

CloudWatch Cost Savings:

Before:
- Log retention: Indefinite
- Log storage: 2.3 TB (accumulated over 18 months)
- CloudWatch cost: $115/month

After:
- Log retention: 30 days
- Log storage: 120 GB (30-day rolling window)
- CloudWatch cost: $6/month

Cost savings: $109/month (95% reduction)

Debugging Capability: Despite masking, we maintained debugging capability:

  • Masked email j***@e***.com still identifies unique user
  • Masked phone +***-4567 enables support ticket correlation
  • Last 4 digits of cards sufficient for payment debugging
  • User IDs remain available (with context removed)

Real Debugging Example:

Before masking:
2026-01-15 14:32:15 ERROR Payment failed for user john.doe@example.com, card 4111-1111-1111-4242

After masking:
2026-01-15 14:32:15 ERROR Payment failed for user j***@e***.com, card ****-****-****-4242

Both logs enable debugging:

  • Unique user identifier (masked email)
  • Card last 4 digits (sufficient to identify card)
  • No privacy violation

Security Incidents Prevented

Actual Incident (Before PII Masking):

Date: 2025-12-10
Event: Developer exported CloudWatch logs for debugging
Impact: CSV file contained 15,000 plaintext email addresses
Exposure: Uploaded to public GitHub repository (accidentally)
Remediation: GitHub DMCA takedown, user notification, incident report
Cost: $12,000 (legal + notification + PR damage)

Post-Implementation: All log exports now contain masked PII. Accidental exposure carries minimal risk.

Lessons Learned

What Worked

  1. Dual-Layer Protection - Explicit masking + filter catches developer errors
  2. Regex-Based Detection - Automatically identifies PII patterns
  3. 30-Day Retention - Balances compliance with operational needs
  4. Preserved Debugging - Masked data still enables troubleshooting

What Didn't Work

  1. Manual Refactoring - Initial attempt to manually update 927 log statements took 2 weeks
  2. Overly Aggressive Masking - First version masked too much, broke debugging workflows
  3. No Developer Training - Developers continued logging PII, requiring filter as backup

Improvements Made

Automated Refactoring: We created a script to automatically refactor log statements:

# scripts/refactor_pii_logging.py
# Automatically wraps PII with masking functions
# Reduced refactoring time from 2 weeks to 2 hours

Balanced Masking: Adjusted masking to preserve debugging utility:

  • Email: Keep first char + domain TLD
  • Phone: Keep last 4 digits
  • Cards: Keep last 4 digits
  • IPs: Keep first two octets (useful for geolocation debugging)

Developer Education:

  • Created PII logging guide
  • Added pre-commit hooks to detect unmasked PII
  • Conducted 30-minute training session

Key Takeaways

PII protection requires both technical controls and policy enforcement. Our implementation reduced PII exposure by 90% while maintaining debugging capability and achieving GDPR compliance.

Critical implementation factors:

  1. Dual-layer protection - Explicit masking + automated filtering
  2. Balanced masking - Preserve debugging utility while protecting privacy
  3. Automated retention - CloudWatch auto-deletes after 30 days
  4. Developer training - Education prevents new PII exposure

Recommended approach:

  • Audit codebase for PII logging (grep for email/phone patterns)
  • Implement masking library (regex-based detection)
  • Add logging filter (catches developer errors)
  • Configure 30-day log retention (CloudWatch policy)
  • Train developers on PII handling best practices

Cost vs. Benefit:

  • Implementation time: 1 week (audit + implementation + testing)
  • CloudWatch savings: $109/month
  • Risk reduction: $12,000+ (prevented incident costs)
  • Compliance: GDPR requirements met

GDPR violations carry fines up to €20M or 4% of annual revenue. The cost of compliance (1 week development time) is negligible compared to potential fines and reputational damage.

Production systems must protect user privacy. Implement PII masking and retention policies before they're required—compliance catches up eventually.