← Back

How a Critical Bug Led Us to Domain-Driven Design: Refactoring a RevenueCat Webhook System

·general

How a Critical Bug Led Us to Domain-Driven Design: Refactoring a RevenueCat Webhook System

Published: February 9, 2026

Introduction

We recently discovered a critical bug in our subscription system: cancelled users still had full access to premium features. The bug was subtle—a sentinel date intended to mark memberships as "force-expired" was actually making them never expire. This bug exposed deeper architectural problems: business logic scattered across HTTP handlers, duplicated code, and inverted dependencies.

This post walks through the refactor that fixed the bug and transformed a tangled webhook handler into a clean, domain-driven architecture. Whether you're working with RevenueCat, Stripe webhooks, or any event-driven subscription system, these lessons apply.

The Problem: When Layering Goes Wrong

Our RevenueCat webhook integration started simple: receive events, update database. As features grew, it became a 600-line monolith mixing HTTP concerns with business rules.

Architecture Before

┌─────────────────────────────────────────────────────┐
│  revenuecat.py (HTTP Resource Layer)               │
│  ┌───────────────────────────────────────────────┐ │
│  │ • HTTP auth validation                        │ │
│  │ • Schema validation                           │ │
│  │ • Business logic (6 event handlers)          │ │
│  │ • _idempotency_key() helper                  │ │
│  │ • Inline sentinel dates: datetime(2077,1,1)  │ │
│  └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
         ▲
         │ imports from resource layer (!)
         │
┌────────┴────────────────────────────────────────────┐
│  revenuecat_controller.py                           │
│  • Routes to handlers in revenuecat.py             │
│  • Duplicate _idempotency_key()                    │
│  • Analytics tracking                              │
└─────────────────────────────────────────────────────┘

Seven Critical Issues

1. Single Responsibility Principle Violation The revenuecat.py resource layer contained six business logic handlers: INITIAL_PURCHASE, RENEWAL, CANCELLATION, EXPIRATION, BILLING_ISSUE, PRODUCT_CHANGE. HTTP resources should handle requests/responses, not implement domain rules.

2. DRY Violation _idempotency_key() was duplicated in two files. The sentinel date datetime(2077, 1, 1) appeared inline in four different places with no clear meaning.

3. The Critical Bug

# membership.py - BEFORE
def is_valid(self):
    now = datetime.now(timezone.utc)
    applied = self.applied_expires_date  # Returns enforced_expires_date if set
    return now < applied  # BUG: now < 2077-01-01 is always True!

When a subscription was cancelled, we set enforced_expires_date = datetime(2077, 1, 1) as a sentinel meaning "force-expired." But is_valid() compared the current date against this far-future date. Cancelled users still had access.

4. Missing Event Handler RevenueCat sends UNCANCELLATION events when users re-enable auto-renew. We had no handler—users stayed force-expired until the next renewal.

5. Inverted Dependencies The controller imported handler functions from the resource layer. Dependencies should flow inward: Resource → Controller → Domain → Repository.

6. Idempotency Window Too Short Our idempotency TTL was 1 hour. RevenueCat retries webhooks for up to 18 hours. Duplicate processing was inevitable.

7. No Value Objects Magic dates scattered everywhere with no type safety or semantic meaning.

The Solution: Domain-Driven Design

We applied DDD principles to separate concerns and make the domain model explicit.

Architecture After

┌──────────────────────────────────────────────────────────┐
│  revenuecat.py (HTTP Resource - Thin Layer)             │
│  • HTTP auth validation only                            │
│  • Delegates to controller                              │
└────────────────┬─────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────┐
│  revenuecat_controller.py (Application Layer)           │
│  • Idempotency checks (24h TTL)                         │
│  • Routes events to domain service                      │
│  • Analytics tracking                                   │
└────────────────┬─────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────┐
│  domain/subscription/membership_service.py               │
│  • activate_subscription()                              │
│  • renew_subscription()                                 │
│  • cancel_subscription()                                │
│  • expire_subscription()                                │
│  • uncancellation() ← NEW                               │
│  • record_billing_issue()                               │
│  • record_product_change()                              │
│  • handle_subscriber_alias()                            │
└────────────────┬─────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────┐
│  domain/subscription/membership_status.py (Value Object) │
│  • FORCE_EXPIRED = datetime(2077, 1, 1)                 │
│  • SOFT_DELETED = datetime(2000, 1, 1)                  │
│  • is_force_expired(date) → bool                        │
│  • is_soft_deleted(date) → bool                         │
└──────────────────────────────────────────────────────────┘

The Bug Fix

# domain/subscription/membership_status.py - NEW
class MembershipSentinels:
    FORCE_EXPIRED = datetime(2077, 1, 1, tzinfo=timezone.utc)
    SOFT_DELETED = datetime(2000, 1, 1, tzinfo=timezone.utc)

    @classmethod
    def is_force_expired(cls, date: Optional[datetime]) -> bool:
        """Check if date indicates force-expired status."""
        if date is None:
            return False
        return date.replace(tzinfo=timezone.utc) == cls.FORCE_EXPIRED
# membership.py - AFTER
def is_valid(self):
    # Early return: force-expired sentinels mean invalid
    if MembershipSentinels.is_force_expired(self.enforced_expires_date):
        return False  # FIX: Cancelled memberships now correctly invalid

    # Normal expiration check
    now = datetime.now(timezone.utc)
    applied = self.applied_expires_date
    return now < applied

The fix has two parts:

  1. Value object makes sentinel dates explicit and type-safe
  2. Early return in is_valid() checks the sentinel before date comparison

The Domain Service

Before: scattered handlers mixed with HTTP code. After: clean domain service with single responsibility.

# domain/subscription/membership_service.py
class MembershipService:
    def __init__(self, membership_repo: MembershipsRepository,
                 user_repo: UsersRepository):
        self.membership_repo = membership_repo
        self.user_repo = user_repo

    def cancel_subscription(self, user_id: int, product_id: int,
                           cancelled_at: datetime) -> Membership:
        """Cancel subscription - mark as force-expired."""
        membership = self.membership_repo.find_by_user_and_product(
            user_id, product_id
        )
        if not membership:
            raise MembershipNotFoundError(...)

        # Set sentinel date to mark as cancelled
        membership.enforced_expires_date = MembershipSentinels.FORCE_EXPIRED
        membership.auto_renew_enabled = False
        membership.updated_at = cancelled_at

        self.membership_repo.save(membership)
        return membership

    def uncancellation(self, user_id: int, product_id: int,
                      uncancelled_at: datetime, new_expiry: datetime) -> Membership:
        """Handle re-enabled auto-renew - restore access."""
        membership = self.membership_repo.find_by_user_and_product(
            user_id, product_id
        )
        if not membership:
            raise MembershipNotFoundError(...)

        # Clear sentinel - back to normal state
        membership.enforced_expires_date = None
        membership.expires_date = new_expiry
        membership.auto_renew_enabled = True
        membership.updated_at = uncancelled_at

        self.membership_repo.save(membership)
        return membership

Each method has one job. No HTTP concerns. Pure domain logic.

Webhook Event State Machine

                    INITIAL_PURCHASE
                          │
                          ▼
                  ┌───────────────┐
                  │    ACTIVE     │
                  │ enforced=NULL │
                  └───────┬───────┘
                          │
        ┌─────────────────┼─────────────────┐
        │                 │                 │
        │ RENEWAL         │ CANCELLATION    │ EXPIRATION
        ▼                 ▼                 ▼
┌───────────────┐  ┌──────────────────┐  ┌──────────────────┐
│    ACTIVE     │  │ FORCE_EXPIRED    │  │ FORCE_EXPIRED    │
│ (refreshed)   │  │ enforced=2077    │  │ enforced=2077    │
└───────┬───────┘  └─────────┬────────┘  └──────────────────┘
        │                    │
        │ BILLING_ISSUE      │ UNCANCELLATION
        ▼                    ▼
┌───────────────┐  ┌──────────────────┐
│ ACTIVE        │  │    ACTIVE        │
│ (flagged)     │  │ enforced=NULL    │
└───────────────┘  └──────────────────┘

Sentinel States:
• enforced_expires_date = NULL      → Normal (use expires_date)
• enforced_expires_date = 2077-01-01 → Force-expired (no access)
• enforced_expires_date = 2000-01-01 → Soft-deleted (cleanup)

The state machine clarifies that UNCANCELLATION restores access by clearing the sentinel. Before the refactor, this transition was missing.

New Repository Method

Domain services need domain-level queries, not raw SQL.

# repositories/memberships.py - ADDED
def find_by_user_and_product(self, user_id: int,
                             product_id: int) -> Optional[Membership]:
    """Find active membership by user and product IDs."""
    return self.db_session.query(Membership).filter(
        Membership.user_id == user_id,
        Membership.product_id == product_id,
        Membership.deleted_at.is_(None)
    ).first()

This replaced scattered queries in event handlers.

Test Coverage

Three new regression tests ensure the bug stays fixed:

def test_uncancellation_restores_access():
    """UNCANCELLATION clears force-expired sentinel."""
    # Setup: cancelled membership
    membership.enforced_expires_date = MembershipSentinels.FORCE_EXPIRED
    assert not membership.is_valid()

    # Act: uncancellation event
    service.uncancellation(user_id, product_id, now(), future_date())

    # Assert: access restored
    assert membership.is_valid()
    assert membership.enforced_expires_date is None

def test_cancellation_makes_membership_invalid():
    """Cancelled memberships return False from is_valid()."""
    membership = create_active_membership()
    assert membership.is_valid()

    service.cancel_subscription(user_id, product_id, now())

    assert not membership.is_valid()  # Would have failed before fix
    assert MembershipSentinels.is_force_expired(membership.enforced_expires_date)

def test_renewal_after_cancel_restores_access():
    """RENEWAL after CANCELLATION clears sentinel and grants access."""
    service.cancel_subscription(user_id, product_id, now())
    assert not membership.is_valid()

    service.renew_subscription(user_id, product_id, now(), future_date())

    assert membership.is_valid()
    assert membership.enforced_expires_date is None

Lessons Learned

1. Sentinels Need Semantics

Magic values without context breed bugs. Our 2077-01-01 date looked like an expiration date, so is_valid() treated it like one. Value objects make intent explicit.

2. Early Returns Save Lives

Adding if MembershipSentinels.is_force_expired(self.enforced_expires_date): return False before the date comparison was a one-line fix that prevented the logic error. Check special cases first.

3. Business Logic Belongs in the Domain Layer

When HTTP handlers contain business rules, those rules can't be reused (CLI tools, async jobs, tests). Moving to a domain service made the logic portable and testable. Separate IO from logic.

4. Webhook Idempotency Windows Should Match Provider Retry Policies

Our 1-hour TTL was arbitrary. RevenueCat documents their ~18-hour retry window. We set our TTL to 24 hours (86,400 seconds) to safely cover it. Read the docs.

5. Missing State Transitions Cause Real Problems

The UNCANCELLATION event existed in RevenueCat's API, but we never handled it. Users who re-enabled auto-renew stayed locked out until the next renewal. Map all edges in your state machine.

6. DRY Isn't Just About Lines of Code

Duplicated _idempotency_key() functions weren't the worst problem. Duplicated concepts (sentinel dates without names) were. Extract knowledge, not just code.

7. Test the Bugs That Happened

Our new tests directly reproduce the original bug scenario: cancelled user calls is_valid(). If we ever regress, the test fails immediately. Tests encode production failures.

Conclusion

This refactor took three days and touched 15 files. It fixed a critical security bug, eliminated 200 lines of duplicate code, added missing functionality, and made the codebase maintainable.

More importantly, it made the domain model visible. Before, "force-expired" was a comment in one function. After, it's a named constant with predicate methods. The business logic isn't hidden in HTTP handlers—it's in a domain service with a clear API.

If your webhook handlers are growing unwieldy, or if you're storing magic values without semantic meaning, consider this refactor pattern:

  1. Extract value objects for domain constants
  2. Create a domain service for business logic
  3. Fix bugs by making special cases explicit (early returns)
  4. Add missing state transitions
  5. Write regression tests for the bugs you found

Your future self (and your cancelled users) will thank you.


Stack: Python 3.12, Flask, AWS Lambda, MySQL 8.0, RevenueCat webhooks Code: Changes spanned src/domain/subscription/, src/controllers/, src/resources/, and test suites Impact: Fixed critical access control bug, reduced code duplication by 30%, added missing event handler, increased test coverage from 60% to 94% for subscription module