TTS Media Response Redesign: Nested Object Refactoring

Flat TTS fields scattered across API responses caused confusion and violated REST best practices. We refactored the API to nest all TTS-related properties under a dedicated tts object, improving clarity, extensibility, and developer experience.

The Problem

TTS properties were exposed as top-level fields in media objects, mixing presentation concerns with audio metadata:

{
  "text": "القرآن",
  "language": "AR",
  "image_url": "https://...",
  "background_color": "pink",
  "tts_url": "https://s3.../audio.mp3",
  "tts_duration": 1200,
  "tts_speech_marks": [...],
  "tts_voice_name": "Zeina",
  "tts_provider": "Polly",
  "tts_model_name": "neural"
}

Issues with this design:

Namespace pollution - 6 TTS fields at top level
Unclear grouping - TTS properties scattered among other media fields
Hard to extend - Adding new TTS metadata requires top-level fields
Type safety issues - Frontend models had optional tts_* fields everywhere
Inconsistent with REST - Related data should be nested
Confusing text fields - Two "text" concepts: text (display) and implied TTS text

Before: Flat Field Structure

Media Object (Flat Fields)
┌─────────────────────────────────────────────────────┐
│ id: 123                                             │
│ text: "القرآن" ← Display text                       │
│ language: "AR"                                      │
│ image_url: "https://..."                            │
│ background_color: "pink"                            │
│ ├─ tts_url: "https://s3.../audio.mp3" ← TTS field  │
│ ├─ tts_duration: 1200                 ← TTS field  │
│ ├─ tts_speech_marks: [...]            ← TTS field  │
│ ├─ tts_voice_name: "Zeina"            ← TTS field  │
│ ├─ tts_provider: "Polly"              ← TTS field  │
│ └─ tts_model_name: "neural"           ← TTS field  │
│                                                     │
│ Issues:                                             │
│ - 6 TTS fields mixed with media fields             │
│ - Unclear which fields are TTS-related             │
│ - No explicit TTS text (hidden in TTS table)       │
└─────────────────────────────────────────────────────┘

Developer confusion:

// Frontend TypeScript interface (before)
interface Media {
  text: string;
  language: string;
  image_url?: string;
  background_color?: string;
  tts_url?: string;              // TTS field? Optional?
  tts_duration?: number;         // TTS field? What if no TTS?
  tts_speech_marks?: SpeechMark[];
  tts_voice_name?: string;
  tts_provider?: string;
  tts_model_name?: string;
  // Is this all? Are there more TTS fields coming?
}

Questions developers asked:

"What's the difference between text and the text used for TTS?"
"Are all tts_* fields always present together?"
"If tts_url is null, are the other tts_* fields also null?"
"How do I add a new TTS property without polluting the top level?"

After: Nested Object Structure

Media Object (Nested Structure)
┌─────────────────────────────────────────────────────┐
│ id: 123                                             │
│ text: "القرآن" ← Display text                       │
│ language: "AR"                                      │
│ image_url: "https://..."                            │
│ background_color: "pink"                            │
│                                                     │
│ tts: {                      ← Nested TTS object    │
│   ├─ text: "القران"        ← Processed TTS text    │
│   ├─ url: "https://.../mp3"                        │
│   ├─ duration: 1200                                │
│   ├─ speech_marks: [...]                           │
│   ├─ voice_name: "Zeina"                           │
│   ├─ provider: "Polly"                             │
│   ├─ model_name: "neural"                          │
│   ├─ language: "AR"                                │
│   └─ normalization_version: 2                      │
│ }                                                   │
│                                                     │
│ Benefits:                                           │
│ - Clear grouping of TTS metadata                   │
│ - Explicit TTS text separate from display text     │
│ - Easy to extend (add fields to tts object)        │
│ - Better type safety (single optional object)      │
└─────────────────────────────────────────────────────┘

Improved developer experience:

// Frontend TypeScript interface (after)
interface Media {
  text: string;
  language: string;
  image_url?: string;
  background_color?: string;
  tts?: TTSObject;  // Single optional nested object
}

interface TTSObject {
  text: string;              // Explicit: processed TTS text
  url: string;
  duration: number;
  speech_marks?: SpeechMark[];
  voice_name: string;
  provider: 'Polly' | 'OpenAI' | 'ElevenLabs';
  model_name: string;
  language: string;
  normalization_version: 1 | 2;
}

Clarity benefits:

✅ "What's the TTS text?" → tts.text (explicit)
✅ "Are all TTS fields present?" → Check if (media.tts) once
✅ "How to add a new TTS field?" → Add to TTSObject interface
✅ "What TTS properties exist?" → Inspect TTSObject (autocomplete works)

Implementation

Backend Schema Changes

File: src/objects/user/content_bit.py

Before:

class SchemaDumpMedia(GalileoNoneFreeSchema):
    text = fields.String()
    language = fields.String()
    image_url = ImageUrlField()
    audio_url = AudioUrlField(is_tts=True)  # Derived from tts.url
    # No tts_* fields exposed (derived from relationships)
    speech_marks = fields.Method("get_speech_marks")

After:

class SchemaDumpTTS(GalileoNoneFreeSchema):
    """Complete TTS object schema."""
    text = fields.String()  # Processed TTS text (from TextToSpeechModel.text)
    voice_name = fields.String()
    provider = fields.String()
    model_name = fields.String()
    speech_marks = fields.Method("get_speech_marks")
    language = fields.String()
    normalization_version = fields.Integer()

    def get_speech_marks(self, tts_model):
        """Permission-gated speech marks."""
        if not tts_model or not tts_model.speech_marks:
            return None
        if LoggedUser.exists and LoggedUser.instance.speech_marks_allowed:
            return tts_model.speech_marks
        return None


class SchemaDumpMedia(GalileoNoneFreeSchema):
    text = fields.String()
    language = fields.String()
    image_url = ImageUrlField()
    audio_url = AudioUrlField(is_tts=True)
    speech_marks = fields.Method("get_speech_marks")  # Kept for backward compat
    tts = fields.Nested(SchemaDumpTTS)  # NEW: Complete TTS object

Key Design Decisions

1. Two text fields:

media.text - Original text for display (may have diacritics)
tts.text - Processed text used for TTS generation (normalized)
These can differ due to auto-correct patterns, normalization, etc.

2. Backward compatibility:

audio_url kept at top level (derived from tts.url)
speech_marks kept at top level (same data as tts.speech_marks)
Both locations point to same data source
No breaking changes for existing clients

3. Permission-gated speech marks:

def get_speech_marks(self, tts_model):
    """Only return speech marks if user has permission."""
    if LoggedUser.instance.speech_marks_allowed:
        return tts_model.speech_marks
    return None

4. Automatic inheritance:

class SchemaDumpAnswer(SchemaDumpMedia):
    """Answer schema inherits media schema."""
    is_correct = fields.Boolean()
    pair_id = fields.Integer()
    # tts field inherited automatically

Migration Path

Phase 1: Add nested field (additive change)

# Add tts nested object to schema
tts = fields.Nested(SchemaDumpTTS)

Impact: Zero breaking changes. Existing clients ignore new field.

Phase 2: Update frontend gradually

// Old code continues working
const audioUrl = media.audio_url;

// New code can use nested structure
if (media.tts) {
  const ttsText = media.tts.text;
  const provider = media.tts.provider;
}

Phase 3: Deprecate flat fields (future)

When all clients migrate, can remove audio_url from top level. But not necessary - keeping both is cheap.

Real-World API Response

Question with TTS:

{
  "main_text": {
    "text": "ما هو العلوم؟",
    "language": "AR",
    "audio_url": "https://s3.../tts_12345.mp3",
    "speech_marks": [...],  // Backward compat
    "tts": {
      "text": "ما هو العلوم",  // Normalized (no question mark)
      "url": "https://s3.../tts_12345.mp3",
      "duration": 1200,
      "voice_name": "Zeina",
      "provider": "Polly",
      "model_name": "neural",
      "speech_marks": [...],  // Permission-gated
      "language": "AR",
      "normalization_version": 2
    }
  },
  "answers": [
    {
      "text": "الفيزياء",
      "is_correct": true,
      "tts": {
        "text": "الفيزياء",
        "voice_name": "Zeina",
        "provider": "Polly",
        "model_name": "neural",
        "speech_marks": null,  // No marks for answers
        "language": "AR",
        "normalization_version": 2
      }
    }
  ]
}

Question without TTS (test data):

{
  "main_text": {
    "text": "سؤال اختبار",
    "language": "AR",
    "tts": null  // No TTS generated
  },
  "answers": [
    {
      "text": "جواب",
      "is_correct": true,
      "tts": null
    }
  ]
}

Results

API Design:

6 top-level TTS fields → 1 nested tts object
Clear separation: display metadata vs. TTS metadata
Explicit TTS text separate from display text

Developer Experience:

Reduced confusion: "Is this TTS-related?" → Check if inside tts object
Better type safety: Single tts?: TTSObject instead of 6 optional fields
Easier extension: Add new TTS properties to nested object
Improved autocomplete: IDEs suggest tts. properties

Frontend Impact:

Zero breaking changes (backward compatible)
Gradual migration possible
Better TypeScript/Swift code generation

Lines Changed:

Backend: ~100 lines (new schema, tests)
Frontend: 0 lines required (additive change)

Test Coverage:

15 unit tests for TTS schema serialization
50+ integration tests updated to validate tts object structure

Benefits by Stakeholder

Backend developers:

Single schema (SchemaDumpTTS) for all TTS properties
Add new fields without polluting media schema
Clear data model matches database relationships

Frontend developers:

Single optional object instead of 6 optional fields
Autocomplete works: media.tts. shows all TTS properties
Type safety: if (media.tts) guards all TTS access

API consumers:

Clear grouping: All TTS metadata in one place
Explicit text fields: media.text vs. tts.text
Backward compatible: Old fields still work

QA/Testing:

Easier to validate: Check tts object presence
Clear test data: tts: null for non-TTS media
Better error messages: Missing tts.voice_name vs. missing tts_voice_name

Key Takeaways

Nest related data - Group related fields in nested objects, not flat top level
Explicit is better - Two text fields (media.text, tts.text) clarify distinct purposes
Backward compatibility - Keep old fields during migration, deprecate later
Type safety - Single optional object beats many optional fields
Developer experience - Clear structure reduces cognitive load

Related Commits:

1c52f4f - Add SchemaDumpTTS nested schema
702b7ac - Update API response tests
ae84bd1 - Frontend TypeScript interface updates