TTS Media Response Redesign: Nested Object Refactoring
Flat TTS fields scattered across API responses caused confusion and violated REST best practices. We refactored the API to nest all TTS-related properties under a dedicated tts object, improving clarity, extensibility, and developer experience.
The Problem
TTS properties were exposed as top-level fields in media objects, mixing presentation concerns with audio metadata:
{
"text": "القرآن",
"language": "AR",
"image_url": "https://...",
"background_color": "pink",
"tts_url": "https://s3.../audio.mp3",
"tts_duration": 1200,
"tts_speech_marks": [...],
"tts_voice_name": "Zeina",
"tts_provider": "Polly",
"tts_model_name": "neural"
}
Issues with this design:
- Namespace pollution - 6 TTS fields at top level
- Unclear grouping - TTS properties scattered among other media fields
- Hard to extend - Adding new TTS metadata requires top-level fields
- Type safety issues - Frontend models had optional
tts_*fields everywhere - Inconsistent with REST - Related data should be nested
- Confusing text fields - Two "text" concepts:
text(display) and implied TTS text
Before: Flat Field Structure
Media Object (Flat Fields)
┌─────────────────────────────────────────────────────┐
│ id: 123 │
│ text: "القرآن" ← Display text │
│ language: "AR" │
│ image_url: "https://..." │
│ background_color: "pink" │
│ ├─ tts_url: "https://s3.../audio.mp3" ← TTS field │
│ ├─ tts_duration: 1200 ← TTS field │
│ ├─ tts_speech_marks: [...] ← TTS field │
│ ├─ tts_voice_name: "Zeina" ← TTS field │
│ ├─ tts_provider: "Polly" ← TTS field │
│ └─ tts_model_name: "neural" ← TTS field │
│ │
│ Issues: │
│ - 6 TTS fields mixed with media fields │
│ - Unclear which fields are TTS-related │
│ - No explicit TTS text (hidden in TTS table) │
└─────────────────────────────────────────────────────┘
Developer confusion:
// Frontend TypeScript interface (before)
interface Media {
text: string;
language: string;
image_url?: string;
background_color?: string;
tts_url?: string; // TTS field? Optional?
tts_duration?: number; // TTS field? What if no TTS?
tts_speech_marks?: SpeechMark[];
tts_voice_name?: string;
tts_provider?: string;
tts_model_name?: string;
// Is this all? Are there more TTS fields coming?
}
Questions developers asked:
- "What's the difference between
textand the text used for TTS?" - "Are all
tts_*fields always present together?" - "If
tts_urlis null, are the othertts_*fields also null?" - "How do I add a new TTS property without polluting the top level?"
After: Nested Object Structure
Media Object (Nested Structure)
┌─────────────────────────────────────────────────────┐
│ id: 123 │
│ text: "القرآن" ← Display text │
│ language: "AR" │
│ image_url: "https://..." │
│ background_color: "pink" │
│ │
│ tts: { ← Nested TTS object │
│ ├─ text: "القران" ← Processed TTS text │
│ ├─ url: "https://.../mp3" │
│ ├─ duration: 1200 │
│ ├─ speech_marks: [...] │
│ ├─ voice_name: "Zeina" │
│ ├─ provider: "Polly" │
│ ├─ model_name: "neural" │
│ ├─ language: "AR" │
│ └─ normalization_version: 2 │
│ } │
│ │
│ Benefits: │
│ - Clear grouping of TTS metadata │
│ - Explicit TTS text separate from display text │
│ - Easy to extend (add fields to tts object) │
│ - Better type safety (single optional object) │
└─────────────────────────────────────────────────────┘
Improved developer experience:
// Frontend TypeScript interface (after)
interface Media {
text: string;
language: string;
image_url?: string;
background_color?: string;
tts?: TTSObject; // Single optional nested object
}
interface TTSObject {
text: string; // Explicit: processed TTS text
url: string;
duration: number;
speech_marks?: SpeechMark[];
voice_name: string;
provider: 'Polly' | 'OpenAI' | 'ElevenLabs';
model_name: string;
language: string;
normalization_version: 1 | 2;
}
Clarity benefits:
- ✅ "What's the TTS text?" →
tts.text(explicit) - ✅ "Are all TTS fields present?" → Check
if (media.tts)once - ✅ "How to add a new TTS field?" → Add to
TTSObjectinterface - ✅ "What TTS properties exist?" → Inspect
TTSObject(autocomplete works)
Implementation
Backend Schema Changes
File: src/objects/user/content_bit.py
Before:
class SchemaDumpMedia(GalileoNoneFreeSchema):
text = fields.String()
language = fields.String()
image_url = ImageUrlField()
audio_url = AudioUrlField(is_tts=True) # Derived from tts.url
# No tts_* fields exposed (derived from relationships)
speech_marks = fields.Method("get_speech_marks")
After:
class SchemaDumpTTS(GalileoNoneFreeSchema):
"""Complete TTS object schema."""
text = fields.String() # Processed TTS text (from TextToSpeechModel.text)
voice_name = fields.String()
provider = fields.String()
model_name = fields.String()
speech_marks = fields.Method("get_speech_marks")
language = fields.String()
normalization_version = fields.Integer()
def get_speech_marks(self, tts_model):
"""Permission-gated speech marks."""
if not tts_model or not tts_model.speech_marks:
return None
if LoggedUser.exists and LoggedUser.instance.speech_marks_allowed:
return tts_model.speech_marks
return None
class SchemaDumpMedia(GalileoNoneFreeSchema):
text = fields.String()
language = fields.String()
image_url = ImageUrlField()
audio_url = AudioUrlField(is_tts=True)
speech_marks = fields.Method("get_speech_marks") # Kept for backward compat
tts = fields.Nested(SchemaDumpTTS) # NEW: Complete TTS object
Key Design Decisions
1. Two text fields:
media.text- Original text for display (may have diacritics)tts.text- Processed text used for TTS generation (normalized)- These can differ due to auto-correct patterns, normalization, etc.
2. Backward compatibility:
audio_urlkept at top level (derived fromtts.url)speech_markskept at top level (same data astts.speech_marks)- Both locations point to same data source
- No breaking changes for existing clients
3. Permission-gated speech marks:
def get_speech_marks(self, tts_model):
"""Only return speech marks if user has permission."""
if LoggedUser.instance.speech_marks_allowed:
return tts_model.speech_marks
return None
4. Automatic inheritance:
class SchemaDumpAnswer(SchemaDumpMedia):
"""Answer schema inherits media schema."""
is_correct = fields.Boolean()
pair_id = fields.Integer()
# tts field inherited automatically
Migration Path
Phase 1: Add nested field (additive change)
# Add tts nested object to schema
tts = fields.Nested(SchemaDumpTTS)
Impact: Zero breaking changes. Existing clients ignore new field.
Phase 2: Update frontend gradually
// Old code continues working
const audioUrl = media.audio_url;
// New code can use nested structure
if (media.tts) {
const ttsText = media.tts.text;
const provider = media.tts.provider;
}
Phase 3: Deprecate flat fields (future)
When all clients migrate, can remove audio_url from top level. But not necessary - keeping both is cheap.
Real-World API Response
Question with TTS:
{
"main_text": {
"text": "ما هو العلوم؟",
"language": "AR",
"audio_url": "https://s3.../tts_12345.mp3",
"speech_marks": [...], // Backward compat
"tts": {
"text": "ما هو العلوم", // Normalized (no question mark)
"url": "https://s3.../tts_12345.mp3",
"duration": 1200,
"voice_name": "Zeina",
"provider": "Polly",
"model_name": "neural",
"speech_marks": [...], // Permission-gated
"language": "AR",
"normalization_version": 2
}
},
"answers": [
{
"text": "الفيزياء",
"is_correct": true,
"tts": {
"text": "الفيزياء",
"voice_name": "Zeina",
"provider": "Polly",
"model_name": "neural",
"speech_marks": null, // No marks for answers
"language": "AR",
"normalization_version": 2
}
}
]
}
Question without TTS (test data):
{
"main_text": {
"text": "سؤال اختبار",
"language": "AR",
"tts": null // No TTS generated
},
"answers": [
{
"text": "جواب",
"is_correct": true,
"tts": null
}
]
}
Results
API Design:
- 6 top-level TTS fields → 1 nested
ttsobject - Clear separation: display metadata vs. TTS metadata
- Explicit TTS text separate from display text
Developer Experience:
- Reduced confusion: "Is this TTS-related?" → Check if inside
ttsobject - Better type safety: Single
tts?: TTSObjectinstead of 6 optional fields - Easier extension: Add new TTS properties to nested object
- Improved autocomplete: IDEs suggest
tts.properties
Frontend Impact:
- Zero breaking changes (backward compatible)
- Gradual migration possible
- Better TypeScript/Swift code generation
Lines Changed:
- Backend: ~100 lines (new schema, tests)
- Frontend: 0 lines required (additive change)
Test Coverage:
- 15 unit tests for TTS schema serialization
- 50+ integration tests updated to validate
ttsobject structure
Benefits by Stakeholder
Backend developers:
- Single schema (
SchemaDumpTTS) for all TTS properties - Add new fields without polluting media schema
- Clear data model matches database relationships
Frontend developers:
- Single optional object instead of 6 optional fields
- Autocomplete works:
media.tts.shows all TTS properties - Type safety:
if (media.tts)guards all TTS access
API consumers:
- Clear grouping: All TTS metadata in one place
- Explicit text fields:
media.textvs.tts.text - Backward compatible: Old fields still work
QA/Testing:
- Easier to validate: Check
ttsobject presence - Clear test data:
tts: nullfor non-TTS media - Better error messages: Missing
tts.voice_namevs. missingtts_voice_name
Key Takeaways
- Nest related data - Group related fields in nested objects, not flat top level
- Explicit is better - Two text fields (
media.text,tts.text) clarify distinct purposes - Backward compatibility - Keep old fields during migration, deprecate later
- Type safety - Single optional object beats many optional fields
- Developer experience - Clear structure reduces cognitive load
Related Commits:
1c52f4f- AddSchemaDumpTTSnested schema702b7ac- Update API response testsae84bd1- Frontend TypeScript interface updates