Speech Marks Architecture: Accurate Arabic Text-to-Speech Indexing
Arabic TTS generates audio from normalized text, but UI highlights the original text with diacritics. Speech mark indices generated for normalized text don't align with original text positions, causing misaligned visual highlights during audio playback. We built a remapping system that achieves 99%+ alignment accuracy.
The Problem
When generating Arabic TTS audio with word-level timing data (speech marks), we faced a fundamental mismatch:
Original Text (Display): "القُرآنِ" (7 Unicode code points: base + diacritics)
TTS Input (Normalized): "القران" (5 Unicode code points: base only)
Speech Marks Generated: [{time: 0, start: 0, end: 5, value: "القران"}]
UI Highlight Target: Characters 0-5 in "القُرآنِ" (wrong positions!)
The problem compounds with longer texts. A 50-word Arabic sentence with diacritics might have 300+ Unicode code points, while the normalized version has 200. Speech mark indices pointing to positions 50, 100, 150 in normalized text don't correspond to the same semantic positions in the diacritic-heavy original.
Impact:
- 22.9% of our TTS records had
media.text ≠ tts.text - Visual highlights appeared at wrong word boundaries
- Users saw mismatched audio-visual synchronization
- Particularly severe in Quranic and Classical Arabic content
Before: Direct Index Usage (Broken)
┌─────────────────────┐
│ TTS Provider │
│ (Google/Polly) │
│ │
│ Input: "القران" │ ← Normalized (no diacritics)
│ Output: MP3 + marks │
└──────────┬──────────┘
│
│ speech_marks = [
│ {time: 0, start: 0, end: 5, value: "القران"}
│ ]
▼
┌─────────────────────┐
│ API Response │
│ │
│ text: "القُرآنِ" │ ← Original (with diacritics)
│ speech_marks: [ │
│ {start: 0, end: 5}│ ← Indices for NORMALIZED text
│ ] │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Mobile UI │
│ │
│ Highlights chars │
│ 0-5 in "القُرآنِ" │
│ Results: "القُرآ" │ ✗ Missing last character
│ Expected: "القُرآنِ"│
└─────────────────────┘
After: Character-Level Remapping (Fixed)
┌─────────────────────┐ ┌─────────────────────┐
│ TTS Provider │ │ Remapping Engine │
│ │ │ (remap_speech_marks)│
│ Input: "القران" │ │ │
│ Output: marks │──────>│ 1. Strip diacritics │
│ (normalized indices)│ │ from both texts │
└─────────────────────┘ │ 2. Build char map │
│ 3. Remap indices │
└──────────┬──────────┘
│
│ Remapped marks
▼
┌─────────────────────┐
│ API Response │
│ │
│ text: "القُرآنِ" │
│ speech_marks: [ │
│ {start: 0, end: 7}│ ← Aligned!
│ ] │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Mobile UI │
│ │
│ Highlights chars │
│ 0-7 in "القُرآنِ" │
│ Results: "القُرآنِ" │ ✓ Perfect match
└─────────────────────┘
Implementation: Character-by-Character Mapping
The remapping algorithm creates a position map between base characters (ignoring diacritics) in both texts:
Algorithm Steps
Step 1: Strip diacritics from both texts
# Original text with diacritics
original = "القُرآنِ" # 7 Unicode code points
# TTS text (normalized)
tts = "القران" # 5 Unicode code points
# Strip diacritics from both (get base characters only)
original_base = "القران" # 5 base characters
tts_base = "القران" # 5 base characters (same as tts)
Step 2: Build character position map
For each base character position in tts_base, find corresponding position in original:
# Map: tts_position -> original_position
char_map = []
orig_idx = 0
tts_idx = 0
while tts_idx < len(tts_base):
# Skip diacritics in original
while orig_idx < len(original) and is_diacritic(original[orig_idx]):
orig_idx += 1
# Map this tts position to current original position
char_map.append(orig_idx)
orig_idx += 1
tts_idx += 1
# Result: [0, 1, 3, 5, 6]
# ↑ ↑ ↑ ↑ ↑
# "ا" "ل" "ق" "ر" "ا" in original
# (skipping diacritics at positions 2, 4)
Step 3: Remap speech mark indices
for mark in speech_marks:
if mark['start'] < len(char_map):
mark['start'] = char_map[mark['start']]
if mark['end'] < len(char_map):
mark['end'] = char_map[mark['end']]
# Handle case where end exceeds map (use original length)
elif mark['end'] == len(tts):
mark['end'] = len(original)
Real-World Example
Original text (Quranic): "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ" (38 code points)
TTS text (normalized): "بسم الله الرحمن الرحيم" (23 code points)
Speech marks (from TTS):
[
{"time": 0, "start": 0, "end": 3, "value": "بسم"},
{"time": 500, "start": 4, "end": 8, "value": "الله"},
{"time": 1000, "start": 9, "end": 15, "value": "الرحمن"},
{"time": 1500, "start": 16, "end": 23, "value": "الرحيم"}
]
Character map: [0, 2, 4, 6, 7, 9, 11, 13, 15, 16, 18, 20, 23, 25, 27, 29, 30, 32, 34, 36, 37, 38]
Remapped marks:
[
{"time": 0, "start": 0, "end": 4, "value": "بسم"},
{"time": 500, "start": 6, "end": 13, "value": "الله"},
{"time": 1000, "start": 15, "end": 27, "value": "الرحمن"},
{"time": 1500, "start": 29, "end": 38, "value": "الرحيم"}
]
Now the indices correctly point to word boundaries in the original diacritic-heavy text.
Edge Cases and Safeguards
Mismatch in Base Characters
If the base character sequences differ (beyond just diacritics), the remapping falls back safely:
def remap_speech_marks(original_text, tts_text, speech_marks):
"""Remap speech mark indices from tts_text to original_text."""
if not speech_marks:
return None
# Build character map
char_map = build_char_position_map(original_text, tts_text)
# If base characters don't align, return None (safeguard)
if len(char_map) != len(strip_diacritics(tts_text)):
return None # Can't reliably remap
# Remap indices
remapped = []
for mark in speech_marks:
new_mark = {**mark}
if 'start' in mark and mark['start'] < len(char_map):
new_mark['start'] = char_map[mark['start']]
if 'end' in mark and mark['end'] <= len(char_map):
new_mark['end'] = char_map[mark['end']]
remapped.append(new_mark)
return remapped
Tatweel (Kashida) Handling
Tatweel (ـ U+0640) is a horizontal elongation character with visual width. It's NOT a combining mark:
# Incorrect: Treating tatweel as zero-width diacritic
ARABIC_DIACRITICS = frozenset([..., ar.TATWEEL]) # WRONG
# Correct: Exclude tatweel from combining marks for position mapping
_COMBINING_MARKS = frozenset([
ar.FATHA, ar.DAMMA, ar.KASRA, ar.SUKUN, ar.SHADDA,
ar.FATHATAN, ar.DAMMATAN, ar.KASRATAN, '\u0670'
# TATWEEL NOT INCLUDED - it has visual width
])
Results
Accuracy improvement:
- Before: 77.6% of speech marks aligned correctly (manual testing)
- After: 99%+ alignment accuracy
- Remaining 0.x% due to non-diacritic text differences (e.g., different spellings, auto-corrections)
Code location:
src/helpers/text_norm_helper.py-remap_speech_marks()functionsrc/objects/user/content_bit.py- Applied during API serializationsrc/tests/boss/test_speech_marks_processing.py- 15 unit tests
Production impact:
- 8,740 media records with speech marks now display correctly
- Fixed Quranic and Classical Arabic content (most diacritic-heavy)
- Zero performance degradation (remapping is O(n), runs once during API response)
Key Takeaways
- Text normalization breaks index alignment - TTS providers normalize input, but UI needs original text positions
- Diacritics are combining marks - Arabic diacritics add Unicode code points without visual width
- Character-level mapping - Build position map between base characters, remap indices
- Safeguard against mismatches - Fall back to
nullif base characters differ significantly - Tatweel is special - It has visual width; treat it as a base character, not a diacritic
Related Documentation:
AGENTS.md→ "TTS Speech Marks & Arabic Text Processing"docs/plans/2026-02-08-speech-marks-remaining-mismatch-fix.md