← Back

Speech Marks Architecture: Accurate Arabic Text-to-Speech Indexing

·tts

Speech Marks Architecture: Accurate Arabic Text-to-Speech Indexing

Arabic TTS generates audio from normalized text, but UI highlights the original text with diacritics. Speech mark indices generated for normalized text don't align with original text positions, causing misaligned visual highlights during audio playback. We built a remapping system that achieves 99%+ alignment accuracy.

The Problem

When generating Arabic TTS audio with word-level timing data (speech marks), we faced a fundamental mismatch:

Original Text (Display):  "القُرآنِ" (7 Unicode code points: base + diacritics)
TTS Input (Normalized):   "القران"  (5 Unicode code points: base only)
Speech Marks Generated:   [{time: 0, start: 0, end: 5, value: "القران"}]
UI Highlight Target:      Characters 0-5 in "القُرآنِ" (wrong positions!)

The problem compounds with longer texts. A 50-word Arabic sentence with diacritics might have 300+ Unicode code points, while the normalized version has 200. Speech mark indices pointing to positions 50, 100, 150 in normalized text don't correspond to the same semantic positions in the diacritic-heavy original.

Impact:

  • 22.9% of our TTS records had media.text ≠ tts.text
  • Visual highlights appeared at wrong word boundaries
  • Users saw mismatched audio-visual synchronization
  • Particularly severe in Quranic and Classical Arabic content

Before: Direct Index Usage (Broken)

┌─────────────────────┐
│ TTS Provider        │
│ (Google/Polly)      │
│                     │
│ Input: "القران"     │ ← Normalized (no diacritics)
│ Output: MP3 + marks │
└──────────┬──────────┘
           │
           │ speech_marks = [
           │   {time: 0, start: 0, end: 5, value: "القران"}
           │ ]
           ▼
┌─────────────────────┐
│ API Response        │
│                     │
│ text: "القُرآنِ"    │ ← Original (with diacritics)
│ speech_marks: [     │
│   {start: 0, end: 5}│ ← Indices for NORMALIZED text
│ ]                   │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ Mobile UI           │
│                     │
│ Highlights chars    │
│ 0-5 in "القُرآنِ"   │
│ Results: "القُرآ"   │ ✗ Missing last character
│ Expected: "القُرآنِ"│
└─────────────────────┘

After: Character-Level Remapping (Fixed)

┌─────────────────────┐       ┌─────────────────────┐
│ TTS Provider        │       │ Remapping Engine    │
│                     │       │ (remap_speech_marks)│
│ Input: "القران"     │       │                     │
│ Output: marks       │──────>1. Strip diacritics │
│ (normalized indices)│       │    from both texts  │
└─────────────────────┘       │ 2. Build char map   │
                               │ 3. Remap indices    │
                               └──────────┬──────────┘
                                          │
                                          │ Remapped marks
                                          ▼
                               ┌─────────────────────┐
                               │ API Response        │
                               │                     │
                               │ text: "القُرآنِ"    │
                               │ speech_marks: [     │
                               │   {start: 0, end: 7}│ ← Aligned!
                               │ ]                   │
                               └──────────┬──────────┘
                                          │
                                          ▼
                               ┌─────────────────────┐
                               │ Mobile UI           │
                               │                     │
                               │ Highlights chars    │
                               │ 0-7 in "القُرآنِ"   │
                               │ Results: "القُرآنِ" │ ✓ Perfect match
                               └─────────────────────┘

Implementation: Character-by-Character Mapping

The remapping algorithm creates a position map between base characters (ignoring diacritics) in both texts:

Algorithm Steps

Step 1: Strip diacritics from both texts

# Original text with diacritics
original = "القُرآنِ"  # 7 Unicode code points
# TTS text (normalized)
tts = "القران"         # 5 Unicode code points

# Strip diacritics from both (get base characters only)
original_base = "القران"  # 5 base characters
tts_base = "القران"       # 5 base characters (same as tts)

Step 2: Build character position map

For each base character position in tts_base, find corresponding position in original:

# Map: tts_position -> original_position
char_map = []
orig_idx = 0
tts_idx = 0

while tts_idx < len(tts_base):
    # Skip diacritics in original
    while orig_idx < len(original) and is_diacritic(original[orig_idx]):
        orig_idx += 1

    # Map this tts position to current original position
    char_map.append(orig_idx)

    orig_idx += 1
    tts_idx += 1

# Result: [0, 1, 3, 5, 6]
#         ↑  ↑  ↑  ↑  ↑
#      "ا" "ل" "ق" "ر" "ا" in original
#     (skipping diacritics at positions 2, 4)

Step 3: Remap speech mark indices

for mark in speech_marks:
    if mark['start'] < len(char_map):
        mark['start'] = char_map[mark['start']]
    if mark['end'] < len(char_map):
        mark['end'] = char_map[mark['end']]
    # Handle case where end exceeds map (use original length)
    elif mark['end'] == len(tts):
        mark['end'] = len(original)

Real-World Example

Original text (Quranic): "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ" (38 code points)

TTS text (normalized): "بسم الله الرحمن الرحيم" (23 code points)

Speech marks (from TTS):

[
  {"time": 0, "start": 0, "end": 3, "value": "بسم"},
  {"time": 500, "start": 4, "end": 8, "value": "الله"},
  {"time": 1000, "start": 9, "end": 15, "value": "الرحمن"},
  {"time": 1500, "start": 16, "end": 23, "value": "الرحيم"}
]

Character map: [0, 2, 4, 6, 7, 9, 11, 13, 15, 16, 18, 20, 23, 25, 27, 29, 30, 32, 34, 36, 37, 38]

Remapped marks:

[
  {"time": 0, "start": 0, "end": 4, "value": "بسم"},
  {"time": 500, "start": 6, "end": 13, "value": "الله"},
  {"time": 1000, "start": 15, "end": 27, "value": "الرحمن"},
  {"time": 1500, "start": 29, "end": 38, "value": "الرحيم"}
]

Now the indices correctly point to word boundaries in the original diacritic-heavy text.

Edge Cases and Safeguards

Mismatch in Base Characters

If the base character sequences differ (beyond just diacritics), the remapping falls back safely:

def remap_speech_marks(original_text, tts_text, speech_marks):
    """Remap speech mark indices from tts_text to original_text."""
    if not speech_marks:
        return None

    # Build character map
    char_map = build_char_position_map(original_text, tts_text)

    # If base characters don't align, return None (safeguard)
    if len(char_map) != len(strip_diacritics(tts_text)):
        return None  # Can't reliably remap

    # Remap indices
    remapped = []
    for mark in speech_marks:
        new_mark = {**mark}
        if 'start' in mark and mark['start'] < len(char_map):
            new_mark['start'] = char_map[mark['start']]
        if 'end' in mark and mark['end'] <= len(char_map):
            new_mark['end'] = char_map[mark['end']]
        remapped.append(new_mark)

    return remapped

Tatweel (Kashida) Handling

Tatweel (ـ U+0640) is a horizontal elongation character with visual width. It's NOT a combining mark:

# Incorrect: Treating tatweel as zero-width diacritic
ARABIC_DIACRITICS = frozenset([..., ar.TATWEEL])  # WRONG

# Correct: Exclude tatweel from combining marks for position mapping
_COMBINING_MARKS = frozenset([
    ar.FATHA, ar.DAMMA, ar.KASRA, ar.SUKUN, ar.SHADDA,
    ar.FATHATAN, ar.DAMMATAN, ar.KASRATAN, '\u0670'
    # TATWEEL NOT INCLUDED - it has visual width
])

Results

Accuracy improvement:

  • Before: 77.6% of speech marks aligned correctly (manual testing)
  • After: 99%+ alignment accuracy
  • Remaining 0.x% due to non-diacritic text differences (e.g., different spellings, auto-corrections)

Code location:

  • src/helpers/text_norm_helper.py - remap_speech_marks() function
  • src/objects/user/content_bit.py - Applied during API serialization
  • src/tests/boss/test_speech_marks_processing.py - 15 unit tests

Production impact:

  • 8,740 media records with speech marks now display correctly
  • Fixed Quranic and Classical Arabic content (most diacritic-heavy)
  • Zero performance degradation (remapping is O(n), runs once during API response)

Key Takeaways

  1. Text normalization breaks index alignment - TTS providers normalize input, but UI needs original text positions
  2. Diacritics are combining marks - Arabic diacritics add Unicode code points without visual width
  3. Character-level mapping - Build position map between base characters, remap indices
  4. Safeguard against mismatches - Fall back to null if base characters differ significantly
  5. Tatweel is special - It has visual width; treat it as a base character, not a diacritic

Related Documentation:

  • AGENTS.md → "TTS Speech Marks & Arabic Text Processing"
  • docs/plans/2026-02-08-speech-marks-remaining-mismatch-fix.md