alqosh

Speech Marks Architecture: Accurate Arabic Text-to-Speech Indexing

February 1, 2026·tts

Overview

Arabic TTS generates audio from normalized text, but UI highlights the original text with diacritics. Speech mark indices generated for normalized text don't align with original text positions, causing misaligned visual highlights during audio playback. We built a remapping system that achieves 99%+ alignment accuracy.

The Problem

When generating Arabic TTS audio with word-level timing data (speech marks), we faced a fundamental mismatch:

Original Text (Display):  "القُرآنِ" (7 Unicode code points: base + diacritics)
TTS Input (Normalized):   "القران"  (5 Unicode code points: base only)
Speech Marks Generated:   [{time: 0, start: 0, end: 5, value: "القران"}]
UI Highlight Target:      Characters 0-5 in "القُرآنِ" (wrong positions!)

The problem compounds with longer texts. A 50-word Arabic sentence with diacritics might have 300+ Unicode code points, while the normalized version has 200. Speech mark indices pointing to positions 50, 100, 150 in normalized text don't correspond to the same semantic positions in the diacritic-heavy original.

Impact:

22.9% of our TTS records had media.text ≠ tts.text
Visual highlights appeared at wrong word boundaries
Users saw mismatched audio-visual synchronization
Particularly severe in Quranic and Classical Arabic content

Before: Direct Index Usage (Broken)

After: Character-Level Remapping (Fixed)

Implementation: Character-by-Character Mapping

The remapping algorithm creates a position map between base characters (ignoring diacritics) in both texts:

Algorithm Steps

Step 1: Strip diacritics from both texts

# Original text with diacritics
original = "القُرآنِ"  # 7 Unicode code points
# TTS text (normalized)
tts = "القران"         # 5 Unicode code points

# Strip diacritics from both (get base characters only)
original_base = "القران"  # 5 base characters
tts_base = "القران"       # 5 base characters (same as tts)

Step 2: Build character position map

For each base character position in tts_base, find corresponding position in original:

# Map: tts_position -> original_position
char_map = []
orig_idx = 0
tts_idx = 0

while tts_idx < len(tts_base):
    # Skip diacritics in original
    while orig_idx < len(original) and is_diacritic(original[orig_idx]):
        orig_idx += 1

    # Map this tts position to current original position
    char_map.append(orig_idx)

    orig_idx += 1
    tts_idx += 1

# Result: [0, 1, 3, 5, 6]
#         ↑  ↑  ↑  ↑  ↑
#      "ا" "ل" "ق" "ر" "ا" in original
#     (skipping diacritics at positions 2, 4)

Step 3: Remap speech mark indices

for mark in speech_marks:
    if mark['start'] < len(char_map):
        mark['start'] = char_map[mark['start']]
    if mark['end'] < len(char_map):
        mark['end'] = char_map[mark['end']]
    # Handle case where end exceeds map (use original length)
    elif mark['end'] == len(tts):
        mark['end'] = len(original)

Real-World Example

Original text (Quranic): "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ" (38 code points)

TTS text (normalized): "بسم الله الرحمن الرحيم" (23 code points)

Speech marks (from TTS):

[
  {"time": 0, "start": 0, "end": 3, "value": "بسم"},
  {"time": 500, "start": 4, "end": 8, "value": "الله"},
  {"time": 1000, "start": 9, "end": 15, "value": "الرحمن"},
  {"time": 1500, "start": 16, "end": 23, "value": "الرحيم"}
]

Character map: [0, 2, 4, 6, 7, 9, 11, 13, 15, 16, 18, 20, 23, 25, 27, 29, 30, 32, 34, 36, 37, 38]

Remapped marks:

[
  {"time": 0, "start": 0, "end": 4, "value": "بسم"},
  {"time": 500, "start": 6, "end": 13, "value": "الله"},
  {"time": 1000, "start": 15, "end": 27, "value": "الرحمن"},
  {"time": 1500, "start": 29, "end": 38, "value": "الرحيم"}
]

Now the indices correctly point to word boundaries in the original diacritic-heavy text.

Edge Cases and Safeguards

Mismatch in Base Characters

If the base character sequences differ (beyond just diacritics), the remapping falls back safely:

def remap_speech_marks(original_text, tts_text, speech_marks):
    """Remap speech mark indices from tts_text to original_text."""
    if not speech_marks:
        return None

    # Build character map
    char_map = build_char_position_map(original_text, tts_text)

    # If base characters don't align, return None (safeguard)
    if len(char_map) != len(strip_diacritics(tts_text)):
        return None  # Can't reliably remap

    # Remap indices
    remapped = []
    for mark in speech_marks:
        new_mark = {**mark}
        if 'start' in mark and mark['start'] < len(char_map):
            new_mark['start'] = char_map[mark['start']]
        if 'end' in mark and mark['end'] <= len(char_map):
            new_mark['end'] = char_map[mark['end']]
        remapped.append(new_mark)

    return remapped

Tatweel (Kashida) Handling

Tatweel (ـ U+0640) is a horizontal elongation character with visual width. It's NOT a combining mark:

# Incorrect: Treating tatweel as zero-width diacritic
ARABIC_DIACRITICS = frozenset([..., ar.TATWEEL])  # WRONG

# Correct: Exclude tatweel from combining marks for position mapping
_COMBINING_MARKS = frozenset([
    ar.FATHA, ar.DAMMA, ar.KASRA, ar.SUKUN, ar.SHADDA,
    ar.FATHATAN, ar.DAMMATAN, ar.KASRATAN, '\u0670'
    # TATWEEL NOT INCLUDED - it has visual width
])

Results

Accuracy improvement:

Before: 77.6% of speech marks aligned correctly (manual testing)
After: 99%+ alignment accuracy
Remaining 0.x% due to non-diacritic text differences (e.g., different spellings, auto-corrections)

Code location:

src/helpers/text_norm_helper.py - remap_speech_marks() function
src/objects/user/content_bit.py - Applied during API serialization
src/tests/boss/test_speech_marks_processing.py - 15 unit tests

Production impact:

8,740 media records with speech marks now display correctly
Fixed Quranic and Classical Arabic content (most diacritic-heavy)
Zero performance degradation (remapping is O(n), runs once during API response)

Key Takeaways

Text normalization breaks index alignment - TTS providers normalize input, but UI needs original text positions
Diacritics are combining marks - Arabic diacritics add Unicode code points without visual width
Character-level mapping - Build position map between base characters, remap indices
Safeguard against mismatches - Fall back to null if base characters differ significantly
Tatweel is special - It has visual width; treat it as a base character, not a diacritic

Related Documentation:

AGENTS.md → "TTS Speech Marks & Arabic Text Processing"
docs/plans/2026-02-08-speech-marks-remaining-mismatch-fix.md