FluidInference · Alex-Wengg · Apr 11, 2026 · Apr 11, 2026 · Apr 11, 2026 · Apr 12, 2026
diff --git a/Documentation/fleurs-full-benchmark-baseline.md b/Documentation/fleurs-full-benchmark-baseline.md
@@ -0,0 +1,121 @@
+# FLEURS Full Benchmark Results - Parakeet v3 Baseline
+
+**Date:** 2026-04-11
+**Branch:** `main`
+**Model:** Parakeet TDT v3 (0.6B)
+**Samples:** 100 per language × 24 languages = 2,400 total
+**Duration:** 21 minutes 39 seconds
+
+## Summary
+
+This benchmark establishes the baseline performance of Parakeet v3 on the FLEURS multilingual dataset before implementing script filtering for issue #512.
+
+**Key Findings:**
+- Polish shows 8.98% WER, confirming Cyrillic script confusion issue
+- All languages maintain real-time performance (RTFx > 40x)
+- Average RTFx across all languages: 62.6x
+- Best performance: Italian (3.46% WER)
+- Lowest performance: Greek (38.91% WER)
+
+## Complete Results
+
+| Language | Code | WER% | CER% | RTFx | Duration | Samples |
+|----------|------|------|------|------|----------|---------|
+| English (US) | en_us | 4.57 | 2.46 | 47.9x | 953.9s | 100 |
+| Spanish (Spain) | es_419 | 3.80 | 1.59 | 67.6x | 1200.8s | 100 |
+| Italian (Italy) | it_it | 3.46 | 1.35 | 86.1x | 1516.9s | 100 |
+| French (France) | fr_fr | 6.59 | 2.86 | 50.0x | 1073.7s | 100 |
+| German (Germany) | de_de | 5.92 | 2.69 | 53.8x | 1496.2s | 100 |
+| Russian (Russia) | ru_ru | 7.01 | 2.01 | 64.1x | 1136.6s | 100 |
+| Dutch (Netherlands) | nl_nl | 8.12 | 3.07 | 52.6x | 1009.6s | 100 |
+| **Polish (Poland)** | **pl_pl** | **8.98** | **3.17** | **53.0x** | **964.7s** | **100** |
+| Ukrainian (Ukraine) | uk_ua | 7.02 | 2.12 | 59.3x | 1098.1s | 100 |
+| Slovak (Slovakia) | sk_sk | 13.96 | 5.39 | 46.2x | 1196.3s | 100 |
+| Czech (Czechia) | cs_cz | 11.28 | 3.67 | 68.0x | 1239.0s | 100 |
+| Bulgarian (Bulgaria) | bg_bg | 11.78 | 3.74 | 47.8x | 1021.9s | 100 |
+| Croatian (Croatia) | hr_hr | 13.52 | 4.06 | 60.0x | 1025.7s | 100 |
+| Romanian (Romania) | ro_ro | 15.02 | 4.63 | 68.2x | 1110.8s | 100 |
+| Finnish (Finland) | fi_fi | 16.08 | 4.98 | 66.1x | 1348.5s | 100 |
+| Hungarian (Hungary) | hu_hu | 19.52 | 6.52 | 84.8x | 1295.2s | 100 |
+| Swedish (Sweden) | sv_se | 17.44 | 5.83 | 65.6x | 1079.0s | 100 |
+| Estonian (Estonia) | et_ee | 19.66 | 4.31 | 68.8x | 1198.9s | 100 |
+| Danish (Denmark) | da_dk | 19.62 | 7.56 | 56.9x | 1125.7s | 100 |
+| Lithuanian (Lithuania) | lt_lt | 25.33 | 7.45 | 70.5x | 1055.8s | 100 |
+| **Greek (Greece)** | **el_gr** | **38.91** | **15.45** | **72.1x** | **1098.7s** | **100** |
+| Maltese (Malta) | mt_mt | 29.59 | 11.23 | 68.1x | 1399.1s | 100 |
+| Latvian (Latvia) | lv_lv | 26.20 | 7.35 | 76.1x | 1176.1s | 100 |
+| Slovenian (Slovenia) | sl_si | 27.10 | 9.83 | 43.0x | 940.0s | 100 |
+
+**Polish** is highlighted as the target language for issue #512 (Cyrillic script confusion).
+**Greek** shows the highest WER, indicating potential room for improvement.
+
+## Performance Categories
+
+### Excellent (WER < 5%)
+- 🥇 Italian: 3.46%
+- 🥈 Spanish: 3.80%
+- 🥉 English: 4.57%
+
+### Very Good (WER 5-7%)
+- German: 5.92%
+- French: 6.59%
+- Russian: 7.01%
+- Ukrainian: 7.02%
+
+### Good (WER 8-10%)
+- Dutch: 8.12%
+- Polish: 8.98% ← **Target for script filtering improvement**
+
+### Moderate (WER 11-16%)
+- Czech: 11.28%
+- Bulgarian: 11.78%
+- Croatian: 13.52%
+- Slovak: 13.96%
+- Romanian: 15.02%
+- Finnish: 16.08%
+
+### Fair (WER 17-20%)
+- Swedish: 17.44%
+- Danish: 19.62%
+- Hungarian: 19.52%
+- Estonian: 19.66%
+
+### Lower (WER > 20%)
+- Lithuanian: 25.33%
+- Latvian: 26.20%
+- Slovenian: 27.10%
+- Maltese: 29.59%
+- Greek: 38.91%
+
+## Methodology
+
+- **Model**: Parakeet TDT v3 (0.6B) with standard JointDecision (argmax only)
+- **Dataset**: FLEURS multilingual benchmark
+- **Sample Size**: 100 utterances per language
+- **Evaluation**: Levenshtein distance for WER/CER calculation
+- **Hardware**: Apple Silicon (M-series)
+- **Compute Units**: Neural Engine + GPU
+
+## Next Steps
+
+1. Implement script filtering using JointDecisionv3 (top-K outputs)
+2. Re-run benchmark on `feat/script-filtering-issue-512` branch
+3. Compare WER improvement for Polish and other affected languages
+4. Validate no regression on languages without script ambiguity
+
+## Raw Results
+
+Individual JSON results saved to:
+```
+benchmark_results/fleurs_*_20260411_224806.json
+```
+
+Full benchmark log:
+```
+benchmark_results/fleurs_full_benchmark_20260411_224806.log
+```
+
+## Related Issues
+
+- [#512](https://github.com/FluidInference/FluidAudio/issues/512) - Polish utterances transcribed in Cyrillic instead of Latin script
+- [#515](https://github.com/FluidInference/FluidAudio/pull/515) - Script filtering implementation (in progress)
diff --git a/Scripts/fleurs_parakeet_sub_benchmark.sh b/Scripts/fleurs_parakeet_sub_benchmark.sh
@@ -0,0 +1,167 @@
+#!/bin/bash
+# Run FLEURS full multilingual benchmark (100 samples x 24 languages = 2,400 samples) with sleep prevention.
+#
+# Benchmarks all 24 languages supported by Parakeet TDT v3:
+#   Best (WER < 5%): en_us, es_419, it_it, fr_fr, de_de
+#   Good (5-10%): ru_ru, nl_nl, pl_pl, uk_ua, sk_sk
+#   Moderate (10-15%): cs_cz, bg_bg, hr_hr, ro_ro, fi_fi
+#   Lower (>15%): hu_hu, sv_se, et_ee, da_dk, lt_lt, el_gr, mt_mt, lv_lv, sl_si
+#
+# Usage:
+#   ./Scripts/fleurs_full_benchmark.sh
+#
+# The script downloads FLEURS data automatically if needed.
+# Uses caffeinate to prevent sleep so you can close the lid.
+# Results are saved to benchmark_results/ with timestamps.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
+RESULTS_DIR="$PROJECT_DIR/benchmark_results"
+TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
+LOG_FILE="$RESULTS_DIR/fleurs_full_benchmark_${TIMESTAMP}.log"
+SAMPLES_PER_LANG=100
+
+# All 24 supported languages
+LANGUAGES=(
+    # Best performing (WER < 5%)
+    "en_us" "es_419" "it_it" "fr_fr" "de_de"
+    # Good performance (WER 5-10%)
+    "ru_ru" "nl_nl" "pl_pl" "uk_ua" "sk_sk"
+    # Moderate performance (WER 10-15%)
+    "cs_cz" "bg_bg" "hr_hr" "ro_ro" "fi_fi"
+    # Lower performance (WER > 15%)
+    "hu_hu" "sv_se" "et_ee" "da_dk" "lt_lt" "el_gr" "mt_mt" "lv_lv" "sl_si"
+)
+
+MODELS_DIR="$HOME/Library/Application Support/FluidAudio/Models"
+
+mkdir -p "$RESULTS_DIR"
+
+log() {
+    echo "[$(date '+%H:%M:%S')] $*" | tee -a "$LOG_FILE"
+}
+
+# Verify Parakeet v3 models exist
+verify_models() {
+    local v3_dir="$MODELS_DIR/parakeet-tdt-0.6b-v3"
+    for f in Preprocessor.mlmodelc Encoder.mlmodelc Decoder.mlmodelc JointDecision.mlmodelc parakeet_vocab.json; do
+        if [[ ! -e "$v3_dir/$f" ]]; then
+            log "MISSING  v3: $v3_dir/$f"
+            return 1
+        fi
+    done
+    return 0
+}
+
+log "=== Verifying Parakeet v3 models ==="
+if ! verify_models; then
+    log ""
+    log "ERROR: Parakeet v3 models missing."
+    log "Please run ASR benchmark first to download models."
+    exit 1
+fi
+log "Parakeet v3 models verified. FLEURS data will download automatically if needed."
+
+log "=== FLEURS full benchmark: $SAMPLES_PER_LANG samples x ${#LANGUAGES[@]} languages = $(( SAMPLES_PER_LANG * ${#LANGUAGES[@]} )) total ==="
+log "Results directory: $RESULTS_DIR"
+
+cd "$PROJECT_DIR"
+
+# Build release if not already built
+if [[ ! -x ".build/release/fluidaudiocli" ]]; then
+    log "Building release binary..."
+    swift build -c release 2>&1 | tail -1 | tee -a "$LOG_FILE"
+fi
+CLI="$PROJECT_DIR/.build/release/fluidaudiocli"
+
+# caffeinate -s: prevent sleep even on AC power / lid closed
+# caffeinate -i: prevent idle sleep
+caffeinate -si -w $$ &
+CAFFEINATE_PID=$!
+log "caffeinate started (PID $CAFFEINATE_PID) — safe to close the lid"
+
+SUITE_START=$(date +%s)
+
+# Run all languages
+LANG_NAMES=(
+    "English (US)" "Spanish (Spain)" "Italian (Italy)" "French (France)" "German (Germany)"
+    "Russian (Russia)" "Dutch (Netherlands)" "Polish (Poland)" "Ukrainian (Ukraine)" "Slovak (Slovakia)"
+    "Czech (Czechia)" "Bulgarian (Bulgaria)" "Croatian (Croatia)" "Romanian (Romania)" "Finnish (Finland)"
+    "Hungarian (Hungary)" "Swedish (Sweden)" "Estonian (Estonia)" "Danish (Denmark)" "Lithuanian (Lithuania)"
+    "Greek (Greece)" "Maltese (Malta)" "Latvian (Latvia)" "Slovenian (Slovenia)"
+)
+
+for i in "${!LANGUAGES[@]}"; do
+    lang="${LANGUAGES[$i]}"
+    name="${LANG_NAMES[$i]}"
+    label="fleurs_${lang}"
+    output_file="$RESULTS_DIR/${label}_${TIMESTAMP}.json"
+
+    log "--- [$((i+1))/${#LANGUAGES[@]}] $name ($lang): starting ($SAMPLES_PER_LANG samples) ---"
+    start_time=$(date +%s)
+
+    "$CLI" fleurs-benchmark \
+        --languages "$lang" \
+        --samples "$SAMPLES_PER_LANG" \
+        --output "$output_file" \
+        2>&1 | tee -a "$LOG_FILE"
+
+    end_time=$(date +%s)
+    elapsed=$(( end_time - start_time ))
+    log "--- $name: finished in ${elapsed}s — $output_file ---"
+done
+
+SUITE_END=$(date +%s)
+SUITE_ELAPSED=$(( SUITE_END - SUITE_START ))
+SUITE_HOURS=$(( SUITE_ELAPSED / 3600 ))
+SUITE_MINS=$(( (SUITE_ELAPSED % 3600) / 60 ))
+SUITE_SECS=$(( SUITE_ELAPSED % 60 ))
+
+log "=== All benchmarks complete in ${SUITE_HOURS}h ${SUITE_MINS}m ${SUITE_SECS}s ==="
+log "Results:"
+ls -lh "$RESULTS_DIR"/*_${TIMESTAMP}.json 2>/dev/null | tee -a "$LOG_FILE"
+
+# Extract WER from all results
+log ""
+log "=== WER Summary (100 samples per language) ==="
+log ""
+printf "%-30s %10s %10s %10s\n" "Language" "WER%" "CER%" "RTFx" | tee -a "$LOG_FILE"
+printf "%-30s %10s %10s %10s\n" "------------------------------" "----------" "----------" "----------" | tee -a "$LOG_FILE"
+
+extract_metrics() {
+    local json_file="$1"
+    if [[ -f "$json_file" ]]; then
+        python3 -c "
+import json
+d = json.load(open('$json_file'))
+wer = round(d['summary']['averageWER']*100, 2)
+cer = round(d['summary']['averageCER']*100, 2)
+rtfx = round(d['summary']['averageRTFx'], 1)
+print(f'{wer}\t{cer}\t{rtfx}')
+" 2>/dev/null || echo "N/A\tN/A\tN/A"
+    else
+        echo "N/A\tN/A\tN/A"
+    fi
+}
+
+for i in "${!LANGUAGES[@]}"; do
+    lang="${LANGUAGES[$i]}"
+    name="${LANG_NAMES[$i]}"
+    json_file="$RESULTS_DIR/fleurs_${lang}_${TIMESTAMP}.json"
+
+    metrics=$(extract_metrics "$json_file")
+    wer=$(echo "$metrics" | cut -f1)
+    cer=$(echo "$metrics" | cut -f2)
+    rtfx=$(echo "$metrics" | cut -f3)
+
+    printf "%-30s %9s%% %9s%% %9sx\n" "$name ($lang)" "$wer" "$cer" "$rtfx" | tee -a "$LOG_FILE"
+done
+
+log ""
+log "✅ Full FLEURS benchmark complete"
+log "Total samples processed: $(( SAMPLES_PER_LANG * ${#LANGUAGES[@]} ))"
+log "Results saved to: $RESULTS_DIR/*_${TIMESTAMP}.json"
+
+# caffeinate will exit automatically since the parent process ($$) exits
diff --git a/Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager+Pipeline.swift b/Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager+Pipeline.swift
@@ -10,7 +10,8 @@ extension AsrManager {
         decoderState: inout TdtDecoderState,
         contextFrameAdjustment: Int = 0,
         isLastChunk: Bool = false,
-        globalFrameOffset: Int = 0
+        globalFrameOffset: Int = 0,
+        language: Language? = nil
     ) async throws -> (hypothesis: TdtHypothesis, encoderSequenceLength: Int) {
 
         let preprocessorInput = try await preparePreprocessorInput(
@@ -68,7 +69,8 @@ extension AsrManager {
                 decoderState: &decoderState,
                 contextFrameAdjustment: contextFrameAdjustment,
                 isLastChunk: isLastChunk,
-                globalFrameOffset: globalFrameOffset
+                globalFrameOffset: globalFrameOffset,
+                language: language
             )
 
             if let preprocessorAudioArray {

diff --git a/Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager+Transcription.swift b/Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager+Transcription.swift
@@ -3,7 +3,7 @@ import Foundation
 extension AsrManager {
 
     internal func transcribeWithState(
-        _ audioSamples: [Float], decoderState: inout TdtDecoderState
+        _ audioSamples: [Float], decoderState: inout TdtDecoderState, language: Language? = nil
     ) async throws -> ASRResult {
         guard isAvailable else { throw ASRError.notInitialized }
         guard audioSamples.count >= config.sampleRate else { throw ASRError.invalidAudioData }
@@ -19,7 +19,8 @@ extension AsrManager {
                 originalLength: frameAlignedLength,
                 actualAudioFrames: nil,  // Will be calculated from originalLength
                 decoderState: &decoderState,
-                isLastChunk: true  // Single-chunk: always first and last
+                isLastChunk: true,  // Single-chunk: always first and last
+                language: language
             )
 
             let result = processTranscriptionResult(
@@ -43,7 +44,8 @@ extension AsrManager {
             progressHandler: { [weak self] progress in
                 guard let self else { return }
                 await self.progressEmitter.report(progress: progress)
-            }
+            },
+            language: language
         )
 
         return result
@@ -55,7 +57,8 @@ extension AsrManager {
         _ chunkSamples: [Float],
         decoderState: inout TdtDecoderState,
         previousTokens: [Int] = [],
-        isLastChunk: Bool = false
+        isLastChunk: Bool = false,
+        language: Language? = nil
     ) async throws -> (tokens: [Int], timestamps: [Int], confidences: [Float], encoderSequenceLength: Int) {
         let (alignedSamples, frameAlignedLength) = frameAlignedAudio(
             chunkSamples, allowAlignment: previousTokens.isEmpty)
@@ -66,7 +69,8 @@ extension AsrManager {
             actualAudioFrames: nil,  // Will be calculated from originalLength
             decoderState: &decoderState,
             contextFrameAdjustment: 0,  // Non-streaming chunks don't use adaptive context
-            isLastChunk: isLastChunk
+            isLastChunk: isLastChunk,
+            language: language
         )
 
         // Apply token deduplication if previous tokens are provided