-
Notifications
You must be signed in to change notification settings - Fork 252
Add script filtering for Cyrillic/Latin disambiguation (fixes #512) #515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Alex-Wengg
wants to merge
11
commits into
main
Choose a base branch
from
feat/script-filtering-issue-512
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
8d51747
docs: Fix speaker diarization model references from 3.1 to community-1
Alex-Wengg dda36e8
docs: Clarify diarization pipeline version differences
Alex-Wengg 7150b55
Merge branch 'main' into docs/clarify-diarization-pipeline-versions
Alex-Wengg 828256b
feat: Add script filtering for Cyrillic/Latin disambiguation (fixes #…
Alex-Wengg 12ea22c
docs: Add FLEURS baseline benchmark results (main branch, before scri…
Alex-Wengg 771e207
Merge branch 'main' of https://github.com/FluidInference/FluidAudio i…
Alex-Wengg eef13c8
fix: Address all 4 critical issues from Devin AI review of PR #515
Alex-Wengg bbf98df
chore: Add FLEURS Parakeet benchmark script and apply swift-format
Alex-Wengg 4bdf2cb
feat: Enable script filtering in FLEURS benchmark
Alex-Wengg 19cb911
feat: Add language parameter to URL-based transcribe methods
Alex-Wengg 923412f
fix: Only apply script filtering when top-1 token is wrong script
Alex-Wengg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| # FLEURS Full Benchmark Results - Parakeet v3 Baseline | ||
|
|
||
| **Date:** 2026-04-11 | ||
| **Branch:** `main` | ||
| **Model:** Parakeet TDT v3 (0.6B) | ||
| **Samples:** 100 per language × 24 languages = 2,400 total | ||
| **Duration:** 21 minutes 39 seconds | ||
|
|
||
| ## Summary | ||
|
|
||
| This benchmark establishes the baseline performance of Parakeet v3 on the FLEURS multilingual dataset before implementing script filtering for issue #512. | ||
|
|
||
| **Key Findings:** | ||
| - Polish shows 8.98% WER, confirming Cyrillic script confusion issue | ||
| - All languages maintain real-time performance (RTFx > 40x) | ||
| - Average RTFx across all languages: 62.6x | ||
| - Best performance: Italian (3.46% WER) | ||
| - Lowest performance: Greek (38.91% WER) | ||
|
|
||
| ## Complete Results | ||
|
|
||
| | Language | Code | WER% | CER% | RTFx | Duration | Samples | | ||
| |----------|------|------|------|------|----------|---------| | ||
| | English (US) | en_us | 4.57 | 2.46 | 47.9x | 953.9s | 100 | | ||
| | Spanish (Spain) | es_419 | 3.80 | 1.59 | 67.6x | 1200.8s | 100 | | ||
| | Italian (Italy) | it_it | 3.46 | 1.35 | 86.1x | 1516.9s | 100 | | ||
| | French (France) | fr_fr | 6.59 | 2.86 | 50.0x | 1073.7s | 100 | | ||
| | German (Germany) | de_de | 5.92 | 2.69 | 53.8x | 1496.2s | 100 | | ||
| | Russian (Russia) | ru_ru | 7.01 | 2.01 | 64.1x | 1136.6s | 100 | | ||
| | Dutch (Netherlands) | nl_nl | 8.12 | 3.07 | 52.6x | 1009.6s | 100 | | ||
| | **Polish (Poland)** | **pl_pl** | **8.98** | **3.17** | **53.0x** | **964.7s** | **100** | | ||
| | Ukrainian (Ukraine) | uk_ua | 7.02 | 2.12 | 59.3x | 1098.1s | 100 | | ||
| | Slovak (Slovakia) | sk_sk | 13.96 | 5.39 | 46.2x | 1196.3s | 100 | | ||
| | Czech (Czechia) | cs_cz | 11.28 | 3.67 | 68.0x | 1239.0s | 100 | | ||
| | Bulgarian (Bulgaria) | bg_bg | 11.78 | 3.74 | 47.8x | 1021.9s | 100 | | ||
| | Croatian (Croatia) | hr_hr | 13.52 | 4.06 | 60.0x | 1025.7s | 100 | | ||
| | Romanian (Romania) | ro_ro | 15.02 | 4.63 | 68.2x | 1110.8s | 100 | | ||
| | Finnish (Finland) | fi_fi | 16.08 | 4.98 | 66.1x | 1348.5s | 100 | | ||
| | Hungarian (Hungary) | hu_hu | 19.52 | 6.52 | 84.8x | 1295.2s | 100 | | ||
| | Swedish (Sweden) | sv_se | 17.44 | 5.83 | 65.6x | 1079.0s | 100 | | ||
| | Estonian (Estonia) | et_ee | 19.66 | 4.31 | 68.8x | 1198.9s | 100 | | ||
| | Danish (Denmark) | da_dk | 19.62 | 7.56 | 56.9x | 1125.7s | 100 | | ||
| | Lithuanian (Lithuania) | lt_lt | 25.33 | 7.45 | 70.5x | 1055.8s | 100 | | ||
| | **Greek (Greece)** | **el_gr** | **38.91** | **15.45** | **72.1x** | **1098.7s** | **100** | | ||
| | Maltese (Malta) | mt_mt | 29.59 | 11.23 | 68.1x | 1399.1s | 100 | | ||
| | Latvian (Latvia) | lv_lv | 26.20 | 7.35 | 76.1x | 1176.1s | 100 | | ||
| | Slovenian (Slovenia) | sl_si | 27.10 | 9.83 | 43.0x | 940.0s | 100 | | ||
|
|
||
| **Polish** is highlighted as the target language for issue #512 (Cyrillic script confusion). | ||
| **Greek** shows the highest WER, indicating potential room for improvement. | ||
|
|
||
| ## Performance Categories | ||
|
|
||
| ### Excellent (WER < 5%) | ||
| - 🥇 Italian: 3.46% | ||
| - 🥈 Spanish: 3.80% | ||
| - 🥉 English: 4.57% | ||
|
|
||
| ### Very Good (WER 5-7%) | ||
| - German: 5.92% | ||
| - French: 6.59% | ||
| - Russian: 7.01% | ||
| - Ukrainian: 7.02% | ||
|
|
||
| ### Good (WER 8-10%) | ||
| - Dutch: 8.12% | ||
| - Polish: 8.98% ← **Target for script filtering improvement** | ||
|
|
||
| ### Moderate (WER 11-16%) | ||
| - Czech: 11.28% | ||
| - Bulgarian: 11.78% | ||
| - Croatian: 13.52% | ||
| - Slovak: 13.96% | ||
| - Romanian: 15.02% | ||
| - Finnish: 16.08% | ||
|
|
||
| ### Fair (WER 17-20%) | ||
| - Swedish: 17.44% | ||
| - Danish: 19.62% | ||
| - Hungarian: 19.52% | ||
| - Estonian: 19.66% | ||
|
|
||
| ### Lower (WER > 20%) | ||
| - Lithuanian: 25.33% | ||
| - Latvian: 26.20% | ||
| - Slovenian: 27.10% | ||
| - Maltese: 29.59% | ||
| - Greek: 38.91% | ||
|
|
||
| ## Methodology | ||
|
|
||
| - **Model**: Parakeet TDT v3 (0.6B) with standard JointDecision (argmax only) | ||
| - **Dataset**: FLEURS multilingual benchmark | ||
| - **Sample Size**: 100 utterances per language | ||
| - **Evaluation**: Levenshtein distance for WER/CER calculation | ||
| - **Hardware**: Apple Silicon (M-series) | ||
| - **Compute Units**: Neural Engine + GPU | ||
|
|
||
| ## Next Steps | ||
|
|
||
| 1. Implement script filtering using JointDecisionv3 (top-K outputs) | ||
| 2. Re-run benchmark on `feat/script-filtering-issue-512` branch | ||
| 3. Compare WER improvement for Polish and other affected languages | ||
| 4. Validate no regression on languages without script ambiguity | ||
|
|
||
| ## Raw Results | ||
|
|
||
| Individual JSON results saved to: | ||
| ``` | ||
| benchmark_results/fleurs_*_20260411_224806.json | ||
| ``` | ||
|
|
||
| Full benchmark log: | ||
| ``` | ||
| benchmark_results/fleurs_full_benchmark_20260411_224806.log | ||
| ``` | ||
|
|
||
| ## Related Issues | ||
|
|
||
| - [#512](https://github.com/FluidInference/FluidAudio/issues/512) - Polish utterances transcribed in Cyrillic instead of Latin script | ||
| - [#515](https://github.com/FluidInference/FluidAudio/pull/515) - Script filtering implementation (in progress) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,167 @@ | ||
| #!/bin/bash | ||
| # Run FLEURS full multilingual benchmark (100 samples x 24 languages = 2,400 samples) with sleep prevention. | ||
| # | ||
| # Benchmarks all 24 languages supported by Parakeet TDT v3: | ||
| # Best (WER < 5%): en_us, es_419, it_it, fr_fr, de_de | ||
| # Good (5-10%): ru_ru, nl_nl, pl_pl, uk_ua, sk_sk | ||
| # Moderate (10-15%): cs_cz, bg_bg, hr_hr, ro_ro, fi_fi | ||
| # Lower (>15%): hu_hu, sv_se, et_ee, da_dk, lt_lt, el_gr, mt_mt, lv_lv, sl_si | ||
| # | ||
| # Usage: | ||
| # ./Scripts/fleurs_full_benchmark.sh | ||
| # | ||
| # The script downloads FLEURS data automatically if needed. | ||
| # Uses caffeinate to prevent sleep so you can close the lid. | ||
| # Results are saved to benchmark_results/ with timestamps. | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
| PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)" | ||
| RESULTS_DIR="$PROJECT_DIR/benchmark_results" | ||
| TIMESTAMP=$(date +"%Y%m%d_%H%M%S") | ||
| LOG_FILE="$RESULTS_DIR/fleurs_full_benchmark_${TIMESTAMP}.log" | ||
| SAMPLES_PER_LANG=100 | ||
|
|
||
| # All 24 supported languages | ||
| LANGUAGES=( | ||
| # Best performing (WER < 5%) | ||
| "en_us" "es_419" "it_it" "fr_fr" "de_de" | ||
| # Good performance (WER 5-10%) | ||
| "ru_ru" "nl_nl" "pl_pl" "uk_ua" "sk_sk" | ||
| # Moderate performance (WER 10-15%) | ||
| "cs_cz" "bg_bg" "hr_hr" "ro_ro" "fi_fi" | ||
| # Lower performance (WER > 15%) | ||
| "hu_hu" "sv_se" "et_ee" "da_dk" "lt_lt" "el_gr" "mt_mt" "lv_lv" "sl_si" | ||
| ) | ||
|
|
||
| MODELS_DIR="$HOME/Library/Application Support/FluidAudio/Models" | ||
|
|
||
| mkdir -p "$RESULTS_DIR" | ||
|
|
||
| log() { | ||
| echo "[$(date '+%H:%M:%S')] $*" | tee -a "$LOG_FILE" | ||
| } | ||
|
|
||
| # Verify Parakeet v3 models exist | ||
| verify_models() { | ||
| local v3_dir="$MODELS_DIR/parakeet-tdt-0.6b-v3" | ||
| for f in Preprocessor.mlmodelc Encoder.mlmodelc Decoder.mlmodelc JointDecision.mlmodelc parakeet_vocab.json; do | ||
| if [[ ! -e "$v3_dir/$f" ]]; then | ||
| log "MISSING v3: $v3_dir/$f" | ||
| return 1 | ||
| fi | ||
| done | ||
| return 0 | ||
| } | ||
|
|
||
| log "=== Verifying Parakeet v3 models ===" | ||
| if ! verify_models; then | ||
| log "" | ||
| log "ERROR: Parakeet v3 models missing." | ||
| log "Please run ASR benchmark first to download models." | ||
| exit 1 | ||
| fi | ||
| log "Parakeet v3 models verified. FLEURS data will download automatically if needed." | ||
|
|
||
| log "=== FLEURS full benchmark: $SAMPLES_PER_LANG samples x ${#LANGUAGES[@]} languages = $(( SAMPLES_PER_LANG * ${#LANGUAGES[@]} )) total ===" | ||
| log "Results directory: $RESULTS_DIR" | ||
|
|
||
| cd "$PROJECT_DIR" | ||
|
|
||
| # Build release if not already built | ||
| if [[ ! -x ".build/release/fluidaudiocli" ]]; then | ||
| log "Building release binary..." | ||
| swift build -c release 2>&1 | tail -1 | tee -a "$LOG_FILE" | ||
| fi | ||
| CLI="$PROJECT_DIR/.build/release/fluidaudiocli" | ||
|
|
||
| # caffeinate -s: prevent sleep even on AC power / lid closed | ||
| # caffeinate -i: prevent idle sleep | ||
| caffeinate -si -w $$ & | ||
| CAFFEINATE_PID=$! | ||
| log "caffeinate started (PID $CAFFEINATE_PID) — safe to close the lid" | ||
|
|
||
| SUITE_START=$(date +%s) | ||
|
|
||
| # Run all languages | ||
| LANG_NAMES=( | ||
| "English (US)" "Spanish (Spain)" "Italian (Italy)" "French (France)" "German (Germany)" | ||
| "Russian (Russia)" "Dutch (Netherlands)" "Polish (Poland)" "Ukrainian (Ukraine)" "Slovak (Slovakia)" | ||
| "Czech (Czechia)" "Bulgarian (Bulgaria)" "Croatian (Croatia)" "Romanian (Romania)" "Finnish (Finland)" | ||
| "Hungarian (Hungary)" "Swedish (Sweden)" "Estonian (Estonia)" "Danish (Denmark)" "Lithuanian (Lithuania)" | ||
| "Greek (Greece)" "Maltese (Malta)" "Latvian (Latvia)" "Slovenian (Slovenia)" | ||
| ) | ||
|
|
||
| for i in "${!LANGUAGES[@]}"; do | ||
| lang="${LANGUAGES[$i]}" | ||
| name="${LANG_NAMES[$i]}" | ||
| label="fleurs_${lang}" | ||
| output_file="$RESULTS_DIR/${label}_${TIMESTAMP}.json" | ||
|
|
||
| log "--- [$((i+1))/${#LANGUAGES[@]}] $name ($lang): starting ($SAMPLES_PER_LANG samples) ---" | ||
| start_time=$(date +%s) | ||
|
|
||
| "$CLI" fleurs-benchmark \ | ||
| --languages "$lang" \ | ||
| --samples "$SAMPLES_PER_LANG" \ | ||
| --output "$output_file" \ | ||
| 2>&1 | tee -a "$LOG_FILE" | ||
|
|
||
| end_time=$(date +%s) | ||
| elapsed=$(( end_time - start_time )) | ||
| log "--- $name: finished in ${elapsed}s — $output_file ---" | ||
| done | ||
|
|
||
| SUITE_END=$(date +%s) | ||
| SUITE_ELAPSED=$(( SUITE_END - SUITE_START )) | ||
| SUITE_HOURS=$(( SUITE_ELAPSED / 3600 )) | ||
| SUITE_MINS=$(( (SUITE_ELAPSED % 3600) / 60 )) | ||
| SUITE_SECS=$(( SUITE_ELAPSED % 60 )) | ||
|
|
||
| log "=== All benchmarks complete in ${SUITE_HOURS}h ${SUITE_MINS}m ${SUITE_SECS}s ===" | ||
| log "Results:" | ||
| ls -lh "$RESULTS_DIR"/*_${TIMESTAMP}.json 2>/dev/null | tee -a "$LOG_FILE" | ||
|
|
||
| # Extract WER from all results | ||
| log "" | ||
| log "=== WER Summary (100 samples per language) ===" | ||
| log "" | ||
| printf "%-30s %10s %10s %10s\n" "Language" "WER%" "CER%" "RTFx" | tee -a "$LOG_FILE" | ||
| printf "%-30s %10s %10s %10s\n" "------------------------------" "----------" "----------" "----------" | tee -a "$LOG_FILE" | ||
|
|
||
| extract_metrics() { | ||
| local json_file="$1" | ||
| if [[ -f "$json_file" ]]; then | ||
| python3 -c " | ||
| import json | ||
| d = json.load(open('$json_file')) | ||
| wer = round(d['summary']['averageWER']*100, 2) | ||
| cer = round(d['summary']['averageCER']*100, 2) | ||
| rtfx = round(d['summary']['averageRTFx'], 1) | ||
| print(f'{wer}\t{cer}\t{rtfx}') | ||
| " 2>/dev/null || echo "N/A\tN/A\tN/A" | ||
| else | ||
| echo "N/A\tN/A\tN/A" | ||
| fi | ||
| } | ||
|
|
||
| for i in "${!LANGUAGES[@]}"; do | ||
| lang="${LANGUAGES[$i]}" | ||
| name="${LANG_NAMES[$i]}" | ||
| json_file="$RESULTS_DIR/fleurs_${lang}_${TIMESTAMP}.json" | ||
|
|
||
| metrics=$(extract_metrics "$json_file") | ||
| wer=$(echo "$metrics" | cut -f1) | ||
| cer=$(echo "$metrics" | cut -f2) | ||
| rtfx=$(echo "$metrics" | cut -f3) | ||
|
|
||
| printf "%-30s %9s%% %9s%% %9sx\n" "$name ($lang)" "$wer" "$cer" "$rtfx" | tee -a "$LOG_FILE" | ||
| done | ||
|
|
||
| log "" | ||
| log "✅ Full FLEURS benchmark complete" | ||
| log "Total samples processed: $(( SAMPLES_PER_LANG * ${#LANGUAGES[@]} ))" | ||
| log "Results saved to: $RESULTS_DIR/*_${TIMESTAMP}.json" | ||
|
|
||
| # caffeinate will exit automatically since the parent process ($$) exits |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.