Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions Documentation/fleurs-full-benchmark-baseline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# FLEURS Full Benchmark Results - Parakeet v3 Baseline

**Date:** 2026-04-11
**Branch:** `main`
**Model:** Parakeet TDT v3 (0.6B)
**Samples:** 100 per language × 24 languages = 2,400 total
**Duration:** 21 minutes 39 seconds

## Summary

This benchmark establishes the baseline performance of Parakeet v3 on the FLEURS multilingual dataset before implementing script filtering for issue #512.

**Key Findings:**
- Polish shows 8.98% WER, confirming Cyrillic script confusion issue
- All languages maintain real-time performance (RTFx > 40x)
- Average RTFx across all languages: 62.6x
- Best performance: Italian (3.46% WER)
- Lowest performance: Greek (38.91% WER)

## Complete Results

| Language | Code | WER% | CER% | RTFx | Duration | Samples |
|----------|------|------|------|------|----------|---------|
| English (US) | en_us | 4.57 | 2.46 | 47.9x | 953.9s | 100 |
| Spanish (Spain) | es_419 | 3.80 | 1.59 | 67.6x | 1200.8s | 100 |
| Italian (Italy) | it_it | 3.46 | 1.35 | 86.1x | 1516.9s | 100 |
| French (France) | fr_fr | 6.59 | 2.86 | 50.0x | 1073.7s | 100 |
| German (Germany) | de_de | 5.92 | 2.69 | 53.8x | 1496.2s | 100 |
| Russian (Russia) | ru_ru | 7.01 | 2.01 | 64.1x | 1136.6s | 100 |
| Dutch (Netherlands) | nl_nl | 8.12 | 3.07 | 52.6x | 1009.6s | 100 |
| **Polish (Poland)** | **pl_pl** | **8.98** | **3.17** | **53.0x** | **964.7s** | **100** |
| Ukrainian (Ukraine) | uk_ua | 7.02 | 2.12 | 59.3x | 1098.1s | 100 |
| Slovak (Slovakia) | sk_sk | 13.96 | 5.39 | 46.2x | 1196.3s | 100 |
| Czech (Czechia) | cs_cz | 11.28 | 3.67 | 68.0x | 1239.0s | 100 |
| Bulgarian (Bulgaria) | bg_bg | 11.78 | 3.74 | 47.8x | 1021.9s | 100 |
| Croatian (Croatia) | hr_hr | 13.52 | 4.06 | 60.0x | 1025.7s | 100 |
| Romanian (Romania) | ro_ro | 15.02 | 4.63 | 68.2x | 1110.8s | 100 |
| Finnish (Finland) | fi_fi | 16.08 | 4.98 | 66.1x | 1348.5s | 100 |
| Hungarian (Hungary) | hu_hu | 19.52 | 6.52 | 84.8x | 1295.2s | 100 |
| Swedish (Sweden) | sv_se | 17.44 | 5.83 | 65.6x | 1079.0s | 100 |
| Estonian (Estonia) | et_ee | 19.66 | 4.31 | 68.8x | 1198.9s | 100 |
| Danish (Denmark) | da_dk | 19.62 | 7.56 | 56.9x | 1125.7s | 100 |
| Lithuanian (Lithuania) | lt_lt | 25.33 | 7.45 | 70.5x | 1055.8s | 100 |
| **Greek (Greece)** | **el_gr** | **38.91** | **15.45** | **72.1x** | **1098.7s** | **100** |
| Maltese (Malta) | mt_mt | 29.59 | 11.23 | 68.1x | 1399.1s | 100 |
| Latvian (Latvia) | lv_lv | 26.20 | 7.35 | 76.1x | 1176.1s | 100 |
| Slovenian (Slovenia) | sl_si | 27.10 | 9.83 | 43.0x | 940.0s | 100 |

**Polish** is highlighted as the target language for issue #512 (Cyrillic script confusion).
**Greek** shows the highest WER, indicating potential room for improvement.

## Performance Categories

### Excellent (WER < 5%)
- 🥇 Italian: 3.46%
- 🥈 Spanish: 3.80%
- 🥉 English: 4.57%

### Very Good (WER 5-7%)
- German: 5.92%
- French: 6.59%
- Russian: 7.01%
- Ukrainian: 7.02%

### Good (WER 8-10%)
- Dutch: 8.12%
- Polish: 8.98% ← **Target for script filtering improvement**

### Moderate (WER 11-16%)
- Czech: 11.28%
- Bulgarian: 11.78%
- Croatian: 13.52%
- Slovak: 13.96%
- Romanian: 15.02%
- Finnish: 16.08%

### Fair (WER 17-20%)
- Swedish: 17.44%
- Danish: 19.62%
- Hungarian: 19.52%
- Estonian: 19.66%

### Lower (WER > 20%)
- Lithuanian: 25.33%
- Latvian: 26.20%
- Slovenian: 27.10%
- Maltese: 29.59%
- Greek: 38.91%

## Methodology

- **Model**: Parakeet TDT v3 (0.6B) with standard JointDecision (argmax only)
- **Dataset**: FLEURS multilingual benchmark
- **Sample Size**: 100 utterances per language
- **Evaluation**: Levenshtein distance for WER/CER calculation
- **Hardware**: Apple Silicon (M-series)
- **Compute Units**: Neural Engine + GPU

## Next Steps

1. Implement script filtering using JointDecisionv3 (top-K outputs)
2. Re-run benchmark on `feat/script-filtering-issue-512` branch
3. Compare WER improvement for Polish and other affected languages
4. Validate no regression on languages without script ambiguity

## Raw Results

Individual JSON results saved to:
```
benchmark_results/fleurs_*_20260411_224806.json
```

Full benchmark log:
```
benchmark_results/fleurs_full_benchmark_20260411_224806.log
```

## Related Issues

- [#512](https://github.com/FluidInference/FluidAudio/issues/512) - Polish utterances transcribed in Cyrillic instead of Latin script
- [#515](https://github.com/FluidInference/FluidAudio/pull/515) - Script filtering implementation (in progress)
167 changes: 167 additions & 0 deletions Scripts/fleurs_parakeet_sub_benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
#!/bin/bash
# Run FLEURS full multilingual benchmark (100 samples x 24 languages = 2,400 samples) with sleep prevention.
#
# Benchmarks all 24 languages supported by Parakeet TDT v3:
# Best (WER < 5%): en_us, es_419, it_it, fr_fr, de_de
# Good (5-10%): ru_ru, nl_nl, pl_pl, uk_ua, sk_sk
# Moderate (10-15%): cs_cz, bg_bg, hr_hr, ro_ro, fi_fi
# Lower (>15%): hu_hu, sv_se, et_ee, da_dk, lt_lt, el_gr, mt_mt, lv_lv, sl_si
#
# Usage:
# ./Scripts/fleurs_full_benchmark.sh
#
# The script downloads FLEURS data automatically if needed.
# Uses caffeinate to prevent sleep so you can close the lid.
# Results are saved to benchmark_results/ with timestamps.

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
RESULTS_DIR="$PROJECT_DIR/benchmark_results"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
LOG_FILE="$RESULTS_DIR/fleurs_full_benchmark_${TIMESTAMP}.log"
SAMPLES_PER_LANG=100

# All 24 supported languages
LANGUAGES=(
# Best performing (WER < 5%)
"en_us" "es_419" "it_it" "fr_fr" "de_de"
# Good performance (WER 5-10%)
"ru_ru" "nl_nl" "pl_pl" "uk_ua" "sk_sk"
# Moderate performance (WER 10-15%)
"cs_cz" "bg_bg" "hr_hr" "ro_ro" "fi_fi"
# Lower performance (WER > 15%)
"hu_hu" "sv_se" "et_ee" "da_dk" "lt_lt" "el_gr" "mt_mt" "lv_lv" "sl_si"
)

MODELS_DIR="$HOME/Library/Application Support/FluidAudio/Models"

mkdir -p "$RESULTS_DIR"

log() {
echo "[$(date '+%H:%M:%S')] $*" | tee -a "$LOG_FILE"
}

# Verify Parakeet v3 models exist
verify_models() {
local v3_dir="$MODELS_DIR/parakeet-tdt-0.6b-v3"
for f in Preprocessor.mlmodelc Encoder.mlmodelc Decoder.mlmodelc JointDecision.mlmodelc parakeet_vocab.json; do
if [[ ! -e "$v3_dir/$f" ]]; then
log "MISSING v3: $v3_dir/$f"
return 1
fi
done
return 0
}

log "=== Verifying Parakeet v3 models ==="
if ! verify_models; then
log ""
log "ERROR: Parakeet v3 models missing."
log "Please run ASR benchmark first to download models."
exit 1
fi
log "Parakeet v3 models verified. FLEURS data will download automatically if needed."

log "=== FLEURS full benchmark: $SAMPLES_PER_LANG samples x ${#LANGUAGES[@]} languages = $(( SAMPLES_PER_LANG * ${#LANGUAGES[@]} )) total ==="
log "Results directory: $RESULTS_DIR"

cd "$PROJECT_DIR"

# Build release if not already built
if [[ ! -x ".build/release/fluidaudiocli" ]]; then
log "Building release binary..."
swift build -c release 2>&1 | tail -1 | tee -a "$LOG_FILE"
fi
CLI="$PROJECT_DIR/.build/release/fluidaudiocli"

# caffeinate -s: prevent sleep even on AC power / lid closed
# caffeinate -i: prevent idle sleep
caffeinate -si -w $$ &
CAFFEINATE_PID=$!
log "caffeinate started (PID $CAFFEINATE_PID) — safe to close the lid"

SUITE_START=$(date +%s)

# Run all languages
LANG_NAMES=(
"English (US)" "Spanish (Spain)" "Italian (Italy)" "French (France)" "German (Germany)"
"Russian (Russia)" "Dutch (Netherlands)" "Polish (Poland)" "Ukrainian (Ukraine)" "Slovak (Slovakia)"
"Czech (Czechia)" "Bulgarian (Bulgaria)" "Croatian (Croatia)" "Romanian (Romania)" "Finnish (Finland)"
"Hungarian (Hungary)" "Swedish (Sweden)" "Estonian (Estonia)" "Danish (Denmark)" "Lithuanian (Lithuania)"
"Greek (Greece)" "Maltese (Malta)" "Latvian (Latvia)" "Slovenian (Slovenia)"
)

for i in "${!LANGUAGES[@]}"; do
lang="${LANGUAGES[$i]}"
name="${LANG_NAMES[$i]}"
label="fleurs_${lang}"
output_file="$RESULTS_DIR/${label}_${TIMESTAMP}.json"

log "--- [$((i+1))/${#LANGUAGES[@]}] $name ($lang): starting ($SAMPLES_PER_LANG samples) ---"
start_time=$(date +%s)

"$CLI" fleurs-benchmark \
--languages "$lang" \
--samples "$SAMPLES_PER_LANG" \
--output "$output_file" \
2>&1 | tee -a "$LOG_FILE"

end_time=$(date +%s)
elapsed=$(( end_time - start_time ))
log "--- $name: finished in ${elapsed}s — $output_file ---"
done

SUITE_END=$(date +%s)
SUITE_ELAPSED=$(( SUITE_END - SUITE_START ))
SUITE_HOURS=$(( SUITE_ELAPSED / 3600 ))
SUITE_MINS=$(( (SUITE_ELAPSED % 3600) / 60 ))
SUITE_SECS=$(( SUITE_ELAPSED % 60 ))

log "=== All benchmarks complete in ${SUITE_HOURS}h ${SUITE_MINS}m ${SUITE_SECS}s ==="
log "Results:"
ls -lh "$RESULTS_DIR"/*_${TIMESTAMP}.json 2>/dev/null | tee -a "$LOG_FILE"

# Extract WER from all results
log ""
log "=== WER Summary (100 samples per language) ==="
log ""
printf "%-30s %10s %10s %10s\n" "Language" "WER%" "CER%" "RTFx" | tee -a "$LOG_FILE"
printf "%-30s %10s %10s %10s\n" "------------------------------" "----------" "----------" "----------" | tee -a "$LOG_FILE"

extract_metrics() {
local json_file="$1"
if [[ -f "$json_file" ]]; then
python3 -c "
import json
d = json.load(open('$json_file'))
wer = round(d['summary']['averageWER']*100, 2)
cer = round(d['summary']['averageCER']*100, 2)
rtfx = round(d['summary']['averageRTFx'], 1)
print(f'{wer}\t{cer}\t{rtfx}')
" 2>/dev/null || echo "N/A\tN/A\tN/A"
else
echo "N/A\tN/A\tN/A"
fi
}

for i in "${!LANGUAGES[@]}"; do
lang="${LANGUAGES[$i]}"
name="${LANG_NAMES[$i]}"
json_file="$RESULTS_DIR/fleurs_${lang}_${TIMESTAMP}.json"

metrics=$(extract_metrics "$json_file")
wer=$(echo "$metrics" | cut -f1)
cer=$(echo "$metrics" | cut -f2)
rtfx=$(echo "$metrics" | cut -f3)

printf "%-30s %9s%% %9s%% %9sx\n" "$name ($lang)" "$wer" "$cer" "$rtfx" | tee -a "$LOG_FILE"
done

log ""
log "✅ Full FLEURS benchmark complete"
log "Total samples processed: $(( SAMPLES_PER_LANG * ${#LANGUAGES[@]} ))"
log "Results saved to: $RESULTS_DIR/*_${TIMESTAMP}.json"

# caffeinate will exit automatically since the parent process ($$) exits
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ extension AsrManager {
decoderState: inout TdtDecoderState,
contextFrameAdjustment: Int = 0,
isLastChunk: Bool = false,
globalFrameOffset: Int = 0
globalFrameOffset: Int = 0,
language: Language? = nil
) async throws -> (hypothesis: TdtHypothesis, encoderSequenceLength: Int) {

let preprocessorInput = try await preparePreprocessorInput(
Expand Down Expand Up @@ -68,7 +69,8 @@ extension AsrManager {
decoderState: &decoderState,
contextFrameAdjustment: contextFrameAdjustment,
isLastChunk: isLastChunk,
globalFrameOffset: globalFrameOffset
globalFrameOffset: globalFrameOffset,
language: language
)

if let preprocessorAudioArray {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ import Foundation
extension AsrManager {

internal func transcribeWithState(
_ audioSamples: [Float], decoderState: inout TdtDecoderState
_ audioSamples: [Float], decoderState: inout TdtDecoderState, language: Language? = nil
) async throws -> ASRResult {
guard isAvailable else { throw ASRError.notInitialized }
guard audioSamples.count >= config.sampleRate else { throw ASRError.invalidAudioData }
Expand All @@ -19,7 +19,8 @@ extension AsrManager {
originalLength: frameAlignedLength,
actualAudioFrames: nil, // Will be calculated from originalLength
decoderState: &decoderState,
isLastChunk: true // Single-chunk: always first and last
isLastChunk: true, // Single-chunk: always first and last
language: language
)

let result = processTranscriptionResult(
Expand All @@ -43,7 +44,8 @@ extension AsrManager {
progressHandler: { [weak self] progress in
guard let self else { return }
await self.progressEmitter.report(progress: progress)
}
},
language: language
)

return result
Expand All @@ -55,7 +57,8 @@ extension AsrManager {
_ chunkSamples: [Float],
decoderState: inout TdtDecoderState,
previousTokens: [Int] = [],
isLastChunk: Bool = false
isLastChunk: Bool = false,
language: Language? = nil
) async throws -> (tokens: [Int], timestamps: [Int], confidences: [Float], encoderSequenceLength: Int) {
let (alignedSamples, frameAlignedLength) = frameAlignedAudio(
chunkSamples, allowAlignment: previousTokens.isEmpty)
Expand All @@ -66,7 +69,8 @@ extension AsrManager {
actualAudioFrames: nil, // Will be calculated from originalLength
decoderState: &decoderState,
contextFrameAdjustment: 0, // Non-streaming chunks don't use adaptive context
isLastChunk: isLastChunk
isLastChunk: isLastChunk,
language: language
)

// Apply token deduplication if previous tokens are provided
Expand Down
Loading
Loading