Fix Japanese TDT model download filename mismatch by Alex-Wengg · Pull Request #522 · FluidInference/FluidAudio

Alex-Wengg · 2026-04-13T02:10:59Z

Fixes the infinite re-download loop for Japanese TDT models reported in #521.

Problem

The download() function was using hardcoded Names.decoderFile and Names.jointFile for all model versions. For .tdtJa, this downloaded:

Decoder.mlmodelc
JointDecision.mlmodelc

But modelsExist() checks for version-specific filenames:

Decoderv2.mlmodelc
Jointerv2.mlmodelc

This mismatch caused the existence check to fail, triggering cache purge and re-download in an infinite loop.

Solution

Use getModelFileNames(version) in the download function to get the correct filenames for each version, matching what modelsExist() expects.

Testing

Build passes
Filenames now match between download and existence check

The download() function was using hardcoded Names.decoderFile and Names.jointFile for all versions, but .tdtJa requires Decoderv2.mlmodelc and Jointerv2.mlmodelc. This caused modelsExist() to fail after download, triggering cache purge and infinite re-download loop. Now uses getModelFileNames(version) to get correct filenames per version.

github-actions · 2026-04-13T02:16:01Z

Kokoro TTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (634.8 KB)

_{Runtime: 0m42s}

_{Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.}

github-actions · 2026-04-13T02:16:45Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	6.68x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	71.1s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.071s	Average chunk processing time
Max Chunk Time	0.142s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 1m19s • 04/13/2026, 01:34 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-04-13T02:17:03Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	349.2x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	372.9x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-04-13T02:19:09Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	33.4%	<35%	✅
Miss Rate	24.4%	-	-
False Alarm	0.2%	-	-
Speaker Error	8.8%	-	-
RTFx	10.2x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 2m 38s • 2026-04-13T05:38:47.990Z}

github-actions · 2026-04-13T02:19:09Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	14.5%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	4.79x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	13.317	6.1	Fetching diarization models
Model Compile	5.707	2.6	CoreML compilation
Audio Load	0.054	0.0	Loading audio file
Segmentation	23.502	10.7	VAD + speech detection
Embedding	217.914	99.5	Speaker embedding extraction
Clustering (VBx)	0.799	0.4	Hungarian algorithm + VBx clustering
Total	218.902	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	14.5%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 242.2s processing • Test runtime: 4m 5s • 04/13/2026, 01:34 AM EST}

github-actions · 2026-04-13T02:20:42Z

Qwen3-ASR int8 Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Transcription pipeline	✅
Decoder size	571 MB (vs 1.1 GB f32)

Performance Metrics

Metric	CI Value	Expected on Apple Silicon
Median RTFx	0.05x	~2.5x
Overall RTFx	0.05x	~2.5x

_{Runtime: 4m31s}

_{Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.}

github-actions · 2026-04-13T02:22:28Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	3.84x	✅
test-other	1.35%	0.00%	2.41x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	3.76x	✅
test-other	1.56%	0.00%	2.57x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.43x	Streaming real-time factor
Avg Chunk Time	2.080s	Average time to process each chunk
Max Chunk Time	2.572s	Maximum chunk processing time
First Token	2.482s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.45x	Streaming real-time factor
Avg Chunk Time	1.988s	Average time to process each chunk
Max Chunk Time	2.641s	Maximum chunk processing time
First Token	2.212s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 7m29s • 04/13/2026, 01:33 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

github-actions · 2026-04-13T02:22:29Z

PocketTTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (195.0 KB)

_{Runtime: 0m50s}

_{Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.}

github-actions · 2026-04-13T02:33:47Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	17.95x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	10.720	18.3	Fetching diarization models
Model Compile	4.594	7.9	CoreML compilation
Audio Load	0.128	0.2	Loading audio file
Segmentation	17.529	30.0	Detecting speech regions
Embedding	29.214	50.0	Extracting speaker voices
Clustering	11.686	20.0	Grouping same speakers
Total	58.451	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 58.4s diarization time • Test runtime: 2m 29s • 04/13/2026, 01:41 AM EST}

Josscii · 2026-04-13T03:14:51Z

And also, these dead code can be removed, since we don't support ctc-ja:

FluidAudio/Sources/FluidAudio/ModelNames.swift

Lines 292 to 309 in 4ef33f0

    
           public enum CTCJa { 
        
               public static let preprocessor = "Preprocessor" 
        
               public static let encoder = "Encoder" 
        
               public static let decoder = "CtcDecoder" 
        
               public static let preprocessorFile = preprocessor + ".mlmodelc" 
        
               public static let encoderFile = encoder + ".mlmodelc" 
        
               public static let decoderFile = decoder + ".mlmodelc" 
        
               // Vocabulary JSON path 
        
               public static let vocabularyFile = "vocab.json" 
        
               public static let requiredModels: Set<String> = [ 
        
                   preprocessorFile, 
        
                   encoderFile, 
        
                   decoderFile, 
        
               ] 
        
           }

Josscii · 2026-04-13T03:24:29Z

These code is also not correct in my opinion:

FluidAudio/Sources/FluidAudio/ModelNames.swift

Lines 675 to 677 in 4ef33f0

    
           case .parakeetJa: 
        
               // Repo contains BOTH CTC and TDT models - return union of both sets 
        
               return ModelNames.CTCJa.requiredModels.union(ModelNames.TDTJa.requiredModels)

@Josscii

The parakeetJa repo contains both CTC and TDT models, but FluidAudio only supports TDT Japanese models. The CTC-only models were being downloaded but never used (no CtcJaManager exists). Changes: - Remove ModelNames.CTCJa enum (dead code) - Update parakeetJa case to only download TDTJa models - Update comment to reflect that CTC models are not supported - Saves bandwidth by not downloading unused CTC models Addresses feedback from @Josscii in PR #522

@Josscii

The parakeetJa repo contains both CTC and TDT models, but FluidAudio only supports TDT Japanese models. The CTC-only models were being downloaded but never used (no CtcJaManager exists). Changes: - Remove ModelNames.CTCJa enum (dead code) - Update parakeetJa case to only download TDTJa models - Update comment to reflect that CTC models are not supported - Saves bandwidth by not downloading unused CTC models Addresses feedback from @Josscii in PR #522

Josscii · 2026-04-13T04:17:13Z

the tdt-ja asr has some bug, here is the transcribed text:

token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072token_3072。token_3072token_3072編集のピアtoken_3072長谷川明token_3072token_3072 token_3072 ありが謗胱胱胱胱胱謗胱謗胱謗莱謗謗謗胱謗胱胱謗莱莱謗莱胱謗胱胱胱胱胱莱莱莱謗莱胱謗莱胱token_3072胱 胱 胱 token_3072 token_3072 胱 token_3072胱 胱 胱 token_3072 token_3072token_3072token_3072 ありが謗 token_3072胱莱胱胱胱謗謗莱謗胱謗 胱莱胱胱胱胱胱胱胱胱謗 胱胱胱胱胱胱胱胱胱胱胱謗 ありが まずは胱胱胱胱胱謗謗謗謗謗謗謗胱謗謗謗 しかし胱謗謗謗謗謗 しかし しかしtoken_3072token_3072胱胱胱胱胱胱謗胱胱胱胱胱胱謗胱 ありが謗胱胱胱胱胱胱 胱莱 token_3072 しかし謗謗莱胱胱胱胱胱謗謗 胱token_3072謗謗謗謗謗謗謗謗謗謗謗謗

新生活・新年度の心得.txt

@Josscii

The vocabulary files were defined but not included in requiredModels, causing an infinite re-download loop because modelsExist() checks for them but download() didn't fetch them. Bug symptoms: - AsrModels.download() would complete without downloading vocab files - modelsExist() would return false (vocab missing) - Download would re-trigger, clearing cache and re-downloading - Infinite loop until user intervention - Models would produce garbage output (e.g., "token_3072") due to missing vocab files Root cause: ModelNames enums defined vocabularyFile/vocabularyPath but didn't include them in their requiredModels sets. This affected: - ASR v2/v3 TDT models (parakeet_vocab.json) - ASR 110m fused models (parakeet_vocab.json) - CTC models (vocab.json) - CTC zh-CN models (vocab.json) - TDT Japanese models (vocab.json) Fix: Add vocabulary files to all affected requiredModels sets so DownloadUtils.downloadRepo() includes them in the download. Fixes the infinite loop bug reported by @Josscii in PR #522. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Josscii · 2026-04-13T04:37:36Z

ea9ca45

This change is wrong, it is not the root cause, please revert.

I think the blankId miss match is the real cause, here we should check the blankId for each model:

FluidAudio/Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager.swift

Lines 213 to 227 in 4ef33f0

    
           // Adapt config's encoderHiddenSize to match the loaded model version 
        
           // (e.g. default config uses 1024 but tdtCtc110m needs 512) 
        
           let adaptedConfig: ASRConfig 
        
           if config.encoderHiddenSize != models.version.encoderHiddenSize { 
        
               adaptedConfig = ASRConfig( 
        
                   sampleRate: config.sampleRate, 
        
                   tdtConfig: config.tdtConfig, 
        
                   encoderHiddenSize: models.version.encoderHiddenSize, 
        
                   parallelChunkConcurrency: config.parallelChunkConcurrency, 
        
                   streamingEnabled: config.streamingEnabled, 
        
                   streamingThreshold: config.streamingThreshold 
        
               ) 
        
           } else { 
        
               adaptedConfig = config 
        
           }

This reverts commit ea9ca45.

@Josscii

The Japanese TDT model uses blankId=3072, but the default TdtConfig uses blankId=8192 (for v3 models). When the config was adapted for encoderHiddenSize in AsrManager, the blankId was not being adapted to match the model's blankId. This caused the decoder to treat blank token 3072 as a regular token, resulting in "token_3072" appearing repeatedly in transcription output (as reported by @Josscii in PR #522). Fix: Adapt both encoderHiddenSize AND blankId when creating the adapted config, using models.version.blankId for the correct value. Fixes #522 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Updated benchmark documentation and baseline to reflect Japanese TDT model performance after blankId fix in PR #522. Changes: - Documentation/ASR/benchmarks100.md: Add Japanese TDT section (7.77% CER) - Scripts/parakeet_subset_benchmark.sh: Update baseline from 6.11% to 7.77% Results (JSUT dataset, 100 files): - Mean CER: 7.77% (down from 11.31% before blankId fix) - Median CER: 6.35% - 46% below 5% CER, 64% below 10% CER, 93% below 20% CER - RTFx: 27.7x The improvement from 11.31% → 7.77% CER (31% relative) is due to fixing the blankId mismatch where the model used blankId=3072 but the decoder was configured for blankId=8192, causing blank tokens to be treated as regular tokens. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Josscii · 2026-04-13T05:13:46Z

Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager.swift

+        let needsHiddenSizeAdaptation = config.encoderHiddenSize != models.version.encoderHiddenSize
+        let needsBlankIdAdaptation = config.tdtConfig.blankId != models.version.blankId
+
+        if needsHiddenSizeAdaptation || needsBlankIdAdaptation {


These two check are unrelated, they should be in separated logic.

@Josscii

Separated blankId and encoderHiddenSize adaptations into independent logic blocks as suggested by @Josscii in review. Changes: - Step 1: Adapt blankId if needed (workingConfig) - Step 2: Adapt encoderHiddenSize if needed (adaptedConfig) Before: Both checks were combined with OR, creating unclear dependency After: Each adaptation is handled independently in sequence This makes the code clearer and follows separation of concerns - blankId and encoderHiddenSize are unrelated model properties that should be adapted independently. Addresses review feedback in PR #522. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Josscii · 2026-04-13T09:48:54Z

#524 please see this issue

This comment was marked as resolved.

Sign in to view

Fix isModelValid to use version-specific decoder/joint filenames

294fe10

Alex-Wengg force-pushed the fix/ja-tdt-download-version-filenames branch from 0225390 to 846924a Compare April 13, 2026 03:35

This comment was marked as resolved.

Sign in to view

Alex-Wengg and others added 3 commits April 13, 2026 00:38

Revert "Fix missing vocabulary files in ASR model downloads"

41b266c

This reverts commit ea9ca45.

Josscii reviewed Apr 13, 2026

View reviewed changes

Conversation

Alex-Wengg commented Apr 13, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Kokoro TTS Smoke Test ✅

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-ASR int8 Smoke Test ✅

Performance Metrics

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PocketTTS Smoke Test ✅

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

Josscii commented Apr 13, 2026

Uh oh!

Josscii commented Apr 13, 2026

Uh oh!

Josscii commented Apr 13, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Josscii commented Apr 13, 2026

Uh oh!

Josscii Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Josscii commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Alex-Wengg commented Apr 13, 2026 •

edited by devin-ai-integration bot

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading