fix(kokoro): voice loading, method selection, padding, and audio safety fixes by yocontra · Pull Request #943 · software-mansion/react-native-executorch

yocontra · 2026-03-06T21:53:30Z

Fixes for Kokoro TTS native code. Addresses voice data truncation, missing Synthesizer method selection, progressive speed-up on longer inputs, phoneme token reordering, and several additional safety fixes.

Voice loading reads only 128 of 510 rows

voice_ was a fixed std::array<..., kMaxInputTokens> (128 elements), but hexgrad/Kokoro-82M voice files contain 510 rows. The remaining 382 rows were silently dropped.

Changed voice_ to std::vector, sized dynamically from the file. Also fixed an OOB in voiceID — upstream used std::min(phonemes.size() - 1, noTokens) where noTokens could equal 128, indexing past the end of a 128-element array. Now uses a three-way std::min({phonemes.size() - 1, noTokens - 1, voice_.size() - 1}).

Synthesizer doesn't do method selection

DurationPredictor discovers and selects from forward_8/forward_32/forward_64/forward_128 based on input size, but Synthesizer only knew about forward. Added the same discovery and selection logic. Falls back to "forward" if no forward_N methods exist, so older models still work.

Audio progressively speeds up on longer inputs

The Synthesizer's attention mechanism drifts on longer input sequences (60+ tokens), causing later phonemes to be spoken progressively faster than the Duration Predictor intended. The DP's timing predictions are correct, but the Synthesizer compresses later phonemes into fewer samples.

Fixed by capping inputTokensLimit to 60, which forces the Partitioner to split text into shorter chunks that the Synthesizer can render faithfully. Each chunk is roughly one sentence (~15-20 words).

`tokenize()` scrambles phoneme order on invalid tokens

std::partition was used to filter out invalid (unrecognized) phoneme tokens, but partition does not preserve relative order. When any phonemes fall outside the vocabulary, the remaining valid tokens could be reordered, producing garbled audio.

Changed to std::stable_partition which preserves relative order.

`stripAudio` unsigned integer underflow

lbound - margin wraps size_t to ~2^64 when the audio's first non-silent sample is near the start of the buffer (i.e., lbound < margin). std::max(huge_value, 0) returns the huge value, and audio.subspan() reads out-of-bounds. This is especially triggered in streaming mode where paddingMs=15 (margin = 360 samples) on short chunks.

Fixed by guarding the subtraction: lbound > margin ? lbound - margin : 0. Also guarded audio.size() - 1 against empty spans.

`isStreaming_` data race

isStreaming_ is a plain bool read by stream() on a background thread and written by streamStop() from the JS thread. Non-atomic access is undefined behavior — the compiler may optimize away the read, making streamStop() ineffective.

Changed to std::atomic<bool>.

`scaleDurations` drops phonemes

When aggressively shrinking durations (many tokens, few total ticks), individual token durations can be driven to 0 by the correction loop. A zero-duration token is skipped by repeatInterleave, effectively deleting that phoneme from the output.

Fixed by clamping each scaled duration to a minimum of 1, and guarding the correction loop so it never drives a duration below 1. Without the correction loop guard, the clamp is immediately undone — the priority queue picks clamped entries (they have high remainders) and subtracts 1, defeating the purpose.

Misc perf

Replace temporary pause zero-vectors with resize() directly on the output
Move-capture audio in the streaming callback instead of copying

Changes

Kokoro.h — voice_ from fixed array to vector, isStreaming_ to std::atomic<bool>
Kokoro.cpp — loadVoice(), synthesize(), generate(), stream(), constructor token limit cap
DurationPredictor.cpp — scaleDurations() min-1 clamp with correction loop guard
Synthesizer.h — forwardMethods_ member
Synthesizer.cpp — method discovery and selection
Utils.cpp — stable_partition in tokenize(), stripAudio underflow guard

chmjkb · 2026-03-06T22:02:00Z

Thanks for the contribution, @yocontra! We'll be taking a closer look at this soon. In the meantime, if you're open to sharing your experience using this or any other APIs in the library, we'd love to hear your feedback!

yocontra · 2026-03-06T22:11:29Z

Right now I'm trying to get TTS working well - I was using an onnx based approach and got the quality pretty high (after writing my own phonemizer, because nobody else had theirs quite right). Switched to this library as I need multiple types of things now, and been fixing issues to get the quality to where that other library was.

I'm also sending some PRs to the phonemizer this library uses as there are some words it pronounces wrong and that should get it to perfect again :)

yocontra · 2026-03-06T22:24:32Z

I'm very eager for #936 as I'm trying to get a fully local replacement for gemini live/chatgpt realtime by doing Speech to Text -> Llama 3.2 -> Kokoro TTS

...act-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/DurationPredictor.cpp

...ges/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Synthesizer.cpp

IgorSwat

Hey @yocontra

Very nice contribution overall. Altough I left some comments on more controversial parts of code, which (according to my knowledge) could have a negative impact on the module in some scenarios. We would be grateful if you look at those and share your opinion with us!

IgorSwat · 2026-03-09T11:07:05Z

...act-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/DurationPredictor.cpp

-                                                     : std::floor(scaled));
+    durationsPtr[i] = std::max(static_cast<int64_t>(1),
+        static_cast<int64_t>(shrinking ? std::ceil(scaled)
+                                       : std::floor(scaled)));


I think it's better not to do that. The reason is not that simple, but I will try to explain it briefly:

The way Kokoro is being exported to ExecuTorch format, forces us to set some upper bound for total duration. The current bound is 296 duration ticks, and increasing this will increase the inference time as well as memory usage. The reason for that is a well-known ExecuTorch issue with exporting LSTMs with dynamic shapes. So there is no way to omit it without sacrificing some performance.

This is very important, because if we get a total duration higher than 296 and not scale it down to the 296 cap, the model will throw a hard exception of invalid input shape and the app using it will surely crash. If we keep this implementation with the condition that scaling cannot reduce duration to 0, we can accidentally end up with duration scaled to something like 297 or 298, which will cause a crash.

Now, the scalling applied in scaleDurations() method is used very rarely - only if the sum of predicted durations is below 16 or exceeds 296. From my experience with the model, I can tell that the first case practically never happens, and the second case happens only for very long sentences around maximum amount of 128 tokens - but those are also very rarely processed due to the way the partitioning algorithm works (it tries to avoid processing long sequences). So the scaleDurations() call is rare in itself, and missing a phoneme by calling it is even rarer.

So I strongly believe that allowing to miss single phonemes in very, very rare cases is better than allowing some accidental crash.

I hope the explanation wasn't too boring :)

IgorSwat · 2026-03-09T11:31:10Z

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.cpp

+  // Duration Predictor's output.
+  static constexpr size_t kSafeTokensLimit = 60;
+  context_.inputTokensLimit =
+      std::min(context_.inputTokensLimit, kSafeTokensLimit);


I would rather not do that. The most destructive thing for the output is cutting the text in the middle of a sentence, which creates a very unnatural-sounding pause. We try to avoid that by cutting the text on end-of-sentence or pause marks (such as ',', '.', '?' or '!'), but reducing the input tokens limit makes it harder to do so.

When implementing the module, I also felt that longer inputs produce better, more natural speech, and the short inputs are the ones which can result in a too quick of a speech.

We would be grateful if you could provide a sample input which motivated you to make this change, cause it would make it easier to understand the issue :)

IgorSwat · 2026-03-09T11:40:26Z

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.cpp

-        "[Kokoro::loadSingleVoice]: failed to read voice weights");
+        "[Kokoro::loadVoice]: failed to read voice weights");
  }
 }


We never really loaded the entire voice array, cause we don't process inputs longer than 128 tokens - and the rest of the voice data which we omit is used specifically for longer inputs.

I guess reading the entire file is also OK, although it's a little bit of memory waste since we don't use the remainder of the voice data.

@IgorSwat This change fixed a bunch of the voice files I was trying to use (the kokoro 82m voice files in their repo, not the .bin files y'all provided). I don't think the input length impacts how many rows of the voice weights are needed but I can look into it more here.

yocontra · 2026-03-09T14:18:33Z

@IgorSwat Thanks for the feedback - I'm new to this TTS stuff and was mainly just kind of hack and slashing my way through getting this to work on my app. The duration stuff is a bit sketchy, I'll try to revert those and find another way to fix the issues I was seeing.

Voice loading: - Read all rows from voice file instead of truncating to kMaxInputTokens. Voice files (hexgrad/Kokoro-82M) have 510 rows; upstream discards 382. - Changed voice_ from fixed std::array to std::vector, sized from file. - voiceID: three-way min(phonemes-1, dpTokens-1, voice_.size()-1) to prevent OOB. Upstream had a latent OOB with voiceID=noTokens on a 128-element array. Synthesizer method selection: - Discover forward_N methods at construction, same pattern DurationPredictor already uses. Falls back to "forward" for older/single-method models. - Uses execute() instead of forward() for named method dispatch. Padding fixes: - Pad indices to inputDurationLimit before Synthesizer to prevent XNNPACK shape mismatch on repeated calls with varying duration predictions. - When DP and Synth use same token count (common case), pass DP tensor directly to Synth instead of copying (~320KB save). Perf: - Use resize() for silence padding instead of allocating temp vectors. - Move-capture audio in streaming callback instead of copying.

The Synthesizer's attention drifts on longer sequences (60+ tokens), causing later phonemes to be spoken progressively faster. Cap inputTokensLimit to 60 so the Partitioner splits text into shorter chunks that stay faithful to the Duration Predictor's timing. Also switch tokenize()'s std::partition to std::stable_partition so phoneme token order is preserved when invalid tokens are filtered out.

Three bugs found via adversarial audit: - stripAudio: unsigned underflow when lbound < margin wraps size_t to ~2^64, causing OOB subspan. Guard subtraction with comparison first. - isStreaming_: plain bool read/written from two threads (stream loop vs streamStop from JS). Changed to std::atomic<bool>. - scaleDurations: aggressive shrinking can drive individual token durations to zero, dropping phonemes from repeatInterleave. Floor each duration at 1 after scaling.

…s correction loop The indices padding to inputDurationLimit (and the cascading synthTokens repadding, dur tensor copying, getMethodTokenCount helpers) introduced subtle audio quality regressions — robotic pacing, clipping at chunk boundaries, and timing drift. Root cause: padding indices with zeros changed the Synthesizer's input semantics, and the dur tensor repadding discarded valid DP output rows. Additionally, the scaleDurations min-1 clamp was defeated by the correction loop immediately subtracting values back to 0. Changes: - Revert synthesize() to pass DP outputs directly to the Synthesizer (same sizes, no repadding) - Remove getMethodTokenCount() from DurationPredictor.h and Synthesizer.h - Guard scaleDurations correction loop against driving durations below 1

- Revert scaleDurations min-1 clamp to avoid exceeding 296 duration cap - Remove kSafeTokensLimit=60 cap that cut text mid-sentence - Use std::ranges::find_if in Synthesizer.cpp

IgorSwat · 2026-03-09T15:43:11Z

@IgorSwat Thanks for the feedback - I'm new to this TTS stuff and was mainly just kind of hack and slashing my way through getting this to work on my app. The duration stuff is a bit sketchy, I'll try to revert those and find another way to fix the issues I was seeing.

Yeah, that's absolutely fine. There is a lot of hidden nuances here, mostly because of ExecuTorch limitations, which force us to do some things in a way they normally should not be done, or at least complicating the stuff.

Hovewer, I saw that the LSTM issue has been resolved recently, so there is a possibility we will be able to remove the entire duration scalling part very soon.

chmjkb requested a review from IgorSwat March 6, 2026 21:58

chmjkb added the bug fix PRs that are fixing bugs label Mar 6, 2026

yocontra mentioned this pull request Mar 6, 2026

fix: off-by-one in Kokoro::synthesize voice index causes crash #941

Closed

yocontra changed the title ~~fix(kokoro): load all voice file rows instead of truncating to kMaxInputTokens~~ fix(kokoro): load all voice rows + add Synthesizer method selection Mar 7, 2026

yocontra marked this pull request as draft March 7, 2026 04:46

yocontra changed the title ~~fix(kokoro): load all voice rows + add Synthesizer method selection~~ fix(kokoro): per-method padding, voice loading, Synthesizer method selection Mar 7, 2026

yocontra changed the title ~~fix(kokoro): per-method padding, voice loading, Synthesizer method selection~~ fix(kokoro): multi-sentence TTS errors, voice loading, Synth method selection Mar 7, 2026

yocontra marked this pull request as ready for review March 7, 2026 04:53

yocontra marked this pull request as draft March 7, 2026 15:49

yocontra force-pushed the fix/kokoro-voice-loading branch from 241cae9 to b5ed922 Compare March 7, 2026 16:19

yocontra changed the title ~~fix(kokoro): multi-sentence TTS errors, voice loading, Synth method selection~~ fix(kokoro): voice loading, Synth method selection, padding fixes Mar 7, 2026

yocontra marked this pull request as ready for review March 8, 2026 01:16

yocontra marked this pull request as draft March 8, 2026 01:44

yocontra changed the title ~~fix(kokoro): voice loading, Synth method selection, padding fixes~~ fix(kokoro): voice loading, method selection, padding, and audio safety fixes Mar 8, 2026

yocontra marked this pull request as ready for review March 8, 2026 04:59

msluszniak reviewed Mar 8, 2026

View reviewed changes

...act-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/DurationPredictor.cpp Outdated Show resolved Hide resolved

msluszniak reviewed Mar 9, 2026

View reviewed changes

...ges/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Synthesizer.cpp Outdated Show resolved Hide resolved

IgorSwat reviewed Mar 9, 2026

View reviewed changes

yocontra added 6 commits March 9, 2026 08:36

revert: back out phonemis.a binary changes

eb42742

Address PR review feedback from IgorSwat and msluszniak

73f4083

- Revert scaleDurations min-1 clamp to avoid exceeding 296 duration cap - Remove kSafeTokensLimit=60 cap that cut text mid-sentence - Use std::ranges::find_if in Synthesizer.cpp

yocontra force-pushed the fix/kokoro-voice-loading branch from c7c8aa9 to 73f4083 Compare March 9, 2026 15:43

yocontra requested review from IgorSwat and msluszniak March 9, 2026 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kokoro): voice loading, method selection, padding, and audio safety fixes#943

fix(kokoro): voice loading, method selection, padding, and audio safety fixes#943
yocontra wants to merge 6 commits intosoftware-mansion:mainfrom
yocontra:fix/kokoro-voice-loading

yocontra commented Mar 6, 2026 •

edited

Loading

Uh oh!

chmjkb commented Mar 6, 2026

Uh oh!

yocontra commented Mar 6, 2026

Uh oh!

yocontra commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

IgorSwat left a comment •

edited

Loading

Uh oh!

IgorSwat Mar 9, 2026

Uh oh!

IgorSwat Mar 9, 2026

Uh oh!

IgorSwat Mar 9, 2026

Uh oh!

yocontra Mar 9, 2026

Uh oh!

yocontra commented Mar 9, 2026

Uh oh!

IgorSwat commented Mar 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yocontra commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Voice loading reads only 128 of 510 rows

Synthesizer doesn't do method selection

Audio progressively speeds up on longer inputs

tokenize() scrambles phoneme order on invalid tokens

stripAudio unsigned integer underflow

isStreaming_ data race

scaleDurations drops phonemes

Misc perf

Changes

Uh oh!

chmjkb commented Mar 6, 2026

Uh oh!

yocontra commented Mar 6, 2026

Uh oh!

yocontra commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

IgorSwat left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IgorSwat Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

IgorSwat Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

IgorSwat Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

yocontra Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

yocontra commented Mar 9, 2026

Uh oh!

IgorSwat commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yocontra commented Mar 6, 2026 •

edited

Loading

`tokenize()` scrambles phoneme order on invalid tokens

`stripAudio` unsigned integer underflow

`isStreaming_` data race

`scaleDurations` drops phonemes

IgorSwat left a comment •

edited

Loading

IgorSwat commented Mar 9, 2026 •

edited

Loading