Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .cspell-wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,4 @@ detr
metaprogramming
ktlint
lefthook
espeak
Original file line number Diff line number Diff line change
Expand Up @@ -82,17 +82,24 @@ You need more details? Check the following resources:

## Running the model

The module provides two ways to generate speech:
The module provides two ways to generate speech using either raw text or pre-generated phonemes:

1. [**`forward(text, speed)`**](../../06-api-reference/interfaces/TextToSpeechType.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
### Using Text

1. [**`forward({ text, speed })`**](../../06-api-reference/interfaces/TextToSpeechType.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
2. [**`stream({ text, speed, onNext, ... })`**](../../06-api-reference/interfaces/TextToSpeechType.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.

### Using Phonemes

If you have pre-computed phonemes (e.g., from an external dictionary or a custom G2P model), you can skip the internal phoneme generation step:

1. [**`forwardFromPhonemes({ phonemes, speed })`**](../../06-api-reference/interfaces/TextToSpeechType.md#forwardfromphonemes): Generates the complete audio waveform from a phoneme string.
2. [**`streamFromPhonemes({ phonemes, speed, onNext, ... })`**](../../06-api-reference/interfaces/TextToSpeechType.md#streamfromphonemes): Streams audio chunks generated from a phoneme string.

:::note
Since it processes the entire text at once, it might take a significant amount of time to produce an audio for long text inputs.
Since `forward` and `forwardFromPhonemes` process the entire input at once, they might take a significant amount of time to produce audio for long inputs.
:::

2. [**`stream({ text, speed })`**](../../06-api-reference/interfaces/TextToSpeechType.md#stream): An async generator that yields chunks of audio as they are computed.
This is ideal for reducing the "time to first audio" for long sentences.

## Example

### Speech Synthesis
Expand Down Expand Up @@ -185,6 +192,48 @@ export default function App() {
}
```

### Synthesis from Phonemes

If you already have a phoneme string obtained from an external source (e.g. the Python `phonemizer` library,
`espeak-ng`, or any custom phonemizer), you can use `forwardFromPhonemes` or `streamFromPhonemes` to synthesize audio directly, skipping the phoneme generation stage.

```tsx
import React from 'react';
import { Button, View } from 'react-native';
import {
useTextToSpeech,
KOKORO_MEDIUM,
KOKORO_VOICE_AF_HEART,
} from 'react-native-executorch';

export default function App() {
const tts = useTextToSpeech({
model: KOKORO_MEDIUM,
voice: KOKORO_VOICE_AF_HEART,
});

const synthesizePhonemes = async () => {
// Example phonemes for "Hello"
const audioData = await tts.forwardFromPhonemes({
phonemes:
'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
});

// ... process or play audioData ...
};

return (
<View style={{ flex: 1, justifyContent: 'center', alignItems: 'center' }}>
<Button
title="Synthesize Phonemes"
onPress={synthesizePhonemes}
disabled={!tts.isReady}
/>
</View>
);
}
```

## Supported models

| Model | Language |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,16 +53,24 @@ For more information on resource sources, see [loading models](../../01-fundamen

## Running the model

The module provides two ways to generate speech:
The module provides two ways to generate speech using either raw text or pre-generated phonemes:

### Using Text

1. [**`forward(text, speed)`**](../../06-api-reference/classes/TextToSpeechModule.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
2. [**`stream({ text, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.

### Using Phonemes

If you have pre-computed phonemes (e.g., from an external dictionary or a custom G2P model), you can skip the internal phoneme generation step:

1. [**`forwardFromPhonemes(phonemes, speed)`**](../../06-api-reference/classes/TextToSpeechModule.md#forwardfromphonemes): Generates the complete audio waveform from a phoneme string.
2. [**`streamFromPhonemes({ phonemes, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#streamfromphonemes): Streams audio chunks generated from a phoneme string.

:::note
Since it processes the entire text at once, it might take a significant amount of time to produce an audio for long text inputs.
Since `forward` and `forwardFromPhonemes` process the entire input at once, they might take a significant amount of time to produce audio for long inputs.
:::

2. [**`stream({ text, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.

## Example

### Speech Synthesis
Expand Down Expand Up @@ -135,3 +143,34 @@ try {
console.error('Streaming failed:', error);
}
```

### Synthesis from Phonemes

If you already have a phoneme string (e.g., from an external library), you can use `forwardFromPhonemes` or `streamFromPhonemes` to synthesize audio directly, skipping the internal phonemizer stage.

```typescript
import {
TextToSpeechModule,
KOKORO_MEDIUM,
KOKORO_VOICE_AF_HEART,
} from 'react-native-executorch';

const tts = new TextToSpeechModule();

await tts.load({
model: KOKORO_MEDIUM,
voice: KOKORO_VOICE_AF_HEART,
});

// Example phonemes for "ExecuTorch"
const waveform = await tts.forwardFromPhonemes('həlˈO wˈɜɹld!', 1.0);

// Or stream from phonemes
for await (const chunk of tts.streamFromPhonemes({
phonemes:
'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
speed: 1.0,
})) {
// ... process chunk ...
}
```
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,14 @@ template <typename Model> class ModelHostObject : public JsiHostObject {
addFunctions(JSI_EXPORT_FUNCTION(ModelHostObject<Model>,
promiseHostFunction<&Model::stream>,
"stream"));
addFunctions(JSI_EXPORT_FUNCTION(
ModelHostObject<Model>,
promiseHostFunction<&Model::generateFromPhonemes>,
"generateFromPhonemes"));
addFunctions(JSI_EXPORT_FUNCTION(
ModelHostObject<Model>,
promiseHostFunction<&Model::streamFromPhonemes>,
"streamFromPhonemes"));
}

if constexpr (meta::HasGenerateFromString<Model>) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

#include <algorithm>
#include <fstream>
#include <phonemis/utilities/string_utils.h>
#include <rnexecutorch/Error.h>
#include <rnexecutorch/data_processing/Sequential.h>

Expand Down Expand Up @@ -73,16 +74,9 @@ void Kokoro::loadVoice(const std::string &voiceSource) {
}
}

std::vector<float> Kokoro::generate(std::string text, float speed) {
if (text.size() > params::kMaxTextSize) {
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
"Kokoro: maximum input text size exceeded");
}

// G2P (Grapheme to Phoneme) conversion
auto phonemes = phonemizer_.process(text);

// Divide the phonemes string intro substrings.
std::vector<float>
Kokoro::generateFromPhonemesImpl(const std::u32string &phonemes, float speed) {
// Divide the phonemes string into substrings.
// Affects the further calculations only in case of string size
// exceeding the biggest model's input.
auto subsentences =
Expand All @@ -98,26 +92,20 @@ std::vector<float> Kokoro::generate(std::string text, float speed) {
size_t pauseMs = params::kPauseValues.contains(lastPhoneme)
? params::kPauseValues.at(lastPhoneme)
: params::kDefaultPause;
std::vector<float> pause(pauseMs * constants::kSamplesPerMilisecond, 0.F);

// Add audio part and pause to the main audio vector
// Add audio part and silence pause to the main audio vector
audio.insert(audio.end(), std::make_move_iterator(audioPart.begin()),
std::make_move_iterator(audioPart.end()));
audio.insert(audio.end(), std::make_move_iterator(pause.begin()),
std::make_move_iterator(pause.end()));
audio.resize(audio.size() + pauseMs * constants::kSamplesPerMilisecond,
0.F);
}

return audio;
}

void Kokoro::stream(std::string text, float speed,
std::shared_ptr<jsi::Function> callback) {
if (text.size() > params::kMaxTextSize) {
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
"Kokoro: maximum input text size exceeded");
}

// Build a full callback function
void Kokoro::streamFromPhonemesImpl(
const std::u32string &phonemes, float speed,
std::shared_ptr<jsi::Function> callback) {
auto nativeCallback = [this, callback](const std::vector<float> &audioVec) {
if (this->isStreaming_) {
this->callInvoker_->invokeAsync([callback, audioVec](jsi::Runtime &rt) {
Expand All @@ -127,21 +115,12 @@ void Kokoro::stream(std::string text, float speed,
}
};

// Mark the beginning of the streaming process
isStreaming_ = true;

// G2P (Grapheme to Phoneme) conversion
auto phonemes = phonemizer_.process(text);

// Divide the phonemes string intro substrings.
// Use specialized implementation to minimize the latency between the
// sentences.
// Use LATENCY strategy to minimize the time-to-first-audio for streaming
auto subsentences =
partitioner_.divide<Partitioner::Strategy::LATENCY>(phonemes);

// We follow the implementation of generate() method, but
// instead of accumulating results in a vector, we push them
// back to the JS side with the callback.
for (size_t i = 0; i < subsentences.size(); i++) {
if (!isStreaming_) {
break;
Expand All @@ -151,7 +130,7 @@ void Kokoro::stream(std::string text, float speed,

// Determine the silent padding duration to be stripped from the edges of
// the generated audio. If a chunk ends with a space or follows one that
// did, it indicates a word boundary split – we use a shorter padding (20ms)
// did, it indicates a word boundary split – we use a shorter padding
// to ensure natural speech flow. Otherwise, we use 50ms for standard
// pauses.
bool endsWithSpace = (subsentence.back() == U' ');
Expand All @@ -161,25 +140,67 @@ void Kokoro::stream(std::string text, float speed,
// Generate an audio vector with the Kokoro model
auto audioPart = synthesize(subsentence, speed, paddingMs);

// Calculate a pause between the sentences
// Calculate and append a pause between the sentences
char32_t lastPhoneme = subsentence.back();
size_t pauseMs = params::kPauseValues.contains(lastPhoneme)
? params::kPauseValues.at(lastPhoneme)
: params::kDefaultPause;
std::vector<float> pause(pauseMs * constants::kSamplesPerMilisecond, 0.F);

// Add pause to the audio vector
audioPart.insert(audioPart.end(), std::make_move_iterator(pause.begin()),
std::make_move_iterator(pause.end()));
audioPart.resize(
audioPart.size() + pauseMs * constants::kSamplesPerMilisecond, 0.F);

// Push the audio right away to the JS side
nativeCallback(audioPart);
}

// Mark the end of the streaming process
isStreaming_ = false;
}

std::vector<float> Kokoro::generate(std::string text, float speed) {
if (text.size() > params::kMaxTextSize) {
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
"Kokoro: maximum input text size exceeded");
}

// G2P (Grapheme to Phoneme) conversion
auto phonemes = phonemizer_.process(text);

return generateFromPhonemesImpl(phonemes, speed);
}

std::vector<float> Kokoro::generateFromPhonemes(std::string phonemes,
float speed) {
if (phonemes.empty()) {
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
"Kokoro: phoneme string must not be empty");
}
return generateFromPhonemesImpl(
phonemis::utilities::string_utils::utf8_to_u32string(phonemes), speed);
}

void Kokoro::stream(std::string text, float speed,
std::shared_ptr<jsi::Function> callback) {
if (text.size() > params::kMaxTextSize) {
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
"Kokoro: maximum input text size exceeded");
}

// G2P (Grapheme to Phoneme) conversion
auto phonemes = phonemizer_.process(text);

streamFromPhonemesImpl(phonemes, speed, callback);
}

void Kokoro::streamFromPhonemes(std::string phonemes, float speed,
std::shared_ptr<jsi::Function> callback) {
if (phonemes.empty()) {
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
"Kokoro: phoneme string must not be empty");
}
streamFromPhonemesImpl(
phonemis::utilities::string_utils::utf8_to_u32string(phonemes), speed,
callback);
}

void Kokoro::streamStop() noexcept { isStreaming_ = false; }

std::vector<float> Kokoro::synthesize(const std::u32string &phonemes,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,22 @@ class Kokoro {
// Processes the entire text at once, before sending back to the JS side.
std::vector<float> generate(std::string text, float speed = 1.F);

// Accepts pre-computed phonemes (as a UTF-8 IPA string) and synthesizes
// audio, bypassing the built-in phonemizer. This allows callers to use
// an external G2P system (e.g. the Python `phonemizer` library, espeak-ng,
// or any custom phonemizer).
std::vector<float> generateFromPhonemes(std::string phonemes,
float speed = 1.F);

// Processes text in chunks, sending each chunk individualy to the JS side
// with asynchronous callbacks.
void stream(std::string text, float speed,
std::shared_ptr<jsi::Function> callback);

// Streaming variant that accepts pre-computed phonemes instead of text.
void streamFromPhonemes(std::string phonemes, float speed,
std::shared_ptr<jsi::Function> callback);

// Stops the streaming process
void streamStop() noexcept;

Expand All @@ -42,6 +53,12 @@ class Kokoro {
// Helper function - loading voice array
void loadVoice(const std::string &voiceSource);

// Helper function - shared synthesis pipeline (partition + synthesize)
std::vector<float> generateFromPhonemesImpl(const std::u32string &phonemes,
float speed);
void streamFromPhonemesImpl(const std::u32string &phonemes, float speed,
std::shared_ptr<jsi::Function> callback);

// Helper function - generate specialization for given input size
std::vector<float> synthesize(const std::u32string &phonemes, float speed,
size_t paddingMs = 50);
Expand Down
Loading
Loading