Skip to content

fix: speech to text live transcription#816

Open
IgorSwat wants to merge 14 commits intomainfrom
@is/speech-to-text
Open

fix: speech to text live transcription#816
IgorSwat wants to merge 14 commits intomainfrom
@is/speech-to-text

Conversation

@IgorSwat
Copy link
Contributor

Description

Various improvements & adjustments in Speech-to-Text module. The list of changes includes:

  • Adjusting native implementation to the new format of Whisper models (single file, bundled encode & decode methods)
  • Refactoring native implementation in order to support multiple STT models in the future
  • (IN PROGRESS): Fixing an impropriate behavior of Whisper streaming

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

You can run the tests defined for Speech-to-Text module, as well as test it manually with the 'speech' demo app (SpeechToText screen).

Screenshots

Related issues

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

IMPORTANT:
This PR is not yet ready to be merged - I am still working on some concrete aspects of the streaming algorithm. However, you are welcome to evaluate and review the architectural design of the code - especially the proposed solution to handle multiple different implementations of STT module.

@msluszniak msluszniak added the bug fix PRs that are fixing bugs label Feb 20, 2026
@msluszniak msluszniak linked an issue Feb 20, 2026 that may be closed by this pull request
@msluszniak msluszniak changed the title @is/speech to text fix: speech to text live transcription Feb 20, 2026
@IgorSwat IgorSwat force-pushed the @is/speech-to-text branch from 7b1e6ff to 2ee6d1d Compare March 2, 2026 09:21
Copy link
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments are not needed imo

Copy link
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall solid work, thanks 👏🏻
Left a couple of nits

this->decoder->unload();
: callInvoker_(std::move(callInvoker)) {
// Switch between the ASR implementations based on model name
if (modelName == "whisper") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

food for thought: as we discussed a few days back, think about how we can make it work so that the native side doesn't need the model name, but accepts a bunch of configurable pipeline steps. no need to do this now IMO, but just a note.

Maybe we can have different ASR implementations based on whether the model does support timestamps or not?

std::shared_ptr<OwningArrayBuffer>
SpeechToText::encode(std::span<float> waveform) const {
std::vector<float> encoderOutput = this->asr->encode(waveform);
std::vector<float> encoderOutput = transcriber_->encode(waveform);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking whether we need to return std::vector from the encoder? Maybe we would just return a span. We wrap this in OwningArrayBuffer, which copies the data.

Copy link
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more things:

  1. I wasn't able to compile the app for Android (due to Norbert bumping minSdkVersion in RNET). You have to bump the minSdkVersion in the example app.
  2. Once compiled, it doesn't ask for mic permissions (im using a Pixel 10) and silently fails.

@IgorSwat IgorSwat force-pushed the @is/speech-to-text branch from ef854bc to d253381 Compare March 5, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug fix PRs that are fixing bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix Speech to Text streaming mode

3 participants