Skip to content

update-from-whisper-x#1

Open
croquies wants to merge 167 commits intofika-dev:mainfrom
m-bain:main
Open

update-from-whisper-x#1
croquies wants to merge 167 commits intofika-dev:mainfrom
m-bain:main

Conversation

@croquies
Copy link

No description provided.

m-bain and others added 30 commits July 11, 2024 13:01
Update alignment.py - added alignment for  sk and sl languages
Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
Updated Norwegian Bokmål and Norwegian Nynorsk models

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
Force ctranslate to version 4.4.0 due libcudnn_ops_infer.so.8:
SYSTRAN/faster-whisper#729

Co-authored-by: Icaro Bombonato <ibombonatosites@gmail.com>
* Update faster-whisper to 1.0.2 to enable model distil-large-v3

* feat: add hotwords option to default_asr_options

---------

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>


---------

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
* chore: bump faster-whisper to 1.1.0

* chore: bump pyannote to 3.3.2

* feat: add multilingual option in load_model function

---------

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>


---------

Co-authored-by: Abhishek Sharma <abhishek@zipteams.com>
Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
…mode (#867)

Adds the parameter local_files_only (default False for consistency) to whisperx.load_model so that the user can avoid downloading the file and return the path to the local cached file if it exists.

---------

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
feat: restrict Python versions to 3.9 - 3.12
Barabazs and others added 30 commits October 21, 2025 09:13
* fix: pin huggingface-hub<1.0.0 for pyannote-audio compatibility

pyannote-audio uses the deprecated `use_auth_token` parameter which was removed in huggingface-hub v1.0.0

* fix: upgrade yanked dependencies

* chore: update version to 3.7.5
* chore: drop python 3.9 support

- Update requires-python to >=3.10
- Remove onnxruntime constraint (only needed for 3.9)
- Simplify numpy (remove version markers and upper bound)
- Remove pandas upper bound (<2.3.0 was for 3.9 compat)
- Remove av direct dependency (transitive via faster-whisper)

* chore(ci): remove python 3.9 from workflows

- Update build-and-release to use Python 3.10
- Remove 3.9 from python-compatibility matrix

* chore: bump version to 3.7.6
Replace O(n*m) pandas operations with O(n log m) interval tree queries
for speaker assignment, where n = words/segments and m = diarization segments.

Performance improvement:
- 7-minute video (1185 words, 147 segments): 73.9s -> 0.32s (228x faster)
- 3-hour podcast: Minutes of processing -> Seconds

Changes:
- Add IntervalTree class using sorted array + binary search
- Refactor assign_word_speakers to use interval tree for overlap queries
- Maintain backward compatibility with same function signature
- Identical output to original implementation

The interval tree uses numpy arrays for efficient storage and binary search
(np.searchsorted) for O(log n) candidate finding, then filters candidates
for actual overlaps.

Fixes #1335
…ssignment

Optimize assign_word_speakers with interval tree for 228x speedup
Fix: pass no_repeat_ngram_size and repetition_penalty to CTranslate2 generate()
[BugFix] The variable I removed was not being used anyhwere.
[BugFix] Type hint fix in decode_batch List[str] not str:
* fix: derive SRT/VTT cue times from word-level timestamps (#1315)

Subtitle cue start/end times were sourced from VAD segment boundaries
instead of word-level timestamps from forced alignment. This caused cues
to appear prematurely and could produce backwards chronological ordering
when VAD segments overlap.

Use min(word starts) / max(word ends) for cue timing, falling back to
segment-level times only when all words are unalignable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version to 3.7.7 in pyproject.toml

---------

Co-authored-by: Claude-Assistant <noreply@anthropic.com>
…-1 (#1349)

* feat: upgrade pyannote-audio dependency to v4

* fix: rename use_auth_token to token for pyannote-audio v4 compatibility

* fix: add omegaconf dep

* fix: use structured output API for pyannote-audio v4 diarization

pyannote-audio 4.x no longer returns a plain Annotation (or tuple when>
return_embeddings=True). It now returns a structured output with
speaker_diarization and speaker_embeddings attributes.

* feat: switch default diarization model to speaker-diarization-community-1

Update default from pyannote/speaker-diarization-3.1 to
pyannote/speaker-diarization-community-1 (pyannote-audio v4),
add CC-BY-4.0 attribution, and update README for v4 API changes.

* fix: correct markdown link formatting for silero-vad in README.md

* chore: update version to 3.8.0


Co-authored-by: Giorgio Azzinnaro <giorgio@azzinna.ro>
…g paths (#1285)

- Add `model_cache_only` param to `load_align_model()`, pass as `local_files_only` to HuggingFace `from_pretrained` calls
- Forward `model_dir` and `model_cache_only` to both `load_align_model` call sites (initial load and language-change reload)
- Add `cache_dir` param to `DiarizationPipeline.__init__`, forward to pyannote `Pipeline.from_pretrained`
- Pass `--model_dir` as `cache_dir` when constructing `DiarizationPipeline` in CLI

Previously only the ASR model respected these flags. Alignment and diarization models would always download from HuggingFace to the default cache, breaking offline and custom-cache workflows.


---------

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
Forward the existing --hf_token CLI argument to faster-whisper's
WhisperModel via a new use_auth_token parameter on load_model(),
enabling downloads of gated/private HuggingFace models.
It works with the initial prompt added.

Ran pdb to make sure and check output.

Long audio works.

Existing Logic is correct without flag.
added and condition before streams, existing logic is not chnaged.
[New File] benchmark testing
Pass through the average log probability (transcription confidence score)
from ctranslate2 to the final segment output. The field is NotRequired
so existing code constructing segments without it remains valid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR #986 ("support timestamps for numbers") introduced three changes that
together broke CTC forced alignment:

1. Unknown chars (numbers, punctuation) were replaced with '*' wildcards
   mapped to token -1. get_wildcard_emission() scored these using
   torch.max() over all non-blank emissions, so wildcards greedily matched
   any speech-like signal in the segment window.

2. get_trellis() was rewritten with a different shape (num_frame, num_tokens)
   and incompatible initialization, discarding the original SoS-offset design
   from the PyTorch forced alignment tutorial.

3. backtrack() was replaced with backtrack_beam(), which always starts
   backtracking from the last frame of the segment window. The original
   backtrack() used torch.argmax() on the last token column to determine
   the starting frame. With padded segment boundaries (silence before/after
   speech), the new implementation spread all tokens across the full window,
   placing the first word at the start of the silence instead of the speech.

This commit restores the original PyTorch tutorial implementation:
- Unknown chars are skipped; words with only unknown chars become
  unalignable and get no timestamps (handled by interpolate_nans)
- get_trellis: restored (num_frame+1, num_tokens+1) shape with SoS offset
- backtrack: restored torch.argmax-based starting frame
- Removed backtrack_beam, get_wildcard_emission, BeamState, Path

Verified: v3.3.0 (pre-#986) produced correct timestamps with padded
segment boundaries; this fix reproduces that behavior.

Fixes #1220

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ktrack

The original code accepted blank_id as a parameter but used hardcoded 0
in two places, breaking alignment for HuggingFace models where the blank
token is [pad] (not index 0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add optional `progress_callback: Callable[[float], None]` parameter
to the three public API functions for real-time progress tracking.
Each callback receives 0-100% for its own stage independently.

Diarization wraps the callback into pyannote's internal hook protocol,
keeping pyannote internals fully encapsulated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The open().read() call was left behind when the SHA-256 checksum
validation was removed in 86e2b3e. The resulting model_bytes variable
was never used, and the file descriptor was never closed.

Closes #1376

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.