update-from-whisper-x by croquies · Pull Request #1 · fika-dev/x-custom

croquies · 2025-01-20T06:41:39Z

No description provided.

…anguage

Update alignment.py - added alignment for sk and sl languages

local vad model

move model to assets

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

Updated Norwegian Bokmål and Norwegian Nynorsk models Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

Force ctranslate to version 4.4.0 due libcudnn_ops_infer.so.8: SYSTRAN/faster-whisper#729 Co-authored-by: Icaro Bombonato <ibombonatosites@gmail.com>

* Update faster-whisper to 1.0.2 to enable model distil-large-v3 * feat: add hotwords option to default_asr_options --------- Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

--------- Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

* chore: bump faster-whisper to 1.1.0 * chore: bump pyannote to 3.3.2 * feat: add multilingual option in load_model function --------- Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

--------- Co-authored-by: Abhishek Sharma <abhishek@zipteams.com> Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

…mode (#867) Adds the parameter local_files_only (default False for consistency) to whisperx.load_model so that the user can avoid downloading the file and return the path to the local cached file if it exists. --------- Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

feat: restrict Python versions to 3.9 - 3.12

…tion

* fix: pin huggingface-hub<1.0.0 for pyannote-audio compatibility pyannote-audio uses the deprecated `use_auth_token` parameter which was removed in huggingface-hub v1.0.0 * fix: upgrade yanked dependencies * chore: update version to 3.7.5

* chore: drop python 3.9 support - Update requires-python to >=3.10 - Remove onnxruntime constraint (only needed for 3.9) - Simplify numpy (remove version markers and upper bound) - Remove pandas upper bound (<2.3.0 was for 3.9 compat) - Remove av direct dependency (transitive via faster-whisper) * chore(ci): remove python 3.9 from workflows - Update build-and-release to use Python 3.10 - Remove 3.9 from python-compatibility matrix * chore: bump version to 3.7.6

Replace O(n*m) pandas operations with O(n log m) interval tree queries for speaker assignment, where n = words/segments and m = diarization segments. Performance improvement: - 7-minute video (1185 words, 147 segments): 73.9s -> 0.32s (228x faster) - 3-hour podcast: Minutes of processing -> Seconds Changes: - Add IntervalTree class using sorted array + binary search - Refactor assign_word_speakers to use interval tree for overlap queries - Maintain backward compatibility with same function signature - Identical output to original implementation The interval tree uses numpy arrays for efficient storage and binary search (np.searchsorted) for O(log n) candidate finding, then filters candidates for actual overlaps. Fixes #1335

…ssignment Optimize assign_word_speakers with interval tree for 228x speedup

…erModel

Fix: pass no_repeat_ngram_size and repetition_penalty to CTranslate2 generate()

[BugFix] The variable I removed was not being used anyhwere.

[BugFix] Type hint fix in decode_batch List[str] not str:

* fix: derive SRT/VTT cue times from word-level timestamps (#1315) Subtitle cue start/end times were sourced from VAD segment boundaries instead of word-level timestamps from forced alignment. This caused cues to appear prematurely and could produce backwards chronological ordering when VAD segments overlap. Use min(word starts) / max(word ends) for cue timing, falling back to segment-level times only when all words are unalignable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version to 3.7.7 in pyproject.toml --------- Co-authored-by: Claude-Assistant <noreply@anthropic.com>

…-1 (#1349) * feat: upgrade pyannote-audio dependency to v4 * fix: rename use_auth_token to token for pyannote-audio v4 compatibility * fix: add omegaconf dep * fix: use structured output API for pyannote-audio v4 diarization pyannote-audio 4.x no longer returns a plain Annotation (or tuple when> return_embeddings=True). It now returns a structured output with speaker_diarization and speaker_embeddings attributes. * feat: switch default diarization model to speaker-diarization-community-1 Update default from pyannote/speaker-diarization-3.1 to pyannote/speaker-diarization-community-1 (pyannote-audio v4), add CC-BY-4.0 attribution, and update README for v4 API changes. * fix: correct markdown link formatting for silero-vad in README.md * chore: update version to 3.8.0 Co-authored-by: Giorgio Azzinnaro <giorgio@azzinna.ro>

…g paths (#1285) - Add `model_cache_only` param to `load_align_model()`, pass as `local_files_only` to HuggingFace `from_pretrained` calls - Forward `model_dir` and `model_cache_only` to both `load_align_model` call sites (initial load and language-change reload) - Add `cache_dir` param to `DiarizationPipeline.__init__`, forward to pyannote `Pipeline.from_pretrained` - Pass `--model_dir` as `cache_dir` when constructing `DiarizationPipeline` in CLI Previously only the ASR model respected these flags. Alignment and diarization models would always download from HuggingFace to the default cache, breaking offline and custom-cache workflows. --------- Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

Forward the existing --hf_token CLI argument to faster-whisper's WhisperModel via a new use_auth_token parameter on load_model(), enabling downloads of gated/private HuggingFace models.

It works with the initial prompt added. Ran pdb to make sure and check output. Long audio works. Existing Logic is correct without flag.

… need.

added and condition before streams, existing logic is not chnaged.

[New File] benchmark testing

Batch wrap

Revert "Batch wrap"

Pass through the average log probability (transcription confidence score) from ctranslate2 to the final segment output. The field is NotRequired so existing code constructing segments without it remains valid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PR #986 ("support timestamps for numbers") introduced three changes that together broke CTC forced alignment: 1. Unknown chars (numbers, punctuation) were replaced with '*' wildcards mapped to token -1. get_wildcard_emission() scored these using torch.max() over all non-blank emissions, so wildcards greedily matched any speech-like signal in the segment window. 2. get_trellis() was rewritten with a different shape (num_frame, num_tokens) and incompatible initialization, discarding the original SoS-offset design from the PyTorch forced alignment tutorial. 3. backtrack() was replaced with backtrack_beam(), which always starts backtracking from the last frame of the segment window. The original backtrack() used torch.argmax() on the last token column to determine the starting frame. With padded segment boundaries (silence before/after speech), the new implementation spread all tokens across the full window, placing the first word at the start of the silence instead of the speech. This commit restores the original PyTorch tutorial implementation: - Unknown chars are skipped; words with only unknown chars become unalignable and get no timestamps (handled by interpolate_nans) - get_trellis: restored (num_frame+1, num_tokens+1) shape with SoS offset - backtrack: restored torch.argmax-based starting frame - Removed backtrack_beam, get_wildcard_emission, BeamState, Path Verified: v3.3.0 (pre-#986) produced correct timestamps with padded segment boundaries; this fix reproduces that behavior. Fixes #1220 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ktrack The original code accepted blank_id as a parameter but used hardcoded 0 in two places, breaking alignment for HuggingFace models where the blank token is [pad] (not index 0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add optional `progress_callback: Callable[[float], None]` parameter to the three public API functions for real-time progress tracking. Each callback receives 0-100% for its own stage independently. Diarization wraps the callback into pyannote's internal hook protocol, keeping pyannote internals fully encapsulated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The open().read() call was left behind when the SHA-256 checksum validation was removed in 86e2b3e. The resulting model_bytes variable was never used, and the file descriptor was never closed. Closes #1376 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

m-bain and others added 30 commits July 11, 2024 13:01

BSD 2 LICENSE

58f0033

Update alignment.py - added alignment for sk and sl languages

912920c

Update alignment.py - added croatian (hr) language

6f70aa6

Update alignment.py - trying another hr alignment file

59b4d88

Update alignment.py - trying another hr alignment

9a9b617

Update alignment.py - remove commented-out alignment modules for hr l…

3f339f9

…anguage

Merge pull request #852 from jan-panoch/main

9e3a9e0

Update alignment.py - added alignment for sk and sl languages

local vad model

a83ddbd

Merge pull request #944 from m-bain/m-bain/local_model

161ae1f

local vad model

move model to assets

a9e50ef

Merge pull request #945 from m-bain/m-bain/local_model

c141074

move model to assets

Remove typo in error message

a898b3b

Fix link in README.md

9809336

Added Romanian phoneme-based ASR model (#791)

6f3bc5b

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

feat: add new align models (#922)

19eff8e

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

feat: update Norwegian models (#687)

9b9e03c

Updated Norwegian Bokmål and Norwegian Nynorsk models Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

fix: Force ctranslate to version 4.4.0

9e4b1b4

Force ctranslate to version 4.4.0 due libcudnn_ops_infer.so.8: SYSTRAN/faster-whisper#729 Co-authored-by: Icaro Bombonato <ibombonatosites@gmail.com>

Update MANIFEST.in to include necessary files

3027cc3

chore: bump version

7307306

feat: update faster-whisper to 1.0.2 (#814)

3ff625c

* Update faster-whisper to 1.0.2 to enable model distil-large-v3 * feat: add hotwords option to default_asr_options --------- Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

feat: add support for faster-whisper 1.0.3 (#875)

7fdbd21

--------- Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

feat: update versions for pyannote:3.3.2 and faster-whisper:1.1.0 (#936)

15ad5bf

* chore: bump faster-whisper to 1.1.0 * chore: bump pyannote to 3.3.2 * feat: add multilingual option in load_model function --------- Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

feat: add verbose output (#759)

51da227

--------- Co-authored-by: Abhishek Sharma <abhishek@zipteams.com> Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>

feat: use model_dir as cache_dir for wav2vec2 (#681)

1c6d932

feat: add Python compatibility testing workflow

81c4af9

feat: restrict Python versions to 3.9 - 3.12

feat: add build and release workflow

90b4545

chore: clean up MANIFEST.in by removing unnecessary asset inclusions

e9ac5b6

chore: update gitignore

948b3e3

fix: update README image source and enhance setup.py for long descrip…

c18f9f9

…tion

Barabazs and others added 30 commits October 21, 2025 09:13

fix: add missing comma

d32ec3e

Merge pull request #1338 from Mr-Neutr0n/perf/interval-tree-speaker-a…

66ada29

…ssignment Optimize assign_word_speakers with interval tree for 228x speedup

[BugFix] Type hint fix in decode_batch List[str] not str:

2b5ae8c

fix: add no_repeat_ngram_size and repetition_penalty options to Whisp…

570b08b

…erModel

Merge pull request #1340 from RickSanchez93/main

863a986

Fix: pass no_repeat_ngram_size and repetition_penalty to CTranslate2 generate()

[BugFix] The variable I removed was not being used anyhwere.

cc6f627

Merge pull request #1342 from 1carlito/bugs

f2d853a

[BugFix] The variable I removed was not being used anyhwere.

Merge pull request #1343 from m-bain/fix-type-hint-decode-batch

741ab9a

[BugFix] Type hint fix in decode_batch List[str] not str:

feat: pass --hf_token to WhisperModel for gated model support

1baf8d2

Forward the existing --hf_token CLI argument to faster-whisper's WhisperModel via a new use_auth_token parameter on load_model(), enabling downloads of gated/private HuggingFace models.

chore: bump version to 3.8.1

42beab1

[fix] Batch context is updated each time.

1430e43

It works with the initial prompt added. Ran pdb to make sure and check output. Long audio works. Existing Logic is correct without flag.

Although the existing commit worked, inital prompt was in the loop no…

e33bb1e

… need.

[modification]

de5fa65

added and condition before streams, existing logic is not chnaged.

[feat] First batch wrap around

1b6a3b7

[New File] benchmark testing

Merge pull request #1355 from 1carlito/batch_wrap

422c44f

Batch wrap

Revert "Batch wrap"

0e073d4

Merge pull request #1356 from m-bain/revert-1355-batch_wrap

4017efc

Revert "Batch wrap"

fix: default compute_type to float32 on CPU to avoid float16 ValueError

064f737

chore: bump version

6d3edb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update-from-whisper-x#1

update-from-whisper-x#1
croquies wants to merge 167 commits intofika-dev:mainfrom
m-bain:main

croquies commented Jan 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

croquies commented Jan 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants