parakeet.cpp

Fast speech recognition with NVIDIA's Parakeet models in pure C++.

Built on axiom — a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.

~27ms encoder inference on Apple Silicon GPU for 10s audio (110M model) — 96x faster than CPU.

Supported Models

Model	Class	Size	Type	Description
`tdt-ctc-110m`	`ParakeetTDTCTC`	110M	Offline	English, dual CTC/TDT decoder heads
`tdt-600m`	`ParakeetTDT`	600M	Offline	Multilingual, TDT decoder
`eou-120m`	`ParakeetEOU`	120M	Streaming	English, RNNT with end-of-utterance detection
`nemotron-600m`	`ParakeetNemotron`	600M	Streaming	Multilingual, configurable latency (80ms–1120ms)
`sortformer`	`Sortformer`	117M	Streaming	Speaker diarization (up to 4 speakers)
`diarized`	`DiarizedTranscriber`	110M+117M	Offline	ASR + diarization → speaker-attributed words

All ASR models share the same audio pipeline: 16kHz mono WAV → 80-bin Mel spectrogram → FastConformer encoder.

Quick Start

#include <parakeet/parakeet.hpp>

parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu();  // optional — Metal acceleration

auto result = t.transcribe("audio.wav");
std::cout << result.text << std::endl;

Choose decoder at call site:

auto result = t.transcribe("audio.wav", parakeet::Decoder::CTC);  // fast greedy
auto result = t.transcribe("audio.wav", parakeet::Decoder::TDT);  // better accuracy (default)

Word-level timestamps with confidence:

auto result = t.transcribe("audio.wav", parakeet::Decoder::TDT, /*timestamps=*/true);
for (const auto &w : result.word_timestamps) {
    std::cout << "[" << w.start << "s - " << w.end << "s] "
              << "(" << w.confidence << ") " << w.word << std::endl;
}
// [0.24s - 0.48s] (0.98) Well
// [0.48s - 0.56s] (0.95) I
// [0.56s - 0.96s] (0.87) don't

Phrase boosting for domain-specific vocabulary:

parakeet::TranscribeOptions opts;
opts.boost_phrases = {"Phoebe", "portrait"};
opts.boost_score = 5.0f;  // log-prob bias (default)
auto result = t.transcribe("audio.wav", opts);

High-Level API

Offline Transcription (TDT-CTC 110M)

parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu();
auto result = t.transcribe("audio.wav");

Offline Transcription (TDT 600M Multilingual)

parakeet::TDTTranscriber t("model.safetensors", "vocab.txt",
                            parakeet::make_tdt_600m_config());
auto result = t.transcribe("audio.wav");

Streaming Transcription (EOU 120M)

parakeet::StreamingTranscriber t("model.safetensors", "vocab.txt",
                                  parakeet::make_eou_120m_config());

// Feed audio chunks (e.g., from microphone)
while (auto chunk = get_audio_chunk()) {
    auto text = t.transcribe_chunk(chunk);
    if (!text.empty()) std::cout << text << std::flush;
}
std::cout << t.get_text() << std::endl;

Streaming Transcription (Nemotron 600M)

// Latency modes: 0=80ms, 1=160ms, 6=560ms, 13=1120ms
auto cfg = parakeet::make_nemotron_600m_config(/*latency_frames=*/1);
parakeet::NemotronTranscriber t("model.safetensors", "vocab.txt", cfg);

while (auto chunk = get_audio_chunk()) {
    auto text = t.transcribe_chunk(chunk);
    if (!text.empty()) std::cout << text << std::flush;
}

Speaker Diarization (Sortformer 117M)

Identify who spoke when — detects up to 4 speakers with per-frame activity probabilities:

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));

auto audio = parakeet::read_audio("meeting.wav");
auto features = parakeet::preprocess_audio(audio.samples, {.normalize = false});
auto segments = model.diarize(features);

for (const auto &seg : segments) {
    std::cout << "Speaker " << seg.speaker_id
              << ": [" << seg.start << "s - " << seg.end << "s]" << std::endl;
}
// Speaker 0: [0.56s - 2.96s]
// Speaker 0: [3.36s - 4.40s]
// Speaker 1: [4.80s - 6.24s]

Streaming diarization with arrival-order speaker tracking:

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));

parakeet::EncoderCache enc_cache;
parakeet::AOSCCache aosc_cache(4);  // max 4 speakers

while (auto chunk = get_audio_chunk()) {
    auto features = parakeet::preprocess_audio(chunk, {.normalize = false});
    auto segments = model.diarize_chunk(features, enc_cache, aosc_cache);
    for (const auto &seg : segments) {
        std::cout << "Speaker " << seg.speaker_id
                  << ": [" << seg.start << "s - " << seg.end << "s]" << std::endl;
    }
}

Diarized Transcription (ASR + Sortformer)

Combines ASR word timestamps with Sortformer speaker diarization to produce speaker-attributed words:

parakeet::DiarizedTranscriber dt("model.safetensors", "sortformer.safetensors",
                                  "vocab.txt");
dt.to_gpu();  // optional

auto result = dt.transcribe("meeting.wav");

Consecutive words from the same speaker are grouped automatically:

Speaker 0 [0.08s - 2.56s]: Good morning, how can I help you today?
Speaker 1 [2.88s - 5.44s]: Hi, I'd like to check on my order status please.
Speaker 0 [5.76s - 8.32s]: Sure, can you give me your order number?
Speaker 1 [8.64s - 10.24s]: It's four five six seven eight.

Each DiarizedWord also carries individual timing and confidence:

for (const auto &w : result.words) {
    // w.speaker_id, w.start, w.end, w.confidence, w.word
}

Standalone alignment is also available if you run ASR and Sortformer separately:

auto diarized = parakeet::diarize_transcription(asr_result.word_timestamps, segments);

Low-Level API

For full control over the pipeline:

CTC (English, punctuation & capitalization):

auto cfg = parakeet::make_110m_config();
parakeet::ParakeetTDTCTC model(cfg);
model.load_state_dict(axiom::io::safetensors::load("model.safetensors"));

auto audio = parakeet::read_audio("audio.wav");
auto features = parakeet::preprocess_audio(audio.samples);
auto encoder_out = model.encoder()(features);

auto log_probs = model.ctc_decoder()(encoder_out);
auto tokens = parakeet::ctc_greedy_decode(log_probs);

parakeet::Tokenizer tokenizer;
tokenizer.load("vocab.txt");
std::cout << tokenizer.decode(tokens[0]) << std::endl;

TDT (Token-and-Duration Transducer):

auto encoder_out = model.encoder()(features);
auto tokens = parakeet::tdt_greedy_decode(model, encoder_out, cfg.durations);
std::cout << tokenizer.decode(tokens[0]) << std::endl;

Timestamps with confidence (CTC, TDT, or RNNT):

// CTC timestamps
auto ts = parakeet::ctc_greedy_decode_with_timestamps(log_probs);

// TDT timestamps
auto ts = parakeet::tdt_greedy_decode_with_timestamps(model, encoder_out, cfg.durations);

// RNNT timestamps
auto ts = parakeet::rnnt_greedy_decode_with_timestamps(model, encoder_out);

// Group into word-level timestamps (confidence = min token confidence per word)
auto words = parakeet::group_timestamps(ts[0], tokenizer.pieces());
for (const auto &w : words) {
    // w.confidence is in [0, 1]
    std::cout << w.word << " (" << w.confidence << ")" << std::endl;
}

Phrase boosting (context biasing):

// Build a trie from boost phrases
parakeet::ContextTrie trie;
trie.build({"Phoebe", "portrait"}, tokenizer);

// Boosted CTC decode — biases log-probs toward trie-matched tokens
auto tokens = parakeet::ctc_greedy_decode_boosted(log_probs, trie, /*boost_score=*/5.0f);

// Boosted TDT decode
auto tokens = parakeet::tdt_greedy_decode_boosted(model, encoder_out, cfg.durations, trie);

// Also available with timestamps
auto ts = parakeet::ctc_greedy_decode_with_timestamps_boosted(log_probs, trie);
auto ts = parakeet::tdt_greedy_decode_with_timestamps_boosted(model, encoder_out, cfg.durations, trie);

GPU acceleration (Metal):

model.to(axiom::Device::GPU);
auto features_gpu = features.gpu();
auto encoder_out = model.encoder()(features_gpu);

// Decode on CPU
auto tokens = parakeet::ctc_greedy_decode(
    model.ctc_decoder()(encoder_out).cpu()
);

CLI

Usage: parakeet <model.safetensors> <audio.wav> [options]

Model types:
  --model TYPE     Model type (default: tdt-ctc-110m)
                   Types: tdt-ctc-110m, tdt-600m, eou-120m,
                          nemotron-600m, sortformer, diarized

Decoder options:
  --ctc            Use CTC decoder (default: TDT)
  --tdt            Use TDT decoder

Phrase boost:
  --boost PHRASE   Boost a phrase (repeatable)
  --boost-score N  Boost score (default: 5.0)

Other options:
  --vocab PATH     SentencePiece vocab file
  --sortformer-weights PATH  Sortformer weights (for diarized mode)
  --gpu            Run on Metal GPU
  --timestamps     Show word-level timestamps
  --streaming      Use streaming mode (eou/nemotron models)
  --latency N      Right context frames for nemotron (0/1/6/13)
  --features PATH  Load pre-computed features from .npy file

Examples:

# Basic transcription (TDT decoder, default)
./build/parakeet model.safetensors audio.wav --vocab vocab.txt

# CTC decoder
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --ctc

# GPU acceleration
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --gpu

# Word-level timestamps
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --timestamps

# Phrase boosting for domain-specific terms
./build/parakeet model.safetensors audio.wav --vocab vocab.txt \
  --boost "Phoebe" --boost "portrait" --boost-score 5.0

# 600M multilingual TDT model
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model tdt-600m

# Streaming with EOU
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model eou-120m

# Nemotron streaming with configurable latency
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model nemotron-600m --latency 6

# Speaker diarization
./build/parakeet sortformer.safetensors meeting.wav --model sortformer

# Diarized transcription (ASR + Sortformer)
./build/parakeet model.safetensors meeting.wav --model diarized \
  --sortformer-weights sortformer.safetensors --vocab vocab.txt

Requirements

C++20 compiler (Clang 14+ or GCC 12+)
CMake 3.20+
macOS 13+ for Metal GPU acceleration

Setup

Build

Axiom is the only dependency (included as a submodule).

git clone --recursive https://github.com/frikallo/parakeet.cpp
cd parakeet.cpp
make build

Test

make test

Convert Weights

Download a NeMo checkpoint from NVIDIA and convert to safetensors:

# Download from HuggingFace (requires pip install huggingface_hub)
huggingface-cli download nvidia/parakeet-tdt_ctc-110m --include "*.nemo" --local-dir .

# Convert to safetensors
pip install safetensors torch
python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensors

The converter supports all model types via the --model flag:

# 110M TDT-CTC (default)
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 110m-tdt-ctc

# 600M multilingual TDT
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdt

# 120M EOU streaming
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model eou-120m

# 600M Nemotron streaming
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model nemotron-600m

# 117M Sortformer diarization
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model sortformer

Also supports raw .ckpt files and inspection:

python scripts/convert_nemo.py model_weights.ckpt -o model.safetensors
python scripts/convert_nemo.py --dump model.nemo  # inspect checkpoint keys

Download Vocab

Grab the SentencePiece vocab from the same HuggingFace repo. The file is inside the .nemo archive, or download directly:

# Extract from .nemo
tar xf parakeet-tdt_ctc-110m.nemo ./tokenizer.model
# or use the vocab.txt from the HF files page

Using as a Library

CMake `find_package`

After installing (make install or cmake --install build):

find_package(Parakeet REQUIRED)
target_link_libraries(myapp PRIVATE Parakeet::parakeet)

CMake `add_subdirectory`

Add parakeet.cpp as a subdirectory or git submodule:

add_subdirectory(third_party/parakeet.cpp)
target_link_libraries(myapp PRIVATE Parakeet::parakeet)

pkg-config

g++ -std=c++20 myapp.cpp $(pkg-config --cflags --libs parakeet) -o myapp

Architecture

Offline Models

Built on a shared FastConformer encoder (Conv2d 8x subsampling → N Conformer blocks with relative positional attention):

Model	Class	Decoder	Use case
CTC	`ParakeetCTC`	Greedy argmax	Fast, English-only
RNNT	`ParakeetRNNT`	Autoregressive LSTM	Streaming capable
TDT	`ParakeetTDT`	LSTM + duration prediction	Better accuracy than RNNT
TDT-CTC	`ParakeetTDTCTC`	Both TDT and CTC heads	Switch decoder at inference

Streaming Models

Built on a cache-aware streaming FastConformer encoder with causal convolutions and bounded-context attention:

Model	Class	Decoder	Use case
EOU	`ParakeetEOU`	Streaming RNNT	End-of-utterance detection
Nemotron	`ParakeetNemotron`	Streaming TDT	Configurable latency streaming

Diarization

Model	Class	Architecture	Use case
Sortformer	`Sortformer`	NEST encoder → Transformer → sigmoid	Speaker diarization (up to 4 speakers)

Benchmarks

Measured on Apple M3 16GB with simulated audio input (Tensor::randn). Times are per-encoder-forward-pass (Sortformer: full forward pass).

Encoder throughput — 10s audio:

Model	Params	CPU (ms)	GPU (ms)	GPU Speedup
110m (TDT-CTC)	110M	2,581	27	96x
tdt-600m	600M	10,779	520	21x
rnnt-600m	600M	10,648	1,468	7x
sortformer	117M	3,195	479	7x

110m GPU scaling across audio lengths:

Audio	CPU (ms)	GPU (ms)	RTF	Throughput
1s	262	24	0.024	41x
5s	1,222	26	0.005	190x
10s	2,581	27	0.003	370x
30s	10,061	32	0.001	935x
60s	26,559	72	0.001	833x

GPU acceleration powered by axiom's Metal graph compiler which fuses the full encoder into optimized MPSGraph operations.

Running benchmarks

# Full suite
make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors"

# Single model
make bench-single ARGS="--110m=models/model.safetensors --benchmark_filter=110m"

# Markdown table output
./build/parakeet_bench --110m=models/model.safetensors --markdown

# Skip GPU benchmarks
./build/parakeet_bench --110m=models/model.safetensors --no-gpu

Available model flags: --110m, --tdt-600m, --rnnt-600m, --sortformer. All Google Benchmark flags (--benchmark_filter, --benchmark_format=json, --benchmark_repetitions=N) are passed through.

Roadmap

Tier 1 — High Impact

Confidence scores — Per-token and per-word confidence (0.0–1.0) from token log-probs. Available on all decoders (CTC, TDT, RNNT, streaming).
Phrase boosting (context biasing) — Token-level trie over a boost list. Bias log-probs during decode for domain-specific vocabulary (product names, jargon, proper nouns). Works with greedy decode.
Beam search decoding — CTC prefix beam search and TDT/RNNT beam search with configurable width. 5–15% relative WER reduction over greedy.
N-gram LM shallow fusion — Load ARPA language models, score partial hypotheses during beam search. Domain-adapted decoding.

Audio & I/O

Multi-format audio loading — WAV (all formats), FLAC, MP3, OGG Vorbis via dr_libs + stb_vorbis. read_audio(path) auto-detects format.
Automatic resampling — Windowed sinc interpolation (Kaiser, 16-tap, ~80dB stopband). Arbitrary rate conversion with GCD simplification.
Sample rate validation — preprocess_audio(AudioData) validates sample rate matches config.
Load from memory buffer — read_audio(bytes, len), read_audio(float*, n, rate), read_audio(int16_t*, n, rate).
Extended WAV support — All WAV formats via dr_wav (8/16/24/32-bit PCM, float, A-law, mu-law).
Audio duration query — get_audio_duration(path) without fully decoding the file. Read header only.
Progress callbacks — transcribe(path, {.on_progress = callback}) for long files. Report preprocessing / encoder / decode stages.
Streaming from raw PCM — Helper to feed int16_t* or float* microphone buffers directly into StreamingTranscriber without manual Tensor construction.

Tier 2 — Production Readiness

Diarized transcription — Fuse Sortformer speaker segments with ASR word timestamps. DiarizedTranscriber composes ASR + Sortformer into speaker-attributed words.
Long-form audio chunking — Split audio >30s into overlapping windows, run encoder on each, merge transcriptions at overlap boundaries.
VAD (voice activity detection) — Skip silent regions, reduce compute. Silero VAD integration or energy-based.
Batch inference — Pad + length-mask multiple audio files, batch through encoder and decoder. GPU utilization improvement.
Neural LM rescoring — N-best reranking with a Transformer LM after beam search.

Tier 3 — Ecosystem

C API — Flat C interface (parakeet_transcribe(...)) for FFI from Python, Swift, Go, Rust.
f16 inference — Half-precision weights and compute. 2x memory reduction, faster on Apple Silicon.
Model quantization — INT8/INT4 weight quantization for mobile deployment.
Hotword / wake word detection — "Hey Parakeet" trigger phrase detection.
Speaker embedding extraction — Speaker verification from Sortformer intermediate layers or TitaNet.

Notes

Audio: 16kHz mono WAV (16-bit PCM or 32-bit float)
Offline models have ~4-5 minute audio length limits; split longer files or use streaming models
Blank token ID is 1024 (110M) or 8192 (600M)
GPU acceleration requires Apple Silicon with Metal support
Timestamps use frame-level alignment: frame * 0.08s (8x subsampling × 160 hop / 16kHz). Confidence = exp(max_log_prob) per token, min-aggregated to word level
Sortformer diarization uses unnormalized features (normalize = false) — this differs from ASR models

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
cmake		cmake
include/parakeet		include/parakeet
scripts		scripts
src		src
tests		tests
third_party		third_party
.clang-format		.clang-format
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

License

Frikallo/parakeet.cpp

Folders and files

Latest commit

History

Repository files navigation

parakeet.cpp

Supported Models

Quick Start

High-Level API

Offline Transcription (TDT-CTC 110M)

Offline Transcription (TDT 600M Multilingual)

Streaming Transcription (EOU 120M)

Streaming Transcription (Nemotron 600M)

Speaker Diarization (Sortformer 117M)

Diarized Transcription (ASR + Sortformer)

Low-Level API

CLI

Requirements

Setup

Build

Test

Convert Weights

Download Vocab

Using as a Library

CMake find_package

CMake add_subdirectory

pkg-config

Architecture

Offline Models

Streaming Models

Diarization

Benchmarks

Running benchmarks

Roadmap

Tier 1 — High Impact

Audio & I/O

Tier 2 — Production Readiness

Tier 3 — Ecosystem

Notes

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

CMake `find_package`

CMake `add_subdirectory`

Packages