English | 简体中文
RapidSpeech.cpp is a high-performance, edge-native speech intelligence framework built on top of ggml. It aims to provide pure C++, zero-dependency, and on-device inference for large-scale ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) models.
While the open-source ecosystem already offers powerful cloud-side frameworks such as vLLM-omni, as well as mature on-device solutions like sherpa-onnx, RapidSpeech.cpp introduces a new generation of design choices focused on edge deployment.
-
vLLM
- Designed for data centers and cloud environments
- Strongly coupled with Python and CUDA
- Maximizes GPU throughput via techniques such as PageAttention
-
RapidSpeech.cpp
- Designed specifically for edge and on-device inference
- Optimized for low latency, low memory footprint, and lightweight deployment
- Runs on embedded devices, mobile platforms, laptops, and even NPU-only systems
- No Python runtime required
| Aspect | sherpa-onnx (ONNX Runtime) | RapidSpeech.cpp (ggml) |
|---|---|---|
| Memory Management | Managed internally by ORT, relatively opaque | Zero runtime allocation — memory is fully planned during graph construction to avoid edge-side OOM |
| Quantization | Primarily INT8, limited support for ultra-low bit-width | Full K-Quants family (Q4_K / Q5_K / Q6_K), significantly reducing bandwidth and memory usage while preserving accuracy |
| GPU Performance | Relies on execution providers with operator mapping overhead | Native backends (ggml-cuda, ggml-metal) with speech-specific optimizations, outperforming generic onnxruntime-gpu |
| Deployment | Requires shared libraries and external config files | Single binary deployment — model weights and configs are fully encapsulated in GGUF |
Automatic Speech Recognition (ASR)
- SenseVoice-small
- FunASR-nano
- Qwen3-ASR
- FireRedASR2
Text-to-Speech (TTS)
- CosyVoice3
- Qwen3-TTS
RapidSpeech.cpp is not just an inference wrapper — it is a full-featured speech application framework:
-
Core Engine A
ggml-based computation backend supporting mixed-precision inference from INT4 to FP32. -
Architecture Layer A plugin-style model construction and loading system, with planned support for FunASR-nano, CosyVoice, Qwen3-TTS, and more.
-
Business Logic Layer Built-in ring buffers, VAD (voice activity detection), text frontend processing (e.g., phonemization), and multi-session management.
- Extreme Quantization: Native support for 4-bit, 5-bit, and 6-bit quantization schemes to match diverse hardware constraints.
- Zero Dependencies: Implemented entirely in C/C++, producing a single lightweight binary.
- GPU / NPU Acceleration: Customized CUDA and Metal backends optimized for speech models.
- Unified Model Format: Both ASR and TTS models use an extended GGUF format.
- Python Bindings: Python interface via pybind11, installable via pip, callable with just one line of code.
Download GGUF model files from:
- 🤗 Hugging Face: https://huggingface.co/RapidAI/RapidSpeech
- ModelScope: https://www.modelscope.cn/models/RapidAI/RapidSpeech
git clone https://github.com/RapidAI/RapidSpeech.cpp
cd RapidSpeech.cpp
git submodule sync && git submodule update --init --recursive
cmake -B build
cmake --build build --config Release🍎 macOS Metal (enabled by default)
Metal acceleration is enabled by default on macOS — no extra configuration needed:
cmake -B build
cmake --build build --config Release🖥️ NVIDIA CUDA
cmake -B build -DRS_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
cmake --build build --config Release🌋 Vulkan
cmake -B build -DRS_VULKAN=ON
cmake --build build --config Release⚡ Huawei CANN (Ascend NPU)
cmake -B build -DRS_CANN=ON
cmake --build build --config ReleaseAfter building, use rs-asr-offline for offline speech recognition:
# Basic usage
./build/rs-asr-offline \
-m /path/to/model.gguf \
-w /path/to/audio.wav
# Specify threads and GPU
./build/rs-asr-offline \
-m /path/to/model.gguf \
-w /path/to/audio.wav \
-t 8 \
--gpu 1Command-line arguments:
| Argument | Description | Default |
|---|---|---|
-m, --model |
Model file path (required) | - |
-w, --wav |
WAV audio file path (optional; uses a test sine wave if not provided) | - |
-t, --threads |
Number of CPU threads | 4 |
--gpu |
Enable GPU acceleration (true/false) |
true |
-h, --help |
Show help message | - |
Option 1: pip install (recommended)
# CPU version
pip install rapidspeech
# CUDA version
pip install rapidspeech-cuda
# Metal version (macOS)
pip install rapidspeech-metalOption 2: Build from source
git clone https://github.com/RapidAI/RapidSpeech.cpp
cd RapidSpeech.cpp
git submodule sync && git submodule update --init --recursive
# Build with Python bindings
pip install .
# Or specify CUDA backend
RS_BACKEND=cuda pip install .import numpy as np
import rapidspeech
# 1. Initialize the offline ASR recognizer
asr = rapidspeech.asr_offline(
model_path="/path/to/model.gguf",
n_threads=4,
use_gpu=True
)
# 2. Load audio data (float32, 16kHz, mono)
# pcm should be a numpy float32 array in the range [-1.0, 1.0]
pcm = load_wav("audio.wav") # Implement WAV loading yourself, or use soundfile / scipy.io.wavfile
# 3. Push audio data
asr.push_audio(pcm)
# 4. Run inference
asr.process()
# 5. Get the recognition result
text = asr.get_text()
print(f"Result: {text}")A complete offline recognition example script is available at python-api-examples/asr/asr-offline.py.
Run the example:
python python-api-examples/asr/asr-offline.py \
--model /path/to/model.gguf \
--audio /path/to/audio.wav \
--threads 4 \
--gpu 1| Class / Method | Description |
|---|---|
rapidspeech.asr_offline(model_path, n_threads=4, use_gpu=True) |
Create an offline ASR recognizer |
asr.push_audio(pcm) |
Push float32 audio data (1-D numpy array) |
asr.process() |
Run inference, returns status code (0=no output, 1=has output, -1=error) |
asr.get_text() |
Get the recognized text result |
RapidSpeech provides a C interface for integration with other languages and projects. Key APIs:
#include "rapidspeech.h"
// Initialization
rs_init_params_t params = rs_default_params();
params.model_path = "/path/to/model.gguf";
params.n_threads = 4;
params.use_gpu = true;
rs_context_t* ctx = rs_init_from_file(params);
// Push audio and run inference
rs_push_audio(ctx, pcm_data, num_samples);
int32_t status = rs_process(ctx);
// Get result
const char* text = rs_get_text_output(ctx);
// Release resources
rs_free(ctx);For the complete C API documentation, see include/rapidspeech.h.
| Option | Description | Default |
|---|---|---|
RS_BUILD_TESTS |
Build test executables | ON |
RS_CUDA |
Enable CUDA acceleration | OFF |
RS_METAL |
Enable Metal acceleration (macOS only) | Auto-detect |
RS_VULKAN |
Enable Vulkan acceleration | OFF |
RS_CANN |
Enable Huawei CANN acceleration | OFF |
RS_OPENCL |
Enable OpenCL acceleration | OFF |
RS_ENABLE_PYTHON |
Enable Python bindings (pybind11) | OFF |
Use the provided script to convert Hugging Face models to GGUF format:
python scripts/convert_hf_to_gguf.py --model /path/to/hf-model --output /path/to/output.ggufIf you are interested in the following areas, we welcome your PRs or participation in discussions:
- Adapting more models (Qwen3-ASR, CosyVoice3, etc.)
- Refining the framework architecture and performance optimization
- Improving documentation and examples
