Version 1.0 - Training Capabilities
Release Date: November 17, 2025
You can now train production-ready speech-to-speech models from scratch, completely independent of external APIs!
File: src/models/speech_tokenizer_trainable.py
- ✓ Residual Vector Quantization (RVQ) with 8 quantizers
- ✓ Convolutional encoder/decoder for high-quality audio
- ✓ HiFiGAN vocoder integration (MIT licensed)
- ✓ Checkpoint management - save/resume training anytime
- ✓ LibriSpeech-ready - works out-of-box with free datasets
Architecture:
Audio → Mel → CNN Encoder → RVQ (8x1024) → CNN Decoder → Mel → HiFiGAN → Audio
└────────── Discrete Tokens ────────────┘
Key Methods:
tokenize(audio)- Convert audio to discrete tokensdetokenize(tokens)- Convert tokens back to audioforward(audio)- Full training forward pass with lossessave_checkpoint()/load_checkpoint()- Checkpoint management
File: training/train_tokenizer.py
- ✓ LibriSpeech dataset loader with automatic preprocessing
- ✓ Multi-GPU support with PyTorch DataLoader
- ✓ WandB integration for real-time monitoring
- ✓ Automatic checkpointing every N epochs
- ✓ Best model tracking based on validation loss
- ✓ Gradient clipping and mixed precision support
Training Loop:
- Load LibriSpeech dataset (auto-download)
- Create train/val split (95/5)
- Train with reconstruction + commitment losses
- Validate every epoch
- Save checkpoints
- Log to WandB/TensorBoard
Usage:
python training/train_tokenizer.py --config training/configs/tokenizer_config.yamlFile: training/configs/tokenizer_config.yaml
- ✓ YAML-based configuration for easy experimentation
- ✓ Model hyperparameters (hidden_dim, codebook_size, etc.)
- ✓ Training settings (batch_size, learning_rate, epochs)
- ✓ Data configuration (dataset paths, splits)
- ✓ Logging options (WandB, TensorBoard)
Example Config:
model:
sample_rate: 24000
codebook_size: 1024
hidden_dim: 512
num_quantizers: 8
training:
epochs: 100
batch_size: 16
learning_rate: 1e-4File: scripts/test_tokenizer.py
- ✓ Audio reconstruction test - Compare original vs reconstructed
- ✓ Codebook utilization analysis - Check if quantizers are used efficiently
- ✓ Latency measurement - Real-time factor (RTF) calculation
- ✓ Quality metrics - MSE, MAE, peak amplitude
Usage:
# Test reconstruction quality
python scripts/test_tokenizer.py \
--checkpoint checkpoints/tokenizer_best.pt \
--input test.wav \
--output reconstructed.wav
# Analyze codebook usage
python scripts/test_tokenizer.py \
--checkpoint checkpoints/tokenizer_best.pt \
--analyze \
--test_dir /workspace/data/LibriSpeech/test-cleanFile: setup_training.sh
- ✓ One-command setup for entire training environment
- ✓ Python/CUDA verification - Check system requirements
- ✓ Dependency installation - PyTorch, torchaudio, all packages
- ✓ Dataset download - Interactive LibriSpeech downloader
- ✓ WandB configuration - Optional monitoring setup
Usage:
bash setup_training.shFile: TRAINING_GUIDE.md (12KB, 400+ lines)
- ✓ Quick start guide - Get training in 30 minutes
- ✓ Cost breakdowns - Detailed GPU hour estimates
- ✓ Phase-by-phase roadmap - Tokenizer → S2S → Emotion
- ✓ Troubleshooting - Common issues and solutions
- ✓ Dataset preparation - LibriSpeech, Common Voice, synthetic
| Dataset | Size | Training Time | Cost (RunPod A100) | Quality |
|---|---|---|---|---|
| train-clean-100 | 100h | 8-12 hours | $10-15 | Testing |
| train-clean-360 | 360h | 30-40 hours | $36-48 | Production |
| train-other-500 | 500h | 40-50 hours | $48-60 | Excellent |
| All combined | 960h | 80-120 hours | $95-143 | Best |
Full Training Pipeline (All 3 Phases):
- Phase 1 (Tokenizer): $95-143
- Phase 2 (S2S Model): $119-476 (coming soon)
- Phase 3 (Emotions): $60-179 (coming soon)
- Total: $274-798
Compare to:
- OpenAI GPT-4o Realtime: $5-10/hr usage, no ownership
- Luna AI: API-only, pricing TBD
- Your model: One-time cost, full ownership, unlimited use
✓ src/models/speech_tokenizer_trainable.py (13KB)
✓ training/train_tokenizer.py (13KB)
✓ training/configs/tokenizer_config.yaml (1.5KB)
✓ scripts/test_tokenizer.py (7KB)
✓ requirements-training.txt (700 bytes)
✓ setup_training.sh (7KB)
✓ TRAINING_GUIDE.md (12KB)
✓ WHATS_NEW.md (this file)
✓ README.md - Added training sections, cost breakdowns, Luna AI comparison
Testing-S2S/
├── training/ # Training scripts
│ ├── train_tokenizer.py
│ └── configs/
│ └── tokenizer_config.yaml
├── scripts/ # Utility scripts
│ └── test_tokenizer.py
└── checkpoints/ # Model checkpoints (gitignored)
├── tokenizer/
└── s2s/
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2Sbash setup_training.shThis will:
- Check system requirements (Python 3.10+, CUDA)
- Create virtual environment
- Install PyTorch with CUDA 12.1
- Install all dependencies
- Download LibriSpeech dataset (optional)
- Configure WandB (optional)
source venv/bin/activate
python training/train_tokenizer.py# Terminal 1: Training logs
python training/train_tokenizer.py
# Terminal 2: GPU monitoring
watch -n 1 nvidia-smi
# Terminal 3: WandB dashboard
# Visit https://wandb.ai/your-username/luna-speech-tokenizer-
Set up RunPod environment
- Launch A100 80GB pod
- Run
setup_training.sh - Verify GPU with
nvidia-smi
-
Download dataset
- Start with train-clean-100 (6GB, $10 training)
- Verify with
ls /workspace/data/LibriSpeech/
-
Start tokenizer training
- Edit config if needed
- Run training script
- Monitor first 10 epochs
-
Test quality
- Use
test_tokenizer.pyon checkpoint - Listen to reconstructed audio
- Check if MAE < 0.1
- Use
- Complete tokenizer training (100 epochs)
- Evaluate reconstruction quality
- Scale up to larger dataset if quality is good
- Begin Phase 2 planning (S2S model)
- Train Hybrid S2S Model (Phase 2)
- Add emotional control (Phase 3)
- Deploy to production
- Fine-tune on Indian languages
If you've been using Testing-S2S for inference only:
✓ Existing inference server (src/server.py)
✓ WebSocket streaming API
✓ Turn-based and stream modes
✓ HiFiGAN vocoder
✓ All API endpoints
✓ Train your own models instead of using random weights
✓ Replace SpeechTokenizer with trained version
✓ Full control over model architecture and data
✓ No dependency on external APIs
- Keep using existing setup - Nothing breaks
- Train tokenizer - Follow TRAINING_GUIDE.md
- Update server - Load trained checkpoint:
tokenizer = TrainableSpeechTokenizer( checkpoint_path="checkpoints/tokenizer_best.pt" ).to(device)
- Test quality - Compare audio before/after
A: No! Existing inference code still works. Training is optional for those who want:
- Full model ownership
- Custom datasets
- Specialized use cases
- Independence from APIs
A: Depends on dataset:
- 100h dataset: 8-12 GPU hours ($10-15)
- 360h dataset: 30-40 GPU hours ($36-48)
- 960h dataset: 80-120 GPU hours ($95-143)
A: Yes! Checkpoints saved every N epochs:
# Resume from checkpoint
python training/train_tokenizer.py \
--config my_config.yaml \
--resume checkpoints/tokenizer_epoch_50.ptA:
- Training: A100 80GB (RunPod: $0.89-1.19/hr)
- Testing: A40 48GB or RTX 4090
- Development: Any GPU with 8GB+ VRAM
A: Phase 1 (Tokenizer) is production-ready:
- ✓ Tested architecture (based on Encodec/SpeechTokenizer)
- ✓ Works with LibriSpeech out-of-box
- ✓ Checkpoint management
- ✓ Quality testing tools
Phases 2-3 coming soon (S2S model, emotions).
Encoder:
- 4 convolutional layers with GroupNorm
- Downsampling by 8x (24kHz → 3kHz frame rate)
- Output: 512-dim latent vectors
Quantizer:
- 8 residual vector quantizers (RVQ)
- 1024 codebook entries per quantizer
- Total vocabulary: 1024^8 possible combinations
Decoder:
- 4 transposed convolutional layers
- Upsampling by 8x (3kHz → 24kHz)
- Output: 80-dim mel spectrogram
Vocoder:
- HiFiGAN Universal (pretrained, frozen)
- Mel → 24kHz audio waveform
Loss Function:
Total Loss = Mel_L1 + 0.5 * Mel_MSE + 0.25 * Commitment
Optimizer: AdamW
Learning Rate: 1e-4
Scheduler: CosineAnnealing
Batch Size: 16 (adjustable)
Gradient Clip: 1.0
Weight Decay: 0.01
Epochs: 100-200- Training Guide: TRAINING_GUIDE.md
- README: README.md
- GitHub: https://github.com/devasphn/Testing-S2S
- RunPod: https://runpod.io
- LibriSpeech: http://www.openslr.org/12/
- WandB: https://wandb.ai
Need help?
- Check documentation: TRAINING_GUIDE.md has troubleshooting
- Open GitHub issue: Include logs and error messages
- Share checkpoints: Use RunPod Network Storage for team collaboration
Version: 1.0.0
Release Date: November 17, 2025
Status: ✓ Tokenizer training ready, S2S & Emotion coming soon
License: MIT - Full commercial use allowed
This training infrastructure was inspired by:
- Luna AI (Pixa) - First Indian speech-to-speech model
- Sparsh Agrawal - Proof that world-class AI can be built with limited resources
- Moshi (Kyutai Labs) - Real-time duplex architecture
- GLM-4-Voice (Tsinghua) - Chinese+English speech model
- LibriSpeech - High-quality free speech dataset
Built with ❤️ by developers who believe in open, independent AI.
⭐ Star the repo to follow Phase 2 (S2S Model) and Phase 3 (Emotions) development!