This project implements speculative decoding for reasoning tasks (e.g., GSM8K) using a custom framework known as MiniTorch. We utilized the DeepSeek-R1 family to select target and draft models for this particular experiment
![]() |
![]() |
![]() |
![]() |
The project builds a modular speculative decoding pipeline with CUDA acceleration, targeting reasoning tasks.
Example for a 100 token limit:
> Who is better Ronaldo or Messi?
========== Speculative ==========
Out: Okay, so I need to figure out whether Ronaldo or Messi is better. Hmm, both are big names in football, but I'm not sure how to compare them. Let me start by recalling what I know about each of them.
Ronaldo, I think, is from Brazil. He's been playing for a long time, maybe 15 years or more. He's known for his skills, especially in the Air, which is a term I've heard in relation to football.
Acceptance rate: 1.000
Throughput: 19.3 tokens/s
=========== Target AR ===========
Out: Okay, so I need to figure out who is better between Ronaldo and Messi. Hmm, both are incredible players, but I'm not sure how to compare them. Let me think about their styles first. Ronaldo plays as a striker, right? He's known for his speed and powerful strikes. Messi, on the other hand, is a midfielder who plays in a more advanced role but can also score goals.
I guess their strengths are different. Ronaldo's strength is his finishing and ability to run
Throughput: 16.6 tokens/s
- MiniTorch Backend: PyTorch-style framework for autodiff, tensor ops, and model building
- CUDA Kernels: Custom fast kernels for softmax, layernorm, and tensor ops
- Benchmark: Evaluation on GSM8K and LIMO
conda create -n specdecode python=3.10
conda activate specdecode
pip install -r requirements.extra.txt
pip install -r requirements.txt
pip install -e .
Not required unless additional kernels were implemented
bash compile_cuda.sh
python project/run_spec_decoding.py
[1] Leviathan, Y., Kalman, M. & Matias, Y.. (2023). Fast Inference from Transformers via Speculative Decoding. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:19274-19286 Available from https://proceedings.mlr.press/v202/leviathan23a.html.
[2] Chen, C., Borgeaud, S., Irving, G., Lespiau, J. B., Sifre, L., & Jumper, J. (2023). Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
[3] Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168.
[4] Rush, S., Gao, G., Abilov, A., and Gokaslan., A. Minitorch, 2021. URL https://minitorch.org.
[5] Ye, Y., Huang, Z., Xiao, Y., Chern, E., Xia, S., and Liu, P. Limo: Less is more for reasoning, 2025. URL https://arxiv.org/abs/2502.03387.
[6] Gao, X., Xie, W., Xiang, Y., and Ji, F. Falcon: Faster and parallel inference of large language models through enhanced semi-autoregressive drafting and custom-designed decoding tree, 2024. URL https://arxiv.org/abs/2412.12639
We would like to thank Romsto and Feifeibear for their work, which has given us a structure to work off of.



