-
Adept Persimmon (
PersimmonForCausalLM) -
Apertus (
ApertusForCausalLM)-
8B-Instruct-2509, 70B-Instruct-2509
Note: Use
--set enable-thinking 1to enable thinking.
-
-
Apriel (
AprielForCausalLM) -
Aquila (
AquilaForCausalLM) -
Baichuan (
BaichuanForCausalLM,BaichuanM1ForCausalLM)- Chat-7B, Chat-13B
- M1: Instruct-14B
- Fine-tunings: Med-R1 (Tip:
--set chat_template im)
-
BlueLM (
BlueLMForCausalLM) -
ChatGLM (
ChatGLMModel,Glm4ForCausalLM,Glm4MoeLiteForCausalLM):-
ChatGLM: 6B -
ChatGLM2 family: ChatGLM2 6B, CodeGeeX2 6B, ChatGLM3 6B
Tip on CodeGeeX2: Code completion only, no context. Use system prompt to specify language, e.g.
-s "# language: python". -
CharacterGLM: 6B (
-a CharacterGLM)Note: Use additional key-value pair arguments to specify characters,
--kv user_name "..." bot_name "..." user_info "..." bot_info "...". -
GLM-4: Chat-9B-128k, Chat-9B-1M
-
CodeGeeX4: 9B (
-a CodeGeeX4) -
GLM-4: GLM-4-0414, GLM-Z1-9B-0414, GLM-4-32B-0414, GLM-Z1-32B-0414, GLM-Z1-Rumination-32B-0414
-
4.7-Flash: (https://huggingface.co/zai-org/GLM-4.7-Flash/tree/279ecdf8ee35f17f1939f95d6b113d8b806a7b2b)
-
-
Cohere (
CohereForCausalLM)- C4AI Command-R
- Aya-23-8B, Aya-23-35B (
-a Aya-23, fully compatible with Command-R) - C4AI Command R7B
-
DeciLM (
DeciLMForCausalLM)- Nemotron: Llama-3.3-Nemotron-Super-49B-v1
-
DeepSeek (
DeepseekForCausalLM,DeepseekV2ForCausalLM,DeepseekV3ForCausalLM)-
v1: Chat-16B
-
Coder v2: Instruct (💣 not tested), Lite-Instruct
-
Moonlight: Instruct-16B (
-a Moonlight) -
GigaChat: Instruct-20B (
-a GigaChat)
Two optimization modes are defined: speed (default) and memory. See
BaseMLAttention. -
-
ERNIE (
Ernie4_5_ForCausalLM,Ernie4_5_MoeForCausalLM) -
EXAONE (
ExaoneForCausalLM)- v3.5: Instruct-2.4B, Instruct-7.8B, Instruct-32B
- Deep: 2.4B, 7.8B, 32B
-
Gemma (
GemmaForCausalLM,Gemma2ForCausalLM,Gemma3ForCausalLM,Gemma3ForConditionalGeneration)- v1.0: Instruct-2B, Instruct-7B
- v1.1: Instruct-2B, Instruct-7B
- CodeGemma v1.1: Instruct-7B
- v2: Instruct-2B, Instruct-9B, Instruct-27B
- v3: Instruct-1B
Note: Only download
tokenizer.modeland DO NOT downloadtokenizer.jsonwhen converting.- Rnj-1: Intruct
-
GPT (
GptOssForCausalLM)Note: Q4_1/Q4_0 quantization won't work. Use Q8 instead.
-
Granite (
GraniteForCausalLM,GraniteMoeForCausalLM) -
GroveMoE (
GroveMoeForCausalLM) -
HunYuan (
HunYuanForCausalLM,HunYuanDenseV1ForCausalLM)-
Dense: Instruct-7B(lost) - Dense: 0.5B-Instruct, 1.8B-Instruct, 4B-Instruct, 7B-Instruct
- MoE: A13B-Instruct
- MT1.5: 1.8B, 7B
-
-
Instella (
InstellaForCausalLM) -
InternLM (
InternLMForCausalLM,InternLM2ForCausalLM)- v1: Chat-7B, Chat-7B v1.1, Chat-20B
- v2: Chat-1.8B, Chat-7B, Chat-20B, Math-Plus-1.8B, Math-Plus-7B, Math-Plus-20
- v2.5: Chat-1.8B, Chat-7B, Chat-7B-1M, Chat-20B
- v3: Instruct-8B
-
Jiutian (
JiutianForCausalLM) -
Ling/Ring (
BailingMoeForCausalLM)- Lite, Coder-Lite
- v1.5: Ling-lite-1.5-2507, Ring-lite2507
- v2: Ling-mini-2.0, Ring-mini-2.0
-
LlaMA-like (
LlamaForCausalLM,Llama4ForConditionalGeneration):- All LlaMA-1 models
- LlaMA-2: Chat-7B, etc
- LlaMA-3: Instruct-8B, Instruct-70B, other derivations such as Llama3-8B-Chinese-Chat
- LlaMA-3.1: Instruct-8B, Instruct-70B
- LlaMA-3.2: Instruct-1B, Instruct-3B
- CodeLlaMA: Instruct-7B (
-a CodeLlaMA) - LLM-Compiler: 7B, 7B-FTD, 13B, 13B-FTD
- DeepSeek: Chat-7B (
-a DeepSeek) , Coder-6.7B (-a DeepSeekCoder), Coder-Instruct-1.3B (-a DeepSeekCoder) 🔥 - Yi: (
-a Yi)- v1: Chat-6B, Chat-34B
- v1.5: Chat-6B, Chat-9B, Chat-34B, Chat-9B-16K, Chat-34B-16K
- Coder: Chat-1.5B, Chat-9B
- WizardLM: LM 7B (
-a WizardLM), LM 13B (-a WizardLM), Coder Python-7B (-a WizardCoder) - TigerBot: Chat-7B, Chat-13B (
-a TigerBot) - CodeFuse-DeepSeek: 33B (
-a CodeFuseDeepSeek) - MAP-Neo: Instruct-7B (
-a MAP-Neo) - Index: Chat-1.9B, Character-1.9B, Chat-1.9B-32K
- NuminaMath: 7B-TIR
- SmolLM: (
-a SmolLM)- v1: Instruct-1.7B
- v2: Instruct-1.7B
- Groq: Llama-3-Groq-8B-Tool-Use (
-a Llama-3-Groq-8B-Tool-Use) - Megrez: Instruct-3B (
-a Megrex) - Falcon: (
-a Falcon3) - DeepSeek-R1-Distill-LlaMA: 8B, 70B (
-a DeepSeek-R1-Distill-LlaMA) - DeepHermes-3: Llama-3-8B-Preview (Use
-s ...to enable thinking) - Watt-tool: 8B, 70B
- Reke-Flash: Flash-3, Flash-3.1 (
-a Reka-Flash-3) - Nemotron: Llama-3.1-Nemotron-Nano-8B
- LlaMA-4: Scout-Instruct, Maverick-Instruct
- Seed-Coder: Instruct-8B,
Reasoning-8B (
--name Seed-Coder) - Nanbeige4: 3B-Thinking
For other models that using
LlamaForCausalLMarchitecture, for example, aiXcoder-7B, try-a Yi.If there are both
tokenizer.modelandtokenizer.json, only downloadtokenizer.model. -
Megrez (
MegrezMoeForCausalLM) -
MiniCPM (
MiniCPMForCausalLM,MiniCPM3ForCausalLM) -
Mistral (
MistralForCausalLM,MixtralForCausalLM)-
Mistral: Instruct-7B-v0.2, Instruct-7B-v0.3
-
Small: Instruct-24B
-
OpenChat: 3.5 (
-a OpenChat) 🔥Tip: Use system prompt to select modes:
-s GPT4(default mode),-s Math(mathematical reasoning mode). -
Starling: 7B-beta (
-a Starling)Note: This is based on OpenChat, and is fully compatible with OpenChat GPT4 mode.
-
WizardLM: Math 7B (
-a WizardMath) -
Mixtral: Instruct-8x7B 🔥, Instruct-8x22B
Three implementations of sliding-window attention (see
SlidingWindowAttentionImpl):- Full cache: more RAM is needed.
- Partial cache: less RAM is needed, and faster than ring cache (default).
- Ring cache (i.e. rolling cache): least RAM, but current implementation is naive (slow). 💣
Note: precision of these implementations differs, which causes different results.
-
NeuralBeagle14: 7B (
-a NeuralBeagle) -
WizardLM-2: WizardLM-2-8x22B (official link is gone) (
-a WizardLM-2-MoE)Note: For
MixtralForCausalLMmodels,--experts ...is supported to select a subset of experts when converting. For example,--experts 0,1,2,3selects the first 4 experts. -
Codestral: 22B-v0.1
-
Mistral-Nemo: Nemo-Instruct-2407
-
Small: Instruct-24B
-
DeepHermes-3-Mistral: 24B-Preview (
-a DeepHermes-3-Mistral. Default: Thinking model.) -
Small-3.1: Instruct-24B
-
Devstral: Small-2505, Small-2507
Note: Please download
tokenizer.jsonfrom here.
-
-
Olm (
OlmoeForCausalLM,Olmo2ForCausalLM)- OLMoE: Instruct-7B
- OLM-2: Instruct-7B, Instruct-13B, Instruct-32B
-
Ouro (
OuroForCausalLM)-
Note: additional options supported (
--set ...)total_ut_steps: default 4exit_threshold: default 1.0
-
-
Orion (
OrionForCausalLM) -
Pangu (
PanguProMoEForCausalLM) -
Phi (
PhiForCausalLM,Phi3ForCausalLM)-
Tip:
--temp 0is recommended. Don't forget to try--format qa. -
Dolphin Phi-2 (
-a DolphinPhi2) 🐬 -
Phi-3: Mini-Instruct-4k, Mini-Instruct-128k, Medium-Instruct-4k, Medium-Instruct-128k
-
Phi-3.5: Mini-Instruct, MoE-Instruct
-
Phi-4: Instruct, Mini-Instruct
-
-
QWen (
QWenLMHeadModel,Qwen2ForCausalLM,Qwen2MoeForCausalLM,Qwen3MoeForCausalLM,Qwen3ForCausalLM)- v1: Chat-7B, Chat-14B, QAnything-7B
- v1.5: Chat-0.5B, Chat-1.8B, Chat-4B, Chat-7B, Chat-14B, CodeQwen-Chat-7B (
-a CodeQwen) - v1.5 MoE: Chat-A2.7B
- v2: Instruct-0.5B, Instruct-1.5B, Instruct-7B, Instruct-72B
- v2 MoE: Instruct-57B-A14B (💣 not tested)
- v2.5: Instruct-0.5B, Instruct-1.5B, Instruct-7B, Instruct-14B, Instruct-32B, Instruct-72B
- v2.5-Coder: Instruct-1.5B, Instruct-7B
- v2.5-Math: Instruct-1.5B, Instruct-7B, Instruct-72B
- Marco-o1 (
-a Marco-o1) - QwQ: 32B-Preview, 32B (
-a QwQ) - ReaderLM-v2 (
-a ReaderLM-v2) - DeepSeek-R1-Distill-QWen: 1.5B, 7B, 14B, 32B, DeepScaleR-1.5B-Preview (
-a DeepSeek-R1-Distill-QWen) - DeepSeek-R1-0528-Qwen3: 8B (
-a DeepSeek-R1-Distill-QWen3) - Skywork-OR1: Math-7B, 7B-Preview, 32B-Preview
- OlympicCoder: 7B, 32B
- v3: 235B-A22B (💣 not tested), 30B-A3B, 32B, 14B, 8B, 4B, 1.7B, 0.6B, 30B-A3B-2507, 30B-A3B-Thinking-2507, 4B-2507, 4B-Thinking-2507
- v3-Coder: 30B-A3B
- MiMo: 7B-RL
- Confucius3-Math: 14B (
-a DeepSeek-R1-Distill-QWen) - Jan-Nano: 4B
- Baichuan-M2: 32B
- MiroThinker-v1.5: 30B
-
Seed (
SeedOssForCausalLM)-
OSS: 36B-Instruct
Note: Use
--set thinking_budget Nto setthinking_budget. Default: -1.
-
-
SmolLM-3 (
SmolLM3ForCausalLM) -
Solor (
SolarForCausalLM) -
TeleChat (
TeleChat2ForCausalLM) -
XVERSE (
XverseForCausalLM)Note: Tokenizer's behavior is not 100% identical.
-
Youtu (
YoutuForCausalLM) -
Zhinao (
ZhinaoForCausalLM)
Please use --format completion for these models.
-
AlphaGeometry-LM (
-a AlphaGeometry-LM) -
DeepSeek (
DeepseekV2ForCausalLM)- Coder-V2-Base (💣 not tested), Coder-V2-Lite-Base
-
Gemma (
GemmaForCausalLM) -
Grok-1
-
LlaMA-like (
LlamaForCausalLM):- DeepSeek: Coder-Base-1.3B (
-a DeepSeekCoder), Coder-Base-6.7B (-a DeepSeekCoder) - Seed-Coder: Base-8B (
--name Seed-Coder)
- DeepSeek: Coder-Base-1.3B (
-
Mistral (
MistralForCausalLM,MixtralForCausalLM)- Mistral: Base-7B-v0.1, Base-7B-v0.3
-
Stable-LM (
StableLMEpochModel) -
StarCoder (
Starcoder2ForCausalLM)
-
Maya1
-
[maya1] (https://huggingface.co/maya-research/maya1/tree/fbd30e2b3ec92d2e227df20005a73e172bc5d2de)
SNAC-24kHz is used as codec. Use these additional command line options when converting:
--name Maya1 -a Maya1 --snac_model /path/to/snac_24kHzUse
--set voice XXto describe the voice. More info.
-
-
Orpheus TTS
-
SNAC-24kHz is used as codec. Use these additional command line options when converting:
--name Orpheus-TTS -a Orpheus-TTS --snac_model /path/to/snac_24kHzUse
--set voice XXto select voiceXX, such astara. More info.
-
-
OuteTTS:
-
DAC.speech.v1.0 1.5kbps is used as codec. Use these additional command line options when converting:
--name OuteTTS -a OuteTTS --dac_model /path/to/dacUse
--set speaker /path/to/speaker.jsonto select a speaker profile. More info.
-
-
Qwen3-TTS (
Qwen3TTSForConditionalGeneration):- 12Hz-1.7B: CustomVoice, VoiceDesign, Base
Note:
voice_clone_modeonly support "xvec" now.Additional options (Use
--set X Yto change values):language: defaultauto.speaker: defaultvivian.instruct: default "".voice_clone_mode: "xvec" or "icl". default "xvec".ref_audio_file: default "".ref_text: default "". Required for "icl" mode.
-
Fuyu (
FuyuForCausalLM)- Base: 8B
-
Gemma (
Gemma3ForConditionalGeneration)- v3: Instruct-4B, Instruct-12B, Instruct-27B
- MedGemma: Instruct-4B, Instruct-27B
- TranslateGemma: 4B, 12B, 27B
Note: Only download
tokenizer.modeland DO NOT downloadtokenizer.jsonwhen converting. Use--set do-pan-and-scan 1to enable Pan and Scan. Use--name TranslateGemmawhen converting TranslateGemma models to activate translation support: specify language codes in prompts like/en->zh Hello world.. Source language code can be set toautowhen translating texts. Default language codes can be configured by--set ID code, such as--set source-language-code zhand--set target-language-code en. -
GLM (
Glm4vForConditionalGeneration)- v4: 4.6V-Flash
Support additional options (Use
--set X Yto change values) likeKimi. -
Janus (
MultiModalityCausalLM)Note: Use
--set parallel-size Nto generateNimages in a single run (default: 2);--set gen-head-temperature Tto set temperature ofgen-head(default: 1.0). Add prefix "/gen " to prompts to generate images. -
Kimi (
KimiVLForConditionalGeneration)Additional options (Use
--set X Yto change values):video_max_frames: default 20.native_resolution: use native resolution or not, default:false(This seems sensitive to quantization, so defaults tofalse).fps: Default 1.0.
-
Mistral (
Mistral3ForConditionalGeneration)- Ministral-3: 3B-Instruct-2512, 3B-Reasoning-2512, 8B-Instruct-2512, 8B-Reasoning-2512
- Devstral-Small-2: 24B-Instruct-2512
-
Qwen (
Qwen2AudioForConditionalGeneration,Qwen2VLForConditionalGeneration,Qwen2_5_VLForConditionalGeneration,Qwen3VLForConditionalGeneration,Qwen3VLMoeForConditionalGeneration)- Qwen2-Audio: 7B-Instruct
- Qwen2-VL: 2B-Instruct, 7B-Instruct
- Qwen2.5-VL: 3B-Instruct, 7B-Instruct
- MiMo-VL: 7B-RL, 7B-RL-2508
- Dolphin: v2
- Qwen3-VL: 2B-Instruct, 4B-Instruct, A3B-Instruct, etc
-
SmolVLM2 (
SmolVLMForConditionalGeneration)Note: Use
--set do-split 1to enable Split. -
Step-VL (
StepVLForConditionalGeneration)- v3: 10B
Additional options (Use
--set X Yto change values):do-pan-and-scan: default 1 (i.e. true). Set to0to use only a global view to reduce the compute.native-resolution: default 0 (i.e. false). This model can support native resolution mathematically without pan and scan. (for experiment only)
-
Youtu-VL (
YoutuVLForConditionalGeneration)
-
dots.ocr (
DotsOCRForCausalLM)Note: Prompt for OCR: {{image:...}}Extract the text content from this image. Here are other prompts for OCR. Use
+single-turnto discard history automatically. -
Nanonets-OCR2 (
Qwen2VLForConditionalGeneration,Qwen2_5_VLForConditionalGeneration) -
GLM-OCR (
GlmOcrForConditionalGeneration)
-
GLM-ASR (
GlmAsrForConditionalGeneration) -
Qwen3-ASR (
Qwen3ASRForConditionalGeneration)Additional options (Use
--set X Yto change values):-
language: default "auto".
Additional options (Use
--set X Yto change values):language: default "Chinese". This affects how sentences are cutted into words. Each character is a "word" for Chinese. For other languages, words are separated by spaces.delimiter: default "". Time stamps are reported for "sentences": sentences are separated by this delimiter. For Chinese, when delimiter is empty, each character is treated as a sentence.format: default "srt". Format of output. "srt" or "json" are supported.
-
Note: Only dense embedding is implemented.
-
Roberta (
XLMRobertaModel)- BCE-Embedding
- BGE-M3 (
-a BGE-M3)
-
MiniCPM (
MiniCPMModel) -
Qwen-3 Embedding (
Qwen3ForCausalLM)
-
Roberta (
XLMRobertaForSequenceClassification)- BCE-ReRanker
- BGE-ReRanker-M3 (
-a BGE-Reranker-M3)
-
MiniCPM (
MiniCPMModel) -
Qwen-3 Reranker (
Qwen3ForCausalLM)
- Qwen3-VL Embedding (
Qwen3VLForConditionalGeneration)
- Qwen3-VL Reranker (
Qwen3VLForConditionalGeneration)
These LoRA models have been tested:
Tips for diffusion LLMs, they are very sensitive to sampling parameters. The results may be completely unacceptable with improper parameters.
-
LLaDA (
LLaDA2MoeModelLM)-
Supported options (
--set OPTION VALUE):block_length: default 32steps: default 32minimal_topk: default 1threshold: default 0.95
-
-
WeDLM (
WeDLMForCausalLM)-
Supported options (
--set OPTION VALUE):-
block_size: default 16When set to <= 1, it falls back to auto regressive decoding.
-
accept_algo: default 2- 0: entropy algo: https://github.com/Tencent/WeDLM/blob/d4481cab821044b8ebd5f78bc37f23787a6275ed/wedlm/engine/sampler.py#L169
- 1: prob algo: https://huggingface.co/tencent/WeDLM-8B-Instruct/blob/main/modeling_wedlm.py#L694
- 2: custom algo: sampling + prob
-
threshold: default 0.7For algo 0, tokens are accepted if entropy is less than threshold; for others, tokens are accepted when probability (or condidence level) is larger than this.
-
pos_penalty_factor: default 0.02 (used by entropy algo)
-
-
-
Meta-AI multi-token prediction models checkpoints
Download at least one multi-token prediction checkpoint (such as 7B_1T_4). Assume it is stored at /path/to/llama-multi-predict/7B_1T_4. Make sure
tokenizer.modelis downloaded to /path/to/llama-multi-predict.To convert it with
-a llama-multi-token-prediction-ckpt:python convert.py -i /path/to/llama-multi-predict/7B_1T_4 -o llama-multi.bin -a llama-multi-token-prediction-ckpt
This is a base model, and remember to use
--format completion.Tip: Use
--kv n_future_tokens Nto change number of future tokens, N = [1, 4].