Supported Models

Chat/Instruct Models

Adept Persimmon (PersimmonForCausalLM)
- Chat-8B
Apertus (ApertusForCausalLM)
- 8B-Instruct-2509, 70B-Instruct-2509
  
  Note: Use --set enable-thinking 1 to enable thinking.
Apriel (AprielForCausalLM)
- Instruct-5B
Aquila (AquilaForCausalLM)
- Chat2-7B, Chat2-34B, Chat2-7B-16K, Chat2-34B-16K
Baichuan (BaichuanForCausalLM, BaichuanM1ForCausalLM)
- Chat-7B, Chat-13B
- M1: Instruct-14B
- Fine-tunings: Med-R1 (Tip: --set chat_template im)
BlueLM (BlueLMForCausalLM)
- Chat-7B, Chat-7B 32K
ChatGLM (ChatGLMModel, Glm4ForCausalLM, Glm4MoeLiteForCausalLM):
- ~~ChatGLM: 6B~~
- ChatGLM2 family: ChatGLM2 6B, CodeGeeX2 6B, ChatGLM3 6B
  
  Tip on CodeGeeX2: Code completion only, no context. Use system prompt to specify language, e.g. -s "# language: python".
- CharacterGLM: 6B (-a CharacterGLM)
  
  Note: Use additional key-value pair arguments to specify characters, --kv user_name "..." bot_name "..." user_info "..." bot_info "...".
- GLM-4: Chat-9B-128k, Chat-9B-1M
- CodeGeeX4: 9B (-a CodeGeeX4)
- GLM-4: GLM-4-0414, GLM-Z1-9B-0414, GLM-4-32B-0414, GLM-Z1-32B-0414, GLM-Z1-Rumination-32B-0414
- 4.7-Flash: (https://huggingface.co/zai-org/GLM-4.7-Flash/tree/279ecdf8ee35f17f1939f95d6b113d8b806a7b2b)
Cohere (CohereForCausalLM)
- C4AI Command-R
- Aya-23-8B, Aya-23-35B (-a Aya-23, fully compatible with Command-R)
- C4AI Command R7B
DeciLM (DeciLMForCausalLM)
- Nemotron: Llama-3.3-Nemotron-Super-49B-v1
DeepSeek (DeepseekForCausalLM, DeepseekV2ForCausalLM, DeepseekV3ForCausalLM)
- v1: Chat-16B
- v2: Chat (💣 not tested), Lite-Chat
- Coder v2: Instruct (💣 not tested), Lite-Instruct
- Moonlight: Instruct-16B (-a Moonlight)
- GigaChat: Instruct-20B (-a GigaChat)
Two optimization modes are defined: speed (default) and memory. See BaseMLAttention.
ERNIE (Ernie4_5_ForCausalLM, Ernie4_5_MoeForCausalLM)
- Non-thinking: 0.3B, A3B
- Thinking: A3B
EXAONE (ExaoneForCausalLM)
- v3.5: Instruct-2.4B, Instruct-7.8B, Instruct-32B
- Deep: 2.4B, 7.8B, 32B
Gemma (GemmaForCausalLM, Gemma2ForCausalLM, Gemma3ForCausalLM, Gemma3ForConditionalGeneration)
- v1.0: Instruct-2B, Instruct-7B
- v1.1: Instruct-2B, Instruct-7B
- CodeGemma v1.1: Instruct-7B
- v2: Instruct-2B, Instruct-9B, Instruct-27B
- v3: Instruct-1B
Note: Only download tokenizer.model and DO NOT download tokenizer.json when converting.
- Rnj-1: Intruct
GPT (GptOssForCausalLM)
- OSS: 20B, 120B
Note: Q4_1/Q4_0 quantization won't work. Use Q8 instead.
Granite (GraniteForCausalLM, GraniteMoeForCausalLM)
- v3.0: Instruct-1B-A400M, Instruct-3B-A800M, Instruct-2B, Instruct-8B
- v3.1: Instruct-1B-A400M, Instruct-3B-A800M, Instruct-2B, Instruct-8B
- v3.2: Instruct-2B, Instruct-2B, Instruct-8B
GroveMoE (GroveMoeForCausalLM)
- Inst
HunYuan (HunYuanForCausalLM, HunYuanDenseV1ForCausalLM)
- ~~Dense: Instruct-7B~~ (lost)
- Dense: 0.5B-Instruct, 1.8B-Instruct, 4B-Instruct, 7B-Instruct
- MoE: A13B-Instruct
- MT1.5: 1.8B, 7B
Instella (InstellaForCausalLM)
- Instruct-3B
InternLM (InternLMForCausalLM, InternLM2ForCausalLM)
- v1: Chat-7B, Chat-7B v1.1, Chat-20B
- v2: Chat-1.8B, Chat-7B, Chat-20B, Math-Plus-1.8B, Math-Plus-7B, Math-Plus-20
- v2.5: Chat-1.8B, Chat-7B, Chat-7B-1M, Chat-20B
- v3: Instruct-8B
Jiutian (JiutianForCausalLM)
- Math-8B, Math-8B-Thinking, Coder-8B-Instruct, DA-8B
Ling/Ring (BailingMoeForCausalLM)
- Lite, Coder-Lite
- v1.5: Ling-lite-1.5-2507, Ring-lite2507
- v2: Ling-mini-2.0, Ring-mini-2.0
LlaMA-like (LlamaForCausalLM, Llama4ForConditionalGeneration):
- All LlaMA-1 models
- LlaMA-2: Chat-7B, etc
- LlaMA-3: Instruct-8B, Instruct-70B, other derivations such as Llama3-8B-Chinese-Chat
- LlaMA-3.1: Instruct-8B, Instruct-70B
- LlaMA-3.2: Instruct-1B, Instruct-3B
- CodeLlaMA: Instruct-7B (-a CodeLlaMA)
- LLM-Compiler: 7B, 7B-FTD, 13B, 13B-FTD
- DeepSeek: Chat-7B (-a DeepSeek) , Coder-6.7B (-a DeepSeekCoder), Coder-Instruct-1.3B (-a DeepSeekCoder) 🔥
- Yi: (-a Yi)
  - v1: Chat-6B, Chat-34B
  - v1.5: Chat-6B, Chat-9B, Chat-34B, Chat-9B-16K, Chat-34B-16K
  - Coder: Chat-1.5B, Chat-9B
- WizardLM: LM 7B (-a WizardLM), LM 13B (-a WizardLM), Coder Python-7B (-a WizardCoder)
- TigerBot: Chat-7B, Chat-13B (-a TigerBot)
- CodeFuse-DeepSeek: 33B (-a CodeFuseDeepSeek)
- MAP-Neo: Instruct-7B (-a MAP-Neo)
- Index: Chat-1.9B, Character-1.9B, Chat-1.9B-32K
- NuminaMath: 7B-TIR
- SmolLM: (-a SmolLM)
  - v1: Instruct-1.7B
  - v2: Instruct-1.7B
- Groq: Llama-3-Groq-8B-Tool-Use (-a Llama-3-Groq-8B-Tool-Use)
- Megrez: Instruct-3B (-a Megrex)
- Falcon: (-a Falcon3)
  - v3: Instruct-1B, Instruct-3B, Instruct-7B, Instruct-10B
- DeepSeek-R1-Distill-LlaMA: 8B, 70B (-a DeepSeek-R1-Distill-LlaMA)
- DeepHermes-3: Llama-3-8B-Preview (Use -s ... to enable thinking)
- Watt-tool: 8B, 70B
- Reke-Flash: Flash-3, Flash-3.1 (-a Reka-Flash-3)
- Nemotron: Llama-3.1-Nemotron-Nano-8B
- LlaMA-4: Scout-Instruct, Maverick-Instruct
- Seed-Coder: Instruct-8B, Reasoning-8B (--name Seed-Coder)
- Nanbeige4: 3B-Thinking
For other models that using LlamaForCausalLM architecture, for example, aiXcoder-7B, try -a Yi.

If there are both tokenizer.model and tokenizer.json, only download tokenizer.model.
Megrez (MegrezMoeForCausalLM)
- (3x7B-A3B)[https://huggingface.co/Infinigence/Megrez2-3x7B-A3B/tree/3ffc3b7c0ffc0f0b27d71fba2a97dcc14c797bb4]
MiniCPM (MiniCPMForCausalLM, MiniCPM3ForCausalLM)
- DPO-2B, SFT-2B, SFT-1B🔥
- 2B-128k (Note: --temp 0 is recommended.)
- MoE-8x2B
- v3: 4B
- v4: 0.5B, 8B, 8B-Survey, 8B-MCP
Mistral (MistralForCausalLM, MixtralForCausalLM)
- Mistral: Instruct-7B-v0.2, Instruct-7B-v0.3
- Small: Instruct-24B
- OpenChat: 3.5 (-a OpenChat) 🔥
  
  Tip: Use system prompt to select modes: -s GPT4 (default mode), -s Math (mathematical reasoning mode).
- Starling: 7B-beta (-a Starling)
  
  Note: This is based on OpenChat, and is fully compatible with OpenChat GPT4 mode.
- WizardLM: Math 7B (-a WizardMath)
- Mixtral: Instruct-8x7B 🔥, Instruct-8x22B
  
  Three implementations of sliding-window attention (see SlidingWindowAttentionImpl):
  - Full cache: more RAM is needed.
  - Partial cache: less RAM is needed, and faster than ring cache (default).
  - Ring cache (i.e. rolling cache): least RAM, but current implementation is naive (slow). 💣
  Note: precision of these implementations differs, which causes different results.
- NeuralBeagle14: 7B (-a NeuralBeagle)
- WizardLM-2: WizardLM-2-8x22B (official link is gone) (-a WizardLM-2-MoE)
  
  Note: For MixtralForCausalLM models, --experts ... is supported to select a subset of experts when converting. For example, --experts 0,1,2,3 selects the first 4 experts.
- Codestral: 22B-v0.1
- Mistral-Nemo: Nemo-Instruct-2407
- Small: Instruct-24B
- DeepHermes-3-Mistral: 24B-Preview (-a DeepHermes-3-Mistral. Default: Thinking model.)
- Small-3.1: Instruct-24B
- Devstral: Small-2505, Small-2507
  
  Note: Please download tokenizer.json from here.
Olm (OlmoeForCausalLM, Olmo2ForCausalLM)
- OLMoE: Instruct-7B
- OLM-2: Instruct-7B, Instruct-13B, Instruct-32B
Ouro (OuroForCausalLM)
- 2.6B-Thinking
  
  Note: additional options supported (--set ...)
  - total_ut_steps: default 4
  - exit_threshold: default 1.0
Orion (OrionForCausalLM)
- Chat-14B
Pangu (PanguProMoEForCausalLM)
- MoE: Pro-MoE
- Embedded: 7B, 1B
Phi (PhiForCausalLM, Phi3ForCausalLM)
- Phi-2
  
  Tip: --temp 0 is recommended. Don't forget to try --format qa.
- Dolphin Phi-2 (-a DolphinPhi2) 🐬
- Phi-3: Mini-Instruct-4k, Mini-Instruct-128k, Medium-Instruct-4k, Medium-Instruct-128k
- Phi-3.5: Mini-Instruct, MoE-Instruct
- Phi-4: Instruct, Mini-Instruct
QWen (QWenLMHeadModel, Qwen2ForCausalLM, Qwen2MoeForCausalLM, Qwen3MoeForCausalLM, Qwen3ForCausalLM)
- v1: Chat-7B, Chat-14B, QAnything-7B
- v1.5: Chat-0.5B, Chat-1.8B, Chat-4B, Chat-7B, Chat-14B, CodeQwen-Chat-7B (-a CodeQwen)
- v1.5 MoE: Chat-A2.7B
- v2: Instruct-0.5B, Instruct-1.5B, Instruct-7B, Instruct-72B
- v2 MoE: Instruct-57B-A14B (💣 not tested)
- v2.5: Instruct-0.5B, Instruct-1.5B, Instruct-7B, Instruct-14B, Instruct-32B, Instruct-72B
- v2.5-Coder: Instruct-1.5B, Instruct-7B
- v2.5-Math: Instruct-1.5B, Instruct-7B, Instruct-72B
- Marco-o1 (-a Marco-o1)
- QwQ: 32B-Preview, 32B (-a QwQ)
- ReaderLM-v2 (-a ReaderLM-v2)
- DeepSeek-R1-Distill-QWen: 1.5B, 7B, 14B, 32B, DeepScaleR-1.5B-Preview (-a DeepSeek-R1-Distill-QWen)
- DeepSeek-R1-0528-Qwen3: 8B (-a DeepSeek-R1-Distill-QWen3)
- Skywork-OR1: Math-7B, 7B-Preview, 32B-Preview
- OlympicCoder: 7B, 32B
- v3: 235B-A22B (💣 not tested), 30B-A3B, 32B, 14B, 8B, 4B, 1.7B, 0.6B, 30B-A3B-2507, 30B-A3B-Thinking-2507, 4B-2507, 4B-Thinking-2507
- v3-Coder: 30B-A3B
- MiMo: 7B-RL
- Confucius3-Math: 14B (-a DeepSeek-R1-Distill-QWen)
- Jan-Nano: 4B
- Baichuan-M2: 32B
- MiroThinker-v1.5: 30B
Seed (SeedOssForCausalLM)
- OSS: 36B-Instruct
  
  Note: Use --set thinking_budget N to set thinking_budget. Default: -1.
SmolLM-3 (SmolLM3ForCausalLM)
- 3B
Solor (SolarForCausalLM)
- Pro
TeleChat (TeleChat2ForCausalLM)
- v2: 3B, 7B, 115B
- v2.5 35B, 115B
XVERSE (XverseForCausalLM)
- Chat-7B, Chat-13B, Chat-65B
Note: Tokenizer's behavior is not 100% identical.
Youtu (YoutuForCausalLM)
- 2B
Zhinao (ZhinaoForCausalLM)
- Chat-7B-4K, Chat-7B-32K, Chat-7B-360K

Text Base Models

Please use --format completion for these models.

AlphaGeometry-LM (-a AlphaGeometry-LM)
- geometry.757
DeepSeek (DeepseekV2ForCausalLM)
- Coder-V2-Base (💣 not tested), Coder-V2-Lite-Base
Gemma (GemmaForCausalLM)
- CodeGemma v1.1: Base-2B, Base-7B
Grok-1
- Base
  
  About Grok-1.
LlaMA-like (LlamaForCausalLM):
- DeepSeek: Coder-Base-1.3B (-a DeepSeekCoder), Coder-Base-6.7B (-a DeepSeekCoder)
- Seed-Coder: Base-8B (--name Seed-Coder)
Mistral (MistralForCausalLM, MixtralForCausalLM)
- Mistral: Base-7B-v0.1, Base-7B-v0.3
Stable-LM (StableLMEpochModel)
- Code-3B
StarCoder (Starcoder2ForCausalLM)
- Base-3B, Base-7B, Base-15B

TTS Models

Maya1
- [maya1] (https://huggingface.co/maya-research/maya1/tree/fbd30e2b3ec92d2e227df20005a73e172bc5d2de)
  
  SNAC-24kHz is used as codec. Use these additional command line options when converting: --name Maya1 -a Maya1 --snac_model /path/to/snac_24kHz
  
  Use --set voice XX to describe the voice. More info.
Orpheus TTS
- 3B: EN, ZH, etc
  
  SNAC-24kHz is used as codec. Use these additional command line options when converting: --name Orpheus-TTS -a Orpheus-TTS --snac_model /path/to/snac_24kHz
  
  Use --set voice XX to select voice XX, such as tara. More info.
OuteTTS:
- 1.0: 1B, 0.6B
  
  DAC.speech.v1.0 1.5kbps is used as codec. Use these additional command line options when converting: --name OuteTTS -a OuteTTS --dac_model /path/to/dac
  
  Use --set speaker /path/to/speaker.json to select a speaker profile. More info.
Qwen3-TTS (Qwen3TTSForConditionalGeneration):
- 12Hz-1.7B: CustomVoice, VoiceDesign, Base
Note: voice_clone_mode only support "xvec" now.

Additional options (Use --set X Y to change values):
- language: default auto.
- speaker: default vivian.
- instruct: default "".
- voice_clone_mode: "xvec" or "icl". default "xvec".
- ref_audio_file: default "".
- ref_text: default "". Required for "icl" mode.

Multimodal Models

Fuyu (FuyuForCausalLM)
- Base: 8B
Gemma (Gemma3ForConditionalGeneration)
- v3: Instruct-4B, Instruct-12B, Instruct-27B
- MedGemma: Instruct-4B, Instruct-27B
- TranslateGemma: 4B, 12B, 27B
Note: Only download tokenizer.model and DO NOT download tokenizer.json when converting. Use --set do-pan-and-scan 1 to enable Pan and Scan. Use --name TranslateGemma when converting TranslateGemma models to activate translation support: specify language codes in prompts like /en->zh Hello world.. Source language code can be set to auto when translating texts. Default language codes can be configured by --set ID code, such as --set source-language-code zh and --set target-language-code en.
GLM (Glm4vForConditionalGeneration)
- v4: 4.6V-Flash
Support additional options (Use --set X Y to change values) like Kimi.
Janus (MultiModalityCausalLM)
- Pro: 1B, 7B
Note: Use --set parallel-size N to generate N images in a single run (default: 2); --set gen-head-temperature T to set temperature of gen-head (default: 1.0). Add prefix "/gen " to prompts to generate images.
Kimi (KimiVLForConditionalGeneration)
- VL: A3B-Instruct, A3B-Thinking, A3B-Thinking-2506
Additional options (Use --set X Y to change values):
- video_max_frames: default 20.
- native_resolution: use native resolution or not, default: false (This seems sensitive to quantization, so defaults to false).
- fps: Default 1.0.
Mistral (Mistral3ForConditionalGeneration)
- Ministral-3: 3B-Instruct-2512, 3B-Reasoning-2512, 8B-Instruct-2512, 8B-Reasoning-2512
- Devstral-Small-2: 24B-Instruct-2512
Qwen (Qwen2AudioForConditionalGeneration, Qwen2VLForConditionalGeneration, Qwen2_5_VLForConditionalGeneration, Qwen3VLForConditionalGeneration, Qwen3VLMoeForConditionalGeneration)
- Qwen2-Audio: 7B-Instruct
- Qwen2-VL: 2B-Instruct, 7B-Instruct
- Qwen2.5-VL: 3B-Instruct, 7B-Instruct
- MiMo-VL: 7B-RL, 7B-RL-2508
- Dolphin: v2
- Qwen3-VL: 2B-Instruct, 4B-Instruct, A3B-Instruct, etc
SmolVLM2 (SmolVLMForConditionalGeneration)
- 2.2B-Instruct
Note: Use --set do-split 1 to enable Split.
Step-VL (StepVLForConditionalGeneration)
- v3: 10B
Additional options (Use --set X Y to change values):
- do-pan-and-scan: default 1 (i.e. true). Set to 0 to use only a global view to reduce the compute.
- native-resolution: default 0 (i.e. false). This model can support native resolution mathematically without pan and scan. (for experiment only)
Youtu-VL (YoutuVLForConditionalGeneration)
- 4B-Instruct

OCR Models

dots.ocr (DotsOCRForCausalLM)
- 3B
Note: Prompt for OCR: {{image:...}}Extract the text content from this image. Here are other prompts for OCR. Use +single-turn to discard history automatically.
Nanonets-OCR2 (Qwen2VLForConditionalGeneration, Qwen2_5_VLForConditionalGeneration)
- OCR2: 3B, 1.5B
GLM-OCR (GlmOcrForConditionalGeneration)
- 0.7B

ASR Models

GLM-ASR (GlmAsrForConditionalGeneration)
- Nano-2512
Qwen3-ASR (Qwen3ASRForConditionalGeneration)
- 0.6B, 1.7B
Additional options (Use --set X Y to change values):
- language: default "auto".
- ForcedAligner-0.6B
Additional options (Use --set X Y to change values):
- language: default "Chinese". This affects how sentences are cutted into words. Each character is a "word" for Chinese. For other languages, words are separated by spaces.
- delimiter: default "". Time stamps are reported for "sentences": sentences are separated by this delimiter. For Chinese, when delimiter is empty, each character is treated as a sentence.
- format: default "srt". Format of output. "srt" or "json" are supported.

RAG Models

Text Embedding

Note: Only dense embedding is implemented.

Roberta (XLMRobertaModel)
- BCE-Embedding
- BGE-M3 (-a BGE-M3)
MiniCPM (MiniCPMModel)
- MiniCPM-Embedding-Light
Qwen-3 Embedding (Qwen3ForCausalLM)
- 0.6B, 4B, 8B (-a Qwen3-Embedding)
  
  Note: use --set task ... to specify task/instruction.

Text Ranking

Roberta (XLMRobertaForSequenceClassification)
- BCE-ReRanker
- BGE-ReRanker-M3 (-a BGE-Reranker-M3)
MiniCPM (MiniCPMModel)
- MiniCPM-Reranker-Light
Qwen-3 Reranker (Qwen3ForCausalLM)
- 0.6B, 4B, 8B (-a Qwen3-Reranker)
  
  Note: use --set task ... to specify task/instruction.

Multi-modal Embedding

Qwen3-VL Embedding (Qwen3VLForConditionalGeneration)
- : 2B, 8B (-a Qwen3-VL-Embedding)
  
  Note: use --set task ... to specify task/instruction.

Multi-modal Ranking

Qwen3-VL Reranker (Qwen3VLForConditionalGeneration)
- : 2B, 8B (-a Qwen3-VL-Reranker)
  
  Note: use --set task ... to specify task/instruction.

LoRA Models

These LoRA models have been tested:

Llama-3-Chinese-8B-Instruct

Special Models

Tips for diffusion LLMs, they are very sensitive to sampling parameters. The results may be completely unacceptable with improper parameters.

LLaDA (LLaDA2MoeModelLM)
- mini-preview, mini
  
  Supported options (--set OPTION VALUE):
  - block_length: default 32
  - steps: default 32
  - minimal_topk: default 1
  - threshold: default 0.95
WeDLM (WeDLMForCausalLM)
- 8B-Instruct
  
  Supported options (--set OPTION VALUE):
  - block_size: default 16
    
    When set to <= 1, it falls back to auto regressive decoding.
  - accept_algo: default 2
    - 0: entropy algo: https://github.com/Tencent/WeDLM/blob/d4481cab821044b8ebd5f78bc37f23787a6275ed/wedlm/engine/sampler.py#L169
    - 1: prob algo: https://huggingface.co/tencent/WeDLM-8B-Instruct/blob/main/modeling_wedlm.py#L694
    - 2: custom algo: sampling + prob
  - threshold: default 0.7
    
    For algo 0, tokens are accepted if entropy is less than threshold; for others, tokens are accepted when probability (or condidence level) is larger than this.
  - pos_penalty_factor: default 0.02 (used by entropy algo)
Meta-AI multi-token prediction models checkpoints

Download at least one multi-token prediction checkpoint (such as 7B_1T_4). Assume it is stored at /path/to/llama-multi-predict/7B_1T_4. Make sure tokenizer.model is downloaded to /path/to/llama-multi-predict.

To convert it with -a llama-multi-token-prediction-ckpt:
```
python convert.py -i /path/to/llama-multi-predict/7B_1T_4 -o llama-multi.bin -a llama-multi-token-prediction-ckpt
```
This is a base model, and remember to use --format completion.

Tip: Use --kv n_future_tokens N to change number of future tokens, N = [1, 4].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supported Models

Chat/Instruct Models

Text Base Models

TTS Models

Multimodal Models

OCR Models

ASR Models

RAG Models

Text Embedding

Text Ranking

Multi-modal Embedding

Multi-modal Ranking

LoRA Models

Special Models

FilesExpand file tree

models.md

Latest commit

History

models.md

File metadata and controls

Supported Models

Chat/Instruct Models

Text Base Models

TTS Models

Multimodal Models

OCR Models

ASR Models

RAG Models

Text Embedding

Text Ranking

Multi-modal Embedding

Multi-modal Ranking

LoRA Models

Special Models