Skip to content

Latest commit

 

History

History
580 lines (415 loc) · 47.2 KB

File metadata and controls

580 lines (415 loc) · 47.2 KB

Supported Models

Chat/Instruct Models

Text Base Models

Please use --format completion for these models.

TTS Models

  • Maya1

  • Orpheus TTS

    • 3B: EN, ZH, etc

      SNAC-24kHz is used as codec. Use these additional command line options when converting: --name Orpheus-TTS -a Orpheus-TTS --snac_model /path/to/snac_24kHz

      Use --set voice XX to select voice XX, such as tara. More info.

  • OuteTTS:

    • 1.0: 1B, 0.6B

      DAC.speech.v1.0 1.5kbps is used as codec. Use these additional command line options when converting: --name OuteTTS -a OuteTTS --dac_model /path/to/dac

      Use --set speaker /path/to/speaker.json to select a speaker profile. More info.

  • Qwen3-TTS (Qwen3TTSForConditionalGeneration):

    Note: voice_clone_mode only support "xvec" now.

    Additional options (Use --set X Y to change values):

    • language: default auto.
    • speaker: default vivian.
    • instruct: default "".
    • voice_clone_mode: "xvec" or "icl". default "xvec".
    • ref_audio_file: default "".
    • ref_text: default "". Required for "icl" mode.

Multimodal Models

  • Fuyu (FuyuForCausalLM)

    • Base: 8B
  • Gemma (Gemma3ForConditionalGeneration)

    Note: Only download tokenizer.model and DO NOT download tokenizer.json when converting. Use --set do-pan-and-scan 1 to enable Pan and Scan. Use --name TranslateGemma when converting TranslateGemma models to activate translation support: specify language codes in prompts like /en->zh Hello world.. Source language code can be set to auto when translating texts. Default language codes can be configured by --set ID code, such as --set source-language-code zh and --set target-language-code en.

  • GLM (Glm4vForConditionalGeneration)

    Support additional options (Use --set X Y to change values) like Kimi.

  • Janus (MultiModalityCausalLM)

    Note: Use --set parallel-size N to generate N images in a single run (default: 2); --set gen-head-temperature T to set temperature of gen-head (default: 1.0). Add prefix "/gen " to prompts to generate images.

  • Kimi (KimiVLForConditionalGeneration)

    Additional options (Use --set X Y to change values):

    • video_max_frames: default 20.
    • native_resolution: use native resolution or not, default: false (This seems sensitive to quantization, so defaults to false).
    • fps: Default 1.0.
  • Mistral (Mistral3ForConditionalGeneration)

  • Qwen (Qwen2AudioForConditionalGeneration, Qwen2VLForConditionalGeneration, Qwen2_5_VLForConditionalGeneration, Qwen3VLForConditionalGeneration, Qwen3VLMoeForConditionalGeneration)

  • SmolVLM2 (SmolVLMForConditionalGeneration)

    Note: Use --set do-split 1 to enable Split.

  • Step-VL (StepVLForConditionalGeneration)

    Additional options (Use --set X Y to change values):

    • do-pan-and-scan: default 1 (i.e. true). Set to 0 to use only a global view to reduce the compute.
    • native-resolution: default 0 (i.e. false). This model can support native resolution mathematically without pan and scan. (for experiment only)
  • Youtu-VL (YoutuVLForConditionalGeneration)

OCR Models

  • dots.ocr (DotsOCRForCausalLM)

    Note: Prompt for OCR: {{image:...}}Extract the text content from this image. Here are other prompts for OCR. Use +single-turn to discard history automatically.

  • Nanonets-OCR2 (Qwen2VLForConditionalGeneration, Qwen2_5_VLForConditionalGeneration)

  • GLM-OCR (GlmOcrForConditionalGeneration)

ASR Models

  • GLM-ASR (GlmAsrForConditionalGeneration)

  • Qwen3-ASR (Qwen3ASRForConditionalGeneration)

    Additional options (Use --set X Y to change values):

    Additional options (Use --set X Y to change values):

    • language: default "Chinese". This affects how sentences are cutted into words. Each character is a "word" for Chinese. For other languages, words are separated by spaces.
    • delimiter: default "". Time stamps are reported for "sentences": sentences are separated by this delimiter. For Chinese, when delimiter is empty, each character is treated as a sentence.
    • format: default "srt". Format of output. "srt" or "json" are supported.

RAG Models

Text Embedding

Note: Only dense embedding is implemented.

Text Ranking

Multi-modal Embedding

  • Qwen3-VL Embedding (Qwen3VLForConditionalGeneration)
    • : 2B, 8B (-a Qwen3-VL-Embedding)

      Note: use --set task ... to specify task/instruction.

Multi-modal Ranking

  • Qwen3-VL Reranker (Qwen3VLForConditionalGeneration)
    • : 2B, 8B (-a Qwen3-VL-Reranker)

      Note: use --set task ... to specify task/instruction.

LoRA Models

These LoRA models have been tested:

Special Models

Tips for diffusion LLMs, they are very sensitive to sampling parameters. The results may be completely unacceptable with improper parameters.

  • LLaDA (LLaDA2MoeModelLM)

    • mini-preview, mini

      Supported options (--set OPTION VALUE):

      • block_length: default 32
      • steps: default 32
      • minimal_topk: default 1
      • threshold: default 0.95
  • WeDLM (WeDLMForCausalLM)

  • Meta-AI multi-token prediction models checkpoints

    Download at least one multi-token prediction checkpoint (such as 7B_1T_4). Assume it is stored at /path/to/llama-multi-predict/7B_1T_4. Make sure tokenizer.model is downloaded to /path/to/llama-multi-predict.

    To convert it with -a llama-multi-token-prediction-ckpt:

    python convert.py -i /path/to/llama-multi-predict/7B_1T_4 -o llama-multi.bin -a llama-multi-token-prediction-ckpt

    This is a base model, and remember to use --format completion.

    Tip: Use --kv n_future_tokens N to change number of future tokens, N = [1, 4].