A CLI toolkit for downloading, converting, quantizing, and uploading Hugging Face models. Supports full end-to-end pipelines from HF Hub to S3-compatible storage.
Note: The
convert,quantize, andpipelinecommands require llama.cpp. Runmdl bootstrap-llamacppfirst to fetch and build it automatically (requiresgit,cmake,make).
- Batch downloads from Hugging Face Hub with resume and state tracking
- GGUF conversion via llama.cpp (
convert_hf_to_gguf.py) - Quantization to Q4_K_M, Q5_K_M, Q8_0, etc. via
llama-quantize - Ollama Modelfile generation — auto-detects chat format from the model's actual config files (supports ChatML, LLaMA-3, Gemma, Phi, Mistral, DeepSeek, and more)
- S3 upload to MinIO or any S3-compatible endpoint
- Pipeline mode — download → convert → quantize → upload in one command
- Bootstrap — fetch and build llama.cpp automatically
- Per-model error handling, dry-run mode, disk-space checks, and YAML configuration
- Python 3.11+
- uv (recommended) or pip
- git, cmake, make (for
bootstrap-llamacpp)
git clone https://github.com/fuzzylabs/mdl.git
cd mdl
uv sync # or: pip install -e .To enable faster Hugging Face downloads:
uv add hf_transfer # or: pip install hf_transferCopy the example files and fill in your values:
cp .env.example .env
cp models.yaml.example models.yaml
cp pipeline.yaml.example pipeline.yamlSee .env.example for all available environment variables.
All commands are accessed through the mdl entry point:
mdl [COMMAND] [OPTIONS]
Batch-download models from Hugging Face Hub.
mdl download --config models.yaml # download all models in config
mdl download -r google/gemma-3-1b-it # download a single model
mdl download -r org/model-a -r org/model-b # download multiple by repo ID
mdl download --config models.yaml --dry-run # preview without downloading
mdl download --clear-state # reset download tracking| Option | Description |
|---|---|
-r, --repo-id |
Repo ID to download (repeatable) |
-c, --config PATH |
YAML config file (models.yaml format) |
-n, --dry-run |
Preview what would be downloaded |
--clear-state |
Clear download state and exit |
--min-disk-space INT |
Minimum free disk space in GB (default: 10) |
--delete-after |
Delete model from HF cache after download |
-v, --verbose |
Enable debug logging |
google:
- gemma-3-1b-it
- gemma-3-4b-it
meta-llama:
- Llama-3.2-1B
- Llama-3.2-1B-InstructEach entry becomes the repo ID org/model when downloaded.
Convert a downloaded Hugging Face model to F16 GGUF format.
mdl convert -m /path/to/model-dir # output defaults to models/<org>/<model>/<model>.f16.gguf
mdl convert -m /path/to/model-dir -o custom/path/out.gguf # explicit output path| Option | Description |
|---|---|
-m, --model-dir PATH |
Path to the downloaded HF model directory (required) |
-o, --output PATH |
Output F16 GGUF path (default: models/<org>/<model>/<model>.f16.gguf) |
--llama-cpp-dir PATH |
Override LLAMA_CPP_DIR env var |
-v, --verbose |
Enable debug logging |
Quantize an F16 GGUF file to a smaller representation.
mdl quantize -i model.f16.gguf # output defaults to model.Q4_K_M.gguf alongside input
mdl quantize -i model.f16.gguf -o out.gguf -t Q5_K_M # explicit output and type
mdl quantize -i model.f16.gguf --model-dir /path/to/hf-model # also generates Modelfile + config files + README| Option | Description |
|---|---|
-i, --input PATH |
Input F16 GGUF file (required) |
-o, --output PATH |
Output quantized GGUF path (default: <input_dir>/<model>.<type>.gguf) |
-t, --type TEXT |
Quantization type — e.g. Q4_K_M, Q5_K_M, Q8_0 (default: Q4_K_M) |
--llama-cpp-dir PATH |
Override LLAMA_CPP_DIR env var |
--model-dir PATH |
HF model directory — generates Ollama Modelfile, copies config files, and creates a MODELFILE_README.md next to the output |
-v, --verbose |
Enable debug logging |
Upload a file to MinIO / S3-compatible storage.
mdl upload -f model.Q4_K_M.gguf
mdl upload -f model.Q4_K_M.gguf -p models/gemma -b my-bucket| Option | Description |
|---|---|
-f, --file PATH |
Local file to upload (required) |
-k, --s3-key TEXT |
Explicit S3 object key (defaults to filename) |
-p, --s3-prefix TEXT |
Prefix (directory) in the bucket |
-b, --bucket TEXT |
Override MINIO_BUCKET env var |
-v, --verbose |
Enable debug logging |
Run the full pipeline: download → convert → quantize → upload.
mdl pipeline # uses pipeline.yaml by default
mdl pipeline -c pipeline.yaml --dry-run # preview without executing
mdl pipeline --no-upload # skip the S3 upload step
mdl pipeline --force # reprocess completed models
mdl pipeline --keep-quantized # keep GGUF files locally after upload
mdl pipeline --clear-state # reset pipeline state and exit| Option | Description |
|---|---|
-c, --config PATH |
Pipeline config file (default: pipeline.yaml) |
-n, --dry-run |
Preview actions without executing |
--clear-state |
Clear pipeline state and exit |
--force |
Reprocess already-completed models |
--no-upload |
Skip S3 upload step |
--keep-download |
Keep downloaded model files after processing |
--keep-quantized |
Keep quantized GGUF files after upload |
Note:
--no-uploadonly skips the S3 upload — it does not keep files on disk. The pipeline works in a temporary directory that is deleted after each model. To retain the quantized GGUF and related files locally, pass--keep-quantized(e.g.mdl pipeline --no-upload --keep-quantized). |--min-disk-space INT| Minimum free disk space in GB (default: 10) | |-v, --verbose| Enable debug logging |
models:
- repo_id: google/gemma-3-1b-it # required
# quantize: true # default: true
# upload: true # default: true
# quantization: Q4_K_M # default: Q4_K_M
# output_name: custom-name.gguf # default: model.QTYPE.gguf
# revision: main # pin a git revision / branch / tag
- repo_id: meta-llama/Llama-3.2-1B
- repo_id: microsoft/Phi-4-mini-instructAll output files are organised under models/<org>/<model_name>/:
- Local (with
--keep-quantized):models/google/gemma-3-1b-it/gemma-3-1b-it.Q4_K_M.gguf - S3:
s3://<bucket>/models/google/gemma-3-1b-it/gemma-3-1b-it.Q4_K_M.gguf
The pipeline automatically generates an Ollama Modelfile alongside each quantized GGUF. It reads the model's actual config files — config.json, tokenizer_config.json, and generation_config.json — to derive everything dynamically:
FROM— path to the GGUF fileTEMPLATE— Ollama Go template, detected from the model'seos_token(not justmodel_type). This correctly handles fine-tunes that change the chat format (e.g. Dolphin-Mistral uses ChatML despite beingmodel_type: mistral)SYSTEM— default system prompt, extracted from the Jinja2chat_templatevia regexPARAMETER—num_ctx,temperature,top_p,top_k,repeat_penalty, and stop tokens — all from the model's own configs
The raw Jinja2 chat_template is also included as comments at the bottom of the Modelfile for cross-reference when editing.
Supported model families (via EOS token and model_type detection):
| Format | Models |
|---|---|
ChatML (<|im_end|>) |
Qwen, Qwen2, Qwen3, Dolphin, Yi, InternLM2, DeepSeek-V2/V3/R1, Jamba |
LLaMA-3 (<|eot_id|>) |
LLaMA-3, LLaMA-3.1, LLaMA-3.2, LLaMA-3.3 |
Gemma (<end_of_turn>) |
Gemma, Gemma 2, Gemma 3 |
Phi (<|end|>) |
Phi-3, Phi-3.5, Phi-4 |
Mistral (</s>) |
Mistral-7B, Mixtral |
| Command-R | Cohere Command-R, Command-R+ |
| Completion | StarCoder2, Falcon |
Unknown models fall back to a generic template with a warning.
Output files (alongside the GGUF):
| File | Purpose |
|---|---|
Modelfile |
Ollama-ready model definition |
MODELFILE_README.md |
Guide explaining each Modelfile section |
config.json |
Model architecture reference (copied from HF) |
tokenizer_config.json |
Chat template & tokens reference (copied from HF) |
generation_config.json |
Generation params reference (copied from HF) |
special_tokens_map.json |
Special tokens reference (copied from HF) |
These files are:
- Uploaded to S3 at
models/<org>/<model>/ - Saved locally when using
--keep-quantized - Logged at DEBUG level when using
--verbose
To use the generated Modelfile with Ollama:
cd models/google/gemma-3-1b-it/
ollama create gemma3-1b -f Modelfile
ollama run gemma3-1bThe pipeline also writes a URL registry to model_urls.json locally and mirrors it to s3://<bucket>/metadata/model_urls.json. Each entry includes a download_url and a curl download reference.
Clone, build, and extract the required llama.cpp binaries. Requires git, cmake, and make.
mdl bootstrap-llamacppThis fetches llama.cpp from GitHub, builds it with cmake + make, and copies the required binaries and headers to a llama.cpp-dist/ directory. If llama.cpp/ already exists, the clone step is skipped.
This is a prerequisite for
mdl convert,mdl quantize, andmdl pipeline.
Print the installed version.
mdl --versionAll variables are set in .env (see .env.example).
| Variable | Description | Default |
|---|---|---|
HF_TOKEN |
Auth token for private/gated models | — |
HF_ENDPOINT |
Custom HF endpoint (mirror/proxy) | https://huggingface.co |
HF_HOME |
HF cache directory | ~/.cache/huggingface/ |
HF_HUB_DOWNLOAD_TIMEOUT |
Download timeout in seconds | 120 |
HF_HUB_ETAG_TIMEOUT |
ETag timeout in seconds | 10 |
HF_HUB_ENABLE_HF_TRANSFER |
Enable fast transfers (requires hf_transfer) |
0 |
| Variable | Description | Default |
|---|---|---|
MINIO_ENDPOINT |
S3 endpoint (host:port) | — |
MINIO_ACCESS_KEY |
Access key | — |
MINIO_SECRET_KEY |
Secret key | — |
MINIO_BUCKET |
Target bucket | models |
MINIO_SECURE |
Use HTTPS | true |
MINIO_PUBLIC_URL |
Public base URL for downloads | — |
MINIO_PRESIGN_DAYS |
Presigned URL expiry in days | 7 |
| Variable | Description | Default |
|---|---|---|
LLAMA_CPP_DIR |
Path to llama.cpp directory | llama.cpp |
- Load environment — reads
.envbefore any HF imports - Parse config — loads YAML and builds the model list
- Validate — checks credentials, disk space, and config structure
- Process models — download, convert, quantize, and upload each model
- Track state — persists progress to
.download_state.json/.pipeline_state.json - Handle errors — logs failures per model and continues with the rest
- Summarise — prints totals for successful, failed, and skipped models
Resume is automatic. Completed models are skipped on re-run. Use --clear-state to start fresh or --force to reprocess.
| Problem | Solution |
|---|---|
RepositoryNotFoundError / GatedRepoError |
Set HF_TOKEN in .env. For gated models, accept terms on the model page first. |
| Downloads timing out | Increase HF_HUB_DOWNLOAD_TIMEOUT in .env |
| Disk space errors | Set HF_HOME to a larger drive, or use --min-disk-space |
| Slow downloads | Install hf_transfer and set HF_HUB_ENABLE_HF_TRANSFER=1 |
| Re-downloading completed models | Don't delete .download_state.json. Use --clear-state only intentionally. |
uv sync --all-extras # install dev dependencies
uv run pytest # run tests
uv run pytest --cov=mdl # run tests with coveragesrc/mdl/
├── __init__.py # package version
├── cli/
│ ├── __init__.py # Click group & subcommand registration
│ ├── bootstrap.py # mdl bootstrap-llamacpp
│ ├── convert.py # mdl convert
│ ├── download.py # mdl download
│ ├── pipeline.py # mdl pipeline
│ ├── quantize.py # mdl quantize
│ └── upload.py # mdl upload
└── core/
├── config.py # env loading & logging setup
├── downloader.py # HF Hub download logic & state
├── modelfile.py # Ollama Modelfile generator
├── quantizer.py # llama.cpp convert & quantize
├── uploader.py # MinIO / S3 upload client
└── url_manager.py # model URL registry
See LICENSE.
Contributions welcome. Please follow the existing code style, add tests for new features, and verify with --dry-run before submitting.