A streamlined solution for running Large Language Models (LLMs) in batch mode on HPC systems powered by Slurm. LLMFlux uses the OpenAI-compatible API format with a JSONL-first architecture, enabling your prompts to flow efficiently through LLM engines at scale.
JSONL Input Batch Processing Results
(OpenAI Format) (Ollama/vLLM + Model) (JSON Output)
│ │ │
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Batch │ │ │ │ Output │
│ Requests │─────────────────▶ │ Model on │─────────────────▶ │ Results │
│ (JSONL) │ │ GPU(s) │ │ (JSON) │
└──────────┘ │ │ └──────────┘
└──────────────┘
LLMFlux processes JSONL files in a standardized OpenAI-compatible batch API format, enabling efficient processing of thousands of prompts on HPC systems with minimal overhead.
- Configuration Guide - How to configure LLMFlux
- Models Guide - Supported models and requirements
- Repository Structure - Codebase organization
- Testing Guide - How to run tests
pip install llmfluxOr for development:
-
Create and Activate Conda Environment:
conda create -n llmflux python=3.11 -y conda activate llmflux
-
Install Package:
pip install -e . -
Environment Setup:
cp .env.example .env # Edit .env with your SLURM account and model details
Confirm the installation by running a base command and ensuring your system gives the correct output:
$llmflux -h
usage: llmflux [-h] [--version] {run,benchmark,show-models,jobs,status,logs,cancel} ...
LLMFlux CLI
positional arguments:
{run,benchmark,show-models,jobs,status,logs,cancel}
run Submit a batch processing job
benchmark Run a benchmark job
show-models List all available model keys from models.yaml
jobs List LLMFlux tracked Slurm jobs
status Show detailed status for a job
logs Show last lines of stdout and stderr for a tracked job
cancel Cancel a tracked running/pending job
options:
-h, --help show this help message and exit
--version, -V Show llmflux version and exitThe primary workflow for LLMFlux is submitting JSONL files for batch processing on SLURM:
from llmflux.slurm import SlurmRunner
from llmflux.core.config import Config
# Setup SLURM configuration
config = Config()
slurm_config = config.get_slurm_config()
slurm_config.account = "myaccount"
# Initialize runner
runner = SlurmRunner(config=slurm_config)
# Submit JSONL file directly for processing
job_id = runner.run(
input_path="prompts.jsonl",
output_path="results.json",
model="llama3.2:3b",
batch_size=4
)
print(f"Job submitted with ID: {job_id}")JSONL input format follows the OpenAI Batch API specification:
{"custom_id":"request1","method":"POST","url":"/v1/chat/completions","body":{"messages":[{"role":"system","content":"You are a helpful assistant"},{"role":"user","content":"Explain quantum computing"}],"temperature":0.7,"max_tokens":500}}
{"custom_id":"request2","method":"POST","url":"/v1/chat/completions","body":{"messages":[{"role":"system","content":"You are a helpful assistant"},{"role":"user","content":"What is machine learning?"}],"temperature":0.7,"max_tokens":500}}For advanced options like custom batch sizes, processing settings, or SLURM configuration, see the Configuration Guide.
For advanced model configuration, see the Models Guide.
LLMFlux includes a command-line interface for submitting batch processing jobs. It uses vLLM as it's default engine, and model configurations rely on the HuggingFace naming scheme. To process your prompts.jsonl file using the Ollama engine running the llama3.2 model with 3b parameters, you would run the command:
# Process JSONL file directly (core functionality)
llmflux run --model Llama-3.2-3B-Instruct --input data/prompts.jsonl --output results/output.jsonIn addition to the default vLLM engine, LLMFlux can also be run using Ollama. You then can call using the names as established in the models.yaml file in the templates dir:
llmflux run \
--account your-account \
--partition gpu \
--model Llama-3.2-3B-Instruct \
--input data/prompts.jsonl \
--output results/output.json# Process JSONL file using VLLM backend
llmflux run --model MistralLite --input data/prompts.jsonl --output results/output.jsonThis will run the same as above, using VLLM as the backend interface. If you wanted to run mistral-lite, for example, checking the file mistral-lite/7b.yaml reveals the name: "mistrallite:7b". Update to the appropriate HuggingFace key and run
# Process JSONL file using VLLM backend
llmflux run --model MistralLite --input data/prompts.jsonl --output results/output.jsonthis will run the model, as noted in the config, by searching HuggingFace for hf_name: "amazon/MistralLite". You will
need to check an existing model file from the folder src/llmflux/templates to find a configuration that matches what you want
and use the name as the argument for the --model argument.
Note that in order to use some HuggingFace models, you will need a key from HF. Once you have a token, update your local copy of the .env file and add or change this line:
HUGGINGFACE_TOKEN=hf_XXXXXXXXXXXXXXXto use the token, replace the hf_XXXX piece with your token. For some gated repos, you will have to visit the huggingface repository directly and activate access (often by accepting a terms and conditions agreement). You may also need to adjust settings on your HF token to ensure that LLMFlux has proper rights to access the model. In addition, the model will by default be stored in your base directory: ~/.cache/huggingfacel/hub. To change this, you can add the following parameter to your .env file:
HF_HOME=/path/to/dirllmflux will automatically download the appropriate models for both OLLAMA and vLLM.
For detailed command options:
llmflux --helpLLMFlux tracks submitted jobs in a local registry (~/.llmflux/jobs.json) and
combines that metadata with Slurm state.
# List tracked jobs (default: active states only)
llmflux jobs
# Include historical states
llmflux jobs --all
# Filter by one or more states
llmflux jobs --state RUNNING --state FAILED
# Show detailed merged status for one job
llmflux status <job-id>
# Tail logs (default: 100 lines)
llmflux logs <job-id>
llmflux logs <job-id> --tail 200
llmflux logs <job-id> -f
# Cancel a tracked job
llmflux cancel <job-id>
llmflux cancel <job-id> --forceNotes:
jobsandstatusderive live state from Slurm JSON output.logsandcancelonly operate on jobs present in the LLMFlux registry.
Results are saved in the user's workspace:
[
{
"input": {
"custom_id": "request1",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Original prompt text"}
],
"temperature": 0.7,
"max_tokens": 1024
},
"metadata": {
"source_file": "example.txt"
}
},
"output": {
"id": "chat-cmpl-123",
"object": "chat.completion",
"created": 1699123456,
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Generated response text"
},
"finish_reason": "stop"
}
]
},
"metadata": {
"model": "llama3.2:3b",
"timestamp": "2023-11-04T12:34:56.789Z",
"processing_time": 1.23
}
}
]LLMFlux provides utility converters to help prepare JSONL files from various input formats:
# Convert CSV to JSONL
llmflux convert csv --input data/papers.csv --output data/papers.jsonl --template "Summarize: {text}"
# Convert directory to JSONL
llmflux convert dir --input data/documents/ --output data/docs.jsonl --recursiveFor code examples of converters, see the examples directory.
LLMFlux ships with a benchmarking workflow that can source prompts, submit the SLURM job, and collect results/metrics for you.
llmflux benchmark \
--model Llama-3.2-3B-Instruct \
--name nightly \
--num-prompts 60 \
--account ACCOUNT_NAME \
--partition PARTITION_NAME \
--nodes 1- Prompt sources: omit
--inputto automatically download and cache LiveBench categories (benchmark_data/). Provide--input path/to/prompts.jsonlto reuse an existing JSONL file instead. Use--num-prompts,--temperature, and--max-tokensto control synthetic dataset generation. - Outputs: results default to
results/benchmarks/<name>_results.jsonand a metrics summary (<name>_metrics.txt) containing elapsed SLURM runtime and number of prompts processed. - Batch tuning: adjust
--batch-sizefor throughput. Pass model arguments such as--temperatureand--max-tokensto forward them to the runner. - SLURM overrides: forward scheduler settings with
--account,--partition,--nodes,--gpus-per-node,--time,--mem, and--cpus-per-task. - Job controls: add
--rebuildto force an Apptainer image rebuild or--debugto keep the generated job script for inspection.
For the complete option reference:
llmflux benchmark --helpWe welcome contributions! Please see CONTRIBUTING.md for guidelines.