Skip to content
Merged
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,16 @@ oellm schedule-eval \

Results are written to `./oellm-output/<timestamp>/results/`.

**Air-gapped cluster nodes (no internet):** batch jobs set `HF_HUB_OFFLINE=1` and get `HF_HOME` from your cluster env. With `--local`, the CLI defaults `HF_HOME` to `~/.cache/huggingface` if unset and would otherwise allow Hub accessβ€”so on a compute node without network, export your real cache and offline flag before running, for example:

```bash
export HF_HOME=/leonardo_work/OELLM_prod2026/users/shaldar0/oellm-evals/hf_data
export HF_HUB_OFFLINE=1
oellm schedule-eval ... --venv_path .venv --local true
```

The `HF_HUB_OFFLINE` value is read when you invoke `oellm` and baked into the generated script.

## SLURM Overrides

Override cluster defaults (partition, account, time limit, etc.) with `--slurm_template_var` (JSON object):
Expand Down
18 changes: 17 additions & 1 deletion oellm/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,22 @@
)


def _resolve_hf_hub_offline(local: bool) -> int:
"""Value embedded in the generated eval script as HF_HUB_OFFLINE.

If ``HF_HUB_OFFLINE`` is set in the environment when ``oellm`` runs, that
value wins. Otherwise defaults to online Hub access for ``--local``
(typical laptop dev) and offline for SLURM jobs (air-gapped workers).
"""
raw = os.environ.get("HF_HUB_OFFLINE")
if raw is not None and str(raw).strip() != "":
try:
return int(str(raw).strip())
except ValueError:
logging.warning("Invalid HF_HUB_OFFLINE=%r; using default", raw)
return 0 if local else 1


@dataclass
class EvaluationJob:
model_path: Path | str
Expand Down Expand Up @@ -369,7 +385,7 @@ def schedule_evals(
venv_path=venv_path or "",
lm_eval_include_path=lm_eval_include_path
or str(files("oellm.resources") / "custom_lm_eval_tasks"),
hf_hub_offline=0 if local else 1,
hf_hub_offline=_resolve_hf_hub_offline(local),
lighteval_model_args="trust_remote_code=True,batch_size=1"
if local
else "trust_remote_code=True",
Expand Down
Loading