Summary
When running NeMo Run workloads on Kubernetes via SkyPilot, the training pod can remain alive after the workload has already completed or failed, leaving GPUs allocated until the pod or cluster is cleaned up manually.
I am not sure whether this is intended SkyPilot behavior on Kubernetes or a NeMo Run integration gap, so I am filing this as a support/docs request rather than a pure bug.
Environment
nemo-run version: please fill exact version from pip show nemo-run
skypilot version: please fill exact version from pip show skypilot
- Python: 3.11.9
- Backend: SkyPilot API server + Kubernetes
Reproducer
import os
os.environ["SKYPILOT_API_SERVER_ENDPOINT"] = "<SKY-PILOT-API-SERVER-URL>"
import nemo_run as run
def skypilot_executor(nodes=1, gpus_per_node=4):
return run.SkypilotExecutor(
gpus="H100",
gpus_per_node=gpus_per_node,
num_nodes=nodes,
cloud="kubernetes",
container_image="nvcr.io/nvidia/nemo:25.07",
cluster_name="mistral-finetune-option-1",
setup="pip install mlflow>=1.0.0",
autodown=True,
)
Observed Behavior
- The training script finishes, fails, or exits with an error.
- The Kubernetes pod remains up instead of being cleaned up.
- GPU resources remain occupied until we manually terminate the pod or cluster.
- In related
SkypilotJobsExecutor runs we also see SkyPilot print Auto-stop is not supported for Kubernetes and RunPod clusters. Skipping.
Expected Behavior
One of the following should happen clearly and consistently:
- If automatic teardown is supported on Kubernetes, the pod or cluster should be cleaned up when the job reaches a terminal state.
- If automatic teardown is not supported on Kubernetes, NeMo Run documentation should state that explicitly for
SkypilotExecutor / SkypilotJobsExecutor, especially when autodown=True is set.
Why This Is Confusing
SkypilotExecutor accepts autodown=True and passes it through to SkyPilot.
- NeMo Run docs show Kubernetes as a supported
SkypilotExecutor target.
- In practice, the workload behaves like fire-and-forget unless we clean it up ourselves.
Request
Please clarify the intended behavior of autodown / auto-stop for Kubernetes-backed SkyPilot executors.
If this is unsupported today, it would help to document:
- that Kubernetes jobs may need manual teardown,
- whether
autodown=True is ignored on Kubernetes,
- and the recommended cleanup workflow after success or failure.
If it is supposed to work, then this likely needs a fix so terminal jobs actually release cluster resources.
Summary
When running NeMo Run workloads on Kubernetes via SkyPilot, the training pod can remain alive after the workload has already completed or failed, leaving GPUs allocated until the pod or cluster is cleaned up manually.
I am not sure whether this is intended SkyPilot behavior on Kubernetes or a NeMo Run integration gap, so I am filing this as a support/docs request rather than a pure bug.
Environment
nemo-runversion: please fill exact version frompip show nemo-runskypilotversion: please fill exact version frompip show skypilotReproducer
Observed Behavior
SkypilotJobsExecutorruns we also see SkyPilot printAuto-stop is not supported for Kubernetes and RunPod clusters. Skipping.Expected Behavior
One of the following should happen clearly and consistently:
SkypilotExecutor/SkypilotJobsExecutor, especially whenautodown=Trueis set.Why This Is Confusing
SkypilotExecutoracceptsautodown=Trueand passes it through to SkyPilot.SkypilotExecutortarget.Request
Please clarify the intended behavior of
autodown/ auto-stop for Kubernetes-backed SkyPilot executors.If this is unsupported today, it would help to document:
autodown=Trueis ignored on Kubernetes,If it is supposed to work, then this likely needs a fix so terminal jobs actually release cluster resources.