Skip to content

Clarify or support automatic teardown for Kubernetes SkypilotExecutor / SkypilotJobsExecutor #483

@jesintharnold

Description

@jesintharnold

Summary

When running NeMo Run workloads on Kubernetes via SkyPilot, the training pod can remain alive after the workload has already completed or failed, leaving GPUs allocated until the pod or cluster is cleaned up manually.

I am not sure whether this is intended SkyPilot behavior on Kubernetes or a NeMo Run integration gap, so I am filing this as a support/docs request rather than a pure bug.

Environment

  • nemo-run version: please fill exact version from pip show nemo-run
  • skypilot version: please fill exact version from pip show skypilot
  • Python: 3.11.9
  • Backend: SkyPilot API server + Kubernetes

Reproducer

import os

os.environ["SKYPILOT_API_SERVER_ENDPOINT"] = "<SKY-PILOT-API-SERVER-URL>"

import nemo_run as run


def skypilot_executor(nodes=1, gpus_per_node=4):
    return run.SkypilotExecutor(
        gpus="H100",
        gpus_per_node=gpus_per_node,
        num_nodes=nodes,
        cloud="kubernetes",
        container_image="nvcr.io/nvidia/nemo:25.07",
        cluster_name="mistral-finetune-option-1",
        setup="pip install mlflow>=1.0.0",
        autodown=True,
    )

Observed Behavior

  • The training script finishes, fails, or exits with an error.
  • The Kubernetes pod remains up instead of being cleaned up.
  • GPU resources remain occupied until we manually terminate the pod or cluster.
  • In related SkypilotJobsExecutor runs we also see SkyPilot print Auto-stop is not supported for Kubernetes and RunPod clusters. Skipping.

Expected Behavior

One of the following should happen clearly and consistently:

  • If automatic teardown is supported on Kubernetes, the pod or cluster should be cleaned up when the job reaches a terminal state.
  • If automatic teardown is not supported on Kubernetes, NeMo Run documentation should state that explicitly for SkypilotExecutor / SkypilotJobsExecutor, especially when autodown=True is set.

Why This Is Confusing

  • SkypilotExecutor accepts autodown=True and passes it through to SkyPilot.
  • NeMo Run docs show Kubernetes as a supported SkypilotExecutor target.
  • In practice, the workload behaves like fire-and-forget unless we clean it up ourselves.

Request

Please clarify the intended behavior of autodown / auto-stop for Kubernetes-backed SkyPilot executors.

If this is unsupported today, it would help to document:

  • that Kubernetes jobs may need manual teardown,
  • whether autodown=True is ignored on Kubernetes,
  • and the recommended cleanup workflow after success or failure.

If it is supposed to work, then this likely needs a fix so terminal jobs actually release cluster resources.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions