Clarify or support automatic teardown for Kubernetes `SkypilotExecutor` / `SkypilotJobsExecutor`

## Summary

When running NeMo Run workloads on Kubernetes via SkyPilot, the training pod can remain alive after the workload has already completed or failed, leaving GPUs allocated until the pod or cluster is cleaned up manually.

I am not sure whether this is intended SkyPilot behavior on Kubernetes or a NeMo Run integration gap, so I am filing this as a support/docs request rather than a pure bug.

## Environment

- `nemo-run` version: please fill exact version from `pip show nemo-run`
- `skypilot` version: please fill exact version from `pip show skypilot`
- Python: 3.11.9
- Backend: SkyPilot API server + Kubernetes

## Reproducer

```python
import os

os.environ["SKYPILOT_API_SERVER_ENDPOINT"] = "<SKY-PILOT-API-SERVER-URL>"

import nemo_run as run


def skypilot_executor(nodes=1, gpus_per_node=4):
    return run.SkypilotExecutor(
        gpus="H100",
        gpus_per_node=gpus_per_node,
        num_nodes=nodes,
        cloud="kubernetes",
        container_image="nvcr.io/nvidia/nemo:25.07",
        cluster_name="mistral-finetune-option-1",
        setup="pip install mlflow>=1.0.0",
        autodown=True,
    )
```

## Observed Behavior

- The training script finishes, fails, or exits with an error.
- The Kubernetes pod remains up instead of being cleaned up.
- GPU resources remain occupied until we manually terminate the pod or cluster.
- In related `SkypilotJobsExecutor` runs we also see SkyPilot print `Auto-stop is not supported for Kubernetes and RunPod clusters. Skipping.`

## Expected Behavior

One of the following should happen clearly and consistently:

- If automatic teardown is supported on Kubernetes, the pod or cluster should be cleaned up when the job reaches a terminal state.
- If automatic teardown is not supported on Kubernetes, NeMo Run documentation should state that explicitly for `SkypilotExecutor` / `SkypilotJobsExecutor`, especially when `autodown=True` is set.

## Why This Is Confusing

- `SkypilotExecutor` accepts `autodown=True` and passes it through to SkyPilot.
- NeMo Run docs show Kubernetes as a supported `SkypilotExecutor` target.
- In practice, the workload behaves like fire-and-forget unless we clean it up ourselves.

## Request

Please clarify the intended behavior of `autodown` / auto-stop for Kubernetes-backed SkyPilot executors.

If this is unsupported today, it would help to document:

- that Kubernetes jobs may need manual teardown,
- whether `autodown=True` is ignored on Kubernetes,
- and the recommended cleanup workflow after success or failure.

If it is supposed to work, then this likely needs a fix so terminal jobs actually release cluster resources.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify or support automatic teardown for Kubernetes `SkypilotExecutor` / `SkypilotJobsExecutor` #483

Summary

Environment

Reproducer

Observed Behavior

Expected Behavior

Why This Is Confusing

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarify or support automatic teardown for Kubernetes SkypilotExecutor / SkypilotJobsExecutor #483

Description

Summary

Environment

Reproducer

Observed Behavior

Expected Behavior

Why This Is Confusing

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Clarify or support automatic teardown for Kubernetes `SkypilotExecutor` / `SkypilotJobsExecutor` #483