Skip to content

Commit c87f6cc

Browse files
authored
chore: Add Claude md file (#446)
Signed-off-by: oliver könig <okoenig@nvidia.com>
1 parent d94c868 commit c87f6cc

1 file changed

Lines changed: 98 additions & 0 deletions

File tree

CLAUDE.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
NeMo Run is a tool for configuring, executing, and managing ML experiments across various computing environments. Its three core pillars are:
8+
1. **Configuration** - Python-native config using Google's Fiddle library
9+
2. **Execution** - Running tasks on local machines, SLURM clusters, Docker, cloud (SkyPilot, DGX Cloud, Lepton)
10+
3. **Management** - Tracking experiment metadata locally in `NEMORUN_HOME` (default: `~/.nemo_run`)
11+
12+
## Commands
13+
14+
```bash
15+
# Install for development
16+
uv sync --extra skypilot
17+
18+
# Run tests (slow tests are skipped by default)
19+
uv run -- pytest test/
20+
21+
# Run a single test
22+
uv run -- pytest test/test_config.py::TestClass::test_method
23+
24+
# Run including slow tests
25+
uv run -- pytest -m "" test/
26+
27+
# Lint
28+
uv run --group lint -- ruff check
29+
30+
# Format
31+
uv run --group lint -- ruff format
32+
33+
# Run with coverage
34+
uv run -- coverage run --branch --source=nemo_run -a -m pytest test/
35+
uv run -- coverage report -m
36+
```
37+
38+
Line length is 100 (configured in `pyproject.toml` under `[tool.ruff]`).
39+
40+
## Architecture
41+
42+
### Core Abstractions
43+
44+
**`Config[T]` / `Partial[T]`** (`nemo_run/config.py`): Built on Fiddle. `Config` instantiates the target directly when built; `Partial` creates a `functools.partial`. `Script` wraps shell commands. These are the primary user-facing types.
45+
46+
**`Executor`** (`nemo_run/core/execution/base.py`): Abstract base for all execution environments. Key fields: `packager`, `launcher`, `env_vars`, `retries`. Implementations:
47+
- `LocalExecutor` - direct local execution
48+
- `DockerExecutor` - via Docker
49+
- `SlurmExecutor` - HPC via SLURM + SSH tunnel
50+
- `SkypilotExecutor` / `SkypilotJobsExecutor` - multi-cloud via SkyPilot
51+
- `DGXCloudExecutor` - NVIDIA DGX Cloud
52+
- `LeptonExecutor` - Lepton AI
53+
54+
**`Experiment`** (`nemo_run/run/experiment.py`): Context manager that groups multiple tasks/jobs, handles parallel execution, log syncing, state tracking, and plugin hooks. Uses TorchX (`torchx>=0.7.0`) as the distributed execution backend.
55+
56+
**`Packager`** (`nemo_run/core/packaging/`): Strategies to bundle code for remote execution:
57+
- `GitArchivePackager` - packages via `git archive`
58+
- `PatternPackager` - file glob patterns
59+
- `HybridPackager` - combines strategies
60+
61+
**`Launcher`** (`nemo_run/core/execution/launcher.py`): Controls how tasks are launched within an executor. Options: `Torchrun`, `FaultTolerance` (NVIDIA), `SlurmRay`, `SlurmTemplate`.
62+
63+
**Tunnels** (`nemo_run/core/tunnel/`): `SSHTunnel` for remote cluster access with rsync for file syncing.
64+
65+
### Data Flow
66+
67+
1. User defines a function/class and wraps it in `run.Config` or `run.Partial`
68+
2. An `Executor` is configured (with `Packager` + optional `Launcher`)
69+
3. `run.run(task, executor)` or `run.Experiment` is used to execute
70+
4. TorchX schedulers (registered as entry points in `pyproject.toml`) dispatch work
71+
5. Metadata stored in `~/.nemo_run/` for experiment tracking
72+
73+
### CLI
74+
75+
Entry points `nemorun` / `nemo` (via Typer) provide experiment management and configuration inspection. The CLI uses lazy imports (`nemo_run/cli/lazy.py`) for fast startup. Extensible via `nemo_run.cli.entrypoints` namespace.
76+
77+
### Serialization
78+
79+
Configurations can be serialized to YAML (`nemo_run/core/serialization/yaml.py`) or compressed JSON (`zlib_json.py`) for persistence.
80+
81+
### Plugin System
82+
83+
`ExperimentPlugin` (`nemo_run/run/plugin.py`) provides hooks into the experiment lifecycle.
84+
85+
## Key Files
86+
87+
- `nemo_run/api.py` - all public exports
88+
- `nemo_run/config.py` - `Config`, `Partial`, `Script` classes
89+
- `nemo_run/run/experiment.py` - `Experiment` context manager
90+
- `nemo_run/core/execution/base.py` - `Executor` base class
91+
- `nemo_run/core/execution/slurm.py` - most complex executor (SLURM + SSH)
92+
- `test/conftest.py` - shared fixtures
93+
94+
## Testing Notes
95+
96+
- Pytest marker `slow` is skipped by default (`addopts = -m "not slow"` in `pyproject.toml`)
97+
- `INCLUDE_WORKSPACE_FILE` env var controls workspace-related test behavior
98+
- Test directory is added to `PYTHONPATH` via `add_test_to_pythonpath` fixture

0 commit comments

Comments
 (0)