Add notebook fine tuning and trainer quickstart #98

typhoonzero · 2026-02-11T08:31:30Z

Summary by CodeRabbit

Documentation
- Rebranded fine-tuning guide to "Workbench": clarified Workbench-driven workflows, configurable model/dataset/output endpoints, container/runtime and PVC guidance, MLflow experiment tracking, non‑NVIDIA GPU notes, and updated publishing/inference guidance.
- Added PyTorch DDP notebook demonstrating distributed Fashion MNIST training with Kubeflow Trainer.
- Added Kubeflow Trainer v2 quick-start for distributed PyTorch on Kubernetes.

coderabbitai · 2026-02-11T08:31:57Z

Walkthrough

Reframes fine-tuning docs to "Alauda AI Workbench", replaces fixed endpoints with placeholders, switches Dockerfile→Containerfile, expands runtime/MLFlow/env/PVC/GPU/security details, and adds two Kubeflow Trainer how-tos: a PyTorch DDP notebook and a ClusterTrainingRuntime quick-start.

Changes

Cohort / File(s)	Summary
Fine-tuning documentation `docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`	Rebranded to "Alauda AI Workbench"; swapped hard-coded model/dataset URLs for placeholders; changed Dockerfile → Containerfile; updated PVC/storage keys (e.g., `wy-model-cache`, `sc-topolvm`), added env vars (`HF_HOME`, `DO_MERGE`, `BASE_MODEL_URL`, `DATASET_URL`, `OUTPUT_MODEL_URL`), adjusted MLflow config, added securityContext/shm settings, clarified VolcanoJob wording and inference publishing notes, and refined non-NVIDIA GPU vendor-neutral guidance.
Kubeflow Trainer — notebook `docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb`	New Jupyter notebook demonstrating PyTorch DDP (model, init, DistributedSampler, DDP), local execution via Kubeflow TrainerClient, runtime discovery, TrainJob/Train step submission, log streaming and monitoring, and admin/cleanup hints.
Kubeflow Trainer — quick start `docs/en/workbench/how_to/kubeflow_trainer_quick_start.md`	New quick-start guide for building a torch-distributed runtime (Containerfile) or using prebuilt image, creating a ClusterTrainingRuntime YAML with security/shared memory, submitting a TrainJob example for MNIST, and notes on networking, runtimes, and next steps.

Sequence Diagram(s)

(Skipped)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

chore: remove finetune nav AI-23833 #96 — Overlapping edits to fine-tuning documentation (fine_tunning_using_notebooks.mdx) and related docs.
Chore/fine tuning docs #64 — Related changes/additions in fine-tuning and training documentation under docs/en (overlapping topics and examples).

Suggested reviewers

zhaomingkun1030

Poem

🐇 I hopped through docs and nudged each line with care,
Replaced hard links with placeholders fair.
Built runtimes, cached models, mounts held tight,
Kubeflow trains at dawn and logs by night,
Hop — Workbench gardens grow in shared delight.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately describes the main changes: it adds documentation for fine-tuning with notebooks and a Kubeflow Trainer quickstart guide.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch add_fine_tuning_and_trainer_quickstart

No actionable comments were generated in the recent review. 🎉

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

🤖 Fix all issues with AI agents

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Line 440: Replace the unhyphenated phrase "fine tuning" with the hyphenated
form "fine-tuning" in the sentence that begins with "model cache PVC
(optional):" (the phrase appears as "fine tuning experiments" and should be
"fine-tuning experiments") so the compound modifier is consistent with the rest
of the document.
- Line 102: Update the Containerfile fenced code block so it includes a language
identifier by replacing the opening triple backticks with ```dockerfile (and
keep the matching closing triple backticks), i.e., locate the Containerfile code
block in the fine_tunning_using_notebooks.mdx content and add the dockerfile
language tag to the opening fence so syntax highlighting is enabled.
- Around line 298-312: The main-container shell uses "set -ex" which will print
commands (including the sensitive git push with ${gitauth}) to logs; change the
shell behavior to avoid tracing secrets by removing or limiting "-x" (use "set
-e" only) or temporally disabling xtrace around the sensitive operation (turn
off xtrace before the git push that uses ${gitauth} and re-enable afterward).
Locate the main container block where "set -ex" is set and the git push that
embeds "${gitauth}" and update it so command tracing does not expose
credentials.
- Around line 228-238: The script currently uses "set -ex" which prints every
command (including the gitauth/GIT_USER:GIT_TOKEN used in the git clone command
and BASE_MODEL_URL) to logs; change this to avoid leaking credentials by
removing the -x (use "set -e") or by disabling tracing around the sensitive
commands: keep tracing off when building gitauth and when running the git clone
and lfs pull (use "set +x" before the git clone
"https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}" and "set -x" after), or replace
inline credentials with a git credential helper; update occurrences of set -ex,
gitauth, and the git clone / lfs pull invocations accordingly.
- Line 80: Replace the full-width right parenthesis used in the example call to
datasets.load_dataset (the symbol shown as ）) with the standard ASCII right
parenthesis ) so the snippet "ds = datasets.load_dataset(<dataset directory>）"
becomes valid Python; locate the example line in the
fine_tunning_using_notebooks.mdx file and correct the punctuation for the
datasets.load_dataset invocation.
- Line 515: Fix the typos in the Note sentence starting with "**Note:** The
following steps can also be adapt to LLM pre-training and traditional ML
senarios." by changing "adapt" to "adapted" and "senarios" to "scenarios" so the
sentence reads "...can also be adapted to LLM pre-training and traditional ML
scenarios." Update that exact string in the document.
- Around line 243-251: The two unconditional deletions ("rm -rf ${DATASET_NAME}"
and "rm -rf data") make the subsequent existence check around ${DATASET_NAME}
dead code so the dataset is always re-cloned; remove those rm -rf lines (or move
them to run only when you explicitly want to force-refresh) so the if [ -d
${DATASET_NAME} ] { echo "dataset ${DATASET_NAME} already exists skipping
download" } else { git ... clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" }
branch can correctly skip cloning when the dataset is present.
- Line 377: The shell conditional uses an unquoted variable (DO_MERGE) which
will break if unset or empty; fix the conditional that currently reads the test
against DO_MERGE by quoting the variable reference and using a POSIX-compatible
string comparison operator (or switch to a bash [[ ... ]] test) so the test
becomes robust when DO_MERGE is empty or unset; update the conditional that
contains DO_MERGE to use the quoted form and a proper "=" comparison.

In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb`:
- Around line 295-301: Replace or remove the notebook cell output that exposes
the internal registry URL in the Runtime object's image field; locate the
runtime printout (Runtime(name='torch-distributed', trainer=RuntimeTrainer(...),
image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2'))
and either clear the cell output or sanitize the image value to a public example
such as 'alaudadockerhub/torch-distributed:v2.9.1-aml2' so the notebook no
longer reveals the internal registry.
- Around line 314-339: The narrative and saved outputs describe training on 3
PyTorch nodes but the code passed to client.train uses CustomTrainer(...,
func=train_fashion_mnist, num_nodes=1) which is inconsistent; either change
num_nodes to 3 in the CustomTrainer invocation to match the markdown/outputs
(and optionally enable GPU resource lines) or update the markdown text and saved
outputs to reflect num_nodes=1 (clear or re-run cells so outputs no longer show
3 nodes); look for the client.train call and the CustomTrainer(num_nodes=...)
setting to make the change.
- Around line 160-167: Fix the typo in the inline comment above the DataLoader
instantiation: change "Shard the dataset accross workers." to "Shard the dataset
across workers." in the notebook cell where train_loader is created (referencing
DataLoader, dataset, batch_size, and DistributedSampler).

In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md`:
- Line 114: Fix the typo in the sentence starting "For more informatoin about
how to use **Kubeflow Trainer v2**..." by replacing "informatoin" with
"information" and add a trailing period at the end of the sentence so it reads:
"For more information about how to use **Kubeflow Trainer v2**, please refer to
[Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)."
- Line 15: Replace the grammatical error in the sentence "Create a
`torch_distributed.Containerfile` from below contents and build a image. Or you
can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`" by
changing "a image" to "an image" (and optionally combine the two fragments into
a single sentence like "build an image, or you can use pre-built image
`alaudadockerhub/torch-distributed:v2.9.1-aml2`") so the text reads correctly.
- Around line 17-37: Add the Dockerfile language identifier to the fenced code
block that starts with "FROM python:3.13-trixie" so the block becomes
```dockerfile; edit the markdown section containing the Dockerfile snippet (the
fenced block showing FROM python:3.13-trixie, ARG USERNAME, RUN pip install ...,
etc.) and prepend "dockerfile" immediately after the opening triple backticks to
satisfy markdownlint MD040 and enable proper syntax highlighting.

docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

coderabbitai · 2026-02-11T08:35:25Z

docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx


-```dockerfile
-FROM nvcr.io/nvidia/pytorch:24.12-py3
+```


⚠️ Potential issue | 🟡 Minor

Specify a language for the fenced code block.

Same as in the quick-start doc, the Containerfile code block should have a language identifier (e.g., dockerfile).

-``` +```dockerfile

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` at line 102, Update the Containerfile fenced code block so it includes a language identifier by replacing the opening triple backticks with ```dockerfile (and keep the matching closing triple backticks), i.e., locate the Containerfile code block in the fine_tunning_using_notebooks.mdx content and add the dockerfile language tag to the opening fence so syntax highlighting is enabled.

coderabbitai · 2026-02-11T08:35:25Z

docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

+                set -ex
+                cd /mnt/models
+                BASE_MODEL_NAME=$(basename ${BASE_MODEL_URL})
+                # Download base model
+                gitauth="${GIT_USER}:${GIT_TOKEN}"
+                BASE_MODEL_URL_NO_HTTPS="${BASE_MODEL_URL//https:\/\/}"
+                if [ -d ${BASE_MODEL_NAME} ]; then
+                    echo "${BASE_MODEL_NAME} dir already exists, skip downloading"
+                else
+                    GIT_LFS_SKIP_SMUDGE=1 git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}"
+                    (cd ${BASE_MODEL_NAME} && git -c http.sslVerify=false -c lfs.activitytimeout=36000 lfs pull)


⚠️ Potential issue | 🟠 Major

Credentials may leak to logs via set -ex.

set -ex causes every command to be printed before execution, including the git clone "https://${gitauth}@..." commands on lines 237, 238, and 250. This will expose GIT_USER:GIT_TOKEN in pod logs.

Consider either:

Using set -e instead of set -ex (remove trace), or

Wrapping sensitive commands with set +x / set -x, or

Using git credential helpers instead of inline URL credentials.

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 228 - 238, The script currently uses "set -ex" which prints every command (including the gitauth/GIT_USER:GIT_TOKEN used in the git clone command and BASE_MODEL_URL) to logs; change this to avoid leaking credentials by removing the -x (use "set -e") or by disabling tracing around the sensitive commands: keep tracing off when building gitauth and when running the git clone and lfs pull (use "set +x" before the git clone "https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}" and "set -x" after), or replace inline credentials with a git credential helper; update occurrences of set -ex, gitauth, and the git clone / lfs pull invocations accordingly.

coderabbitai · 2026-02-11T08:35:25Z

docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

+
+                rm -rf ${DATASET_NAME}
+                rm -rf data
+
+                if [ -d ${DATASET_NAME} ]; then
+                    echo "dataset ${DATASET_NAME} already exists skipping download"
+                else
+                    git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
+                fi


⚠️ Potential issue | 🔴 Critical

Bug: rm -rf makes the subsequent existence check dead code.

Lines 244–245 unconditionally delete ${DATASET_NAME} and data, so the if [ -d ${DATASET_NAME} ] check on line 247 will never be true. This means the "already exists, skipping download" path is unreachable, and the dataset is always re-downloaded.

It looks like the intent was to skip downloading if already cached (matching the model logic above). Remove the rm -rf lines, or restructure:

Proposed fix

- rm -rf ${DATASET_NAME} - rm -rf data - if [ -d ${DATASET_NAME} ]; then echo "dataset ${DATASET_NAME} already exists skipping download" else git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

rm -rf ${DATASET_NAME}

rm -rf data

if [ -d ${DATASET_NAME} ]; then

echo "dataset ${DATASET_NAME} already exists skipping download"

else

git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"

fi

if [ -d ${DATASET_NAME} ]; then

echo "dataset ${DATASET_NAME} already exists skipping download"

else

git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"

fi

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 243 - 251, The two unconditional deletions ("rm -rf ${DATASET_NAME}" and "rm -rf data") make the subsequent existence check around ${DATASET_NAME} dead code so the dataset is always re-cloned; remove those rm -rf lines (or move them to run only when you explicitly want to force-refresh) so the if [ -d ${DATASET_NAME} ] { echo "dataset ${DATASET_NAME} already exists skipping download" } else { git ... clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" } branch can correctly skip cloning when the dataset is present.

coderabbitai · 2026-02-11T08:35:25Z

docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

+              - bash
+              - -c
+              - |
+                set -ex
+                echo "job workers list: ${VC_WORKER_HOSTS}"
+                if [ "${VC_WORKER_HOSTS}" != "" ]; then
+                    export N_RANKS=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print NF}')
+                    export RANK=$VC_TASK_INDEX
+                    export MASTER_HOST=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print $1}')
+                    export RANK=$RANK
+                    export WORLD_SIZE=$N_RANKS
+                    export NNODES=$N_RANKS
+                    export NODE_RANK=$RANK
+                    export MASTER_ADDR=${MASTER_HOST}
+                    export MASTER_PORT="8888"


⚠️ Potential issue | 🟠 Major

Same credential leakage concern in the main container's set -ex block.

Lines 301 and 410 have the same issue as the init container — set -ex will trace the git push command (line 410) that embeds ${gitauth} in the URL, leaking credentials to pod logs.

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 298 - 312, The main-container shell uses "set -ex" which will print commands (including the sensitive git push with ${gitauth}) to logs; change the shell behavior to avoid tracing secrets by removing or limiting "-x" (use "set -e" only) or temporally disabling xtrace around the sensitive operation (turn off xtrace before the git push that uses ${gitauth} and re-enable afterward). Locate the main container block where "set -ex" is set and the git push that embeds "${gitauth}" and update it so command tracing does not expose credentials.

coderabbitai · 2026-02-11T08:35:25Z

docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb

+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Runtime(name='torch-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None)\n"
+     ]
+    }


⚠️ Potential issue | 🟡 Minor

Output contains an internal registry URL.

The output at line 299 exposes 152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2, which appears to be an internal/private registry address. For public-facing documentation, consider clearing this output or replacing it with a sanitized example using the public image reference (alaudadockerhub/torch-distributed:v2.9.1-aml2).

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb` around lines 295 - 301, Replace or remove the notebook cell output that exposes the internal registry URL in the Runtime object's image field; locate the runtime printout (Runtime(name='torch-distributed', trainer=RuntimeTrainer(...), image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2')) and either clear the cell output or sanitize the image value to a public example such as 'alaudadockerhub/torch-distributed:v2.9.1-aml2' so the notebook no longer reveals the internal registry.

coderabbitai · 2026-02-11T08:35:25Z

docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb

+    "## Run the Distributed TrainJob\n",
+    "\n",
+    "Kubeflow TrainJob will train the above model on 3 PyTorch nodes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "job_name = client.train(\n",
+    "    trainer=CustomTrainer(\n",
+    "        func=train_fashion_mnist,\n",
+    "        # Set how many PyTorch nodes you want to use for distributed training.\n",
+    "        num_nodes=1,\n",
+    "        # Set the resources for each PyTorch node.\n",
+    "        resources_per_node={\n",
+    "            \"cpu\": 2,\n",
+    "            \"memory\": \"8Gi\",\n",
+    "            # Uncomment this to distribute the TrainJob using GPU nodes.\n",
+    "            # \"nvidia.com/gpu\": 1,\n",
+    "        },\n",
+    "    ),\n",
+    "    runtime=torch_runtime,\n",
+    ")"


⚠️ Potential issue | 🟠 Major

Inconsistency: markdown says "3 PyTorch nodes" but num_nodes=1.

The markdown cells at lines 316, 350, and 404 all reference 3 nodes/GPUs (e.g., "train the above model on 3 PyTorch nodes", "distributed training across 3 nodes", "we run training on 3 GPUs"). However, the code at line 329 sets num_nodes=1. The saved outputs (lines 382–387) also show 3 nodes, suggesting the notebook was run with a different configuration than what's in the code cells.

Either update num_nodes to 3 to match the narrative and outputs, or update the markdown text and clear/update the outputs to match num_nodes=1.

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb` around lines 314 - 339, The narrative and saved outputs describe training on 3 PyTorch nodes but the code passed to client.train uses CustomTrainer(..., func=train_fashion_mnist, num_nodes=1) which is inconsistent; either change num_nodes to 3 in the CustomTrainer invocation to match the markdown/outputs (and optionally enable GPU resource lines) or update the markdown text and saved outputs to reflect num_nodes=1 (clear or re-run cells so outputs no longer show 3 nodes); look for the client.train call and the CustomTrainer(num_nodes=...) setting to make the change.

coderabbitai · 2026-02-11T08:35:25Z

docs/en/workbench/how_to/kubeflow_trainer_quick_start.md

+
+## Prepare Runtime Image
+
+Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.


⚠️ Potential issue | 🟡 Minor

Minor grammar: "a image" → "an image".

-Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`. +Create a `torch_distributed.Containerfile` from the contents below and build an image. Or you can use the pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.

Create a `torch_distributed.Containerfile` from the contents below and build an image. Or you can use the pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 15, Replace the grammatical error in the sentence "Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`" by changing "a image" to "an image" (and optionally combine the two fragments into a single sentence like "build an image, or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`") so the text reads correctly.

coderabbitai · 2026-02-11T08:35:25Z

docs/en/workbench/how_to/kubeflow_trainer_quick_start.md

+```
+FROM python:3.13-trixie
+ARG USERNAME=appuser
+ARG USER_UID=1000
+ARG USER_GID=$USER_UID
+
+RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources
+RUN apt-get update && \
+apt-get install -y build-essential
+
+RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip && \
+pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu126 \
+"torch==2.9.1" \
+"torchvision==0.24.1"
+
+RUN groupadd --gid $USER_GID $USERNAME \
+    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME
+
+WORKDIR /workspace
+RUN chown $USERNAME:$USER_GID /workspace
+```


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "kubeflow_trainer_quick_start.md" | head -20

Repository: alauda/aml-docs

Length of output: 117

🌐 Web query:

PyTorch 2.9.1 torchvision compatibility matrix versions

💡 Result:

For PyTorch 2.9.1, the officially paired torchvision version is:

torch torchvision

2.9.1 0.24.1

2.9.0 0.24.0

2.8.0 0.23.0

Sources: PyTorch “Previous Versions” install commands (these reflect the supported matching domain-library versions). [1]

🏁 Script executed:

sed -n '17,37p' ./docs/en/workbench/how_to/kubeflow_trainer_quick_start.md | cat -A

Repository: alauda/aml-docs

Length of output: 696

Add language identifier (dockerfile) to the fenced code block per markdownlint MD040.

The code block requires a language identifier for proper syntax highlighting and markdown linting compliance.

Proposed fix

-``` +```dockerfile FROM python:3.13-trixie

Note: The torch and torchvision versions are correctly paired (torch 2.9.1 officially matches torchvision 0.24.1 per PyTorch's official release matrix).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

```

FROM python:3.13-trixie

ARG USERNAME=appuser

ARG USER_UID=1000

ARG USER_GID=$USER_UID

RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources

RUN apt-get update && \

apt-get install -y build-essential

RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip && \

pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu126 \

"torch==2.9.1" \

"torchvision==0.24.1"

RUN groupadd --gid $USER_GID $USERNAME \

&& useradd --uid $USER_UID --gid $USER_GID -m $USERNAME

WORKDIR /workspace

RUN chown $USERNAME:$USER_GID /workspace

```

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` around lines 17 - 37, Add the Dockerfile language identifier to the fenced code block that starts with "FROM python:3.13-trixie" so the block becomes ```dockerfile; edit the markdown section containing the Dockerfile snippet (the fenced block showing FROM python:3.13-trixie, ARG USERNAME, RUN pip install ..., etc.) and prepend "dockerfile" immediately after the opening triple backticks to satisfy markdownlint MD040 and enable proper syntax highlighting.

coderabbitai · 2026-02-11T08:35:25Z

docs/en/workbench/how_to/kubeflow_trainer_quick_start.md

+
+Download [kubeflow_trainer_mnist.ipynb](./kubeflow_trainer_mnist.ipynb) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.
+
+For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)


⚠️ Potential issue | 🟡 Minor

Typo: "informatoin" → "information", and add a trailing period.

-For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/) +For more information about how to use **Kubeflow Trainer v2**, please refer to the [Kubeflow Documentation](https://www.kubeflow.org/docs/components/trainer/).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)

For more information about how to use **Kubeflow Trainer v2**, please refer to the [Kubeflow Documentation](https://www.kubeflow.org/docs/components/trainer/).

🧰 Tools

🪛 LanguageTool

[grammar] ~114-~114: Ensure spelling is correct
Context: ...t a TrainJob using pytorch. For more informatoin about how to use *Kubeflow Trainer v2...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 114, Fix the typo in the sentence starting "For more informatoin about how to use **Kubeflow Trainer v2**..." by replacing "informatoin" with "information" and add a trailing period at the end of the sentence so it reads: "For more information about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)."

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Line 185: Replace the personal PVC name and output model URL that use the
`wy-` prefix with generic, documentation-friendly placeholders: change the
claimName value `wy-model-cache` to `model-cache` or `<your-model-cache-pvc>`
(refer to the claimName key), and update the output model URL `wy-sft-output` to
a generic name like `sft-output` or `<your-sft-output>` (refer to the output
model URL string). Ensure both occurrences (the claimName and the output model
URL) are consistent with other placeholders such as
`<your-model-registry-git-address>`.
- Around line 456-478: Update the StorageClass example to fix the likely typo in
mountPermissions and avoid hardcoded IPs: change mountPermissions from "0757" to
a more typical value (e.g., "0755" or "0775") in the StorageClass snippet
(referencing mountPermissions and the ai-nfs StorageClass) and replace the
hardcoded server IP (server: 192.168.17.28) with a placeholder such as
<your-nfs-server-ip> so the example is consistent and safe to reuse.
- Around line 113-115: The shallow clone uses "git clone --depth 1" which may
not include the tag referenced by LLAMA_FACTORY_VERSION, causing "git checkout
${LLAMA_FACTORY_VERSION}" to fail; update the clone invocation that contains
"git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git" to either
include the tag/branch with "--branch ${LLAMA_FACTORY_VERSION} --depth 1" or
perform a full clone (remove "--depth 1") so that "git checkout
${LLAMA_FACTORY_VERSION}" succeeds.
- Line 407: The git lfs track command currently uses an unquoted glob (the
pattern *.safetensors) which can be expanded by the shell; update the occurrence
of the git lfs track invocation so the glob pattern is quoted (surround the
*.safetensors pattern with quotes) to ensure the tracking rule is registered
instead of being expanded to filenames.
- Around line 638-647: The YAML example in the non-Nvidia GPU section
incorrectly uses Nvidia-specific resource names (nvidia.com/gpualloc,
nvidia.com/gpucores, nvidia.com/gpumem); update those keys to vendor-agnostic
placeholders (e.g., <vendor>.com/gpualloc, <vendor>.com/gpucores,
<vendor>.com/gpumem or a single generic accelerator resource like
<vendor>.com/accelerator) to match the guidance on line 521 and avoid
Nvidia-specific terminology in this section.
- Around line 305-307: Remove the redundant self-assignment of the environment
variable: delete the line that does "export RANK=$RANK" (the no-op after "export
RANK=$VC_TASK_INDEX"); leave the "export RANK=$VC_TASK_INDEX" and "export
MASTER_HOST=..." lines intact so RANK is set only once from VC_TASK_INDEX.

docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

cloudflare-workers-and-pages · 2026-02-11T09:14:04Z

Deploying alauda-ai with Cloudflare Pages

Latest commit:	`0fa921a`
Status:	✅ Deploy successful!
Preview URL:	https://8e38e0e4.alauda-ai.pages.dev
Branch Preview URL:	https://add-fine-tuning-and-trainer.alauda-ai.pages.dev

View logs

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Around line 291-296: The MLFLOW_TRACKING_URI value is inconsistent across
examples; update both occurrences of the MLFLOW_TRACKING_URI environment
variable (the value currently set to
"http://mlflow-tracking-server.kubeflow:5000" and the other FQDN variant) to the
same consistent placeholder (e.g.,
"http://<mlflow-tracking-server-endpoint>:5000") so users aren’t confused by
different namespaces/DNS formats; ensure the MLFLOW_TRACKING_URI value is
identical in the example blocks that include the MLFLOW_TRACKING_URI and keep
MLFLOW_EXPERIMENT_NAME unchanged.
- Around line 102-116: The ARG LLAMA_FACTORY_VERSION is declared before FROM so
${LLAMA_FACTORY_VERSION} is empty in later RUN steps; move or re-declare ARG
LLAMA_FACTORY_VERSION immediately after the FROM line so it is available to
subsequent RUN commands, and update the git clone invocation in the RUN that
currently uses "git clone --depth 1" to use "git clone --branch
${LLAMA_FACTORY_VERSION} --depth 1" (or otherwise fetch the specific tag)
instead of relying on a later git checkout, and ensure the documentation code
fence uses the "dockerfile" language identifier; locate the relevant FROM, ARG
LLAMA_FACTORY_VERSION, and RUN lines and the git clone/git checkout commands to
apply these changes.

In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md`:
- Line 112: Update the link text and branding in the sentence that begins
"Download `kubeflow_trainer_mnist.ipynb`..." by capitalizing "GitHub" and
replacing the vague link label "github workbench howtos" with a more descriptive
label (for example: "AML Workbench how-tos on GitHub" or "the AML Workbench
how-tos repository on GitHub"), keeping the same URL; locate the exact string
"Download `kubeflow_trainer_mnist.ipynb` from [github workbench
howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to)"
and modify only the link text and capitalization.

coderabbitai · 2026-02-11T09:14:05Z

docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

+```
+ARG LLAMA_FACTORY_VERSION="v0.9.4"
+FROM python:3.13-trixie

-RUN sed -i 's@//.*archive.ubuntu.com@//mirrors.ustc.edu.cn@g' /etc/apt/sources.list && \
-sed -i 's/security.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && \
+RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources && \
 apt-get update && \
 export DEBIAN_FRONTEND=noninteractive && \
-apt-get install -yq --no-install-recommends git git-lfs unzip curl ffmpeg && \
+apt-get install -yq --no-install-recommends git git-lfs unzip curl ffmpeg default-libmysqlclient-dev build-essential pkg-config && \
 apt clean && rm -rf /var/lib/apt/lists/*

-RUN cd /opt && \
+RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip setuptools && \
+cd /opt && \
 git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && \
-cd LLaMA-Factory && \
-pip install --no-cache-dir -e ".[torch,metrics,deepspeed,awq,modelscope]" -i https://pypi.tuna.tsinghua.edu.cn/simple && \
-pip install --no-cache-dir "transformers==4.51.1" "tokenizers==0.21.1" -i https://pypi.tuna.tsinghua.edu.cn/simple
-
-RUN apt-get update && apt-get install -y default-libmysqlclient-dev build-essential pkg-config && \
-pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip setuptools && \
-pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple \
-"sqlalchemy==2.0.30" "pymysql==1.1.1" "loguru==0.7.2" "mysqlclient==2.2.7" "mlflow"
+cd LLaMA-Factory && git checkout ${LLAMA_FACTORY_VERSION} && \
+pip install --no-cache-dir -e ".[torch,metrics,deepspeed,awq,modelscope]" -i https://pypi.tuna.tsinghua.edu.cn/simple


⚠️ Potential issue | 🔴 Critical

Bug: ARG declared before FROM is not available in subsequent RUN instructions.

In a Containerfile/Dockerfile, an ARG before FROM is only usable within FROM directives. After FROM, all build args are reset. So ${LLAMA_FACTORY_VERSION} on line 115 will resolve to an empty string, and git checkout will fail.

Move the ARG after FROM, or re-declare it:

Proposed fix

-ARG LLAMA_FACTORY_VERSION="v0.9.4" FROM python:3.13-trixie +ARG LLAMA_FACTORY_VERSION="v0.9.4"

Additionally, as noted in a previous review, this code block should have a dockerfile language identifier, and git clone --depth 1 won't fetch the tag — use --branch ${LLAMA_FACTORY_VERSION} instead.

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 102 - 116, The ARG LLAMA_FACTORY_VERSION is declared before FROM so ${LLAMA_FACTORY_VERSION} is empty in later RUN steps; move or re-declare ARG LLAMA_FACTORY_VERSION immediately after the FROM line so it is available to subsequent RUN commands, and update the git clone invocation in the RUN that currently uses "git clone --depth 1" to use "git clone --branch ${LLAMA_FACTORY_VERSION} --depth 1" (or otherwise fetch the specific tag) instead of relying on a later git checkout, and ensure the documentation code fence uses the "dockerfile" language identifier; locate the relevant FROM, ARG LLAMA_FACTORY_VERSION, and RUN lines and the git clone/git checkout commands to apply these changes.

coderabbitai · 2026-02-11T09:14:05Z

docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

+                # Change the MLFLOW_TRACKING_URI to your actual mlflow service endpoint
                - name: MLFLOW_TRACKING_URI
-                  value: 'http://mlflow-tracking-server.aml-system.svc.cluster.local:5000'
+                  value: "http://mlflow-tracking-server.kubeflow:5000"
+                # Set MLFLOW_EXPERIMENT_NAME to your namespace or your own experiment name
                - name: MLFLOW_EXPERIMENT_NAME
-                  value: kubeflow-admin-cpaas-io
+                  value: mlops-demo-ai-test


⚠️ Potential issue | 🟡 Minor

Inconsistent MLFlow tracking URIs between examples.

Line 293 uses http://mlflow-tracking-server.kubeflow:5000 while line 626 uses http://mlflow-tracking-server.aml-system.svc.cluster.local:5000. These point to different namespaces (kubeflow vs aml-system) and use different DNS formats (short vs FQDN). This could confuse users copying the examples.

Consider using a consistent placeholder like http://<mlflow-tracking-server-endpoint>:5000 in both, or at minimum use the same namespace and format.

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 291 - 296, The MLFLOW_TRACKING_URI value is inconsistent across examples; update both occurrences of the MLFLOW_TRACKING_URI environment variable (the value currently set to "http://mlflow-tracking-server.kubeflow:5000" and the other FQDN variant) to the same consistent placeholder (e.g., "http://<mlflow-tracking-server-endpoint>:5000") so users aren’t confused by different namespaces/DNS formats; ensure the MLFLOW_TRACKING_URI value is identical in the example blocks that include the MLFLOW_TRACKING_URI and keep MLFLOW_EXPERIMENT_NAME unchanged.

coderabbitai · 2026-02-11T09:14:05Z

docs/en/workbench/how_to/kubeflow_trainer_quick_start.md

+
+> **Note: You need internet access to run below example notebook, since you need to install python packages, download datasets in this notebook.**
+
+Download `kubeflow_trainer_mnist.ipynb` from [github workbench howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.


⚠️ Potential issue | 🟡 Minor

Capitalize "GitHub" and improve link text.

"github" should be capitalized per its official branding. The link text "github workbench howtos" is also vague — consider a more descriptive label.

-Download `kubeflow_trainer_mnist.ipynb` from [github workbench howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch. +Download `kubeflow_trainer_mnist.ipynb` from the [GitHub workbench how-to directory](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag-and-drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using PyTorch.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~112-~112: The official name of this software platform is spelled with a capital “H”.
Context: ...ad kubeflow_trainer_mnist.ipynb from [github workbench howtos](https://github.com/al...

(GITHUB)

🤖 Prompt for AI Agents

In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 112, Update the link text and branding in the sentence that begins "Download `kubeflow_trainer_mnist.ipynb`..." by capitalizing "GitHub" and replacing the vague link label "github workbench howtos" with a more descriptive label (for example: "AML Workbench how-tos on GitHub" or "the AML Workbench how-tos repository on GitHub"), keeping the same URL; locate the exact string "Download `kubeflow_trainer_mnist.ipynb` from [github workbench howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to)" and modify only the link text and capitalization.

add notebook fine tuning and trainer quickstart

f1f5d60

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

update

12ad5e6

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

typhoonzero added 2 commits February 11, 2026 16:50

update

d5a5ee2

update

c2c2e06

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

update

0fa921a


		## Prepare Runtime Image

		Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.


		Download [kubeflow_trainer_mnist.ipynb](./kubeflow_trainer_mnist.ipynb) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.

		For more informatoin about how to use Kubeflow Trainer v2, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)

	For more informatoin about how to use Kubeflow Trainer v2, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)
	For more information about how to use Kubeflow Trainer v2, please refer to the [Kubeflow Documentation](https://www.kubeflow.org/docs/components/trainer/).


		> Note: You need internet access to run below example notebook, since you need to install python packages, download datasets in this notebook.

		Download `kubeflow_trainer_mnist.ipynb` from [github workbench howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.

Add notebook fine tuning and trainer quickstart #98

Are you sure you want to change the base?

Add notebook fine tuning and trainer quickstart #98

Uh oh!

Conversation

typhoonzero commented Feb 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloudflare-workers-and-pages bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying alauda-ai with Cloudflare Pages

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

typhoonzero commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Feb 11, 2026 •

edited

Loading