Skip to content

Conversation

@typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented Feb 11, 2026

Summary by CodeRabbit

  • Documentation
    • Rebranded fine-tuning guide to "Workbench": clarified Workbench-driven workflows, configurable model/dataset/output endpoints, container/runtime and PVC guidance, MLflow experiment tracking, non‑NVIDIA GPU notes, and updated publishing/inference guidance.
    • Added PyTorch DDP notebook demonstrating distributed Fashion MNIST training with Kubeflow Trainer.
    • Added Kubeflow Trainer v2 quick-start for distributed PyTorch on Kubernetes.

@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

Walkthrough

Reframes fine-tuning docs to "Alauda AI Workbench", replaces fixed endpoints with placeholders, switches Dockerfile→Containerfile, expands runtime/MLFlow/env/PVC/GPU/security details, and adds two Kubeflow Trainer how-tos: a PyTorch DDP notebook and a ClusterTrainingRuntime quick-start.

Changes

Cohort / File(s) Summary
Fine-tuning documentation
docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx
Rebranded to "Alauda AI Workbench"; swapped hard-coded model/dataset URLs for placeholders; changed Dockerfile → Containerfile; updated PVC/storage keys (e.g., wy-model-cache, sc-topolvm), added env vars (HF_HOME, DO_MERGE, BASE_MODEL_URL, DATASET_URL, OUTPUT_MODEL_URL), adjusted MLflow config, added securityContext/shm settings, clarified VolcanoJob wording and inference publishing notes, and refined non-NVIDIA GPU vendor-neutral guidance.
Kubeflow Trainer — notebook
docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb
New Jupyter notebook demonstrating PyTorch DDP (model, init, DistributedSampler, DDP), local execution via Kubeflow TrainerClient, runtime discovery, TrainJob/Train step submission, log streaming and monitoring, and admin/cleanup hints.
Kubeflow Trainer — quick start
docs/en/workbench/how_to/kubeflow_trainer_quick_start.md
New quick-start guide for building a torch-distributed runtime (Containerfile) or using prebuilt image, creating a ClusterTrainingRuntime YAML with security/shared memory, submitting a TrainJob example for MNIST, and notes on networking, runtimes, and next steps.

Sequence Diagram(s)

(Skipped)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • zhaomingkun1030

Poem

🐇 I hopped through docs and nudged each line with care,
Replaced hard links with placeholders fair.
Built runtimes, cached models, mounts held tight,
Kubeflow trains at dawn and logs by night,
Hop — Workbench gardens grow in shared delight.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main changes: it adds documentation for fine-tuning with notebooks and a Kubeflow Trainer quickstart guide.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add_fine_tuning_and_trainer_quickstart

No actionable comments were generated in the recent review. 🎉

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🤖 Fix all issues with AI agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Line 440: Replace the unhyphenated phrase "fine tuning" with the hyphenated
form "fine-tuning" in the sentence that begins with "model cache PVC
(optional):" (the phrase appears as "fine tuning experiments" and should be
"fine-tuning experiments") so the compound modifier is consistent with the rest
of the document.
- Line 102: Update the Containerfile fenced code block so it includes a language
identifier by replacing the opening triple backticks with ```dockerfile (and
keep the matching closing triple backticks), i.e., locate the Containerfile code
block in the fine_tunning_using_notebooks.mdx content and add the dockerfile
language tag to the opening fence so syntax highlighting is enabled.
- Around line 298-312: The main-container shell uses "set -ex" which will print
commands (including the sensitive git push with ${gitauth}) to logs; change the
shell behavior to avoid tracing secrets by removing or limiting "-x" (use "set
-e" only) or temporally disabling xtrace around the sensitive operation (turn
off xtrace before the git push that uses ${gitauth} and re-enable afterward).
Locate the main container block where "set -ex" is set and the git push that
embeds "${gitauth}" and update it so command tracing does not expose
credentials.
- Around line 228-238: The script currently uses "set -ex" which prints every
command (including the gitauth/GIT_USER:GIT_TOKEN used in the git clone command
and BASE_MODEL_URL) to logs; change this to avoid leaking credentials by
removing the -x (use "set -e") or by disabling tracing around the sensitive
commands: keep tracing off when building gitauth and when running the git clone
and lfs pull (use "set +x" before the git clone
"https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}" and "set -x" after), or replace
inline credentials with a git credential helper; update occurrences of set -ex,
gitauth, and the git clone / lfs pull invocations accordingly.
- Line 80: Replace the full-width right parenthesis used in the example call to
datasets.load_dataset (the symbol shown as )) with the standard ASCII right
parenthesis ) so the snippet "ds = datasets.load_dataset(<dataset directory>)"
becomes valid Python; locate the example line in the
fine_tunning_using_notebooks.mdx file and correct the punctuation for the
datasets.load_dataset invocation.
- Line 515: Fix the typos in the Note sentence starting with "**Note:** The
following steps can also be adapt to LLM pre-training and traditional ML
senarios." by changing "adapt" to "adapted" and "senarios" to "scenarios" so the
sentence reads "...can also be adapted to LLM pre-training and traditional ML
scenarios." Update that exact string in the document.
- Around line 243-251: The two unconditional deletions ("rm -rf ${DATASET_NAME}"
and "rm -rf data") make the subsequent existence check around ${DATASET_NAME}
dead code so the dataset is always re-cloned; remove those rm -rf lines (or move
them to run only when you explicitly want to force-refresh) so the if [ -d
${DATASET_NAME} ] { echo "dataset ${DATASET_NAME} already exists skipping
download" } else { git ... clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" }
branch can correctly skip cloning when the dataset is present.
- Line 377: The shell conditional uses an unquoted variable (DO_MERGE) which
will break if unset or empty; fix the conditional that currently reads the test
against DO_MERGE by quoting the variable reference and using a POSIX-compatible
string comparison operator (or switch to a bash [[ ... ]] test) so the test
becomes robust when DO_MERGE is empty or unset; update the conditional that
contains DO_MERGE to use the quoted form and a proper "=" comparison.

In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb`:
- Around line 295-301: Replace or remove the notebook cell output that exposes
the internal registry URL in the Runtime object's image field; locate the
runtime printout (Runtime(name='torch-distributed', trainer=RuntimeTrainer(...),
image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2'))
and either clear the cell output or sanitize the image value to a public example
such as 'alaudadockerhub/torch-distributed:v2.9.1-aml2' so the notebook no
longer reveals the internal registry.
- Around line 314-339: The narrative and saved outputs describe training on 3
PyTorch nodes but the code passed to client.train uses CustomTrainer(...,
func=train_fashion_mnist, num_nodes=1) which is inconsistent; either change
num_nodes to 3 in the CustomTrainer invocation to match the markdown/outputs
(and optionally enable GPU resource lines) or update the markdown text and saved
outputs to reflect num_nodes=1 (clear or re-run cells so outputs no longer show
3 nodes); look for the client.train call and the CustomTrainer(num_nodes=...)
setting to make the change.
- Around line 160-167: Fix the typo in the inline comment above the DataLoader
instantiation: change "Shard the dataset accross workers." to "Shard the dataset
across workers." in the notebook cell where train_loader is created (referencing
DataLoader, dataset, batch_size, and DistributedSampler).

In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md`:
- Line 114: Fix the typo in the sentence starting "For more informatoin about
how to use **Kubeflow Trainer v2**..." by replacing "informatoin" with
"information" and add a trailing period at the end of the sentence so it reads:
"For more information about how to use **Kubeflow Trainer v2**, please refer to
[Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)."
- Line 15: Replace the grammatical error in the sentence "Create a
`torch_distributed.Containerfile` from below contents and build a image. Or you
can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`" by
changing "a image" to "an image" (and optionally combine the two fragments into
a single sentence like "build an image, or you can use pre-built image
`alaudadockerhub/torch-distributed:v2.9.1-aml2`") so the text reads correctly.
- Around line 17-37: Add the Dockerfile language identifier to the fenced code
block that starts with "FROM python:3.13-trixie" so the block becomes
```dockerfile; edit the markdown section containing the Dockerfile snippet (the
fenced block showing FROM python:3.13-trixie, ARG USERNAME, RUN pip install ...,
etc.) and prepend "dockerfile" immediately after the opening triple backticks to
satisfy markdownlint MD040 and enable proper syntax highlighting.


```dockerfile
FROM nvcr.io/nvidia/pytorch:24.12-py3
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Specify a language for the fenced code block.

Same as in the quick-start doc, the Containerfile code block should have a language identifier (e.g., dockerfile).

-```
+```dockerfile
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` at line 102,
Update the Containerfile fenced code block so it includes a language identifier
by replacing the opening triple backticks with ```dockerfile (and keep the
matching closing triple backticks), i.e., locate the Containerfile code block in
the fine_tunning_using_notebooks.mdx content and add the dockerfile language tag
to the opening fence so syntax highlighting is enabled.

Comment on lines +228 to +238
set -ex
cd /mnt/models
BASE_MODEL_NAME=$(basename ${BASE_MODEL_URL})
# Download base model
gitauth="${GIT_USER}:${GIT_TOKEN}"
BASE_MODEL_URL_NO_HTTPS="${BASE_MODEL_URL//https:\/\/}"
if [ -d ${BASE_MODEL_NAME} ]; then
echo "${BASE_MODEL_NAME} dir already exists, skip downloading"
else
GIT_LFS_SKIP_SMUDGE=1 git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}"
(cd ${BASE_MODEL_NAME} && git -c http.sslVerify=false -c lfs.activitytimeout=36000 lfs pull)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Credentials may leak to logs via set -ex.

set -ex causes every command to be printed before execution, including the git clone "https://${gitauth}@..." commands on lines 237, 238, and 250. This will expose GIT_USER:GIT_TOKEN in pod logs.

Consider either:

  • Using set -e instead of set -ex (remove trace), or
  • Wrapping sensitive commands with set +x / set -x, or
  • Using git credential helpers instead of inline URL credentials.
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 228 -
238, The script currently uses "set -ex" which prints every command (including
the gitauth/GIT_USER:GIT_TOKEN used in the git clone command and BASE_MODEL_URL)
to logs; change this to avoid leaking credentials by removing the -x (use "set
-e") or by disabling tracing around the sensitive commands: keep tracing off
when building gitauth and when running the git clone and lfs pull (use "set +x"
before the git clone "https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}" and "set
-x" after), or replace inline credentials with a git credential helper; update
occurrences of set -ex, gitauth, and the git clone / lfs pull invocations
accordingly.

Comment on lines +243 to +251

rm -rf ${DATASET_NAME}
rm -rf data

if [ -d ${DATASET_NAME} ]; then
echo "dataset ${DATASET_NAME} already exists skipping download"
else
git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: rm -rf makes the subsequent existence check dead code.

Lines 244–245 unconditionally delete ${DATASET_NAME} and data, so the if [ -d ${DATASET_NAME} ] check on line 247 will never be true. This means the "already exists, skipping download" path is unreachable, and the dataset is always re-downloaded.

It looks like the intent was to skip downloading if already cached (matching the model logic above). Remove the rm -rf lines, or restructure:

Proposed fix
-                rm -rf ${DATASET_NAME}
-                rm -rf data
-                 
                 if [ -d ${DATASET_NAME} ]; then
                     echo "dataset ${DATASET_NAME} already exists skipping download"
                 else
                     git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
                 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rm -rf ${DATASET_NAME}
rm -rf data
if [ -d ${DATASET_NAME} ]; then
echo "dataset ${DATASET_NAME} already exists skipping download"
else
git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
fi
if [ -d ${DATASET_NAME} ]; then
echo "dataset ${DATASET_NAME} already exists skipping download"
else
git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
fi
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 243 -
251, The two unconditional deletions ("rm -rf ${DATASET_NAME}" and "rm -rf
data") make the subsequent existence check around ${DATASET_NAME} dead code so
the dataset is always re-cloned; remove those rm -rf lines (or move them to run
only when you explicitly want to force-refresh) so the if [ -d ${DATASET_NAME} ]
{ echo "dataset ${DATASET_NAME} already exists skipping download" } else { git
... clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" } branch can correctly
skip cloning when the dataset is present.

Comment on lines +298 to +312
- bash
- -c
- |
set -ex
echo "job workers list: ${VC_WORKER_HOSTS}"
if [ "${VC_WORKER_HOSTS}" != "" ]; then
export N_RANKS=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print NF}')
export RANK=$VC_TASK_INDEX
export MASTER_HOST=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print $1}')
export RANK=$RANK
export WORLD_SIZE=$N_RANKS
export NNODES=$N_RANKS
export NODE_RANK=$RANK
export MASTER_ADDR=${MASTER_HOST}
export MASTER_PORT="8888"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same credential leakage concern in the main container's set -ex block.

Lines 301 and 410 have the same issue as the init container — set -ex will trace the git push command (line 410) that embeds ${gitauth} in the URL, leaking credentials to pod logs.

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 298 -
312, The main-container shell uses "set -ex" which will print commands
(including the sensitive git push with ${gitauth}) to logs; change the shell
behavior to avoid tracing secrets by removing or limiting "-x" (use "set -e"
only) or temporally disabling xtrace around the sensitive operation (turn off
xtrace before the git push that uses ${gitauth} and re-enable afterward). Locate
the main container block where "set -ex" is set and the git push that embeds
"${gitauth}" and update it so command tracing does not expose credentials.

Comment on lines +295 to +301
{
"name": "stdout",
"output_type": "stream",
"text": [
"Runtime(name='torch-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None)\n"
]
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Output contains an internal registry URL.

The output at line 299 exposes 152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2, which appears to be an internal/private registry address. For public-facing documentation, consider clearing this output or replacing it with a sanitized example using the public image reference (alaudadockerhub/torch-distributed:v2.9.1-aml2).

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb` around lines 295 -
301, Replace or remove the notebook cell output that exposes the internal
registry URL in the Runtime object's image field; locate the runtime printout
(Runtime(name='torch-distributed', trainer=RuntimeTrainer(...),
image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2'))
and either clear the cell output or sanitize the image value to a public example
such as 'alaudadockerhub/torch-distributed:v2.9.1-aml2' so the notebook no
longer reveals the internal registry.

Comment on lines +314 to +339
"## Run the Distributed TrainJob\n",
"\n",
"Kubeflow TrainJob will train the above model on 3 PyTorch nodes."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"job_name = client.train(\n",
" trainer=CustomTrainer(\n",
" func=train_fashion_mnist,\n",
" # Set how many PyTorch nodes you want to use for distributed training.\n",
" num_nodes=1,\n",
" # Set the resources for each PyTorch node.\n",
" resources_per_node={\n",
" \"cpu\": 2,\n",
" \"memory\": \"8Gi\",\n",
" # Uncomment this to distribute the TrainJob using GPU nodes.\n",
" # \"nvidia.com/gpu\": 1,\n",
" },\n",
" ),\n",
" runtime=torch_runtime,\n",
")"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inconsistency: markdown says "3 PyTorch nodes" but num_nodes=1.

The markdown cells at lines 316, 350, and 404 all reference 3 nodes/GPUs (e.g., "train the above model on 3 PyTorch nodes", "distributed training across 3 nodes", "we run training on 3 GPUs"). However, the code at line 329 sets num_nodes=1. The saved outputs (lines 382–387) also show 3 nodes, suggesting the notebook was run with a different configuration than what's in the code cells.

Either update num_nodes to 3 to match the narrative and outputs, or update the markdown text and clear/update the outputs to match num_nodes=1.

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb` around lines 314 -
339, The narrative and saved outputs describe training on 3 PyTorch nodes but
the code passed to client.train uses CustomTrainer(...,
func=train_fashion_mnist, num_nodes=1) which is inconsistent; either change
num_nodes to 3 in the CustomTrainer invocation to match the markdown/outputs
(and optionally enable GPU resource lines) or update the markdown text and saved
outputs to reflect num_nodes=1 (clear or re-run cells so outputs no longer show
3 nodes); look for the client.train call and the CustomTrainer(num_nodes=...)
setting to make the change.


## Prepare Runtime Image

Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor grammar: "a image" → "an image".

-Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
+Create a `torch_distributed.Containerfile` from the contents below and build an image. Or you can use the pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
Create a `torch_distributed.Containerfile` from the contents below and build an image. Or you can use the pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 15, Replace
the grammatical error in the sentence "Create a
`torch_distributed.Containerfile` from below contents and build a image. Or you
can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`" by
changing "a image" to "an image" (and optionally combine the two fragments into
a single sentence like "build an image, or you can use pre-built image
`alaudadockerhub/torch-distributed:v2.9.1-aml2`") so the text reads correctly.

Comment on lines +17 to +37
```
FROM python:3.13-trixie
ARG USERNAME=appuser
ARG USER_UID=1000
ARG USER_GID=$USER_UID

RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources
RUN apt-get update && \
apt-get install -y build-essential

RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip && \
pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu126 \
"torch==2.9.1" \
"torchvision==0.24.1"

RUN groupadd --gid $USER_GID $USERNAME \
&& useradd --uid $USER_UID --gid $USER_GID -m $USERNAME

WORKDIR /workspace
RUN chown $USERNAME:$USER_GID /workspace
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "kubeflow_trainer_quick_start.md" | head -20

Repository: alauda/aml-docs

Length of output: 117


🌐 Web query:

PyTorch 2.9.1 torchvision compatibility matrix versions

💡 Result:

For PyTorch 2.9.1, the officially paired torchvision version is:

torch torchvision
2.9.1 0.24.1
2.9.0 0.24.0
2.8.0 0.23.0

Sources: PyTorch “Previous Versions” install commands (these reflect the supported matching domain-library versions). [1]


🏁 Script executed:

sed -n '17,37p' ./docs/en/workbench/how_to/kubeflow_trainer_quick_start.md | cat -A

Repository: alauda/aml-docs

Length of output: 696


Add language identifier (dockerfile) to the fenced code block per markdownlint MD040.

The code block requires a language identifier for proper syntax highlighting and markdown linting compliance.

Proposed fix
-```
+```dockerfile
 FROM python:3.13-trixie

Note: The torch and torchvision versions are correctly paired (torch 2.9.1 officially matches torchvision 0.24.1 per PyTorch's official release matrix).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
FROM python:3.13-trixie
ARG USERNAME=appuser
ARG USER_UID=1000
ARG USER_GID=$USER_UID
RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources
RUN apt-get update && \
apt-get install -y build-essential
RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip && \
pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu126 \
"torch==2.9.1" \
"torchvision==0.24.1"
RUN groupadd --gid $USER_GID $USERNAME \
&& useradd --uid $USER_UID --gid $USER_GID -m $USERNAME
WORKDIR /workspace
RUN chown $USERNAME:$USER_GID /workspace
```
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` around lines 17 -
37, Add the Dockerfile language identifier to the fenced code block that starts
with "FROM python:3.13-trixie" so the block becomes ```dockerfile; edit the
markdown section containing the Dockerfile snippet (the fenced block showing
FROM python:3.13-trixie, ARG USERNAME, RUN pip install ..., etc.) and prepend
"dockerfile" immediately after the opening triple backticks to satisfy
markdownlint MD040 and enable proper syntax highlighting.


Download [kubeflow_trainer_mnist.ipynb](./kubeflow_trainer_mnist.ipynb) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.

For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: "informatoin" → "information", and add a trailing period.

-For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)
+For more information about how to use **Kubeflow Trainer v2**, please refer to the [Kubeflow Documentation](https://www.kubeflow.org/docs/components/trainer/).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)
For more information about how to use **Kubeflow Trainer v2**, please refer to the [Kubeflow Documentation](https://www.kubeflow.org/docs/components/trainer/).
🧰 Tools
🪛 LanguageTool

[grammar] ~114-~114: Ensure spelling is correct
Context: ...t a TrainJob using pytorch. For more informatoin about how to use *Kubeflow Trainer v2...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 114, Fix
the typo in the sentence starting "For more informatoin about how to use
**Kubeflow Trainer v2**..." by replacing "informatoin" with "information" and
add a trailing period at the end of the sentence so it reads: "For more
information about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow
Document](https://www.kubeflow.org/docs/components/trainer/)."

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Line 185: Replace the personal PVC name and output model URL that use the
`wy-` prefix with generic, documentation-friendly placeholders: change the
claimName value `wy-model-cache` to `model-cache` or `<your-model-cache-pvc>`
(refer to the claimName key), and update the output model URL `wy-sft-output` to
a generic name like `sft-output` or `<your-sft-output>` (refer to the output
model URL string). Ensure both occurrences (the claimName and the output model
URL) are consistent with other placeholders such as
`<your-model-registry-git-address>`.
- Around line 456-478: Update the StorageClass example to fix the likely typo in
mountPermissions and avoid hardcoded IPs: change mountPermissions from "0757" to
a more typical value (e.g., "0755" or "0775") in the StorageClass snippet
(referencing mountPermissions and the ai-nfs StorageClass) and replace the
hardcoded server IP (server: 192.168.17.28) with a placeholder such as
<your-nfs-server-ip> so the example is consistent and safe to reuse.
- Around line 113-115: The shallow clone uses "git clone --depth 1" which may
not include the tag referenced by LLAMA_FACTORY_VERSION, causing "git checkout
${LLAMA_FACTORY_VERSION}" to fail; update the clone invocation that contains
"git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git" to either
include the tag/branch with "--branch ${LLAMA_FACTORY_VERSION} --depth 1" or
perform a full clone (remove "--depth 1") so that "git checkout
${LLAMA_FACTORY_VERSION}" succeeds.
- Line 407: The git lfs track command currently uses an unquoted glob (the
pattern *.safetensors) which can be expanded by the shell; update the occurrence
of the git lfs track invocation so the glob pattern is quoted (surround the
*.safetensors pattern with quotes) to ensure the tracking rule is registered
instead of being expanded to filenames.
- Around line 638-647: The YAML example in the non-Nvidia GPU section
incorrectly uses Nvidia-specific resource names (nvidia.com/gpualloc,
nvidia.com/gpucores, nvidia.com/gpumem); update those keys to vendor-agnostic
placeholders (e.g., <vendor>.com/gpualloc, <vendor>.com/gpucores,
<vendor>.com/gpumem or a single generic accelerator resource like
<vendor>.com/accelerator) to match the guidance on line 521 and avoid
Nvidia-specific terminology in this section.
- Around line 305-307: Remove the redundant self-assignment of the environment
variable: delete the line that does "export RANK=$RANK" (the no-op after "export
RANK=$VC_TASK_INDEX"); leave the "export RANK=$VC_TASK_INDEX" and "export
MASTER_HOST=..." lines intact so RANK is set only once from VC_TASK_INDEX.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Feb 11, 2026

Deploying alauda-ai with  Cloudflare Pages  Cloudflare Pages

Latest commit: 0fa921a
Status: ✅  Deploy successful!
Preview URL: https://8e38e0e4.alauda-ai.pages.dev
Branch Preview URL: https://add-fine-tuning-and-trainer.alauda-ai.pages.dev

View logs

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Around line 291-296: The MLFLOW_TRACKING_URI value is inconsistent across
examples; update both occurrences of the MLFLOW_TRACKING_URI environment
variable (the value currently set to
"http://mlflow-tracking-server.kubeflow:5000" and the other FQDN variant) to the
same consistent placeholder (e.g.,
"http://<mlflow-tracking-server-endpoint>:5000") so users aren’t confused by
different namespaces/DNS formats; ensure the MLFLOW_TRACKING_URI value is
identical in the example blocks that include the MLFLOW_TRACKING_URI and keep
MLFLOW_EXPERIMENT_NAME unchanged.
- Around line 102-116: The ARG LLAMA_FACTORY_VERSION is declared before FROM so
${LLAMA_FACTORY_VERSION} is empty in later RUN steps; move or re-declare ARG
LLAMA_FACTORY_VERSION immediately after the FROM line so it is available to
subsequent RUN commands, and update the git clone invocation in the RUN that
currently uses "git clone --depth 1" to use "git clone --branch
${LLAMA_FACTORY_VERSION} --depth 1" (or otherwise fetch the specific tag)
instead of relying on a later git checkout, and ensure the documentation code
fence uses the "dockerfile" language identifier; locate the relevant FROM, ARG
LLAMA_FACTORY_VERSION, and RUN lines and the git clone/git checkout commands to
apply these changes.

In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md`:
- Line 112: Update the link text and branding in the sentence that begins
"Download `kubeflow_trainer_mnist.ipynb`..." by capitalizing "GitHub" and
replacing the vague link label "github workbench howtos" with a more descriptive
label (for example: "AML Workbench how-tos on GitHub" or "the AML Workbench
how-tos repository on GitHub"), keeping the same URL; locate the exact string
"Download `kubeflow_trainer_mnist.ipynb` from [github workbench
howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to)"
and modify only the link text and capitalization.

Comment on lines +102 to +116
```
ARG LLAMA_FACTORY_VERSION="v0.9.4"
FROM python:3.13-trixie

RUN sed -i 's@//.*archive.ubuntu.com@//mirrors.ustc.edu.cn@g' /etc/apt/sources.list && \
sed -i 's/security.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && \
RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources && \
apt-get update && \
export DEBIAN_FRONTEND=noninteractive && \
apt-get install -yq --no-install-recommends git git-lfs unzip curl ffmpeg && \
apt-get install -yq --no-install-recommends git git-lfs unzip curl ffmpeg default-libmysqlclient-dev build-essential pkg-config && \
apt clean && rm -rf /var/lib/apt/lists/*

RUN cd /opt && \
RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip setuptools && \
cd /opt && \
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && \
cd LLaMA-Factory && \
pip install --no-cache-dir -e ".[torch,metrics,deepspeed,awq,modelscope]" -i https://pypi.tuna.tsinghua.edu.cn/simple && \
pip install --no-cache-dir "transformers==4.51.1" "tokenizers==0.21.1" -i https://pypi.tuna.tsinghua.edu.cn/simple

RUN apt-get update && apt-get install -y default-libmysqlclient-dev build-essential pkg-config && \
pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip setuptools && \
pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple \
"sqlalchemy==2.0.30" "pymysql==1.1.1" "loguru==0.7.2" "mysqlclient==2.2.7" "mlflow"
cd LLaMA-Factory && git checkout ${LLAMA_FACTORY_VERSION} && \
pip install --no-cache-dir -e ".[torch,metrics,deepspeed,awq,modelscope]" -i https://pypi.tuna.tsinghua.edu.cn/simple
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: ARG declared before FROM is not available in subsequent RUN instructions.

In a Containerfile/Dockerfile, an ARG before FROM is only usable within FROM directives. After FROM, all build args are reset. So ${LLAMA_FACTORY_VERSION} on line 115 will resolve to an empty string, and git checkout will fail.

Move the ARG after FROM, or re-declare it:

Proposed fix
-ARG LLAMA_FACTORY_VERSION="v0.9.4"
 FROM python:3.13-trixie
+ARG LLAMA_FACTORY_VERSION="v0.9.4"

Additionally, as noted in a previous review, this code block should have a dockerfile language identifier, and git clone --depth 1 won't fetch the tag — use --branch ${LLAMA_FACTORY_VERSION} instead.

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 102 -
116, The ARG LLAMA_FACTORY_VERSION is declared before FROM so
${LLAMA_FACTORY_VERSION} is empty in later RUN steps; move or re-declare ARG
LLAMA_FACTORY_VERSION immediately after the FROM line so it is available to
subsequent RUN commands, and update the git clone invocation in the RUN that
currently uses "git clone --depth 1" to use "git clone --branch
${LLAMA_FACTORY_VERSION} --depth 1" (or otherwise fetch the specific tag)
instead of relying on a later git checkout, and ensure the documentation code
fence uses the "dockerfile" language identifier; locate the relevant FROM, ARG
LLAMA_FACTORY_VERSION, and RUN lines and the git clone/git checkout commands to
apply these changes.

Comment on lines +291 to +296
# Change the MLFLOW_TRACKING_URI to your actual mlflow service endpoint
- name: MLFLOW_TRACKING_URI
value: 'http://mlflow-tracking-server.aml-system.svc.cluster.local:5000'
value: "http://mlflow-tracking-server.kubeflow:5000"
# Set MLFLOW_EXPERIMENT_NAME to your namespace or your own experiment name
- name: MLFLOW_EXPERIMENT_NAME
value: kubeflow-admin-cpaas-io
value: mlops-demo-ai-test
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent MLFlow tracking URIs between examples.

Line 293 uses http://mlflow-tracking-server.kubeflow:5000 while line 626 uses http://mlflow-tracking-server.aml-system.svc.cluster.local:5000. These point to different namespaces (kubeflow vs aml-system) and use different DNS formats (short vs FQDN). This could confuse users copying the examples.

Consider using a consistent placeholder like http://<mlflow-tracking-server-endpoint>:5000 in both, or at minimum use the same namespace and format.

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 291 -
296, The MLFLOW_TRACKING_URI value is inconsistent across examples; update both
occurrences of the MLFLOW_TRACKING_URI environment variable (the value currently
set to "http://mlflow-tracking-server.kubeflow:5000" and the other FQDN variant)
to the same consistent placeholder (e.g.,
"http://<mlflow-tracking-server-endpoint>:5000") so users aren’t confused by
different namespaces/DNS formats; ensure the MLFLOW_TRACKING_URI value is
identical in the example blocks that include the MLFLOW_TRACKING_URI and keep
MLFLOW_EXPERIMENT_NAME unchanged.


> **Note: You need internet access to run below example notebook, since you need to install python packages, download datasets in this notebook.**

Download `kubeflow_trainer_mnist.ipynb` from [github workbench howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Capitalize "GitHub" and improve link text.

"github" should be capitalized per its official branding. The link text "github workbench howtos" is also vague — consider a more descriptive label.

-Download `kubeflow_trainer_mnist.ipynb` from [github workbench howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.
+Download `kubeflow_trainer_mnist.ipynb` from the [GitHub workbench how-to directory](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag-and-drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using PyTorch.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~112-~112: The official name of this software platform is spelled with a capital “H”.
Context: ...ad kubeflow_trainer_mnist.ipynb from [github workbench howtos](https://github.com/al...

(GITHUB)

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 112, Update
the link text and branding in the sentence that begins "Download
`kubeflow_trainer_mnist.ipynb`..." by capitalizing "GitHub" and replacing the
vague link label "github workbench howtos" with a more descriptive label (for
example: "AML Workbench how-tos on GitHub" or "the AML Workbench how-tos
repository on GitHub"), keeping the same URL; locate the exact string "Download
`kubeflow_trainer_mnist.ipynb` from [github workbench
howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to)"
and modify only the link text and capitalization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant