-
Notifications
You must be signed in to change notification settings - Fork 0
Add notebook fine tuning and trainer quickstart #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughReframes fine-tuning docs to "Alauda AI Workbench", replaces fixed endpoints with placeholders, switches Dockerfile→Containerfile, expands runtime/MLFlow/env/PVC/GPU/security details, and adds two Kubeflow Trainer how-tos: a PyTorch DDP notebook and a ClusterTrainingRuntime quick-start. Changes
Sequence Diagram(s)(Skipped) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
🤖 Fix all issues with AI agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Line 440: Replace the unhyphenated phrase "fine tuning" with the hyphenated
form "fine-tuning" in the sentence that begins with "model cache PVC
(optional):" (the phrase appears as "fine tuning experiments" and should be
"fine-tuning experiments") so the compound modifier is consistent with the rest
of the document.
- Line 102: Update the Containerfile fenced code block so it includes a language
identifier by replacing the opening triple backticks with ```dockerfile (and
keep the matching closing triple backticks), i.e., locate the Containerfile code
block in the fine_tunning_using_notebooks.mdx content and add the dockerfile
language tag to the opening fence so syntax highlighting is enabled.
- Around line 298-312: The main-container shell uses "set -ex" which will print
commands (including the sensitive git push with ${gitauth}) to logs; change the
shell behavior to avoid tracing secrets by removing or limiting "-x" (use "set
-e" only) or temporally disabling xtrace around the sensitive operation (turn
off xtrace before the git push that uses ${gitauth} and re-enable afterward).
Locate the main container block where "set -ex" is set and the git push that
embeds "${gitauth}" and update it so command tracing does not expose
credentials.
- Around line 228-238: The script currently uses "set -ex" which prints every
command (including the gitauth/GIT_USER:GIT_TOKEN used in the git clone command
and BASE_MODEL_URL) to logs; change this to avoid leaking credentials by
removing the -x (use "set -e") or by disabling tracing around the sensitive
commands: keep tracing off when building gitauth and when running the git clone
and lfs pull (use "set +x" before the git clone
"https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}" and "set -x" after), or replace
inline credentials with a git credential helper; update occurrences of set -ex,
gitauth, and the git clone / lfs pull invocations accordingly.
- Line 80: Replace the full-width right parenthesis used in the example call to
datasets.load_dataset (the symbol shown as )) with the standard ASCII right
parenthesis ) so the snippet "ds = datasets.load_dataset(<dataset directory>)"
becomes valid Python; locate the example line in the
fine_tunning_using_notebooks.mdx file and correct the punctuation for the
datasets.load_dataset invocation.
- Line 515: Fix the typos in the Note sentence starting with "**Note:** The
following steps can also be adapt to LLM pre-training and traditional ML
senarios." by changing "adapt" to "adapted" and "senarios" to "scenarios" so the
sentence reads "...can also be adapted to LLM pre-training and traditional ML
scenarios." Update that exact string in the document.
- Around line 243-251: The two unconditional deletions ("rm -rf ${DATASET_NAME}"
and "rm -rf data") make the subsequent existence check around ${DATASET_NAME}
dead code so the dataset is always re-cloned; remove those rm -rf lines (or move
them to run only when you explicitly want to force-refresh) so the if [ -d
${DATASET_NAME} ] { echo "dataset ${DATASET_NAME} already exists skipping
download" } else { git ... clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" }
branch can correctly skip cloning when the dataset is present.
- Line 377: The shell conditional uses an unquoted variable (DO_MERGE) which
will break if unset or empty; fix the conditional that currently reads the test
against DO_MERGE by quoting the variable reference and using a POSIX-compatible
string comparison operator (or switch to a bash [[ ... ]] test) so the test
becomes robust when DO_MERGE is empty or unset; update the conditional that
contains DO_MERGE to use the quoted form and a proper "=" comparison.
In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb`:
- Around line 295-301: Replace or remove the notebook cell output that exposes
the internal registry URL in the Runtime object's image field; locate the
runtime printout (Runtime(name='torch-distributed', trainer=RuntimeTrainer(...),
image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2'))
and either clear the cell output or sanitize the image value to a public example
such as 'alaudadockerhub/torch-distributed:v2.9.1-aml2' so the notebook no
longer reveals the internal registry.
- Around line 314-339: The narrative and saved outputs describe training on 3
PyTorch nodes but the code passed to client.train uses CustomTrainer(...,
func=train_fashion_mnist, num_nodes=1) which is inconsistent; either change
num_nodes to 3 in the CustomTrainer invocation to match the markdown/outputs
(and optionally enable GPU resource lines) or update the markdown text and saved
outputs to reflect num_nodes=1 (clear or re-run cells so outputs no longer show
3 nodes); look for the client.train call and the CustomTrainer(num_nodes=...)
setting to make the change.
- Around line 160-167: Fix the typo in the inline comment above the DataLoader
instantiation: change "Shard the dataset accross workers." to "Shard the dataset
across workers." in the notebook cell where train_loader is created (referencing
DataLoader, dataset, batch_size, and DistributedSampler).
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md`:
- Line 114: Fix the typo in the sentence starting "For more informatoin about
how to use **Kubeflow Trainer v2**..." by replacing "informatoin" with
"information" and add a trailing period at the end of the sentence so it reads:
"For more information about how to use **Kubeflow Trainer v2**, please refer to
[Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)."
- Line 15: Replace the grammatical error in the sentence "Create a
`torch_distributed.Containerfile` from below contents and build a image. Or you
can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`" by
changing "a image" to "an image" (and optionally combine the two fragments into
a single sentence like "build an image, or you can use pre-built image
`alaudadockerhub/torch-distributed:v2.9.1-aml2`") so the text reads correctly.
- Around line 17-37: Add the Dockerfile language identifier to the fenced code
block that starts with "FROM python:3.13-trixie" so the block becomes
```dockerfile; edit the markdown section containing the Dockerfile snippet (the
fenced block showing FROM python:3.13-trixie, ARG USERNAME, RUN pip install ...,
etc.) and prepend "dockerfile" immediately after the opening triple backticks to
satisfy markdownlint MD040 and enable proper syntax highlighting.
|
|
||
| ```dockerfile | ||
| FROM nvcr.io/nvidia/pytorch:24.12-py3 | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify a language for the fenced code block.
Same as in the quick-start doc, the Containerfile code block should have a language identifier (e.g., dockerfile).
-```
+```dockerfile🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` at line 102,
Update the Containerfile fenced code block so it includes a language identifier
by replacing the opening triple backticks with ```dockerfile (and keep the
matching closing triple backticks), i.e., locate the Containerfile code block in
the fine_tunning_using_notebooks.mdx content and add the dockerfile language tag
to the opening fence so syntax highlighting is enabled.
| set -ex | ||
| cd /mnt/models | ||
| BASE_MODEL_NAME=$(basename ${BASE_MODEL_URL}) | ||
| # Download base model | ||
| gitauth="${GIT_USER}:${GIT_TOKEN}" | ||
| BASE_MODEL_URL_NO_HTTPS="${BASE_MODEL_URL//https:\/\/}" | ||
| if [ -d ${BASE_MODEL_NAME} ]; then | ||
| echo "${BASE_MODEL_NAME} dir already exists, skip downloading" | ||
| else | ||
| GIT_LFS_SKIP_SMUDGE=1 git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}" | ||
| (cd ${BASE_MODEL_NAME} && git -c http.sslVerify=false -c lfs.activitytimeout=36000 lfs pull) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Credentials may leak to logs via set -ex.
set -ex causes every command to be printed before execution, including the git clone "https://${gitauth}@..." commands on lines 237, 238, and 250. This will expose GIT_USER:GIT_TOKEN in pod logs.
Consider either:
- Using
set -einstead ofset -ex(remove trace), or - Wrapping sensitive commands with
set +x/set -x, or - Using git credential helpers instead of inline URL credentials.
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 228 -
238, The script currently uses "set -ex" which prints every command (including
the gitauth/GIT_USER:GIT_TOKEN used in the git clone command and BASE_MODEL_URL)
to logs; change this to avoid leaking credentials by removing the -x (use "set
-e") or by disabling tracing around the sensitive commands: keep tracing off
when building gitauth and when running the git clone and lfs pull (use "set +x"
before the git clone "https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}" and "set
-x" after), or replace inline credentials with a git credential helper; update
occurrences of set -ex, gitauth, and the git clone / lfs pull invocations
accordingly.
|
|
||
| rm -rf ${DATASET_NAME} | ||
| rm -rf data | ||
|
|
||
| if [ -d ${DATASET_NAME} ]; then | ||
| echo "dataset ${DATASET_NAME} already exists skipping download" | ||
| else | ||
| git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: rm -rf makes the subsequent existence check dead code.
Lines 244–245 unconditionally delete ${DATASET_NAME} and data, so the if [ -d ${DATASET_NAME} ] check on line 247 will never be true. This means the "already exists, skipping download" path is unreachable, and the dataset is always re-downloaded.
It looks like the intent was to skip downloading if already cached (matching the model logic above). Remove the rm -rf lines, or restructure:
Proposed fix
- rm -rf ${DATASET_NAME}
- rm -rf data
-
if [ -d ${DATASET_NAME} ]; then
echo "dataset ${DATASET_NAME} already exists skipping download"
else
git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rm -rf ${DATASET_NAME} | |
| rm -rf data | |
| if [ -d ${DATASET_NAME} ]; then | |
| echo "dataset ${DATASET_NAME} already exists skipping download" | |
| else | |
| git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" | |
| fi | |
| if [ -d ${DATASET_NAME} ]; then | |
| echo "dataset ${DATASET_NAME} already exists skipping download" | |
| else | |
| git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" | |
| fi |
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 243 -
251, The two unconditional deletions ("rm -rf ${DATASET_NAME}" and "rm -rf
data") make the subsequent existence check around ${DATASET_NAME} dead code so
the dataset is always re-cloned; remove those rm -rf lines (or move them to run
only when you explicitly want to force-refresh) so the if [ -d ${DATASET_NAME} ]
{ echo "dataset ${DATASET_NAME} already exists skipping download" } else { git
... clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" } branch can correctly
skip cloning when the dataset is present.
| - bash | ||
| - -c | ||
| - | | ||
| set -ex | ||
| echo "job workers list: ${VC_WORKER_HOSTS}" | ||
| if [ "${VC_WORKER_HOSTS}" != "" ]; then | ||
| export N_RANKS=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print NF}') | ||
| export RANK=$VC_TASK_INDEX | ||
| export MASTER_HOST=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print $1}') | ||
| export RANK=$RANK | ||
| export WORLD_SIZE=$N_RANKS | ||
| export NNODES=$N_RANKS | ||
| export NODE_RANK=$RANK | ||
| export MASTER_ADDR=${MASTER_HOST} | ||
| export MASTER_PORT="8888" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same credential leakage concern in the main container's set -ex block.
Lines 301 and 410 have the same issue as the init container — set -ex will trace the git push command (line 410) that embeds ${gitauth} in the URL, leaking credentials to pod logs.
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 298 -
312, The main-container shell uses "set -ex" which will print commands
(including the sensitive git push with ${gitauth}) to logs; change the shell
behavior to avoid tracing secrets by removing or limiting "-x" (use "set -e"
only) or temporally disabling xtrace around the sensitive operation (turn off
xtrace before the git push that uses ${gitauth} and re-enable afterward). Locate
the main container block where "set -ex" is set and the git push that embeds
"${gitauth}" and update it so command tracing does not expose credentials.
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "Runtime(name='torch-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None)\n" | ||
| ] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output contains an internal registry URL.
The output at line 299 exposes 152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2, which appears to be an internal/private registry address. For public-facing documentation, consider clearing this output or replacing it with a sanitized example using the public image reference (alaudadockerhub/torch-distributed:v2.9.1-aml2).
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb` around lines 295 -
301, Replace or remove the notebook cell output that exposes the internal
registry URL in the Runtime object's image field; locate the runtime printout
(Runtime(name='torch-distributed', trainer=RuntimeTrainer(...),
image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2'))
and either clear the cell output or sanitize the image value to a public example
such as 'alaudadockerhub/torch-distributed:v2.9.1-aml2' so the notebook no
longer reveals the internal registry.
| "## Run the Distributed TrainJob\n", | ||
| "\n", | ||
| "Kubeflow TrainJob will train the above model on 3 PyTorch nodes." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 23, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "job_name = client.train(\n", | ||
| " trainer=CustomTrainer(\n", | ||
| " func=train_fashion_mnist,\n", | ||
| " # Set how many PyTorch nodes you want to use for distributed training.\n", | ||
| " num_nodes=1,\n", | ||
| " # Set the resources for each PyTorch node.\n", | ||
| " resources_per_node={\n", | ||
| " \"cpu\": 2,\n", | ||
| " \"memory\": \"8Gi\",\n", | ||
| " # Uncomment this to distribute the TrainJob using GPU nodes.\n", | ||
| " # \"nvidia.com/gpu\": 1,\n", | ||
| " },\n", | ||
| " ),\n", | ||
| " runtime=torch_runtime,\n", | ||
| ")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistency: markdown says "3 PyTorch nodes" but num_nodes=1.
The markdown cells at lines 316, 350, and 404 all reference 3 nodes/GPUs (e.g., "train the above model on 3 PyTorch nodes", "distributed training across 3 nodes", "we run training on 3 GPUs"). However, the code at line 329 sets num_nodes=1. The saved outputs (lines 382–387) also show 3 nodes, suggesting the notebook was run with a different configuration than what's in the code cells.
Either update num_nodes to 3 to match the narrative and outputs, or update the markdown text and clear/update the outputs to match num_nodes=1.
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb` around lines 314 -
339, The narrative and saved outputs describe training on 3 PyTorch nodes but
the code passed to client.train uses CustomTrainer(...,
func=train_fashion_mnist, num_nodes=1) which is inconsistent; either change
num_nodes to 3 in the CustomTrainer invocation to match the markdown/outputs
(and optionally enable GPU resource lines) or update the markdown text and saved
outputs to reflect num_nodes=1 (clear or re-run cells so outputs no longer show
3 nodes); look for the client.train call and the CustomTrainer(num_nodes=...)
setting to make the change.
|
|
||
| ## Prepare Runtime Image | ||
|
|
||
| Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor grammar: "a image" → "an image".
-Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
+Create a `torch_distributed.Containerfile` from the contents below and build an image. Or you can use the pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`. | |
| Create a `torch_distributed.Containerfile` from the contents below and build an image. Or you can use the pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`. |
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 15, Replace
the grammatical error in the sentence "Create a
`torch_distributed.Containerfile` from below contents and build a image. Or you
can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`" by
changing "a image" to "an image" (and optionally combine the two fragments into
a single sentence like "build an image, or you can use pre-built image
`alaudadockerhub/torch-distributed:v2.9.1-aml2`") so the text reads correctly.
| ``` | ||
| FROM python:3.13-trixie | ||
| ARG USERNAME=appuser | ||
| ARG USER_UID=1000 | ||
| ARG USER_GID=$USER_UID | ||
|
|
||
| RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources | ||
| RUN apt-get update && \ | ||
| apt-get install -y build-essential | ||
|
|
||
| RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip && \ | ||
| pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu126 \ | ||
| "torch==2.9.1" \ | ||
| "torchvision==0.24.1" | ||
|
|
||
| RUN groupadd --gid $USER_GID $USERNAME \ | ||
| && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME | ||
|
|
||
| WORKDIR /workspace | ||
| RUN chown $USERNAME:$USER_GID /workspace | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "kubeflow_trainer_quick_start.md" | head -20Repository: alauda/aml-docs
Length of output: 117
🌐 Web query:
PyTorch 2.9.1 torchvision compatibility matrix versions
💡 Result:
For PyTorch 2.9.1, the officially paired torchvision version is:
| torch | torchvision |
|---|---|
| 2.9.1 | 0.24.1 |
| 2.9.0 | 0.24.0 |
| 2.8.0 | 0.23.0 |
Sources: PyTorch “Previous Versions” install commands (these reflect the supported matching domain-library versions). [1]
🏁 Script executed:
sed -n '17,37p' ./docs/en/workbench/how_to/kubeflow_trainer_quick_start.md | cat -ARepository: alauda/aml-docs
Length of output: 696
Add language identifier (dockerfile) to the fenced code block per markdownlint MD040.
The code block requires a language identifier for proper syntax highlighting and markdown linting compliance.
Proposed fix
-```
+```dockerfile
FROM python:3.13-trixieNote: The torch and torchvision versions are correctly paired (torch 2.9.1 officially matches torchvision 0.24.1 per PyTorch's official release matrix).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| FROM python:3.13-trixie | |
| ARG USERNAME=appuser | |
| ARG USER_UID=1000 | |
| ARG USER_GID=$USER_UID | |
| RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources | |
| RUN apt-get update && \ | |
| apt-get install -y build-essential | |
| RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip && \ | |
| pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu126 \ | |
| "torch==2.9.1" \ | |
| "torchvision==0.24.1" | |
| RUN groupadd --gid $USER_GID $USERNAME \ | |
| && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME | |
| WORKDIR /workspace | |
| RUN chown $USERNAME:$USER_GID /workspace | |
| ``` |
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` around lines 17 -
37, Add the Dockerfile language identifier to the fenced code block that starts
with "FROM python:3.13-trixie" so the block becomes ```dockerfile; edit the
markdown section containing the Dockerfile snippet (the fenced block showing
FROM python:3.13-trixie, ARG USERNAME, RUN pip install ..., etc.) and prepend
"dockerfile" immediately after the opening triple backticks to satisfy
markdownlint MD040 and enable proper syntax highlighting.
|
|
||
| Download [kubeflow_trainer_mnist.ipynb](./kubeflow_trainer_mnist.ipynb) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch. | ||
|
|
||
| For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "informatoin" → "information", and add a trailing period.
-For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)
+For more information about how to use **Kubeflow Trainer v2**, please refer to the [Kubeflow Documentation](https://www.kubeflow.org/docs/components/trainer/).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/) | |
| For more information about how to use **Kubeflow Trainer v2**, please refer to the [Kubeflow Documentation](https://www.kubeflow.org/docs/components/trainer/). |
🧰 Tools
🪛 LanguageTool
[grammar] ~114-~114: Ensure spelling is correct
Context: ...t a TrainJob using pytorch. For more informatoin about how to use *Kubeflow Trainer v2...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 114, Fix
the typo in the sentence starting "For more informatoin about how to use
**Kubeflow Trainer v2**..." by replacing "informatoin" with "information" and
add a trailing period at the end of the sentence so it reads: "For more
information about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow
Document](https://www.kubeflow.org/docs/components/trainer/)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Line 185: Replace the personal PVC name and output model URL that use the
`wy-` prefix with generic, documentation-friendly placeholders: change the
claimName value `wy-model-cache` to `model-cache` or `<your-model-cache-pvc>`
(refer to the claimName key), and update the output model URL `wy-sft-output` to
a generic name like `sft-output` or `<your-sft-output>` (refer to the output
model URL string). Ensure both occurrences (the claimName and the output model
URL) are consistent with other placeholders such as
`<your-model-registry-git-address>`.
- Around line 456-478: Update the StorageClass example to fix the likely typo in
mountPermissions and avoid hardcoded IPs: change mountPermissions from "0757" to
a more typical value (e.g., "0755" or "0775") in the StorageClass snippet
(referencing mountPermissions and the ai-nfs StorageClass) and replace the
hardcoded server IP (server: 192.168.17.28) with a placeholder such as
<your-nfs-server-ip> so the example is consistent and safe to reuse.
- Around line 113-115: The shallow clone uses "git clone --depth 1" which may
not include the tag referenced by LLAMA_FACTORY_VERSION, causing "git checkout
${LLAMA_FACTORY_VERSION}" to fail; update the clone invocation that contains
"git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git" to either
include the tag/branch with "--branch ${LLAMA_FACTORY_VERSION} --depth 1" or
perform a full clone (remove "--depth 1") so that "git checkout
${LLAMA_FACTORY_VERSION}" succeeds.
- Line 407: The git lfs track command currently uses an unquoted glob (the
pattern *.safetensors) which can be expanded by the shell; update the occurrence
of the git lfs track invocation so the glob pattern is quoted (surround the
*.safetensors pattern with quotes) to ensure the tracking rule is registered
instead of being expanded to filenames.
- Around line 638-647: The YAML example in the non-Nvidia GPU section
incorrectly uses Nvidia-specific resource names (nvidia.com/gpualloc,
nvidia.com/gpucores, nvidia.com/gpumem); update those keys to vendor-agnostic
placeholders (e.g., <vendor>.com/gpualloc, <vendor>.com/gpucores,
<vendor>.com/gpumem or a single generic accelerator resource like
<vendor>.com/accelerator) to match the guidance on line 521 and avoid
Nvidia-specific terminology in this section.
- Around line 305-307: Remove the redundant self-assignment of the environment
variable: delete the line that does "export RANK=$RANK" (the no-op after "export
RANK=$VC_TASK_INDEX"); leave the "export RANK=$VC_TASK_INDEX" and "export
MASTER_HOST=..." lines intact so RANK is set only once from VC_TASK_INDEX.
Deploying alauda-ai with
|
| Latest commit: |
0fa921a
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://8e38e0e4.alauda-ai.pages.dev |
| Branch Preview URL: | https://add-fine-tuning-and-trainer.alauda-ai.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Around line 291-296: The MLFLOW_TRACKING_URI value is inconsistent across
examples; update both occurrences of the MLFLOW_TRACKING_URI environment
variable (the value currently set to
"http://mlflow-tracking-server.kubeflow:5000" and the other FQDN variant) to the
same consistent placeholder (e.g.,
"http://<mlflow-tracking-server-endpoint>:5000") so users aren’t confused by
different namespaces/DNS formats; ensure the MLFLOW_TRACKING_URI value is
identical in the example blocks that include the MLFLOW_TRACKING_URI and keep
MLFLOW_EXPERIMENT_NAME unchanged.
- Around line 102-116: The ARG LLAMA_FACTORY_VERSION is declared before FROM so
${LLAMA_FACTORY_VERSION} is empty in later RUN steps; move or re-declare ARG
LLAMA_FACTORY_VERSION immediately after the FROM line so it is available to
subsequent RUN commands, and update the git clone invocation in the RUN that
currently uses "git clone --depth 1" to use "git clone --branch
${LLAMA_FACTORY_VERSION} --depth 1" (or otherwise fetch the specific tag)
instead of relying on a later git checkout, and ensure the documentation code
fence uses the "dockerfile" language identifier; locate the relevant FROM, ARG
LLAMA_FACTORY_VERSION, and RUN lines and the git clone/git checkout commands to
apply these changes.
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md`:
- Line 112: Update the link text and branding in the sentence that begins
"Download `kubeflow_trainer_mnist.ipynb`..." by capitalizing "GitHub" and
replacing the vague link label "github workbench howtos" with a more descriptive
label (for example: "AML Workbench how-tos on GitHub" or "the AML Workbench
how-tos repository on GitHub"), keeping the same URL; locate the exact string
"Download `kubeflow_trainer_mnist.ipynb` from [github workbench
howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to)"
and modify only the link text and capitalization.
| ``` | ||
| ARG LLAMA_FACTORY_VERSION="v0.9.4" | ||
| FROM python:3.13-trixie | ||
|
|
||
| RUN sed -i 's@//.*archive.ubuntu.com@//mirrors.ustc.edu.cn@g' /etc/apt/sources.list && \ | ||
| sed -i 's/security.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && \ | ||
| RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources && \ | ||
| apt-get update && \ | ||
| export DEBIAN_FRONTEND=noninteractive && \ | ||
| apt-get install -yq --no-install-recommends git git-lfs unzip curl ffmpeg && \ | ||
| apt-get install -yq --no-install-recommends git git-lfs unzip curl ffmpeg default-libmysqlclient-dev build-essential pkg-config && \ | ||
| apt clean && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| RUN cd /opt && \ | ||
| RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip setuptools && \ | ||
| cd /opt && \ | ||
| git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && \ | ||
| cd LLaMA-Factory && \ | ||
| pip install --no-cache-dir -e ".[torch,metrics,deepspeed,awq,modelscope]" -i https://pypi.tuna.tsinghua.edu.cn/simple && \ | ||
| pip install --no-cache-dir "transformers==4.51.1" "tokenizers==0.21.1" -i https://pypi.tuna.tsinghua.edu.cn/simple | ||
|
|
||
| RUN apt-get update && apt-get install -y default-libmysqlclient-dev build-essential pkg-config && \ | ||
| pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip setuptools && \ | ||
| pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple \ | ||
| "sqlalchemy==2.0.30" "pymysql==1.1.1" "loguru==0.7.2" "mysqlclient==2.2.7" "mlflow" | ||
| cd LLaMA-Factory && git checkout ${LLAMA_FACTORY_VERSION} && \ | ||
| pip install --no-cache-dir -e ".[torch,metrics,deepspeed,awq,modelscope]" -i https://pypi.tuna.tsinghua.edu.cn/simple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: ARG declared before FROM is not available in subsequent RUN instructions.
In a Containerfile/Dockerfile, an ARG before FROM is only usable within FROM directives. After FROM, all build args are reset. So ${LLAMA_FACTORY_VERSION} on line 115 will resolve to an empty string, and git checkout will fail.
Move the ARG after FROM, or re-declare it:
Proposed fix
-ARG LLAMA_FACTORY_VERSION="v0.9.4"
FROM python:3.13-trixie
+ARG LLAMA_FACTORY_VERSION="v0.9.4"Additionally, as noted in a previous review, this code block should have a dockerfile language identifier, and git clone --depth 1 won't fetch the tag — use --branch ${LLAMA_FACTORY_VERSION} instead.
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 102 -
116, The ARG LLAMA_FACTORY_VERSION is declared before FROM so
${LLAMA_FACTORY_VERSION} is empty in later RUN steps; move or re-declare ARG
LLAMA_FACTORY_VERSION immediately after the FROM line so it is available to
subsequent RUN commands, and update the git clone invocation in the RUN that
currently uses "git clone --depth 1" to use "git clone --branch
${LLAMA_FACTORY_VERSION} --depth 1" (or otherwise fetch the specific tag)
instead of relying on a later git checkout, and ensure the documentation code
fence uses the "dockerfile" language identifier; locate the relevant FROM, ARG
LLAMA_FACTORY_VERSION, and RUN lines and the git clone/git checkout commands to
apply these changes.
| # Change the MLFLOW_TRACKING_URI to your actual mlflow service endpoint | ||
| - name: MLFLOW_TRACKING_URI | ||
| value: 'http://mlflow-tracking-server.aml-system.svc.cluster.local:5000' | ||
| value: "http://mlflow-tracking-server.kubeflow:5000" | ||
| # Set MLFLOW_EXPERIMENT_NAME to your namespace or your own experiment name | ||
| - name: MLFLOW_EXPERIMENT_NAME | ||
| value: kubeflow-admin-cpaas-io | ||
| value: mlops-demo-ai-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent MLFlow tracking URIs between examples.
Line 293 uses http://mlflow-tracking-server.kubeflow:5000 while line 626 uses http://mlflow-tracking-server.aml-system.svc.cluster.local:5000. These point to different namespaces (kubeflow vs aml-system) and use different DNS formats (short vs FQDN). This could confuse users copying the examples.
Consider using a consistent placeholder like http://<mlflow-tracking-server-endpoint>:5000 in both, or at minimum use the same namespace and format.
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 291 -
296, The MLFLOW_TRACKING_URI value is inconsistent across examples; update both
occurrences of the MLFLOW_TRACKING_URI environment variable (the value currently
set to "http://mlflow-tracking-server.kubeflow:5000" and the other FQDN variant)
to the same consistent placeholder (e.g.,
"http://<mlflow-tracking-server-endpoint>:5000") so users aren’t confused by
different namespaces/DNS formats; ensure the MLFLOW_TRACKING_URI value is
identical in the example blocks that include the MLFLOW_TRACKING_URI and keep
MLFLOW_EXPERIMENT_NAME unchanged.
|
|
||
| > **Note: You need internet access to run below example notebook, since you need to install python packages, download datasets in this notebook.** | ||
|
|
||
| Download `kubeflow_trainer_mnist.ipynb` from [github workbench howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize "GitHub" and improve link text.
"github" should be capitalized per its official branding. The link text "github workbench howtos" is also vague — consider a more descriptive label.
-Download `kubeflow_trainer_mnist.ipynb` from [github workbench howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.
+Download `kubeflow_trainer_mnist.ipynb` from the [GitHub workbench how-to directory](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag-and-drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using PyTorch.🧰 Tools
🪛 LanguageTool
[uncategorized] ~112-~112: The official name of this software platform is spelled with a capital “H”.
Context: ...ad kubeflow_trainer_mnist.ipynb from [github workbench howtos](https://github.com/al...
(GITHUB)
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 112, Update
the link text and branding in the sentence that begins "Download
`kubeflow_trainer_mnist.ipynb`..." by capitalizing "GitHub" and replacing the
vague link label "github workbench howtos" with a more descriptive label (for
example: "AML Workbench how-tos on GitHub" or "the AML Workbench how-tos
repository on GitHub"), keeping the same URL; locate the exact string "Download
`kubeflow_trainer_mnist.ipynb` from [github workbench
howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to)"
and modify only the link text and capitalization.
Summary by CodeRabbit