From 961231a8c472974f64a184933f80fcf7ce4f67b9 Mon Sep 17 00:00:00 2001 From: zgsu Date: Tue, 14 Apr 2026 23:13:28 +0800 Subject: [PATCH 1/3] Add Ascend NPU Workbench training guide --- ...e-tune-and-pretrain-llms-on-ascend-npu.mdx | 145 +++++++ .../how_to/fine_tunning_using_notebooks.mdx | 2 + docs/public/qwen25_pretrain_verify.ipynb | 409 ++++++++++++++++++ docs/public/qwen3_finetune_verify.ipynb | 390 +++++++++++++++++ 4 files changed, 946 insertions(+) create mode 100644 docs/en/workbench/how_to/fine-tune-and-pretrain-llms-on-ascend-npu.mdx create mode 100644 docs/public/qwen25_pretrain_verify.ipynb create mode 100644 docs/public/qwen3_finetune_verify.ipynb diff --git a/docs/en/workbench/how_to/fine-tune-and-pretrain-llms-on-ascend-npu.mdx b/docs/en/workbench/how_to/fine-tune-and-pretrain-llms-on-ascend-npu.mdx new file mode 100644 index 0000000..875e91f --- /dev/null +++ b/docs/en/workbench/how_to/fine-tune-and-pretrain-llms-on-ascend-npu.mdx @@ -0,0 +1,145 @@ +--- +weight: 27 +--- + +# Fine-tune and Pretrain LLMs on Ascend NPU Using Workbench + +## Background + +This guide describes a Workbench-based solution for running large model fine-tuning and pretraining on `arm64` nodes with Huawei Ascend NPU. The solution uses the `PyTorch CANN` workbench image, which is built for Ascend environments and includes `Python 3.12`, `CANN 8.5.0`, `PyTorch 2.9.0`, and `torch_npu 2.9.0`. + +The workflow is centered on two verification notebooks: + +- [Download `qwen3_finetune_verify.ipynb`](/qwen3_finetune_verify.ipynb) for full-parameter supervised fine-tuning of `Qwen3-8B` +- [Download `qwen25_pretrain_verify.ipynb`](/qwen25_pretrain_verify.ipynb) for pretraining `Qwen2.5-7B` + +Both notebooks use `MindSpeed-LLM` and are designed as validation-first examples. They begin with a lightweight configuration so that you can confirm the runtime, model loading, preprocessing, and distributed launch path before scaling the same workflow to a real training run. + +Unlike the VolcanoJob-based flow used in some other examples, this solution runs training directly inside a single Workbench container with multiple Ascend NPUs attached. + +## Before You Begin + +Make sure the cluster already provides an operational Ascend runtime. In practice, this means the Ascend driver, CANN runtime, and Kubernetes device plugin are already installed and working, and your workbench can be scheduled to an `arm64` node with Ascend NPU resources. + +Create the workbench with the `PyTorch CANN` image described in [Creating a Workbench](./create_workbench.mdx). For the default notebook settings, plan for at least `4` NPUs. The verification notebooks also create converted model weights, preprocessing outputs, logs, and checkpoints, so the workspace should use persistent storage with enough free capacity for both the original HuggingFace model and the converted Megatron/MCore weights. + +The notebooks clone `MindSpeed-LLM` from `https://gitcode.com/ascend/MindSpeed-LLM.git` during execution. If the workbench cannot reach that repository, place a local copy in the workspace and update the notebook path in the first parameter cell. + +## Create the Workbench + +Create a JupyterLab workbench on the Ascend node pool and select the `PyTorch CANN` image. Keep the workspace on persistent storage so that notebooks, converted weights, and training outputs remain available after restart. If you follow the notebook defaults, request enough NPU resources to satisfy the configured tensor and pipeline parallelism. + +For the detailed creation steps and image selection, see [Creating a Workbench](./create_workbench.mdx). + +## Import the Verification Notebooks + +Upload the two notebooks into the JupyterLab workspace and open them there. If your image distribution already exposes the notebooks in the workspace, you can use them directly. Otherwise, download them from the links above and upload them through the JupyterLab file browser. The JupyterLab upload workflow is described in [Creating a Workbench](./create_workbench.mdx). + +## Prepare the Base Model + +Both notebooks expect a HuggingFace-format base model in the workspace. The default paths are: + +| Notebook | Variable | Default path | +|----------|----------|--------------| +| Fine-tuning | `HF_MODEL_DIR` | `/opt/app-root/src/models/Qwen3-8B` | +| Pretraining | `HF_MODEL_DIR` | `/opt/app-root/src/models/Qwen2.5-7B` | + +You can place the model files in those directories or change `HF_MODEL_DIR` in the first parameter cell. Before running the notebook, verify that the target directory contains the expected model configuration, tokenizer files, and weight files. + +If you want the model to be versioned and reusable across workbenches, upload it to the platform model repository first and then clone or copy it into the workspace. The repository-based upload flow is documented in [Upload Models Using Notebook](../../model_inference/model_management/how_to/upload_models_using_notebook.mdx). + +> **Note:** Both notebooks convert HuggingFace weights to Megatron/MCore format before training starts. This conversion creates another large set of files, so storage planning matters. + +## Prepare the Dataset + +The fine-tuning and pretraining notebooks expect different kinds of input data. + +### Fine-tuning Data + +The fine-tuning notebook uses instruction-tuning data. Its validation path is based on Alpaca-style samples, and it can also consume a real dataset when you place the files in the workspace and update the path variables in the parameter cell. + +By default, the notebook looks for: + +- `ALPACA_PARQUET = /opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet` +- `RAW_DATA_FILE = /opt/app-root/src/Qwen3-8B-work-dir/finetune_dataset/alpaca_sample.jsonl` + +If the parquet file exists, the notebook converts it to JSONL automatically. If you already have JSONL instruction data, place it at `RAW_DATA_FILE` or update the variable to your actual path. + +The expected Alpaca-style JSONL record is: + +```json +{"instruction": "...", "input": "...", "output": "..."} +``` + +The notebook can also be adapted to other instruction formats such as ShareGPT or Pairwise datasets by changing the handler in the parameter section. + +### Pretraining Data + +The pretraining notebook uses raw text data. `MindSpeed-LLM` preprocessing supports `.parquet`, `.json`, `.jsonl`, and `.txt`. For structured formats such as parquet, json, or jsonl, the data should include a `text` field. For plain text input, provide one text segment per line. + +The validation notebook uses the following default input path: + +- `ALPACA_PARQUET = /opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet` + +If that file is absent, the notebook falls back to a small built-in sample so that the preprocessing and training flow can still be verified. + +### Getting Data into the Workspace + +For small test files, uploading directly in JupyterLab is usually enough. For larger datasets, it is more practical to mount a PVC or pull the data from the platform dataset repository into the workspace. If you want a repository-based dataset workflow, see [Fine-tuning LLMs using Workbench](./fine_tunning_using_notebooks.mdx). + +## Run the Fine-tuning Notebook + +Open `qwen3_finetune_verify.ipynb` and start with the first parameter cell. That cell controls the model path, dataset path, output location, sequence length, training iterations, and the tensor and pipeline parallelism used during both weight conversion and training. + +The notebook follows a straightforward progression. It first checks the Ascend runtime and confirms that `torch_npu`, `MindSpeed`, and `MindSpeed-LLM` are available. It then prepares a small Alpaca-style dataset or loads your real one, clones the `MindSpeed-LLM` repository, converts the HuggingFace checkpoint into Megatron/MCore format, preprocesses the data into the format required by `MindSpeed-LLM`, launches full-parameter SFT with `posttrain_gpt.py`, and finally runs an inference check against the generated checkpoint. + +The default configuration is intentionally conservative. It uses a short sequence length and a small number of iterations so that the notebook can serve as an environment verification tool rather than a long production run. Once that path is working, move to a real dataset and tune the parameters for the actual workload. + +The most important parameters to review are: + +- `HF_MODEL_DIR` +- `ALPACA_PARQUET` or `RAW_DATA_FILE` +- `OUTPUT_DIR` +- `TP` and `PP` +- `SEQ_LENGTH` +- `TRAIN_ITERS` +- `ENABLE_THINKING` + +For real fine-tuning, the notebook guidance is to increase `SEQ_LENGTH` to match the model context window, increase `TRAIN_ITERS` to a production-sized value, and adjust parallelism and batch sizing according to the available NPUs and the size of the training set. If you want periodic checkpoints, also update the save interval in the training cell. + +## Run the Pretraining Notebook + +Open `qwen25_pretrain_verify.ipynb` and review the first parameter cell in the same way. This notebook uses a raw text corpus rather than instruction-response records, but the overall structure is similar. + +It begins with an environment check, prepares a sample text dataset or loads your real one, clones the `MindSpeed-LLM` repository, converts the HuggingFace checkpoint into Megatron/MCore format, preprocesses the raw text into `.bin` and `.idx` files, and launches pretraining with `pretrain_gpt.py`. + +The validation configuration is again intentionally small. It is useful for verifying that preprocessing, checkpoint conversion, distributed launch, and output writing all work correctly on the Ascend runtime before you commit to a much longer run. + +The main parameters to review are: + +- `HF_MODEL_DIR` +- dataset path variables such as `ALPACA_PARQUET` +- `OUTPUT_DIR` +- `TP` and `PP` +- `SEQ_LENGTH` +- `TRAIN_ITERS` + +When you move from validation to a real pretraining job, increase the sequence length and iteration count, set the global batch size according to the available NPUs and corpus size, and revisit the save interval. If you change `TP` or `PP`, rerun the weight conversion step so that the converted checkpoint matches the training layout. + +## Output and Persistence + +By default, the notebooks write their outputs to the following locations: + +| Notebook | Default output path | +|----------|---------------------| +| Fine-tuning | `/opt/app-root/src/Qwen3-8B-work-dir/output/qwen3_8b_finetuned` | +| Pretraining | `/opt/app-root/src/Qwen2.5-7B-work-dir/output/qwen25_7b_pretrained` | + +Keep these directories on persistent storage. The outputs can be large, and in most real workflows you will want to preserve them after the workbench restarts or publish them for later use. If you want to push the resulting model back to the model repository, follow the Git LFS workflow in [Upload Models Using Notebook](../../model_inference/model_management/how_to/upload_models_using_notebook.mdx). + +## Operational Notes + +- These notebooks are verification examples first. Do not leave the default iteration count and sequence length unchanged for real training. +- The fine-tuning notebook runs full-parameter SFT rather than LoRA. +- The selected parallel configuration affects memory usage, weight conversion, and runtime layout. If you change `TP` or `PP`, reconvert the weights before training. +- In offline or restricted environments, prepare the `MindSpeed-LLM` repository and required model and dataset files in advance and place them directly in the workspace. diff --git a/docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx b/docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx index 5764c34..c121908 100644 --- a/docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx +++ b/docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx @@ -525,6 +525,8 @@ Steps: When using a non-Nvidia GPU environment (e.g. NPU, Intel Gaudi, AMD etc.), you can follow the common steps below to fine-tune models, launch training tasks, and manage them in AML Notebook. +For a concrete Huawei Ascend NPU example based on the `PyTorch CANN` workbench image and `MindSpeed-LLM` notebooks, see [Fine-tune and Pretrain LLMs on Ascend NPU Using Workbench](./fine-tune-and-pretrain-llms-on-ascend-npu.mdx). + > **Note:** The following steps can also be adapt to LLM pre-training and traditional ML senarios. These are general steps for converting a vendor solution to run on Alauda AI using Notebook and VolcanoJob. ### Preparation diff --git a/docs/public/qwen25_pretrain_verify.ipynb b/docs/public/qwen25_pretrain_verify.ipynb new file mode 100644 index 0000000..63aa32e --- /dev/null +++ b/docs/public/qwen25_pretrain_verify.ipynb @@ -0,0 +1,409 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b1a2c3d4", + "metadata": {}, + "source": [ + "# Qwen2.5-7B Pretraining Verification\n", + "\n", + "This notebook verifies the pretraining capability of the **Ascend 910B CANN image** by running Qwen2.5-7B pretraining with MindSpeed-LLM.\n", + "\n", + "**Workflow:**\n", + "1. Check the environment\n", + "2. Prepare the pretraining dataset\n", + "3. Clone the MindSpeed-LLM scripts\n", + "4. Convert HF weights to Megatron weights\n", + "5. Preprocess the data\n", + "6. Start pretraining\n", + "\n", + "> Training parameters are set for verification mode (few iterations + short sequences). Increase `TRAIN_ITERS` and `SEQ_LENGTH` for production runs." + ] + }, + { + "cell_type": "markdown", + "id": "c2d3e4f5", + "metadata": {}, + "source": [ + "## 0. Parameter Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d3e4f5a6", + "metadata": {}, + "outputs": [], + "source": "import warnings\nwarnings.filterwarnings('ignore', category=DeprecationWarning)\nwarnings.filterwarnings('ignore', category=ImportWarning)\nwarnings.filterwarnings('ignore', category=UserWarning)\n\nfrom pathlib import Path\n\n# ===== Path configuration =====\nHF_MODEL_DIR = Path('/opt/app-root/src/models/Qwen2.5-7B')\nWORK_DIR = Path('/opt/app-root/src/Qwen2.5-7B-work-dir')\nMINDSPEED_LLM_DIR = WORK_DIR / 'MindSpeed-LLM'\nDATA_DIR = WORK_DIR / 'pretrain_dataset'\nOUTPUT_DIR = WORK_DIR / 'output' / 'qwen25_7b_pretrained'\nLOGS_DIR = WORK_DIR / 'logs'\n\n# ===== Optional: real dataset path =====\nALPACA_PARQUET = Path('/opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet')\n\n# ===== Ascend environment scripts =====\nCANN_ENV = '/usr/local/Ascend/cann/set_env.sh'\nATB_ENV = '/usr/local/Ascend/nnal/atb/set_env.sh'\n\n# ===== Parallel configuration (must match weight conversion) =====\nTP = 1 # Tensor parallelism\nPP = 4 # Pipeline parallelism; requires at least TP*PP=4 NPUs\n# Note: With TP=1, each PP stage has about 1.6-2.2B parameters. The AdamW optimizer states\n# (exp_avg + exp_avg_sq, fp32) take about 17.6 GiB, and weights plus gradients exceed the 29 GiB memory on 910B.\n# Enable --use-distributed-optimizer (ZeRO-1) during training to shard optimizer states by the DP dimension and reduce memory usage.\n\n# ===== Weight conversion output (path includes parallel config to avoid reusing old weights after TP/PP changes) =====\nMCORE_WEIGHTS_DIR = WORK_DIR / 'model_weights' / f'qwen25_mcore_tp{TP}_pp{PP}'\n\n# ===== Training hyperparameters (verification mode) =====\nSEQ_LENGTH = 512 # Recommended production value: 4096\nTRAIN_ITERS = 50 # Recommended production value: 2000+\nMBS = 1\nLR = 1.25e-6\nMIN_LR = 1.25e-7\n\n# ===== Data preprocessing =====\nPROCESSED_DATA_PREFIX = DATA_DIR / 'alpaca'\nDATA_PATH = str(DATA_DIR / 'alpaca_text_document') # preprocess_data.py automatically adds the _text_document suffix\n\nprint('Configuration loaded')\nprint(f' Model: {HF_MODEL_DIR}')\nprint(f' Dataset: {ALPACA_PARQUET}' if ALPACA_PARQUET.exists() else ' Dataset: not found; sample data will be used')\nprint(f' TP={TP}, PP={PP}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')" + }, + { + "cell_type": "markdown", + "id": "e4f5a6b7", + "metadata": {}, + "source": [ + "## Helper Functions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5a6b7c8", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import subprocess\n", + "\n", + "_SUPPRESS_WARNINGS = 'ignore::DeprecationWarning,ignore::ImportWarning,ignore::UserWarning'\n", + "\n", + "def run_cmd(cmd, cwd=None, check=True):\n", + " 'Run a bash command in the Ascend environment and stream output in real time'\n", + " env_prefix = f'source {CANN_ENV} && source {ATB_ENV}'\n", + " full_cmd = f'{env_prefix} && {cmd}'\n", + " print(f'$ {cmd}\\n')\n", + " run_env = os.environ.copy()\n", + " run_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\n", + " result = subprocess.run(\n", + " ['bash', '-lc', full_cmd],\n", + " cwd=str(cwd or WORK_DIR),\n", + " text=True,\n", + " env=run_env,\n", + " )\n", + " if check and result.returncode != 0:\n", + " raise RuntimeError(f'Command failed with return code: {result.returncode}')\n", + " return result\n", + "\n", + "print('Helper function defined: run_cmd()')" + ] + }, + { + "cell_type": "markdown", + "id": "a6b7c8d9", + "metadata": {}, + "source": [ + "## 1. Environment Check" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7c8d9e0", + "metadata": {}, + "outputs": [], + "source": [ + "import warnings\n", + "\n", + "with warnings.catch_warnings():\n", + " warnings.simplefilter('ignore', DeprecationWarning)\n", + " warnings.simplefilter('ignore', ImportWarning)\n", + " warnings.simplefilter('ignore', UserWarning)\n", + " import torch\n", + " import torch_npu\n", + "\n", + "print('=' * 60)\n", + "print('Environment check')\n", + "print('=' * 60)\n", + "\n", + "# PyTorch & NPU\n", + "print(f'PyTorch: {torch.__version__}')\n", + "print(f'torch_npu: {torch_npu.__version__}')\n", + "nproc = torch.npu.device_count()\n", + "print(f'NPU count: {nproc}')\n", + "for i in range(nproc):\n", + " print(f' NPU {i}: {torch.npu.get_device_name(i)}')\n", + "\n", + "# MindSpeed\n", + "with warnings.catch_warnings():\n", + " warnings.simplefilter('ignore', DeprecationWarning)\n", + " warnings.simplefilter('ignore', ImportWarning)\n", + " warnings.simplefilter('ignore', UserWarning)\n", + " import mindspeed\n", + " import mindspeed_llm\n", + "\n", + "print('MindSpeed: installed')\n", + "print('MindSpeed-LLM: installed')\n", + "\n", + "# Model files\n", + "print(f'\\nModel directory: {HF_MODEL_DIR}')\n", + "assert HF_MODEL_DIR.exists(), f'Model directory does not exist: {HF_MODEL_DIR}'\n", + "model_files = sorted(HF_MODEL_DIR.glob('*'))\n", + "for f in model_files[:5]:\n", + " if f.is_file():\n", + " print(f' {f.name} ({f.stat().st_size / 1e9:.2f} GB)')\n", + "if len(model_files) > 5:\n", + " print(f' ... {len(model_files)} files in total')\n", + "\n", + "# Parallel configuration check\n", + "assert nproc >= TP * PP, f'NPU count({nproc}) < TP*PP({TP*PP}); reduce PP'\n", + "DP = nproc // (TP * PP)\n", + "GBS = DP * MBS\n", + "print(f'\\nParallel configuration: TP={TP}, PP={PP}, DP={DP}, GBS={GBS}')\n", + "\n", + "assert torch.npu.is_available(), 'NPU is not available'\n", + "print('\\nEnvironment check passed!')" + ] + }, + { + "cell_type": "markdown", + "id": "c8d9e0f1", + "metadata": {}, + "source": [ + "## 2. Prepare the Pretraining Dataset\n", + "\n", + "Pretraining uses raw text data. MindSpeed-LLM's `preprocess_data.py` supports `.parquet`, `.json`, `.jsonl`, `.txt`, and other formats.\n", + "\n", + "This example uses the Alpaca dataset in parquet format, which contains a `text` field. If you use another dataset, make sure it contains a `text` field or uses plain text format." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9e0f1a2", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import warnings\n", + "\n", + "DATA_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "if ALPACA_PARQUET.exists():\n", + " print(f'Dataset is ready: {ALPACA_PARQUET.name}')\n", + " with warnings.catch_warnings():\n", + " warnings.simplefilter('ignore', DeprecationWarning)\n", + " import pandas as pd\n", + " df = pd.read_parquet(ALPACA_PARQUET)\n", + " print(f'{len(df)} samples, columns: {list(df.columns)}')\n", + " print('\\nSample examples:')\n", + " for i, row in df.head(3).iterrows():\n", + " text = str(row.get('text', ''))[:100]\n", + " print(f' [{i}] {text}...')\n", + " DATA_INPUT = str(ALPACA_PARQUET)\n", + "else:\n", + " print('Alpaca dataset not found. Creating sample text data\\n')\n", + " sample_texts = [\n", + " {'text': 'Natural language processing is an important branch of artificial intelligence that studies how computers understand and generate human language.'},\n", + " {'text': 'Deep learning uses multilayer neural networks to learn hierarchical data representations and is widely used in computer vision and natural language processing.'},\n", + " {'text': 'Python is a high-level programming language known for its concise, readable syntax and rich ecosystem.'},\n", + " {'text': 'Machine learning is a core artificial intelligence technology that enables computers to learn and improve automatically from data.'},\n", + " {'text': 'Ascend 910B is an artificial intelligence processor from Huawei designed for deep learning training and inference workloads.'},\n", + " {'text': 'Pretraining is the first stage of training large language models and learns statistical patterns from massive text corpora.'},\n", + " {'text': 'The Transformer architecture is a foundation of modern natural language processing and uses self-attention for parallel sequence modeling.'},\n", + " {'text': 'Distributed training spreads large model training workloads across multiple compute devices.'},\n", + " {'text': 'Tensor parallelism and pipeline parallelism are two common parallelism strategies for large model training.'},\n", + " {'text': 'Gradient accumulation simulates large-batch training when memory is limited.'},\n", + " ]\n", + " SAMPLE_FILE = DATA_DIR / 'sample_pretrain.jsonl'\n", + " with open(SAMPLE_FILE, 'w', encoding='utf-8') as f:\n", + " for item in sample_texts:\n", + " f.write(json.dumps(item, ensure_ascii=False) + '\\n')\n", + " DATA_INPUT = str(SAMPLE_FILE)\n", + " print(f'Sample data created: {SAMPLE_FILE}')\n", + " print(f'{len(sample_texts)} samples')\n", + "\n", + "print(f'\\nData input path: {DATA_INPUT}')" + ] + }, + { + "cell_type": "markdown", + "id": "e0f1a2b3", + "metadata": {}, + "source": [ + "## 3. Clone MindSpeed-LLM\n", + "\n", + "The `mindspeed_llm` Python package is installed during image build, but the training scripts (`convert_ckpt.py`, `preprocess_data.py`, `pretrain_gpt.py`, and others) must run from the repository directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1a2b3c4", + "metadata": {}, + "outputs": [], + "source": [ + "if MINDSPEED_LLM_DIR.exists():\n", + " print(f'Already exists: {MINDSPEED_LLM_DIR}')\n", + "else:\n", + " print('Cloning MindSpeed-LLM (shallow clone)...')\n", + " run_cmd(f'git clone --depth 1 https://gitcode.com/ascend/MindSpeed-LLM.git {MINDSPEED_LLM_DIR}')\n", + "\n", + "# Verify required scripts\n", + "scripts = [\n", + " ('Weight conversion', 'convert_ckpt.py'),\n", + " ('Data preprocessing', 'preprocess_data.py'),\n", + " ('Pretraining', 'pretrain_gpt.py'),\n", + "]\n", + "for name, script in scripts:\n", + " exists = (MINDSPEED_LLM_DIR / script).exists()\n", + " print(f' [{name}] {script}: {\"OK\" if exists else \"MISSING\"}')\n", + "\n", + "assert all((MINDSPEED_LLM_DIR / s).exists() for _, s in scripts), 'Required scripts are missing'\n", + "print('\\nScript check passed!')" + ] + }, + { + "cell_type": "markdown", + "id": "a2b3c4d5", + "metadata": {}, + "source": [ + "## 4. Convert HF Weights to Megatron Weights\n", + "\n", + "Convert HuggingFace-format weights to Megatron-Mcore format, split by TP/PP. The first conversion usually takes 5-10 minutes.\n", + "\n", + "Qwen2.5 is based on the LLaMA2 architecture with QKV bias, so the conversion uses `--model-type-hf llama2` plus `--add-qkv-bias`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3c4d5e6", + "metadata": {}, + "outputs": [], + "source": [ + "MCORE_WEIGHTS_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "# Check whether conversion already exists\n", + "converted = any(MCORE_WEIGHTS_DIR.glob('iter_*'))\n", + "\n", + "if converted:\n", + " print(f'Weights already exist; skipping conversion: {MCORE_WEIGHTS_DIR}')\n", + " for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n", + " print(f' {p.name}')\n", + "else:\n", + " convert_cmd = ' && '.join([\n", + " f'cd {MINDSPEED_LLM_DIR}',\n", + " 'python convert_ckpt.py'\n", + " ' --use-mcore-models'\n", + " ' --model-type GPT'\n", + " ' --load-model-type hf'\n", + " ' --save-model-type mg'\n", + " f' --target-tensor-parallel-size {TP}'\n", + " f' --target-pipeline-parallel-size {PP}'\n", + " ' --add-qkv-bias'\n", + " f' --load-dir {HF_MODEL_DIR}'\n", + " f' --save-dir {MCORE_WEIGHTS_DIR}'\n", + " f' --tokenizer-model {HF_MODEL_DIR / \"tokenizer.json\"}'\n", + " ' --model-type-hf llama2'\n", + " ' --params-dtype bf16',\n", + " ])\n", + " print('Running weight conversion (about 5-10 minutes)...')\n", + " run_cmd(convert_cmd, cwd=MINDSPEED_LLM_DIR)\n", + " print('Weight conversion completed!')\n", + " for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n", + " print(f' {p.name}')" + ] + }, + { + "cell_type": "markdown", + "id": "c4d5e6f7", + "metadata": {}, + "source": [ + "## 5. Data Preprocessing\n", + "\n", + "Convert text data into the binary format (`.bin` + `.idx`) required by MindSpeed-LLM pretraining.\n", + "\n", + "No handler needs to be specified for pretraining data processing. `preprocess_data.py` automatically extracts the `text` field and generates `alpaca_text_document.bin` and `alpaca_text_document.idx`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5e6f7a8", + "metadata": {}, + "outputs": [], + "source": [ + "preprocess_cmd = ' && '.join([\n", + " f'cd {MINDSPEED_LLM_DIR}',\n", + " 'python preprocess_data.py'\n", + " f' --input {DATA_INPUT}'\n", + " f' --tokenizer-name-or-path {HF_MODEL_DIR}'\n", + " f' --output-prefix {PROCESSED_DATA_PREFIX}'\n", + " ' --tokenizer-type PretrainedFromHF'\n", + " ' --workers 4'\n", + " ' --log-interval 1000',\n", + "])\n", + "\n", + "print('Running data preprocessing...')\n", + "run_cmd(preprocess_cmd, cwd=MINDSPEED_LLM_DIR)\n", + "\n", + "# Verify outputs\n", + "print('\\nPreprocessing outputs:')\n", + "for f in sorted(DATA_DIR.glob('alpaca*')):\n", + " print(f' {f.name} ({f.stat().st_size / 1024:.1f} KB)')\n", + "\n", + "assert (DATA_DIR / 'alpaca_text_document.bin').exists() or (DATA_DIR / 'alpaca_text_document.idx').exists(), \\\n", + " f'Preprocessing outputs not found: {DATA_DIR / \"alpaca_text_document.*\"}'\n", + "print('Data preprocessing completed!')" + ] + }, + { + "cell_type": "markdown", + "id": "e6f7a8b9", + "metadata": {}, + "source": [ + "## 6. Start Pretraining\n", + "\n", + "Run Qwen2.5-7B pretraining with MindSpeed-LLM. Training logs are streamed to the notebook in real time.\n", + "\n", + "> In verification mode, `TRAIN_ITERS=50`. For full pretraining, 2000+ iterations are recommended.\n", + "\n", + "**Qwen2.5-7B architecture parameters:**\n", + "- 28 Transformer layers, hidden size 3584, FFN size 18944\n", + "- 28 attention heads, 4 KV heads (GQA)\n", + "- RoPE positional encoding, SwiGLU activation, RMSNorm, QKV bias" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7a8b9c0", + "metadata": {}, + "outputs": [], + "source": "import torch\n\nnproc = torch.npu.device_count()\nDP = nproc // (TP * PP)\nGBS = DP * MBS\n\nLOGS_DIR.mkdir(parents=True, exist_ok=True)\nOUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n\n# Environment variables\nenv = ' && '.join([\n f'cd {MINDSPEED_LLM_DIR}',\n 'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n 'export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True',\n])\n\n# torchrun distributed arguments\ndistributed = ' '.join([\n 'torchrun',\n f'--nproc_per_node {nproc}',\n '--nnodes 1 --node_rank 0',\n '--master_addr localhost --master_port 6000',\n])\n\n# Model architecture (Qwen2.5-7B)\nmodel_args = ' '.join([\n '--use-mcore-models',\n f'--tensor-model-parallel-size {TP}',\n f'--pipeline-model-parallel-size {PP}',\n '--sequence-parallel --use-flash-attn',\n '--transformer-impl local',\n '--use-distributed-optimizer',\n '--num-layers 28 --hidden-size 3584 --num-attention-heads 28',\n '--ffn-hidden-size 18944 --max-position-embeddings 131072',\n f'--seq-length {SEQ_LENGTH}',\n '--make-vocab-size-divisible-by 1 --padded-vocab-size 152064',\n '--position-embedding-type rope --rotary-base 1000000 --use-rotary-position-embeddings',\n '--group-query-attention --num-query-groups 4',\n '--add-qkv-bias --disable-bias-linear',\n '--untie-embeddings-and-output-weights',\n '--swiglu --normalization RMSNorm --norm-epsilon 1e-6',\n])\n\n# Training hyperparameters\ntrain_args = ' '.join([\n f'--micro-batch-size {MBS} --global-batch-size {GBS}',\n f'--train-iters {TRAIN_ITERS}',\n '--lr-decay-style cosine --lr-warmup-fraction 0.01',\n '--init-method-std 0.01',\n f'--lr {LR} --min-lr {MIN_LR}',\n '--weight-decay 1e-1 --clip-grad 1.0',\n '--adam-beta1 0.9 --adam-beta2 0.95 --initial-loss-scale 4096',\n '--no-gradient-accumulation-fusion --attention-softmax-in-fp32',\n '--no-masked-softmax-fusion --no-load-optim --no-load-rng',\n '--seed 42 --bf16',\n])\n\n# Activation recomputation: recompute forward activations during backward pass to trade compute for memory.\n# With PP=4, each stage has 7 layers, so recomputing all layers maximizes memory savings.\nrecompute_args = ' '.join([\n '--recompute-granularity full',\n '--recompute-method block',\n '--recompute-num-layers 7',\n])\n\n# Data and outputs\ndata_args = ' '.join([\n f'--data-path {DATA_PATH}',\n '--split 100,0,0',\n '--tokenizer-type PretrainedFromHF',\n f'--tokenizer-name-or-path {HF_MODEL_DIR}',\n '--log-interval 1',\n f'--save-interval {TRAIN_ITERS}',\n f'--eval-interval {TRAIN_ITERS} --eval-iters 0',\n])\n\n# Load and save\noutput_args = ' '.join([\n f'--load {MCORE_WEIGHTS_DIR} --save {OUTPUT_DIR}',\n '--distributed-backend nccl',\n '--exit-on-missing-checkpoint',\n '--no-save-optim --no-save-rng',\n])\n\ncmd = f'{env} && {distributed} pretrain_gpt.py {model_args} {train_args} {recompute_args} {data_args} {output_args}'\n\nprint(f'Training configuration: {nproc} NPU, TP={TP}, PP={PP}, DP={DP}')\nprint(f'GBS={GBS}, MBS={MBS}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')\nprint(f'Activation recomputation: full (7 layers per PP stage)')\nprint(f'\\nStarting pretraining...\\n')\nrun_cmd(cmd, cwd=MINDSPEED_LLM_DIR)\nprint(f'\\nPretraining completed! Weights saved to: {OUTPUT_DIR}')" + }, + { + "cell_type": "markdown", + "id": "a8b9c0d1", + "metadata": {}, + "source": [ + "## Use a Real Dataset\n", + "\n", + "After verification succeeds, use a real dataset for full pretraining as follows:\n", + "\n", + "1. **Prepare the data**: place the text dataset inside the container\n", + " - Supported formats: `.parquet`, `.json`, `.jsonl`, `.txt`\n", + " - The data must contain a `text` field (parquet/json/jsonl) or one text segment per line (txt)\n", + "\n", + "2. **Adjust parameters**:\n", + " - `SEQ_LENGTH = 4096` to match the model context length\n", + " - `TRAIN_ITERS = 2000+` depending on dataset size\n", + " - `GBS` based on the NPU count and dataset size; it can be set larger than `DP * MBS` to enable gradient accumulation\n", + "\n", + "3. **Save interval**: modify `--save-interval` in the training cell for periodic checkpoints\n", + "\n", + "4. **Weight conversion**: if TP/PP changes, rerun weight conversion" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/docs/public/qwen3_finetune_verify.ipynb b/docs/public/qwen3_finetune_verify.ipynb new file mode 100644 index 0000000..2745368 --- /dev/null +++ b/docs/public/qwen3_finetune_verify.ipynb @@ -0,0 +1,390 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "18655ab8", + "metadata": {}, + "source": [ + "# Qwen3-8B Full-Parameter Fine-Tuning Verification\n", + "\n", + "This notebook verifies the fine-tuning capability of the **Ascend 910B CANN image** by running full-parameter SFT fine-tuning for Qwen3-8B with MindSpeed-LLM.\n", + "\n", + "**Workflow:**\n", + "1. Environment check\n", + "2. Prepare a sample dataset (Alpaca format)\n", + "3. Clone the MindSpeed-LLM scripts\n", + "4. Convert HF weights to Megatron weights\n", + "5. Preprocess the data\n", + "6. Start fine-tuning\n", + "7. Run inference validation\n", + "\n", + "> The training parameters are set for verification mode (few iterations + short sequence length). Increase `TRAIN_ITERS` and `SEQ_LENGTH` for production use." + ] + }, + { + "cell_type": "markdown", + "id": "12b48017", + "metadata": {}, + "source": [ + "## 0. Parameter Configuration" + ] + }, + { + "cell_type": "code", + "id": "a0fa2576", + "metadata": {}, + "source": "import warnings\nwarnings.filterwarnings('ignore', category=DeprecationWarning)\nwarnings.filterwarnings('ignore', category=ImportWarning)\nwarnings.filterwarnings('ignore', category=UserWarning)\n\nfrom pathlib import Path\n\n# ===== Path configuration =====\nHF_MODEL_DIR = Path('/opt/app-root/src/models/Qwen3-8B')\nWORK_DIR = Path('/opt/app-root/src/Qwen3-8B-work-dir')\nMINDSPEED_LLM_DIR = WORK_DIR / 'MindSpeed-LLM'\nDATA_DIR = WORK_DIR / 'finetune_dataset'\nRAW_DATA_FILE = DATA_DIR / 'alpaca_sample.jsonl'\nPROCESSED_DATA_PREFIX = DATA_DIR / 'alpaca'\nOUTPUT_DIR = WORK_DIR / 'output' / 'qwen3_8b_finetuned'\nLOGS_DIR = WORK_DIR / 'logs'\n\n# ===== Optional: real dataset path =====\nALPACA_PARQUET = Path('/opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet')\n\n# ===== Ascend environment scripts =====\nCANN_ENV = '/usr/local/Ascend/cann/set_env.sh'\nATB_ENV = '/usr/local/Ascend/nnal/atb/set_env.sh'\n\n# ===== Parallelism configuration (must match weight conversion) =====\nTP = 2 # With TP=1, one card holds about 4.1B parameters; fp32 gradient buffers + bf16 weights require about 30 GiB, exceeding the 910B 29 GiB memory limit\nPP = 2 # At least TPxPP=4 NPUs are required; for a single card, set TP=1 and PP=1 (OOM is possible)\n\n# ===== Weight conversion output (path includes parallel settings to avoid reusing stale weights after TP/PP changes) =====\nMCORE_WEIGHTS_DIR = WORK_DIR / 'model_weights' / f'qwen3_mcore_tp{TP}_pp{PP}'\n\n# ===== Training hyperparameters (verification mode) =====\nSEQ_LENGTH = 512 # 4096 is recommended for production\nTRAIN_ITERS = 50 # 2000+ is recommended for production\nMBS = 1\nLR = 1.25e-6\nMIN_LR = 1.25e-7\n\n# ===== Data preprocessing =====\nHANDLER_NAME = 'AlpacaStyleInstructionHandler'\nTOKENIZER_TYPE = 'PretrainedFromHF'\nPROMPT_TYPE = 'qwen3'\nENABLE_THINKING = 'none'\n\nprint('Configuration loaded')\nprint(f' Model: {HF_MODEL_DIR}')\nprint(f' Dataset: {ALPACA_PARQUET}' if ALPACA_PARQUET.exists() else ' Dataset: not found, using built-in sample data')\nprint(f' TP={TP}, PP={PP}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')", + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "15d10a9a", + "metadata": {}, + "source": [ + "## Helper Function" + ] + }, + { + "cell_type": "code", + "id": "7eb53b45", + "metadata": {}, + "source": "import os\nimport subprocess\n\n_SUPPRESS_WARNINGS = 'ignore::DeprecationWarning,ignore::ImportWarning,ignore::UserWarning'\n\ndef run_cmd(cmd, cwd=None, check=True):\n 'Run a bash command in the Ascend environment and stream output in real time'\n env_prefix = f'source {CANN_ENV} && source {ATB_ENV}'\n full_cmd = f'{env_prefix} && {cmd}'\n print(f'$ {cmd}\\n')\n run_env = os.environ.copy()\n run_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\n result = subprocess.run(\n ['bash', '-lc', full_cmd],\n cwd=str(cwd or WORK_DIR),\n text=True,\n env=run_env,\n )\n if check and result.returncode != 0:\n raise RuntimeError(f'Command failed with return code: {result.returncode}')\n return result\n\nprint('Helper function defined: run_cmd()')", + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "0d2cbf3b", + "metadata": {}, + "source": [ + "## 1. Environment Check" + ] + }, + { + "cell_type": "code", + "id": "1643dfe5", + "metadata": {}, + "source": "import warnings\nwith warnings.catch_warnings():\n warnings.simplefilter('ignore', DeprecationWarning)\n warnings.simplefilter('ignore', ImportWarning)\n warnings.simplefilter('ignore', UserWarning)\n import torch\n import torch_npu\n\nprint('=' * 60)\nprint('Environment Check')\nprint('=' * 60)\n\n# PyTorch & NPU\nprint(f'PyTorch: {torch.__version__}')\nprint(f'torch_npu: {torch_npu.__version__}')\nnproc = torch.npu.device_count()\nprint(f'NPU count: {nproc}')\nfor i in range(nproc):\n print(f' NPU {i}: {torch.npu.get_device_name(i)}')\n\n# MindSpeed\nwith warnings.catch_warnings():\n warnings.simplefilter('ignore', DeprecationWarning)\n warnings.simplefilter('ignore', ImportWarning)\n warnings.simplefilter('ignore', UserWarning)\n import mindspeed\n import mindspeed_llm\nprint('MindSpeed: installed')\nprint('MindSpeed-LLM: installed')\n\n# Model files\nprint(f'\\nModel directory: {HF_MODEL_DIR}')\nassert HF_MODEL_DIR.exists(), f'Model directory does not exist: {HF_MODEL_DIR}'\nmodel_files = sorted(HF_MODEL_DIR.glob('*'))\nfor f in model_files[:5]:\n if f.is_file():\n print(f' {f.name} ({f.stat().st_size / 1e9:.2f} GB)')\nif len(model_files) > 5:\n print(f' ... {len(model_files)} files in total')\n\n# Parallelism validation\nassert nproc >= TP * PP, f'NPU count ({nproc}) < TP*PP ({TP*PP}); reduce PP'\nDP = nproc // (TP * PP)\nGBS = DP * MBS\nprint(f'\\nParallelism: TP={TP}, PP={PP}, DP={DP}, GBS={GBS}')\nassert torch.npu.is_available(), 'NPU is not available'\nprint('\\nEnvironment check passed!')", + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "a194e018", + "metadata": {}, + "source": [ + "## 2. Prepare a Sample Dataset\n", + "\n", + "Create sample data in Alpaca format to verify the fine-tuning workflow.\n", + "\n", + "To use a real dataset, place a JSONL file at `RAW_DATA_FILE`, with one JSON object per line:\n", + "```json\n", + "{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}\n", + "```" + ] + }, + { + "cell_type": "code", + "id": "6d845761", + "metadata": {}, + "source": "import json\nimport warnings\nimport pandas as pd\n\nDATA_DIR.mkdir(parents=True, exist_ok=True)\n\nif ALPACA_PARQUET.exists():\n print(f'Loading Alpaca dataset: {ALPACA_PARQUET.name}')\n with warnings.catch_warnings():\n warnings.simplefilter('ignore', DeprecationWarning)\n df = pd.read_parquet(ALPACA_PARQUET)\n print(f'{len(df)} samples loaded, columns: {list(df.columns)}')\n\n # Convert to JSONL (instruction / input / output)\n with open(RAW_DATA_FILE, 'w', encoding='utf-8') as f:\n for item in df[['instruction', 'input', 'output']].to_dict('records'):\n item['input'] = item.get('input') or ''\n f.write(json.dumps(item, ensure_ascii=False) + '\\n')\n\n print(f'Converted to JSONL: {RAW_DATA_FILE}')\n print('\\nSample records:')\n for item in df[['instruction', 'input', 'output']].head(3).to_dict('records'):\n inp = f' {item[\"input\"]}' if item['input'] else ''\n print(f' Q: {item[\"instruction\"][:80]}{inp[:40]}')\n print(f' A: {str(item[\"output\"])[:80]}')\nelse:\n print('Alpaca dataset not found, using built-in sample data\\n')\n sample_data = [\n {'instruction': 'Translate the following sentence into French', 'input': 'The weather is nice today.', 'output': \"Il fait beau aujourd'hui.\"},\n {'instruction': 'Translate the following sentence into Spanish', 'input': 'I like programming.', 'output': 'Me gusta programar.'},\n {'instruction': 'Summarize the sentence in one short phrase', 'input': 'Machine learning is fascinating and widely used in many fields.', 'output': 'Machine learning is broadly useful.'},\n {'instruction': 'Rewrite the sentence in a more formal tone', 'input': 'Hello, how are you?', 'output': 'Hello, how are you doing today?'},\n {'instruction': 'Introduce Python in one sentence', 'input': '', 'output': 'Python is a high-level general-purpose programming language known for its readability and rich ecosystem.'},\n {'instruction': 'List three common sorting algorithms', 'input': '', 'output': 'Three common sorting algorithms are bubble sort, quicksort, and merge sort.'},\n {'instruction': 'Explain what deep learning is', 'input': '', 'output': 'Deep learning is a branch of machine learning that uses multi-layer neural networks to learn hierarchical representations of data.'},\n {'instruction': 'Write a Python function to add two numbers', 'input': '', 'output': 'def add(a, b):\\n return a + b'},\n {'instruction': 'Rewrite the sentence to be more concise', 'input': 'Artificial intelligence is changing the world.', 'output': 'AI is transforming the world.'},\n {'instruction': 'What is a GPU?', 'input': '', 'output': 'A GPU is a graphics processing unit designed to accelerate highly parallel computation, especially for training and inference workloads.'},\n ]\n with open(RAW_DATA_FILE, 'w', encoding='utf-8') as f:\n for item in sample_data:\n f.write(json.dumps(item, ensure_ascii=False) + '\\n')\n print(f'Sample dataset created: {RAW_DATA_FILE}')\n print(f'{len(sample_data)} samples in total')", + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "9c4692a2", + "metadata": {}, + "source": [ + "## 3. Clone MindSpeed-LLM\n", + "\n", + "The `mindspeed_llm` Python package is already installed in the image, but the training scripts (`convert_ckpt_v2.py`, `preprocess_data.py`, `posttrain_gpt.py`, and others) must be run from the repository directory." + ] + }, + { + "cell_type": "code", + "id": "511c1c4d", + "metadata": {}, + "source": [ + "if MINDSPEED_LLM_DIR.exists():\n", + " print(f'Already exists: {MINDSPEED_LLM_DIR}')\n", + "else:\n", + " print('Cloning MindSpeed-LLM (shallow clone)...')\n", + " run_cmd(f'git clone --depth 1 https://gitcode.com/ascend/MindSpeed-LLM.git {MINDSPEED_LLM_DIR}')\n", + "\n", + "# Validate required scripts\n", + "scripts = [\n", + " ('Weight conversion', 'convert_ckpt_v2.py'),\n", + " ('Data preprocessing', 'preprocess_data.py'),\n", + " ('Fine-tuning', 'posttrain_gpt.py'),\n", + " ('Inference', 'inference.py'),\n", + "]\n", + "for name, script in scripts:\n", + " exists = (MINDSPEED_LLM_DIR / script).exists()\n", + " print(f' [{name}] {script}: {\"OK\" if exists else \"MISSING\"}')\n", + "\n", + "assert all((MINDSPEED_LLM_DIR / s).exists() for _, s in scripts), 'Required scripts are missing'\n", + "print('\\nScript check passed!')" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "331e0d10", + "metadata": {}, + "source": [ + "## 4. HF Weight to Megatron Weight Conversion\n", + "\n", + "Convert HuggingFace-format weights to Megatron format, split by TP/PP. The first conversion usually takes about 5-10 minutes." + ] + }, + { + "cell_type": "code", + "id": "463dd7da", + "metadata": {}, + "source": [ + "MCORE_WEIGHTS_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "# Check whether conversion has already been completed\n", + "converted = any(MCORE_WEIGHTS_DIR.glob('iter_*'))\n", + "\n", + "if converted:\n", + " print(f'Weights already exist, skipping conversion: {MCORE_WEIGHTS_DIR}')\n", + " for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n", + " print(f' {p.name}')\n", + "else:\n", + " convert_cmd = ' && '.join([\n", + " f'cd {MINDSPEED_LLM_DIR}',\n", + " f'python convert_ckpt_v2.py'\n", + " ' --load-model-type hf'\n", + " ' --save-model-type mg'\n", + " f' --target-tensor-parallel-size {TP}'\n", + " f' --target-pipeline-parallel-size {PP}'\n", + " f' --load-dir {HF_MODEL_DIR}'\n", + " f' --save-dir {MCORE_WEIGHTS_DIR}'\n", + " ' --model-type-hf qwen3',\n", + " ])\n", + " print('Running weight conversion (about 5-10 minutes)...')\n", + " run_cmd(convert_cmd, cwd=MINDSPEED_LLM_DIR)\n", + " print('Weight conversion completed!')\n", + " for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n", + " print(f' {p.name}')" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "419d028a", + "metadata": {}, + "source": [ + "## 5. Data Preprocessing\n", + "\n", + "Convert Alpaca-format JSONL data into the binary format required by MindSpeed-LLM training." + ] + }, + { + "cell_type": "code", + "id": "f68febbf", + "metadata": {}, + "source": [ + "preprocess_cmd = ' && '.join([\n", + " f'cd {MINDSPEED_LLM_DIR}',\n", + " f'python preprocess_data.py'\n", + " f' --input {RAW_DATA_FILE}'\n", + " f' --tokenizer-name-or-path {HF_MODEL_DIR}'\n", + " f' --output-prefix {PROCESSED_DATA_PREFIX}'\n", + " f' --handler-name {HANDLER_NAME}'\n", + " f' --tokenizer-type {TOKENIZER_TYPE}'\n", + " ' --workers 4'\n", + " ' --log-interval 1'\n", + " f' --enable-thinking {ENABLE_THINKING}'\n", + " f' --prompt-type {PROMPT_TYPE}',\n", + "])\n", + "\n", + "print('Running data preprocessing...')\n", + "run_cmd(preprocess_cmd, cwd=MINDSPEED_LLM_DIR)\n", + "\n", + "# Verify outputs\n", + "print('\\nPreprocessing outputs:')\n", + "for f in sorted(PROCESSED_DATA_PREFIX.parent.glob('alpaca*')):\n", + " print(f' {f.name} ({f.stat().st_size / 1024:.1f} KB)')\n", + "print('Data preprocessing completed!')" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "67501275", + "metadata": {}, + "source": [ + "## 6. Start Fine-Tuning\n", + "\n", + "Run full-parameter SFT fine-tuning with MindSpeed-LLM. Training logs are streamed to the notebook in real time.\n", + "\n", + "> In verification mode, `TRAIN_ITERS=50`. For a full fine-tuning run, 2000+ iterations are recommended." + ] + }, + { + "cell_type": "code", + "id": "16c0ef7e", + "metadata": {}, + "source": [ + "import torch\n", + "\n", + "nproc = torch.npu.device_count()\n", + "DP = nproc // (TP * PP)\n", + "GBS = DP * MBS\n", + "\n", + "LOGS_DIR.mkdir(parents=True, exist_ok=True)\n", + "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "# Environment variables\n", + "env = ' && '.join([\n", + " f'cd {MINDSPEED_LLM_DIR}',\n", + " 'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n", + " 'export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True',\n", + "])\n", + "\n", + "# Distributed torchrun arguments\n", + "distributed = ' '.join([\n", + " 'torchrun',\n", + " f'--nproc_per_node {nproc}',\n", + " '--nnodes 1 --node_rank 0',\n", + " '--master_addr localhost --master_port 6000',\n", + "])\n", + "\n", + "# Model architecture\n", + "model_args = ' '.join([\n", + " '--use-mcore-models',\n", + " '--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec',\n", + " '--kv-channels 128 --qk-layernorm',\n", + " f'--tensor-model-parallel-size {TP}',\n", + " f'--pipeline-model-parallel-size {PP}',\n", + " '--sequence-parallel --use-distributed-optimizer --use-flash-attn',\n", + " '--num-layers 36 --hidden-size 4096 --num-attention-heads 32',\n", + " '--ffn-hidden-size 12288 --max-position-embeddings 32768',\n", + " f'--seq-length {SEQ_LENGTH}',\n", + " '--make-vocab-size-divisible-by 1 --padded-vocab-size 151936',\n", + " '--rotary-base 1000000 --use-rotary-position-embeddings',\n", + "])\n", + "\n", + "# Training hyperparameters\n", + "train_args = ' '.join([\n", + " f'--micro-batch-size {MBS} --global-batch-size {GBS}',\n", + " '--disable-bias-linear --swiglu',\n", + " f'--train-iters {TRAIN_ITERS}',\n", + " '--tokenizer-type PretrainedFromHF',\n", + " f'--tokenizer-name-or-path {HF_MODEL_DIR}',\n", + " '--normalization RMSNorm --position-embedding-type rope',\n", + " '--norm-epsilon 1e-6 --hidden-dropout 0 --attention-dropout 0',\n", + " '--no-gradient-accumulation-fusion --attention-softmax-in-fp32',\n", + " '--exit-on-missing-checkpoint --no-masked-softmax-fusion',\n", + " '--group-query-attention --untie-embeddings-and-output-weights',\n", + " '--num-query-groups 8',\n", + " f'--min-lr {MIN_LR} --lr {LR}',\n", + " '--weight-decay 1e-1 --clip-grad 1.0',\n", + " '--adam-beta1 0.9 --adam-beta2 0.95 --initial-loss-scale 4096',\n", + " '--no-load-optim --no-load-rng --seed 42 --bf16',\n", + "])\n", + "\n", + "# Data and outputs\n", + "data_args = ' '.join([\n", + " f'--data-path {PROCESSED_DATA_PREFIX}',\n", + " '--split 100,0,0',\n", + " '--log-interval 1',\n", + " f'--save-interval {TRAIN_ITERS}',\n", + " f'--eval-interval {TRAIN_ITERS} --eval-iters 0',\n", + "])\n", + "\n", + "# Fine-tuning configuration\n", + "tune_args = ' '.join([\n", + " '--finetune --stage sft --is-instruction-dataset',\n", + " '--prompt-type qwen3 --no-pad-to-seq-lengths',\n", + " '--distributed-backend nccl',\n", + " f'--load {MCORE_WEIGHTS_DIR} --save {OUTPUT_DIR}',\n", + " '--transformer-impl local',\n", + " '--no-save-optim --no-save-rng',\n", + "])\n", + "\n", + "cmd = f'{env} && {distributed} posttrain_gpt.py {model_args} {train_args} {data_args} {tune_args}'\n", + "\n", + "print(f'Training configuration: {nproc} NPU, TP={TP}, PP={PP}, DP={DP}')\n", + "print(f'GBS={GBS}, MBS={MBS}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')\n", + "print(f'\\nStarting training...\\n')\n", + "run_cmd(cmd, cwd=MINDSPEED_LLM_DIR)\n", + "print(f'\\nTraining completed! Weights saved to: {OUTPUT_DIR}')" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "d077bc56", + "metadata": {}, + "source": [ + "## 7. Inference Validation\n", + "\n", + "Load the fine-tuned weights and run a generation test." + ] + }, + { + "cell_type": "code", + "id": "09ae43f0", + "metadata": {}, + "source": "import os\n\nnproc = torch.npu.device_count()\n\nenv = ' && '.join([\n f'cd {MINDSPEED_LLM_DIR}',\n 'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n])\n\ndistributed = ' '.join([\n 'torchrun',\n f'--nproc_per_node {nproc}',\n '--nnodes 1 --node_rank 0',\n '--master_addr localhost --master_port 6001',\n])\n\ninfer_args = ' '.join([\n '--use-mcore-models',\n '--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec',\n '--qk-layernorm',\n f'--tensor-model-parallel-size {TP}',\n f'--pipeline-model-parallel-size {PP}',\n '--num-layers 36 --hidden-size 4096 --num-attention-heads 32',\n '--ffn-hidden-size 12288',\n f'--max-position-embeddings {SEQ_LENGTH} --seq-length {SEQ_LENGTH}',\n '--disable-bias-linear',\n '--group-query-attention --num-query-groups 8',\n '--swiglu --use-fused-swiglu',\n '--normalization RMSNorm --norm-epsilon 1e-6 --use-fused-rmsnorm',\n '--position-embedding-type rope --rotary-base 1000000 --use-fused-rotary-pos-emb',\n '--make-vocab-size-divisible-by 1 --padded-vocab-size 151936',\n '--micro-batch-size 1 --max-new-tokens 256',\n '--tokenizer-type PretrainedFromHF',\n f'--tokenizer-name-or-path {HF_MODEL_DIR}',\n '--tokenizer-not-use-fast',\n '--hidden-dropout 0 --attention-dropout 0',\n '--untie-embeddings-and-output-weights',\n '--no-gradient-accumulation-fusion --attention-softmax-in-fp32',\n '--seed 42',\n f'--load {OUTPUT_DIR}',\n '--exit-on-missing-checkpoint --transformer-impl local',\n])\n\ncmd = f'{env} && {distributed} inference.py {infer_args}'\nfull_cmd = f'source {CANN_ENV} && source {ATB_ENV} && {cmd}'\n\nprint('Starting inference...\\n')\nrun_env = os.environ.copy()\nrun_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\nresult = subprocess.run(\n ['bash', '-lc', full_cmd],\n cwd=str(MINDSPEED_LLM_DIR),\n text=True,\n input='q\\n', # Exit interactive chat mode automatically after inference.py finishes the default 4 generation rounds and enters input(); sending q terminates it\n env=run_env,\n)\nif result.returncode != 0:\n print(f'\\nInference return code: {result.returncode}')\nprint('\\nInference completed!')", + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "f87ecc9d", + "metadata": {}, + "source": [ + "## Using a Real Dataset\n", + "\n", + "After verification succeeds, use the following steps for full fine-tuning with a real dataset:\n", + "\n", + "1. **Prepare the data**: place an Alpaca/ShareGPT/Pairwise dataset inside the container\n", + " - Alpaca: `{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}`\n", + " - Change `HANDLER_NAME` to the matching handler\n", + "\n", + "2. **Tune the parameters**:\n", + " - `SEQ_LENGTH = 4096` to match the model context length\n", + " - `TRAIN_ITERS = 2000+` adjusted to the dataset size\n", + " - `GBS` adjusted to the NPU count and dataset size\n", + "\n", + "3. **Checkpoint interval**: change `--save-interval` in the training cell to save checkpoints periodically\n", + "\n", + "4. **enable-thinking**:\n", + " - `true` to process all data with slow-thinking mode\n", + " - `false` to process all data with fast-thinking mode\n", + " - `none` to mix fast and slow thinking (default)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.12", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file From 0312ef25518ca0fc512d046929fb954bd8ab23a6 Mon Sep 17 00:00:00 2001 From: zgsu Date: Wed, 15 Apr 2026 11:02:29 +0800 Subject: [PATCH 2/3] docs: add MindSpore CANN workbench guidance --- docs/en/workbench/how_to/create_workbench.mdx | 17 +- ...e-tune-and-pretrain-llms-on-ascend-npu.mdx | 103 ++- docs/public/qwen3_0.6b_finetune_verify.ipynb | 864 ++++++++++++++++++ 3 files changed, 957 insertions(+), 27 deletions(-) create mode 100644 docs/public/qwen3_0.6b_finetune_verify.ipynb diff --git a/docs/en/workbench/how_to/create_workbench.mdx b/docs/en/workbench/how_to/create_workbench.mdx index 3c82a48..68c8b62 100644 --- a/docs/en/workbench/how_to/create_workbench.mdx +++ b/docs/en/workbench/how_to/create_workbench.mdx @@ -82,7 +82,7 @@ The following images are available out of the box: #### Multi-architecture images (`x86_64` and `arm64`) | Image name | Description | Main packages | -| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | **Minimal Python**
[alauda-workbench-jupyter-minimal-cpu-py312-ubi9](https://hub.docker.com/r/alaudadockerhub/alauda-workbench-jupyter-minimal-cpu-py312-ubi9) | Use this image if you want a lightweight Jupyter workbench and plan to install project-specific packages yourself. | `Python 3.12`
`JupyterLab 4.5.6`
`Jupyter Server 2.17.0`
`JupyterLab Git 0.52.0`
`nbdime 4.0.4`
`nbgitpuller 1.2.2` | | **Standard Data Science**
[alauda-workbench-jupyter-datascience-cpu-py312-ubi9](https://hub.docker.com/r/alaudadockerhub/alauda-workbench-jupyter-datascience-cpu-py312-ubi9) | Use this image for general data science work that does not require a framework-specific GPU image. | `Python 3.12`
`JupyterLab 4.5.6`
`Jupyter Server 2.17.0`
`NumPy 2.4.3`
`pandas 2.3.3`
`SciPy 1.16.3`
`scikit-learn 1.8.0`
`Matplotlib 3.10.8`
`Plotly 6.5.2`
`KFP 2.15.2`
`Kubeflow Training 1.9.3`
`Feast 0.60.0`
`CodeFlare SDK 0.35.0`
`ODH Elyra 4.3.2` | | **code-server**
[alauda-workbench-codeserver-datascience-cpu-py312-ubi9](https://hub.docker.com/r/alaudadockerhub/alauda-workbench-codeserver-datascience-cpu-py312-ubi9) | Use this image if you prefer a VS Code-like IDE for data science development. Elyra-based pipelines are not available with this image. | `Python 3.12`
`code-server 4.106.3`
`Python extension 2026.0.0`
`Jupyter extension 2025.9.1`
`ipykernel 7.2.0`
`debugpy 1.8.20`
`NumPy 2.4.3`
`pandas 2.3.3`
`scikit-learn 1.8.0`
`SciPy 1.16.3`
`KFP 2.15.2`
`Feast 0.60.0`
`virtualenv 21.1.0`
`ripgrep 15.0.0` | @@ -96,7 +96,7 @@ The following images are available on Docker Hub but are **not built into the pl These images are intended for `x86_64` nodes with NVIDIA GPU support. | Image name | Description | Main packages | -| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | **TensorFlow**
[alaudadockerhub/odh-workbench-jupyter-tensorflow-cuda-py312-ubi9](https://hub.docker.com/r/alaudadockerhub/odh-workbench-jupyter-tensorflow-cuda-py312-ubi9) | Use this image for TensorFlow model development and training on NVIDIA GPUs. | `Python 3.12`
`CUDA base image 12.9`
`TensorFlow 2.20.0+redhat`
`TensorBoard 2.20.0`
`JupyterLab 4.5.6`
`Jupyter Server 2.17.0`
`NumPy 2.4.3`
`pandas 2.3.3` | | **PyTorch LLM Compressor**
[alaudadockerhub/odh-workbench-jupyter-pytorch-llmcompressor-cuda-py312-ubi9](https://hub.docker.com/r/alaudadockerhub/odh-workbench-jupyter-pytorch-llmcompressor-cuda-py312-ubi9) | Use this image for PyTorch-based LLM compression and optimization on NVIDIA GPUs. | `Python 3.12`
`CUDA base image 12.9`
`PyTorch 2.9.1`
`torchvision 0.24.1`
`TensorBoard 2.20.0`
`llmcompressor 0.9.0.2`
`transformers 4.57.3`
`datasets 4.4.1`
`accelerate 1.12.0`
`compressed-tensors 0.13.0`
`nvidia-ml-py 13.590.44`
`lm-eval 0.4.11` | | **PyTorch**
[alaudadockerhub/odh-workbench-jupyter-pytorch-cuda-py312-ubi9](https://hub.docker.com/r/alaudadockerhub/odh-workbench-jupyter-pytorch-cuda-py312-ubi9) | Use this image for PyTorch model development and training on NVIDIA GPUs. | `Python 3.12`
`CUDA base image 12.9`
`PyTorch 2.9.1`
`torchvision 0.24.1`
`TensorBoard 2.20.0`
`JupyterLab 4.5.6`
`Jupyter Server 2.17.0`
`onnxscript 0.6.2` | @@ -106,17 +106,18 @@ These images are intended for `x86_64` nodes with NVIDIA GPU support. These images are intended for `arm64` nodes with Ascend NPU support. -| Image name | Description | Main packages | -| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **CANN Minimal Python**
[alauda-workbench-jupyter-minimal-cann-py312-ubi9](https://hub.docker.com/r/alaudadockerhub/alauda-workbench-jupyter-minimal-cann-py312-ubi9) | Use this image if you need a lightweight Jupyter base image with Ascend CANN support. | `Python 3.12`
`CANN 8.5.0`
`JupyterLab 4.5.6`
`Jupyter Server 2.17.0`
`JupyterLab Git 0.51.4`
`nbdime 4.0.4`
`nbgitpuller 1.2.2` | -| **PyTorch CANN**
[alauda-workbench-jupyter-pytorch-cann-py312-ubi9](https://hub.docker.com/r/alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9) | Use this image for PyTorch model development and training on Ascend NPUs. | `Python 3.12`
`CANN 8.5.0`
`PyTorch 2.9.0`
`torch_npu 2.9.0` (Ascend release `7.3.0`)
`JupyterLab 4.5.6`
`Jupyter Server 2.17.0`
`TensorBoard 2.20.0`
`Ray 2.54.0`
`onnxscript 0.6.2`
`NumPy 2.4.3`
`pandas 2.3.3`
`scikit-learn 1.8.0`
`SciPy 1.16.3`
`KFP 2.15.2`
`Feast 0.60.0` | +| Image name | Description | Main packages | +|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **CANN Minimal Python**
[alauda-workbench-jupyter-minimal-cann-py312-ubi9](https://hub.docker.com/r/alaudadockerhub/alauda-workbench-jupyter-minimal-cann-py312-ubi9) | Use this image if you need a lightweight Jupyter base image with Ascend CANN support. | `Python 3.12`
`CANN 8.5.0`
`JupyterLab 4.5.6`
`Jupyter Server 2.17.0`
`JupyterLab Git 0.51.4`
`nbdime 4.0.4`
`nbgitpuller 1.2.2` | +| **PyTorch CANN**
[alauda-workbench-jupyter-pytorch-cann-py312-ubi9](https://hub.docker.com/r/alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9) | Use this image for PyTorch model development and training on Ascend NPUs. | `Python 3.12`
`CANN 8.5.0`
`PyTorch 2.9.0`
`torch_npu 2.9.0` (Ascend release `7.3.0`)
`JupyterLab 4.5.6`
`Jupyter Server 2.17.0`
`TensorBoard 2.20.0`
`Ray 2.54.0`
`onnxscript 0.6.2`
`NumPy 2.4.3`
`pandas 2.3.3`
`scikit-learn 1.8.0`
`SciPy 1.16.3`
`KFP 2.15.2`
`Feast 0.60.0` | +| **MindSpore CANN**
[docker.io/alaudadockerhub/alauda-workbench-jupyter-mindspore-cann-py312-ubi9:v0.1.7](https://hub.docker.com/r/alaudadockerhub/alauda-workbench-jupyter-mindspore-cann-py312-ubi9) | Use this image for MindSpore model development, checkpoint conversion, and training on Ascend NPUs. | `Python 3.12`
`CANN 8.5.0`
`MindSpore 2.8.0`
`JupyterLab 4.5.6`
`Jupyter Server 2.17.0`
`TensorBoard 2.20.0`
`ODH Elyra 4.3.2`
`onnxscript 0.6.2`
`KFP 2.15.2`
`Kubeflow Training 1.9.3`
`pandas 2.3.3`
`scikit-learn 1.8.0`
`SciPy 1.16.3` | To use an additional image, first synchronize it to your own image registry. You can do this with a tool such as `skopeo`, or by using the script described in the next section. ## Docker Hub Image Synchronization Script Guide [sync-from-dockerhub.sh](/sync-from-dockerhub.sh) is an automated tool for synchronizing selected Docker Hub images, especially very large images, to a private image registry such as Harbor. -Large images are more likely to encounter Out-Of-Memory (OOM) or timeout failures during direct transfer because of network fluctuations. To improve reliability, the script uses a relay workflow: **pull locally -> export as a tar archive -> push the tar archive to the target registry**. It also cleans up temporary files automatically when the task completes or exits unexpectedly. +Large images are more likely to encounter Out-Of-Memory (OOM) or timeout failures during direct transfer because of network fluctuations. To improve reliability, the script uses a relay workflow: **pull locally → export as a tar archive → push the tar archive to the target registry**. It also cleans up temporary files automatically when the task completes or exits unexpectedly. ### Script Prerequisites @@ -133,7 +134,7 @@ The script executes synchronization by reading environment variables, providing ### Required Parameters (Target Private Registry Configuration) | Environment Variable | Description | Example Value | -| :------------------- | :-------------------------------------------------------------------- | :----------------------- | +|:---------------------|:----------------------------------------------------------------------|:-------------------------| | `TARGET_REGISTRY` | Address of the target private image registry | `build-harbor.alauda.cn` | | `TARGET_PROJECT` | Specific project/namespace in the target registry to store the images | `mlops/workbench-images` | | `TARGET_USER` | Username for logging into the target registry | `admin` | diff --git a/docs/en/workbench/how_to/fine-tune-and-pretrain-llms-on-ascend-npu.mdx b/docs/en/workbench/how_to/fine-tune-and-pretrain-llms-on-ascend-npu.mdx index 875e91f..375e927 100644 --- a/docs/en/workbench/how_to/fine-tune-and-pretrain-llms-on-ascend-npu.mdx +++ b/docs/en/workbench/how_to/fine-tune-and-pretrain-llms-on-ascend-npu.mdx @@ -6,14 +6,15 @@ weight: 27 ## Background -This guide describes a Workbench-based solution for running large model fine-tuning and pretraining on `arm64` nodes with Huawei Ascend NPU. The solution uses the `PyTorch CANN` workbench image, which is built for Ascend environments and includes `Python 3.12`, `CANN 8.5.0`, `PyTorch 2.9.0`, and `torch_npu 2.9.0`. +This guide describes Workbench-based solutions for running large model fine-tuning and pretraining on `arm64` nodes with Huawei Ascend NPU. The main validation flow uses the `PyTorch CANN` workbench image, which is built for Ascend environments and includes `Python 3.12`, `CANN 8.5.0`, `PyTorch 2.9.0`, and `torch_npu 2.9.0`. This page also includes a MindSpore-based fine-tuning flow that uses the `MindSpore CANN` workbench image. -The workflow is centered on two verification notebooks: +The workflow is centered on three verification notebooks: -- [Download `qwen3_finetune_verify.ipynb`](/qwen3_finetune_verify.ipynb) for full-parameter supervised fine-tuning of `Qwen3-8B` -- [Download `qwen25_pretrain_verify.ipynb`](/qwen25_pretrain_verify.ipynb) for pretraining `Qwen2.5-7B` +- [Download `qwen3_finetune_verify.ipynb`](/qwen3_finetune_verify.ipynb) for full-parameter supervised fine-tuning of `Qwen3-8B` in the `PyTorch CANN` Jupyter image +- [Download `qwen25_pretrain_verify.ipynb`](/qwen25_pretrain_verify.ipynb) for pretraining `Qwen2.5-7B` in the `PyTorch CANN` Jupyter image +- [Download `qwen3_0.6b_finetune_verify.ipynb`](/qwen3_0.6b_finetune_verify.ipynb) for MindSpore-based full-parameter fine-tuning of `Qwen3-0.6B` in the `MindSpore CANN` Jupyter image -Both notebooks use `MindSpeed-LLM` and are designed as validation-first examples. They begin with a lightweight configuration so that you can confirm the runtime, model loading, preprocessing, and distributed launch path before scaling the same workflow to a real training run. +All notebooks are designed as validation-first examples. They begin with a lightweight configuration so that you can confirm the runtime, model loading, preprocessing, and distributed launch path before scaling the same workflow to a real training run. The PyTorch-based notebooks run on top of `MindSpeed-LLM`, while the MindSpore notebook validates the bundled `MindSpeed-Core-MS` and `MindSpeed-LLM` source tree shipped in the image. Unlike the VolcanoJob-based flow used in some other examples, this solution runs training directly inside a single Workbench container with multiple Ascend NPUs attached. @@ -21,34 +22,40 @@ Unlike the VolcanoJob-based flow used in some other examples, this solution runs Make sure the cluster already provides an operational Ascend runtime. In practice, this means the Ascend driver, CANN runtime, and Kubernetes device plugin are already installed and working, and your workbench can be scheduled to an `arm64` node with Ascend NPU resources. -Create the workbench with the `PyTorch CANN` image described in [Creating a Workbench](./create_workbench.mdx). For the default notebook settings, plan for at least `4` NPUs. The verification notebooks also create converted model weights, preprocessing outputs, logs, and checkpoints, so the workspace should use persistent storage with enough free capacity for both the original HuggingFace model and the converted Megatron/MCore weights. +Create the workbench with the image that matches the notebook you want to run, as described in [Creating a Workbench](./create_workbench.mdx): -The notebooks clone `MindSpeed-LLM` from `https://gitcode.com/ascend/MindSpeed-LLM.git` during execution. If the workbench cannot reach that repository, place a local copy in the workspace and update the notebook path in the first parameter cell. +- Use `PyTorch CANN` for `qwen3_finetune_verify.ipynb` and `qwen25_pretrain_verify.ipynb`. +- Use `MindSpore CANN` for `qwen3_0.6b_finetune_verify.ipynb`. + +For the default notebook settings, plan for at least `4` NPUs for the PyTorch-based examples. The MindSpore verification notebook is tuned for `2` Ascend `910B` `32G` NPUs with `TP=1`, `PP=1`, and `MBS=2`. All notebooks create converted model weights, preprocessing outputs, logs, and checkpoints, so the workspace should use persistent storage with enough free capacity for both the original HuggingFace model and the converted Megatron/MCore weights. + +The PyTorch-based notebooks clone `MindSpeed-LLM` from `https://gitcode.com/ascend/MindSpeed-LLM.git` during execution. If the workbench cannot reach that repository, place a local copy in the workspace and update the notebook path in the first parameter cell. The MindSpore notebook does not clone extra repositories at runtime. Instead, it uses the bundled source tree under `/opt/app-root/share/MindSpeed-Core-MS`. ## Create the Workbench -Create a JupyterLab workbench on the Ascend node pool and select the `PyTorch CANN` image. Keep the workspace on persistent storage so that notebooks, converted weights, and training outputs remain available after restart. If you follow the notebook defaults, request enough NPU resources to satisfy the configured tensor and pipeline parallelism. +Create a JupyterLab workbench on the Ascend node pool and select the image that matches the notebook you plan to run. Use `PyTorch CANN` for the `Qwen3-8B` fine-tuning and `Qwen2.5-7B` pretraining notebooks, or `MindSpore CANN` for the `Qwen3-0.6B` MindSpore fine-tuning notebook. Keep the workspace on persistent storage so that notebooks, converted weights, and training outputs remain available after restart. If you follow the notebook defaults, request enough NPU resources to satisfy the configured tensor and pipeline parallelism. For the detailed creation steps and image selection, see [Creating a Workbench](./create_workbench.mdx). ## Import the Verification Notebooks -Upload the two notebooks into the JupyterLab workspace and open them there. If your image distribution already exposes the notebooks in the workspace, you can use them directly. Otherwise, download them from the links above and upload them through the JupyterLab file browser. The JupyterLab upload workflow is described in [Creating a Workbench](./create_workbench.mdx). +Upload the notebook or notebooks you plan to use into the JupyterLab workspace and open them there. If your image distribution already exposes the notebooks in the workspace, you can use them directly. Otherwise, download them from the links above and upload them through the JupyterLab file browser. The JupyterLab upload workflow is described in [Creating a Workbench](./create_workbench.mdx). ## Prepare the Base Model -Both notebooks expect a HuggingFace-format base model in the workspace. The default paths are: +All three notebooks expect a HuggingFace-format base model in the workspace. The default paths are: | Notebook | Variable | Default path | |----------|----------|--------------| | Fine-tuning | `HF_MODEL_DIR` | `/opt/app-root/src/models/Qwen3-8B` | | Pretraining | `HF_MODEL_DIR` | `/opt/app-root/src/models/Qwen2.5-7B` | +| MindSpore fine-tuning | `HF_MODEL_DIR` | `/opt/app-root/src/models/Qwen3-0.6B` | You can place the model files in those directories or change `HF_MODEL_DIR` in the first parameter cell. Before running the notebook, verify that the target directory contains the expected model configuration, tokenizer files, and weight files. If you want the model to be versioned and reusable across workbenches, upload it to the platform model repository first and then clone or copy it into the workspace. The repository-based upload flow is documented in [Upload Models Using Notebook](../../model_inference/model_management/how_to/upload_models_using_notebook.mdx). -> **Note:** Both notebooks convert HuggingFace weights to Megatron/MCore format before training starts. This conversion creates another large set of files, so storage planning matters. +> **Note:** All three notebooks convert HuggingFace weights to Megatron/MCore format before training starts. This conversion creates another large set of files, so storage planning matters. ## Prepare the Dataset @@ -63,6 +70,11 @@ By default, the notebook looks for: - `ALPACA_PARQUET = /opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet` - `RAW_DATA_FILE = /opt/app-root/src/Qwen3-8B-work-dir/finetune_dataset/alpaca_sample.jsonl` +The MindSpore fine-tuning notebook uses the same Alpaca-style schema but writes the converted JSONL to a different working directory: + +- `ALPACA_PARQUET = /opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet` +- `RAW_DATA_FILE = /opt/app-root/src/Qwen3-0.6B-work-dir/finetune_dataset/alpaca_sample.jsonl` + If the parquet file exists, the notebook converts it to JSONL automatically. If you already have JSONL instruction data, place it at `RAW_DATA_FILE` or update the variable to your actual path. The expected Alpaca-style JSONL record is: @@ -89,7 +101,7 @@ For small test files, uploading directly in JupyterLab is usually enough. For la ## Run the Fine-tuning Notebook -Open `qwen3_finetune_verify.ipynb` and start with the first parameter cell. That cell controls the model path, dataset path, output location, sequence length, training iterations, and the tensor and pipeline parallelism used during both weight conversion and training. +Open `qwen3_finetune_verify.ipynb` in a workbench that uses the `PyTorch CANN` Jupyter image and start with the first parameter cell. That cell controls the model path, dataset path, output location, sequence length, training iterations, and the tensor and pipeline parallelism used during both weight conversion and training. The notebook follows a straightforward progression. It first checks the Ascend runtime and confirms that `torch_npu`, `MindSpeed`, and `MindSpeed-LLM` are available. It then prepares a small Alpaca-style dataset or loads your real one, clones the `MindSpeed-LLM` repository, converts the HuggingFace checkpoint into Megatron/MCore format, preprocesses the data into the format required by `MindSpeed-LLM`, launches full-parameter SFT with `posttrain_gpt.py`, and finally runs an inference check against the generated checkpoint. @@ -107,9 +119,60 @@ The most important parameters to review are: For real fine-tuning, the notebook guidance is to increase `SEQ_LENGTH` to match the model context window, increase `TRAIN_ITERS` to a production-sized value, and adjust parallelism and batch sizing according to the available NPUs and the size of the training set. If you want periodic checkpoints, also update the save interval in the training cell. +## Run the MindSpore Fine-Tuning Notebook + +Open `qwen3_0.6b_finetune_verify.ipynb` in a workbench that uses the `MindSpore CANN` Jupyter image. This notebook validates the official Qwen3 MindSpore full-parameter fine-tuning path and mirrors the upstream workflow implemented by the following scripts in `MindSpeed-LLM`: + +- `examples/mindspore/qwen3/ckpt_convert_qwen3_hf2mcore.sh` +- `examples/mindspore/qwen3/data_convert_qwen3_instruction.sh` +- `examples/mindspore/qwen3/tune_qwen3_0point6b_4K_full_ms.sh` + +Unlike the PyTorch-based notebook, this flow uses the image-bundled source tree under `/opt/app-root/share/MindSpeed-Core-MS`. It checks the bundled `MindSpeed-LLM`, `MindSpeed`, `MSAdapter`, and `Megatron-LM` directories, and also validates that the image exposes the Ascend environment scripts and the expected `PYTHONPATH` entries before training starts. + +The default validation configuration in the first parameter cell is: + +- `HF_MODEL_DIR=/opt/app-root/src/models/Qwen3-0.6B` +- `WORK_DIR=/opt/app-root/src/Qwen3-0.6B-work-dir` +- `RAW_DATA_FILE=/opt/app-root/src/Qwen3-0.6B-work-dir/finetune_dataset/alpaca_sample.jsonl` +- `OUTPUT_DIR=/opt/app-root/src/Qwen3-0.6B-work-dir/output/qwen3_0.6b_finetuned` +- `TP=1`, `PP=1`, `MBS=2` +- `SEQ_LENGTH=2048` +- `TRAIN_ITERS=100` +- `ENABLE_THINKING=true` + +The notebook follows this sequence: + +1. It validates the runtime environment. + It checks `mindspore`, `msadapter`, `mindspeed`, and `mindspeed_llm`, confirms the model directory contains `config.json`, tokenizer files, and `.safetensors` weights, and verifies that the available NPU count is compatible with the configured `TP` and `PP`. +2. It prepares the instruction dataset. + If `ALPACA_PARQUET` exists, the notebook converts it into Alpaca-style JSONL. Otherwise, it creates a small built-in sample dataset so that the full pipeline can still be verified. +3. It converts HuggingFace weights to MindSpeed/MCore format. + The notebook calls `mindspeed_llm/mindspore/convert_ckpt.py` with `--load-model-type hf`, `--save-model-type mg`, and `--ai-framework mindspore`, and writes the converted weights to a `TP` and `PP` specific output directory. +4. It preprocesses the fine-tuning dataset. + The notebook runs `preprocess_data.py` with `AlpacaStyleInstructionHandler`, `PretrainedFromHF`, `prompt-type qwen3`, and `enable-thinking true`, then checks that the required `.bin` and `.idx` files were generated. +5. It launches full-parameter SFT training. + The training cell uses `msrun` together with `posttrain_gpt.py`, sets `--finetune`, `--stage sft`, `--is-instruction-dataset`, `--ai-framework mindspore`, and `--ckpt-format msadapter`, and writes logs to `/opt/app-root/src/Qwen3-0.6B-work-dir/logs`. +6. It validates the generated checkpoint. + The final cell checks `latest_checkpointed_iteration.txt`, lists `iter_*` checkpoint directories, and confirms that the training log file exists. + +The most important parameters to review are: + +- `HF_MODEL_DIR` +- `ALPACA_PARQUET` or `RAW_DATA_FILE` +- `OUTPUT_DIR` +- `TP`, `PP`, and `MBS` +- `SEQ_LENGTH` +- `TRAIN_ITERS` +- `ENABLE_THINKING` +- `MASTER_ADDR`, `MASTER_PORT`, `NNODES`, and `NODE_RANK` + +When you move from validation to a real training run, increase `SEQ_LENGTH`, `TRAIN_ITERS`, and dataset size gradually based on the available NPU memory and the target context length. If you change `TP` or `PP`, rerun weight conversion so that the converted checkpoint layout matches the training layout. For multi-node training, update `MASTER_ADDR`, `MASTER_PORT`, `NNODES`, and `NODE_RANK` before running the training cell again. + +The current notebook validates checkpoint generation only. It does not include a stable MindSpore inference step. + ## Run the Pretraining Notebook -Open `qwen25_pretrain_verify.ipynb` and review the first parameter cell in the same way. This notebook uses a raw text corpus rather than instruction-response records, but the overall structure is similar. +Open `qwen25_pretrain_verify.ipynb` in a workbench that uses the `PyTorch CANN` Jupyter image and review the first parameter cell in the same way. This notebook uses a raw text corpus rather than instruction-response records, but the overall structure is similar. It begins with an environment check, prepares a sample text dataset or loads your real one, clones the `MindSpeed-LLM` repository, converts the HuggingFace checkpoint into Megatron/MCore format, preprocesses the raw text into `.bin` and `.idx` files, and launches pretraining with `pretrain_gpt.py`. @@ -130,16 +193,18 @@ When you move from validation to a real pretraining job, increase the sequence l By default, the notebooks write their outputs to the following locations: -| Notebook | Default output path | -|----------|---------------------| -| Fine-tuning | `/opt/app-root/src/Qwen3-8B-work-dir/output/qwen3_8b_finetuned` | -| Pretraining | `/opt/app-root/src/Qwen2.5-7B-work-dir/output/qwen25_7b_pretrained` | +| Notebook and required Jupyter image | Default output path | +|-------------------------------------|---------------------| +| `qwen3_finetune_verify.ipynb` in the `PyTorch CANN` Jupyter image | `/opt/app-root/src/Qwen3-8B-work-dir/output/qwen3_8b_finetuned` | +| `qwen25_pretrain_verify.ipynb` in the `PyTorch CANN` Jupyter image | `/opt/app-root/src/Qwen2.5-7B-work-dir/output/qwen25_7b_pretrained` | +| `qwen3_0.6b_finetune_verify.ipynb` in the `MindSpore CANN` Jupyter image | `/opt/app-root/src/Qwen3-0.6B-work-dir/output/qwen3_0.6b_finetuned` | Keep these directories on persistent storage. The outputs can be large, and in most real workflows you will want to preserve them after the workbench restarts or publish them for later use. If you want to push the resulting model back to the model repository, follow the Git LFS workflow in [Upload Models Using Notebook](../../model_inference/model_management/how_to/upload_models_using_notebook.mdx). ## Operational Notes - These notebooks are verification examples first. Do not leave the default iteration count and sequence length unchanged for real training. -- The fine-tuning notebook runs full-parameter SFT rather than LoRA. +- The fine-tuning notebooks run full-parameter SFT rather than LoRA. +- `qwen3_0.6b_finetune_verify.ipynb` requires the `MindSpore CANN` workbench image and the bundled `MindSpeed-Core-MS` source tree. - The selected parallel configuration affects memory usage, weight conversion, and runtime layout. If you change `TP` or `PP`, reconvert the weights before training. -- In offline or restricted environments, prepare the `MindSpeed-LLM` repository and required model and dataset files in advance and place them directly in the workspace. +- In offline or restricted environments, prepare the `MindSpeed-LLM` repository for the PyTorch-based notebooks and place the required model and dataset files directly in the workspace. The MindSpore notebook uses the bundled source tree from the image. diff --git a/docs/public/qwen3_0.6b_finetune_verify.ipynb b/docs/public/qwen3_0.6b_finetune_verify.ipynb new file mode 100644 index 0000000..689050c --- /dev/null +++ b/docs/public/qwen3_0.6b_finetune_verify.ipynb @@ -0,0 +1,864 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Qwen3-0.6B Full-Parameter Fine-Tuning\n", + "\n", + "This notebook targets the **Ascend 910B CANN MindSpore image** and follows the upstream Qwen3 MindSpore fine-tuning flow from `MindSpeed-LLM/docs/zh/mindspore/quick_start.md`.\n", + "\n", + "To match the image-bundled source tree and this verification scenario, the notebook inlines commands equivalent to the official scripts:\n", + "- `examples/mindspore/qwen3/ckpt_convert_qwen3_hf2mcore.sh`\n", + "- `examples/mindspore/qwen3/data_convert_qwen3_instruction.sh`\n", + "- `examples/mindspore/qwen3/tune_qwen3_0point6b_4K_full_ms.sh`\n", + "\n", + "**Workflow:**\n", + "1. Environment checks\n", + "2. Prepare the instruction dataset\n", + "3. Verify the bundled MindSpeed-Core-MS/MindSpeed-LLM source tree\n", + "4. Convert HF weights to MindSpeed/Mcore format\n", + "5. Preprocess the fine-tuning data\n", + "6. Launch full-parameter SFT fine-tuning\n", + "7. Validate the output checkpoint\n", + "\n", + "> The default parameters are conservative and intended for image validation. For longer training closer to the upstream baseline, increase `SEQ_LENGTH`, `TRAIN_ITERS`, and `MBS` as needed.\n" + ], + "id": "bf537aee6af17427" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 0. Parameters\n" + ], + "id": "24a8a8925043c5aa" + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "import os\n", + "import warnings\n", + "warnings.filterwarnings('ignore', category=DeprecationWarning)\n", + "warnings.filterwarnings('ignore', category=ImportWarning)\n", + "warnings.filterwarnings('ignore', category=UserWarning)\n", + "warnings.filterwarnings('ignore', category=FutureWarning)\n", + "\n", + "from pathlib import Path\n", + "\n", + "# ===== Path configuration =====\n", + "MINDSPEED_CORE_MS_DEFAULT_PATH = Path('/opt/app-root/share/MindSpeed-Core-MS')\n", + "MINDSPEED_CORE_MS_PATH = Path(\n", + " os.environ.get('MINDSPEED_CORE_MS_PATH', str(MINDSPEED_CORE_MS_DEFAULT_PATH))\n", + ")\n", + "MINDSPEED_LLM_DIR = MINDSPEED_CORE_MS_PATH / 'MindSpeed-LLM'\n", + "MINDSPEED_DIR = MINDSPEED_CORE_MS_PATH / 'MindSpeed'\n", + "MSADAPTER_DIR = MINDSPEED_CORE_MS_PATH / 'MSAdapter'\n", + "MEGATRON_DIR = MINDSPEED_CORE_MS_PATH / 'Megatron-LM'\n", + "SET_PATH_SCRIPT = MINDSPEED_CORE_MS_PATH / 'tests' / 'scripts' / 'set_path.sh'\n", + "\n", + "OFFICIAL_CONVERT_SCRIPT = MINDSPEED_LLM_DIR / 'examples' / 'mindspore' / 'qwen3' / 'ckpt_convert_qwen3_hf2mcore.sh'\n", + "OFFICIAL_PREPROCESS_SCRIPT = MINDSPEED_LLM_DIR / 'examples' / 'mindspore' / 'qwen3' / 'data_convert_qwen3_instruction.sh'\n", + "OFFICIAL_TUNE_SCRIPT = MINDSPEED_LLM_DIR / 'examples' / 'mindspore' / 'qwen3' / 'tune_qwen3_0point6b_4K_full_ms.sh'\n", + "CONVERT_CKPT_ENTRY = MINDSPEED_LLM_DIR / 'mindspeed_llm' / 'mindspore' / 'convert_ckpt.py'\n", + "assert CONVERT_CKPT_ENTRY.exists(), f'Official MindSpore conversion entry not found: {CONVERT_CKPT_ENTRY}'\n", + "\n", + "HF_MODEL_DIR = Path('/opt/app-root/src/models/Qwen3-0.6B')\n", + "WORK_DIR = Path('/opt/app-root/src/Qwen3-0.6B-work-dir')\n", + "DATA_DIR = WORK_DIR / 'finetune_dataset'\n", + "RAW_DATA_FILE = DATA_DIR / 'alpaca_sample.jsonl'\n", + "PROCESSED_DATA_PREFIX = DATA_DIR / 'alpaca'\n", + "OUTPUT_DIR = WORK_DIR / 'output' / 'qwen3_0.6b_finetuned'\n", + "LOGS_DIR = WORK_DIR / 'logs'\n", + "PRECHECK_LOG_DIR = LOGS_DIR / 'preflight'\n", + "\n", + "# ===== Optional: real dataset path =====\n", + "ALPACA_PARQUET = Path('/opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet')\n", + "\n", + "# ===== Ascend environment scripts =====\n", + "CANN_ENV = '/usr/local/Ascend/cann/set_env.sh'\n", + "ATB_ENV = '/usr/local/Ascend/nnal/atb/set_env.sh'\n", + "\n", + "# ===== Repository and model configuration =====\n", + "MODEL_SPEC = 'mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec'\n", + "MASTER_ADDR = 'localhost'\n", + "MASTER_PORT = 6015\n", + "NNODES = 1\n", + "NODE_RANK = 0\n", + "DISTRIBUTED_BACKEND = 'nccl' # Keep nccl to match the upstream tune_qwen3_0point6b_4K_full_ms.sh behavior.\n", + "\n", + "# ===== Parallelism configuration (must match weight conversion) =====\n", + "# Qwen3-0.6B is small enough for TP=1 and PP=1; use both cards for data parallelism.\n", + "TP = 1\n", + "PP = 1\n", + "MBS = 2 # micro-batch=2 fits on each 32G card.\n", + "\n", + "# ===== Weight conversion output (include TP/PP in the path to avoid stale reuse) =====\n", + "MCORE_WEIGHTS_DIR = WORK_DIR / 'model_weights' / f'qwen3_mcore_tp{TP}_pp{PP}'\n", + "\n", + "# ===== Training hyperparameters (for 2x 910B 32G) =====\n", + "SEQ_LENGTH = 2048 # The upstream quick start example uses 4096; reduce to 2048 for validation.\n", + "TRAIN_ITERS = 100 # The upstream quick start example uses 2000; reduce to 100 for validation.\n", + "LR = 1.25e-6\n", + "MIN_LR = 1.25e-7\n", + "\n", + "# ===== Data preprocessing =====\n", + "ENABLE_THINKING = 'true'\n", + "HANDLER_NAME = 'AlpacaStyleInstructionHandler'\n", + "TOKENIZER_TYPE = 'PretrainedFromHF'\n", + "PROMPT_TYPE = 'qwen3'\n", + "DATA_PATH = str(PROCESSED_DATA_PREFIX)\n", + "\n", + "print('Configuration loaded')\n", + "print(f' MindSpeed-Core-MS: {MINDSPEED_CORE_MS_PATH}')\n", + "print(f' MindSpeed-LLM: {MINDSPEED_LLM_DIR}')\n", + "print(f' Official convert script: {OFFICIAL_CONVERT_SCRIPT}')\n", + "print(f' Official preprocess script: {OFFICIAL_PREPROCESS_SCRIPT}')\n", + "print(f' Official fine-tune script: {OFFICIAL_TUNE_SCRIPT}')\n", + "print(f' Official conversion entry: {CONVERT_CKPT_ENTRY}')\n", + "print(f' Model: {HF_MODEL_DIR}')\n", + "print(f' Work dir: {WORK_DIR}')\n", + "print(f' Dataset: {ALPACA_PARQUET}' if ALPACA_PARQUET.exists() else ' Dataset: built-in sample dataset')\n", + "print(f' TP={TP}, PP={PP}, MBS={MBS}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')\n", + "print(f' Distributed backend: {DISTRIBUTED_BACKEND}')\n" + ], + "id": "436e89b6617b17e6" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Helpers\n" + ], + "id": "27f668d69a63d358" + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "import shlex\n", + "import subprocess\n", + "\n", + "_SUPPRESS_WARNINGS = 'ignore::DeprecationWarning,ignore::ImportWarning,ignore::UserWarning,ignore::FutureWarning'\n", + "\n", + "def q(value):\n", + " return shlex.quote(str(value))\n", + "\n", + "def run_cmd(cmd, cwd=None, check=True, step_name=None, log_file=None):\n", + " 'Run a bash command inside the Ascend environment, keep pipefail semantics, and optionally tee output to a log file.'\n", + " env_parts = ['set -o pipefail', f'source {q(CANN_ENV)}', f'source {q(ATB_ENV)}']\n", + " if SET_PATH_SCRIPT.exists():\n", + " env_parts.append(f'source {q(SET_PATH_SCRIPT)}')\n", + " env_prefix = ' && '.join(env_parts)\n", + " effective_cwd = Path(cwd or WORK_DIR)\n", + " resolved_log_file = Path(log_file) if log_file else None\n", + " wrapped_cmd = cmd\n", + " if resolved_log_file is not None:\n", + " resolved_log_file.parent.mkdir(parents=True, exist_ok=True)\n", + " wrapped_cmd = f'{{\n", + "{cmd}\n", + "}} 2>&1 | tee {q(resolved_log_file)}'\n", + " full_cmd = f'{env_prefix} && {wrapped_cmd}'\n", + " print(f'$ {cmd}\n", + "')\n", + " if resolved_log_file is not None:\n", + " print(f'Log file: {resolved_log_file}\n", + "')\n", + " run_env = os.environ.copy()\n", + " run_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\n", + " result = subprocess.run(\n", + " ['bash', '-lc', full_cmd],\n", + " cwd=str(effective_cwd),\n", + " text=True,\n", + " env=run_env,\n", + " )\n", + " if check and result.returncode != 0:\n", + " step_label = f'Step[{step_name}]' if step_name else 'Command'\n", + " log_hint = f', log file: {resolved_log_file}' if resolved_log_file is not None else ''\n", + " raise RuntimeError(f'{step_label} failed with exit code {result.returncode}{log_hint}')\n", + " return result\n", + "\n", + "print('Helper ready: run_cmd()')\n", + "print(f'Preflight log dir: {PRECHECK_LOG_DIR}')\n" + ], + "id": "24c1197bd5a2b547" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Environment Checks\n" + ], + "id": "54d5bc600212c226" + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "import os\n", + "import warnings\n", + "\n", + "print('=' * 60)\n", + "print('Environment checks')\n", + "print('=' * 60)\n", + "\n", + "assert Path(CANN_ENV).exists(), f'Ascend CANN environment script not found: {CANN_ENV}'\n", + "assert Path(ATB_ENV).exists(), f'Ascend ATB environment script not found: {ATB_ENV}'\n", + "print(f'Expected MindSpeed-Core-MS path: {MINDSPEED_CORE_MS_PATH}')\n", + "assert MINDSPEED_CORE_MS_PATH.exists(), f'MindSpeed-Core-MS source tree not found: {MINDSPEED_CORE_MS_PATH}'\n", + "assert SET_PATH_SCRIPT.exists(), f'set_path.sh not found: {SET_PATH_SCRIPT}'\n", + "for repo_dir in (MINDSPEED_LLM_DIR, MINDSPEED_DIR, MSADAPTER_DIR, MEGATRON_DIR):\n", + " assert repo_dir.exists(), f'Bundled source tree missing: {repo_dir}'\n", + "for script_path in (OFFICIAL_CONVERT_SCRIPT, OFFICIAL_PREPROCESS_SCRIPT, OFFICIAL_TUNE_SCRIPT):\n", + " assert script_path.exists(), f'Official example script missing: {script_path}'\n", + "\n", + "with warnings.catch_warnings():\n", + " warnings.simplefilter('ignore', DeprecationWarning)\n", + " warnings.simplefilter('ignore', ImportWarning)\n", + " warnings.simplefilter('ignore', UserWarning)\n", + " warnings.simplefilter('ignore', FutureWarning)\n", + " import mindspore as ms\n", + " import msadapter\n", + " import mindspeed\n", + " import mindspeed_llm\n", + "\n", + "print(f'MindSpore: {ms.__version__}')\n", + "print(f'CANN env script: {CANN_ENV}')\n", + "print(f'ATB env script: {ATB_ENV}')\n", + "print(f'MindSpeed-Core-MS: {MINDSPEED_CORE_MS_PATH}')\n", + "print(f'set_path.sh: {SET_PATH_SCRIPT}')\n", + "\n", + "device_target = 'unknown'\n", + "if hasattr(ms, 'get_context'):\n", + " try:\n", + " device_target = ms.get_context('device_target')\n", + " except Exception:\n", + " pass\n", + "\n", + "nproc = None\n", + "hal = getattr(ms, 'hal', None)\n", + "if hal is not None and hasattr(hal, 'device_count'):\n", + " try:\n", + " nproc = int(hal.device_count())\n", + " except Exception:\n", + " nproc = None\n", + "\n", + "if not nproc:\n", + " visible_devices = os.environ.get('ASCEND_RT_VISIBLE_DEVICES') or os.environ.get('ASCEND_VISIBLE_DEVICES')\n", + " if visible_devices:\n", + " nproc = len([d for d in visible_devices.split(',') if d.strip()])\n", + " else:\n", + " nproc = int(os.environ.get('RANK_SIZE', '1'))\n", + "\n", + "print(f'Device target: {device_target}')\n", + "print(f'NPU count: {nproc}')\n", + "\n", + "print(f'\n", + "Model directory: {HF_MODEL_DIR}')\n", + "assert HF_MODEL_DIR.exists(), f'Model directory not found: {HF_MODEL_DIR}'\n", + "required_model_files = [\n", + " HF_MODEL_DIR / 'config.json',\n", + " HF_MODEL_DIR / 'tokenizer.json',\n", + " HF_MODEL_DIR / 'tokenizer_config.json',\n", + "]\n", + "for required_file in required_model_files:\n", + " assert required_file.exists(), f'Required model file not found: {required_file}'\n", + "safetensor_files = sorted(HF_MODEL_DIR.glob('*.safetensors'))\n", + "assert safetensor_files, f'No safetensors weight files found: {HF_MODEL_DIR}'\n", + "print('Required model files:')\n", + "for required_file in required_model_files:\n", + " print(f' {required_file.name}: OK')\n", + "print(f' safetensors: {len(safetensor_files)} file(s)')\n", + "\n", + "model_files = sorted(HF_MODEL_DIR.glob('*'))\n", + "for f in model_files[:5]:\n", + " if f.is_file():\n", + " print(f' {f.name} ({f.stat().st_size / 1e9:.2f} GB)')\n", + "if len(model_files) > 5:\n", + " print(f' ... total {len(model_files)} files')\n", + "\n", + "py_path_entries = [Path(p) for p in os.environ.get('PYTHONPATH', '').split(':') if p]\n", + "expected_entries = [\n", + " MINDSPEED_CORE_MS_PATH / 'msadapter',\n", + " MSADAPTER_DIR,\n", + " MINDSPEED_CORE_MS_PATH / 'msadapter' / 'msa_thirdparty',\n", + " MSADAPTER_DIR / 'msa_thirdparty',\n", + " MINDSPEED_LLM_DIR,\n", + " MEGATRON_DIR,\n", + " MINDSPEED_DIR,\n", + "]\n", + "print('\n", + "PYTHONPATH key entries:')\n", + "for entry in expected_entries:\n", + " present = entry in py_path_entries\n", + " print(f' {entry}: {\"OK\" if present else \"missing\"}')\n", + "missing_entries = [str(entry) for entry in expected_entries if entry not in py_path_entries]\n", + "assert not missing_entries, f'PYTHONPATH is missing required entries: {missing_entries}'\n", + "\n", + "assert nproc >= TP * PP, f'NPU count ({nproc}) < TP*PP ({TP * PP}); reduce PP and retry.'\n", + "DP = max(1, nproc // (TP * PP))\n", + "GBS = DP * MBS\n", + "print(f'\n", + "Parallel configuration: TP={TP}, PP={PP}, DP={DP}, GBS={GBS}')\n", + "\n", + "subprocess_env_check = '\n", + "'.join([\n", + " \"python - <<'PY'\",\n", + " 'import importlib',\n", + " 'import os',\n", + " 'import sys',\n", + " 'from pathlib import Path',\n", + " '',\n", + " \"core_root = Path(os.environ['MINDSPEED_CORE_MS_PATH'])\",\n", + " \"print(f'sys.executable: {sys.executable}')\",\n", + " \"print(f'MINDSPEED_CORE_MS_PATH: {core_root}')\",\n", + " \"for name in ('mindspore', 'msadapter', 'mindspeed', 'mindspeed_llm', 'transformers'):\",\n", + " ' module = importlib.import_module(name)',\n", + " \" print(f'{name}: {getattr(module, \"__file__\", \"\")}')\",\n", + " '',\n", + " \"cfg = core_root / 'MindSpeed-LLM' / 'configs' / 'checkpoint' / 'model_cfg.json'\",\n", + " \"print(f'checkpoint_cfg: {cfg} exists={cfg.exists()}')\",\n", + " 'PY',\n", + " '',\n", + "])\n", + "run_cmd(\n", + " subprocess_env_check,\n", + " step_name='subprocess-import-check',\n", + " log_file=PRECHECK_LOG_DIR / 'subprocess_import_check.log',\n", + ")\n", + "print(f'Subprocess environment check log: {PRECHECK_LOG_DIR / \"subprocess_import_check.log\"}')\n", + "print('\n", + "Environment checks passed!')\n" + ], + "id": "458f673071da592d" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Prepare the Dataset\n", + "\n", + "Create sample Alpaca-formatted instruction data for fine-tuning flow validation.\n", + "\n", + "To use a real dataset, place a parquet file at `ALPACA_PARQUET` or write a JSONL file to `RAW_DATA_FILE`, one JSON object per line:\n", + "\n", + "```json\n", + "{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}\n", + "```\n" + ], + "id": "f29d99ee638e7039" + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "import warnings\n", + "\n", + "DATA_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "if ALPACA_PARQUET.exists():\n", + " print(f'Loading Alpaca dataset: {ALPACA_PARQUET.name}')\n", + " with warnings.catch_warnings():\n", + " warnings.simplefilter('ignore', DeprecationWarning)\n", + " import pandas as pd\n", + " df = pd.read_parquet(ALPACA_PARQUET)\n", + "\n", + " required_columns = {'instruction', 'input', 'output'}\n", + " missing_columns = required_columns.difference(df.columns)\n", + " assert not missing_columns, f'Parquet dataset is missing required columns: {sorted(missing_columns)}'\n", + "\n", + " with open(RAW_DATA_FILE, 'w', encoding='utf-8') as f:\n", + " for item in df[['instruction', 'input', 'output']].to_dict('records'):\n", + " item['input'] = item.get('input') or ''\n", + " f.write(json.dumps(item, ensure_ascii=False) + '\n", + "')\n", + "\n", + " print(f'Converted parquet to JSONL: {RAW_DATA_FILE}')\n", + " preview_records = df[['instruction', 'input', 'output']].head(3).to_dict('records')\n", + "else:\n", + " print('Alpaca dataset not found, using built-in sample data\n", + "')\n", + " sample_data = [\n", + " {'instruction': 'Translate the following sentence into French', 'input': 'The weather is nice today.', 'output': \"Il fait beau aujourd'hui.\"},\n", + " {'instruction': 'Translate the following sentence into Spanish', 'input': 'I like programming.', 'output': 'Me gusta programar.'},\n", + " {'instruction': 'Summarize the following in one sentence', 'input': 'Machine learning is fascinating and widely used in many fields.', 'output': 'Machine learning is widely used across many fields.'},\n", + " {'instruction': 'Rewrite in a more formal tone', 'input': 'Hello, how are you?', 'output': 'Hello, how have you been today?'},\n", + " {'instruction': 'Introduce MindSpore in one sentence', 'input': '', 'output': 'MindSpore is an all-scenario AI computing framework for device, edge, and cloud.'},\n", + " {'instruction': 'List three common sorting algorithms', 'input': '', 'output': 'Three common sorting algorithms are bubble sort, quicksort, and merge sort.'},\n", + " {'instruction': 'Explain what full-parameter fine-tuning means', 'input': '', 'output': 'Full-parameter fine-tuning updates all model weights instead of training only a lightweight adapter.'},\n", + " {'instruction': 'Write a Python function that adds two numbers', 'input': '', 'output': 'def add(a, b):\n", + " return a + b'},\n", + " {'instruction': 'Rewrite in a more concise way', 'input': 'Artificial intelligence is changing the world.', 'output': 'AI is changing the world.'},\n", + " {'instruction': 'What is Ascend 910B?', 'input': '', 'output': 'Ascend 910B is an AI accelerator chip designed by Huawei for deep learning training and inference.'},\n", + " ]\n", + " with open(RAW_DATA_FILE, 'w', encoding='utf-8') as f:\n", + " for item in sample_data:\n", + " f.write(json.dumps(item, ensure_ascii=False) + '\n", + "')\n", + " preview_records = sample_data[:3]\n", + " print(f'Sample dataset created: {RAW_DATA_FILE}')\n", + " print(f'{len(sample_data)} samples total')\n", + "\n", + "print('\n", + "Data preview:')\n", + "for item in preview_records:\n", + " inp = f' {item[\"input\"]}' if item.get('input') else ''\n", + " print(f' Q: {item[\"instruction\"][:80]}{inp[:40]}')\n", + " print(f' A: {str(item[\"output\"])[:80]}')\n" + ], + "id": "d16a7a6cad0a1069" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Verify the Bundled Source Tree\n", + "\n", + "The image keeps the `MindSpeed-Core-MS`, `MindSpeed-LLM`, `MindSpeed`, `MSAdapter`, and `Megatron-LM` source trees in the same layout used by the official `MindSpeed-Core-MS/tests/scripts/set_path.sh`, so the notebook does not need to clone extra repositories.\n", + "\n", + "The bundled source tree lives under `/opt/app-root/share/MindSpeed-Core-MS`, while `/opt/app-root/src` remains available for the workbench PVC, models, datasets, and training outputs.\n", + "\n", + "This cell only performs directory checks and smoke tests for the official entry points. It does not modify the bundled source tree at notebook runtime.\n" + ], + "id": "851e67430d5ed824" + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "WORK_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "repos = [\n", + " ('MindSpeed-Core-MS', MINDSPEED_CORE_MS_PATH),\n", + " ('MindSpeed-LLM', MINDSPEED_LLM_DIR),\n", + " ('MindSpeed', MINDSPEED_DIR),\n", + " ('MSAdapter', MSADAPTER_DIR),\n", + " ('Megatron-LM', MEGATRON_DIR),\n", + "]\n", + "\n", + "print('Bundled source tree:')\n", + "for name, repo_dir in repos:\n", + " exists = repo_dir.exists()\n", + " print(f' [{name}] {repo_dir}: {\"OK\" if exists else \"missing\"}')\n", + " assert exists, f'Bundled source tree missing: {repo_dir}'\n", + "\n", + "script_checks = [\n", + " ('Official weight conversion script', OFFICIAL_CONVERT_SCRIPT),\n", + " ('Official MindSpore conversion entry', CONVERT_CKPT_ENTRY),\n", + " ('Official instruction data script', OFFICIAL_PREPROCESS_SCRIPT),\n", + " ('Data preprocessing entry', MINDSPEED_LLM_DIR / 'preprocess_data.py'),\n", + " ('Official fine-tune script', OFFICIAL_TUNE_SCRIPT),\n", + " ('Fine-tuning training entry', MINDSPEED_LLM_DIR / 'posttrain_gpt.py'),\n", + "]\n", + "\n", + "print('\n", + "Official scripts and entry points:')\n", + "for name, script_path in script_checks:\n", + " exists = script_path.exists()\n", + " print(f' [{name}] {script_path}: {\"OK\" if exists else \"missing\"}')\n", + " assert exists, f'Required script missing: {script_path}'\n", + "\n", + "repo_smoke_cmd = ' && '.join([\n", + " f'cd {q(MINDSPEED_LLM_DIR)}',\n", + " f'python {q(CONVERT_CKPT_ENTRY)} --load-model-type hf --help >/dev/null',\n", + " 'python ./preprocess_data.py --help >/dev/null',\n", + " 'python ./posttrain_gpt.py --help >/dev/null',\n", + "])\n", + "run_cmd(\n", + " repo_smoke_cmd,\n", + " cwd=MINDSPEED_LLM_DIR,\n", + " step_name='repo-smoke-check',\n", + " log_file=PRECHECK_LOG_DIR / 'repo_smoke_check.log',\n", + ")\n", + "print(f'Repository entry-point check log: {PRECHECK_LOG_DIR / \"repo_smoke_check.log\"}')\n", + "print('\n", + "Source tree checks passed!')\n" + ], + "id": "24f4339feef7faac" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Convert HF Weights to MindSpeed/Mcore Format\n", + "\n", + "Convert HuggingFace Qwen3-0.6B weights into the Mcore checkpoint format required by the MindSpore training flow. The first conversion usually takes a few minutes.\n", + "\n", + "> If conversion hits device-side OOM, refer to the upstream quick start and run the conversion on the CPU side instead.\n" + ], + "id": "1e08be6c6932204f" + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "MCORE_WEIGHTS_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "conversion_marker = MCORE_WEIGHTS_DIR / 'latest_checkpointed_iteration.txt'\n", + "iter_dirs = sorted(MCORE_WEIGHTS_DIR.glob('iter_*'))\n", + "converted = conversion_marker.exists() and bool(iter_dirs)\n", + "\n", + "if converted:\n", + " print(f'Weights already exist, skipping conversion: {MCORE_WEIGHTS_DIR}')\n", + " print(f'Latest checkpoint marker: {conversion_marker.read_text().strip()}')\n", + "else:\n", + " convert_args = [\n", + " '--use-mcore-models',\n", + " '--model-type', 'GPT',\n", + " '--load-model-type', 'hf',\n", + " '--save-model-type', 'mg',\n", + " '--target-tensor-parallel-size', str(TP),\n", + " '--target-pipeline-parallel-size', str(PP),\n", + " '--load-dir', str(HF_MODEL_DIR),\n", + " '--save-dir', str(MCORE_WEIGHTS_DIR),\n", + " '--tokenizer-model', str(HF_MODEL_DIR / 'tokenizer.json'),\n", + " '--params-dtype', 'bf16',\n", + " '--model-type-hf', 'qwen3',\n", + " '--ai-framework', 'mindspore',\n", + " ]\n", + " if MODEL_SPEC:\n", + " convert_args.extend(['--spec', *MODEL_SPEC.split()])\n", + "\n", + " convert_cmd = ' && '.join([\n", + " f'cd {q(MINDSPEED_LLM_DIR)}',\n", + " 'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n", + " ' '.join(['python', q(CONVERT_CKPT_ENTRY), *[q(arg) for arg in convert_args]]),\n", + " ])\n", + " print('Converting weights through the official entry point...')\n", + " run_cmd(\n", + " convert_cmd,\n", + " cwd=MINDSPEED_LLM_DIR,\n", + " step_name='weight-convert',\n", + " log_file=LOGS_DIR / 'convert_qwen3_0.6b_ms.log',\n", + " )\n", + " print('Weight conversion completed!')\n", + "\n", + "print('\n", + "Converted checkpoint files:')\n", + "for p in sorted(MCORE_WEIGHTS_DIR.glob('*'))[:20]:\n", + " print(f' {p.name}')\n", + "\n", + "assert conversion_marker.exists(), f'Checkpoint marker file not found: {conversion_marker}'\n", + "assert any(MCORE_WEIGHTS_DIR.glob('iter_*')), f'Converted iter_* checkpoint not found: {MCORE_WEIGHTS_DIR}'\n" + ], + "id": "c11053fb769c124f" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Data Preprocessing\n", + "\n", + "Convert Alpaca-formatted instruction data into the packed binary format required by the MindSpore Qwen3 fine-tuning flow.\n" + ], + "id": "8b4eb7214878fee9" + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "preprocess_cmd = ' && '.join([\n", + " f'cd {q(MINDSPEED_LLM_DIR)}',\n", + " f'mkdir -p {q(DATA_DIR)}',\n", + " 'python ./preprocess_data.py'\n", + " f' --input {q(RAW_DATA_FILE)}'\n", + " f' --tokenizer-name-or-path {q(HF_MODEL_DIR)}'\n", + " f' --output-prefix {q(PROCESSED_DATA_PREFIX)}'\n", + " f' --handler-name {HANDLER_NAME}'\n", + " f' --tokenizer-type {TOKENIZER_TYPE}'\n", + " ' --workers 4'\n", + " ' --log-interval 1'\n", + " f' --enable-thinking {ENABLE_THINKING}'\n", + " f' --prompt-type {PROMPT_TYPE}',\n", + "])\n", + "\n", + "print('Preprocessing data...')\n", + "run_cmd(\n", + " preprocess_cmd,\n", + " cwd=MINDSPEED_LLM_DIR,\n", + " step_name='data-preprocess',\n", + " log_file=LOGS_DIR / 'preprocess_qwen3_0.6b_ms.log',\n", + ")\n", + "\n", + "expected_outputs = [\n", + " DATA_DIR / 'alpaca_packed_attention_mask_document.bin',\n", + " DATA_DIR / 'alpaca_packed_attention_mask_document.idx',\n", + " DATA_DIR / 'alpaca_packed_input_ids_document.bin',\n", + " DATA_DIR / 'alpaca_packed_input_ids_document.idx',\n", + " DATA_DIR / 'alpaca_packed_labels_document.bin',\n", + " DATA_DIR / 'alpaca_packed_labels_document.idx',\n", + "]\n", + "\n", + "print('\n", + "Preprocessing output files:')\n", + "for f in expected_outputs:\n", + " size_kb = f.stat().st_size / 1024 if f.exists() else 0\n", + " print(f' {f.name}: {\"OK\" if f.exists() else \"missing\"} ({size_kb:.1f} KB)')\n", + "\n", + "missing = [str(f) for f in expected_outputs if not f.exists()]\n", + "assert not missing, f'Preprocessing output files were not generated: {missing}'\n", + "print('\n", + "Data preprocessing completed!')\n" + ], + "id": "1bbde65b273d888a" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Launch Fine-Tuning\n", + "\n", + "Run full-parameter SFT for Qwen3-0.6B with the MindSpore backend. Training logs stream to the notebook.\n", + "\n", + "> The current configuration `SEQ_LENGTH=2048, TRAIN_ITERS=100, MBS=2` targets 2x 910B 32G NPUs. For larger-scale training, increase `TRAIN_ITERS`, `SEQ_LENGTH`, and the dataset size.\n" + ], + "id": "66219c9355e9d630" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "LOGS_DIR.mkdir(parents=True, exist_ok=True)\n", + "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "WORLD_SIZE = nproc * NNODES\n", + "DP = max(1, nproc // (TP * PP))\n", + "GBS = DP * MBS\n", + "\n", + "train_log_file = LOGS_DIR / 'tune_qwen3_0.6b_ms.log'\n", + "\n", + "env = ' && '.join([\n", + " f'cd {q(MINDSPEED_LLM_DIR)}',\n", + " 'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n", + " 'export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True',\n", + " f'mkdir -p {q(LOGS_DIR)}',\n", + " f'mkdir -p {q(OUTPUT_DIR)}',\n", + "])\n", + "\n", + "distributed = ' '.join([\n", + " 'msrun',\n", + " f'--local_worker_num {nproc}',\n", + " f'--worker_num {WORLD_SIZE}',\n", + " f'--node_rank {NODE_RANK}',\n", + " f'--master_addr {MASTER_ADDR}',\n", + " f'--master_port {MASTER_PORT}',\n", + " f'--log_dir={q(LOGS_DIR / \"msrun\")}',\n", + " '--join=True',\n", + " '--cluster_time_out=300',\n", + "])\n", + "\n", + "model_args = ' '.join([\n", + " '--use-mcore-models',\n", + " f'--tensor-model-parallel-size {TP}',\n", + " f'--pipeline-model-parallel-size {PP}',\n", + " '--sequence-parallel',\n", + " f'--spec {MODEL_SPEC}',\n", + " '--kv-channels 128',\n", + " '--qk-layernorm',\n", + " '--use-flash-attn',\n", + " '--num-layers 28',\n", + " '--hidden-size 1024',\n", + " '--use-rotary-position-embeddings',\n", + " '--num-attention-heads 16',\n", + " '--ffn-hidden-size 3072',\n", + " '--max-position-embeddings 32768',\n", + " f'--seq-length {SEQ_LENGTH}',\n", + " '--make-vocab-size-divisible-by 1',\n", + " '--padded-vocab-size 151936',\n", + " '--rotary-base 1000000',\n", + " '--disable-bias-linear',\n", + " '--swiglu',\n", + " '--tokenizer-type PretrainedFromHF',\n", + " f'--tokenizer-name-or-path {q(HF_MODEL_DIR)}',\n", + " '--normalization RMSNorm',\n", + " '--position-embedding-type rope',\n", + " '--norm-epsilon 1e-6',\n", + " '--hidden-dropout 0',\n", + " '--attention-dropout 0',\n", + " '--no-gradient-accumulation-fusion',\n", + " '--attention-softmax-in-fp32',\n", + " '--exit-on-missing-checkpoint',\n", + " '--no-masked-softmax-fusion',\n", + " '--group-query-attention',\n", + " '--num-query-groups 8',\n", + " '--seed 42',\n", + " '--bf16',\n", + " '--transformer-impl local',\n", + " '--ckpt-format msadapter',\n", + "])\n", + "\n", + "train_args = ' '.join([\n", + " f'--train-iters {TRAIN_ITERS}',\n", + " f'--micro-batch-size {MBS}',\n", + " f'--global-batch-size {GBS}',\n", + " f'--lr {LR}',\n", + " f'--min-lr {MIN_LR}',\n", + " '--weight-decay 1e-1',\n", + " '--lr-warmup-fraction 0.01',\n", + " '--clip-grad 1.0',\n", + " '--adam-beta1 0.9',\n", + " '--adam-beta2 0.95',\n", + " '--no-load-optim',\n", + " '--no-load-rng',\n", + "])\n", + "\n", + "data_args = ' '.join([\n", + " f'--data-path {q(DATA_PATH)}',\n", + " '--split 100,0,0',\n", + "])\n", + "\n", + "output_args = ' '.join([\n", + " '--log-interval 1',\n", + " f'--save-interval {TRAIN_ITERS}',\n", + " f'--eval-interval {TRAIN_ITERS}',\n", + " '--eval-iters 0',\n", + " '--log-throughput',\n", + "])\n", + "\n", + "tune_args = ' '.join([\n", + " '--finetune',\n", + " '--stage sft',\n", + " '--is-instruction-dataset',\n", + " '--prompt-type qwen3',\n", + " '--no-pad-to-seq-lengths',\n", + "])\n", + "\n", + "cmd = (\n", + " f'{env} && {distributed} posttrain_gpt.py '\n", + " f'{model_args} {train_args} {data_args} {output_args} {tune_args} '\n", + " f'--distributed-backend {DISTRIBUTED_BACKEND} '\n", + " f'--load {q(MCORE_WEIGHTS_DIR)} '\n", + " f'--save {q(OUTPUT_DIR)} '\n", + " f'--ai-framework mindspore '\n", + ")\n", + "\n", + "print(f'Training config: {nproc} NPU, TP={TP}, PP={PP}, DP={DP}, GBS={GBS}')\n", + "print(f'SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}, MASTER={MASTER_ADDR}:{MASTER_PORT}')\n", + "print(f'Log file: {train_log_file}')\n", + "print('\n", + "Starting fine-tuning...\n", + "')\n", + "run_cmd(\n", + " cmd,\n", + " cwd=MINDSPEED_LLM_DIR,\n", + " step_name='train',\n", + " log_file=train_log_file,\n", + ")\n", + "print(f'\n", + "Fine-tuning completed! Weights saved to: {OUTPUT_DIR}')\n" + ], + "id": "3dd57cef5ae58e69" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Validate Outputs\n" + ], + "id": "b765e7120b74eab3" + }, + { + "cell_type": "code", + "metadata": {}, + "execution_count": null, + "outputs": [], + "source": [ + "log_file = LOGS_DIR / 'tune_qwen3_0.6b_ms.log'\n", + "latest_marker = OUTPUT_DIR / 'latest_checkpointed_iteration.txt'\n", + "iter_dirs = sorted(OUTPUT_DIR.glob('iter_*'))\n", + "\n", + "print(f'Output directory: {OUTPUT_DIR}')\n", + "print(f'Log file: {log_file}')\n", + "\n", + "assert OUTPUT_DIR.exists() and any(OUTPUT_DIR.iterdir()), f'No artifacts found in output directory: {OUTPUT_DIR}'\n", + "\n", + "if latest_marker.exists():\n", + " print(f'Latest checkpoint marker: {latest_marker.read_text().strip()}')\n", + "else:\n", + " print('Latest checkpoint marker not found, listing all checkpoint artifacts instead.')\n", + "\n", + "if iter_dirs:\n", + " print('\n", + "Checkpoint iteration directories:')\n", + " for d in iter_dirs:\n", + " print(f' {d.name}')\n", + "\n", + "print('\n", + "Output artifacts (first 20 entries):')\n", + "for p in sorted(OUTPUT_DIR.rglob('*'))[:20]:\n", + " if p.is_file():\n", + " print(f' {p.relative_to(OUTPUT_DIR)} ({p.stat().st_size / 1024:.1f} KB)')\n", + " else:\n", + " print(f' {p.relative_to(OUTPUT_DIR)}/')\n", + "\n", + "assert log_file.exists(), f'Training log was not generated: {log_file}'\n", + "print('\n", + "Validation passed!')\n" + ], + "id": "ae0cdda33955c5d9" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using a Real Dataset\n", + "\n", + "Once validation passes, you can switch to a real dataset:\n", + "\n", + "1. **Prepare the data**\n", + " - Place a parquet dataset at `ALPACA_PARQUET`, or write Alpaca-formatted JSONL to `RAW_DATA_FILE`.\n", + " - Each record should contain `instruction`, `input`, and `output` fields.\n", + "\n", + "2. **Adjust the training scale**\n", + " - Increase `SEQ_LENGTH` gradually based on available accelerator memory and target context length.\n", + " - Increase `TRAIN_ITERS` based on dataset size.\n", + " - Adjust `MBS` to match available memory.\n", + "\n", + "3. **Re-convert weights when TP/PP changes**\n", + " - The `MCORE_WEIGHTS_DIR` path includes `TP` and `PP` to prevent accidental reuse of stale converted weights.\n", + "\n", + "4. **Adjust instruction preprocessing behavior**\n", + " - The upstream example uses `ENABLE_THINKING=true` by default.\n", + " - Change `ENABLE_THINKING` if your dataset or prompt style needs different behavior.\n", + "\n", + "5. **Multi-node training**\n", + " - Update `MASTER_ADDR`, `MASTER_PORT`, `NNODES`, and `NODE_RANK`, then rerun the training cell.\n", + "\n", + "This notebook only validates checkpoint generation. Add inference steps later after the target image and upstream scripts provide a stable MindSpore inference path.\n" + ], + "id": "8c3896d42381d177" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 8c01df58785730beb8c7db5ce0aa734e716360d1 Mon Sep 17 00:00:00 2001 From: zgsu Date: Wed, 15 Apr 2026 11:11:30 +0800 Subject: [PATCH 3/3] docs: replace MinIO wording with S3-compatible Ceph storage --- .../how_to/compressor_by_workbench.mdx | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/en/llm-compressor/how_to/compressor_by_workbench.mdx b/docs/en/llm-compressor/how_to/compressor_by_workbench.mdx index 72eb79f..2b9ac2c 100644 --- a/docs/en/llm-compressor/how_to/compressor_by_workbench.mdx +++ b/docs/en/llm-compressor/how_to/compressor_by_workbench.mdx @@ -83,9 +83,9 @@ ds = ds.map(preprocess, remove_columns=ds.column_names) 5. Preprocess and tokenize into format the model uses. -### (Optional) Upload Dataset into S3 Storage +### (Optional) Upload Dataset into S3-Compatible Object Storage -If you wish to upload datasets into S3, you can run those codes in `JupyterLab`. +If you want to upload datasets into S3-compatible object storage, you can run the following code in `JupyterLab`. Alauda AI supports S3-compatible storage access, and in typical product deployments the storage implementation is Ceph object storage, so you can use the standard `boto3` client. ```python import os @@ -104,7 +104,7 @@ config = TransferConfig( for root, dirs, files in os.walk(local_folder): for filename in files: - local_path = os path.join(root, filename) + local_path = os.path.join(root, filename) relative_path = os.path.relpath(local_path, local_folder) s3_key = f"ultrachat_200k/{relative_path.replace(os.sep, '/')}" s3.upload_file(local_path, bucket_name, s3_key, Config=config) @@ -116,9 +116,9 @@ for root, dirs, files in os.walk(local_folder): 2. Configure multipart upload with 100 MB chunks and a maximum of 10 concurrent threads. -### (Optional) Use Dataset in S3 Storage +### (Optional) Use Dataset from S3-Compatible Object Storage -If you wish to use datasets from S3, you can first install the `s3fs` tool and then modify the dataset loading section in the example by following the code below. +If you want to use datasets stored in S3-compatible object storage, first install the `s3fs` tool and then modify the dataset loading section in the example as shown below. In Alauda AI environments, this S3-compatible storage is typically backed by Ceph object storage. ```bash pip install s3fs -i https://pypi.tuna.tsinghua.edu.cn/simple @@ -135,7 +135,7 @@ storage_options = { "key": "07Apples@", "secret": "O7Apples@", "client_kwargs": { - "endpoint_url": "http://minio.minio-system.svc.cluster.local:80" #[!code callout] + "endpoint_url": "https://ceph-obj.example.com" #[!code callout] } } @@ -149,7 +149,7 @@ ds = load_dataset( 1. Set environment variables (as a backup, some underlying components will use them). - 2. Define storage configuration; you must explicitly specify the endpoint_url to connect to MinIO. + 2. Define storage configuration; you must explicitly specify the `endpoint_url` for your S3-compatible object storage service, such as a Ceph object storage endpoint. 3. If the dataset is split, this is equivalent to `split="train_sft"` in the example.