feat: Refactor training framework with modular architecture#3
feat: Refactor training framework with modular architecture#3
Conversation
There was a problem hiding this comment.
如果pp的类型不是 G-pipe,应该怎么办呢?
There was a problem hiding this comment.
如果我们的config里面有数据集,你这里传mock data是不是不行啊?
There was a problem hiding this comment.
这个地方有没有跳过那些warmup的训练环节呢?
There was a problem hiding this comment.
We should not use print for logging, we should use logger module here.
Please apply to all code in the pr.
|
|
||
| # Add data configuration | ||
| if self.config.train_dataset is None or (isinstance(self.config.train_dataset, str) and self.config.train_dataset.lower() == "mock"): | ||
| megatron_args += ["--mock-data", "--tokenizer-type", "NullTokenizer", "--vocab-size", str(self.config.vocab_size)] |
There was a problem hiding this comment.
Do we need to use different tokenizer for different model here?
| f"--micro-batch-size={self.config.mbs}", | ||
| f"--global-batch-size={self.config.gbs}", | ||
| f"--seq-length={self.config.seq_len}", | ||
| f"--lr={self.config.lr}", |
There was a problem hiding this comment.
Please also consider decay lr here.
We could add these configs in our json file:
f"--lr_scheduler_type=cosine", # 推荐:余弦退火 (cosine) 或 线性 (linear)
f"--warmup_ratio=0.03", # 推荐:前 3% 的步数用于热身
# 或者使用步数
# f"--warmup_steps=100",
There was a problem hiding this comment.
we should get precision from our json file
There was a problem hiding this comment.
Could we get this config from json file?
| parser.add_argument("--config", required=True, help="path to config.json") | ||
| parser.add_argument("--framework", default="megatron", choices=["megatron", "infinitrain"], | ||
| help="training framework to use") | ||
| parser.add_argument("--gpu-platform", default="nvidia", choices=["nvidia", "other"], |
There was a problem hiding this comment.
Please add a device arg in config file
baominghelly
left a comment
There was a problem hiding this comment.
comment added in pr
zzhfz
left a comment
There was a problem hiding this comment.
@baominghelly 训练脚本修改已完成,请审核最新版本。
主要更新
- 标准化输出格式:run_id/testcase符合规范
- 完整配置支持
- 已解决6条评审意见
- 成功通过Megatron-LM测试
description
Add Modular Architecture Refactoring script
evidence
./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_result.json
./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_loss.csv
./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_ppl.csv
./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_throughput.csv