- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

大模型实践--多显卡分布式训练实践（基于DeepSpeed）

剑指南天发表于 2026/05/16 13:09:00 2026/05/16

【摘要】 Qwen3-4B分布式全参微调实践

1.概述

Accelerate在多卡或多机环境下可以快速构建训练流程，通过直接切换 FSDP、DeepSpeed 等不同分布式训练工具，简化了分布式训练的实现难度。

DeepSpeed 分布式训练框架内部集成了数据并行、流水线并行、张量并行以及专家并行等多种大模型分布式训练技术，并提供 ZeRO 优化器，适用于各种规模的训练场景。本文主要介绍 DeepSpeed 基于ZeRO 优化器的训练过程。

2. 模型选择

3. 训练数据集

数据集主要是心理安慰对话。

4. 微调方法

基于DeepSpeed的全参数微调，使用 ZeRO 优化器。

5. 分布式训练

是，4卡训练

6. 显卡选择

7. 微调工具

基于 TRL 中用于监督微调的 SFT Trainer 工具

8. 代码实践

①分布式环境配置

（1）依赖安装

pip install trl accelerate deepspeed

（2）accelerate 配置

accelerate config

生成的文件内容

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: false
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

②训练的代码

from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
import torch
from transformers import AutoModelForCausalLM,AutoTokenizer
import os

# 配置预训练模型下载地址
os.environ['HF_ENDPOINT']='https://hf-mirror.com/'
os.environ['HF_HOME']='/root/autodl-tmp/hf'

# Configure model and tokenizer
model_name = "Qwen/Qwen3-4B"
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_name,dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# 数据集
# Load dataset
dataset_dict = load_dataset("json",data_files="/root/autodl-tmp/hf/data/psychology_data.jsonl")
dataset_dict = dataset_dict['train'].train_test_split(test_size=0.1,shuffle=True)

def map_func(message):
    conversation = message['conversation']
    messages = []
    for item in conversation:
        messages.append({"role": "user", "content": item['human']})
        messages.append({"role": "assistant", "content":item['assistant']})
    return {"messages":messages}
dataset_dict = dataset_dict.map(map_func,batched=False,remove_columns=['conversation_id','category','conversation','dataset'])

# Configure trainer
training_args = SFTConfig(
    output_dir="/root/autodl-tmp/hf/model/Qwen3-4B/full_zero3/model1/",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    save_steps=20,
    save_total_limit=1,
    eval_strategy="steps",
    eval_steps=20,
    load_best_model_at_end=True,
    logging_dir="/root/tf-logs/",
    logging_strategy='steps',
    logging_steps=10,
    bf16=True,
    warmup_ratio=0.1,
)
# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset_dict["train"],
    eval_dataset=dataset_dict["test"],
    processing_class=tokenizer,
)

# Start training
trainer.train()

③开始训练

accelerate launch --config_file qwen4B-full-zero3.yaml qwen4B_full.py

（1）加载模型权重

（2）处理用于训练的数据

（3）使用分词器分词

（4）训练和评估

日志和监控数据：

GPU

显存占用率

⑤保存模型

cd /root/autodl-tmp/hf/model/Qwen3-4B/full_zero3/model1/checkpoint-272

python zero_to_fp32.py ./ ./

9. 推理效果对比

①Qwen/Qwen3-4B

from transformers import AutoModelForCausalLM, AutoTokenizer,TextStreamer

model_name = "Qwen/Qwen3-4B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)

prompt = "觉得找心里咨询师很丢脸? 我今年17，因为某些原因觉得要找个心理咨询师咨询一下，但我没法向家长开口，我觉得心理问题就像得了艾滋病一样让人难以启齿，很难堪。而且父母也不会理解，现在花一样的年纪，能有什么心坎。我该怎么办?"
inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
streamer = TextStreamer(tokenizer,skip_prompt=True)

# Despite returning the usual output, the streamer will also print the generated text to stdout.
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=1024)

②微调后的模型

from transformers import AutoModelForCausalLM, AutoTokenizer,TextStreamer

model_name = "/root/autodl-tmp/hf/model/Qwen3-4B/full_zero3/model1/checkpoint-272/"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)

# # prepare the model input
prompt = "觉得找心里咨询师很丢脸? 我今年17，因为某些原因觉得要找个心理咨询师咨询一下，但我没法向家长开口，我觉得心理问题就像得了艾滋病一样让人难以启齿，很难堪。而且父母也不会理解，现在花一样的年纪，能有什么心坎。我该怎么办?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

大模型实践--多显卡分布式训练实践（基于DeepSpeed）

1.概述

2. 模型选择

3. 训练数据集

4. 微调方法

5. 分布式训练

6. 显卡选择

7. 微调工具

8. 代码实践

9. 推理效果对比

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

大模型实践--多显卡分布式训练实践（基于DeepSpeed）

1.概述

2. 模型选择

3. 训练数据集

4. 微调方法

5. 分布式训练

6. 显卡选择

7. 微调工具

8. 代码实践

9. 推理效果对比

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品