- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

大模型原理--TRL(SFT Trainer篇)

剑指南天发表于 2026/05/14 09:18:44 2026/05/14

【摘要】 TRL提供的用于监督微调的 SFT Trainer 工具，可以实现全参数微调和参数高效微调。

1.概述

TRL（Transformers Reinforcement Learning）是一个全栈库，提供了一整套工具，如监督微调（SFT）、组相对策略优化（GRPO）、直接偏好优化（DPO）、奖励建模等方法训练 Transformer 语言模型。

2. 监督微调

监督式微调（Supervised Fine-tuning, SFT）主要用于使预训练模型能够遵循指令、使用特定输出格式进行对话的过程。监督微调是将通用大语言模型适配至特定任务的关键技术手段，其核心价值主要体现在以下三个方面：

①增强任务执行能力：显著提升模型在特定任务上的可靠性与准确性，使其更好地满足实际应用需求。

②定制输出行为：使模型能够按照预期的语气、格式生成内容，确保输出一致、可靠、符合规范。

③注入领域知识：将特定领域的术语、规则或业务逻辑内化至模型参数中，从而增强其在专业场景下的适用性与表现力。

但是面对具体任务时，应优先尝试提示工程（Prompt Engineering）等轻量级方法。仅当提示工程无法达到预期效果时，才应考虑采用监督微调。以下情形可作为采用 SFT 的合理依据：

①模型能力适配不足：基础模型在目标任务上表现不稳定，且通过优化提示、示例引导等方式仍难以满足性能要求。

②输出一致性要求高：应用场景对输出格式或风格有严格规范，而仅靠提示难以保证长期稳定的输出质量。

③降低模型使用成本：微调并使用小型专用模型的总体成本（训练、部署、推理时间）显著低于直接使用大型通用模型。

本文主要介绍TRL提供的用于监督微调的 SFT Trainer 工具。

3. 数据集预处理

SFTTrainer 既支持标准数据集格式，也支持对话式数据集格式。当接收到对话式数据集时，该训练器会自动将聊天模板应用到该数据集上。

# Standard language modeling
{"text": "The sky is blue."}

# Conversational language modeling
{"messages": [{"role": "user", "content": "What color is the sky?"},{"role": "assistant", "content": "It is blue."}]}

# Standard prompt-completion
{"prompt": "The sky is", "completion": " blue."}

# Conversational prompt-completion
{"prompt": [{"role": "user", "content": "What color is the sky?"}], "completion": [{"role": "assistant", "content": "It is blue."}]}

如果数据集不符合下面的格式,需要将数据格式进行转换。

from datasets import load_dataset

dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")
print(dataset['train'][0])

def preprocess_function(example):
    return {
        "prompt": [{"role": "user", "content": example["Question"]}],
        "completion": [
            {"role": "assistant", "content": f"<think>{example['Complex_CoT']}</think>{example['Response']}"}
        ],
    }

dataset = dataset.map(preprocess_function, remove_columns=["Question", "Response", "Complex_CoT"])
print(next(iter(dataset["train"])))

输出结果：

4. SFTTrainer对象解析

class trl.SFTTrainer( model: str | PreTrainedModel | PeftModel, args: trl.trainer.sft_config.SFTConfig | transformers.training_args.TrainingArguments | None = None, data_collator: collections.abc.Callable[[list[typing.Any]], dict[str, typing.Any]] | None = None, train_dataset: datasets.arrow_dataset.Dataset | datasets.iterable_dataset.IterableDataset | None = None, eval_dataset: datasets.arrow_dataset.Dataset | datasets.iterable_dataset.IterableDataset | dict[str, datasets.arrow_dataset.Dataset | datasets.iterable_dataset.IterableDataset] | None = None, processing_class: transformers.tokenization_utils_base.PreTrainedTokenizerBase | transformers.processing_utils.ProcessorMixin | None = None, compute_loss_func: collections.abc.Callable | None = None, compute_metrics: collections.abc.Callable[[transformers.trainer_utils.EvalPrediction], dict] | None = None, callbacks: list[transformers.trainer_callback.TrainerCallback] | None = None, optimizers: tuple = (None, None), optimizer_cls_and_kwargs: tuple[type[torch.optim.optimizer.Optimizer], dict[str, typing.Any]] | None = None, preprocess_logits_for_metrics: collections.abc.Callable[[torch.Tensor, torch.Tensor], torch.Tensor] | None = None, peft_config: PeftConfig | None = None, formatting_func: collections.abc.Callable[[dict], str] | None = None )

参数详解:

model：(str or PreTrainedModel or PeftModel) — Model to be trained. Can be either:

①A string, being the model id of a pretrained model hosted inside a model repo on huggingface.co, or a path to a directory containing model weights saved using save_pretrained, e.g. './my_model_directory/'. The model is loaded using <ModelArchitecture>.from_pretrained (where <ModelArchitecture> is derived from the model config) with the keyword arguments in args.model_init_kwargs.

from transformers import AutoModelForCausalLM
from trl import SFTConfig
import torch

# "Qwen/Qwen3-0.6B" 为模型id
# 模型可以在这儿进行配置,比如dtype=torch.bfloat16
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", dtype=torch.bfloat16)
# 模型也可以在SFTConfig对象中的model_init_kwargs关键词进行配置
training_args = SFTConfig(
    model_init_kwargs={"dtype": torch.bfloat16},
)

②A PreTrainedModel object. Only causal language models are supported.

③A PeftModel object. Only causal language models are supported. If you’re training a model with an MoE architecture and want to include the load balancing/auxiliary loss as a part of the final loss, remember to set the output_router_logits config of the model to True.

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM

config = PeftConfig.from_pretrained("ybelkada/opt-350m-lora")
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
lora_model = PeftModel.from_pretrained(model, "ybelkada/opt-350m-lora")

加载模型的参数dtype影响显存和计算

args： (SFTConfig, optional) — Configuration for this trainer. If None, a default configuration is used. SFT的主要配置。

data_collator： (DataCollator, optional) — Function to use to form a batch from a list of elements of the processed train_dataset or eval_dataset. Will default to DataCollatorForLanguageModeling if the model is a language model and DataCollatorForVisionLanguageModeling if the model is a vision-language model. Custom collators must truncate sequences before padding; the trainer does not apply post-collation truncation. 将原始dataset整理为可供模型输入的batch。比如 sequences 长度超过 max_length 进行 truncate。按照同一批内最长的sequence进行padding。

train_dataset： (Dataset or IterableDataset) — Dataset to use for training. This trainer supports both language modeling type and prompt-completion type. The format of the samples can be either:

Standard: Each sample contains plain text.
Conversational: Each sample contains structured messages (e.g., role and content).

The trainer also supports processed datasets (tokenized) as long as they contain an input_ids field.

eval_dataset： (Dataset, IterableDataset or dict[str, Dataset | IterableDataset]) — Dataset to use for evaluation. It must meet the same requirements as train_dataset.

processing_class： (PreTrainedTokenizerBase, ProcessorMixin, optional) — Processing class used to process the data. If None, the processing class is loaded from the model’s name with from_pretrained. A padding token, tokenizer.pad_token, must be set. If the processing class has not set a padding token, tokenizer.eos_token will be used as the default.

compute_loss_func： (Callable, optional) — A function that accepts the raw model outputs, labels, and the number of items in the entire accumulated batch (batch_size * gradient_accumulation_steps) and returns the loss.

compute_metrics： (Callable[[EvalPrediction], dict], optional) — The function that will be used to compute metrics at evaluation. Must take a EvalPrediction and return a dictionary string to metric values. When passing SFTConfig with batch_eval_metrics set to True, your compute_metrics function must take a boolean compute_result argument. This will be triggered after the last eval batch to signal that the function needs to calculate and return the global summary statistics rather than accumulating the batch-level statistics.

optimizers：( tuple[torch.optim.Optimizer | None, torch.optim.lr_scheduler.LambdaLR | None] ,optional, defaults to (None, None) ) — A tuple containing the optimizer and the scheduler to use. Will default to an instance of AdamW on your model and a scheduler given by get_linear_schedule_with_warmup controlled by args .

peft_config： (PeftConfig, optional) — PEFT configuration used to wrap the model. If None, the model is not wrapped.

formatting_func (Callable, optional) — Formatting function applied to the dataset before tokenization. Applying the formatting function explicitly converts the dataset into a language modeling type.

代码配置：

from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from  transformers import AutoModelForCausalLM,AutoTokenizer
import torch

# Load dataset
dataset = load_dataset("HuggingFaceTB/smoltalk", "all")

# Configure model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_name)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Configure trainer
training_args = SFTConfig(
    output_dir="./sft_output",
    max_steps=1000,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    logging_steps=10,
    save_steps=100,
    eval_strategy="steps",
    eval_steps=50,
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
)

# Start training
trainer.train()

5. SFTConfig

①基础训练相关

num_train_epochs：( float , optional, defaults to 3.0) — Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).

max_steps： ( int , optional, defaults to -1) — If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs . For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until max_steps is reached. 每计算完一个batch，算一个step。

per_device_train_batch_size： ( int , optional, defaults to 8) — The batch size per device. The global batch size is computed as: per_device_train_batch_size * number_of_devices in multi-GPU or distributed setups. 一次前向+反向传播中，用来计算梯度并更新模型参数（除了梯度累计的情况）的那一组样本叫做一个 batch，这个过程叫做一个 step。这一个批次里包含的样本数量叫做batch_size。节省显存

gradient_accumulation_steps： ( int , optional, defaults to 1) — Number of updates steps to accumulate the gradients for, before performing a backward/update pass. 可以计算多个batch，然后更新权重。这样就可以以一个小显卡，得到大显卡训练的效果。

max_length： ( int or None , optional, defaults to 1024 ) — Maximum length of the tokenized sequence. Sequences longer than max_length are truncated from the right. If None , no truncation is applied. When packing is enabled, this value sets the sequence length. 节省显存

②训练优化相关

bf16：( bool , optional, defaults to False ) — Whether to use bf16 16-bit (mixed) precision training instead of 32-bit training. Requires Ampere or higher NVIDIA architecture or Intel XPU or using CPU (use_cpu) or
Ascend NPU.

fp16： ( bool , optional, defaults to False ) — Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.

两个参数配置混合精度训练。BF16数值范围大，但是精度低。FP16数值范围小，但是精度高。节省显存和计算

half_precision_backend： ( str , optional, defaults to "auto" ) — The backend to use for mixed precision training. Must be one of "auto", "apex", "cpu_amp" . "auto" will use CPU/CUDA AMP or APEX depending on the PyTorch version detected, while the other choices will force the requested backend. 根据设备自动选择混合精度训练的方法。节省显存

activation_offloading： ( bool , optional, defaults to False ) — Whether to offload the activations to the CPU. 为节省显存空间，将前向传播的激活值临时卸载到CPU内存中。

gradient_checkpointing： ( bool , optional, defaults to False ) — If True, use gradient checkpointing to save memory at the expense of slower backward pass. 在前向传播中只保留部分激活值，并在反向传播时，按需重新计算其余激活值，减少显存占用。在ZeRO中有使用到，下面的激活值就是前向传播算一般的激活值。

weight_decay： ( float , optional, defaults to 0) — The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer. 权重衰减系数

max_grad_norm： ( float , optional, defaults to 1.0) — Maximum gradient norm (for gradient clipping). 梯度裁剪的阈值。如果梯度超过阈值，直接使用阈值，防止梯度爆炸。

③评估相关

eval_strategy： ( str or IntervalStrategy, optional, defaults to "no" ) — The evaluation strategy to adopt during training. Possible values are: "no" : No evaluation is done during training. "steps" : Evaluation is done (and logged) every eval_steps . "epoch" : Evaluation is done at the end of each epoch.

eval_steps： ( int or float , optional) — Number of update steps between two evaluations if eval_strategy="steps" . Will default to the same value as logging_steps if not set. Should be an integer or a float in range [0,1) . If smaller than 1, will be interpreted as ratio of total training steps.

影响steps的因素有GPU的卡数量，每个设备的batch size，累积的批数。per_device_train_batch_size * number_of_devices * gradient_accumulation_steps

per_device_eval_batch_size： ( int , optional, defaults to 8) — The batch size per device accelerator core/CPU for evaluation.

④学习率相关（和optimizers一样）

learning_rate： ( float , optional, defaults to 5e-5) — The initial learning rate for AdamW optimizer.

lr_scheduler_type： ( str or SchedulerType, optional, defaults to "linear" ) — The scheduler type to use. See the documentation of SchedulerType for all possible values.

warmup_ratio： ( float , optional, defaults to 0.0) — Ratio of total training steps used for a linear warmup from 0 to learning_rate .

warmup_steps： ( int , optional, defaults to 0) — Number of steps used for a linear warmup from 0 to learning_rate . Overrides any effect of warmup_ratio .

通过比率或者steps控制warm up的范围。

⑤损失计算相关

completion_only_loss： ( bool , optional) — Whether to compute loss only on the completion part of the sequence. If set to True , loss is computed only on the completion, which is supported only for promptcompletion
datasets. If False , loss is computed on the entire sequence. If None (default), the behavior depends on the dataset: loss is computed on the completion for prompt-completion datasets, and on the full sequence for language modeling datasets. 训练时只计算"prompt-completion"数据集中的completion部分的loss：节省计算和显存

assistant_only_loss： ( bool , optional, defaults to False ) — Whether to compute loss only on the assistant part of the sequence. If set to True , loss is computed only on the assistant responses, which is supported only for conversational datasets. If False , loss is computed on the entire sequence. 训练时只计算对话数据集中assistant部分的loss：节省计算和显存

⑥保存相关

output_dir： ( str , optional, defaults to "trainer_output" ) — The output directory where the model predictions and checkpoints will be written.

save_strategy： ( str or SaveStrategy , optional, defaults to "steps" ) — The checkpoint save strategy to adopt during training. Possible values are: "no" : No save is done during training. "epoch" : Save is done at the end of each epoch. "steps" : Save is done every save_steps . "best" : Save is done whenever a new best_metric is achieved.If "epoch" or "steps" is chosen, saving will also be performed at the very end of training, always.

save_steps： ( int or float , optional, defaults to 500) — Number of updates steps before two checkpoint saves if save_strategy="steps" . Should be an integer or a float in range [0,1) . If smaller than 1, will be interpreted as ratio of total training steps.

save_total_limit： ( int , optional) — If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir . When load_best_model_at_end is enabled, the “best” checkpoint according to metric_for_best_model will always be retained in addition to the most recent ones. For example, for save_total_limit=5 and load_best_model_at_end , the four last checkpoints will always be retained alongside the best model. When save_total_limit=1 and load_best_model_at_end , it is possible that two checkpoints are saved: the last one and the best one (if they are different). 对应硬盘空间不足的情况，必须配置。

load_best_model_at_end： ( bool , optional, defaults to False ) — Whether or not to load the best model found during training at the end of training. When this option is enabled, the best checkpoint will always be saved. 保存在某种指标下的最好的模型。

metric_for_best_model： ( str , optional) — Use in conjunction with load_best_model_at_end to specify the metric to use to compare two different models. Must be the name of a metric returned by the evaluation with or without the prefix "eval_" . If not specified, this will default to "loss" when either load_best_model_at_end == True or lr_scheduler_type == SchedulerType.REDUCE_ON_PLATEAU (to use the evaluation loss). If you set this value, greater_is_better will default to True unless the name ends with “loss”. Don’t forget to set it to False if your metric is better when lower. 怎么判断最优模型的指标。默认是验证集的loss。

greater_is_better： ( bool , optional) — Use in conjunction with load_best_model_at_end and metric_for_best_model to specify if better models should have a greater metric or not. Will default to: True if metric_for_best_model is set to a value that doesn’t end in "loss" . False if metric_for_best_model is not set, or set to a value that ends in "loss" .

⑦日志相关

logging_dir: ( str , optional) — TensorBoard log directory. Will default to output_dir/runs/CURRENT_DATETIME_HOSTNAME.

logging_strategy: ( str or IntervalStrategy, optional, defaults to "steps" ) — The logging strategy to adopt during training. Possible values are: "no" : No logging is done during training. "epoch" : Logging is done at the end of each epoch. "steps" : Logging is done every logging_steps .

logging_steps: ( int or float , optional, defaults to 500) — Number of update steps between two logs if logging_strategy="steps" . Should be an integer or a float in range [0,1) . If smaller than 1, will be interpreted as ratio of total training steps.

代码配置

# Configure trainer
training_args = SFTConfig(
    output_dir="./sft_output",
    max_steps=1000,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    logging_steps=10,
    save_steps=100,
    eval_strategy="steps",
    eval_steps=50,
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
)

6. PeftConfig

代码配置:

from peft import LoraConfig

# TODO: Configure LoRA parameters
# r: rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 6
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_alpha = 8
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

import torch
from transformers import AutoModelForCausalLM,AutoTokenizer
from peft import PeftModel

# 1. Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    "base_model_name", torch_dtype=torch.float16, device_map="auto"
)

# 2. Load the PEFT model with adapter
peft_model = PeftModel.from_pretrained(
    base_model, "path/to/adapter", torch_dtype=torch.float16
)

# 3. Merge adapter weights with base model
merged_model = peft_model.merge_and_unload()

# Save both model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("base_model_name")
merged_model.save_pretrained("path/to/save/merged_model")
tokenizer.save_pretrained("path/to/save/merged_model")

或者使用Unsloth库

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b",
    max_seq_length=max_length,
    dtype="auto",  # For auto-detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit=True,  # Use 4bit quantization to reduce memory usage. Can be False
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  # Dropout = 0 is currently optimized
    bias="none",  # Bias = "none" is currently optimized
    use_gradient_checkpointing=True,
    random_state=3407,
)

training_args = SFTConfig(output_dir="./output", max_length=max_length)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

7. QLora

Bitsandbytes is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

set load_in_4bit=True to quantize the model to 4-bits when you load it
set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution
set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights
set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computatio

import torch
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training,get_peft_model,LoraConfig

# 量化配置
config = BitsAndBytesConfig(
    load_in_4bit=True, # 4-bit量化
    bnb_4bit_quant_type="nf4", # 将标准正态分布按累积概率均分为 16 个等概率区间
    bnb_4bit_use_double_quant=True, # 双重量化
    bnb_4bit_compute_dtype=torch.bfloat16, # 反量化后的计算精度
)
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", quantization_config=config)
# This method wraps the entire protocol for preparing a model before running a training. This includes:
#   1- Cast the layernorm in fp32
#   2- making output embedding layer require grads
#   3- Add the upcasting of the lm head to fp32
#   4- Freezing the base model layers to ensure they are not updated during training
model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=8,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)

注意:量化后的模型通常不会再进行后续训练,因为权重和激活值的精度降低会造成训练不稳定.

8. 总结

根据是否更新模型全部参数，监督微调的方法可分为全参数微调和参数高效微调。全参数微调依赖SFTConfig可以实现，参数高效微调依赖peft或者unsloth库来实现。

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

大模型原理--TRL(SFT Trainer篇)

1.概述

2. 监督微调

3. 数据集预处理

4. SFTTrainer对象解析

5. SFTConfig

6. PeftConfig

7. QLora

8. 总结

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

大模型原理--TRL(SFT Trainer篇)

1.概述

2. 监督微调

3. 数据集预处理

4. SFTTrainer对象解析

5. SFTConfig

6. PeftConfig

7. QLora

8. 总结

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品