大模型基础--情感分析任务的演进(Transformer预训练模型篇)
【摘要】 基于Transformer预训练-BERT构建一个文本情感分类模型
1.需求说明
本案例的目标是基于Transformer预训练-BERT构建一个文本情感分类模型,对评论内容进行二分类判断(正面或负面)。
2.需求分析
数据来源:https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip
模型结构设计:模型整体由预训练模型层和一层线性层构成。
训练方案:损失函数使用 BCEWithLogitsLoss,结合了sigmoid激活和二分类交叉熵计算,数值稳定且适合二分类任务。优化器使用Adam优化器进行参数更新,提升训练效率。
评估方案:模型训练完毕后,使用测试集统计正确率
3.代码实现
3.1 原始训练数据的预处理
from datasets import load_dataset, ClassLabel
from transformers import AutoTokenizer
from nlp_tutorial.ch07_pretrain_model.bert_emotion_analysis.src.d_config import *
def preprocess():
print("开始数据预处理")
# 加载数据. 数据会存在key为train的DatasetDict
dataset = load_dataset('csv', data_files=str(RAW_DATA_DIR / RAW_DATA_FILE))['train']
# 移除无效的字段,过滤掉空数据,将 label列的数据类型转换为 ClassLabel类型
# ClassLabel是 Hugging Face 中用于分类任务的标签类型,它会自动将标签映射为整数索引
dataset = (dataset.remove_columns(['cat']).filter(lambda x: x['review'] is not None)
.cast_column('label', ClassLabel(names=['n', 'p'])))
# 将训练数据分为训练集和测试集
dataset_dict = dataset.train_test_split(test_size=0.2, stratify_by_column='label')
# 加载BERT自带的分词器
tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_DIR)
# 分词器编码的策略
def encode(example):
input = tokenizer(
example['review'],
padding='max_length',
max_length=SEQ_LEN,
truncation=True
)
return input
# 数据编码,padding,id2token(向量表)
dataset_dict = dataset_dict.map(encode, batched=True,remove_columns=['review'])
dataset_dict.save_to_disk(PROCESSED_DATA_DIR)
print("数据处理结束")
if __name__ == '__main__':
preprocess()
3.2 自定义DataLoader(DataLoader会自动将Dataset数据分批,还可以在批次内处理数据)
from datasets import load_from_disk
from torch.utils.data import DataLoader
from nlp_tutorial.ch07_pretrain_model.bert_emotion_analysis.src.d_config import *
def get_dataloader(train=True):
# 从本地加载数据
path = PROCESSED_DATA_DIR/('train' if train else 'test')
dataset = load_from_disk(path)
# 设定使用索引调用dataset数据时的需要转化的数据类型
dataset.set_format(type='torch')
dataloader = DataLoader(dataset,batch_size=BATCH_SIZE,shuffle=True)
return dataloader
3.3 定义模型
import torch.nn as nn
from nlp_tutorial.ch07_pretrain_model.bert_emotion_analysis.src.d_config import *
from transformers import AutoModel
class ReviewAnalysisModel(nn.Module):
def __init__(self):
super().__init__()
self.bert = AutoModel.from_pretrained(PRE_TRAINED_DIR)
self.linear = nn.Linear(in_features=self.bert.config.hidden_size, out_features=1)
def forward(self, input_ids,attention_mask,token_type_ids):
# 参数的值在分词后的Dataset中
output= self.bert(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
return self.linear(output.pooler_output).squeeze(-1)
3.4 配置信息(按照配置文件创建对应文件夹)
from pathlib import Path
ROOT_DIR = Path(__file__).parent.parent
RAW_DATA_DIR = ROOT_DIR / 'data' / 'raw'
PROCESSED_DATA_DIR = ROOT_DIR / 'data' / 'processed'
PRE_TRAINED_DIR = ROOT_DIR / 'pretrained'
MODEL_DIR = ROOT_DIR / 'models'
LOG_DIR = ROOT_DIR / 'logs'
RAW_DATA_FILE = 'online_shopping_10_cats.csv'
BEST_MODEL = 'best_model.pt'
BERT_MODEL = 'bert-base-chinese'
SEQ_LEN = 128
BATCH_SIZE = 64
LEARN_RATE = 1e-5
EPOCHS = 2
3.5 定义训练过程
import time
import torch
from torch import nn, optim
from torch.utils.tensorboard import SummaryWriter
from tqdm import tqdm
from nlp_tutorial.ch07_pretrain_model.bert_emotion_analysis.src.b_dataset import get_dataloader
from nlp_tutorial.ch07_pretrain_model.bert_emotion_analysis.src.c_model import ReviewAnalysisModel
from nlp_tutorial.ch07_pretrain_model.bert_emotion_analysis.src.d_config import *
def train_one_epoch(model, train_loader, loss, optimizer, device):
model.train()
total_loss = 0.0
for batch in tqdm(train_loader, desc='训练: '):
inputs = {k: v.to(device) for k, v in batch.items()}
targets = inputs.pop('label').to(dtype=torch.float)
output = model(**inputs)
loss_value = loss(output, targets)
loss_value.backward()
optimizer.step()
optimizer.zero_grad()
total_loss += loss_value.item()
return total_loss / len(train_loader)
def train(continue_train=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
train_loader = get_dataloader()
model = ReviewAnalysisModel().to(device)
# 控制是否在已经训练的基础上继续训练
if continue_train: model.load_state_dict(torch.load(MODEL_DIR / BEST_MODEL))
loss = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(lr=LEARN_RATE, params=model.parameters())
start = time.time()
min_loss = float('inf')
with SummaryWriter(log_dir=LOG_DIR / time.strftime('%Y-%m-%d_%H-%M-%S')) as writer:
for epoch in range(EPOCHS):
print("=" * 10, f"EPOCH:{epoch + 1}", "=" * 10)
time.sleep(0.1)
this_loss = train_one_epoch(model=model, train_loader=train_loader, loss=loss, optimizer=optimizer,
device=device)
print("this loss: ", this_loss)
writer.add_scalar('loss', this_loss, epoch + 1)
if this_loss < min_loss:
min_loss = this_loss
torch.save(model.state_dict(), MODEL_DIR / BEST_MODEL)
print("模型保存成功!")
print("time elapsed: ",time.time()-start,"s")
if __name__ == '__main__':
train()
3.6 模型评估
import torch
from tqdm import tqdm
from nlp_tutorial.ch07_pretrain_model.bert_emotion_analysis.src.b_dataset import get_dataloader
from nlp_tutorial.ch07_pretrain_model.bert_emotion_analysis.src.c_model import ReviewAnalysisModel
from nlp_tutorial.ch07_pretrain_model.bert_emotion_analysis.src.d_config import *
def predict_batch(model, inputs):
model.eval()
with torch.no_grad():
output = model(**inputs)
output = torch.sigmoid(output)
return output.tolist()
def evaluate(model, dataloader, device):
correct_count = 0.0
total_count = 0.0
for batch in tqdm(dataloader, "评估模型中"):
inputs = {k: v.to(device) for k, v in batch.items()}
targets = inputs.pop('label').tolist()
batch_results = predict_batch(model, inputs)
for target, result in zip(targets, batch_results):
total_count += 1
result = 1 if result > 0.5 else 0
if result == target:
correct_count += 1.0
return correct_count / total_count
def run_evaluate():
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = ReviewAnalysisModel().to(device)
model.load_state_dict(torch.load(MODEL_DIR / BEST_MODEL))
print("模型加载成功!")
test_dataloader = get_dataloader(train=False)
acc = evaluate(model, test_dataloader, device)
print("评估结果:")
print("Accuracy: ", acc)
if __name__ == '__main__':
run_evaluate()
4. 运行数据预处理代码

5. 运行训练代码


6. 运行评估代码

【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)