- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

【NLP】(task7)Transformers完成序列标注任务

野猪佩奇996 发表于 2022/01/23 02:07:45 2022/01/23

【摘要】学习总结（1）回顾 fine tune BERT 解决新的下游任务的5个步骤： 1）准备原始文本数据 2）将原始文本转换成BERT相容的输入格式（重点，如下图所示） 3）在BERT之上加入新layer...

学习总结

（1）回顾 fine tune BERT 解决新的下游任务的5个步骤：
1）准备原始文本数据
2）将原始文本转换成BERT相容的输入格式（重点，如下图所示）
3）在BERT之上加入新layer成下游任务模型（重点）
4）训练该下游任务模型
5）对新样本做推论
而利用HuggingFace后，我们是在BERT上加入dropout和linear classsifier，最后输出用来预测类别的logits（即用了迁移学习的思想）。

（2）本次学习围绕序列标注（有NER、POS、Chunk等具体任务）中命名实体识别（Name Entity Recognition，NER）：传统神经网络模型的命名实体识别方法是以词为粒度建模的；而在本次的BERT预训练语言模型用作序列标注时，通常使用切分粒度更小的分词器（如WordPiece）处理输入文本——破坏词与序列标签的一一对应关系。

（3）用BERT模型解决序列标注任务（即为文本的每个token预测一个标签）：

在加载数据阶段中，使用CONLL 2003 dataset数据集，并观察实体类别及表示形式；
在数据预处理阶段中，对tokenizer分词器的建模，将subtokens、words和标注的labels对齐，并完成数据集中所有样本的预处理；
在微调预训练模型阶段，通过对模型参数进行设置，设置seqeval评估方法（计算命名实体识别的相关指标），并构建Trainner训练器，进行模型训练，对precision（精确率）、recall（召回率）和f1值进行评估比较。

文章目录

本文涉及的jupter notebook在篇章4代码库中。如果您正在google的colab中打开这个notebook，您可能需要安装Transformers和🤗Datasets库。将以下命令取消注释即可安装。

!pip install datasets transformers seqeval

  
 
  1

如果您正在本地打开这个notebook，请确保您已经进行上述依赖包的安装。您也可以在这里找到本notebook的多GPU分布式训练版本。

本小节所涉及的模型结构与上一篇章中的BERT基本一致，额外需要学习的是特定任务的数据处理方法和模型训练方法。

任务：序列标注（token级的分类问题）

序列标注，通常也可以看作是token级别的分类问题：对每一个token进行分类。
在这个notebook中，我们将展示如何使用🤗 Transformers中的transformer模型去做token级别的分类问题。token级别的分类任务通常指的是为为文本中的每一个token预测一个标签结果。下图展示的是一个NER实体名词识别任务。

1.最常见的token级别分类任务:

NER (Named-entity recognition 名词-实体识别) 分辨出文本中的名词和实体 (person人名, organization组织机构名, location地点名…).
POS (Part-of-speech tagging词性标注) 根据语法对token进行词性标注 (noun名词, verb动词, adjective形容词…)，如下图所示（图源自李宏毅深度学习课程ppt）。
Chunk (Chunking短语组块) 将同一个短语的tokens组块放在一起。

2.相关初始化操作

对于以上的序列标注任务，下面将展示如何使用简单的Dataset库加载数据集，同时使用transformer中的Trainer接口对预训练模型进行微调。

只要预训练的transformer模型最顶层有一个token分类的神经网络层（比如上一篇章提到的BertForTokenClassification）（另外，由于transformer库的tokenizer新特性，可能还需要对应的预训练模型有fast tokenizer这个功能，参考这个表），那么本notebook理论上可以使用各种各样的transformer模型（模型面板），解决任何token级别的分类任务。

如果处理的任务有所不同，大概率只需要很小的改动便可以使用本notebook进行处理。
要根据GPU显存来调整微调训练所需要的btach size大小，避免显存溢出。

# 设置分类任务
task = "ner" #需要是"ner", "pos" 或者 "chunk"
# 设置BERT模型
model_checkpoint = "distilbert-base-uncased"
# 根据GPU调整batch_size,避免显存溢出
batch_size = 16

  
 
  1
  2
  3
  4
  5
  6

上面是加载distilbert-base-uncased预训练模型，关于该模型的更多介绍可以参考官网介绍——distilbert-base-uncased。

一、加载数据

我们将会使用🤗 Datasets库来加载数据和对应的评测方式。数据加载和评测方式加载只需要简单使用load_dataset和load_metric即可。

from datasets import load_dataset, load_metric

  
 
  1

本notebook中的例子使用的是CONLL 2003 dataset数据集。这个notebook应该可以处理🤗 Datasets库中的任何token分类任务。
如果使用的是自定义的json/csv文件数据集，需要查看数据集文档来学习如何加载。自定义数据集可能需要在加载属性名字上做一些调整。

1.1 加载数据

datasets = load_dataset("conll2003")

  
 
  1

加载后显示（本实验在Colab上进行）：

Downloading:
9.52k/? [00:00<00:00, 191kB/s]
Downloading:
4.18k/? [00:00<00:00, 73.5kB/s]
Downloading and preparing dataset conll2003/conll2003 (download: 4.63 MiB, generated: 9.78 MiB, post-processed: Unknown size, total: 14.41 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6...
Downloading:
3.28M/? [00:00<00:00, 14.8MB/s]
Downloading:
827k/? [00:00<00:00, 10.0MB/s]
Downloading:
748k/? [00:00<00:00, 9.26MB/s]
Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6. Subsequent calls will reuse this data.

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12

这个datasets对象本身是一种DatasetDict数据结构. 对于训练集、验证集和测试集，只需要使用对应的key（train，validation，test）即可得到相应的数据。

1.2 查看数据

datasets

  
 
  1

结果为：

    DatasetDict({
        train: Dataset({
            features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
            num_rows: 14041
        })
        validation: Dataset({
            features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
            num_rows: 3250
        })
        test: Dataset({
            features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
            num_rows: 3453
        })
    })

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14

无论是在训练集、验证机还是测试集中，datasets都包含了一个名为tokens的列（一般来说是将文本切分成了很多词），还包含一个名为label的列，这一列对应这tokens的标注。

给定一个数据切分的key（train、validation或者test）和下标即可查看数据。

# 查看训练集下标为0的数据（第一条数据）
datasets["train"][0]

  
 
  1
  2

结果为：

    {'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
     'id': '0',
     'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
     'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
     'tokens': ['EU',
      'rejects',
      'German',
      'call',
      'to',
      'boycott',
      'British',
      'lamb',
      '.']}

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13

所有的数据标签labels都已经被编码成了整数，可以直接被预训练transformer模型使用。这些整数的编码所对应的实际类别储存在features中。

# 数据标签labels都编码成整数，可查看features属性
datasets["train"].features[f"ner_tags"]

  
 
  1
  2

结果：

Sequence(feature=ClassLabel(num_classes=9, 
names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], 
names_file=None, id=None), length=-1, id=None)

  
 
  1
  2
  3

以NER为例，0对应的标签类别是”O“（没有特别实体）， 1对应的是”B-PER“（实体中间的person token）等等。”O“的意思是没有特别实体（no special entity）。
本例包含4种实体类别分别是（PER、ORG、LOC，MISC），每一种实体类别又分别有B-（实体开始的token）前缀和I-（实体中间的token）前缀。

标签含义对应：

‘PER’ ：person
‘ORG’ ：organization
‘LOC’ ：location
‘MISC’ ：miscellaneous
O：没有特别实体
B-：实体开始的token
I-：实体中间的token

label_list = datasets["train"].features[f"{task}_tags"].feature.names
label_list

  
 
  1
  2

结果为：

    ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

  
 
  1

1.3 数据集里抽10条康康

为了能够进一步理解数据长什么样子，下面的函数将从数据集里随机选择几个例子进行展示。

from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21

show_random_elements(datasets["train"])

  
 
  1

结果如下表所示：

	id	tokens	pos_tags	chunk_tags	ner_tags
0	2227	[Result, of, a, French, first, division, match, on, Friday, .]	[NN, IN, DT, JJ, JJ, NN, NN, IN, NNP, .]	[B-NP, B-PP, B-NP, I-NP, I-NP, I-NP, I-NP, B-PP, B-NP, O]	[O, O, O, B-MISC, O, O, O, O, O, O]
1	2615	[Mid-tier, golds, up, in, heavy, trading, .]	[NN, NNS, IN, IN, JJ, NN, .]	[B-NP, I-NP, B-PP, B-PP, B-NP, I-NP, O]	[O, O, O, O, O, O, O]
2	10256	[Neagle, (, 14-6, ), beat, the, Braves, for, the, third, time, this, season, ,, allowing, two, runs, and, six, hits, in, eight, innings, .]	[NNP, (, CD, ), VB, DT, NNPS, IN, DT, JJ, NN, DT, NN, ,, VBG, CD, NNS, CC, CD, NNS, IN, CD, NN, .]	[B-NP, O, B-NP, O, B-VP, B-NP, I-NP, B-PP, B-NP, I-NP, I-NP, B-NP, I-NP, O, B-VP, B-NP, I-NP, O, B-NP, I-NP, B-PP, B-NP, I-NP, O]	[B-PER, O, O, O, O, O, B-ORG, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]
3	10720	[Hansa, Rostock, 4, 1, 2, 1, 5, 4, 5]	[NNP, NNP, CD, CD, CD, CD, CD, CD, CD]	[B-NP, I-NP, I-NP, I-NP, I-NP, I-NP, I-NP, I-NP, I-NP]	[B-ORG, I-ORG, O, O, O, O, O, O, O]
4	7125	[MONTREAL, 70, 59, .543, 11]	[NNP, CD, CD, CD, CD]	[B-NP, I-NP, I-NP, I-NP, I-NP]	[B-ORG, O, O, O, O]
5	3316	[Softbank, Corp, said, on, Friday, that, it, would, procure, $, 900, million, through, the, foreign, exchange, market, by, September, 5, as, part, of, its, acquisition, of, U.S., firm, ,, Kingston, Technology, Co, .]	[NNP, NNP, VBD, IN, NNP, IN, PRP, MD, NN, $, CD, CD, IN, DT, JJ, NN, NN, IN, NNP, CD, IN, NN, IN, PRP$, NN, IN, NNP, NN, ,, NNP, NNP, NNP, .]	[B-NP, I-NP, B-VP, B-PP, B-NP, B-SBAR, B-NP, B-VP, B-NP, I-NP, I-NP, I-NP, B-PP, B-NP, I-NP, I-NP, I-NP, B-PP, B-NP, I-NP, B-PP, B-NP, B-PP, B-NP, I-NP, B-PP, B-NP, I-NP, O, B-NP, I-NP, I-NP, O]	[B-ORG, I-ORG, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, B-LOC, O, O, B-ORG, I-ORG, I-ORG, O]
6	3923	[Ghent, 3, Aalst, 2]	[NN, CD, NNP, CD]	[B-NP, I-NP, I-NP, I-NP]	[B-ORG, O, B-ORG, O]
7	2776	[The, separatists, ,, who, swept, into, Grozny, on, August, 6, ,, still, control, large, areas, of, the, centre, of, town, ,, and, Russian, soldiers, are, based, at, checkpoints, on, the, approach, roads, .]	[DT, NNS, ,, WP, VBD, IN, NNP, IN, NNP, CD, ,, RB, VBP, JJ, NNS, IN, DT, NN, IN, NN, ,, CC, JJ, NNS, VBP, VBN, IN, NNS, IN, DT, NN, NNS, .]	[B-NP, I-NP, O, B-NP, B-VP, B-PP, B-NP, B-PP, B-NP, I-NP, O, B-ADVP, B-VP, B-NP, I-NP, B-PP, B-NP, I-NP, B-PP, B-NP, O, O, B-NP, I-NP, B-VP, I-VP, B-PP, B-NP, B-PP, B-NP, I-NP, I-NP, O]	[O, O, O, O, O, O, B-LOC, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, B-MISC, O, O, O, O, O, O, O, O, O, O]
8	1178	[Doctor, Masserigne, Ndiaye, said, medical, staff, were, overwhelmed, with, work, ., "]	[NNP, NNP, NNP, VBD, JJ, NN, VBD, VBN, IN, NN, ., "]	[B-NP, I-NP, I-NP, B-VP, B-NP, I-NP, B-VP, I-VP, B-PP, B-NP, O, O]	[O, B-PER, I-PER, O, O, O, O, O, O, O, O, O]
9	10988	[Reuters, historical, calendar, -, September, 4, .]	[NNP, JJ, NN, :, NNP, CD, .]	[B-NP, I-NP, I-NP, O, B-NP, I-NP, O]	[B-ORG, O, O, O, O, O, O]

二、预处理数据

在将数据喂入模型之前，我们需要对数据进行预处理。预处理的工具叫Tokenizer。Tokenizer首先对输入进行tokenize，然后将tokens转化为预模型中需要对应的token ID，再转化为模型需要的输入格式。

为了达到数据预处理的目的，我们使用AutoTokenizer.from_pretrained方法实例化我们的tokenizer，这样可以确保：

我们得到一个与预训练模型一一对应的tokenizer。
使用指定的模型checkpoint对应的tokenizer的时候，我们也下载了模型需要的词表库vocabulary，准确来说是tokens vocabulary。

这个被下载的tokens vocabulary会被缓存起来，从而再次使用的时候不会重新下载。

2.1 构建模型对应的tokenizer

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

  
 
  1
  2
  3

注意：以下代码要求tokenizer必须是transformers.PreTrainedTokenizerFast类型，因为我们在预处理的时候需要用到fast tokenizer的一些特殊特性（比如多线程快速tokenizer）。

几乎所有模型对应的tokenizer都有对应的fast tokenizer。我们可以在模型tokenizer对应表里查看所有预训练模型对应的tokenizer所拥有的特点。

import transformers
# 使用fast tokenizer
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

  
 
  1
  2
  3

在这里big table of models查看模型是否有fast tokenizer。

tokenizer既可以对单个文本进行预处理，也可以对一对文本进行预处理，tokenizer预处理后得到的数据满足预训练模型输入格式

tokenizer("Hello, this is one sentence!")

  
 
  1

{'input_ids': [101, 7592, 1010, 2023, 2003, 2028, 6251, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

  
 
  1

tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True)

  
 
  1

{'input_ids': [101, 7592, 1010, 2023, 2003, 2028, 6251, 3975, 2046, 2616, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

  
 
  1

注意：transformer预训练模型在预训练的时候通常使用的是subword，如果我们的文本输入已经被切分成了word，那么这些word还会被我们的tokenizer继续切分。举个例子：

example = datasets["train"][4]
print(example["tokens"])

  
 
  1
  2

['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.']

  
 
  1

tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

  
 
  1
  2
  3

可见被tokenizer分词器继续切分的结果：

    ['[CLS]', 'germany', "'", 's', 'representative', 'to', 'the', 'european', 
    'union', "'", 's', 'veterinary', 'committee', 'werner', 
    'z', '##wing', '##mann', 'said', 'on', 'wednesday', 'consumers', 'should', 'buy', 
    'sheep', '##me', '##at', 'from', 'countries', 'other', 'than', 'britain', 
    'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.', '[SEP]']

  
 
  1
  2
  3
  4
  5

由上面结果看出单词"Zwingmann" 和 "sheepmeat"继续被切分成了3个subtokens——‘z’ , ‘##wing’ , ‘##mann’；‘sheep’, ‘##me’ , ‘##at’。

由于标注数据通常是在word级别进行标注的，既然word还会被切分成subtokens，那么意味着我们还需要对标注数据进行subtokens的对齐（参照上面的栗子）。
由于预训练模型输入格式的要求，往往还需要加上一些特殊符号比如： [CLS] 和 a [SEP]。

len(example[f"{task}_tags"]), len(tokenized_input["input_ids"])

  
 
  1

2.2 解决subtokens对齐问题

tokenizer有一个word_ids方法可以帮助我们解决2.1说的对齐问题。

# 使用word_ids解决subtokens对齐问题
print(tokenized_input.word_ids())

  
 
  1
  2

[None, 0, 1, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 11, 11, 12, 13, 14, 15, 16, 17, 18, 18, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, None]

  
 
  1

我们可以看到，word_ids将每一个subtokens位置都对应了一个word的下标。比如第1个位置对应第0个word，然后第2、3个位置对应第1个word。特殊字符对应了NOne。有了这个list，我们就能将subtokens和words还有标注的labels对齐啦。

# 获取subtokens位置
word_ids = tokenized_input.word_ids()

# 将subtokens、words和标注的labels对齐
aligned_labels = [
-100 if i is None else example[f"{task}_tags"][i] for i in word_ids]

print(len(aligned_labels), len(tokenized_input["input_ids"]))

  
 
  1
  2
  3
  4
  5
  6
  7
  8

我们通常将特殊字符的label设置为-100，在模型中-100通常会被忽略掉不计算loss。

两种对齐label的方式：

多个subtokens对齐一个word，对齐一个label
多个subtokens的第一个subtoken对齐word，对齐一个label，其他subtokens直接赋予-100.

提供了这两种方式，通过label_all_tokens = True切换。

label_all_tokens = True

  
 
  1

2.3 整合预处理函数（综合上面步骤）

最后我们将所有内容合起来变成我们的预处理函数。is_split_into_words=True在上面已经结束啦。

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
    	# 获取subtokens位置
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        # 遍历subtokens位置索引
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            # 将特殊符号的标签设置为-100，以便在计算损失函数时自动忽略
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            # 把标签设置到每个词的第一个token上
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            # 对于每个词的其他token也设置为当前标签
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

		# 对齐word
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32

以上的预处理函数可以处理一个样本，也可以处理多个样本exapmles。如果是处理多个样本，则返回的是多个样本被预处理之后的结果list。

tokenize_and_align_labels(datasets['train'][:5])

  
 
  1

可见数据对应的input_ids、attention_mask、labels（这三个和下图对应）：

    {'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], [101, 2848, 13934, 102], [101, 9371, 2727, 1011, 5511, 1011, 2570, 102], [101, 1996, 2647, 3222, 2056, 2006, 9432, 2009, 18335, 2007, 2446, 6040, 2000, 10390, 2000, 18454, 2078, 2329, 12559, 2127, 6529, 5646, 3251, 5506, 11190, 4295, 2064, 2022, 11860, 2000, 8351, 1012, 102], [101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 
    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 
    'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100], [-100, 1, 2, -100], [-100, 5, 0, 0, 0, 0, 0, -100], [-100, 0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100], [-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, -100]]}

  
 
  1
  2
  3

2.4 对datasets所有样本进行预处理

接下来对数据集datasets里面的所有样本进行预处理，处理的方式是使用map函数，将预处理函数prepare_train_features应用到（map)所有样本上。

tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)

  
 
  1

更好的是，返回的结果会自动被缓存，避免下次处理的时候重新计算（但是也要注意，如果输入有改动，可能会被缓存影响！）。

datasets库函数会对输入的参数进行检测，判断是否有变化，如果没有变化就使用缓存数据，如果有变化就重新处理。但如果输入参数不变，想改变输入的时候，最好清理调这个缓存。清理的方式是使用load_from_cache_file=False参数。
上面使用到的batched=True这个参数是tokenizer的特点，以为这会使用多线程同时并行对输入进行处理。

三、微调预训练模型

既然数据已经准备好了，现在我们需要下载并加载我们的预训练模型，然后微调预训练模型。既然我们是做seq2seq任务，那么我们需要一个能解决这个任务的模型类。

我们使用AutoModelForTokenClassification 这个类。和tokenizer相似，from_pretrained方法同样可以帮助我们下载并加载模型，同时也会对模型进行缓存，就不会重复下载模型啦。

3.1 加载模型

# 获得标签列表，并加载预训练模型
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))

  
 
  1
  2
  3
  4

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

  
 
  1
  2
  3
  4
  5

由于我们微调的任务是token分类任务，而我们加载的是预训练的语言模型，所以会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数（比如：预训练语言模型的神经网络head被扔掉了，同时随机初始化了token分类的神经网络head）。

3.2 设定训练参数

为了能够得到一个Trainer训练工具，我们还需要3个要素，其中最重要的是训练的设定/参数 TrainingArguments。这个训练设定包含了能够定义训练过程的所有属性。

args = TrainingArguments(
    f"test-{task}",
    # 每个epoch会做一次验证评估
    evaluation_strategy = "epoch",
    # 定义初始学习率
    learning_rate=2e-5, 
    # 定义训练批次大小
    per_device_train_batch_size=batch_size,
    # 定义测试批次大小
    per_device_eval_batch_size=batch_size,
    # 定义训练轮数
    num_train_epochs=3,
    weight_decay=0.01,
)

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14

上面evaluation_strategy = "epoch"参数告诉训练代码：我们每个epcoh会做一次验证评估。
上面batch_size在这个notebook之前定义好了。

最后我们需要一个数据收集器data collator，将我们处理好的输入喂给模型。

from transformers import DataCollatorForTokenClassification

# 通过数据收集器，将处理好的数据喂给model
data_collator = DataCollatorForTokenClassification(tokenizer)

  
 
  1
  2
  3
  4

3.3 设定评估方法

设置好Trainer还剩最后一件事情，那就是我们需要定义好评估方法。我们使用seqeval metric来完成评估。将模型预测送入评估之前，我们也会做一些数据后处理：

metric = load_metric("seqeval")

  
 
  1

评估的输入是预测和label的list

labels = [label_list[i] for i in example[f"{task}_tags"]]
metric.compute(predictions=[labels], references=[labels])

  
 
  1
  2

{'LOC': {'f1': 1.0, 'number': 2, 'precision': 1.0, 'recall': 1.0},
 'ORG': {'f1': 1.0, 'number': 1, 'precision': 1.0, 'recall': 1.0},
 'PER': {'f1': 1.0, 'number': 1, 'precision': 1.0, 'recall': 1.0},
 'overall_accuracy': 1.0,
 'overall_f1': 1.0,
 'overall_precision': 1.0,
 'overall_recall': 1.0}

  
 
  1
  2
  3
  4
  5
  6
  7

对模型预测结果做一些后处理

选择预测分类最大概率的下标
将下标转化为label
忽略-100所在地方

下面的函数将上面的步骤合并了起来。

import numpy as np

def compute_metrics(p):
    predictions, labels = p
    # 选择预测分类最大概率的下标
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    # 将下标转化为label，并忽略-100的位置
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25

我们计算所有类别总的precision/recall/f1，所以会扔掉单个类别的precision/recall/f1

3.4 训练模型

将数据/模型/参数传入Trainer

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9

调用train方法开始训练

trainer.train()

  
 
  1

我们可以再次使用evaluate方法评估，可以评估其他数据集。

3.5 模型评估

trainer.evaluate()

  
 
  1

3.6 输出单个类别的precision、recall、F1值

（1）正样本的Precision表示你预测为正的样本中有多少预测对了： $\text { Precision }=\frac{t p}{t p+f p}$ （2）正样本的Recall表示真实标签为正的样本有多少被你预测对了，Recall又称为查全率、召回率： $\text { Recall }=\frac{t p}{t p+f n}$ （3）Accuracy，表示你有多少比例的样本预测对了。分母是全部样本的数量；容易扩展到多类别的情况。通常来说，正确率越高，分类器越好。我们最常说的就是这个准确率。
$\text { Accuracy }=\frac{t p+t n}{t p+t n+f p+f n}$ （4）1/F1score = 1/2(1/recall + 1/precision)=2recallprecision/(recall+precision)
同样F1score也是针对某个样本而言的。一般而言F1score用来综合precision和recall作为一个评价指标。还有F1score的变形，主要是添加一个权重系数可以根据需要对recall和precision赋予不同的权重。

小结：召回率Recall和准确率Accuracy虽然没有必然关系，但是在实际应用中是相互制约的——要根据实际需求，找到一个平衡点。

当我们问检索系统某一件事的所有细节时（输入检索query查询词），Recall指：检索系统能“回忆”起那些事的多少细节，通俗来讲就是“回忆的能力”。“能回忆起来的细节数” 除以 “系统知道这件事的所有细节”，就是“记忆率”，也就是recall——召回率。简单的，也可以理解为查全率。

现在为了得到单个类别的precision、recall、f1，我们直接将结果输入相同的评估函数即可：

predictions, labels, _ = trainer.predict(tokenized_datasets["validation"])
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15

结果为：

    {'LOC': {'precision': 0.949718574108818,
      'recall': 0.966768525592055,
      'f1': 0.9581677077418134,
      'number': 2618},
     'MISC': {'precision': 0.8132387706855791,
      'recall': 0.8383428107229894,
      'f1': 0.8255999999999999,
      'number': 1231},
     'ORG': {'precision': 0.9055232558139535,
      'recall': 0.9090466926070039,
      'f1': 0.9072815533980583,
      'number': 2056},
     'PER': {'precision': 0.9759552042160737,
      'recall': 0.9765985497692815,
      'f1': 0.9762767710049424,
      'number': 3034},
     'overall_precision': 0.9292672127518264,
     'overall_recall': 0.9391430808815304,
     'overall_f1': 0.9341790463472988,
     'overall_accuracy': 0.9842565968195466}

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20

最后别忘了，查看如何上传模型，上传模型到到🤗 Model Hub。随后就可以像这个notebook一开始一样，直接用模型名字就能使用您自己上传的模型啦。

Reference

（1）datawhale course
（2）BERT使用详解(实战)：https://www.jianshu.com/p/bfd0148b292e
（3）huggingface官网：https://huggingface.co/transformers/preprocessing.html
（4）手把手教你用Pytorch-Transformers——部分源码解读及相关说明（一）
（5）进击的BERT：https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html
（6）distilbert-base-uncased库：https://huggingface.co/distilbert-base-uncased

文章来源: andyguo.blog.csdn.net，作者：山顶夕景，版权归原作者所有，如需转载，请联系作者。

原文链接：andyguo.blog.csdn.net/article/details/119950655

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

【NLP】(task7)Transformers完成序列标注任务

学习总结

文章目录

任务：序列标注（token级的分类问题）

1.最常见的token级别分类任务:

2.相关初始化操作

一、加载数据

1.1 加载数据

1.2 查看数据

标签含义对应：

1.3 数据集里抽10条康康

二、预处理数据

2.1 构建模型对应的tokenizer

2.2 解决subtokens对齐问题

两种对齐label的方式：

2.3 整合预处理函数（综合上面步骤）

2.4 对datasets所有样本进行预处理

三、微调预训练模型

3.1 加载模型

3.2 设定训练参数

3.3 设定评估方法

对模型预测结果做一些后处理

3.4 训练模型

3.5 模型评估

3.6 输出单个类别的precision、recall、F1值

Reference

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品