自然语言处理实战——命名实体识别(一)

举报
HWCloudAI 发表于 2022/12/19 11:47:03 2022/12/19
【摘要】 自然语言处理实战——命名实体识别BERT模型(Bidirectional Encoder Representations from Transformers)是2018年10月谷歌推出的,它在机器阅读理解顶级水平测试SQuAD1.1中表现出惊人的成绩:全部两个衡量指标上全面超越人类,并且还在11种不同NLP测试中创出最佳成绩,包括将GLUE基准推至80.4%(绝对改进率7.6%),Mult...

自然语言处理实战——命名实体识别

BERT模型(Bidirectional Encoder Representations from Transformers)是2018年10月谷歌推出的,它在机器阅读理解顶级水平测试SQuAD1.1中表现出惊人的成绩:全部两个衡量指标上全面超越人类,并且还在11种不同NLP测试中创出最佳成绩,包括将GLUE基准推至80.4%(绝对改进率7.6%),MultiNLI准确度达到86.7%(绝对改进率5.6%)等。BERT模型被认为是NLP新时代的开始,自此NLP领域终于找到了一种方法,可以像计算机视觉那样进行迁移学习,任何需要构建语言处理模型的人都可以将这个强大的预训练模型作为现成的组件使用,从而节省了从头开始训练模型所需的时间、精力、知识和资源。具体地来说,BERT可以用于以下自然语言处理任务中:

  • 问答系统
  • 命名实体识别
  • 文档聚类
  • 邮件过滤和分类
  • 情感分析

本案例将带大家学习BERT模型的命名实体识别功能。

注意事项:

  1. 本案例使用框架**:** TensorFlow-1.13.1

  2. 本案例使用硬件规格**:** 8 vCPU + 64 GiB + 1 x Tesla V100-PCIE-32GB

  3. 切换硬件规格方法**:** 如需切换硬件规格,您可以在本页面右边的工作区进行切换

  4. 运行代码方法**:** 点击本页面顶部菜单栏的三角形运行按钮或按Ctrl+Enter键 运行每个方块中的代码

  5. JupyterLab的详细用法**:** 请参考《ModelAtrs JupyterLab使用指导》

  6. 碰到问题的解决办法**:** 请参考《ModelAtrs JupyterLab常见问题解决办法》

1.准备源代码和数据

准备案例所需的源代码和数据,相关资源已经保存在 OBS 中,我们通过 Moxing 将资源下载到本地。

import os
import subprocess
import moxing as mox

print('Downloading datasets and code ...')
if not os.path.exists('./ner'):
    mox.file.copy('obs://modelarts-labs-bj4/notebook/DL_nlp_ner/ner.tar.gz', './ner.tar.gz')
    p1 = subprocess.run(['tar xf ./ner.tar.gz;rm ./ner.tar.gz'], stdout=subprocess.PIPE, shell=True, check=True)
    if os.path.exists('./ner'):
        print('Download success')
    else:
        raise Exception('Download failed')
else:
    print('Download success')
INFO:root:Using MoXing-v1.17.3-

INFO:root:Using OBS-Python-SDK-3.20.7


Downloading datasets and code ...

Download success

解压从obs下载的压缩包,解压后删除压缩包。

2.导入Python库

import os
import json
import numpy as np
import tensorflow as tf
import codecs
import pickle
import collections
from ner.bert import modeling, optimization, tokenization
/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])

/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])

/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

  _np_qint16 = np.dtype([("qint16", np.int16, 1)])

/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])

/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

  _np_qint32 = np.dtype([("qint32", np.int32, 1)])

/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

  np_resource = np.dtype([("resource", np.ubyte, 1)])

3.定义路径及参数

data_dir = "./ner/data"    
output_dir = "./ner/output"    
vocab_file = "./ner/chinese_L-12_H-768_A-12/vocab.txt"    
data_config_path = "./ner/chinese_L-12_H-768_A-12/bert_config.json"    
init_checkpoint = "./ner/chinese_L-12_H-768_A-12/bert_model.ckpt"    
max_seq_length = 128    
batch_size = 64    
num_train_epochs = 5.0    

4.定义processor类获取数据,打印标签

tf.logging.set_verbosity(tf.logging.INFO)
from ner.src.models import InputFeatures, InputExample, DataProcessor, NerProcessor

processors = {"ner": NerProcessor }
processor = processors["ner"](output_dir)

label_list = processor.get_labels()
print("labels:", label_list)
labels: ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'X', '[CLS]', '[SEP]']

以上 labels 分别表示:

  • O:非标注实体
  • B-PER:人名首字
  • I-PER:人名非首字
  • B-ORG:组织首字
  • I-ORG:组织名非首字
  • B-LOC:地名首字
  • I-LOC:地名非首字
  • X:未知
  • [CLS]:句首
  • [SEP]:句尾

5.加载预训练参数

data_config = json.load(codecs.open(data_config_path))
train_examples = processor.get_train_examples(data_dir)    
num_train_steps = int(len(train_examples) / batch_size * num_train_epochs)    
num_warmup_steps = int(num_train_steps * 0.1)   
data_config['num_train_steps'] = num_train_steps
data_config['num_warmup_steps'] = num_warmup_steps
data_config['num_train_size'] = len(train_examples)

print("显示配置信息:")
for key,value in data_config.items():
    print('{key}:{value}'.format(key = key, value = value))

bert_config = modeling.BertConfig.from_json_file(data_config_path)
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=True)

#tf.estimator运行参数
run_config = tf.estimator.RunConfig(
    model_dir=output_dir,
    save_summary_steps=1000,
    save_checkpoints_steps=1000,
    session_config=tf.ConfigProto(
        log_device_placement=False,
        inter_op_parallelism_threads=0,
        intra_op_parallelism_threads=0,
        allow_soft_placement=True
    )
)
显示配置信息:

attention_probs_dropout_prob:0.1

directionality:bidi

hidden_act:gelu

hidden_dropout_prob:0.1

hidden_size:768

initializer_range:0.02

intermediate_size:3072

max_position_embeddings:512

num_attention_heads:12

num_hidden_layers:12

pooler_fc_size:768

pooler_num_attention_heads:12

pooler_num_fc_layers:3

pooler_size_per_head:128

pooler_type:first_token_transform

type_vocab_size:2

vocab_size:21128

num_train_steps:1630

num_warmup_steps:163

num_train_size:20864

6.读取数据,获取句向量

def convert_single_example(ex_index, example, label_list, max_seq_length, 
                           tokenizer, output_dir, mode):
    label_map = {}
    for (i, label) in enumerate(label_list, 1):
        label_map[label] = i
    if not os.path.exists(os.path.join(output_dir, 'label2id.pkl')):
        with codecs.open(os.path.join(output_dir, 'label2id.pkl'), 'wb') as w:
            pickle.dump(label_map, w)

    textlist = example.text.split(' ')
    labellist = example.label.split(' ')
    tokens = []
    labels = []
    for i, word in enumerate(textlist):
        token = tokenizer.tokenize(word)
        tokens.extend(token)
        label_1 = labellist[i]
        for m in range(len(token)):
            if m == 0:
                labels.append(label_1)
            else:  
                labels.append("X")
    if len(tokens) >= max_seq_length - 1:
        tokens = tokens[0:(max_seq_length - 2)]
        labels = labels[0:(max_seq_length - 2)]
    ntokens = []
    segment_ids = []
    label_ids = []
    ntokens.append("[CLS]")  # 句子开始设置 [CLS] 标志
    segment_ids.append(0)
    label_ids.append(label_map["[CLS]"])  
    for i, token in enumerate(tokens):
        ntokens.append(token)
        segment_ids.append(0)
        label_ids.append(label_map[labels[i]])
    ntokens.append("[SEP]")  # 句尾添加 [SEP] 标志
    segment_ids.append(0)
    label_ids.append(label_map["[SEP]"])
    input_ids = tokenizer.convert_tokens_to_ids(ntokens)  
    input_mask = [1] * len(input_ids)

    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)
        label_ids.append(0)
        ntokens.append("**NULL**")

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length
    assert len(label_ids) == max_seq_length

    feature = InputFeatures(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_ids=label_ids,
    )
   
    return feature

def filed_based_convert_examples_to_features(
        examples, label_list, max_seq_length, tokenizer, output_file, mode=None):
    writer = tf.python_io.TFRecordWriter(output_file)
    for (ex_index, example) in enumerate(examples):
        if ex_index % 5000 == 0:
            tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
        feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, output_dir, mode)

        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f

        features = collections.OrderedDict()
        features["input_ids"] = create_int_feature(feature.input_ids)
        features["input_mask"] = create_int_feature(feature.input_mask)
        features["segment_ids"] = create_int_feature(feature.segment_ids)
        features["label_ids"] = create_int_feature(feature.label_ids)
        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        writer.write(tf_example.SerializeToString())

train_file = os.path.join(output_dir, "train.tf_record")

#将训练集中字符转化为features作为训练的输入
filed_based_convert_examples_to_features(
            train_examples, label_list, max_seq_length, tokenizer, output_file=train_file)
INFO:tensorflow:Writing example 0 of 20864

INFO:tensorflow:Writing example 5000 of 20864

INFO:tensorflow:Writing example 10000 of 20864

INFO:tensorflow:Writing example 15000 of 20864

INFO:tensorflow:Writing example 20000 of 20864

7.引入 BiLSTM+CRF 层,作为下游模型

learning_rate = 5e-5 
dropout_rate = 1.0   
lstm_size=1    
cell='lstm'
num_layers=1

from ner.src.models import BLSTM_CRF
from tensorflow.contrib.layers.python.layers import initializers

def create_model(bert_config, is_training, input_ids, input_mask,
                 segment_ids, labels, num_labels, use_one_hot_embeddings,
                 dropout_rate=dropout_rate, lstm_size=1, cell='lstm', num_layers=1):
    model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings
    )
    embedding = model.get_sequence_output()
    max_seq_length = embedding.shape[1].value
    used = tf.sign(tf.abs(input_ids))
    lengths = tf.reduce_sum(used, reduction_indices=1)  
    blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=1, cell_type='lstm', num_layers=1,
                          dropout_rate=dropout_rate, initializers=initializers, num_labels=num_labels,
                          seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
    rst = blstm_crf.add_blstm_crf_layer(crf_only=True)
    return rst

def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps,use_one_hot_embeddings=False):
    #构建模型
    def model_fn(features, labels, mode, params):
        tf.logging.info("*** Features ***")
        for name in sorted(features.keys()):
            tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
        input_ids = features["input_ids"]
        input_mask = features["input_mask"]
        segment_ids = features["segment_ids"]
        label_ids = features["label_ids"]

        print('shape of input_ids', input_ids.shape)
        is_training = (mode == tf.estimator.ModeKeys.TRAIN)

        total_loss, logits, trans, pred_ids = create_model(
            bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
            num_labels, False, dropout_rate, lstm_size, cell, num_layers)

        tvars = tf.trainable_variables()

        if init_checkpoint:
            (assignment_map, initialized_variable_names) = \
                 modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
            tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
        
        output_spec = None
        if mode == tf.estimator.ModeKeys.TRAIN:
            train_op = optimization.create_optimizer(
                 total_loss, learning_rate, num_train_steps, num_warmup_steps, False)
            hook_dict = {}
            hook_dict['loss'] = total_loss
            hook_dict['global_steps'] = tf.train.get_or_create_global_step()
            logging_hook = tf.train.LoggingTensorHook(
                hook_dict, every_n_iter=100)

            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                loss=total_loss,
                train_op=train_op,
                training_hooks=[logging_hook])

        elif mode == tf.estimator.ModeKeys.EVAL:
            def metric_fn(label_ids, pred_ids):

                return {
                    "eval_loss": tf.metrics.mean_squared_error(labels=label_ids, predictions=pred_ids),   }
            
            eval_metrics = metric_fn(label_ids, pred_ids)
            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                loss=total_loss,
                eval_metric_ops=eval_metrics
            )
        else:
            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                predictions=pred_ids
            )
        return output_spec

    return model_fn

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。