自然语言处理实战——命名实体识别(一)
自然语言处理实战——命名实体识别
BERT模型(Bidirectional Encoder Representations from Transformers)是2018年10月谷歌推出的,它在机器阅读理解顶级水平测试SQuAD1.1中表现出惊人的成绩:全部两个衡量指标上全面超越人类,并且还在11种不同NLP测试中创出最佳成绩,包括将GLUE基准推至80.4%(绝对改进率7.6%),MultiNLI准确度达到86.7%(绝对改进率5.6%)等。BERT模型被认为是NLP新时代的开始,自此NLP领域终于找到了一种方法,可以像计算机视觉那样进行迁移学习,任何需要构建语言处理模型的人都可以将这个强大的预训练模型作为现成的组件使用,从而节省了从头开始训练模型所需的时间、精力、知识和资源。具体地来说,BERT可以用于以下自然语言处理任务中:
- 问答系统
- 命名实体识别
- 文档聚类
- 邮件过滤和分类
- 情感分析
本案例将带大家学习BERT模型的命名实体识别功能。
注意事项:
-
本案例使用框架**:** TensorFlow-1.13.1
-
本案例使用硬件规格**:** 8 vCPU + 64 GiB + 1 x Tesla V100-PCIE-32GB
-
切换硬件规格方法**:** 如需切换硬件规格,您可以在本页面右边的工作区进行切换
-
运行代码方法**:** 点击本页面顶部菜单栏的三角形运行按钮或按Ctrl+Enter键 运行每个方块中的代码
-
JupyterLab的详细用法**:** 请参考《ModelAtrs JupyterLab使用指导》
-
碰到问题的解决办法**:** 请参考《ModelAtrs JupyterLab常见问题解决办法》
1.准备源代码和数据
准备案例所需的源代码和数据,相关资源已经保存在 OBS 中,我们通过 Moxing 将资源下载到本地。
import os
import subprocess
import moxing as mox
print('Downloading datasets and code ...')
if not os.path.exists('./ner'):
mox.file.copy('obs://modelarts-labs-bj4/notebook/DL_nlp_ner/ner.tar.gz', './ner.tar.gz')
p1 = subprocess.run(['tar xf ./ner.tar.gz;rm ./ner.tar.gz'], stdout=subprocess.PIPE, shell=True, check=True)
if os.path.exists('./ner'):
print('Download success')
else:
raise Exception('Download failed')
else:
print('Download success')
INFO:root:Using MoXing-v1.17.3-
INFO:root:Using OBS-Python-SDK-3.20.7
Downloading datasets and code ...
Download success
解压从obs下载的压缩包,解压后删除压缩包。
2.导入Python库
import os
import json
import numpy as np
import tensorflow as tf
import codecs
import pickle
import collections
from ner.bert import modeling, optimization, tokenization
/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ma-user/anaconda3/envs/TensorFlow-1.13.1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
3.定义路径及参数
data_dir = "./ner/data"
output_dir = "./ner/output"
vocab_file = "./ner/chinese_L-12_H-768_A-12/vocab.txt"
data_config_path = "./ner/chinese_L-12_H-768_A-12/bert_config.json"
init_checkpoint = "./ner/chinese_L-12_H-768_A-12/bert_model.ckpt"
max_seq_length = 128
batch_size = 64
num_train_epochs = 5.0
4.定义processor类获取数据,打印标签
tf.logging.set_verbosity(tf.logging.INFO)
from ner.src.models import InputFeatures, InputExample, DataProcessor, NerProcessor
processors = {"ner": NerProcessor }
processor = processors["ner"](output_dir)
label_list = processor.get_labels()
print("labels:", label_list)
labels: ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'X', '[CLS]', '[SEP]']
以上 labels 分别表示:
- O:非标注实体
- B-PER:人名首字
- I-PER:人名非首字
- B-ORG:组织首字
- I-ORG:组织名非首字
- B-LOC:地名首字
- I-LOC:地名非首字
- X:未知
- [CLS]:句首
- [SEP]:句尾
5.加载预训练参数
data_config = json.load(codecs.open(data_config_path))
train_examples = processor.get_train_examples(data_dir)
num_train_steps = int(len(train_examples) / batch_size * num_train_epochs)
num_warmup_steps = int(num_train_steps * 0.1)
data_config['num_train_steps'] = num_train_steps
data_config['num_warmup_steps'] = num_warmup_steps
data_config['num_train_size'] = len(train_examples)
print("显示配置信息:")
for key,value in data_config.items():
print('{key}:{value}'.format(key = key, value = value))
bert_config = modeling.BertConfig.from_json_file(data_config_path)
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=True)
#tf.estimator运行参数
run_config = tf.estimator.RunConfig(
model_dir=output_dir,
save_summary_steps=1000,
save_checkpoints_steps=1000,
session_config=tf.ConfigProto(
log_device_placement=False,
inter_op_parallelism_threads=0,
intra_op_parallelism_threads=0,
allow_soft_placement=True
)
)
显示配置信息:
attention_probs_dropout_prob:0.1
directionality:bidi
hidden_act:gelu
hidden_dropout_prob:0.1
hidden_size:768
initializer_range:0.02
intermediate_size:3072
max_position_embeddings:512
num_attention_heads:12
num_hidden_layers:12
pooler_fc_size:768
pooler_num_attention_heads:12
pooler_num_fc_layers:3
pooler_size_per_head:128
pooler_type:first_token_transform
type_vocab_size:2
vocab_size:21128
num_train_steps:1630
num_warmup_steps:163
num_train_size:20864
6.读取数据,获取句向量
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer, output_dir, mode):
label_map = {}
for (i, label) in enumerate(label_list, 1):
label_map[label] = i
if not os.path.exists(os.path.join(output_dir, 'label2id.pkl')):
with codecs.open(os.path.join(output_dir, 'label2id.pkl'), 'wb') as w:
pickle.dump(label_map, w)
textlist = example.text.split(' ')
labellist = example.label.split(' ')
tokens = []
labels = []
for i, word in enumerate(textlist):
token = tokenizer.tokenize(word)
tokens.extend(token)
label_1 = labellist[i]
for m in range(len(token)):
if m == 0:
labels.append(label_1)
else:
labels.append("X")
if len(tokens) >= max_seq_length - 1:
tokens = tokens[0:(max_seq_length - 2)]
labels = labels[0:(max_seq_length - 2)]
ntokens = []
segment_ids = []
label_ids = []
ntokens.append("[CLS]") # 句子开始设置 [CLS] 标志
segment_ids.append(0)
label_ids.append(label_map["[CLS]"])
for i, token in enumerate(tokens):
ntokens.append(token)
segment_ids.append(0)
label_ids.append(label_map[labels[i]])
ntokens.append("[SEP]") # 句尾添加 [SEP] 标志
segment_ids.append(0)
label_ids.append(label_map["[SEP]"])
input_ids = tokenizer.convert_tokens_to_ids(ntokens)
input_mask = [1] * len(input_ids)
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
label_ids.append(0)
ntokens.append("**NULL**")
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
assert len(label_ids) == max_seq_length
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_ids=label_ids,
)
return feature
def filed_based_convert_examples_to_features(
examples, label_list, max_seq_length, tokenizer, output_file, mode=None):
writer = tf.python_io.TFRecordWriter(output_file)
for (ex_index, example) in enumerate(examples):
if ex_index % 5000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, output_dir, mode)
def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature(feature.label_ids)
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
train_file = os.path.join(output_dir, "train.tf_record")
#将训练集中字符转化为features作为训练的输入
filed_based_convert_examples_to_features(
train_examples, label_list, max_seq_length, tokenizer, output_file=train_file)
INFO:tensorflow:Writing example 0 of 20864
INFO:tensorflow:Writing example 5000 of 20864
INFO:tensorflow:Writing example 10000 of 20864
INFO:tensorflow:Writing example 15000 of 20864
INFO:tensorflow:Writing example 20000 of 20864
7.引入 BiLSTM+CRF 层,作为下游模型
learning_rate = 5e-5
dropout_rate = 1.0
lstm_size=1
cell='lstm'
num_layers=1
from ner.src.models import BLSTM_CRF
from tensorflow.contrib.layers.python.layers import initializers
def create_model(bert_config, is_training, input_ids, input_mask,
segment_ids, labels, num_labels, use_one_hot_embeddings,
dropout_rate=dropout_rate, lstm_size=1, cell='lstm', num_layers=1):
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings
)
embedding = model.get_sequence_output()
max_seq_length = embedding.shape[1].value
used = tf.sign(tf.abs(input_ids))
lengths = tf.reduce_sum(used, reduction_indices=1)
blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=1, cell_type='lstm', num_layers=1,
dropout_rate=dropout_rate, initializers=initializers, num_labels=num_labels,
seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
rst = blstm_crf.add_blstm_crf_layer(crf_only=True)
return rst
def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps,use_one_hot_embeddings=False):
#构建模型
def model_fn(features, labels, mode, params):
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
label_ids = features["label_ids"]
print('shape of input_ids', input_ids.shape)
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
total_loss, logits, trans, pred_ids = create_model(
bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
num_labels, False, dropout_rate, lstm_size, cell, num_layers)
tvars = tf.trainable_variables()
if init_checkpoint:
(assignment_map, initialized_variable_names) = \
modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps, False)
hook_dict = {}
hook_dict['loss'] = total_loss
hook_dict['global_steps'] = tf.train.get_or_create_global_step()
logging_hook = tf.train.LoggingTensorHook(
hook_dict, every_n_iter=100)
output_spec = tf.estimator.EstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
training_hooks=[logging_hook])
elif mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(label_ids, pred_ids):
return {
"eval_loss": tf.metrics.mean_squared_error(labels=label_ids, predictions=pred_ids), }
eval_metrics = metric_fn(label_ids, pred_ids)
output_spec = tf.estimator.EstimatorSpec(
mode=mode,
loss=total_loss,
eval_metric_ops=eval_metrics
)
else:
output_spec = tf.estimator.EstimatorSpec(
mode=mode,
predictions=pred_ids
)
return output_spec
return model_fn
- 点赞
- 收藏
- 关注作者
评论(0)