- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

Pytorch版本的BERT

可爱又积极发表于 2021/10/22 08:53:37 2021/10/22

【摘要】一、Google BERT：BERT地址：https://github.com/google-research/bertpytorch版本的BERT：https://github.com/huggingface/pytorch-pretrained-BERT使用要求：Python 3.5+ & PyTorch0.4.1/1.0.0 & pip install pytorch-pret...

一、Google BERT：

BERT地址：https://github.com/google-research/bert

pytorch版本的BERT：https://github.com/huggingface/pytorch-pretrained-BERT

使用要求：Python 3.5+ & PyTorch0.4.1/1.0.0 & pip install pytorch-pretrained-bert & 下载BERT-模型

二、BERT-模型

BERT-Base, Multilingual (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
中文模型【BERT-Base-Chinese】：https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz

三、简单介绍

1）Bidirectional Encoder Representations from Transformers

the first unsupervised, deeply bidirectional system for pre-training NLP，且上下文相关。
train a large model (12-layer to 24-layer Transformer) on a large corpus (Wikipedia + BookCorpus) for a long time (1M update steps)
（曾经的 Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit 只学到了单侧信息。）

2.）学习词的表示：BERT mask 了15%的word，如：

Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon
3. ）学习句子间信息：

Sentence A: the man went to the store .
Sentence B: he bought a gallon of milk .
Label: IsNextSentence
四、BERT的使用

4.1、两种使用方式

1）Pre-training ：four days on 4 to 16 Cloud TPUs

BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
2）fine-tuning ：a few hours on a GPU

The fine-tuning examples which use BERT-Base should be able to run on a GPU that has at least 12GB of RAM using the hyperparameters given.
4.2、Out-of-memory 如何解决

when using a GPU with 12GB - 16GB of RAM, you are likely to encounter out-of-memory issues if you use the same hyperparameters described in the paper，调整以下参数：

max_seq_length: 训好的模型用512，可以调小
train_batch_size:
Model type, BERT-Base vs. BERT-Large: The BERT-Large model requires more memory.
Optimizer: 训好的模型用Adam, requires a lot of extra memory for the m and v vectors. Switching to a more memory efficient optimizer can reduce memory usage, but can also affect the results.
4.3、Pytorch-BERT的使用

原始BERT是运行在TF上的，TF-BERT使用可参考：https://www.jianshu.com/p/bfd0148b292e

Pytorch版本BERT组成如下：

1）Eight Bert PyTorch models

BertModel - raw BERT Transformer model (fully pre-trained),
BertForMaskedLM - BERT Transformer with the pre-trained masked language modeling head on top (fully pre-trained),
BertForNextSentencePrediction - BERT Transformer with the pre-trained next sentence prediction classifier on top (fully pre-trained),
BertForPreTraining - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (fully pre-trained),
BertForSequenceClassification - BERT Transformer with a sequence classification head on top (BERT Transformer is pre-trained, the sequence classification head is only initialized and has to be trained),
BertForMultipleChoice - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is pre-trained, the multiple choice classification head is only initialized and has to be trained),
BertForTokenClassification - BERT Transformer with a token classification head on top (BERT Transformer is pre-trained, the token classification head is only initialized and has to be trained),
BertForQuestionAnswering - BERT Transformer with a token classification head on top (BERT Transformer is pre-trained, the token classification head is only initialized and has to be trained).
2）Tokenizers for BERT (using word-piece) (in the tokenization.py file):

BasicTokenizer - basic tokenization (punctuation splitting, lower casing, etc.),
WordpieceTokenizer - WordPiece tokenization,
BertTokenizer - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
3）Optimizer for BERT (in the optimization.py file):

BertAdam - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
源码中：
__init__(params, lr=required, warmup=-1, t_total=-1, schedule='warmup_linear', betas=(0.9, 0.999), e=1e-6, weight_decay=0.01, max_grad_norm=1.0, **kwargs)
lr: learning rate
warmup: portion of t_total for the warmup, -1 means no warmup. 【使用部分t_total热身】
t_total: total number of training steps for the learning rate schedule, -1 means constant learning rate of 1. (no warmup regardless of warmup setting). Default: -1【总训练步骤】
schedule: schedule to use for the warmup (see above).
Can be `'warmup_linear'`, `'warmup_constant'`, `'warmup_cosine'`, `'none'`, `None` or a `_LRSchedule` object (see below).
If `None` or `'none'`, learning rate is always kept constant. Default : `'warmup_linear'`
eg：使用方式：train_optimi_step = int(train_iter_num / args.batch_size) * args.epochs
optimizer = BertAdam([param for _, param in param_optimizer], lr=args.lr, warmup=0.1, t_total=train_optimi_step)
4）Five examples on how to use BERT (in the examples folder):

extract_features.py - Show how to extract hidden states from an instance of BertModel,
run_classifier.py - Show how to fine-tune an instance of BertForSequenceClassification on GLUE's MRPC task,
run_squad.py - Show how to fine-tune an instance of BertForQuestionAnswering on SQuAD v1.0 and SQuAD v2.0 tasks.
run_swag.py - Show how to fine-tune an instance of BertForMultipleChoice on Swag task.
run_lm_finetuning.py - Show how to fine-tune an instance of BertForPretraining on a target text corpus.
五、BERT的使用代码

使用Pytorch版本BERT使用方式如下：

1）First prepare a tokenized input with BertTokenizer

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

# 加载词典 pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenized input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# 将 token 转为 vocabulary 索引
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# 定义句子 A、B 索引
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# 将 inputs 转为 PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
2）use BertModel to get hidden states

# 加载模型 pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

# GPU & put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# 得到每一层的 hidden states
with torch.no_grad():
encoded_layers, _ = model(tokens_tensor, segments_tensors)
# 模型 bert-base-uncased 有12层，所以 hidden states 也有12层
assert len(encoded_layers) == 12
3）use BertForMaskedLM

# 加载模型 pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)

# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'henson'
1. BertModel

输入: in modeling.py

input_ids: torch.LongTensor [batch_size, sequence_length] with the word token indices in the vocabulary.
token_type_ids: optional torch.LongTensor [batch_size, sequence_length] with the token types indices selected in [0, 1]. Type 0 corresponds to a sentence A and type 1 corresponds to a sentence B token.
attention_mask: torch.LongTensor [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used when a batch has varying length sentences.
output_all_encoded_layers: controls the content of the encoded_layers output as described below. Default: True.
输出:

encoded_layers: 取决于output_encoded_layers 参数:
output_all_encoded_layers=True:
输出一列 encoded-hidden-states at the end of each attention block (12 full sequences for BERT-base, 24 for BERT-large), 每个 encoded-hidden-state= [batch_size, sequence_length, hidden_size]
output_all_encoded_layers=False:
输出最后一个attention block对应的encoded-hidden-states，1个 [batch_size, sequence_length, hidden_size]
pooled_output: torch.FloatTensor [batch_size, hidden_size] which is the output of a classifier pretrained on top of the hidden state associated [CLF] to train on the Next-Sentence task.

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

Pytorch版本的BERT

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

Pytorch版本的BERT

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品