TensorRT8.0.1.6之安装与测试
最近,TensorRT8.0.1悄然发布,并称在bert及大模型方面取得了性能突破。于是,我便对trt8的情况进行了摸底。
一、TensorRT8.0.1的安装
1、查看TensorRT Release 8的官方文档https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#rel_8-0-1,可得TensorRT8.0的依赖包及版本(TensorRT官方测试所用的版本):
- cuDNN 8.2.1
- TensorFlow 1.15.5
- PyTorch 1.8.1
- ONNX 1.8.0
- CUDA 10.2 or 11.0 update 1 or 11.1 update 1 or 11.2 update 2 or 11.3 update 1
2、有两种搭建环境的方法有两种
方法一:使用官方镜像
- 直接拉取tensorrt发布的官方镜像,镜像网址在https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt,镜像的具体信息在https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/rel_21-07.html#rel_21-07
docker pull nvcr.io/nvidia/tensorrt:21.07-py3
- 运行docker container
docker run -v /mnt/:/mnt/ -it --cap-add SYS_PTRACE --runtime=nvidia --shm-size=4gb -e NVIDIA_VISIBLE_DEVICES=7 --net=host nvcr.io/nvidia/tensorrt:xx.xx-py3 bash
- 在镜像中编译、运行TensorRT C++ samples
cd /workspace/tensorrt/samples
make -j4
方法二:本地配置环境
- 本地已安装CUDA 10.2
- nvidia官方网站下载cuDNN 8.2.1压缩包,解压到目录/usr/local/cudnn_v8.2.1中
- nvidia官方网站下载TensorRT8.0.1,解压到目录/usr/local/TensorRT-8.0.1.6中(注:TensorRT8分为GA和EA版本,EA是提前发布的不稳定版本,GA是经过完备测试的稳定版本)
- 设置环境变量
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64/:/usr/local/cuda-10.2/extras/CUPTI/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cudnn_v8.2.1/lib64/:$LD_LIBRARY_PATH
export CPLUS_INCLUDE_PATH=/usr/local/cuda-10.2/include/:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/usr/local/cudnn_v8.2.1/include/:$CPLUS_INCLUDE_PATH
export PATH=/usr/local/cuda-10.2/bin/:$PATH
- 编译TensorRT C++ samples
- 安装TensorRT的python包,pip install tensorrt-8.0.1.6-cp37-none-linux_x86_64.whl(根据情况安装uff、onnx-graphsurgeon的whl包)
二、测试模型
1、设置TensorRT8的环境变量
export TRT_TOOLKIT_ROOT_DIR=/usr/local/TensorRT-8.0.1.6/
export LD_LIBRARY_PATH=/usr/local/TensorRT-8.0.1.6/targets/x86_64-linux-gnu/lib/:/usr/local/TensorRT-8.0.1.6/targets/x86_64-linux-gnu/lib/stubs/:$LD_LIBRARY_PATH
2、使用的工具版本信息
- ONNX: 1.9.0
- ONNX IR version:0.0.6
- ONNX opset version: 11
- tf2onnx: 1.8.5
3、测试
- 使用TensorRT命令行工具trtexec进行测试
./trtexec --onnx=***.onnx --workspace=2048 --warmUp=2000 --iterations=500 --device=4 --avgRuns=500 --explicitBatch --tacticSources=-cublasLt,+cublas
- 遇到问题
问题一:“a known issue with cuBLAS LT 10.2 <https://github.com/NVIDIA/TensorRT/issues/866> ” ,解决方法为 --tacticSources=-cublasLt,+cublas问题二:[TRT] ../builder/cudnnBuilderUtils.cpp (360) - Cuda Error in findFastestTactic: 10 (invalid device ordinal)
这个问题的原因是,当环境变量设置export CUDA_VISIBLE_DEVICES="4"时, ./trtexec同时添加了参数--device=4。只设置其中一个就OK。
三、复现TensorRT8的benchmark之bert
注:参考网页https://developer.nvidia.com/blog/real-time-nlp-with-bert-using-tensorrt-updated/,
也可参考网页https://github.com/NVIDIA/TensorRT/tree/release/8.0/demo/BERT/。
该网页通过Dockerfile构建镜像,然后在镜像环境中使用ngc命令下载bert模型,使用脚本构建tensorRT引擎,进行推理;
而我在复现过程中,利用Dockerfile构建镜像时不断遇到因网络导致的下载失败问题,于是,不再创建镜像,直接下载模型在本机环境中复现benchmark。
1、TensorRT推理的具体步骤:
- 将TensorRT代码仓和bert demo的相关脚本下载到本机,git clone --recursive https://github.com/NVIDIA/TensorRT && cd TensorRT
- NVIDIA NGC网站https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_v2_large_fp16_128/files提供了一些模型可供测试。如图所示,点击左侧catalog栏点击Models,可在其中搜索所需模型关键词。例如,搜索"bert large",从中选取自己想要的模型,复制wget命令将模型下载至本机。
- 可得模型文件目录,存储到TensorRT/demo/BERT/models/fine-tuned/bert_tf_v2_large_fp16_128_2目录下:
- 使用TensorRT/demo/BERT中提供的builder.py构建TensorRT runtime engine(可根据需求对脚本进行更改,例如本机创建的conda虚拟环境中未安装pytorch,便将脚本builder.py中的load_pytorch_weights_and_quant,及builder_utils.py中pytorch相关部分注释掉了)
mkdir -p engines && python3 builder.py -m models/fine-tuned/bert_tf_v2_large_fp16_128_2/model.ckpt-8144 -o engines/bert_large_1_128.engine -b 1 -s 128 --fp16 -c models/fine-tuned/bert_tf_v2_large_fp16_128_2
其中,-m:checkpoint of the model,-o:path_to_engine, -b:batch, -s:sequence_length, -c:dir_of_the_checkpoint
- 创建engine完成后,可用两种方法推理:
- 命令行推理:
trtexec --load_model=path_to_engine --workspace=2048 --warmUp=2000 --iterations=500 --avgRuns=500 --explicitBatch --tacticSources=-cublasLt,+cublas
-
使用TensorRT/demo/BERT目录下的perf.py进行推理:
其中,-e:path_to_engine, -b:batch, -s:sequence_length, -i:inference_iterations,-w:warmup_times, -r:random_seedpython perf.py -e engines/bert_large_128_${batch}.engine -b ${batch} -s 128 -i 500 -w 2000 -r 0
- 命令行推理:
- 至此,可得TensorRT对bert模型的推理时间
2、TensorFlow的推理步骤
- 首先,将model的checkpoint模型转化为pb文件。参考https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_squad.py的get_frozen_tftrt_model函数,可将cpkt文件转换为pb文件
- 然后,通过导入pb文件,用tensorflow及xla分别推理
- 为了保证输入数据的统一,查看perf.py构造输入的方法为:
用同样的方法构造tensorflow及xla的输入,seed设置为0# Prepare random input pseudo_vocab_size = 30522 pseudo_type_vocab_size = 2 np.random.seed(args.random_seed) test_word_ids = np.random.randint(0, pseudo_vocab_size, (max(args.batch_size), args.sequence_length), dtype=np.int32) test_segment_ids = np.random.randint(0, pseudo_type_vocab_size, (max(args.batch_size), args.sequence_length), dtype=np.int32) test_input_mask = np.ones((max(args.batch_size), args.sequence_length), dtype=np.int32)
3、将TensorRT8的推理结果与TensorFLow及XLA的推理结果进行比较
延伸:为什么Bert不用TensorRT-7与TensorRT8进行比较?
查看TensorRT7.0与8.0的plugin_creator,代码如下:
trt.init_libnvinfer_plugins(TRT_LOGGER, "")
plg_registry = trt.get_plugin_registry()
PLUGIN_CREATORS = plg_registry.plugin_creator_list
for plugin_creator in PLUGIN_CREATORS:
print(plugin_creator.name)
可得到二者支持的plugin_creator,如下表,其中,bert所需的自定义算子类型CustomEmbLayerNormPluginDynamic、CustomSkipLayerNormPluginDynamic等在7中不支持,所以无法用tensorrt-7直接对bert模型处理
TensorRT-8.0 | TensorRT-7.0 |
CustomSkipLayerNormPluginDynamic CustomEmbLayerNormPluginDynamic CustomFCPluginDynamic RnRes2Br1Br2c_TRT GroupNormalizationPlugin CustomQKVToContextPluginDynamic CustomGeluPluginDynamic CgPersistentLSTMPlugin_TRT SingleStepLSTMPlugin RnRes2Br2bBr2c_TRT RnRes2FullFusion_TRT InstanceNormalization_TRT GridAnchor_TRT GridAnchorRect_TRT NMS_TRT Reorg_TRT Region_TRT Clip_TRT LReLU_TRT PriorBox_TRT Normalize_TRT ScatterND RPROI_TRT BatchedNMS_TRT BatchedNMSDynamic_TRT FlattenConcat_TRT CropAndResize DetectionLayer_TRT EfficientNMS_ONNX_TRT EfficientNMS_TRT Proposal ProposalLayer_TRT PyramidROIAlign_TRT ResizeNearest_TRT Split SpecialSlice_TRT InstanceNormalization_TRT |
RnRes2Br2bBr2c_TRT RnRes2Br1Br2c_TRT CgPersistentLSTMPlugin_TRT SingleStepLSTMPlugin GridAnchor_TRT NMS_TRT Reorg_TRT Region_TRT Clip_TRT LReLU_TRT PriorBox_TRT Normalize_TRT RPROI_TRT BatchedNMS_TRT FlattenConcat_TRT CropAndResize DetectionLayer_TRT Proposal ProposalLayer_TRT PyramidROIAlign_TRT ResizeNearest_TRT Split SpecialSlice_TRT InstanceNormalization_TRT |
四、结论
TensorRT-8.0确实比TensorRT7.0性能有所提升,一般CV模型在fp32精度模式下,大概提升10%左右,fp16精度模式下,则提升较大。尤其是在处理bert-fp16模型时,性能有几倍提升。
- 点赞
- 收藏
- 关注作者
评论(0)