- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

【技术分享】caffe2分布式平台安装部署文档

Hua Xu 发表于 2019/12/31 15:11:21 2019/12/31

【摘要】本文旨在搭建部署Caffe2的分布式环境，支持多机多GPU的神经网络训练，最后以resnet50为例进行分布式训练。

本文旨在搭建部署Caffe2的分布式环境，支持多机多GPU的神经网络训练，最后以resnet50为例进行分布式训练。

1 环境准备

1.1 系统准备

1.1.1 物理服务器环境

三台服务器，硬件配置如下：

服务器名称	CPU	内存	GPU	显存
server1	Xeon E5-2680 v4	440 GB	Tesla P100	16 GB
server2	Xeon E5-2680 v4	440 GB	Tesla P100	16 GB
server3	Xeon E5-2680 v4	440 GB	Tesla P4	8 GB

系统环境如下：

服务器名称	系统	IP	CUDA版本
server1	Ubuntu 16.04	192.168.133.10	8.0
server2	Ubuntu 16.04	192.168.133.11	8.0
server3	Ubuntu 16.04	192.168.133.12	8.0

在每个服务器上安装docker，随后安装nvidia-docker插件这里不再给出步骤，参考https://docs.docker.com/install/linux/docker-ce/ubuntu/

https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#setting-up

1.1.2 容器环境准备

选择某个物理主机，创建一个镜像，并且编译caffe2，随后将镜像保存后push到私有仓库，其他物理主机直接拉取镜像即可完成包含caffe2的容器创建，这里选择server1。

在server1物理服务器上拉取NVIDIA包含cuda的官方镜像，地址https://hub.docker.com/r/nvidia/cuda/（这里没有直接拉取caffe2的官方镜像，是为了能够方便自行编译，并且caffe2官方镜像中的版本时间较早，不一定能满足需求）这里选择了包含CUDA 8.0的devel版本，方便编译caffe2（base和runtime版本不支持cuda应用的源码编译，只支持编译好的应用）。

docker pull nvidia/cuda:8.0-devel-ubuntu16.04

使用nvidia-docker启动容器，支持GPU。

nvidia-docker run -it --name caffe2 --net=host --hostname caffe2 --dns 8.8.8.8 –v /mnt:/mnt nvidia/cuda:8.0-devel-ubuntu16.04 bash

容器操作系统：Ubuntu 16.04，64位版本，用户：root

1.2 网络配置

容器内的操作系统需要proxy代理方能连接yum源安装软件包。

本文使用的代理是http://192.168.5.18:3128，编辑根目录下的.bashrc 或者/etc/profile文件，在最后增加如下几行：

export http_proxy="http://192.168.5.18:3128"

export https_proxy="http://192.168.5.18:3128"

export ip_range=$(echo 192.168.79.{1..255} | sed 's/ /,/g')

export no_proxy="localhost,127.0.0.1,$ip_range,.huawei.com"

之后source .bashrc 或者 source /etc/profile即可。

使用curl baidu.com验证，如果有内容输出说明网络已连通。

2 部署caffe2

首先在单个节点的容器中完成caffe2的编译安装，随后拓展到其他节点。

2.1 安装依赖包

2.1.1 配置软件源

由于镜像中没有编辑器，先安装vim

apt-get update && apt-get install vim

更换软件源（速度更快一点），首先备份已有源

cd /etc/apt && mv sources.list sources.list.bk

下载或者通过vim将163或者阿里源写入sources.list中。

deb http://mirrors.163.com/ubuntu/ trusty main restricted universe multiverse

deb http://mirrors.163.com/ubuntu/ trusty-security main restricted universe multiverse

deb http://mirrors.163.com/ubuntu/ trusty-updates main restricted universe multiverse

deb http://mirrors.163.com/ubuntu/ trusty-proposed main restricted universe multiverse

deb http://mirrors.163.com/ubuntu/ trusty-backports main restricted universe multiverse

deb-src http://mirrors.163.com/ubuntu/ trusty main restricted universe multiverse

deb-src http://mirrors.163.com/ubuntu/ trusty-security main restricted universe multiverse

deb-src http://mirrors.163.com/ubuntu/ trusty-updates main restricted universe multiverse

deb-src http://mirrors.163.com/ubuntu/ trusty-proposed main restricted universe multiverse

deb-src http://mirrors.163.com/ubuntu/ trusty-backports main restricted universe multiverse

更新列表

apt-get update

2.1.2 安装依赖包

apt-get install openssh-server

apt-get install -y --no-install-recommends build-essential cmake git libgoogle-glog-dev libgtest-dev libiomp-dev libleveldb-dev liblmdb-dev libopencv-dev libopenmpi-dev libsnappy-dev libprotobuf-dev openmpi-bin openmpi-doc protobuf-compiler protobuf-c-compiler libgflags-dev python-dev python-pip python-setuptools graphviz

与官方推荐的依赖包相比增加了一些必要的依赖包。

2.1.3 安装python依赖库

sudo pip install flask future hypothesis numpy protobuf pydot python-nvd3 pyyaml requests scikit-image scipy setuptools six tornado jupyter matplotlib pydot

与官方推荐的库相比增加了一些必要的库。

2.1.4 安装cuDNN

由于镜像中包含了CUDA，所以这里只需要安装cuDNN加速库即可，两种方法：

安装cuDNN方法1：

添加源

echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list

更新列表并安装

apt-get update

apt-get install -y --no-install-recommends libcudnn7=7.1.2.21-1+cuda8.0 libcudnn7-dev=7.1.2.21-1+cuda8.0

rm -rf /var/lib/apt/lists/*

安装cuDNN方法2：

下载源码包并解压，在https://developer.nvidia.com/rdp/cudnn-download下载cuDNN支持8.0的版本

tar -xzvf cudnn-8.0-linux-x64-v7.1.tgz

cd cuda/include/cudnn.h /usr/local/cuda/include/

cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/

rm cudnn-8.0-linux-x64-v7.1.tgz && sudo ldconfig

配置环境变量

export PATH=/usr/local/cuda-8.0/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH

2.2 编译Caffe2

2.2.1 下载caffe2源代码

下载方法1：

cd /opt && git clone --recursive https://github.com/caffe2/caffe2.git

下载方法2：

如果公司内部服务器无法直接下载，可以在windows上安装git进行下载。注意要设置proxy ，参考http://3ms.huawei.com/km/blogs/details/5098499。（另外直接在浏览器下载源代码zip的方式最终在编译时会出错，原因在于此类方式会少下载third-party那部分源代码，因此必须使用git加--recursive的方式来下载这些submodule，否则它们不会直接下载）

2.2.2 编译安装caffe2

cd /opt/caffe2 && mkdir build && cd build

cmake ..

make install

编译时可能遇到cmake版本过低的问题，解决方法参考Troubleshooting。

环境变量设置

在/etc/profile中添加

export PYTHONPATH=/usr/local:$PYTHONPATH

export PYTHONPATH=/opt/caffe2/build:$PYTHONPATH

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

2.2.3 测试caffe2

cd ~ && python -c 'from caffe2.python import core' 2>/dev/null && echo "Success" || echo "Failure"

输出Success则表示成功

检测GPU是否生效（GPU支持并且cuda安装正常时可使用）。

python caffe2/python/operator_test/relu_op_test.py

python2 -c 'from caffe2.python import workspace; print(workspace.NumCudaDevices())'

输出结果大于0说明GPU正常被使用，否则会报错。

2.3 NFS共享目录部署

由于Caffe2在进行分布式训练时，需要共享目录完成参数的rendezvous，可以使用NFS或者Redis，这里选择NFS快速搭建一个可用的分布式环境。由于容器不能直接挂载NFS共享目录，或者这里通过物理主机映射的方式来实现共享。

2.3.1 NFS共享服务器搭建

首先选择一个物理主机搭建NFS server，这里选择192.168.133.10，将/export共享出去

apt-get install nfs-kernel-server

mkdir /export

chmod 777 /export

方便重启后NFS共享仍然生效，将其写入/etc/exports

/export 127.0.0.1(ro,fsid=0,insecure,no_subtree_check,async)

挂载目录

exportfs –a

2.3.2 挂载NFS客户端

在三个物理主机中都安装nfs-kernel

apt-get install nfs-kernel-server

然后挂载共享目录

mount –t nfs 192.168.133.10:/export /mnt

对于容器访问NFS共享目录，则在容器启动的时候将/mnt映射上去即可。

将容器打上tag，push到私有仓库之中，方便在其它主机上启动。

docker push 192.168.133.11:5000/caffe2:latest

这样该镜像就有编译好的caffe2和nfs-kernel

将刚才的容器关闭，以新的镜像重新启动一个容器

nvidia-docker run -it --name caffe2-dis --net=host --hostname caffe2 --dns 8.8.8.8 -v /mnt:/mnt 192.168.133.11:5000/caffe2:latest bash

在其他两个节点上则通过拉取来获取刚才创建的镜像

docker pull 192.168.133.11:5000/caffe2:latest

nvidia-docker run -it --name caffe2-dis --net=host --hostname caffe2 --dns 8.8.8.8 -v /mnt:/mnt 192.168.133.11:5000/caffe2:latest bash

至此包含caffe2并且能访问NFS共享目录的分布式环境搭建完毕

2.4 Troubleshooting

（1）编译时遇到cmake版本过低的问题。

解决方法：通过源码重新安装最新版本的cmake

在http://www.cmake.org/download/中下载最新源码包，如cmake-3.10.3.tar，然后执行解压安装

tar –xzvf cmake-3.10.3.tar.gz

cd cmake-3.10.3

./boostrap

make

make install

（2）linux环境下，git caffe2源码可能遇到证书错误。

解决方法：将github加入到信任列表

export GIT_SSL_NO_VERIFY=1

sudo update-ca-certificates

echo -n | openssl s_client -showcerts -connect github.com:443 2>/dev/null | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p'

（3）make install caffe2的源码出错

[third_party/onnx/onnx/onnx_onnx_c2.pb.cc] Error 1

解决方法：

重新下载编译安装最新版本的Protobuf，地址https://github.com/google/protobuf/releases/，随后解压编译安装即可。

首先删除原有的低版本软件包

apt-get remove protobuf

然后解压编译安装3.5.1版本

tar –xzvf protobuf-all-3.5.1.tar.gz

cd protobuf-3.5.1

./autogen.sh

./configure

make

make check

make install

如果protoc不在/usr/bin/下，可以添加连接

ln -s /usr/local/bin/protoc /usr/bin/protoc

ln -s /usr/local/lib/libprotobuf.so /usr/lib64/libprotobuf.so

不然仍然无法通过编译。

（5）ImportError: cannot import name caffe2_pb2

环境变量设置问题，参考官网正确设置PATH，PYTHONPATH，LD_LIBRARY_PATH等变量，使用env查看是否有多余冒号等情况出现。

（6）cmake时提示无法找到cudnn库

这是因为cuDNN的库解压后，没有正确地被添加到/usr/local/cuda/下的原因，可以使用find / -name cudnn.h进行搜索，看是否存在了cudnn.h的头文件，以确实cudnn是否被正确安装。不能使用官网中提供的直接解压到目录的方法，该方法会导致没有该库文件存在，参考文中的方法，拷贝过去即可。

（7）ImportError: No module named _tkinter, please install the python-tk package

解决方法：sudo apt-get install python-tk

（8）如果安装了anaconda2，可能会遇到python的依赖找不到的情况，导致cmake的时候，numpy等python都找不到。

解决方法：在PYTHONPATH中添加对应的python路径，如/root/anaconda2/lib/python2.7/site-packages。并且在此种情况下用conda去管理安装需要的依赖包。

（9）WARNING:root:Debug message: /root/anaconda2/bin/../lib/libstdc++.so.6: version `CXXABI_1.3.8' not found 。

解决方法：

libstdc++.so.6在系统中的位置为

/usr/lib/x86_64-linux-gnu/libstdc++.so.6

这里出错的原因是该文件在别的位置也存在，如Anaconda中，并且Anaconda中的版本低于系统版本（可以使用strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6查看），解决办法是将系统中的位置都拷贝过去.

mv /root/anaconda2/lib/libstdc++.so.6 /root/anaconda2/lib/libstdc++.so.6.bk

cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 /root/anaconda2/lib/

3 ResNet50训练实践

ResNet50模型是caffe2官网上给出的多GPU训练的指导的例子，该网络被用于图像识别任务，常作为神经网络训练性能的基准测试网络。数据集为ImageNet 1K，但是该数据集过大（约300G空间），GPU太少时训练时间过长（两张GPU卡要耗一周），所以这里采用它的一个子集，训练集包含了640种车和640种船，一共1280张图片；测试集包含了48种车和48种船，一共96张图片，数据集总体130MB。

3.1 单机训练测试

单机上测试ResNet50的构建和训练，下载数据方便进行分布式地训练。（地址：https://github.com/caffe2/caffe2/blob/master/caffe2/python/tutorials/Multi-GPU_Training.ipynb）（单机也可以参考官网进行MNIST手写字符识别任务的学习，地址：https://github.com/caffe2/tutorials/blob/master/MNIST.ipynb）

3.1.1 导入数据

首先，通过代码下载并且解压数据（也可以手动进行），代码：

from __future__ import absolute_import

from __future__ import division

from __future__ import print_function

from __future__ import unicode_literals

from caffe2.python import core, workspace, model_helper, net_drawer, memonger, brew

from caffe2.python import data_parallel_model as dpm

from caffe2.python.models import resnet

from caffe2.proto import caffe2_pb2

import numpy as np

import time

import os

from IPython import display

workspace.GlobalInit(['caffe2', '--caffe2_log_level=2'])

# This section checks if you have the training and testing databases

current_folder = os.path.join(os.path.expanduser('~'), 'caffe2_notebooks')

data_folder = os.path.join(current_folder, 'tutorial_data', 'resnet_trainer')

# Train/test data

train_data_db = os.path.join(data_folder, "imagenet_cars_boats_train")

train_data_db_type = "lmdb"

# actually 640 cars and 640 boats = 1280

train_data_count = 1280

test_data_db = os.path.join(data_folder, "imagenet_cars_boats_val")

test_data_db_type = "lmdb"

# actually 48 cars and 48 boats = 96

test_data_count = 96

# Get the dataset if it is missing

def DownloadDataset(url, path):

import requests, zipfile, StringIO

print("Downloading {} ... ".format(url))

r = requests.get(url, stream=True)

z = zipfile.ZipFile(StringIO.StringIO(r.content))

z.extractall(path)

print("Done downloading to {}!".format(path))

# Make the data folder if it doesn't exist

if not os.path.exists(data_folder):

os.makedirs(data_folder)

else:

print("Data folder found at {}".format(data_folder))

# See if you already have to db, and if not, download it

if not os.path.exists(train_data_db):

DownloadDataset("http://download.caffe2.ai/databases/resnet_trainer.zip", data_folder)

代码通过官网下载imagenet数据集，随后将其解压到文件夹：~/caffe2_notebooks/tutorial_data/resnet_trainer

完成后我们可以看到该文件夹下包含了imagenet_cars_boats_train和imagenet_cars_boats_val两个文件夹，分别存储了训练数据集和测试数据集。这里记录了训练数据集和测试集的位置和数据库类型、数据大小等信息。

3.1.2 配置训练参数

配置网络训练时采用的参数，代码如下：

# Configure how you want to train the model and with how many GPUs

# This is set to use two GPUs in a single machine, but if you have more GPUs, extend the array [0, 1, 2, n]

gpus = [0]

# Batch size of 32 sums up to roughly 5GB of memory per device

batch_per_device = 32

total_batch_size = batch_per_device * len(gpus)

# This model discriminates between two labels: car or boat

num_labels = 2

# Initial learning rate (scale with total batch size)

base_learning_rate = 0.0004 * total_batch_size

# only intends to influence the learning rate after 10 epochs

stepsize = int(10 * train_data_count / total_batch_size)

# Weight decay (L2 regularization)

weight_decay = 1e-4

这里指定了使用的GPU、batch size、总batch size（这里单GPU就等于batch size）、标签数量（两类船和车）、学习速率、step size、权重衰减值。

3.1.3 构建网络并训练

创建网络并清空工作区，防止上次训练数据产生的干扰（如果你是第二次运行的话会有影响）

train_model = model_helper.ModelHelper(name="resnet_test")

workspace.ResetWorkspace()

3.1.4 数据读取

创建数据读取，从之前指定的位置读取数据作为训练数据。

reader = train_model.CreateDB("train_reader", db=train_data_db, db_type=train_data_db_type)

3.1.5 图片输入

定义原始图片输入的处理方法

def add_image_input_ops(model):

# utilize the ImageInput operator to prep the images

data, label = brew.image_input(model,

reader,

["data", "label"],

batch_size=batch_per_device,

# mean: to remove color values that are common

mean=128.,

# std is going to be modified randomly to influence the mean subtraction

std=128.,

# scale to rescale each image to a common size

scale=256,

# crop to the square each image to exact dimensions

crop=224,

# not running in test mode

is_test=False,

# mirroring of the images will occur randomly

mirror=1

)

# prevent back-propagation: optional performance improvement; may not be observable at small scale

data = model.net.StopGradient(data, data)

3.1.6 定义ResNet50网络模型创建方法

def create_resnet50_model_ops(model, loss_scale=1.0):

# Creates a residual network

[softmax, loss] = resnet.create_resnet50(

model,

"data",

num_input_channels=3,

num_labels=num_labels,

label="label",

)

prefix = model.net.Proto().name

loss = model.net.Scale(loss, prefix + "_loss", scale=loss_scale)

brew.accuracy(model, [softmax, "label"], prefix + "_accuracy")

return [loss]

这里调用了resnet的create_resnet50方法创建网络，该方法为caffe2官方实现，可以查看源码深入理解。

3.1.7 定义参数更新方法

def add_parameter_update_ops(model):

brew.add_weight_decay(model, weight_decay)

iter = brew.iter(model, "iter")

lr = model.net.LearningRate(

[iter],

"lr",

base_lr=base_learning_rate,

policy="step",

stepsize=stepsize,

gamma=0.1,

)

# Momentum SGD update

for param in model.GetParams():

param_grad = model.param_to_grad[param]

param_momentum = model.param_init_net.ConstantFill(

[param], param + '_momentum', value=0.0

)

# Update param_grad and param_momentum in place

model.net.MomentumSGDUpdate(

[param_grad, param_momentum, lr, param],

[param_grad, param_momentum, param],

momentum=0.9,

# Nesterov Momentum works slightly better than standard momentum

nesterov=1,

)

3.1.8 梯度优化

def optimize_gradient_memory(model, loss):

model.net._net = memonger.share_grad_blobs(

model.net,

loss,

set(model.param_to_grad.values()),

# Due to memonger internals, we need a namescope here. Let's make one up; we'll need it later!

namescope="imonaboat",

share_activations=False)

3.1.9 创建网络并且训练

强制指定网络训练使用的GPU（本机的第一个），随后调用之前定义的方法创建网络，并开始训练。

# We need to give the network context and force it to run on the first GPU even if there are more.

device_opt = core.DeviceOption(caffe2_pb2.CUDA, gpus[0])

# Here's where that NameScope comes into play

with core.NameScope("imonaboat"):

# Picking that one GPU

with core.DeviceScope(device_opt):

# Run our reader, and create the layers that transform the images

add_image_input_ops(train_model)

# Generate our residual network and return the losses

losses = create_resnet50_model_ops(train_model)

# Create gradients for each loss

blobs_to_gradients = train_model.AddGradientOperators(losses)

# Kick off the learning and managing of the weights

add_parameter_update_ops(train_model)

# Optimize memory usage by consolidating where we can

optimize_gradient_memory(train_model, [blobs_to_gradients[losses[0]]])

# Startup the network

workspace.RunNetOnce(train_model.param_init_net)

# Load all of the initial weights; overwrite lets you run this multiple times

workspace.CreateNet(train_model.net, overwrite=True)

num_epochs = 1

for epoch in range(num_epochs):

# Split up the images evenly: total images / batch size

num_iters = int(train_data_count / total_batch_size)

for iter in range(num_iters):

# Stopwatch start!

t1 = time.time()

# Run this iteration!

workspace.RunNet(train_model.net.Proto().name)

t2 = time.time()

dt = t2 - t1

# Stopwatch stopped! How'd we do?

print((

"Finished iteration {:>" + str(len(str(num_iters))) + "}/{}" +

" (epoch {:>" + str(len(str(num_epochs))) + "}/{})" +

" ({:.2f} images/sec)").

format(iter+1, num_iters, epoch+1, num_epochs, total_batch_size/dt))

将以上代码合并或者在python的交互端口依次输入上述代码即可开始训练。

提示：这部分代码与官网代码有部分方法调用上的不同，可能是caffe2接口更新，但是官方文档到目前为止还未更新的原因。建议在代码中，将model.xxx创建算子的方法更改为使用brew.xxx的帮手函数方法，增加第一个参数为定义的model即可。

3.2 分布式训练测试

3.2.1 下载resnet50代码

在每个节点上完成单机测试之后，即可开始分布式的测试。训练数据即是刚才我们下载的数据，代码为官方在github上给出的代码，下载下来后命名为resnet50_trainer.py。

（地址：https://github.com/caffe2/caffe2/blob/master/caffe2/python/examples/resnet50_trainer.py）

3.2.2 重要提示

容器内进行分布式训练，需要修改/etc/hosts，将该容器的域名解析设置为自己物理主机的IP，如

192.168.133.10 caffe2

如果不修改，会发生Gloo在通信时无法发现对方主机，无法建立socket连接（因此推测Gloo是根据IP进行通信的）。

3.2.3 分布式训练

训练时需要在每个节点上依次输入命令，在这里具体的命令为（节点更多时可以通过脚本来完成）：

第一个节点：

time python resnet50_trainer.py --train_data ~/caffe2_notebooks/tutorial_data/resnet_trainer/imagenet_cars_boats_train/ --test_data ~/caffe2_notebooks/tutorial_data/resnet_trainer/imagenet_cars_boats_val/ --gpus 0 --num_labels 2 --base_learning_rate 0.0384 --batch_size 32 --epoch_size 1280 --num_epochs 10 --num_shards 3 --shard_id 0 --run_id 1234 --file_store_path=/mnt/

第二个节点：

第三个节点：

可以看到三个节点上的命令基本相同，唯一不同的是shard-ids参数，它作为了每个节点的唯一标识，其他的参数解释参考下一节。

3.2.4 参数解释

参数的解释可以参考官网，个人的理解如下：

参数	个人理解
train_data	必备，训练数据集的位置，文件夹即可
test_data	可选，测试数据集的位置，文件夹即可
db_type	可选，数据库类型，默认lmdb
gpus	可选，指定当前节点上使用的gpu的ID列表，从0开始，用“,”隔开
num_gpus	可选，指定当前节点上的gpu个数，可用于替代gpu数目
num_channels	可选，图片的颜色通道数目，默认为3
image_size	输入图片的像素尺寸，高或宽，假设图片是正方形，默认227，可能不能应对小尺寸
num_labels	数据中的标签数量，默认是1000类，可以根据输入数据集而变化，这里的命令设置为2类
batch_size	batch的大小，这里指的是该节点上所有GPU的batch size，而不是所有节点的，单个GPU默认是32，根据该节点上的GPU数量增加
epoch_size	每个epoch输入的数量，默认未1500000，可以自定义，如caffe2官网提供的小数据集有1280张
num_epochs	epoch数量
base_learning_rate	学习速率，官方建议设置为所有节点batch_size之和*0.0004，默认值为0.1，假设所有节点的batch size之和为256的学习速率值，根据自己设定的总batch size而改变（不是该节点上的batch size）
weight_decay	权重衰减
num_shards	分布式训练时的机器节点数量，默认为1，单节点，
shard_id	该节点的shard ID，默认为0，将第一个节点设置为0，后续节点依次设置为1,2,3……即可
run_id	运行ID标识，用于分布式运行时，所有节点相互标识，参与该次训练的所有节点保持一致即可
redis_host	Redis服务器的端口，用作rendezvous
redis_port	Redis服务器的IP
file_store_path	共享目录位置，用于不同节点参数同步的临时文件夹，作为redis的替代，两者二选一即可，这里使用之前挂载的NFS目录。

3.3 简单的性能测试

使用resnet50_trainer.py，我们在配置好的物理环境中测试分布式测试的性能，结果如下（单机命令去掉num_shards等分布式所需的命令，修改batch size即可）：

服务器数目	GPU/服务器	总GPU数目	单节点Batch size	时间（s）
1	1（p100）	1	32	415.5
1	1（p100）	1	64	585
1	1（p4）	1	32	671
2	1	2（p4+p100）	32	516
2	1	2（p100*2）	32	283
3	1	3（p100*2+p4）	32	410

该测试为单次测试结果，没有多次重复进行，数据不严谨，仅作为分布式训练能力的探测，并且部分GPU加速、网络分发未优化。

可以看到P100和P4在性能上差距还是蛮大的，在测试过程中P4卡上batch size设置为64时便出现了out of memory的错误提示。因为P4卡的存在，多机分布式训练时反而降低了训练速度，看来使用同步的随机梯度优化还是使用同构的硬件比较好，不然会严重影响效率。

3.4 Troubleshooting

分布式过程中遇到的大部分问题，如connection error、Aborted (core dumped)都是由多节点的网络通信异常引起的，首先是保证各个节点自己能够解析自己的主机名（如本文中的caffe2），随后会使用解析出来的IP进行通信，然后保证节点之间能够顺利完成通信便可以解决大部分的训练问题。

4 参考资料

https://caffe2.ai/docs/getting-started.html?platform=ubuntu&configuration=compile

https://caffe2.ai/docs/getting-started.html?platform=centos&configuration=cloud

https://github.com/caffe2/tutorials/blob/master/MNIST.ipynb

https://github.com/caffe2/caffe2/blob/master/caffe2/python/tutorials/Multi-GPU_Training.ipynb

https://github.com/caffe2/caffe2/blob/master/caffe2/python/examples/resnet50_trainer.py

https://blog.csdn.net/zziahgf/article/details/79022490

https://hub.docker.com/r/nvidia/cuda/

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

【技术分享】caffe2分布式平台安装部署文档

1 环境准备

1.1 系统准备

1.1.1 物理服务器环境

1.1.2 容器环境准备

1.2 网络配置

2 部署caffe2

2.1 安装依赖包

2.1.1 配置软件源

2.1.2 安装依赖包

2.1.3 安装python依赖库

2.1.4 安装cuDNN

2.2 编译Caffe2

2.2.1 下载caffe2源代码

2.2.2 编译安装caffe2

2.2.3 测试caffe2

2.3 NFS共享目录部署

2.3.1 NFS共享服务器搭建

2.3.2 挂载NFS客户端

2.4 Troubleshooting

3 ResNet50训练实践

3.1 单机训练测试

3.1.1 导入数据

3.1.2 配置训练参数

3.1.3 构建网络并训练

3.1.4 数据读取

3.1.5 图片输入

3.1.6 定义ResNet50网络模型创建方法

3.1.7 定义参数更新方法

3.1.8 梯度优化

3.1.9 创建网络并且训练

3.2 分布式训练测试

3.2.1 下载resnet50代码

3.2.2 重要提示

3.2.3 分布式训练

3.2.4 参数解释

3.3 简单的性能测试

3.4 Troubleshooting

4 参考资料

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品