- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

PyTorch自动混合精度（AMP）

德鲁瓦发表于 2022/04/15 10:24:39 2022/04/15

【摘要】一、自动混合精度示例通常，“自动混合精度训练”是指同时使用torch.cuda.amp.autocast 和 torch.cuda.amp.GradScaler 进行训练。 torch.cuda.amp.autocast 的实例为所选区域启用autocasting。 Autocasting 自动选择 GPU 上算子的计算精度以提高性能，同时保证模型的整体精度。 torch.cuda...

一、自动混合精度示例

通常，“自动混合精度训练”是指同时使用torch.cuda.amp.autocast 和 torch.cuda.amp.GradScaler 进行训练。
torch.cuda.amp.autocast 的实例为所选区域启用autocasting。 Autocasting 自动选择 GPU 上算子的计算精度以提高性能，同时保证模型的整体精度。
torch.cuda.amp.GradScaler的实例有助于执行梯度缩放步骤。梯度缩放通过最小化梯度下溢来提高具有float16梯度的网络的收敛性。

1.1 典型的混合精度训练

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
# Runs the forward pass with autocasting.
with autocast():
output = model(input)
loss = loss_fn(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
scaler.scale(loss).backward()

# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(optimizer)

# Updates the scale for next iteration.
scaler.update()

1.2 使用未缩放的梯度

所有scaler.scale(loss).backward()产生的梯度都被缩放。如果你希望在backward() 和 scaler.step(optimizer) 之间修改或检查参数的 .grad 属性，应该先取消缩放它们。例如，gradient clipping对一组梯度进行操作，以使他们的global norm（torch.nn.utils.clip_grad_norm_()）或者max幅度（torch.nn.utils.clip_grad_value_()）小于等于一些用户强加的阈值，如果你想不先取消缩放，那梯度的norm/maximum magnitude也会被缩放，那么你设置的阈值（针对未缩放的梯度）就是无效的。
scaler.unscale_(optimizer) 取消由optimizer所赋参数持有的梯度。如果您的一个或多个模型包含分配给另一个optimizer（例如optimizer2）的其他参数，您可以单独调用 scaler.unscale_(optimizer2) 来取消缩放这些参数的梯度。

1.2.1 梯度裁剪（Gradient clipping）

scaler = GradScaler()

for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
with autocast():
output = model(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()

# Unscales the gradients of optimizer's assigned params in-place scaler.unscale_(optimizer)

# Since the gradients of optimizer's assigned params are unscaled, clips as usual: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# optimizer's gradients are already unscaled, so scaler.step does not unscale them,
# although it still skips optimizer.step() if the gradients contain infs or NaNs. scaler.step(optimizer)

# Updates the scale for next iteration. scaler.update()
注：（1）scaler 记录了在本次迭代中已经为此优化器调用了 scaler.unscale_(optimizer)，因此 scaler.step(optimizer) 在（内部）调用 optimizer.step() 之前不会对梯度再次进行缩放。（2）unscale_ 在每个优化器的每个迭代步骤中只能调用一次，并且只有在该优化器分配的参数的所有梯度都已累积之后。在每个迭代为同一优化器调用两次unscale_，会触发 RuntimeError。

1.3 使用缩放的梯度

1.3.1 梯度累积

梯度累积每经过一个有效的batch（size=batch_per_iter * iters_to_accumulate(* num_procs 如果是分布式训练)）才累加一次梯度。缩放应该针对有效batch进行校准，也就是检查inf/NaN，如果梯度有inf/NaN，则跳过该参数更新，并且缩放更新应以有效batch为粒度进行。此外，梯度保持缩放，缩放因子保持不变，直到有效batch的梯度被累积。如果梯度在累积完成之前取消缩放（或者缩放因子发生变化），则下一次反向pass将缩放的梯度添加到未缩放的梯度（或者以不同因子缩放的梯度）上，之后无法恢复用于参数更新的未缩放的累积梯度。
因此，如果你想要 unscale_grads（例如，允许裁剪未缩放的 grads），请在 step 之前调用 unscale_，毕竟（已缩放）即将到来的 step 的 grads 已经累积。此外，仅您为完整的有效batch调用step后调用update。
scaler = GradScaler()

for epoch in epochs:
for i, (input, target) in enumerate(data):
with autocast(): output = model(input) loss = loss_fn(output, target) loss = loss / iters_to_accumulate

# Accumulates scaled gradients. scaler.scale(loss).backward()

if (i + 1) % iters_to_accumulate == 0: # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

scaler.step(optimizer) scaler.update() optimizer.zero_grad()

1.3.2 梯度惩罚

梯度惩罚实现通常使用 torch.autograd.grad() 创建梯度，将梯度们组合以创建惩罚值，并将惩罚值添加到损失中。这是一个没有梯度缩放或autocasting的 L2 惩罚的示例：
for epoch in epochs: for input, target in data: optimizer.zero_grad() output = model(input) loss = loss_fn(output, target)

# Creates gradients grad_params = torch.autograd.grad(outputs=loss, inputs=model.parameters(), create_graph=True)

# Computes the penalty term and adds it to the loss grad_norm = 0 for grad in grad_params: grad_norm += grad.pow(2).sum() grad_norm = grad_norm.sqrt() loss = loss + grad_norm

loss.backward()
# clip gradients here, if desired
optimizer.step()
要使用梯度缩放实现梯度惩罚，传递给 torch.autograd.grad() 的输出张量应该被缩放。因此，生成的梯度将被缩放，并且在组合以创建惩罚值之前应该未缩放。此外，惩罚项计算是前向传递的一部分，因此应该在自动转换上下文中。
scaler = GradScaler()

for epoch in epochs: for input, target in data: optimizer.zero_grad() with autocast(): output = model(input) loss = loss_fn(output, target)

# Scales the loss for autograd.grad's backward pass, producing scaled_grad_params scaled_grad_params = torch.autograd.grad(outputs=scaler.scale(loss), inputs=model.parameters(), create_graph=True)

# Creates unscaled grad_params before computing the penalty. scaled_grad_params are # not owned by any optimizer, so ordinary division is used instead of scaler.unscale_: inv_scale = 1./scaler.get_scale() grad_params = [p * inv_scale for p in scaled_grad_params]

# Computes the penalty term and adds it to the loss with autocast(): grad_norm = 0 for grad in grad_params: grad_norm += grad.pow(2).sum() grad_norm = grad_norm.sqrt() loss = loss + grad_norm

# Applies scaling to the backward call as usual. # Accumulates leaf gradients that are correctly scaled. scaler.scale(loss).backward()

# may unscale_ here if desired (e.g., to allow clipping unscaled gradients) # step() and update() proceed as usual. scaler.step(optimizer) scaler.update()

1.4 多模型、损失和优化器的训练

如果你对网络有多个loss，你必须为每个loss单独调用scaler.scale。如果你对网络有多个优化器，你可以在任意一个优化器上单独调用scaler.unscale_，但是你必须为每个优化器单独调用scaler.step。
然而，scaler.update只能调用一次，在这个迭代使用的所有优化器都执行完step后。
scaler = torch.cuda.amp.GradScaler()

for epoch in epochs: for input, target in data: optimizer0.zero_grad() optimizer1.zero_grad() with autocast(): output0 = model0(input) output1 = model1(input) loss0 = loss_fn(2 * output0 + 3 * output1, target) loss1 = loss_fn(3 * output0 - 5 * output1, target)

# (retain_graph here is unrelated to amp, it's present because in this # example, both backward() calls share some sections of graph.) scaler.scale(loss0).backward(retain_graph=True) scaler.scale(loss1).backward()

# You can choose which optimizers receive explicit unscaling, if you # want to inspect or modify the gradients of the params they own. scaler.unscale_(optimizer0)

scaler.step(optimizer0) scaler.step(optimizer1)

scaler.update()
每个优化器检查其 infs/NaN 的梯度，并独立决定是否跳过该步骤。这可能会导致一个优化器跳过该步骤，而另一个则没有。由于很少发生跳步（每几百次迭代），这不应妨碍收敛。如果在向多优化器模型添加梯度缩放后发现收敛不佳，请报告错误。

1.5 多卡训练

仅autocast的使用方法发生变化， GradScaler的使用方法不受影响。

1.5.1 单进程的DataParallel

torch.nn.DataParallel生成线程在每个设备上运行前向pass，autocast状态在每个线程中传播。
model = MyModel() dp_model = nn.DataParallel(model)

# Sets autocast in the main thread with autocast():
# dp_model's internal threads will autocast. output = dp_model(input)
# loss_fn also autocast loss = loss_fn(output)

1.5.2 DistributedDataParallel, 每个进程一个GPU

torch.nn.parallel.DistributedDataParallel 的文档建议每个进程使用一个 GPU 以获得最佳性能。在这种情况下，DistributedDataParallel 不会在内部产生线程，因此 autocast 和 GradScaler 的使用不受影响。

1.5.3 DistributedDataParallel, 每个进程多个GPU

torch.nn.parallel.DistributedDataParallel 可能会产生一个侧线程来在每个设备上运行正向pass，例如 torch.nn.DataParallel。使用方法与torch.nn.DataParallel是一样的：将 autocast 作为模型 forward 方法的一部分应用，以确保它在侧线程中启用。

1.6 Autocast和自定义Autograd 函数

如果您的网络使用自定义 autograd 函数（torch.autograd.Function 的子类），函数存在以下几种情况，需要一定更改来适应autocast的兼容性：
（1）接受多个浮点的张量输入
（2）封装任何可自动转换的op
（3）需要特定的 dtype（例如，如果它包装了仅为 dtype 编译的 CUDA 扩展）。
这种情况下，如果你导入了这些函数，并且无法更改它的定义，安全的后备方法是在发生错误的任何使用点，关闭自动转换并强制执行float32（或者dtype）：
with autocast(): ...
with autocast(enabled=False): output = imported_function(input1.float(), input2.float())
如果你是函数的作者（或者可以修改它的定义），一个更好的方案是使用torch.cuda.amp.custom_fwd()和torch.cuda.amp.custom_bwd()作为装饰器。

1.6.1 含有多输入或者自动转换算子的函数

分别在forward和backward处应用custom_fwd和custom_bwd（无参数）。这确保了forward以当前的autocast状态执行，backward以与forward相同的autocast状态执行（可避免type不匹配的错误）：
class MyMM(torch.autograd.Function): @staticmethod @custom_fwd def forward(ctx, a, b): ctx.save_for_backward(a, b) return a.mm(b)

@staticmethod @custom_bwd def backward(ctx, grad): a, b = ctx.saved_tensors return grad.mm(b.t()), a.t().mm(grad)
这样，MyMM就可以在任何地方调用，而不必禁用autocast或者手动转换inputs的精度：
mymm = MyMM.apply

with autocast(): output = mymm(input1, input2)

1.6.2 需要特定dtype的函数

假定一个需要torch.float32类型输入的自定义函数。将custom_fwd(cast_inputs=torch.float32)应用于forward，custom_bwd（不带参数）应用于backward。如果forward运行在autocast的范围内，装饰器将浮点的CUDA Tensor inputs转化为float32，在forward和backward中局部地禁用autocast。
class MyFloat32Func(torch.autograd.Function): @staticmethod @custom_fwd(cast_inputs=torch.float32) def forward(ctx, input): ctx.save_for_backward(input) ... return fwd_output

@staticmethod @custom_bwd def backward(ctx, grad): ...
这样，MyFloat32Func就可以在任何地方调用，而不必禁用autocast或者手动转换inputs的精度：
func = MyFloat32Func.apply

with autocast(): # func will run in float32, regardless of the surrounding autocast state output = func(input)

参考链接
（1）https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-models-losses-and-optimizers

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

PyTorch自动混合精度（AMP）

一、自动混合精度示例

1.1 典型的混合精度训练

1.2 使用未缩放的梯度

1.2.1 梯度裁剪（Gradient clipping）

1.3 使用缩放的梯度

1.3.1 梯度累积

1.3.2 梯度惩罚

1.4 多模型、损失和优化器的训练

1.5 多卡训练

1.5.1 单进程的DataParallel

1.5.2 DistributedDataParallel, 每个进程一个GPU

1.5.3 DistributedDataParallel, 每个进程多个GPU

1.6 Autocast和自定义Autograd 函数

1.6.1 含有多输入或者自动转换算子的函数

1.6.2 需要特定dtype的函数

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

PyTorch自动混合精度（AMP）

一、自动混合精度示例

1.1 典型的混合精度训练

1.2 使用未缩放的梯度

1.2.1 梯度裁剪（Gradient clipping）

1.3 使用缩放的梯度

1.3.1 梯度累积

1.3.2 梯度惩罚

1.4 多模型、损失和优化器的训练

1.5 多卡训练

1.5.1 单进程的DataParallel

1.5.2 DistributedDataParallel, 每个进程一个GPU

1.5.3 DistributedDataParallel, 每个进程多个GPU

1.6 Autocast和自定义Autograd 函数

1.6.1 含有多输入或者自动转换算子的函数

1.6.2 需要特定dtype的函数

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品