【昇腾】NPU Snt9B裸金属服务器使用nohup命令后台训练时中途偶现失败解决方案

举报
modelarts-dev-server 发表于 2023/11/21 16:11:38 2023/11/21
【摘要】 1. 问题描述使用华为云Snt9B裸金属服务器,通过nohup命令基于pytorch框架进行大模型训练时,训练中途偶现如下报错导致训练中断:{'loss': 0.0759, 'learning_rate': 0.0005298913043478261, 'epoch': 3.15} 79%|███████▉ | 4640/5888 [2:28:56<5:39:33, 16.32s/it] ...

1. 问题描述

使用华为云Snt9B裸金属服务器,通过nohup命令基于pytorch框架进行大模型训练时,训练中途偶现如下报错导致训练中断:

{'loss': 0.0759, 'learning_rate': 0.0005298913043478261, 'epoch': 3.15}

 79%|███████▉  | 4640/5888 [2:28:56<5:39:33, 16.32s/it]
 79%|███████▉  | 4641/5888 [2:29:10<5:25:33, 15.66s/it]
 79%|███████▉  | 4642/5888 [2:29:26<5:25:14, 15.66s/it]
 79%|███████▉  | 4643/5888 [2:29:43<5:36:54, 16.24s/it]
 79%|███████▉  | 4644/5888 [2:29:58<5:27:29, 15.80s/it]
 79%|███████▉  | 4645/5888 [2:30:12<5:15:11, 15.21s/it]
                                                       
{'loss': 0.1008, 'learning_rate': 0.0005277683423913043, 'epoch': 3.16}

 79%|███████▉  | 4645/5888 [2:30:12<5:15:11, 15.21s/it]
 79%|███████▉  | 4646/5888 [2:30:31<5:39:45, 16.41s/it]
 79%|███████▉  | 4647/5888 [2:30:46<5:26:51, 15.80s/it]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637350 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637351 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637352 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637353 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637354 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637355 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637356 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637357 closing signal SIGHUP
Traceback (most recent call last):
  File "/root/miniconda3/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/root/miniconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
    time.sleep(monitor_interval)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1637267 got signal: 1

2. 问题定位

nohup用于在系统后台不挂断地运行命令,退出终端不会影响程序的运行。

但是nohup命令通过pytorch训练时存在已知bug,在多进程训练时会偶现中断问题:

https://discuss.pytorch.org/t/ddp-error-torch-distributed-elastic-agent-server-api-received-1-death-signal-shutting-down-workers/135720

3. 解决方案

通过tmux命令代替nohup命令进行后台训练。使用方式与nohup相似,具体命令及安装方式可参考:

tmux使用教程: https://www.ruanyifeng.com/blog/2019/10/tmux.html

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。