【昇腾】NPU Snt9B裸金属服务器使用nohup命令后台训练时中途偶现失败解决方案
【摘要】 1. 问题描述使用华为云Snt9B裸金属服务器,通过nohup命令基于pytorch框架进行大模型训练时,训练中途偶现如下报错导致训练中断:{'loss': 0.0759, 'learning_rate': 0.0005298913043478261, 'epoch': 3.15} 79%|███████▉ | 4640/5888 [2:28:56<5:39:33, 16.32s/it] ...
1. 问题描述
使用华为云Snt9B裸金属服务器,通过nohup命令基于pytorch框架进行大模型训练时,训练中途偶现如下报错导致训练中断:
{'loss': 0.0759, 'learning_rate': 0.0005298913043478261, 'epoch': 3.15}
79%|███████▉ | 4640/5888 [2:28:56<5:39:33, 16.32s/it]
79%|███████▉ | 4641/5888 [2:29:10<5:25:33, 15.66s/it]
79%|███████▉ | 4642/5888 [2:29:26<5:25:14, 15.66s/it]
79%|███████▉ | 4643/5888 [2:29:43<5:36:54, 16.24s/it]
79%|███████▉ | 4644/5888 [2:29:58<5:27:29, 15.80s/it]
79%|███████▉ | 4645/5888 [2:30:12<5:15:11, 15.21s/it]
{'loss': 0.1008, 'learning_rate': 0.0005277683423913043, 'epoch': 3.16}
79%|███████▉ | 4645/5888 [2:30:12<5:15:11, 15.21s/it]
79%|███████▉ | 4646/5888 [2:30:31<5:39:45, 16.41s/it]
79%|███████▉ | 4647/5888 [2:30:46<5:26:51, 15.80s/it]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637350 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637351 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637352 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637353 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637354 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637355 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637356 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1637357 closing signal SIGHUP
Traceback (most recent call last):
File "/root/miniconda3/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/root/miniconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 985, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1637267 got signal: 1
2. 问题定位
nohup用于在系统后台不挂断地运行命令,退出终端不会影响程序的运行。
但是nohup命令通过pytorch训练时存在已知bug,在多进程训练时会偶现中断问题:
3. 解决方案
通过tmux命令代替nohup命令进行后台训练。使用方式与nohup相似,具体命令及安装方式可参考:
【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)