他的回复:
具体的报错信息:报错信息显示,p2p连接超时了 [WARNING] PRE_ACT(56864,ffffb8d23a40,python):2022-09-29-10:03:31.439.039 [mindspore/ccsrc/backend/common/pass/communication_op_fusion.cc:198] GetAllReduceSplitSegment] Split threshold is 0. AllReduce nodes will take default fusion strategy. [CRITICAL] GE(56864,ffffb8d23a40,python):2022-09-29-10:05:56.138.935 [mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:100] Distribute] davinci_model : load task fail, return ret: 1343225860 [CRITICAL] DEVICE(56864,ffffb8d23a40,python):2022-09-29-10:05:56.139.477 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:567] LoadTask] Distribute Task Failed, error msg: mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:100 Distribute] davinci_model : load task fail, return ret: 1343225860 [ERROR] DEVICE(56864,ffffb8d23a40,python):2022-09-29-10:05:56.139.597 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_context.cc:660] ReportErrorMessage] Ascend error occurred, error message: EI9999: Inner Error! EI9999 connected p2p timeout, timeout:120 s.local logicDevid:0,remote physic id:4 The possible causes are as follows:1.the connectionbetween this device and the target device is abnormal 2.an exception occurredat the target devices 3.The ranktable is not matched.[FUNC:WaitP2PConnected][FILE:p2p_mgmt.cc][LINE:228] [CRITICAL] DEVICE(56864,ffffb8d23a40,python):2022-09-29-10:05:56.139.626 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_context.cc:422] PreprocessBeforeRunGraph] Preprocess failed before run graph 0, error msg: mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:567 LoadTask] Distribute Task Failed, error msg: mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:100 Distribute] davinci_model : load task fail, return ret: 1343225860 Traceback (most recent call last): File "Distribute_Ascend_Mobilevit_train.py", line 121, in MobileViT_train(args) File "Distribute_Ascend_Mobilevit_train.py", line 103, in MobileViT_train dataset_sink_mode=False) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 906, in train sink_size=sink_size) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 87, in wrapper func(self, *args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 542, in _train self._train_process(epoch, train_dataset, list_callback, cb_params) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 794, in _train_process outputs = self._train_network(*next_element) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 586, in __call__ out = self.compile_and_run(*args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 964, in compile_and_run self.compile(*inputs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 937, in compile _cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1006, in compile result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode()) RuntimeError: mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_context.cc:422 PreprocessBeforeRunGraph] Preprocess failed before run graph 0, error msg: mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:567 LoadTask] Distribute Task Failed, error msg: mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:100 Distribute] davinci_model : load task fail, return ret: 1343225860 [ERROR] MD(56864,fffec1ffb1e0,python):2022-09-29-10:06:01.350.188 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Exception thrown from PyFunc. The actual amount of data read from generator 460 is different from generator.len 160146, you should adjust generator.len to make them match. Line of code : 217 File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc