建议使用以下浏览器,以获得最佳体验。 IE 9.0+以上版本 Chrome 31+ 谷歌浏览器 Firefox 30+ 火狐浏览器
请选择 进入手机版 | 继续访问电脑版
设置昵称

在此一键设置昵称,即可参与社区互动!

确定
我再想想
选择版块
直达楼层
标签
您还可以添加5个标签
  • 没有搜索到和“关键字”相关的标签
  • 云产品
  • 解决方案
  • 技术领域
  • 通用技术
  • 平台功能
取消

采纳成功

您已采纳当前回复为最佳回复

编程大赛最佳伴侣

发帖: 11粉丝: 0

级别 : 版主

发消息 + 关注

发表于2021年06月10日 16:39:29 155 10
直达本楼层的链接
楼主
显示全部楼层
[分布式] 【MindSpore】【分布式】添加自动化分布式运行字段后报错

在不启用如下分布式的字段时,训练任务正常运行: ```python context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, gradients_mean=False) ``` 但是在启动分布式训练之后,有大量这样的报错信息: ```bash [ERROR] PARALLEL(117302,python3):2021-06-10-16:31:15.028.805 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:2 68] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(117302,python3):2021-06-10-16:31:15.028.868 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:2 68] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(117302,python3):2021-06-10-16:31:15.028.884 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:2 68] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(117302,python3):2021-06-10-16:31:15.028.950 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:2 68] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(117302,python3):2021-06-10-16:31:15.028.965 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:2 68] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(117302,python3):2021-06-10-16:31:15.029.018 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:2 68] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(117302,python3):2021-06-10-16:31:15.029.031 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:2 68] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. ``` 最后的报错Traceback如下所示: ```bash [ERROR] PARALLEL(117302,python3):2021-06-10-16:31:15.034.140 [mindspore/ccsrc/frontend/parallel/step_auto_parallel.cc:411] Con structCostGraphNodesByUniqueId] The OperatorInfo: ReshapeInfo1919 does not match the Prim: Squeeze. The fullname_with_scope: D efault/network-_VirtualDatasetCell/_backbone-WithLabelLossCell/_backbone-Cybertron/model-MolCT/interactions-CellList/2-NeuralI nteractionUnit/Squeeze-op102 Traceback (most recent call last): File "Tutorial_04.py", line 116, in <module> model.train(n_epoch,ds_train,callbacks=[record_cb,ckpoint_cb],dataset_sink_mode=False) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 627, in train sink_size=sink_size) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 407, in _train self._train_process(epoch, train_dataset, list_callback, cb_params) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 536, in _train_process outputs = self._train_network(*next_element) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 341, in __call__ out = self.compile_and_run(*inputs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 608, in compile_and_run self.compile(*inputs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 595, in compile _executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 494, in compile result = self._executor.compile(obj, args_list, phase, use_vm) RuntimeError: mindspore/ccsrc/frontend/parallel/step_auto_parallel.cc:411 ConstructCostGraphNodesByUniqueId] The OperatorInfo: ReshapeInfo1919 does not match the Prim: Squeeze. The fullname_with_scope: Default/network-_VirtualDatasetCell/_backbone-With LabelLossCell/_backbone-Cybertron/model-MolCT/interactions-CellList/2-NeuralInteractionUnit/Squeeze-op102 # ``` 请问这个报错是什么意思?我们传入的训练参数都是转化成float32位的,并且dimension的最后一位都设置成了1没有空的维度,不过我们发现MindSpore自己的一部分参数是有一个空维度的,比如这些中维度以逗号结尾的: ```bash 0 model.atom_embedding.embedding_table (100, 128) 1 model.dis_filter.linear.weight (128, 64) 2 model.dis_filter.linear.bias (128,) 3 model.dis_filter.residual.nonlinear.mlp.0.weight (128, 128) 4 model.dis_filter.residual.nonlinear.mlp.0.bias (128,) 5 model.dis_filter.residual.nonlinear.mlp.1.weight (128, 128) 6 model.dis_filter.residual.nonlinear.mlp.1.bias (128,) 7 model.interactions.0.positional_embedding.norm.gamma (128,) 8 model.interactions.0.positional_embedding.norm.beta (128,) 9 model.interactions.0.positional_embedding.x2q.weight (128, 128) 10 model.interactions.0.positional_embedding.x2k.weight (128, 128) 11 model.interactions.0.positional_embedding.x2v.weight (128, 128) 12 model.interactions.0.multi_head_attention.output.weight (128, 128) 13 model.interactions.1.positional_embedding.norm.gamma (128,) 14 model.interactions.1.positional_embedding.norm.beta (128,) 15 model.interactions.1.positional_embedding.x2q.weight (128, 128) 16 model.interactions.1.positional_embedding.x2k.weight (128, 128) 17 model.interactions.1.positional_embedding.x2v.weight (128, 128) 18 model.interactions.1.multi_head_attention.output.weight (128, 128) 19 model.interactions.2.positional_embedding.norm.gamma (128,) 20 model.interactions.2.positional_embedding.norm.beta (128,) 21 model.interactions.2.positional_embedding.x2q.weight (128, 128) 22 model.interactions.2.positional_embedding.x2k.weight (128, 128) 23 model.interactions.2.positional_embedding.x2v.weight (128, 128) 24 model.interactions.2.multi_head_attention.output.weight (128, 128) 25 readout.decoder.0.output.mlp.0.weight (64, 128) 26 readout.decoder.0.output.mlp.0.bias (64,) 27 readout.decoder.0.output.mlp.1.weight (1, 64) 28 readout.decoder.0.output.mlp.1.bias (1,) 29 readout.decoder.1.output.mlp.0.weight (64, 128) 30 readout.decoder.1.output.mlp.0.bias (64,) 31 readout.decoder.1.output.mlp.1.weight (1, 64) 32 readout.decoder.1.output.mlp.1.bias (1,) 33 readout.decoder.2.output.mlp.0.weight (64, 128) 34 readout.decoder.2.output.mlp.0.bias (64,) 35 readout.decoder.2.output.mlp.1.weight (1, 64) 36 readout.decoder.2.output.mlp.1.bias (1,) 37 readout.decoder.3.output.mlp.0.weight (64, 128) 38 readout.decoder.3.output.mlp.0.bias (64,) 39 readout.decoder.3.output.mlp.1.weight (1, 64) 40 readout.decoder.3.output.mlp.1.bias (1,) ```

分布式 MindSpore

举报
分享

分享文章到朋友圈

分享文章到微博

采纳成功

您已采纳当前回复为最佳回复

chengxiaoli

发帖: 234粉丝: 31

级别 : 版主,版块专家

发消息 + 关注

发表于2021年06月10日 17:25:49
直达本楼层的链接
沙发
显示全部楼层

欢迎使用MindSpore,问题已经收到,我们会尽快安装技术支撑帮助分析并答复。

点赞 评论 引用 举报

采纳成功

您已采纳当前回复为最佳回复

编程大赛最佳伴侣

发帖: 11粉丝: 0

级别 : 版主

发消息 + 关注

发表于2021年06月11日 10:14:39
直达本楼层的链接
板凳
显示全部楼层
另外如果我选择用`DATA_PARALLEL`的话,报错信息会变成这样: ```bash [ERROR] DEVICE(67706,python3):2021-06-11-10:10:40.065.534 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:478] LoadTask] Distribute Task Failed Traceback (most recent call last): File "Tutorial_04.py", line 116, in <module> model.train(n_epoch,ds_train,callbacks=[record_cb,ckpoint_cb],dataset_sink_mode=False) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 627, in train sink_size=sink_size) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 407, in _train self._train_process(epoch, train_dataset, list_callback, cb_params) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 536, in _train_process outputs = self._train_network(*next_element) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 341, in __call__ out = self.compile_and_run(*inputs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 608, in compile_and_run self.compile(*inputs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 595, in compile _executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 494, in compile result = self._executor.compile(obj, args_list, phase, use_vm) RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:478 LoadTask] Distribute Task Failed # ``` 跟`AUTO_PARALLEL`模式的报错不一样。
点赞 评论 引用 举报

采纳成功

您已采纳当前回复为最佳回复

xiaoda

发帖: 0粉丝: 0

发消息 + 关注

发表于2021年06月11日 11:14:52
直达本楼层的链接
地板
显示全部楼层

from mindspore.parallel._cost_model_context import _set_algo_single_loop

_set_algo_single_loop(False)

可以在脚本里加上一个这样的设置。我们在自动搜索并行策略中做了这样的特性:当有模型中有for循环时,我们假设for循环中的图结构是一样的,这样有利于并行策略搜索。

这个报错的意思是发现了for循环中的图结构是不一样的。_set_algo_single_loop(False)这个设置是关闭上述特性,把for循环展开成一张大图来搜索策略。

可以这样设置再试试并行策略搜索,如果还有问题请再提出,谢谢。

点赞 评论 引用 举报

采纳成功

您已采纳当前回复为最佳回复

编程大赛最佳伴侣

发帖: 11粉丝: 0

级别 : 版主

发消息 + 关注

发表于2021年06月11日 11:33:42
直达本楼层的链接
5#
显示全部楼层
回复:xiaoda 发表于 2021-6-11 11:14 from mindspore.parallel._cost_model_context import _set_algo_single_loop_set_alg
我在脚本里面加上了这两行配置: ```python from mindspore.parallel._cost_model_context import _set_algo_single_loop _set_algo_single_loop(False) ``` 但是执行的结果还是报错,报错内容如下。 `DATA`并行模式: ```bash [ERROR] DEVICE(79015,python3):2021-06-11-11:30:21.352.195 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:478] LoadTask] Distribute Task Failed Traceback (most recent call last): File "Tutorial_04.py", line 119, in <module> model.train(n_epoch,ds_train,callbacks=[record_cb,ckpoint_cb],dataset_sink_mode=False) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 627, in train sink_size=sink_size) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 407, in _train self._train_process(epoch, train_dataset, list_callback, cb_params) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 536, in _train_process outputs = self._train_network(*next_element) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 341, in __call__ out = self.compile_and_run(*inputs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 608, in compile_and_run self.compile(*inputs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 595, in compile _executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 494, in compile result = self._executor.compile(obj, args_list, phase, use_vm) RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:478 LoadTask] Distribute Task Failed ``` `AUTO`并行模式: ```bash Start training ... [ERROR] PARALLEL(78678,python3):2021-06-11-11:26:36.370.165 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:268] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(78678,python3):2021-06-11-11:26:36.370.226 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:268] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(78678,python3):2021-06-11-11:26:36.370.238 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:268] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(78678,python3):2021-06-11-11:26:36.370.296 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:268] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(78678,python3):2021-06-11-11:26:36.370.338 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:268] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(78678,python3):2021-06-11-11:26:36.370.376 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:268] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(78678,python3):2021-06-11-11:26:36.370.393 [mindspore/ccsrc/frontend/parallel/ops_info/gather_v2_p_info.cc:268] CheckStrategy] GatherPInfo1212: Last dim of param slice shape need 32Byte aligned. [ERROR] PARALLEL(78678,python3):2021-06-11-11:26:36.378.626 [mindspore/ccsrc/frontend/parallel/step_auto_parallel.cc:185] IsAutoParallelCareNode] Should implementing OperatorInfo for: Select Traceback (most recent call last): File "Tutorial_04.py", line 119, in <module> model.train(n_epoch,ds_train,callbacks=[record_cb,ckpoint_cb],dataset_sink_mode=False) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 627, in train sink_size=sink_size) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 407, in _train self._train_process(epoch, train_dataset, list_callback, cb_params) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 536, in _train_process outputs = self._train_network(*next_element) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 341, in __call__ out = self.compile_and_run(*inputs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 608, in compile_and_run self.compile(*inputs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 595, in compile _executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode) File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 494, in compile result = self._executor.compile(obj, args_list, phase, use_vm) RuntimeError: mindspore/ccsrc/frontend/parallel/step_auto_parallel.cc:185 IsAutoParallelCareNode] Should implementing OperatorInfo for: Select ```
点赞 评论 引用 举报

采纳成功

您已采纳当前回复为最佳回复

xiaoda

发帖: 0粉丝: 0

发消息 + 关注

发表于2021年06月11日 14:28:16
直达本楼层的链接
6#
显示全部楼层

在auto_parallel模式下报的“Should implementing OperatorInfo for:  Select”是指还没实现Select算子的分布式版本。

在自动并行下,每个算子都需要实现一个分布式的算子版本,现在我们只实现了部分算子,https://www.mindspore.cn/doc/note/zh-CN/r1.2/operator_list_parallel.html ,我看MindSpore master分支上已经实现了Select,不知道你用的是哪个版本?可能是因为你的版本中还不支持Select。

点赞 评论 引用 举报

采纳成功

您已采纳当前回复为最佳回复

编程大赛最佳伴侣

发帖: 11粉丝: 0

级别 : 版主

发消息 + 关注

发表于2021年06月11日 14:53:30
直达本楼层的链接
7#
显示全部楼层
回复:xiaoda 发表于 2021-6-11 14:28 在auto_parallel模式下报的“Should implementing OperatorInfo for: Select&rdq
我用的是pip直接安装的最新版mindspore-ascend: ```bash HwHiAiUser@ubuntu:~/mindspore/cybertroncode/tutorials$ python3 -m pip show mindspore-ascend Name: mindspore-ascend Version: 1.2.0 Summary: MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios. Home-page: https://www.mindspore.cn Author: The MindSpore Authors Author-email: contact@mindspore.cn License: Apache 2.0 Location: /usr/local/python3.7.5/lib/python3.7/site-packages Requires: pillow, psutil, numpy, protobuf, asttokens, scipy, easydict, astunparse, packaging, wheel, setuptools, decorator, cffi, sympy Required-by: ```
点赞 评论 引用 举报

采纳成功

您已采纳当前回复为最佳回复

编程大赛最佳伴侣

发帖: 11粉丝: 0

级别 : 版主

发消息 + 关注

发表于2021年06月11日 15:11:20
直达本楼层的链接
8#
显示全部楼层
回复:xiaoda 发表于 2021-6-11 14:28 在auto_parallel模式下报的“Should implementing OperatorInfo for: Select&rdq
另外我的代码中只有以下三个算子: ```python from mindspore.ops import operations as P from mindspore.ops import functional as F from mindspore.ops import composite as C ``` 跟您给出的列表对了一下一个都没有,那么是说用了这几个算子就没办法使用分布式是吗?
点赞 评论 引用 举报

采纳成功

您已采纳当前回复为最佳回复

xiaoda

发帖: 0粉丝: 0

发消息 + 关注

发表于2021年06月11日 15:33:08
直达本楼层的链接
9#
显示全部楼层

这几个不是算子,我说的算子是MatMul,ReLU这种。

点赞 评论 引用 举报

采纳成功

您已采纳当前回复为最佳回复

编程大赛最佳伴侣

发帖: 11粉丝: 0

级别 : 版主

发消息 + 关注

发表于2021年06月11日 15:51:21
直达本楼层的链接
10#
显示全部楼层
回复:xiaoda 发表于 2021-6-11 15:33 这几个不是算子,我说的算子是MatMul,ReLU这种。
我梳理了一下所有的代码,对比了一下您给的清单,我发现有这么几个是没有的,能否麻烦您帮忙看一下是否是这几个算子导致的问题? ```python mindspore.ops.operations.Ones mindspore.ops.operations.Fill mindspore.ops.operations.TensorSummary mindspore.ops.operations.ScalarSummary mindspore.ops.operations.GatherD mindspore.ops.functional.Select mindspore.ops.functional.mixed_precision_cast mindspore.ops.composite.GradOperation ```
点赞 评论 引用 举报

游客

富文本
Markdown
您需要登录后才可以回帖 登录 | 立即注册

邀请回答
您可以邀请3位专家

结贴

您对问题的回复是否满意?
满意度
非常满意 满意 一般 不满意
我要反馈
0/200