建议使用以下浏览器,以获得最佳体验。 IE 9.0+以上版本 Chrome 31+ 谷歌浏览器 Firefox 30+ 火狐浏览器
请选择 进入手机版 | 继续访问电脑版
设置昵称

在此一键设置昵称,即可参与社区互动!

确定
我再想想
选择版块
标签
您还可以添加5个标签
  • 没有搜索到和“关键字”相关的标签
  • 云产品
  • 解决方案
  • 技术领域
  • 通用技术
  • 平台功能
取消

skywalk

发帖: 4粉丝: 2

发消息 + 关注

更新于2021年01月09日 12:46:22 647 5
直达本楼层的链接
楼主
显示全部楼层
[干货分享] 华为极客周活动-昇腾万里--模型王者挑战赛 VQVAE调通过程

华为极客周活动-昇腾万里--模型王者挑战赛VQVAE调通过程

https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=92438

【昇腾万里--模型王者挑战赛来啦!】青铜赛段开启,丰厚大奖与荣誉已备好!!

这个比赛是刘明通知我极客周点亮山东活动的时候,我才知道的。参加模型大赛,既能锻炼自己,又能为点亮山东作出贡献,还有丰厚的礼品,于是就欣然接受极客周活动的邀请了。当时想着模型跑通的难度应该不太大,至少比论文复现的难度要低吧,这样保底可以贡献点亮山东的积分。至于奖品,那都是浮云,对,我真的是这样想的。

 

挑战赛简述:

【青铜赛段】已开启----开发者需要在下方模型列表中选择一个模型,将训练脚本迁移到昇腾AI处理器运行,并完成单Step训练。

 

jeff等朋友交流后,选择了vqvae这个模型:

https://github.com/hiwonjoon/tf-vqvae/

选它的理由是:

1 因为从pytorch迁移到ascend昇腾涉及到改写代码到mindspore,我们感觉这个难度要比tf迁移要高,所以只能从不需要改写代码的tf模型中找。

2 因为有准备阶段跑通模型脚本的要求:

准备模型脚本&申请昇腾环境:

尝试将原生模型脚本在本地CPUGPU环境运行,并将运行成功(仅需启动成功,不需要完整跑完训练)的截图发给昇腾小助手,以便分配本次活动的运行环境。

因此要先能尽快本地跑通一个以便申请昇腾环境。Vqvae模型使用cifar-10数据集,这个数据集我们熟悉,体量也小,比其它模型的不知名数据集更容易上手。

3 另外就是几个朋友准备错开不同的模型,毕竟每个模型的大奖只有一个,都攻同一个模型内卷太厉害。

 

当然事后才知道,vqvae模型是比较不顺利的一个模型,迁移工程中坑比较少的模型是EASTPSENET。但是我本人是那种单核的性格,所以后来主要精力一直在VQVAE上面,其中的滋味酸甜苦辣都有,一言难尽啊。后面就是太长不看系列,如果想知道最终这个模型迁移有没有跑通,请直接划到底部看。

攻占VQVAE模型的战役打响啦!

本地跑通VQVAE

一般我的处理过程就是拿来就跑,然后缺什么补什么,算是有点急脾气吧。

首先将源码git clone到本地 ,源码地址:https://github.com/hiwonjoon/tf-vqvae

直接运行python mnist.py ,报错:

no model :better_exceptions

解决方法:安装相应的模块 !pip install better_exceptions

 

报错:

module 'tensorflow' has no attribute 'set_random_seed'

解决方法:修改 import tensorflow as tf成为:import tensorflow.compat.v1 as tf

当然后来调试程序的时候,为了尽量少改动源代码,减少麻烦,又把这句换回去了。

 

报错:

---> 21 from tensorflow.examples.tutorials.mnist import input_data

     22 mnist = input_data.read_data_sets(DATA_DIR, one_hot=True)

     23

ModuleNotFoundError: No module named 'tensorflow.examples.tutorials'

 

这时才注意到模型对应的tf版本是1.15,而我本地装的是2.0版本,决定放弃2.0版本了,改回1.15版本,这也是升腾里面tf的版本。

Successfully uninstalled tensorflow-2.3.1

Successfully installed gast-0.2.2 keras-applications-1.0.8 tensorboard-1.15.0 tensorflow-1.15.0 tensorflow-estimator-1.15.1

 

现在报错:

--> 194         from layers import GatedCNN

    195         self.X = tf.placeholder(tf.int32,[None,size,size])

    196

ModuleNotFoundError: No module named 'layers'

 

检查代码,发现pixelcnn目录里面没有layers文件,而git源代码里有的,原来这个目录是github上面链接到另外一个项目,git clone的时候没有下载到本地。解决的方法是手工把layers等几个文件下载下来并放入pixelcnn目录里。

 

报错:没有pydm,于是装上了pydm

 

结果就调通了,太高兴了! 其实大部分项目都有requirement.txt,这样直接pip install -r requirement.txt就行了。当然即使有这个文件,我本人也习惯一个一个添加模块,主要是因为这样可以单独看到每个模块的安装情况,并且不受requirement.txt文件里的版本限制。Python各模块的版本问题一直是个大坑,运气好碰不到,运气不好直接撂挑子。所以我宁愿手工一个一个安装。

 

然后将截图发给小助手,申请到了ascend资源。后面就是在升腾系统上面完成迁移工作了。

 

总结一下,这步还算比较顺利,出的问题都是常规问题。这里跑的是mnist数据集,比赛最终验收是需要cifar10数据集通过才行,但是我建议先调通mnist数据集,从简单到复杂,逐渐提高难度更有利于任务的完成,这也是后面一直贯彻的思想。

 

在升腾上开始迁移工作

ascend上面,竟然python mnist.py没有调试就能跑起来了,当时感觉太不可思议了。当然碰到了数据集无法下载的问题,反正这个数据集小,手工上传就ok了。之所以顺利,是因为前面已经在本地跑通了,另外现在也只是用cpu来跑,还没有涉及迁移工作。

 

罗马不是一日建成的,果然,训练完成后报错:

Traceback (most recent call last):

  File "mnist.py", line 291, in <module>

    extract_z(**config)

  File "mnist.py", line 119, in extract_z

    x = tf.placeholder(tf.float32,[None,784])

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 2619, in placeholder

    return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 6669, in placeholder

    "Placeholder", dtype=dtype, shape=shape, name=name)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper

    op_def=op_def)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func

    return func(*args, **kwargs)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op

    attrs, op_def, compute_device)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3401, in _create_op_internal

    self._check_not_finalized()

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 2998, in _check_not_finalized

    raise RuntimeError("Graph is finalized and cannot be modified.")

RuntimeError: Graph is finalized and cannot be modified.

 

现在进入乱改模式:87行注释掉试试:不行

230行注释掉:也不行

 

现在看是119行那里报错,用的x = tf.placeholder(tf.float32,[None,784])

看了一下说明,tf里面为什么用placeholder

为什么要使用tf.placeholder?

 

因为每一个tensor值在graph上都是一个op

当我们将train数据分成一个个minibatch然后传入网络进行训练时,

每一个minibatch都将是一个op

这样的话,一副graph上的op未免太多,也会产生巨大的开销;

 

于是就有了tf.placeholder

我们每次可以将 一个minibatch传入到x = tf.placeholder(tf.float32,[None,32])上,

下一次传入的x都替换掉上一次传入的x

这样就对于所有传入的minibatch x就只会产生一个op

不会产生其他多余的op,进而减少了graph的开销。

 

以上说明链接:https://www.jianshu.com/p/a23cf9be601f

来源:简书

 

然后用了好几天时间用来解决这个问题,但是都看不太懂,不知道该怎么修改。

 

于是静下心来找tfmnist入门例子学习,在仔细学习的时候,发现这样一句代码:

tf.reset_default_graph()

于是在每段tf训练代码前面,使用这个reset代码,乌拉,程序至少能向下执行了!

 

然后碰到报错:

2020-12-13 20:02:07,916 - logging at line 304 - INFO - end extract_z and train_prior

Traceback (most recent call last):

  File "mnist.py", line 305, in <module>

    train_prior(config=config,**config)

  File "mnist.py", line 214, in train_prior

    10,NUM_LAYERS,NUM_FEATURE_MAPS)

  File "/home/model_user14/jk/tf-vqvae/model.py", line 195, in __init__

    from layers import GatedCNN

ModuleNotFoundError: No module named 'layers'

 

这个问题也是困扰了我好久。我仔细看了本机的代码,没有找到layers库,我曾经把tflayers暴露出来(就是import tensorflow.layers),但是报错,明显不对。我就不明白了,难道别人不会碰到吗?代码model.py中在使用lasyer的前面,有这句:

sys.path.append('pixelcnn')

但是pixelcnn这个目录是空的(好吧,我现在真的不知道为什么cpu的时候已经出现并解决这个问题,这里又弄不明白一次)。

 

于是到github上面看源代码,发现pixelcnn这个目录是关联目录,链接到另外一个项目,那个项目里是有layers这个文件的,将这个文件内容copy过来,乌拉,程序终于执行到结束啦(又解决了一次,又高兴了一次)!

 

现在的问题是:

怎样用npu训练

现在用时比较长,可以看到log信息里没有tf_adapter[GEOP]等关键词,这证明没有使用npu。这种判断方法是给的资料里面讲的,而且最终PR提交也是验证这几个关键字来判断是否通过的。

加上这句,看看到底用的啥设备:

config.gpu_options.allow_growth = True

我现在不明白这句原来就有,还用加?

 

看了下输出,果然用的cpu

valid/strided_slice/stack_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0

2020-12-14 11:34:37.334224: I tensorflow/core/common_runtime/placer.cc:54] valid/strided_slice/stack_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0

 

现在开始按照手册进行tf迁移到升腾系统的代码改写阶段,最前面加上库的引入:

from npu_bridge.estimator import npu_ops

from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig

 

 

sess.run前面加入这些语句:

#创建session

config = tf.ConfigProto()

custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()

custom_op.name =  "NpuOptimizer"

custom_op.parameter_map["use_off_line"].b = True # 必须显示开启,在昇腾AI处理器执行训练

config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  # 必须显示关闭remap

config.graph_options.rewrite_options.optimizers.extend(["GradFusionOptimizer"]) #分布式添加

 

果然,

报错,一大堆!

 

现在先把加上的这些全部注释掉,把(现在我也不知道当时解决的啥,有时候没记录就永远的不知道发生了什么)解决掉

 

报错:

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/variable_scope.py", line 868, in _get_single_variable

    (err_msg, "".join(traceback.format_list(tb))))

ValueError: Variable net/params/enc/conv2d_1/w already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

 

按照提示加上这句:

tf.reset_default_graph()

 

现在终于cpu版本过去了

既然cpu版本调通了,那就继续调试npu版本。下面的报错信息非常多:   

unknownshape_format=",".join(unknownshape_format_list))

TypeError: gen_param() got an unexpected keyword argument 'unknownshape_format'

[ERROR] TEFUSION(86372,python):2020-12-14-16:08:51.995.932 [tensor_engine/te_fusion/fusion_op.cc:2078]SelectTbeOpFormat Failed to call op func op_select_format, need to check op info: module name[impl.mul], op name [op_select_format], op inputs: ({'shape': (-1, 64), 'ori_shape': (-1, 64), 'format': 'NHWC', 'ori_format': 'NHWC', 'dtype': 'float32', 'addr_type': 0, 'valid_shape': (), 'slice_offset': (), 'L1_workspace_size': -1, 'L1_fusion_type': -1, 'L1_addr_offset': 0, 'total_shape': (), 'split_index': 0, 'is_first_layer': False, 'range': ((1, None), (64, 64))}, {'shape': (1,), 'ori_shape': (), 'format': 'NHWC', 'ori_format': 'NHWC', 'dtype': 'float32', 'addr_type': 0, 'valid_shape': (), 'slice_offset': (), 'L1_workspace_size': -1, 'L1_fusion_type': -1, 'L1_addr_offset': 0, 'total_shape': (), 'split_index': 0, 'is_first_layer': False}), outputs: ({'shape': (-1, 64), 'ori_shape': (-1, 64), 'format': 'NHWC', 'ori_format': 'NHWC', 'dt

[ERROR] TEFUSION(86372,python):2020-12-14-16:08:51.995.974 [tensor_engine/te_fusion/fusion_op.cc:2078]SelectTbeOpFormat ype': 'float32', 'addr_type': 0, 'valid_shape': (), 'slice_offset': (), 'L1_workspace_size': -1, 'L1_fusion_type': -1, 'L1_addr_offset': 0, 'total_shape': (), 'split_index': 0, 'range': ((1, None), (64, 64))},), attrs: ().

[ERROR] TEFUSION(86372,python):2020-12-14-16:08:51.996.001 [tensor_engine/te_fusion/fusion_api.cc:943]SelectTbeOpFormat Failed to select tbe op format. Name=[train/backward/Adam/update_train/params/embed/embed/mul_1], Module=[/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_impl/built-in/ai_core/tbe/impl/mul]

[ERROR] FE(86372,python):2020-12-14-16:08:51.996.020 [fusion_engine/adapter/tbe_adapter/tbe_op_store_adapter.cpp:1087]87118 SelectOpFormat:"Op[train/backward/Adam/update_train/params/embed/embed/mul_1,optype[Mul]]: fail to invoke SelectTbeOpFormat."

[ERROR] FE(86372,python):2020-12-14-16:08:51.996.037 [fusion_engine/format_selector/op_customize/format_dtype_op_customize_selector.cpp:130]87118 GetDynamicFormatDtype:"Op[name=train/backward/Adam/update_train/params/embed/embed/mul_1,type=Mul]: fail to select formats and dataTypes."

[ERROR] FE(86372,python):2020-12-14-16:08:51.996.054 [fusion_engine/format_selector/op_customize/format_dtype_op_customize_selector.cpp:41]87118 GetSupportFormatDtype:"Fail to get dynamic format and data type of op[train/backward/Adam/update_train/params/embed/embed/mul_1, Mul]."

[ERROR] FE(86372,python):2020-12-14-16:08:51.996.068 [fusion_engine/format_selector/op_customize/format_dtype_op_customize_selector.cpp:57]87118 GetUnknownShapeSupportFormatDtype:"Op[name=train/backward/Adam/update_train/params/embed/embed/mul_1,type=Mul]: Fail to GetUnknownShapeSupportFormatDtype."

[ERROR] FE(86372,python):2020-12-14-16:08:51.996.083 [fusion_engine/ops_kernel_store/sub_ops_store.cpp:770]87118 CheckSubStoreSupported:"Fail to get the GetSupportFormatDtype, return false."

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/site-packages/te/platform/fusion_manager.py", line 706, in call_op_func

    return opfunc(*inputs, *outputs, *attrs)

  File "/home/HwHiAiUser/Ascend/ascend-toolkit/latest/arm64-linux/opp/op_impl/built-in/ai_core/tbe/impl/mul.py", line 316, in op_select_format

    unknownshape_format=",".join(unknownshape_format_list))

TypeError: gen_param() got an unexpected keyword argument 'unknownshape_format'


 

 

仔细看报错

TensorSummaryV2 is not in white list, so currently not support

忘记记录了,好像这里也没有啥。

把前面的报错提交issue,好像不是我一个人碰到这个问题。

看到别人报的issuehttps://gitee.com/ascend/modelzoo/issues/I2958D?from=project-issue

我报的issue https://gitee.com/ascend/modelzoo/issues/I29BT6?from=project-issue

 

重新来一次,报错:

Traceback (most recent call last):

  File "mnist.py", line 292, in <module>

    train_prior(config=config,**config)

  File "mnist.py", line 194, in train_prior

    vq_net = VQVAE(None,None,BETA,_not_used,K,D,_mnist_arch,params,False)

  File "/home/model_user14/jk/vqvae/tf-vqvae/model.py", line 101, in __init__

    enc_spec,enc_param_scope,dec_spec,dec_param_scope = arch_fn(D)

  File "/home/model_user14/jk/vqvae/tf-vqvae/model.py", line 10, in _mnist_arch

    Conv2d('conv2d_1',1,d//4,data_format='NHWC'),

  File "/home/model_user14/jk/vqvae/tf-vqvae/commons/ops.py", line 9, in __init__

    initializer=tf.truncated_normal_initializer(stddev=stddev))

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1500, in get_variable

    aggregation=aggregation)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1243, in get_variable

    aggregation=aggregation)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/variable_scope.py", line 567, in get_variable

    aggregation=aggregation)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/variable_scope.py", line 519, in _true_getter

    aggregation=aggregation)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/variable_scope.py", line 868, in _get_single_variable

    (err_msg, "".join(traceback.format_list(tb))))

ValueError: Variable net/params/enc/conv2d_1/w already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

 

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__

    self._traceback = tf_stack.extract_stack()

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal

    op_def=op_def)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op

    attrs, op_def, compute_device)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func

    return func(*args, **kwargs)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper

    op_def=op_def)

 

还碰到了一次不管怎么调试,最后的推理都报错的问题,实在没招最后把mnist.py覆盖过来都不行。最终经过非常仔细的排查发现是ops.py里面因为写入这句导致的:import tensorflow.compat.v1 as tf

于是从那之后再没有自作主张的写这句话,而是老老实实的写:import tensorflow as tf ,尽管这样会出很多的警告信息。比如这样的:WARNING  The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead. 众所周知,在程序员的世界里警告都是无视的。

 

现在,cpu版本的mnistcifar10两个程序都跑通了,就等issue解决啦!

 

Issue工程师需要更多日志信息

先在程序里写入:

    import os

    os.environ['SLOG_PRINT_TO_STDOUT'] = "1"

 

在训练开始之前,在控制台输入一条命令:

script -f 你的文件名.log

然后正常启动训练,此时控制台,也就是屏幕上所有滚动的内容都会被记录到刚才那个文件中。

训练结束之后,按

Ctrl+D

快捷键,停止记录,保存文件。

 

 

提交了日志之后,回应:

您好,我是Mul算子的开发负责人,从您的上述报错来看,可能是由于您当前使用的代码版本中缺少相关定义导致,具体可以查看您的mul.py文件同目录下,是否有util文件夹?如果有,请查看其中的util_select_op_base.py文件中,是否有unknownshape_format关键词。如果没有该关键词,则您的代码版本可能较老,该文件未能够与算子文件同步,我们近期会更新代码,届时第一时间通知您;如果有该关键字,请麻烦把全量日志按照上述的日志获取方法找到并贴在评论下,谢谢!

 

问题是mul.py文件在哪里啊? 哦,有截图告诉我位置:

/home/HwHiAiUser/Ascend/ascend-toolkit/latest/arm64-linux/opp/op_impl/built-in/ai_core/tbe/impl/mul.py

现在就是要看是否有util目录:

有的

然后看看util文件夹,也有

然后查看util_select_op_base.py文件,发现没有unknownshape_format关键词。

 

最新的回复:

您好,您所依赖的这个util**.py文件不是算子文件,可能您无法直接安装使用。本周内将有一次代码更新,更新完成后您可以正常使用该文件。更新完成后我将第一时间联系您,对给您带来的困扰表示抱歉。

 

又来了新回复:

最新解决方案已经在附件里,替换到环境的/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_impl/built-in/ai_core/tbe/impl/util目录下再试试

 

附件里是新的util_select_op_base.py文件,但是我发现有好几个目录啊,我有选择恐惧症:

/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_impl/built-in/ai_core/tbe/impl/util

model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_impl/built-in/ai_core/tbe/impl/util

 

/home/HwHiAiUser/Ascend/ascend-toolkit/20.1/arm64-linux/opp/op_impl/built-in/ai_core/tbe/impl/util

/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_impl/built-in/ai_core/tbe/impl/util

/home/HwHiAiUser/Ascend/ascend-toolkit/latest/arm64-linux/opp/op_impl/built-in/ai_core/tbe/impl/util

 

检查,每个目录都有相应的代码了。(后来才知道 20.120.1.rc1latest这三个目录是同一个,另外两个是软链接指向同一个)

 

按照上面的解决方案将util_select_op_base.py放入相应目录中,有新的报错:

提了新issue https://gitee.com/ascend/modelzoo/issues/I29N4C

[线上模型挑战] tensorflow VQVAE 模型报错2

一天后得到答复,生成了一个新的文件:

您好,问题已经解决,请取附件的so替换到环境中如下路径,替换前请先备份。

/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so

 

于是就将原文件备份,并将这个文件cp的相应位置。以后这步就是常规操作了,提issue,拿升级文件,更新升级文件,测试,出来报错提issue,这样循环操作。这个so文件替换后报错就改变了。

看别人的报错:

https://gitee.com/ascend/modelzoo/issues/I23LFO

感谢工作人员的辛苦检查!这里发现即使在输入阶段固定了网络的输入shape,如果代码中使用了类似如下tf.reshape(,(-1,))这样的动态代码也会报错

 

positive_loss = loss * positive_mask

negative_loss = loss * negative_mask

negative_loss, _ = tf.nn.top_k(tf.reshape(negative_loss, (-1,)), tf.cast(negative_count, tf.int32))

@zx 是的,当前我们框架有限制,对某些算子的输入有限制,必须满足为常量,建议您将reshape的第二个输入替换成tf.constant构造的常量,而不是(-1,)的动态shape,谢谢!

 

链接地址回复评论:

FusedBatchNormV32019年推出的新算子,他的第五个输出是与cuda相关的优化输出,目前尚不支持。可以使用with compat.forward_compatibility_horizon(2019, 5, 1):规避使用该算子

对心回复:

TopKV2的动态shape算子支持已纳入后续开发队列,明确相关计划路标之后,我们及时知会

 

我没有看太明白,不过知道这么回事:如果有报错,研发会给找到问题在哪里,并给出规避方法,至少照做就行了。后面还会将该需求纳入后续开发队列,这样总有一天这个问题会从根本上解决。这样一看,就对原来模型大赛中出现的各种问题不再忧心忡忡,而是信心满满啦!有信心很重要,因为VQVAE这个模型问题最多,耗时最长,我曾经好几次情绪陷入低谷,又好几次走出低谷,继续勇往直前!

VQVAE issue3

 https://gitee.com/ascend/modelzoo/issues/I29T7Z

回复:从日志上看,原因在于输入给conv2dshape不是4D的,我们需要知道conv2d上面链接的是哪个算子;

可以通过设置,export DUMP_GE_GRAPH=1 GE图打印出来,然后放在issue的附件上,或者把pb文件放在附件上,便于进一步定位,多谢

 

GE图是这样的,我还不会看:

image.png

提交GE图后,回复:

这个issue经分析,与https://gitee.com/HUAWEI-ASCEND/dashboard?issue_id=I28YYG

#I29E8U:[线上模型挑战]tensorflow-VQVAE 迁移昇腾,数据处理阶段报unknown shape?from=project-issue是一样的问题

都是queue类算子后面接conv引起的

请关注上面2issue,多谢

 

于是Issue3关闭,转到这个链接继续观察:

[线上模型挑战]VQ-VAE迁移 Conv2报错

https://gitee.com/ascend/modelzoo/issues/I28YYG

 

据说已经有一个人跑通了。那我们就:

学习别人成功的代码

比着改,现在第一步训练可以出来了,

但是到了97step 就报错退出了:

2020-12-18 17:13:51.049172: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:22, total_bytes:262144, shape: 1 1 256 256, tensor_ptr:281461724729536, output281461705736000

2020-12-18 17:13:51.049349: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:23, total_bytes:1024, shape: 256, tensor_ptr:281461724991808, output281461689349328

2020-12-18 17:13:51.049393: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:24, total_bytes:2359296, shape: 3 3 256 256, tensor_ptr:281461724992960, output281461705998160

2020-12-18 17:13:51.050712: I tf_adapter/kernels/geop_npu.cc:573] [GEOP] RunGraphAsync callback, status:0, kernel_name:GeOp7_12[ 2585288us]

[  100] Loss: 0.083                                                                                                                                 

  2%|██                                                                                                         | 97/5000 [00:57<1:59:22,  1.46s/it]2020-12-18 17:13:53.807739: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost

2020-12-18 17:13:53.808222: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_5 success. [0 ms]

 

[ERROR] AICPU(136030,python):2020-12-18-17:13:55.066.334 [aicpu/aicpu_host/aicpu_engine/aicpu_ops_kernel_info/aicpu_ops_kernel_info.cpp:341][GetKernelLibByOpType]:"Operator:Placeholder belongs to relevant info library is not exist."

[ERROR] AICPU(136030,python):2020-12-18-17:13:55.066.439 [aicpu/aicpu_host/aicpu_engine/aicpu_ops_kernel_info/aicpu_ops_kernel_info.cpp:120][CalcOpRunningParam]:"Op Placeholder can't find associated kernel info lib."

[ERROR] GE(136030,python):2020-12-18-17:13:55.066.544 [framework/domi/graph/build/graph_builder.cc:84]136783 CalcOpParam: ErrorNo: 20000016() Calculate op running param failed, node name is Placeholder


 


举报
分享

分享文章到朋友圈

分享文章到微博

skywalk

发帖: 4粉丝: 2

发消息 + 关注

发表于2021年01月09日 12:34:46
直达本楼层的链接
沙发
显示全部楼层

现在的进展问题:

依瞳系统暂停之后,再开,那些文件又要重新打一遍补丁

2 issue3还没有解决,要持续关注这个issue https://gitee.com/ascend/modelzoo/issues/I28YYG

 

Issue3 解决过程

经排查,QueueDequeueMany输出shapeTF不一致系上游算子RandomShuffleQueueshapes属性未向下传递导致

已联系负责该算子的开发人员进行修复

 

问已经解决,请替换附件中的文件至如下目录,替换前请先备份。

/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so

 

这样issue3已经解决,又打一个补丁文件。

 

Issue4 报错AOE

打上issue3的补丁,还是报错。由于时间有点久,那个报错也没有什么特征性的东西,所以从log上看不出是否修正了我以前的那个bug。原谅我有时候也偷懒,没有仔细比对报错信息。

 

发现aoe已经提交issue了,我在升级完issue的补丁后,我们的报错信心一样,于是就持续关注这个issue和解决方案,这个也算是issue4了:

https://gitee.com/ascend/modelzoo/issues/I2A7SC

 

返回信息,可以参考下方链接中“sess.run模式下开启混合计算”的方式将tf.train.shuffle_batchtf.train.string_input_producer设置为不下沉,再试试网络是否可以跑起来

https://support.huaweicloud.com/mprtg-A800_9000_9010/atlasprtg_13_0033.html

 

说用下混合计算,看到文档里说混合计算模式下,iterations_per_loop必须为1。不过我在代码里没有找到这个关键字,那是否意味着我不用考虑iterations_per_loop的实际取值呢?(后来通过沟通知道,混合模式iterations_per_loop就已经设为1了)

用户还可通过without_npu_compile_scope自行配置不下沉的算子。

 

于是按照说明修改代码82行:

# change to 不下沉

    with npu_scope.without_npu_compile_scope():

        filename_queue = tf.train.string_input_producer(filenames,num_epochs=num_epochs)

 

因为AOE的问题解决了,他的issue关闭,但是我的问题还没解决,所以我报了自己的issue4

Issue4报错

提交issue

https://gitee.com/ascend/modelzoo/issues/I2AMHH

 

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.FixedLengthRecordDataset`.

WARNING:tensorflow:From cifar10.py:305: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

2020-12-23 22:22:55.165648: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost

2020-12-23 22:22:55.166302: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_5 success. [0 ms]

2020-12-23 22:22:55.166352: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.

2020-12-23 22:22:55.166546: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is True

2020-12-23 22:22:55.166574: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1

2020-12-23 22:22:55.166804: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_9 begin.

2020-12-23 22:22:55.166829: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is True

2020-12-23 22:22:55.166876: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1

2020-12-23 22:22:55.167661: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 1, hasMakeIteratorOp:0, hasIteratorOp:0

2020-12-23 22:22:55.167710: I tf_adapter/util/npu_ops_identifier.cc:67] [MIX] Parsing json from /home/HwHiAiUser/Ascend/ascend-toolkit/latest/arm64-linux/opp/framework/built-in/tensorflow/npu_supported_ops.json

2020-12-23 22:22:55.169692: I tf_adapter/util/npu_ops_identifier.cc:69] 690 ops parsed

2020-12-23 22:22:55.170185: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [2 ms]

2020-12-23 22:22:55.176442: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1

2020-12-23 22:22:55.176485: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 382, max nodes count: 377 in subgraph: GeOp9_0 minGroupSize: 1

2020-12-23 22:22:55.176643: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_9 markForPartition success.

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call

    return fn(*args)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn

    target_list, run_metadata)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun

    run_metadata)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list

 

During handling of the above exception, another exception occurred:

 

Traceback (most recent call last):

  File "cifar10.py", line 517, in <module>

    extract_z(**config)

  File "cifar10.py", line 330, in extract_z

    sess.run(init_op)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

    run_metadata_ptr)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run

    feed_dict_tensor, options, run_metadata)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run

    run_metadata)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call

    raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list

2020-12-23 22:22:56.194723: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin

2020-12-23 22:22:56.194890: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.

2020-12-23 22:22:56.194990: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end

 

后来无法复现,就关闭了。但是今天又复现了。

发现原来依瞳系统下面,pythonpython3.7指向竟然不是同一个。

model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$  which python

/usr/bin/python

model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$  which python3.7

/usr/local/bin/python3.7

 

model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$ which pip

/usr/local/bin/pip

因此应该用usr/local/bin这个目录下的python,也就是python3.7

 

检查tf.train.batch里面的设置,尤其是allow_smaller_final_batch的设置,发现一共出现3处,唯一起作用那处已经改成了False

    images,labels = tf.train.batch(

        [image,label],

        batch_size=BATCH_SIZE,

        num_threads=1,

        capacity=BATCH_SIZE,

        allow_smaller_final_batch=False)

 

将代码改成非aoe模式,因为aoe他的规避方式不复合比赛的要求。

将其它两处train.Batch 改成了不下沉。混合计算里的不下沉,就是将不下沉的语句用这个语句包起来:

  with npu_scope.without_npu_compile_scope():

 

由于修改好几个地方,程序被改的面目全非,因此又重新测试cpu代码,发现cifar10cpu代码竟然也不通过了(这就是vqvae这个模型难的地方,改一点点地方,报错就不一样,甚至感觉没改哪些地方,它自己也会莫名其妙的不通了)。

又废了好大劲才终于又调通了cpu代码。Cpu代码程序单独写为cifar_base.py。然后再比着调通npu程序代码,也就是如果把npu相关代码都屏蔽掉,npu程序也是能跑通的。

 

Issue5的报错,是shape没有对齐

https://gitee.com/ascend/modelzoo/issues/I2AVI5

应该是data batch那里没有把最后一段丢弃的缘故,

    images,labels = tf.train.batch(

        [image,label],

        batch_size=BATCH_SIZE,

        num_threads=1,

        capacity=BATCH_SIZE,

        allow_smaller_final_batch=False)

加上黑体部分就ok了。

 

issue5.1

https://gitee.com/ascend/modelzoo/issues/I2B2US

 

提到:麻烦以后出现DEVMM报错时,敲dmesg获取一下内核日志,方便定位,谢谢~

 

[ERROR] DEVMM(25538,python3.7):2020-12-27-12:52:10.065.876 [hardware/build/../dev_platform/devmm/devmm/devmm_svm.c:268][devmm_copy_ioctl 268] <curpid:25538,0x66a5> <errno:3> Ioctl(-1060090619) error! ret=-1, dst=0xfffed40af2a0, src=0x1008000bc000, size=112,

 

但是这个问题并不容易复现,在我的系统里偶尔能复现,在研发那块复现也很困难。

 

结果元旦后第一个工作日:新年新气象,今天略微修改了下代码,竟然跑通了,我都很惊讶。

数据读取部分用了混合计算不下沉,原则上没有修改骨干代码 ,但是元旦那天还不行,今天稍微改了下代码,就跑通了 

这个issue的问题解决了,关闭。

 

上面的记录文字很短,其实这个issue花费的时间非常多,从2020年的年尾,一直到2021年的年初,两头占着算两年时间,中间代码改的面目全非,bug的样子也是日新月异,可以说最后成功的喜悦有多大,中间的情绪低落就有多深。

 

还没有完成的issue6 报错

这回的报错没有提交issue6 

报错信息:

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

WARNING:tensorflow:From cn.py:346: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

2021-01-04 16:05:37.573107: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost

2021-01-04 16:05:37.574181: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_15 success. [0 ms]

2021-01-04 16:05:37.574252: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.

2021-01-04 16:05:37.574439: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is True

2021-01-04 16:05:37.574461: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1

2021-01-04 16:05:37.574658: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_29 begin.

2021-01-04 16:05:37.574679: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is True

2021-01-04 16:05:37.574689: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1

2021-01-04 16:05:37.575437: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 1, hasMakeIteratorOp:0, hasIteratorOp:0

2021-01-04 16:05:37.575952: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [0 ms]

2021-01-04 16:05:37.582660: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1

2021-01-04 16:05:37.582750: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 382, max nodes count: 377 in subgraph: GeOp29_0 minGroupSize: 1

2021-01-04 16:05:37.583336: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_29 markForPartition success.

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call

    return fn(*args)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn

    target_list, run_metadata)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun

    run_metadata)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer_2/limit_epochs/epochs/Assign is not in white list

 

During handling of the above exception, another exception occurred:

 

Traceback (most recent call last):

  File "cn.py", line 558, in <module>

    extract_z(**config)

  File "cn.py", line 371, in extract_z

    sess.run(init_op)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

    run_metadata_ptr)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run

    feed_dict_tensor, options, run_metadata)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run

    run_metadata)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call

    raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer_2/limit_epochs/epochs/Assign is not in white list

2021-01-04 16:05:38.559795: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin

2021-01-04 16:05:38.560022: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.

2021-01-04 16:05:38.560042: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end

 

看到这句提示,是否要修改代码呢?

WARNING:tensorflow:From cn.py:347: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

WARNING:tensorflow:From cn.py:347: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

 

现在报错信息为:

During handling of the above exception, another exception occurred:

 

Traceback (most recent call last):

  File "cn.py", line 559, in <module>

    extract_z(**config)

  File "cn.py", line 372, in extract_z

    sess.run(init_op)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

    run_metadata_ptr)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run

    feed_dict_tensor, options, run_metadata)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run

    run_metadata)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call

    raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list

查找,发现有可能是没有正确初始化导致的,于是加上这句试试:

sess.graph.finalize()

还是同样的报错。

 

Main代码:

        init_op = tf.group(tf.global_variables_initializer(),

                        tf.local_variables_initializer())

       

    # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Run!

    config = tf.ConfigProto()

    # config.gpu_options.allow_growth = True

 

    custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()

    custom_op.name =  "NpuOptimizer"

    custom_op.parameter_map["use_off_line"].b = True #在昇腾AI处理器执行训练

    custom_op.parameter_map["mix_compile_mode"].b =  True

    config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  #关闭remap开关

 

 

    sess = tf.Session(config=config)

    # sess.graph.finalize()

    sess.run(init_op)

    print("="*1000, "run sess.run(init_op) OK!")

    summary_writer = tf.summary.FileWriter(LOG_DIR,sess.graph)

    # logging.warning("dch summary_writer")

    summary_writer.add_summary(config_summary.eval(session=sess))

    # logging.warning("dch summary_writer.add")

 

extract_z代码:

   with npu_scope.without_npu_compile_scope():

        images,labels = tf.train.batch(

            [image,label],

            batch_size=BATCH_SIZE,

            num_threads=1,

            capacity=BATCH_SIZE,

            allow_smaller_final_batch=False)

    # <<<<<<<

    # images = images.batch(batch_size, drop_remainder=True)

    # >>>>>>> MODEL

    with tf.variable_scope('net'):

        with tf.variable_scope('params') as params:

            pass

        x_ph = tf.placeholder(tf.float32,[BATCH_SIZE,32,32,3])

        net= VQVAE(None,None,BETA,x_ph,K,D,_cifar10_arch,params,False)

 

    init_op = tf.group(tf.global_variables_initializer(),

                    tf.local_variables_initializer())

 

    # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Run!

    config = tf.ConfigProto()

    # config.gpu_options.allow_growth = True

    custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()

    custom_op.name =  "NpuOptimizer"

    custom_op.parameter_map["use_off_line"].b = True #在昇腾AI处理器执行训练

    custom_op.parameter_map["mix_compile_mode"].b =  True # 测试算子下沉

    config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  #关闭remap开关

 

    sess = tf.Session(config=config)

    logger.warn('warn sess = tf.Session(config=config)')

    # sess = tf.Session()

    sess.graph.finalize()

    sess.run(init_op)

    logger.warn('warn sess.run(init_op)')

 

最终采用将这句话里的epoch=1参数去掉,终于能够通过了 

    # image,label = get_image(num_epochs=1)

    image,label = get_image()

这个解决方法可能不是最终解决方法,先这样处理。

issue7报错:

train_prior部分:  config['TRAIN_NUM'] = 8 # 9个之后会报错

issuehttps://gitee.com/ascend/modelzoo/issues/I2BUME/

 

报错信息:

2021-01-04 20:30:06.952771: I tf_adapter/kernels/geop_npu.cc:573] [GEOP] RunGraphAsync callback, status:0, kernel_name:GeOp75_0[ 6456228us]

 50%|██████████████████████████████████████████████                                              | 9/18 [01:36<01:36, 10.67s/it]

Traceback (most recent call last):

  File "cifar10.py", line 531, in <module>

    train_prior(config=config,**config)

  File "cifar10.py", line 476, in train_prior

    sess.run(sample_summary_op,feed_dict={sample_images:sampled_ims}),it)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

    run_metadata_ptr)

  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1156, in _run

    (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))

ValueError: Cannot feed value of shape (20, 32, 32, 3) for Tensor 'misc/Placeholder:0', which has shape '(1, 32, 32, 3)'

 

这个问题现在还没有解决,完全不知道问题出在哪里。 大约跟数据的喂入有关系,但是我目前解决不了,只能先设为8config['TRAIN_NUM'] = 8保证整个程序能跑通,这个issue先留着吧。

 

PR提交,最后的冲刺

经过艰苦卓绝的奋斗,终于迎来了模型大赛的曙光,整个模型能够在升腾系统上跑通了,而且基本符合大赛的要求。后面就是一些微调了。

审核方给的修改意见

程序设置了对tf.train.string_input_producer这个的不下沉

要把这个去掉,只开启混合计算

 

 

2 VQ-VAE的网络问题:网络结构中数据预处理的方式是通过这个循环控制的,这个循环在数据达到上限后抛出异常,根据异常来结束处理,目前在昇腾产品执行会core。请开发者自行修改成其他控制流程:

        while not coord.should_stop():

            x,y = sess.run([images,labels])

            k = sess.run(net.k,feed_dict={x_ph:x})

            ks.append(k)

            ys.append(y)

            print('.', end='', flush=True)

except tf.errors.OutOfRangeError:

 

VQVAE PR最后的整改

整个程序只启动混合计算,把单独的不下沉设置全部去掉。(也就是最终的使用方法,理论上系统把支持的全部下沉,不支持的默认就能不下沉,不需要用户手动设置)

 

while循环改成for循环

        for step in tqdm(xrange(TRAIN_NUM), dynamic_ncols=True):

            x,y = sess.run([images,labels])

            k = sess.run(net.k,feed_dict={x_ph:x})

 

            ks.append(k)

            ys.append(y)

 

并设置循环步数:

config['TRAIN_NUM'] = 24

 

再跟少芳那边沟通了一下,第二部分能通过就是将get_image函数参数去掉解决的,反正已经设置了循环步数,这里应该不影响整体。

修改: # image,label = get_image(num_epochs=1)

修改为: image,label = get_image()

 

然后提交PR,终于PR验收通过啦!乌拉!非常激动!结果并不重要,中间出现问题、解决问题的过程最重要。但是如果没有结果,这篇文档都师出无名,中间付出的精力可能就白白付出了,学到的东西可能也没现在这么多、印象这么深刻。

 

VQVAE 模型tensorflow迁移到升腾总结

本次大赛主要经历了报名、模型选择、模型迁移、排错、提交PR等几个阶段,具体过程如前面篇幅所讲,一言难尽啊!

本次模型迁移大赛是很好的一次学习和锻炼的机会,我原来对tensorflow一点都不懂,经过这次比赛,不管懂不懂,反正代码看了好多遍,tf程序的流程也懂了一点。升腾系统原来也只是在Modelartsnotebook和训练任务中有接触,像这次这样可以在依瞳系统里自由的安装软件、完全控制系统还是第一次。在排错的过程中,跟华为研发有了第一线接触,为及时准确的排错能力点赞!对升腾系统和MindSpore AI框架充满信心!

 

模型大赛的白银赛段很快就要来了,大家快准备报名吧!



点赞1 评论 引用 举报

Jack20

发帖: 125粉丝: 157

发消息 + 关注

发表于2021年01月11日 13:18:55
直达本楼层的链接
板凳
显示全部楼层

谢谢分享

点赞 评论 引用 举报

Tornado88

发帖: 5粉丝: 1

发消息 + 关注

发表于2021年01月11日 20:10:27
直达本楼层的链接
地板
显示全部楼层

厉害厉害!

点赞 评论 引用 举报

Jounce

发帖: 1粉丝: 0

发消息 + 关注

发表于2021年01月12日 10:18:23
直达本楼层的链接
5#
显示全部楼层

点赞 评论 引用 举报

JeffDing

发帖: 38粉丝: 24

发消息 + 关注

发表于2021年01月13日 06:55:48
直达本楼层的链接
6#
显示全部楼层

很详细的VQVAE分享,太赞了

点赞 评论 引用 举报

游客

富文本
Markdown
您需要登录后才可以回帖 登录 | 立即注册