华为极客周活动-昇腾万里--模型王者挑战赛VQVAE调通过程 下
现在的进展问题:
1 依瞳系统暂停之后,再开,那些文件又要重新打一遍补丁
2 issue3还没有解决,要持续关注这个issue https://gitee.com/ascend/modelzoo/issues/I28YYG
Issue3 解决过程
经排查,QueueDequeueMany输出shape与TF不一致系上游算子RandomShuffleQueue中shapes属性未向下传递导致
已联系负责该算子的开发人员进行修复
问已经解决,请替换附件中的文件至如下目录,替换前请先备份。
/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so
这样issue3已经解决,又打一个补丁文件。
Issue4 报错(AOE)
打上issue3的补丁,还是报错。由于时间有点久,那个报错也没有什么特征性的东西,所以从log上看不出是否修正了我以前的那个bug。原谅我有时候也偷懒,没有仔细比对报错信息。
发现aoe已经提交issue了,我在升级完issue的补丁后,我们的报错信心一样,于是就持续关注这个issue和解决方案,这个也算是issue4了:
https://gitee.com/ascend/modelzoo/issues/I2A7SC
返回信息,可以参考下方链接中“sess.run模式下开启混合计算”的方式将tf.train.shuffle_batch和tf.train.string_input_producer设置为不下沉,再试试网络是否可以跑起来
https://support.huaweicloud.com/mprtg-A800_9000_9010/atlasprtg_13_0033.html
说用下混合计算,看到文档里说混合计算模式下,iterations_per_loop必须为1。不过我在代码里没有找到这个关键字,那是否意味着我不用考虑iterations_per_loop的实际取值呢?(后来通过沟通知道,混合模式iterations_per_loop就已经设为1了)
用户还可通过without_npu_compile_scope自行配置不下沉的算子。
于是按照说明修改代码82行:
# change to 不下沉
with npu_scope.without_npu_compile_scope():
filename_queue = tf.train.string_input_producer(filenames,num_epochs=num_epochs)
因为AOE的问题解决了,他的issue关闭,但是我的问题还没解决,所以我报了自己的issue4:
Issue4报错
提交issue:
https://gitee.com/ascend/modelzoo/issues/I2AMHH
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.FixedLengthRecordDataset`.
WARNING:tensorflow:From cifar10.py:305: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
2020-12-23 22:22:55.165648: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost
2020-12-23 22:22:55.166302: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_5 success. [0 ms]
2020-12-23 22:22:55.166352: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.
2020-12-23 22:22:55.166546: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is True
2020-12-23 22:22:55.166574: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1
2020-12-23 22:22:55.166804: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_9 begin.
2020-12-23 22:22:55.166829: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is True
2020-12-23 22:22:55.166876: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1
2020-12-23 22:22:55.167661: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 1, hasMakeIteratorOp:0, hasIteratorOp:0
2020-12-23 22:22:55.167710: I tf_adapter/util/npu_ops_identifier.cc:67] [MIX] Parsing json from /home/HwHiAiUser/Ascend/ascend-toolkit/latest/arm64-linux/opp/framework/built-in/tensorflow/npu_supported_ops.json
2020-12-23 22:22:55.169692: I tf_adapter/util/npu_ops_identifier.cc:69] 690 ops parsed
2020-12-23 22:22:55.170185: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [2 ms]
2020-12-23 22:22:55.176442: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1
2020-12-23 22:22:55.176485: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 382, max nodes count: 377 in subgraph: GeOp9_0 minGroupSize: 1
2020-12-23 22:22:55.176643: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_9 markForPartition success.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cifar10.py", line 517, in <module>
extract_z(**config)
File "cifar10.py", line 330, in extract_z
sess.run(init_op)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list
2020-12-23 22:22:56.194723: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin
2020-12-23 22:22:56.194890: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.
2020-12-23 22:22:56.194990: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end
后来无法复现,就关闭了。但是今天又复现了。
发现原来依瞳系统下面,python和python3.7指向竟然不是同一个。
model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$ which python
/usr/bin/python
model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$ which python3.7
/usr/local/bin/python3.7
model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$ which pip
/usr/local/bin/pip
因此应该用usr/local/bin这个目录下的python,也就是python3.7
检查tf.train.batch里面的设置,尤其是allow_smaller_final_batch的设置,发现一共出现3处,唯一起作用那处已经改成了False
images,labels = tf.train.batch(
[image,label],
batch_size=BATCH_SIZE,
num_threads=1,
capacity=BATCH_SIZE,
allow_smaller_final_batch=False)
将代码改成非aoe模式,因为aoe他的规避方式不复合比赛的要求。
将其它两处train.Batch 改成了不下沉。混合计算里的不下沉,就是将不下沉的语句用这个语句包起来:
with npu_scope.without_npu_compile_scope():
由于修改好几个地方,程序被改的面目全非,因此又重新测试cpu代码,发现cifar10的cpu代码竟然也不通过了(这就是vqvae这个模型难的地方,改一点点地方,报错就不一样,甚至感觉没改哪些地方,它自己也会莫名其妙的不通了)。
又废了好大劲才终于又调通了cpu代码。Cpu代码程序单独写为cifar_base.py。然后再比着调通npu程序代码,也就是如果把npu相关代码都屏蔽掉,npu程序也是能跑通的。
Issue5的报错,是shape没有对齐
https://gitee.com/ascend/modelzoo/issues/I2AVI5
应该是data batch那里没有把最后一段丢弃的缘故,
images,labels = tf.train.batch(
[image,label],
batch_size=BATCH_SIZE,
num_threads=1,
capacity=BATCH_SIZE,
allow_smaller_final_batch=False)
加上黑体部分就ok了。
报issue5.1
https://gitee.com/ascend/modelzoo/issues/I2B2US
提到:麻烦以后出现DEVMM报错时,敲dmesg获取一下内核日志,方便定位,谢谢~
[ERROR] DEVMM(25538,python3.7):2020-12-27-12:52:10.065.876 [hardware/build/../dev_platform/devmm/devmm/devmm_svm.c:268][devmm_copy_ioctl 268] <curpid:25538,0x66a5> <errno:3> Ioctl(-1060090619) error! ret=-1, dst=0xfffed40af2a0, src=0x1008000bc000, size=112,
但是这个问题并不容易复现,在我的系统里偶尔能复现,在研发那块复现也很困难。
结果元旦后第一个工作日:新年新气象,今天略微修改了下代码,竟然跑通了,我都很惊讶。
数据读取部分用了混合计算不下沉,原则上没有修改骨干代码 ,但是元旦那天还不行,今天稍微改了下代码,就跑通了 。
这个issue的问题解决了,关闭。
上面的记录文字很短,其实这个issue花费的时间非常多,从2020年的年尾,一直到2021年的年初,两头占着算两年时间,中间代码改的面目全非,bug的样子也是日新月异,可以说最后成功的喜悦有多大,中间的情绪低落就有多深。
还没有完成的issue6 报错
这回的报错没有提交issue6 。
报错信息:
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
WARNING:tensorflow:From cn.py:346: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
2021-01-04 16:05:37.573107: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost
2021-01-04 16:05:37.574181: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_15 success. [0 ms]
2021-01-04 16:05:37.574252: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.
2021-01-04 16:05:37.574439: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is True
2021-01-04 16:05:37.574461: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1
2021-01-04 16:05:37.574658: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_29 begin.
2021-01-04 16:05:37.574679: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is True
2021-01-04 16:05:37.574689: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1
2021-01-04 16:05:37.575437: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 1, hasMakeIteratorOp:0, hasIteratorOp:0
2021-01-04 16:05:37.575952: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [0 ms]
2021-01-04 16:05:37.582660: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1
2021-01-04 16:05:37.582750: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 382, max nodes count: 377 in subgraph: GeOp29_0 minGroupSize: 1
2021-01-04 16:05:37.583336: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_29 markForPartition success.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer_2/limit_epochs/epochs/Assign is not in white list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cn.py", line 558, in <module>
extract_z(**config)
File "cn.py", line 371, in extract_z
sess.run(init_op)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer_2/limit_epochs/epochs/Assign is not in white list
2021-01-04 16:05:38.559795: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin
2021-01-04 16:05:38.560022: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.
2021-01-04 16:05:38.560042: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end
看到这句提示,是否要修改代码呢?
WARNING:tensorflow:From cn.py:347: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
WARNING:tensorflow:From cn.py:347: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
现在报错信息为:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cn.py", line 559, in <module>
extract_z(**config)
File "cn.py", line 372, in extract_z
sess.run(init_op)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list
查找,发现有可能是没有正确初始化导致的,于是加上这句试试:
sess.graph.finalize()
还是同样的报错。
Main代码:
init_op = tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer())
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Run!
config = tf.ConfigProto()
# config.gpu_options.allow_growth = True
custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True #在昇腾AI处理器执行训练
custom_op.parameter_map["mix_compile_mode"].b = True
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #关闭remap开关
sess = tf.Session(config=config)
# sess.graph.finalize()
sess.run(init_op)
print("="*1000, "run sess.run(init_op) OK!")
summary_writer = tf.summary.FileWriter(LOG_DIR,sess.graph)
# logging.warning("dch summary_writer")
summary_writer.add_summary(config_summary.eval(session=sess))
# logging.warning("dch summary_writer.add")
extract_z代码:
with npu_scope.without_npu_compile_scope():
images,labels = tf.train.batch(
[image,label],
batch_size=BATCH_SIZE,
num_threads=1,
capacity=BATCH_SIZE,
allow_smaller_final_batch=False)
# <<<<<<<
# images = images.batch(batch_size, drop_remainder=True)
# >>>>>>> MODEL
with tf.variable_scope('net'):
with tf.variable_scope('params') as params:
pass
x_ph = tf.placeholder(tf.float32,[BATCH_SIZE,32,32,3])
net= VQVAE(None,None,BETA,x_ph,K,D,_cifar10_arch,params,False)
init_op = tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer())
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Run!
config = tf.ConfigProto()
# config.gpu_options.allow_growth = True
custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True #在昇腾AI处理器执行训练
custom_op.parameter_map["mix_compile_mode"].b = True # 测试算子下沉
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF #关闭remap开关
sess = tf.Session(config=config)
logger.warn('warn sess = tf.Session(config=config)')
# sess = tf.Session()
sess.graph.finalize()
sess.run(init_op)
logger.warn('warn sess.run(init_op)')
最终采用将这句话里的epoch=1参数去掉,终于能够通过了 。
# image,label = get_image(num_epochs=1)
image,label = get_image()
这个解决方法可能不是最终解决方法,先这样处理。
issue7报错:
是train_prior部分: config['TRAIN_NUM'] = 8 # 9个之后会报错
报issue:https://gitee.com/ascend/modelzoo/issues/I2BUME/
报错信息:
2021-01-04 20:30:06.952771: I tf_adapter/kernels/geop_npu.cc:573] [GEOP] RunGraphAsync callback, status:0, kernel_name:GeOp75_0[ 6456228us]
50%|██████████████████████████████████████████████ | 9/18 [01:36<01:36, 10.67s/it]
Traceback (most recent call last):
File "cifar10.py", line 531, in <module>
train_prior(config=config,**config)
File "cifar10.py", line 476, in train_prior
sess.run(sample_summary_op,feed_dict={sample_images:sampled_ims}),it)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1156, in _run
(np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (20, 32, 32, 3) for Tensor 'misc/Placeholder:0', which has shape '(1, 32, 32, 3)'
这个问题现在还没有解决,完全不知道问题出在哪里。 大约跟数据的喂入有关系,但是我目前解决不了,只能先设为8:config['TRAIN_NUM'] = 8保证整个程序能跑通,这个issue先留着吧。
PR提交,最后的冲刺
经过艰苦卓绝的奋斗,终于迎来了模型大赛的曙光,整个模型能够在升腾系统上跑通了,而且基本符合大赛的要求。后面就是一些微调了。
审核方给的修改意见
1 程序设置了对tf.train.string_input_producer这个的不下沉
要把这个去掉,只开启混合计算
2 VQ-VAE的网络问题:网络结构中数据预处理的方式是通过这个循环控制的,这个循环在数据达到上限后抛出异常,根据异常来结束处理,目前在昇腾产品执行会core。请开发者自行修改成其他控制流程:
while not coord.should_stop():
x,y = sess.run([images,labels])
k = sess.run(net.k,feed_dict={x_ph:x})
ks.append(k)
ys.append(y)
print('.', end='', flush=True)
except tf.errors.OutOfRangeError:
VQVAE PR最后的整改
1 整个程序只启动混合计算,把单独的不下沉设置全部去掉。(也就是最终的使用方法,理论上系统把支持的全部下沉,不支持的默认就能不下沉,不需要用户手动设置)
2 将while循环改成for循环
for step in tqdm(xrange(TRAIN_NUM), dynamic_ncols=True):
x,y = sess.run([images,labels])
k = sess.run(net.k,feed_dict={x_ph:x})
ks.append(k)
ys.append(y)
并设置循环步数:
config['TRAIN_NUM'] = 24
再跟少芳那边沟通了一下,第二部分能通过就是将get_image函数参数去掉解决的,反正已经设置了循环步数,这里应该不影响整体。
修改: # image,label = get_image(num_epochs=1)
修改为: image,label = get_image()
然后提交PR,终于PR验收通过啦!乌拉!非常激动!结果并不重要,中间出现问题、解决问题的过程最重要。但是如果没有结果,这篇文档都师出无名,中间付出的精力可能就白白付出了,学到的东西可能也没现在这么多、印象这么深刻。
VQVAE 模型tensorflow迁移到升腾总结
本次大赛主要经历了报名、模型选择、模型迁移、排错、提交PR等几个阶段,具体过程如前面篇幅所讲,一言难尽啊!
本次模型迁移大赛是很好的一次学习和锻炼的机会,我原来对tensorflow一点都不懂,经过这次比赛,不管懂不懂,反正代码看了好多遍,tf程序的流程也懂了一点。升腾系统原来也只是在Modelarts的notebook和训练任务中有接触,像这次这样可以在依瞳系统里自由的安装软件、完全控制系统还是第一次。在排错的过程中,跟华为研发有了第一线接触,为及时准确的排错能力点赞!对升腾系统和MindSpore AI框架充满信心!
模型大赛的白银赛段很快就要来了,大家快准备报名吧!
- 点赞
- 收藏
- 关注作者
评论(0)