- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

CogVideoX适配昇腾开源验证任务心得

p_xcn 发表于 2024/11/28 11:00:31 2024/11/28

【摘要】 CogVideoX是智谱AI开发的视频生成大模型。无需复杂的视频制作技能和工具，能够将简短的文本描述或静态图片转化为高质量、具有视觉吸引力的动态视频。本文将介绍在华为的Ascend NPU上配置CogVideoX项目的过程中遇到的问题和解决方案，以及心得总结。

一、背景介绍

CogVideoX是智谱AI开发的视频生成大模型。无需复杂的视频制作技能和工具，能够将简短的文本描述或静态图片转化为高质量、具有视觉吸引力的动态视频。本文将介绍在华为的Ascend NPU上配置CogVideoX项目的过程中遇到的问题和解决方案，以及心得总结。

二、资源清单

Ascend NPU：

产品名称	芯片类型	CANN版本	驱动版本	操作系统
堡垒机	昇腾910B3	CANN 7.0.1.5	23.0.6	Huawei Cloud EulerOS 2.0

三、遇到的问题和解决方案

具体步骤详见：https://blog.csdn.net/qq_54958500/article/details/143732793?spm=1001.2014.3001.5502

1.问题：运算符异步调用，堆栈追踪可能不准确：

RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is Conv2D.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[ERROR] 2024-11-26-14:42:44 (PID:1154952, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[W1114 14:42:44.066216328 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())

解决方案：设置环境变量获取更准确的堆栈跟踪信息：

export ASCEND_LAUNCH_BLOCKING=1

2.问题：tbe 模块未找到：

RuntimeError: InnerRun:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:218 OPS function error: Conv2D, error code is 500001
[ERROR] 2024-11-26-14:45:31 (PID:1155878, Device:0, RankID:-1) ERR01100 OPS call acl api failed
[Error]: The internal ACL of the system is incorrect.
        Rectify the fault based on the error information in the ascend log.
EC0010: Failed to import Python module [ModuleNotFoundError: No module named 'tbe'.].
        Solution: Check that all required components are properly installed and the specified Python path matches the Python installation directory. (If the path does not match the directory, run set_env.sh in the installation package.)
        TraceBack (most recent call last):
        [GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeInner][FILE:tbe_op_store_adapter.cc][LINE:1623]
        [SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager.cc][LINE:85]
        [SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager.cc][LINE:126]
        [FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager.cc][LINE:124]                                                                                                                 
        PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager.cc][LINE:96]
        OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib.cc][LINE:237]
        GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib.cc][LINE:165]
        [Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        [Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        [Set][Options]OpCompileProcessor init failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        [Init][Env]init env failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        build op model failed, result = 500001[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
                                                                                                                  
[W1114 14:45:31.433200230 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())

解决方案：运行环境设置脚本：

source /usr/local/Ascend/ascend-toolkit/set_env.sh

3.问题：运算函数未在 libopapi.so 中找到：

RuntimeError: aclnnFusedInferAttentionScoreV2 or aclnnFusedInferAttentionScoreV2GetWorkspaceSize not in libopapi.so, or libopapi.sonot found.
[ERROR] 2024-11-26-14:55:15 (PID:1158036, Device:0, RankID:-1) ERR01004 OPS invalid pointer
[W1114 14:55:15.526441221 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())

解决方案：torch版本过高，算子尚未兼容，降低torch版本：

pip install torch==2.1.0 torch_npu==2.1.0 torchvision==0.16.0

4.问题：内存不足。

解决方案：减少生成视频的帧数，降低模型精度。

5.问题：出现警告：

SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats

解决方案：通常是由于使用了不兼容的 Python 版本或编译选项，环境改为python 3.8。

四、验证结果

- DEMO内容：加载预训练的CogVideoX模型，根据一个详细的文本描述生成对应视频。

Prompt：

"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

在NPU:1上运行代码：

生成的视频：

五、心得总结

在整个适配过程中，遇到了多个技术挑战，主要问题还是环境配置相关，通过根据报错信息理解和做相应修改、项目Issues提问等方法，最终成功解决了这些问题。这一过程不仅加深了我对昇腾 NPU 环境的理解，也提升了我在复杂系统集成中的问题解决能力。希望这些经验能帮助其他开发者在类似环境中顺利部署和运行 CogVideoX 项目。

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

CogVideoX适配昇腾开源验证任务心得

一、背景介绍

二、资源清单

Ascend NPU：

三、遇到的问题和解决方案

四、验证结果

五、心得总结

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

CogVideoX适配昇腾开源验证任务心得

一、背景介绍

二、资源清单

Ascend NPU：

三、遇到的问题和解决方案

四、验证结果

五、心得总结

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品