CogVideoX适配昇腾开源验证任务心得

举报
p_xcn 发表于 2024/11/28 11:00:31 2024/11/28
【摘要】 CogVideoX是智谱AI开发的视频生成大模型。无需复杂的视频制作技能和工具,能够将简短的文本描述或静态图片转化为高质量、具有视觉吸引力的动态视频。本文将介绍在华为的Ascend NPU上配置CogVideoX项目的过程中遇到的问题和解决方案,以及心得总结。

一、背景介绍

CogVideoX是智谱AI开发的视频生成大模型。无需复杂的视频制作技能和工具,能够将简短的文本描述或静态图片转化为高质量、具有视觉吸引力的动态视频本文将介绍在华为的Ascend NPU上配置CogVideoX项目的过程中遇到的问题和解决方案,以及心得总结。

二、资源清单

Ascend NPU:

产品名称

芯片类型

CANN版本

驱动版本

操作系统

堡垒机

昇腾910B3

CANN 7.0.1.5

23.0.6

Huawei Cloud EulerOS 2.0

三、遇到的问题和解决方案

具体步骤详见:https://blog.csdn.net/qq_54958500/article/details/143732793?spm=1001.2014.3001.5502

1.问题:运算符异步调用,堆栈追踪可能不准确:

RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is Conv2D.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[ERROR] 2024-11-26-14:42:44 (PID:1154952, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[W1114 14:42:44.066216328 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())

解决方案设置环境变量获取更准确的堆栈跟踪信息

export ASCEND_LAUNCH_BLOCKING=1

2.问题tbe 模块未找到

RuntimeError: InnerRun:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:218 OPS function error: Conv2D, error code is 500001
[ERROR] 2024-11-26-14:45:31 (PID:1155878, Device:0, RankID:-1) ERR01100 OPS call acl api failed
[Error]: The internal ACL of the system is incorrect.
        Rectify the fault based on the error information in the ascend log.
EC0010: Failed to import Python module [ModuleNotFoundError: No module named 'tbe'.].
        Solution: Check that all required components are properly installed and the specified Python path matches the Python installation directory. (If the path does not match the directory, run set_env.sh in the installation package.)
        TraceBack (most recent call last):
        [GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeInner][FILE:tbe_op_store_adapter.cc][LINE:1623]
        [SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager.cc][LINE:85]
        [SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager.cc][LINE:126]
        [FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager.cc][LINE:124]                                                                                                                 
        PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager.cc][LINE:96]
        OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib.cc][LINE:237]
        GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib.cc][LINE:165]
        [Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        [Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        [Set][Options]OpCompileProcessor init failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        [Init][Env]init env failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        build op model failed, result = 500001[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
                                                                                                                  
[W1114 14:45:31.433200230 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())

解决方案运行环境设置脚本:

source /usr/local/Ascend/ascend-toolkit/set_env.sh

3.问题:运算函数未在 libopapi.so 中找到:

RuntimeError: aclnnFusedInferAttentionScoreV2 or aclnnFusedInferAttentionScoreV2GetWorkspaceSize not in libopapi.so, or libopapi.sonot found.
[ERROR] 2024-11-26-14:55:15 (PID:1158036, Device:0, RankID:-1) ERR01004 OPS invalid pointer
[W1114 14:55:15.526441221 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())

解决方案:torch版本过高,算子尚未兼容,降低torch版本:

pip install torch==2.1.0 torch_npu==2.1.0 torchvision==0.16.0

4.问题:内存不足。

解决方案:减少生成视频的帧数,降低模型精度。

5.问题:出现警告:

SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats

解决方案:通常是由于使用了不兼容的 Python 版本或编译选项,环境改为python 3.8。

四、验证结果

- DEMO内容:加载预训练的CogVideoX模型,根据一个详细的文本描述生成对应视频。

Prompt

"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

NPU:1上运行代码:

生成的视频:

五、心得总结

在整个适配过程中,遇到了多个技术挑战,主要问题还是环境配置相关,通过根据报错信息理解和做相应修改、项目Issues提问等方法,最终成功解决了这些问题。这一过程不仅加深了我对昇腾 NPU 环境的理解,也提升了我在复杂系统集成中的问题解决能力。希望这些经验能帮助其他开发者在类似环境中顺利部署和运行 CogVideoX 项目。

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。