CogVideoX适配昇腾开源验证任务心得
一、背景介绍
CogVideoX是智谱AI开发的视频生成大模型。无需复杂的视频制作技能和工具,能够将简短的文本描述或静态图片转化为高质量、具有视觉吸引力的动态视频。本文将介绍在华为的Ascend NPU上配置CogVideoX项目的过程中遇到的问题和解决方案,以及心得总结。
二、资源清单
Ascend NPU:
产品名称 |
芯片类型 |
CANN版本 |
驱动版本 |
操作系统 |
堡垒机 |
昇腾910B3 |
CANN 7.0.1.5 |
23.0.6 |
Huawei Cloud EulerOS 2.0 |
三、遇到的问题和解决方案
具体步骤详见:https://blog.csdn.net/qq_54958500/article/details/143732793?spm=1001.2014.3001.5502
1.问题:运算符异步调用,堆栈追踪可能不准确:
RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is Conv2D.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[ERROR] 2024-11-26-14:42:44 (PID:1154952, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[W1114 14:42:44.066216328 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())
解决方案:设置环境变量获取更准确的堆栈跟踪信息:
export ASCEND_LAUNCH_BLOCKING=1
2.问题:tbe 模块未找到:
RuntimeError: InnerRun:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:218 OPS function error: Conv2D, error code is 500001
[ERROR] 2024-11-26-14:45:31 (PID:1155878, Device:0, RankID:-1) ERR01100 OPS call acl api failed
[Error]: The internal ACL of the system is incorrect.
Rectify the fault based on the error information in the ascend log.
EC0010: Failed to import Python module [ModuleNotFoundError: No module named 'tbe'.].
Solution: Check that all required components are properly installed and the specified Python path matches the Python installation directory. (If the path does not match the directory, run set_env.sh in the installation package.)
TraceBack (most recent call last):
[GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeInner][FILE:tbe_op_store_adapter.cc][LINE:1623]
[SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager.cc][LINE:85]
[SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager.cc][LINE:126]
[FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager.cc][LINE:124]
PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager.cc][LINE:96]
OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib.cc][LINE:237]
GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib.cc][LINE:165]
[Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
[Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
[Set][Options]OpCompileProcessor init failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
[Init][Env]init env failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
build op model failed, result = 500001[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
[W1114 14:45:31.433200230 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())
解决方案:运行环境设置脚本:
source /usr/local/Ascend/ascend-toolkit/set_env.sh
3.问题:运算函数未在 libopapi.so 中找到:
RuntimeError: aclnnFusedInferAttentionScoreV2 or aclnnFusedInferAttentionScoreV2GetWorkspaceSize not in libopapi.so, or libopapi.sonot found.
[ERROR] 2024-11-26-14:55:15 (PID:1158036, Device:0, RankID:-1) ERR01004 OPS invalid pointer
[W1114 14:55:15.526441221 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())
解决方案:torch版本过高,算子尚未兼容,降低torch版本:
pip install torch==2.1.0 torch_npu==2.1.0 torchvision==0.16.0
4.问题:内存不足。
解决方案:减少生成视频的帧数,降低模型精度。
5.问题:出现警告:
SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats
解决方案:通常是由于使用了不兼容的 Python 版本或编译选项,环境改为python 3.8。
四、验证结果
- DEMO内容:加载预训练的CogVideoX模型,根据一个详细的文本描述生成对应视频。
Prompt:
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
在NPU:1上运行代码:
生成的视频:
五、心得总结
在整个适配过程中,遇到了多个技术挑战,主要问题还是环境配置相关,通过根据报错信息理解和做相应修改、项目Issues提问等方法,最终成功解决了这些问题。这一过程不仅加深了我对昇腾 NPU 环境的理解,也提升了我在复杂系统集成中的问题解决能力。希望这些经验能帮助其他开发者在类似环境中顺利部署和运行 CogVideoX 项目。
- 点赞
- 收藏
- 关注作者
评论(0)