- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

CANN学习资源开源仓的中级算子开发二pybind调用算子

黄生发表于 2026/03/26 21:38:52 2026/03/26

【摘要】 pybind调用算子仍是通过调用aclnn接口实现算子计算，只是将aclnn接口封装成Python函数，方便在Python中调用。首先安装依赖，方便在atomgit ai环境里面依赖都已经满足pip install torch==2.9.0;pip install torch-npu==2.9.0;pip install pybind11;pip install setuptools; pi...

pybind调用算子仍是通过调用aclnn接口实现算子计算，只是将aclnn接口封装成Python函数，方便在Python中调用。首先安装依赖，方便在atomgit ai环境里面依赖都已经满足

pip install torch==2.9.0;pip install torch-npu==2.9.0;pip install pybind11;pip install setuptools; pip install wheel

pybind11（Python binding for C++11）是一个轻量级的C++库，用于将C++代码暴露给Python，创建Python绑定；让Python能直接调用C++函数、类、对象；常用于PyTorch等框架的底层算子扩展，实现高性能C++实现与Python前端的无缝对接

然后编写pybind封装代码，将aclnnAddCustomTemplate接口封装成Python函数npu_add_custom_template。借助pybind11的PYBIND11_MODULE宏，将C++实现的npu_add_custom_template函数，封装成可供Python导入和调用的扩展模块接口。custom_op.cpp代码如下：

#include <torch/extension.h>
#include <torch/csrc/autograd/custom_function.h>
#include "pytorch_npu_helper.hpp"  //里面定义了宏EXEC_NPU_CMD用于执行aclnn二段式接口

using torch::autograd::Function;
using torch::autograd::AutogradContext;
using tensor_list = std::vector<at::Tensor>;

using namespace at;
at::Tensor npu_add_custom_template(const at::Tensor &x, const at::Tensor &y) {
    at::Tensor z = at::empty_like(x);
    EXEC_NPU_CMD(aclnnAddCustomTemplate, x, y, z);
    return z;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { //3个参数：Python 中要调用的函数名、C++ 函数的地址 / 指针、函数的文档字符串（可选）
		m.def("npu_add_custom_template", &npu_add_custom_template, "torch add");
}

hpp文件是C++头文件，后缀.hpp表示“C++ Header File”，与.h的主要区别是：.hpp通常用于包含C++特有语法（类、模板等）的头文件。这里pytorch_npu_helper.hpp是昇腾NPU适配PyTorch时的辅助头文件，定义了EXEC_NPU_CMD等宏来简化ACLNN（Ascend Compute Library for Neural Network）二段式接口的调用

完成代码后，需要编译成扩展模块，供Python调用。写法比较固定，通过setuptools库的setup函数，指定扩展模块的名称、版本、作者、描述等信息，以及需要编译的源文件，这里在setup.py中指定custom_op.cpp文件作为源文件，指定编译出的模块名为custom_ops_lib，这样我们可以在Python里通过import custom_ops_lib来使用上文PYBIND11_MODULE宏对外暴露的python接口。setup.py写好后，进行pybind封装的扩展模块的编译和安装：

export LD_LIBRARY_PATH=${HOME}/vendors/customize/op_api/lib/:$LD_LIBRARY_PATH;cd Sources/pybind_op/;python3 setup.py build bdist_wheel;pip3 install dist/custom_ops*.whl --force-reinstall

输出

...
running install_scripts
creating build/bdist.linux-aarch64/wheel/custom_ops-1.0.dist-info/WHEEL
creating 'dist/custom_ops-1.0-cp311-cp311-linux_aarch64.whl' and adding 'build/bdist.linux-aarch64/wheel' to it
adding 'custom_ops_lib.cpython-311-aarch64-linux-gnu.so'
adding 'custom_ops-1.0.dist-info/METADATA'
adding 'custom_ops-1.0.dist-info/WHEEL'
adding 'custom_ops-1.0.dist-info/top_level.txt'
adding 'custom_ops-1.0.dist-info/RECORD'
removing build/bdist.linux-aarch64/wheel
path string is NULLpath string is NULLDefaulting to user installation because normal site-packages is not writeable
Processing ./dist/custom_ops-1.0-cp311-cp311-linux_aarch64.whl
Installing collected packages: custom-ops
Successfully installed custom-ops-1.0

安装好pybind绑定的扩展模块后，通过import custom_ops_lib来导入编译出的扩展模块，然后就像调用普通Python函数一样调用自定义算子的接口。需要注意的是自定义算子运行在npu上，所以调用时需要将输入数据转换为npu张量，输出数据需要打印或者保存时也需要先转换为cpu张量。代码：

# 我们用Python调用Pybind封装好的npu_add_custom_template函数，以此来运行我们开发的自定义算子
import torch
import torch_npu
import custom_ops_lib

torch.npu.config.allow_internal_format = False

length = [8, 2048]
x = torch.rand(length, device='cpu', dtype=torch.float16)
y = torch.rand(length, device='cpu', dtype=torch.float16)
golden = x + y
output_npu = custom_ops_lib.npu_add_custom_template(x.npu(), y.npu())

print("is same:",torch.allclose(golden, output_npu.cpu(),rtol=0.001, atol=0.001))

运行：

source ${HOME}/vendors/customize/bin/set_env.bash; python Sources/pybind_op/test_op.py

输出：

path string is NULLpath string is NULL[W326 21:01:05.514101901 compiler_depend.ts:338] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator())
opType=AddCustomTemplate, DumpHead: AIV-0, CoreType=AIV, block dim=8, total_block_num=8, block_remain_len=1048384, block_initial_space=1048576, rsv=0, magic=5aa5bccd
CANN Version: 8.5.0, TimeStamp: 202507
Core 0 executed 1 times in total
opType=AddCustomTemplate, DumpHead: AIV-1, CoreType=AIV, block dim=8, total_block_num=8, block_remain_len=1048384, block_initial_space=1048576, rsv=0, magic=5aa5bccd
CANN Version: 8.5.0, TimeStamp: 202507
Core 1 executed 1 times in total
...
opType=AddCustomTemplate, DumpHead: AIV-7, CoreType=AIV, block dim=8, total_block_num=8, block_remain_len=1048384, block_initial_space=1048576, rsv=0, magic=5aa5bccd
CANN Version: 8.5.0, TimeStamp: 202507
Core 7 executed 1 times in total
is same: True

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

CANN学习资源开源仓的中级算子开发二pybind调用算子

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

CANN学习资源开源仓的中级算子开发二pybind调用算子

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品