- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

CANN学习资源开源仓的自定义算子快速入门

黄生发表于 2026/03/21 21:09:29 2026/03/21

【摘要】这是一个来自于CANN学习资源开源仓(https://gitcode.com/cann/cann-learning-hub)的自定义算子快速入门。验证过的环境是CANN8.5，我用的环境是gitcode提供的notebook，里面镜像提供的就是CANN8.5。为了学习而简化后的算子设计如下：算子类型(OpType) Add 算子输入 name shape ...

这是一个来自于CANN学习资源开源仓(https://gitcode.com/cann/cann-learning-hub)的自定义算子快速入门。验证过的环境是CANN8.5，我用的环境是gitcode提供的notebook，里面镜像提供的就是CANN8.5。为了学习而简化后的算子设计如下：

算子类型(OpType)	Add
算子输入	name	shape	data type	format
	x	(8, 2048)	float	ND
	y	(8, 2048)	float	ND
算子输出	z	(8, 2048)	float	ND
核函数名	add
使用核数	8

简化后，shape是固定的（一般算子可能是多个shape或all不限定），没有泛化能力。还有，数据类型只有float，没有其他的比如fp16和bf16等之类的，处理极度简单。另外虽然有2个输入，但是数据类型一致，也是极度简单，缺乏输入之间不同数据类型组合的复杂情况。输出也是固定的数据类型，整个算子也没有属性。所以本例子可以做为一个了解，但是离真正开发实际的算子，差距还有很多。

本例中，可以学习到核函数的开发（定义和实现类）、tiling只是做个样子（极度简化，与实际tiling设计相差万里），acl对于内存的拷贝，核函数的调用、结果比对等等。下面是笔记。

核函数的参数：

仅支持入参为指针或C/C++内置数据类型（Primitive data types），如：half* s0、float* s1、int32_t c。
为了统一表达，建议使用GM_ADDR宏来修饰入参，其为是编译器中自带的宏，代表的含义为Global Memory中的地址，其定义如下:
```
#define GM_ADDR __gm__ uint8_t*
```

示例如下：

extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)

由于必然涉及tiling结构体，所以核函数中还要在最后增加了一个tiling结构体参数。

核函数实现类的矢量编程范式：

CopyIn，Compute，CopyOut。

CopyIn：将输入数据从Global Memory搬运到Local Memory，完成搬运后执行入队列操作；
- 使用DataCopy接口将GlobalTensor数据拷贝到LocalTensor。
- 使用EnQue将LocalTensor放入VecIn的Queue中。
Compute：完成队列出队后，从Local Memory获取数据并计算，计算完成后执行入队操作；
- 使用DeQue从VecIn中取出LocalTensor。
- 使用Ascend C接口Add完成矢量计算。
- 使用EnQue将计算结果LocalTensor放入到VecOut的Queue中。
- 使用FreeTensor将释放不再使用的LocalTensor。
CopyOut：完成队列出队后，将计算结果从Local Memory搬运到Global Memory。
- 使用DeQue接口从VecOut的Queue中取出LocalTensor。
- 使用DataCopy接口将LocalTensor拷贝到GlobalTensor上。
- 使用FreeTensor将不再使用的LocalTensor进行回收

核函数实现类的初始化函数Init：

设置输入输出Global Tensor的Global Memory内存地址。
通过TPipe内存管理对象为输入输出Queue分配内存。

以上是NPU上执行的部分，然后可以做host侧的调用。

核函数调用：通过<<<…>>>内核调用符进行算子调用。当然在调用之前，先要把数据从host拷贝到device(npu)上。
1. kernel_name<<<blockDim, l2ctrl, stream>>>(argument list);
计算结果比对：比对golden数据核实际输出，验证算子精度
算子验证主程序：生成输入及golden数据，用来进行算子验证

最后编译命令是：

bisheng Sources/add.asc --npu-arch=dav-2201 -o add

运行程序：

./add

附完整代码add.asc：


#include <cstdint>
#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include "acl/acl.h"
#include "kernel_operator.h"

constexpr uint32_t BUFFER_NUM = 2; // tensor num for each queue

struct AddCustomTilingData
{
    uint32_t totalLength;
    uint32_t tileNum;
};

class KernelAdd {
public:
    __aicore__ inline KernelAdd(){}
    // 初始化函数，完成内存初始化相关操作
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t tileNum);
    // 核心处理函数，实现算子逻辑，调用私有成员函数CopyIn、Compute、CopyOut完成矢量算子的三级流水操作
    __aicore__ inline void Process();

private:
    // 搬入函数，从Global Memory搬运数据至Local Memory，被核心Process函数调用
    __aicore__ inline void CopyIn(int32_t progress);
    // 计算函数，完成两个输入参数相加，得到最终结果，被核心Process函数调用
    __aicore__ inline void Compute(int32_t progress);
    // 搬出函数，将最终结果从Local Memory搬运到Global Memory上，被核心Process函数调用
    __aicore__ inline void CopyOut(int32_t progress);

private:
    AscendC::TPipe pipe;  // TPipe内存管理对象
    AscendC::TQue<AscendC::TPosition::VECIN, BUFFER_NUM> inQueueX, inQueueY;  // 输入数据Queue队列管理对象，TPosition为VECIN
    AscendC::TQue<AscendC::TPosition::VECOUT, BUFFER_NUM> outQueueZ;  // 输出数据Queue队列管理对象，TPosition为VECOUT
    AscendC::GlobalTensor<float> xGm;  // 管理输入输出Global Memory内存地址的对象，其中xGm, yGm为输入，zGm为输出
    AscendC::GlobalTensor<float> yGm;
    AscendC::GlobalTensor<float> zGm;
    uint32_t blockLength; // 每个核的计算数据长度
    uint32_t tileNum; // 每个核需要计算的数据块个数
    uint32_t tileLength; // 每个核内每个数据块的长度
};

__aicore__ inline void KernelAdd::Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t tileNum)
{

     this->blockLength = totalLength / AscendC::GetBlockNum();     // length computed of each core
     this->tileNum = tileNum;                                      // split data into 8 tiles for each core
     this->tileLength = this->blockLength / tileNum / BUFFER_NUM;  // separate to 2 parts, due to double buffer
     // get start index for current core, core parallel
     xGm.SetGlobalBuffer((__gm__ float *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
     yGm.SetGlobalBuffer((__gm__ float *)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
     zGm.SetGlobalBuffer((__gm__ float *)z + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
     // pipe alloc memory to queue, the unit is Bytes
     pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(float));
     pipe.InitBuffer(inQueueY, BUFFER_NUM, this->tileLength * sizeof(float));
     pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->tileLength * sizeof(float));
}

__aicore__ inline void KernelAdd::Process()
{
    // loop count need to be doubled, due to double buffer
    int32_t loopCount = this->tileNum * BUFFER_NUM;
    // tiling strategy, pipeline parallel
    for (int32_t i = 0; i < loopCount; i++) {
        CopyIn(i);
        Compute(i);
        CopyOut(i);
    }
}

__aicore__ inline void KernelAdd::CopyIn( int32_t progress)
{
    // alloc tensor from queue memory
    AscendC::LocalTensor<float> xLocal = inQueueX.AllocTensor<float>();
    AscendC::LocalTensor<float> yLocal = inQueueY.AllocTensor<float>();
    // copy progress_th tile from global tensor to local tensor
    AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);
    AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], this->tileLength);
    // enque input tensors to VECIN queue
    inQueueX.EnQue(xLocal);
    inQueueY.EnQue(yLocal);
}

__aicore__ inline void KernelAdd::Compute(int32_t progress)
{
    // deque input tensors from VECIN queue
    AscendC::LocalTensor<float> xLocal = inQueueX.DeQue<float>();
    AscendC::LocalTensor<float> yLocal = inQueueY.DeQue<float>();
    AscendC::LocalTensor<float> zLocal = outQueueZ.AllocTensor<float>();
    // call Add instr for computation
    AscendC::Add(zLocal, xLocal, yLocal, this->tileLength);
    // enque the output tensor to VECOUT queue
    outQueueZ.EnQue<float>(zLocal);
    // free input tensors for reuse
    inQueueX.FreeTensor(xLocal);
    inQueueY.FreeTensor(yLocal);
}

__aicore__ inline void KernelAdd::CopyOut(int32_t progress)
{
    // deque output tensor from VECOUT queue
    AscendC::LocalTensor<float> zLocal = outQueueZ.DeQue<float>();
    // copy progress_th tile from local tensor to global tensor
    AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength);
    // free output tensor for reuse
    outQueueZ.FreeTensor(zLocal);
}

__global__ __aicore__ void add(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)
{
    KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);    // 设置Kernel类型为Vector核（用于矢量计算）
    KernelAdd op;
    op.Init(x, y, z, tiling.totalLength, tiling.tileNum);
    op.Process();
}

std::vector<float> kernel_add(std::vector<float> &x, std::vector<float> &y)
{
    constexpr uint32_t blockDim = 8;
    uint32_t totalLength = x.size();
    size_t totalByteSize = totalLength * sizeof(float);
    int32_t deviceId = 0;
    aclrtStream stream = nullptr;
    AddCustomTilingData tiling = {/*totalLength:*/totalLength, /*tileNum:*/8};
    uint8_t *xHost = reinterpret_cast<uint8_t *>(x.data());
    uint8_t *yHost = reinterpret_cast<uint8_t *>(y.data());
    uint8_t *zHost = nullptr;
    uint8_t *xDevice = nullptr;
    uint8_t *yDevice = nullptr;
    uint8_t *zDevice = nullptr;

    // 初始化
    aclInit(nullptr);
    // 运行管理资源申请
    aclrtSetDevice(deviceId);
    aclrtCreateStream(&stream);
    // 分配Host内存
    aclrtMallocHost((void **)(&zHost), totalByteSize);
    // 分配Device内存
    aclrtMalloc((void **)&xDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);
    aclrtMalloc((void **)&yDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);
    aclrtMalloc((void **)&zDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);
    // 将Host上的输入数据拷贝到Device侧
    aclrtMemcpy(xDevice, totalByteSize, xHost, totalByteSize, ACL_MEMCPY_HOST_TO_DEVICE);
    aclrtMemcpy(yDevice, totalByteSize, yHost, totalByteSize, ACL_MEMCPY_HOST_TO_DEVICE);
    // 用内核调用符<<<...>>>调用核函数完成指定的运算
    add<<<blockDim, nullptr, stream>>>(xDevice, yDevice, zDevice, tiling);
    aclrtSynchronizeStream(stream);
    // 将Device上的运算结果拷贝回Host
    aclrtMemcpy(zHost, totalByteSize, zDevice, totalByteSize, ACL_MEMCPY_DEVICE_TO_HOST);
    std::vector<float> z((float *)zHost, (float *)(zHost + totalLength));
    // 释放申请的资源
    aclrtFree(xDevice);
    aclrtFree(yDevice);
    aclrtFree(zDevice);
    aclrtFreeHost(zHost);
    // 去初始化
    aclrtDestroyStream(stream);
    aclrtResetDevice(deviceId);
    aclFinalize();
    return z;
}

uint32_t VerifyResult(std::vector<float> &output, std::vector<float> &golden)
{
    auto printTensor = [](std::vector<float> &tensor, const char *name) {
        constexpr size_t maxPrintSize = 20;
        std::cout << name << ": ";
        std::copy(tensor.begin(), tensor.begin() + std::min(tensor.size(), maxPrintSize),
            std::ostream_iterator<float>(std::cout, " "));
        if (tensor.size() > maxPrintSize) {
            std::cout << "...";
        }
        std::cout << std::endl;
    };
    printTensor(output, "Output");
    printTensor(golden, "Golden");
    if (std::equal(output.begin(), output.end(), golden.begin())) {
        std::cout << "[Success] Case accuracy is verification passed." << std::endl;
        return 0;
    } else {
        std::cout << "[Failed] Case accuracy is verification failed!" << std::endl;
        return 1;
    }
    return 0;
}

int32_t main(int32_t argc, char *argv[])
{
    constexpr uint32_t totalLength = 8 * 2048;
    constexpr float valueX = 1.2f;
    constexpr float valueY = 2.3f;
    std::vector<float> x(totalLength, valueX);
    std::vector<float> y(totalLength, valueY);

    std::vector<float> output = kernel_add(x, y);

    std::vector<float> golden(totalLength, valueX + valueY);
    return VerifyResult(output, golden);
}

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

CANN学习资源开源仓的自定义算子快速入门

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

CANN学习资源开源仓的自定义算子快速入门

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品