- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

CANNBot简介

黄生发表于 2026/04/23 20:31:20 2026/04/23

【摘要】 CANNBot是基于AI Agent构建的一系列智能体，可兼容各类大模型（当然建议你用最强的）和Agent框架。用来自动开发算子的。仓库提供的是skills模块。三层架构：交互例子：帮我开发一个abs算子，支持float16数据类型，shape主要是[1,128]、[4,2048]、[32,4096]读取 requirement.md，根据需求开发自定义算子帮我开发一个 softmax 算子...

CANNBot是基于AI Agent构建的一系列智能体，可兼容各类大模型（当然建议你用最强的）和Agent框架。用来自动开发算子的。仓库提供的是skills模块。

三层架构：

交互例子：

帮我开发一个abs算子，支持float16数据类型，shape主要是[1,128]、[4,2048]、[32,4096]
读取 requirement.md，根据需求开发自定义算子
帮我开发一个 softmax 算子，支持 float16 数据类型，shape 主要是 [1,128]、[4,2048]、[32,4096]

实操如下。我是看了cannbot pypto那个直播来操作的，但是不知为何生成的算子是AscendC的。

git clone https://gitcode.com/cann/skills.git
cd skills/ops/teams/pypto-op-orchestrator/
bash init.sh project opencode #项目级而不是全局，生成.opencode目录

#如果没有opencode cli，以下是安装命令
curl -fsSL https://opencode.ai/install | bash #慢 可以缓存下载的安装包下次安装用 本机下载比容器里还快
export TERM=xterm-256color #在juypterlab里terminal运行要加
opencode #输入上面交互例子的第3个、然后看着它干活

.opencode目录结构

.opencode/
├── agents
│   ├── ascendc-kernel-architect.md -> /home/atomgit/skills/ops/agents/ascendc-kernel-architect.md
│   ├── ascendc-kernel-developer.md -> /home/atomgit/skills/ops/agents/ascendc-kernel-developer.md
│   └── ascendc-kernel-reviewer.md -> /home/atomgit/skills/ops/agents/ascendc-kernel-reviewer.md
├── AGENTS.md -> /home/atomgit/skills/ops/teams/pypto-op-orchestrator/AGENTS.md
├── cannbot-manifest.json
├── node_modules
├── package.json
├── package-lock.json
└── skills
    ├── ascendc-api-best-practices -> /home/atomgit/skills/ops/skills/ascendc-api-best-practices
    ├── ascendc-code-review -> /home/atomgit/skills/ops/skills/ascendc-code-review
    ├── ascendc-direct-invoke-template -> /home/atomgit/skills/ops/skills/ascendc-direct-invoke-template
    ├── ascendc-docs-search -> /home/atomgit/skills/ops/skills/ascendc-docs-search
    ├── ascendc-env-check -> /home/atomgit/skills/ops/skills/ascendc-env-check
    ├── ascendc-npu-arch -> /home/atomgit/skills/ops/skills/ascendc-npu-arch
    ├── ascendc-precision-debug -> /home/atomgit/skills/ops/skills/ascendc-precision-debug
    ├── ascendc-runtime-debug -> /home/atomgit/skills/ops/skills/ascendc-runtime-debug
    ├── ascendc-tiling-design -> /home/atomgit/skills/ops/skills/ascendc-tiling-design
    ├── ops-precision-standard -> /home/atomgit/skills/ops/skills/ops-precision-standard
    └── ops-profiling -> /home/atomgit/skills/ops/skills/ops-profiling

提出需求后生成的softmax算子目录：(没等生成完我中断了好像，用的是opencode自带的模型，而不是当前最强的glm5.1）

ops/softmax/
├── build/
│   ├── autogen/
│   ├── CMakeCache.txt
│   ├── CMakeFiles/
│   ├── cmake_install.cmake
│   └── Makefile
├── CMakeLists.txt
├── CMakeLists.txt.bak
├── data_utils.h
├── docs/
│   ├── DESIGN.md
│   ├── environment.json
│   ├── PLAN.md
│   ├── REVIEW.md
│   └── WALKTHROUGH.md
├── README.md
├── run.sh
├── scripts/
│   ├── gen_data.py
│   └── verify_result.py
├── softmax.asc
└── test/

以下它生成的 README.md , 还是有模有样的：

Softmax 算子实现

概述

本目录实现了 Ascend C Kernel 直调方式的 Softmax 算子，沿 axis=-1 对输入进行归一化指数运算。

算子规格

属性	值
数据类型	float16 (FP16)
归约轴	axis = -1（仅支持最内维）
Shape 范围	M ≤ 8192

数学定义

softmax(x)[i] = exp(x[i]) / sum(exp(x[:]))

数值稳定性

使用 x - max(x) 偏移后再计算 exp，避免 exp(+∞) 溢出：

maxVal = max(x)
x' = x - maxVal
expVal = exp(x')
sumVal = sum(expVal)
y = expVal / sumVal

计算流程

GM [N, M]
   ↓ DataCopyPad
UB [rowsPerCore, rLengthAlign]
   ↓ ReduceMax → maxVal (scalar)
   ↓ Duplicate + Sub (x - maxVal)
   ↓ Exp
   ↓ ReduceSum → sumVal (scalar)
   ↓ Duplicate + Div (exp / sumVal)
   ↓ DataCopyPad → GM

目录结构

softmax/
├── docs/                   # 设计文档
│   ├── DESIGN.md           # 技术设计
│   └── PLAN.md             # 开发计划
├── build/                  # 编译输出
│   ├── softmax_custom      # 可执行文件
│   ├── input/              # 测试输入
│   └── output/             # 测试输出
├── scripts/
│   ├── gen_data.py         # 数据生成
│   └── verify_result.py    # 精度验证
├── softmax.asc             # Kernel 实现
├── CMakeLists.txt          # 构建配置
├── run.sh                  # 运行脚本
└── data_utils.h           # 数据工具

快速开始

编译

cd build
cmake ..
make -j4

运行测试

# 默认测试 [4, 4096]
./softmax_custom

# 指定 shape
./softmax_custom 32 4096

# 完整流程（编译 + 测试 + 验证）
bash ../run.sh

验证精度

python3 scripts/gen_data.py 32 4096
./softmax_custom 32 4096
python3 scripts/verify_result.py output/output.bin output/golden.bin

测试用例

用例	Shape	精度验证
TC_001	[1, 128]	✅ PASS
TC_002	[4, 2048]	✅ PASS
TC_003	[32, 4096]	✅ PASS

精度标准

dtype	atol	rtol	实际最大误差
float16	1e-3	1e-2	1.22e-4

关键实现点

1. half 类型限制

Ascend C 中 half 类型不支持在 kernel 函数内直接进行算术运算（如 -maxVal、1.0f / sumVal）。

解决方案：

使用 Duplicate + Sub 替代 Adds
使用 Duplicate + Div 替代 Muls

2. Tiling 策略

多核切分：按行方向切分，每核处理多行

rowsPerCore = ceil(rows / usedCores)

UB 全载：单次处理整行数据

rLengthAlign = ((cols + 15) / 16) * 16  // 32字节对齐

3. Buffer 配置

Buffer	大小	用途
inQueueX	rowsPerCore × rLengthAlign × 2B	输入队列
outQueueY	同上	输出队列
rowBuf	rLengthAlign × 2B	行级临时
reduceBuf	32KB	Reduce 临时

API 参考

API	用途	参数
`ReduceMax`	求最大值	`(dst, src, tmp, count, calIndex)`
`ReduceSum`	求和	`(dst, src, tmp, count)`
`Sub`	广播减法	`(dst, src0, src1, count)`
`Div`	除法	`(dst, src0, src1, count)`
`Exp`	指数	`(dst, src, count)`
`Duplicate`	标量广播	`(dst, scalar, count)`

性能考虑

每核按行独立处理，适合行数较多的场景
使用单缓冲，避免 Double Buffer 额外开销
Reduce 操作使用 Level 2 接口，无需对齐

参考资料

Ascend C API 最佳实践 ../ascendc-api-best-practices/
Reduce API 文档 ../ascendc-api-best-practices/references/api-reduce.md
算子精度标准 ../ops-precision-standard/

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入