Ascend C算子性能优化实用技巧02——内存优化

举报
昇腾CANN 发表于 2024/08/29 11:15:54 2024/08/29
【摘要】 目前已经有越来越多的开发者使用Ascend C,我们将通过几期“Ascend C算子性能优化”专题分享,围绕开发者最为关心的算子性能优化环节,介绍Ascend C算子常用的优化技巧,帮助开发者自主构建出更优性能的算子。专题内容将围绕流水优化、搬运优化、内存优化、API使用优化以及Tiling优化等优化技巧,从方案讲解、优化案例、性能对比等多角度展开介绍。

Ascend C算子性能优化实用技巧02——内存优化

Ascend C是CANN针对算子开发场景推出的编程语言,原生支持C和C++标准规范,兼具开发效率和运行性能。使用Ascend C,开发者可以基于昇腾AI硬件,高效的实现自定义的创新算法。

目前已经有越来越多的开发者使用Ascend C,我们将通过几期“Ascend C算子性能优化”专题分享,围绕开发者最为关心的算子性能优化环节,介绍Ascend C算子常用的优化技巧,帮助开发者自主构建出更优性能的算子。专题内容将围绕流水优化、搬运优化、内存优化、API使用优化以及Tiling优化等优化技巧,从方案讲解、优化案例、性能对比等多角度展开介绍。

上期内容分享了《Ascend C算子性能优化实用技巧01——流水优化》,本期您将从内存优化角度,了解到一些实用的内存优化技巧:

  • l通过Unified Buffer融合实现连续vector计算
  • 通过L0C Buffer数据暂存实现高效的矩阵乘结果累加
  • 较小矩阵长驻L1 Buffer,仅分次搬运较大矩阵
  • 通过BT Buffer实现高效的bias计算
  • 通过FP Buffer存放量化参数实现高效随路量化


1 昇腾AI处理器存储单元简介

AI处理器中的计算资源要想发挥强劲算力,必要条件是保证输入数据能够及时准确地出现在计算单元中,需要精心设计存储系统,保证计算单元所需的数据供应。

昇腾AI处理器中的AI Core包含多级内部存储,AI Core需要把外部存储中的数据加载到内部存储中,才能完成相应的计算。AI Core的主要内部存储包括:

  • L1 Buffer:L1缓冲区,通用内部存储,是AI Core内比较大的一块数据中转区,可暂存AI Core中需要反复使用的一些数据从而减少从总线读写的次数。
  • L0A Buffer / L0B BufferCube指令的输入。
  • L0C BufferCube指令的输出,但进行累加计算的时候,也是输入的一部分。
  • Unified Buffer:统一缓冲区,向量和标量计算的输入和输出。

为了配合AI Core中的数据传输和搬运,AI Core中还包含MTE(Memory Transfer Engine,存储转换引擎)搬运单元,在搬运过程中可执行随路数据格式/类型转换。

图 1AI Core架构图

1.png

除L1 Buffer(L1缓冲区),L0 Buffer(L0缓冲区),Unified Buffer(统一缓冲区)这些基本的存储单元外,某些采用AI Core分离架构的昇腾AI处理器还会增加BT Buffer和FP Buffer这两个Buffer。AI Core分离架构将AI Core拆成矩阵计算(AI Cube,AIC)和向量计算(AI Vector,AIV)两个独立的核,每个核都有自己的Scalar单元,能独立加载自己的代码段,从而实现矩阵计算与向量计算的解耦,在系统软件的统一调度下互相配合达到计算效率优化的效果。

  • BT Buffer:BiasTable Buffer,用于存放Bias。
  • FP Buffer:Fixpipe Buffer,用于存放量化参数、Relu参数等。

图 2AI Core架构图(分离架构)

2.png

2 通过UB Buffer融合实现连续vector计算

算子实现中涉及多次vector计算,且前一次计算输出是后一次计算输入的情况下,可将前一次计算输出暂存在UB(Unified Buffer)上直接作为下一次计算的输入,不需要将前一次的计算输出从UB搬运到GM后再从GM搬运到UB。这种UB Buffer融合的方式可以减少搬入搬出次数,实现连续vector计算,提升内存使用效率。数据流图对比如下:

图2-1 数据流图对比

3.png

举个例子,以下算子的计算逻辑为进行Exp计算后再进行Abs计算。计算过程中先把源操作数从GM搬运到UB进行Exp计算,Exp计算完成后将Exp的结果从UB搬运到GM;再从GM中把Exp的结果搬运到UB上作为Abs计算的输入,Abs计算完成后将目的操作数结果从UB搬运到GM。整个过程从GM搬进搬出共4次。当需要进行的vector计算为n次时,从GM搬进搬出共需要2n次。

class KernelSample { 
public: 
    __aicore__ inline KernelSample() {} 
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm) 
    { 
        src0Global.SetGlobalBuffer((__gm__ float*)src0Gm); 
        dstGlobal.SetGlobalBuffer((__gm__ float*)dstGm); 
        pipe.InitBuffer(inQueueSrc0, 1, 1024 * sizeof(float)); 
        pipe.InitBuffer(outQueueDst, 1, 1024 * sizeof(float)); 
    } 
    __aicore__ inline void Process() 
    { 
        CopyIn(); 
        Compute(); 
        CopyOut(); 
        CopyIn1(); 
        Compute1(); 
        CopyOut1(); 
    } 
 
private: 
    __aicore__ inline void CopyIn() 
    { 
        LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>(); 
        DataCopy(src0Local, src0Global, 1024); 
        inQueueSrc0.EnQue(src0Local); 
    } 
    __aicore__ inline void Compute() 
    { 
        LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>(); 
        LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>(); 
        Exp(dstLocal, src0Local, 1024); 
        outQueueDst.EnQue<float>(dstLocal); 
        inQueueSrc0.FreeTensor(src0Local); 
    } 
    __aicore__ inline void CopyOut() 
    { 
        LocalTensor<float> dstLocal = outQueueDst.DeQue<float>(); 
        DataCopy(dstGlobal, dstLocal, 1024); 
        outQueueDst.FreeTensor(dstLocal); 
    } 
    __aicore__ inline void CopyIn1() 
    { 
	PipeBarrier<PIPE_ALL>(); 
        LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>(); 
        DataCopy(src0Local, dstGlobal, 1024); 
        inQueueSrc0.EnQue(src0Local); 
    } 
    __aicore__ inline void Compute1() 
    { 
        LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>(); 
        LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>(); 
        Abs(dstLocal, src0Local, 1024); 
        outQueueDst.EnQue<float>(dstLocal); 
        inQueueSrc0.FreeTensor(src0Local); 
    } 
    __aicore__ inline void CopyOut1() 
    { 
        LocalTensor<float> dstLocal = outQueueDst.DeQue<float>(); 
        DataCopy(dstGlobal, dstLocal, 1024); 
        outQueueDst.FreeTensor(dstLocal); 
    } 
 
private: 
    TPipe pipe; 
    TQue<QuePosition::VECIN, 1> inQueueSrc0; 
    TQue<QuePosition::VECOUT, 1> outQueueDst; 
    GlobalTensor<float> src0Global, dstGlobal; 
};

使用UB Buffer融合方式后,在UB上进行连续vector计算时,前一次的结果可直接作为后一次计算的输入,继续在UB上进行计算,不需要中间的搬进搬出,只需在开始计算时将源操作数搬运到UB,以及全部计算结束后将最终结果从UB搬运到GM,共2次搬进搬出。

class KernelSample { 
public: 
    __aicore__ inline KernelSample() {} 
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm) 
    { 
        src0Global.SetGlobalBuffer((__gm__ float*)src0Gm); 
        dstGlobal.SetGlobalBuffer((__gm__ float*)dstGm); 
        pipe.InitBuffer(inQueueSrc0, 1, 1024 * sizeof(float)); 
        pipe.InitBuffer(outQueueDst, 1, 1024 * sizeof(float)); 
    } 
    __aicore__ inline void Process() 
    { 
        CopyIn(); 
        Compute(); 
        CopyOut(); 
    } 
 
private: 
    __aicore__ inline void CopyIn() 
    { 
        LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>(); 
        DataCopy(src0Local, src0Global, 1024); 
        inQueueSrc0.EnQue(src0Local); 
    } 
    __aicore__ inline void Compute() 
    { 
        LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>(); 
        LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>(); 
        Exp(dstLocal, src0Local, 1024); 
        Abs(dstLocal, dstLocal, 1024); 
        outQueueDst.EnQue<float>(dstLocal); 
        inQueueSrc0.FreeTensor(src0Local); 
    } 
    __aicore__ inline void CopyOut() 
    { 
        LocalTensor<float> dstLocal = outQueueDst.DeQue<float>(); 
        DataCopy(dstGlobal, dstLocal, 1024); 
        outQueueDst.FreeTensor(dstLocal); 
    } 
 
private: 
    TPipe pipe; 
    TQue<QuePosition::VECIN, 1> inQueueSrc0; 
    TQue<QuePosition::VECOUT, 1> outQueueDst; 
    GlobalTensor<float> src0Global, dstGlobal; 
};

3 通过L0C数据暂存实现高效的矩阵乘结果累加

算子实现中对矩阵乘的结果进行累加时(比如矩阵A1 * B1 + A2 * B2...结果的累加),可将前一次矩阵乘的结果暂存在CO1(L0C)上,调用Mmad接口实现矩阵乘结果累加。相比于每次矩阵乘的结果从CO1搬运到GM上,再搬运到UB上进行累加计算,可减少数据搬运的次数,提升内存使用效率。

图3-1 优化前数据流图

4.png

图3-2 优化后数据流图

5.png

优化前,算子进行2次矩阵乘结果累加的过程如下:

  • 将前一次矩阵乘的计算结果从CO1搬运到workspace上,再从workspace搬运到UB上;
  • 下一次矩阵乘计算重复完成上述步骤将结果搬运到UB上;
  • 在UB上将2次矩阵乘的结果相加。

当需要累加n次矩阵乘时,分别增加了n次CO1->workspace、workspace->UB搬运以及n次Add运算。

... 
// 该样例仅做示例说明,非完整代码,省略了部分同步控制代码 
public: 
    __aicore__ inline KernelSample() 
    { 
        aSize = m * k; 
        bSize = k * n; 
        cSize = m * n; 
    } 
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c) 
    { 
        aGM.SetGlobalBuffer((__gm__ half *)a); 
        bGM.SetGlobalBuffer((__gm__ half *)b); 
        cGM.SetGlobalBuffer((__gm__ float *)c); 
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half)); 
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float)); 
        pipe.InitBuffer(inQueueSrc0, 1, cSize * sizeof(float)); 
        pipe.InitBuffer(inQueueSrc1, 1, cSize * sizeof(float)); 
        pipe.InitBuffer(outQueueDst, 1, cSize * sizeof(float)); 
 
    } 
    __aicore__ inline void Process() 
    { 
        // 第一次矩阵乘计算 
        CopyIn(); 
        SplitA(); 
        SplitB(); 
        Compute(); 
        // 将第一次矩阵乘的结果搬出 
        CopyOut(); 
        // 将第一次矩阵乘的结果搬运到UB 
        CopyIn1(); 
        // 第二次矩阵乘计算 
        Compute1(); 
        // 将第一次矩阵乘的结果搬出 
        CopyOut1(); 
        // 将第二次矩阵乘的结果搬运到UB 
        CopyIn1(); 
        // 将两次矩阵乘的结果累加 
        Compute2(); 
        CopyOut2(); 
    } 
private: 
    __aicore__ inline void CopyIn() 
    { 
        LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>(); 
        LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>(); 
 
        Nd2NzParams dataCopyA1Params; 
        dataCopyA1Params.ndNum = 1; 
        dataCopyA1Params.nValue = m; 
        dataCopyA1Params.dValue = k; 
        dataCopyA1Params.srcNdMatrixStride = 0; 
        dataCopyA1Params.srcDValue = k; 
        dataCopyA1Params.dstNzC0Stride = m; 
        dataCopyA1Params.dstNzNStride = 1; 
        dataCopyA1Params.dstNzMatrixStride = 0; 
        DataCopy(a1Local, aGM, dataCopyA1Params); 
 
        Nd2NzParams dataCopyB1Params; 
        dataCopyB1Params.ndNum = 1; 
        dataCopyB1Params.nValue = k; 
        dataCopyB1Params.dValue = n; 
        dataCopyB1Params.srcNdMatrixStride = 0; 
        dataCopyB1Params.srcDValue = n; 
        dataCopyB1Params.dstNzC0Stride = k; 
        dataCopyB1Params.dstNzNStride = 1; 
        dataCopyB1Params.dstNzMatrixStride = 0; 
        DataCopy(b1Local, bGM, dataCopyB1Params); 
 
        inQueueA1.EnQue<half>(a1Local); 
        inQueueB1.EnQue<half>(b1Local); 
    } 
    __aicore__ inline void SplitA() 
    { 
        ... 
    } 
    __aicore__ inline void SplitB() 
    { 
        ... 
    } 
    __aicore__ inline void Compute() 
    { 
        LocalTensor<half> a2Local = inQueueA2.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.DeQue<half>(); 
        LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>(); 
        MmadParams mmadParams; 
        mmadParams.m = m; 
        mmadParams.n = n; 
        mmadParams.k = k; 
        // 矩阵乘 
        Mmad(c1Local, a2Local, b2Local, mmadParams); 
        outQueueCO1.EnQue<float>(c1Local); 
        inQueueA2.EnQue<half>(a2Local); 
        inQueueB2.EnQue<half>(b2Local); 
    } 
    __aicore__ inline void CopyOut() 
    { 
        LocalTensor<float> c1Local = outQueueCO1.DeQue<float>(); 
        GM_ADDR usrWorkspace = AscendC::GetUserWorkspace(workspace); 
        xGm.SetGlobalBuffer((__gm__ float *)(usrWorkspace)); 
        FixpipeParamsV220 fixpipeParams; 
        fixpipeParams.nSize = n; 
        fixpipeParams.mSize = m; 
        fixpipeParams.srcStride = m; 
        fixpipeParams.dstStride = n; 
        fixpipeParams.ndNum = 1; 
        fixpipeParams.srcNdStride = 0; 
        fixpipeParams.dstNdStride = 0; 
        // 将矩阵乘的计算结果从CO1搬运到workspace 
        Fixpipe(xGm, c1Local, fixpipeParams); 
        outQueueCO1.EnQue<float>(c1Local); 
    } 
    __aicore__ inline void CopyIn1() 
    { 
        PipeBarrier<PIPE_ALL>(); 
        LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>(); 
        // 将矩阵乘的计算结果从workspace搬运到UB 
        DataCopy(src0Local, xGm, cSize); 
        inQueueSrc0.EnQue<float>(src0Local); 
    } 
    __aicore__ inline void Compute1() 
    { 
        LocalTensor<half> a2Local = inQueueA2.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.DeQue<half>(); 
        LocalTensor<float> c1Local = outQueueCO1.DeQue<float>(); 
        MmadParams mmadParams; 
        mmadParams.m = m; 
        mmadParams.n = n; 
        mmadParams.k = k; 
        // 矩阵乘 
        Mmad(c1Local, a2Local, b2Local, mmadParams); 
        outQueueCO1.EnQue<float>(c1Local); 
        inQueueA2.FreeTensor(a2Local); 
        inQueueB2.FreeTensor(b2Local); 
    } 
    __aicore__ inline void CopyOut1() 
    { 
        LocalTensor<float> c1Local = outQueueCO1.DeQue<float>(); 
        FixpipeParamsV220 fixpipeParams; 
        fixpipeParams.nSize = n; 
        fixpipeParams.mSize = m; 
        fixpipeParams.srcStride = m; 
        fixpipeParams.dstStride = n; 
        fixpipeParams.ndNum = 1; 
        fixpipeParams.srcNdStride = 0; 
        fixpipeParams.dstNdStride = 0; 
        // 将矩阵乘的计算结果从CO1搬运到workspace 
        Fixpipe(xGm, c1Local, fixpipeParams); 
        outQueueCO1.FreeTensor(c1Local); 
    } 
    __aicore__ inline void CopyIn2() 
    { 
        PipeBarrier<PIPE_ALL>(); 
        LocalTensor<float> src1Local = inQueueSrc1.AllocTensor<float>(); 
        // 将矩阵乘的计算结果从workspace搬运到UB 
        DataCopy(src1Local, xGm, cSize); 
        inQueueSrc1.EnQue<float>(src1Local); 
    } 
    __aicore__ inline void Compute2() 
    { 
        LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>(); 
        LocalTensor<float> src1Local = inQueueSrc1.DeQue<float>(); 
        LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>(); 
        // 两次矩阵乘的结果相加 
        Add(dstLocal, src0Local, src1Local, cSize); 
        outQueueDst.EnQue<float>(dstLocal); 
        inQueueSrc0.FreeTensor(src0Local); 
        inQueueSrc1.FreeTensor(src1Local); 
    } 
    __aicore__ inline void CopyOut2() 
    { 
        ... 
    } 
private: 
    TPipe pipe; 
    TQue<QuePosition::A1, 1> inQueueA1; 
    TQue<QuePosition::A2, 1> inQueueA2; 
    TQue<QuePosition::B1, 1> inQueueB1; 
    TQue<QuePosition::B2, 1> inQueueB2; 
    TQue<QuePosition::CO1, 1> outQueueCO1; 
    TQue<QuePosition::VECIN, 1> inQueueSrc0; 
    TQue<QuePosition::VECIN, 1> inQueueSrc1; 
    TQue<QuePosition::VECOUT, 1> outQueueDst; 
 
    GlobalTensor<half> aGM; 
    GlobalTensor<half> bGM; 
    GlobalTensor<dst_T> cGM; 
    uint16_t m = 32, k = 32, n = 32; 
    uint16_t aSize, bSize, cSize;   
...

通过优化,该算子对矩阵乘结果累加时,可将前一次矩阵乘的结果暂存在L0C上,通过Mmad接口参数cmatrixInitVal和cmatrixSource配置C矩阵的初始值 ,只调用2次Mmad接口实现2次矩阵乘结果累加。

... 
// 该样例仅做示例说明,非完整代码,省略了部分同步控制代码 
public: 
    __aicore__ inline KernelSample() 
    { 
        aSize = m * k; 
        bSize = k * n; 
        cSize = m * n; 
    } 
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c) 
    { 
        aGM.SetGlobalBuffer((__gm__ half *)a); 
        bGM.SetGlobalBuffer((__gm__ half *)b); 
        cGM.SetGlobalBuffer((__gm__ float *)c); 
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half)); 
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float)); 
    } 
    __aicore__ inline void Process() 
    { 
        CopyIn(); 
        SplitA(); 
        SplitB(); 
        Compute(); 
        CopyOut(); 
    } 
private: 
    __aicore__ inline void CopyIn() 
    { 
        LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>(); 
        LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>(); 
 
        Nd2NzParams dataCopyA1Params; 
        dataCopyA1Params.ndNum = 1; 
        dataCopyA1Params.nValue = m; 
        dataCopyA1Params.dValue = k; 
        dataCopyA1Params.srcNdMatrixStride = 0; 
        dataCopyA1Params.srcDValue = k; 
        dataCopyA1Params.dstNzC0Stride = m; 
        dataCopyA1Params.dstNzNStride = 1; 
        dataCopyA1Params.dstNzMatrixStride = 0; 
        DataCopy(a1Local, aGM, dataCopyA1Params); 
 
        Nd2NzParams dataCopyB1Params; 
        dataCopyB1Params.ndNum = 1; 
        dataCopyB1Params.nValue = k; 
        dataCopyB1Params.dValue = n; 
        dataCopyB1Params.srcNdMatrixStride = 0; 
        dataCopyB1Params.srcDValue = n; 
        dataCopyB1Params.dstNzC0Stride = k; 
        dataCopyB1Params.dstNzNStride = 1; 
        dataCopyB1Params.dstNzMatrixStride = 0; 
        DataCopy(b1Local, bGM, dataCopyB1Params); 
 
        inQueueA1.EnQue(a1Local); 
        inQueueB1.EnQue(b1Local); 
    } 
    __aicore__ inline void SplitA() 
    { 
        ... 
    } 
    __aicore__ inline void SplitB() 
    { 
        ... 
    } 
    __aicore__ inline void Compute() 
    { 
        LocalTensor<half> a2Local = inQueueA2.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.DeQue<half>(); 
        LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>(); 
        MmadParams mmadParams; 
        mmadParams.m = m; 
        mmadParams.n = n; 
        mmadParams.k = k; 
        // 第一次矩阵乘 
        Mmad(c1Local, a2Local, b2Local, mmadParams); 
        PipeBarrier<PIPE_M>(); 
        // 第二次矩阵乘累加第一次矩阵乘的结果 
        mmadParams.cmatrixInitVal = false; 
        Mmad(c1Local, a2Local, b2Local, c1Local, mmadParams); 
        outQueueCO1.EnQue<float>(c1Local); 
        inQueueA2.FreeTensor(a2Local); 
        inQueueB2.FreeTensor(b2Local); 
    } 
    __aicore__ inline void CopyOut() 
    { 
        LocalTensor<float> c1Local = outQueueCO1.DeQue<float>(); 
        FixpipeParamsV220 fixpipeParams; 
        fixpipeParams.nSize = n; 
        fixpipeParams.mSize = m; 
        fixpipeParams.srcStride = m; 
        fixpipeParams.dstStride = n; 
 
        fixpipeParams.ndNum = 1; 
        fixpipeParams.srcNdStride = 0; 
        fixpipeParams.dstNdStride = 0; 
        Fixpipe(cGM, c1Local, fixpipeParams); 
        outQueueCO1.FreeTensor(c1Local); 
    } 
private: 
    TPipe pipe; 
    TQue<QuePosition::A1, 1> inQueueA1; 
    TQue<QuePosition::A2, 1> inQueueA2; 
    TQue<QuePosition::B1, 1> inQueueB1; 
    TQue<QuePosition::B2, 1> inQueueB2; 
    TQue<QuePosition::CO1, 1> outQueueCO1; 
 
    GlobalTensor<half> aGM; 
    GlobalTensor<half> bGM; 
    GlobalTensor<dst_T> cGM; 
    uint16_t m = 32, k = 32, n = 32; 
    uint16_t aSize, bSize, cSize; 

4 较小矩阵长驻L1 Buffer,仅分次搬运较大矩阵

在进行cube计算时,当L1无法全载左右矩阵时,可以让较小的矩阵长驻于L1上,只分次搬运较大的矩阵,减少搬运次数。

假设L1的大小为512K,左矩阵和右矩阵的大小分别为992K、16K,数据类型为half,单次无法将左右矩阵全部载入L1中。开发者规划的切分策略为:不切K轴,将左矩阵平均分成两块A1、A2,shape大小均为[992, 256];将右矩阵平均分成两块,shape大小均为[256, 16]。计算时的加载顺序如下:先加载A1矩阵至L1,将B1、B2依次加载并计算;然后再加载A2至L1,将B1、B2依次加载并计算。

图4-1 优化前切分策略图示

6.png

 

... 
public: 
    __aicore__ inline KernelSample() 
    { 
        aSize = baseM * baseK; 
        bSize = baseK * baseN; 
        cSize = m * n; 
    } 
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c) 
    { 
        aGM.SetGlobalBuffer((__gm__ half *)a); 
        bGM.SetGlobalBuffer((__gm__ half *)b); 
        cGM.SetGlobalBuffer((__gm__ float *)c); 
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half)); 
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float)); 
    } 
    __aicore__ inline void Process() 
    { 
        for (uint32_t i = 0; i < 2; i++) { 
            CopyInA1(i); 
            SplitA(); 
            for (uint32_t j = 0; j < 2; j++) { 
                CopyInB1(j); 
                SplitB(); 
                Compute(i, j); 
            } 
        } 
        CopyOut(); 
    } 
private: 
    __aicore__ inline void CopyInA1(uint32_t i) 
    { 
        LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>(); 
        // 左矩阵a1/a2分块载入A1 
        Nd2NzParams dataCopyA1Params; 
        dataCopyA1Params.ndNum = 1; 
        dataCopyA1Params.nValue = baseM; 
        dataCopyA1Params.dValue = baseK; 
        dataCopyA1Params.srcNdMatrixStride = 0; 
        dataCopyA1Params.srcDValue = baseK; 
        dataCopyA1Params.dstNzC0Stride = baseM; 
        dataCopyA1Params.dstNzNStride = 1; 
        dataCopyA1Params.dstNzMatrixStride = 0; 
        DataCopy(a1Local, aGM[i * baseM * baseK], dataCopyA1Params); 
        inQueueA1.EnQue(a1Local); 
    } 
    __aicore__ inline void SplitA() 
    { 
        LocalTensor<half> a1Local = inQueueA1.DeQue<half>(); 
        LocalTensor<half> a2Local = inQueueA2.AllocTensor<half>(); 
        // 左矩阵a1/a2分块从A1->A2 
        LoadData2dParams loadL0AParams; 
        loadL0AParams.repeatTimes = baseM * baseK * sizeof(half) / 512; 
        loadL0AParams.srcStride = 1; 
        loadL0AParams.dstGap = 0; 
        LoadData(a2Local, a1Local, loadL0AParams); 
        inQueueA2.EnQue(a2Local); 
        inQueueA1.FreeTensor(a1Local); 
    } 
    __aicore__ inline void CopyInB1(uint32_t j) 
    { 
        LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>(); 
        // 右矩阵分块b1/b2载入B1 
        Nd2NzParams dataCopyB1Params; 
        dataCopyB1Params.ndNum = 1; 
        dataCopyB1Params.nValue = baseK; 
        dataCopyB1Params.dValue = baseN; 
        dataCopyB1Params.srcNdMatrixStride = 0; 
        dataCopyB1Params.srcDValue = n; 
        dataCopyB1Params.dstNzC0Stride = baseK; 
        dataCopyB1Params.dstNzNStride = 1; 
        dataCopyB1Params.dstNzMatrixStride = 0; 
        DataCopy(b1Local, bGM[j * baseN], dataCopyB1Params); 
        inQueueB1.EnQue(b1Local); 
    } 
    __aicore__ inline void SplitB() 
    { 
        LocalTensor<half> b1Local = inQueueB1.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.AllocTensor<half>(); 
        // 右矩阵分块b1/b2从B1->B2 
        LoadData2dTransposeParams loadL0BParams; 
        loadL0BParams.startIndex = 0; 
        loadL0BParams.repeatTimes = baseK / nBlockSize; 
        loadL0BParams.srcStride = 1; 
        loadL0BParams.dstGap = 1; 
        LoadDataWithTranspose(b2Local, b1Local, loadL0BParams); 
        inQueueB2.EnQue(b2Local); 
        inQueueB1.FreeTensor(b1Local); 
    } 
    __aicore__ inline void Compute(uint32_t i, uint32_t j) 
    { 
        LocalTensor<half> a2Local = inQueueA2.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.DeQue<half>(); 
        LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>(); 
        // 矩阵乘 
        mmadParams.m = baseM; 
        mmadParams.n = baseN; 
        mmadParams.k = baseK; 
        Mmad(c1Local[i * baseM * baseN + j * m * baseN], a2Local, b2Local, mmadParams); 
        outQueueCO1.EnQue<float>(c1Local); 
        inQueueA2.FreeTensor(a2Local); 
        inQueueB2.FreeTensor(b2Local); 
    } 
    __aicore__ inline void CopyOut() 
    { 
        ... 
    } 
private: 
    TPipe pipe; 
    TQue<QuePosition::A1, 1> inQueueA1; 
    TQue<QuePosition::A2, 1> inQueueA2; 
    TQue<QuePosition::B1, 1> inQueueB1; 
    TQue<QuePosition::B2, 1> inQueueB2; 
    TQue<QuePosition::CO1, 1> outQueueCO1; 
 
    GlobalTensor<half> aGM; 
    GlobalTensor<half> bGM; 
    GlobalTensor<dst_T> cGM; 
    uint16_t m = 1984, k = 256, n = 32; 
    uint16_t baseM = 992, baseK = 256, baseN = 16; 
    uint16_t aSize, bSize, cSize; 
    uint16_t nBlockSize = 16; 
...

经过优化,将较小的右矩阵一次性搬入L1并长存于L1上,循环内不断搬运A矩阵,当循环次数为2时,共需要3次搬运。

... 
public: 
    __aicore__ inline KernelSample() 
    { 
        aSize = baseM * baseK; 
        bSize = baseK * n; 
        cSize = m * n; 
    } 
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c) 
    { 
        aGM.SetGlobalBuffer((__gm__ half *)a); 
        bGM.SetGlobalBuffer((__gm__ half *)b); 
        cGM.SetGlobalBuffer((__gm__ float *)c); 
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half)); 
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float)); 
    } 
    __aicore__ inline void Process() 
    { 
        CopyInB1(); 
        SplitB(); 
        for (uint32_t i = 0; i < 2; i++) { 
            CopyInA1(i); 
            SplitA(); 
            for (uint32_t j = 0; j < 2; j++) { 
                Compute(i, j); 
            } 
        } 
        CopyOut(); 
    } 
private: 
    __aicore__ inline void CopyInB1() 
    { 
        LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>(); 
        // 右矩阵全载入B1 
        Nd2NzParams dataCopyB1Params; 
        dataCopyB1Params.ndNum = 1; 
        dataCopyB1Params.nValue = baseK; 
        dataCopyB1Params.dValue = n; 
        dataCopyB1Params.srcNdMatrixStride = 0; 
        dataCopyB1Params.srcDValue = n; 
        dataCopyB1Params.dstNzC0Stride = baseK; 
        dataCopyB1Params.dstNzNStride = 1; 
        dataCopyB1Params.dstNzMatrixStride = 0; 
        DataCopy(b1Local, bGM, dataCopyB1Params); 
        inQueueB1.EnQue(b1Local); 
    } 
    __aicore__ inline void SplitB() 
    { 
        LocalTensor<half> b1Local = inQueueB1.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.AllocTensor<half>(); 
        // 右矩阵全部从B1->B2 
        LoadData2dTransposeParams loadL0BParams; 
        loadL0BParams.startIndex = 0; 
        loadL0BParams.repeatTimes = baseK / nBlockSize; 
        loadL0BParams.srcStride = 1; 
        loadL0BParams.dstGap = 1; 
        for (int blockNum = 0; blockNum < (n / nBlockSize); blockNum++) { 
            LoadDataWithTranspose(b2Local[blockNum * 16 * nBlockSize], b1Local[blockNum * baseK * nBlockSize], loadL0BParams); 
        } 
        inQueueB2.EnQue(b2Local); 
        inQueueB1.FreeTensor(b1Local); 
    } 
    __aicore__ inline void CopyInA1(uint32_t i) 
    { 
        LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>(); 
        // 左矩阵a1/a2分块载入A1 
        Nd2NzParams dataCopyA1Params; 
        dataCopyA1Params.ndNum = 1; 
        dataCopyA1Params.nValue = baseM; 
        dataCopyA1Params.dValue = baseK; 
        dataCopyA1Params.srcNdMatrixStride = 0; 
        dataCopyA1Params.srcDValue = baseK; 
        dataCopyA1Params.dstNzC0Stride = baseM; 
        dataCopyA1Params.dstNzNStride = 1; 
        dataCopyA1Params.dstNzMatrixStride = 0; 
        DataCopy(a1Local, aGM[i * baseM * baseK], dataCopyA1Params); 
        inQueueA1.EnQue(a1Local); 
    } 
    __aicore__ inline void SplitA() 
    { 
        LocalTensor<half> a1Local = inQueueA1.DeQue<half>(); 
        LocalTensor<half> a2Local = inQueueA2.AllocTensor<half>(); 
        // 左矩阵a1/a2分块从A1->A2 
        LoadData2dParams loadL0AParams; 
        loadL0AParams.repeatTimes = baseM * baseK * sizeof(half) / 512; 
        loadL0AParams.srcStride = 1; 
        loadL0AParams.dstGap = 0; 
        LoadData(a2Local, a1Local, loadL0AParams); 
        inQueueA2.EnQue(a2Local); 
        inQueueA1.FreeTensor(a1Local); 
    } 
    __aicore__ inline void Compute(uint32_t i, uint32_t j) 
    { 
        LocalTensor<half> a2Local = inQueueA2.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.DeQue<half>(); 
        LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>(); 
        // 矩阵乘 
        mmadParams.m = baseM; 
        mmadParams.n = baseN; 
        mmadParams.k = baseK; 
        Mmad(c1Local[i * baseM * baseN + j * m * baseN], a2Local, b2Local, mmadParams); 
        outQueueCO1.EnQue<float>(c1Local); 
        inQueueA2.FreeTensor(a2Local); 
        inQueueB2.FreeTensor(b2Local); 
    } 
    __aicore__ inline void CopyOut() 
    { 
        ... 
    } 
private: 
    TPipe pipe; 
    TQue<QuePosition::A1, 1> inQueueA1; 
    TQue<QuePosition::A2, 1> inQueueA2; 
    TQue<QuePosition::B1, 1> inQueueB1; 
    TQue<QuePosition::B2, 1> inQueueB2; 
    TQue<QuePosition::CO1, 1> outQueueCO1; 
 
    GlobalTensor<half> aGM; 
    GlobalTensor<half> bGM; 
    GlobalTensor<dst_T> cGM; 
    uint16_t m = 1984, k = 256, n = 32; 
    uint16_t baseM = 992, baseK = 256, baseN = 16; 
    uint16_t aSize, bSize, cSize; 
    uint16_t nBlockSize = 16; 
...

5 通过BT Buffer实现高效的bias计算

算子中进行带bias的矩阵乘计算时,可将bias数据搬运至C2(Bias Table Buffer)上,调用一次Mmad接口实现矩阵乘加bias的计算。相比于先将矩阵乘的结果从CO1(L0C)搬运到GM上,再搬运到UB上进行加bias的过程,减少了数据搬运的次数,可提升内存使用效率。数据流图对比如下:

图5-1 优化前数据流图

7.png

图5-2 优化后数据流图

8.png

在优化前,算子进行带bias的矩阵乘计算时,过程如下:

  • 将矩阵乘的计算结果从CO1(L0C)搬运到workspace上;
  • 从workspace搬运到UB上;
  • 在UB上进行加bias的运算;
  • 最后将结果搬运到GM。

当循环n次该计算过程,则分别增加了n次CO1->workspace、workspace->UB的搬运。

// 该样例仅做示例说明,非完整代码,省略了部分同步控制代码 
public: 
    __aicore__ inline KernelSample() 
    { 
        aSize = m * k; 
        bSize = k * n; 
        cSize = m * n; 
    } 
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *bias, __gm__ uint8_t *c) 
    { 
        aGM.SetGlobalBuffer((__gm__ half *)a); 
        bGM.SetGlobalBuffer((__gm__ half *)b); 
        cGM.SetGlobalBuffer((__gm__ float *)c); 
        biasGM.SetGlobalBuffer((__gm__ float *)bias); 
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half)); 
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float)); 
        pipe.InitBuffer(inQueueBias, 1, n * sizeof(float)); 
        pipe.InitBuffer(inQueueSrc0, 1, cSize * sizeof(float)); 
        pipe.InitBuffer(outQueueDst, 1, cSize * sizeof(float)); 
 
    } 
    __aicore__ inline void Process() 
    { 
        CopyIn(); 
        SplitA(); 
        SplitB(); 
        Compute(); 
        CopyOut(); 
        CopyIn1(); 
        Compute1(); 
        CopyOut1(); 
    } 
private: 
    __aicore__ inline void CopyIn() 
    { 
        LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>(); 
        LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>(); 
        LocalTensor<float> biasLocal = inQueueBias.AllocTensor<float>(); 
 
        Nd2NzParams dataCopyA1Params; 
        dataCopyA1Params.ndNum = 1; 
        dataCopyA1Params.nValue = m; 
        dataCopyA1Params.dValue = k; 
        dataCopyA1Params.srcNdMatrixStride = 0; 
        dataCopyA1Params.srcDValue = k; 
        dataCopyA1Params.dstNzC0Stride = m; 
        dataCopyA1Params.dstNzNStride = 1; 
        dataCopyA1Params.dstNzMatrixStride = 0; 
        DataCopy(a1Local, aGM, dataCopyA1Params); 
 
        Nd2NzParams dataCopyB1Params; 
        dataCopyB1Params.ndNum = 1; 
        dataCopyB1Params.nValue = k; 
        dataCopyB1Params.dValue = n; 
        dataCopyB1Params.srcNdMatrixStride = 0; 
        dataCopyB1Params.srcDValue = n; 
        dataCopyB1Params.dstNzC0Stride = k; 
        dataCopyB1Params.dstNzNStride = 1; 
        dataCopyB1Params.dstNzMatrixStride = 0; 
        DataCopy(b1Local, bGM, dataCopyB1Params); 
        // 将bias搬运到UB 
        DataCopy(biasLocal, biasGM, n); 
 
        inQueueA1.EnQue(a1Local); 
        inQueueB1.EnQue(b1Local); 
        inQueueBias.EnQue(biasLocal); 
    } 
    __aicore__ inline void SplitA() 
    { 
        ... 
    } 
    __aicore__ inline void SplitB() 
    { 
        ... 
    } 
    __aicore__ inline void Compute() 
    { 
        LocalTensor<half> a2Local = inQueueA2.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.DeQue<half>(); 
        LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>(); 
        MmadParams mmadParams; 
        mmadParams.m = m; 
        mmadParams.n = n; 
        mmadParams.k = k; 
        // 矩阵乘 
        Mmad(c1Local, a2Local, b2Local, mmadParams); // m*n 
        outQueueCO1.EnQue<float>(c1Local); 
        inQueueA2.FreeTensor(a2Local); 
        inQueueB2.FreeTensor(b2Local); 
    } 
    __aicore__ inline void CopyOut() 
    { 
        LocalTensor<float> c1Local = outQueueCO1.DeQue<float>(); 
        GM_ADDR usrWorkspace = AscendC::GetUserWorkspace(workspace); 
        xGm.SetGlobalBuffer((__gm__ float *)(usrWorkspace)); 
        FixpipeParamsV220 fixpipeParams; 
        fixpipeParams.nSize = n; 
        fixpipeParams.mSize = m; 
        fixpipeParams.srcStride = m; 
        fixpipeParams.dstStride = n; 
        fixpipeParams.ndNum = 1; 
        fixpipeParams.srcNdStride = 0; 
        fixpipeParams.dstNdStride = 0; 
        // 将矩阵乘的计算结果从CO1搬运到workspace 
        Fixpipe(xGm, c1Local, fixpipeParams); 
        outQueueCO1.FreeTensor(c1Local); 
    } 
    __aicore__ inline void CopyIn1() 
    { 
        PipeBarrier<PIPE_ALL>(); 
        // 将矩阵乘的计算结果从workspace搬运到UB 
        LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>(); 
        DataCopy(src0Local, xGm, cSize); 
        inQueueSrc0.EnQue(src0Local); 
    } 
    __aicore__ inline void Compute1() 
    { 
        LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>(); 
        LocalTensor<float> biasLocal = inQueueBias.DeQue<float>(); 
        LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>(); 
        BinaryRepeatParams addRepeatParams; 
        addRepeatParams.dstRepStride = 8; 
        addRepeatParams.src0RepStride = 8; 
        addRepeatParams.src1RepStride = 0; 
        // 加bias的运算 
        Add(dstLocal, src0Local, biasLocal, 32, m, addRepeatParams); 
        outQueueDst.EnQue<float>(dstLocal); 
        inQueueSrc0.FreeTensor(src0Local); 
        inQueueBias.FreeTensor(biasLocal); 
    } 
    __aicore__ inline void CopyOut1() 
    { 
        ... 
    } 
private: 
    TPipe pipe; 
    TQue<QuePosition::A1, 1> inQueueA1; 
    TQue<QuePosition::A2, 1> inQueueA2; 
    TQue<QuePosition::B1, 1> inQueueB1; 
    TQue<QuePosition::B2, 1> inQueueB2; 
    TQue<QuePosition::VECIN, 1> inQueueBias; 
    TQue<QuePosition::VECIN, 1> inQueueSrc0; 
    TQue<QuePosition::VECOUT, 1> outQueueDst; 
 
    GlobalTensor<half> aGM; 
    GlobalTensor<half> bGM; 
    GlobalTensor<dst_T> cGM; 
    GlobalTensor<float> biasGM; 
    uint16_t m = 32, k = 32, n = 32; 
    uint16_t aSize, bSize, cSize;   
...

经过优化,该算子进行带bias的矩阵乘计算时,先将bias搬运到BT上,调用一次Mmad接口实现矩阵乘加bias的计算。

... 
// 该样例仅做示例说明,非完整代码,省略了部分同步控制代码 
public: 
    __aicore__ inline KernelSample() 
    { 
        aSize = m * k; 
        bSize = k * n; 
        cSize = m * n; 
    } 
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *bias, __gm__ uint8_t *c) 
    { 
        aGM.SetGlobalBuffer((__gm__ half *)a); 
        bGM.SetGlobalBuffer((__gm__ half *)b); 
        cGM.SetGlobalBuffer((__gm__ float *)c); 
        biasGM.SetGlobalBuffer((__gm__ float *)bias); 
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half)); 
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float)); 
        pipe.InitBuffer(inQueueC1, 1, n * sizeof(float)); 
        pipe.InitBuffer(outQueueC2, 1, n * sizeof(float)); 
    } 
    __aicore__ inline void Process() 
    { 
        CopyIn(); 
        SplitA(); 
        SplitB(); 
        SplitBias(); 
        Compute(); 
        CopyOut(); 
    } 
private: 
    __aicore__ inline void CopyIn() 
    { 
        LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>(); 
        LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>(); 
        LocalTensor<float> bias1Local = inQueueC1.AllocTensor<float>(); 
 
        Nd2NzParams dataCopyA1Params; 
        dataCopyA1Params.ndNum = 1; 
        dataCopyA1Params.nValue = m; 
        dataCopyA1Params.dValue = k; 
        dataCopyA1Params.srcNdMatrixStride = 0; 
        dataCopyA1Params.srcDValue = k; 
        dataCopyA1Params.dstNzC0Stride = m; 
        dataCopyA1Params.dstNzNStride = 1; 
        dataCopyA1Params.dstNzMatrixStride = 0; 
        DataCopy(a1Local, aGM, dataCopyA1Params); 
 
        Nd2NzParams dataCopyB1Params; 
        dataCopyB1Params.ndNum = 1; 
        dataCopyB1Params.nValue = k; 
        dataCopyB1Params.dValue = n; 
        dataCopyB1Params.srcNdMatrixStride = 0; 
        dataCopyB1Params.srcDValue = n; 
        dataCopyB1Params.dstNzC0Stride = k; 
        dataCopyB1Params.dstNzNStride = 1; 
        dataCopyB1Params.dstNzMatrixStride = 0; 
        DataCopy(b1Local, bGM, dataCopyB1Params); 
        // 将bias从GM搬运到L1 
        DataCopy(bias1Local, biasGM, n); 
 
        inQueueA1.EnQue(a1Local); 
        inQueueB1.EnQue(b1Local); 
        inQueueC1.EnQue(bias1Local); 
    } 
    __aicore__ inline void SplitA() 
    { 
        ... 
    } 
    __aicore__ inline void SplitB() 
    { 
        ... 
    } 
    __aicore__ inline void SplitBias() 
    { 
        LocalTensor<float> bias1Local = inQueueC1.DeQue<float>(); 
        LocalTensor<float> bias2Local = outQueueC2.AllocTensor<float>(); 
        // 将bias从L1搬运到BT 
        DataCopy(bias2Local, bias1Local, { 1, (uint16_t)(n * sizeof(float) / 64), 0, 0 }); 
        outQueueC2.EnQue<float>(bias2Local); 
        inQueueC1.FreeTensor(bias1Local); 
    } 
    __aicore__ inline void Compute() 
    { 
        LocalTensor<half> a2Local = inQueueA2.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.DeQue<half>(); 
        LocalTensor<float> bias2Local = outQueueC2.DeQue<float>(); 
        LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>(); 
        MmadParams mmadParams; 
        mmadParams.m = m; 
        mmadParams.n = n; 
        mmadParams.k = k; 
        mmadParams.cmatrixInitVal = false; 
        // 矩阵乘 
        Mmad(c1Local, a2Local, b2Local, bias2Local, mmadParams); 
        outQueueCO1.EnQue<float>(c1Local); 
        inQueueA2.FreeTensor(a2Local); 
        inQueueB2.FreeTensor(b2Local); 
        outQueueC2.FreeTensor(bias2Local); 
    } 
    __aicore__ inline void CopyOut() 
    { 
        LocalTensor<float> c1Local = outQueueCO1.DeQue<float>(); 
        FixpipeParamsV220 fixpipeParams; 
        fixpipeParams.nSize = n; 
        fixpipeParams.mSize = m; 
        fixpipeParams.srcStride = m; 
        fixpipeParams.dstStride = n; 
 
        fixpipeParams.ndNum = 1; 
        fixpipeParams.srcNdStride = 0; 
        fixpipeParams.dstNdStride = 0; 
        Fixpipe(cGM, c1Local, fixpipeParams); 
        outQueueCO1.FreeTensor(c1Local); 
    } 
private: 
    TPipe pipe; 
    TQue<QuePosition::A1, 1> inQueueA1; 
    TQue<QuePosition::A2, 1> inQueueA2; 
    TQue<QuePosition::B1, 1> inQueueB1; 
    TQue<QuePosition::B2, 1> inQueueB2; 
    TQue<QuePosition::CO1, 1> outQueueCO1; 
    TQue<QuePosition::C1, 1> inQueueC1; 
    TQue<QuePosition::C2, 1> outQueueC2; 
 
    GlobalTensor<half> aGM; 
    GlobalTensor<half> bGM; 
    GlobalTensor<dst_T> cGM; 
    GlobalTensor<float> biasGM; 
    uint16_t m = 32, k = 32, n = 32; 
    uint16_t aSize, bSize, cSize; 

6 通过FP Buffer存放量化参数实现高效随路量化

算子实现中对矩阵乘结果进行量化计算时,可将量化参数搬运到C2PIPE2GM(Fixpipe Buffer)上,调用一次Fixpipe接口实现矩阵乘结果的量化计算。相比于将矩阵乘的结果从CO1(L0C)搬运到GM,再从GM搬运到UB,在UB进行量化计算的过程,数据搬运的次数更少,内存使用效率更高。

图6-1 优化前数据流图

9.png

图6-2 优化后数据流图

10.png

在优化前,对矩阵乘结果进行量化计算的过程如下:

  • 将矩阵乘的结果从CO1搬运到workspace上;
  • 再从workspace搬运到UB上;
  • 将量化参数搬运到UB上,和矩阵乘的结果一起在UB上进行一系列量化计算;
  • 将最终量化结果从UB搬运到GM上。

相比于正确示例多增加了CO1->workspace、workspace->UB的搬运过程和量化的vector计算。

... 
// 该样例仅做示例说明,非完整代码,省略了部分同步控制代码 
public: 
    __aicore__ inline KernelSample() 
    { 
        aSize = m * k; 
        bSize = k * n; 
        cSize = m * n; 
    } 
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c, __gm__ uint8_t *deqTensor) 
    { 
        aGM.SetGlobalBuffer((__gm__ half *)a); 
        bGM.SetGlobalBuffer((__gm__ half *)b); 
        cGM.SetGlobalBuffer((__gm__ float *)c); 
        deqGM.SetGlobalBuffer((__gm__ half *)deqTensor); 
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half)); 
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float)); 
        pipe.InitBuffer(inQueueSrc0, 1, cSize * sizeof(float)); 
        pipe.InitBuffer(inQueueTmp, 1, cSize * sizeof(half)); 
        pipe.InitBuffer(inQueueDeq, 1, cSize * sizeof(half)); 
        pipe.InitBuffer(outQueueDst, 1, cSize * sizeof(int8_t)); 
    } 
    __aicore__ inline void Process() 
    { 
        CopyIn(); 
        SplitA(); 
        SplitB(); 
        Compute(); 
        CopyOut(); 
        CopyIn1(); 
        Compute1(); 
        CopyOut1(); 
    } 
private: 
    __aicore__ inline void CopyIn() 
    { 
        LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>(); 
        LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>(); 
        LocalTensor<half> deqLocal = inQueueDeq.AllocTensor<half>(); 
 
        Nd2NzParams dataCopyA1Params; 
        dataCopyA1Params.ndNum = 1; 
        dataCopyA1Params.nValue = m; 
        dataCopyA1Params.dValue = k; 
        dataCopyA1Params.srcNdMatrixStride = 0; 
        dataCopyA1Params.srcDValue = k; 
        dataCopyA1Params.dstNzC0Stride = m; 
        dataCopyA1Params.dstNzNStride = 1; 
        dataCopyA1Params.dstNzMatrixStride = 0; 
        DataCopy(a1Local, aGM, dataCopyA1Params); 
 
        Nd2NzParams dataCopyB1Params; 
        dataCopyB1Params.ndNum = 1; 
        dataCopyB1Params.nValue = k; 
        dataCopyB1Params.dValue = n; 
        dataCopyB1Params.srcNdMatrixStride = 0; 
        dataCopyB1Params.srcDValue = n; 
        dataCopyB1Params.dstNzC0Stride = k; 
        dataCopyB1Params.dstNzNStride = 1; 
        dataCopyB1Params.dstNzMatrixStride = 0; 
        DataCopy(b1Local, bGM, dataCopyB1Params); 
        // 将量化参数搬运到UB 
        DataCopy(deqLocal, deqGM, cSize); 
 
        inQueueA1.EnQue(a1Local); 
        inQueueB1.EnQue(b1Local); 
        inQueueDeq.EnQue(deqLocal); 
    } 
    __aicore__ inline void SplitA() 
    { 
        ... 
    } 
    __aicore__ inline void SplitB() 
    { 
        ... 
    } 
    __aicore__ inline void Compute() 
    { 
        LocalTensor<half> a2Local = inQueueA2.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.DeQue<half>(); 
        LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>(); 
        MmadParams mmadParams; 
        mmadParams.m = m; 
        mmadParams.n = n; 
        mmadParams.k = k; 
        // 矩阵乘 
        Mmad(c1Local, a2Local, b2Local, mmadParams); // m*n 
        outQueueCO1.EnQue<float>(c1Local); 
        inQueueA2.FreeTensor(a2Local); 
        inQueueB2.FreeTensor(b2Local); 
    } 
    __aicore__ inline void CopyOut() 
    { 
        LocalTensor<float> c1Local = outQueueCO1.DeQue<float>(); 
        GM_ADDR usrWorkspace = AscendC::GetUserWorkspace(workspace); 
        xGm.SetGlobalBuffer((__gm__ float *)(usrWorkspace)); 
        FixpipeParamsV220 fixpipeParams; 
        fixpipeParams.nSize = n; 
        fixpipeParams.mSize = m; 
        fixpipeParams.srcStride = m; 
        fixpipeParams.dstStride = n; 
        fixpipeParams.ndNum = 1; 
        fixpipeParams.srcNdStride = 0; 
        fixpipeParams.dstNdStride = 0; 
        // 将矩阵乘的计算结果从CO1搬运到workspace 
        Fixpipe(xGm, c1Local, fixpipeParams); 
        outQueueCO1.FreeTensor(c1Local); 
    } 
    __aicore__ inline void CopyIn1() 
    { 
        PipeBarrier<PIPE_ALL>(); 
        // 将矩阵乘的计算结果从workspace搬运到UB 
        LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>(); 
        DataCopy(src0Local, xGm, cSize); 
        inQueueSrc0.EnQue(src0Local); 
    } 
    __aicore__ inline void Compute1() 
    { 
        LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>(); 
        LocalTensor<half> tmpLocal = inQueueTmp.AllocTensor<half>(); 
        LocalTensor<half> deqLocal = inQueueDeq.DeQue<half>(); 
        LocalTensor<int8_t> dstLocal = outQueueDst.AllocTensor<int8_t>(); 
        // 量化计算 
        Cast(tmpLocal, src0Local, RoundMode::CAST_NONE, cSize); 
        LocalTensor<half> tmpHalfBuffer = src0Local.ReinterpretCast<half>(); 
        Mul(tmpHalfBuffer, tmpLocal, deqLocal, cSize); 
        Cast(dstLocal, tmpHalfBuffer, RoundMode::CAST_NONE, cSize); 
        outQueueDst.EnQue<int8_t>(dstLocal); 
        inQueueSrc0.FreeTensor(src0Local); 
        inQueueTmp.FreeTensor(tmpLocal); 
        inQueueDeq.FreeTensor(deqLocal); 
    } 
    __aicore__ inline void CopyOut1() 
    { 
        ... 
    } 
private: 
    TPipe pipe; 
    TQue<QuePosition::A1, 1> inQueueA1; 
    TQue<QuePosition::A2, 1> inQueueA2; 
    TQue<QuePosition::B1, 1> inQueueB1; 
    TQue<QuePosition::B2, 1> inQueueB2; 
    TQue<QuePosition::CO1, 1> outQueueCO1; 
    TQue<QuePosition::VECIN, 1> inQueueDeq; 
    TQue<QuePosition::VECIN, 1> inQueueSrc0; 
    TQue<QuePosition::VECCALC, 1> inQueueTmp; 
    TQue<QuePosition::VECOUT, 1> outQueueDst; 
 
    GlobalTensor<half> aGM; 
    GlobalTensor<half> bGM; 
    GlobalTensor<dst_T> cGM; 
    GlobalTensor<float> biasGM; 
    uint16_t m = 32, k = 32, n = 32; 
    uint16_t aSize, bSize, cSize; 
    ...

经过优化,该算子对矩阵乘的结果进行量化计算时,可将量化参数搬运到FB(Fixpipe Buffer)上,调用一次Fixpipe接口实现矩阵乘结果的量化计算。

... 
public: 
    __aicore__ inline KernelSample() 
    { 
        aSize = m * k; 
        bSize = k * n; 
        cSize = m * n; 
    } 
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c, __gm__ uint8_t *deqTensor) 
    { 
        aGM.SetGlobalBuffer((__gm__ half *)a); 
        bGM.SetGlobalBuffer((__gm__ half *)b); 
        cGM.SetGlobalBuffer((__gm__ float *)c); 
        deqGM.SetGlobalBuffer((__gm__ uint64_t *)deqTensor); 
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half)); 
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half)); 
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float)); 
        pipe.InitBuffer(inQueueDeq1, 1, cSize * sizeof(uint64_t)); 
        pipe.InitBuffer(inQueueDeq, 1, cSize * sizeof(uint64_t)); 
    } 
    __aicore__ inline void Process() 
    { 
        CopyIn(); 
        SplitA(); 
        SplitB(); 
        SplitDeq(); 
        Compute(); 
        CopyOut(); 
    } 
private: 
    __aicore__ inline void CopyIn() 
    { 
        LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>(); 
        LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>(); 
        LocalTensor<uint64_t> deq1Local = inQueueDeq1.AllocTensor<uint64_t>(); 
 
        Nd2NzParams dataCopyA1Params; 
        dataCopyA1Params.ndNum = 1; 
        dataCopyA1Params.nValue = m; 
        dataCopyA1Params.dValue = k; 
        dataCopyA1Params.srcNdMatrixStride = 0; 
        dataCopyA1Params.srcDValue = k; 
        dataCopyA1Params.dstNzC0Stride = m; 
        dataCopyA1Params.dstNzNStride = 1; 
        dataCopyA1Params.dstNzMatrixStride = 0; 
        DataCopy(a1Local, aGM, dataCopyA1Params); 
 
        Nd2NzParams dataCopyB1Params; 
        dataCopyB1Params.ndNum = 1; 
        dataCopyB1Params.nValue = k; 
        dataCopyB1Params.dValue = n; 
        dataCopyB1Params.srcNdMatrixStride = 0; 
        dataCopyB1Params.srcDValue = n; 
        dataCopyB1Params.dstNzC0Stride = k; 
        dataCopyB1Params.dstNzNStride = 1; 
        dataCopyB1Params.dstNzMatrixStride = 0; 
        DataCopy(b1Local, bGM, dataCopyB1Params); 
        // 将量化参数搬运到L1上 
        DataCopy(deq1Local, deqGM, cSize); 
 
        inQueueA1.EnQue(a1Local); 
        inQueueB1.EnQue(b1Local); 
        inQueueDeq.EnQue(deq1Local); 
    } 
    __aicore__ inline void SplitA() 
    { 
        ... 
    } 
    __aicore__ inline void SplitB() 
    { 
        ... 
    } 
    __aicore__ inline void SplitDeq() 
    { 
        LocalTensor<uint64_t> deq1Local = inQueueDeq1.DeQue<uint64_t>(); 
        LocalTensor<uint64_t> deqLocal = inQueueDeq.AllocTensor<uint64_t>(); 
        // 将量化参数从L1->FB 
        DataCopy(deqLocal, deq1Local, { 1, (uint16_t)(cSize * sizeof(uint64_t) / 128), 0, 0 }); 
        inQueueDeq.EnQue<uint61_t>(deqLocal); 
        inQueueDeq1.FreeTensor(deq1Local); 
    } 
    __aicore__ inline void Compute() 
    { 
        LocalTensor<half> a2Local = inQueueA2.DeQue<half>(); 
        LocalTensor<half> b2Local = inQueueB2.DeQue<half>(); 
        LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>(); 
        MmadParams mmadParams; 
        mmadParams.m = m; 
        mmadParams.n = n; 
        mmadParams.k = k; 
        // 矩阵乘 
        Mmad(c1Local, a2Local, b2Local, mmadParams); // m*n 
        outQueueCO1.EnQue<float>(c1Local); 
        inQueueA2.FreeTensor(a2Local); 
        inQueueB2.FreeTensor(b2Local); 
    } 
    __aicore__ inline void CopyOut() 
    { 
        LocalTensor<float> c1Local = outQueueCO1.DeQue<float>(); 
        LocalTensor<uint64_t> deqLocal = inQueueDeq.DeQue<uint64_t>(); 
        SetFixpipeNz2ndFlag(1, 0, 0); 
        DataCopyCO12DstParams dataCopyParams; 
        dataCopyParams.nSize = n; 
        dataCopyParams.mSize = m; 
        dataCopyParams.srcStride = m; 
        dataCopyParams.dstStride = n; 
        dataCopyParams.quantPre = QuantMode_t::VQF322B8_PRE; 
        dataCopyParams.nz2ndEn = true; 
        // 将矩阵乘进行量化后的计算结果搬出 
        DataCopy(cGM, c1Local, DataCopyCO12DstParams); 
        outQueueCO1.FreeTensor(c1Local); 
    } 
 
private: 
    TPipe pipe; 
    TQue<QuePosition::A1, 1> inQueueA1; 
    TQue<QuePosition::A2, 1> inQueueA2; 
    TQue<QuePosition::B1, 1> inQueueB1; 
    TQue<QuePosition::B2, 1> inQueueB2; 
    TQue<QuePosition::C1, 1> inQueueDeq1; 
    TQue<QuePosition::C2PIPE2GM, 1> inQueueDeq; 
    TQue<QuePosition::CO1, 1> outQueueCO1; 
    GlobalTensor<half> aGM; 
    GlobalTensor<half> bGM; 
    GlobalTensor<dst_T> cGM; 
    GlobalTensor<uint64_t> deqTensorGM; 
    uint16_t m = 32, k = 32, n = 32; 
    uint16_t aSize, bSize, cSize; 
    ...

7 更多学习资源

了解更多Ascend C算子性能优化手段和实践案例,请访问:https://www.hiascend.com/ascend-c

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。