排列扫描(Permuted Scan)
Brent-Kung的扫描算法在GPGPU中实现时会存在Bank Conflict
传统上通过Padding的方式来解决Bank Conflict的问题 [Mark Harris, Shubhabrata Sengupta, John D. Owens. "Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3 Chapter 39 2007](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html "Mark Harris, Shubhabrata Sengupta, John D. Owens. "Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3 Chapter 39 2007")
Merrill和Grimshaw提出了一种新的方法——排列扫描(Permuted Scan)——可以在不需要Padding的情况下解决Bank Conflict的问题 [Duane Merrill, Andrew Grimshaw. "Parallel Scan for Stream Architectures." University of Virginia, Department of Computer Science, Technical Report 2009.](https://libraopen.lib.virginia.edu/public_view/kd17cs85f "Duane Merrill, Andrew Grimshaw. "Parallel Scan for Stream Architectures." University of Virginia, Department of Computer Science, Technical Report 2009.")
但是,Merrill和Grimshaw并没有找到一种时间复杂度为O(1)的计算内存索引到元素Rank的映射关系的方法 "Unfortunately we are unaware of an O(1) function for performing such a mapping of element ranks to memory indices."——论文原文
我发现可以用倒位序(出自FFT)计算该映射关系 //特别声明:该方法由本人原创,在法律允许的范围内保留一切权利(包括但不限于著作权) HLSL MemoryIndex = reversebits(ElementRank)>>23 //23=(32-log(2,512)) //我们假定单个线程组内有256个线程 //512为局部共享内存维度
GLSL MemoryIndex = bitfieldReverse(ElementRank)>>23 //23=(32-log(2,512))
本文转载自异步社区
原文链接:https://www.epubit.com/articleDetails?id=N6c857ad6-35a4-483c-85d1-2854bf24630b
- 点赞
- 收藏
- 关注作者
评论(0)