跳转至

cuda Assembly:PTX & SASS

两种汇编

  1. parallel thread execution (PTX) 内联汇编有没有关系
  2. PTX是编程人员可以操作的最底层汇编,原因是SASS代码的实现会经常根据GPU架构而经常变换
  3. https://docs.nvidia.com/cuda//pdf/Inline_PTX_Assembly.pdf
  4. ISA指令手册 https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instruction-set
  5. SASS
  6. Streaming ASSembly(Shader Assembly?) 没有官方的证明
  7. 没有官方详细的手册,有基本介绍:https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#ampere
  8. https://zhuanlan.zhihu.com/p/161624982
  9. 从可执行程序反汇编SASS
    1. https://www.findhao.net/easycoding/2339.html

SASS 指令基本信息

对于Ampere架构

指令方向

(instruction) (destination) (source1), (source2) ...

各种寄存器说明 * RX for registers * URX for uniform registers * SRX for special system-controlled registers * PX for predicate registers * c[X][Y] for constant memory

SASS 举例说明1

SASS的难点在于指令的后缀。由于手册确实,需要结合PTX的后缀查看

/*0028*/         IMAD R6.CC, R3, R5, c[0x0][0x20]; 
/*0030*/         IMAD.HI.X R7, R3, R5, c[0x0][0x24]; 
/*0040*/         LD.E R2, [R6]; //load

line1

/*0028*/ IMAD R6.CC, R3, R5, c[0x0][0x20];
Extended-precision integer multiply-add: multiply R3 with R5, sum with constant in bank 0, offset 0x20, store in R6 with carry-out.

c[BANK][ADDR] is a constant memory。

.CC means “set the flags”

line2

/*0030*/ IMAD.HI.X R7, R3, R5, c[0x0][0x24];
Integer multiply-add with extract: multiply R3 with R5, extract upper half, sum that upper half with constant in bank 0, offset 0x24, store in R7 with carry-in.

line3

/*0040*/         LD.E R2, [R6]; //load
LD.E is a load from global memory using 64-bit address in R6,R7(表面上是R6,其实是R6 与 R7 组成的地址对)

summary

R6 = R3*R5 + c[0x0][0x20], saving carry to CC
R7 = (R3*R5 + c[0x0][0x24])>>32 + CC
R2 = *(R7<<32 + R6)
寄存器是32位的原因是 SMEM的bank是4字节的。c数组将32位的基地址分开存了。

first two commands multiply two 32-bit values (R3 and R5) and add 64-bit value c[0x0][0x24]<<32+c[0x0][0x20],

leaving 64-bit address result in the R6,R7 pair

对应的代码是

kernel f (uint32* x) // 64-bit pointer
{
   R2 = x[R3*R5]
}

SASS Opt Code分析2

  • LDG - Load form Global Memory
  • ULDC - Load from Constant Memory into Uniform register
  • USHF - Uniform Funnel Shift (猜测是特殊的加速shift)
  • STS - Store within Local or Shared Window

流水STS

观察 偏移 * 4 * 2060(delta=2056) * 4116(delta=2056) * 8228(delta=2 * 2056) * 6172(delta=-1 * 2056) * 10284(delta=2 * 2056) * 12340(delta=2056)

可见汇编就是中间写反了,导致不连续,不然能隐藏更多延迟

STS缓存寄存器来源

那么这些寄存器是怎么来的呢?感觉就是写反了

      IMAD.WIDE.U32 R16, R16, R19, c[0x0][0x168] 
      LDG.E R27, [R16.64] 
      IMAD.WIDE R30, R19, c[0x0][0x164], R16 
      LDG.E R31, [R30.64] 
      IMAD.WIDE R32, R19, c[0x0][0x164], R30 
      LDG.E R39, [R32.64] 
      # important R41 R37
      IMAD.WIDE R34, R19, c[0x0][0x164], R32 
      IMAD.WIDE R40, R19, c[0x0][0x164], R34 
      LDG.E R41, [R40.64] 
      LDG.E R37, [R34.64] 

Fix

原因是前面是手动展开的,假如等待编译器自动展开for循环就不会有这个问题

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://forums.developer.nvidia.com/t/solved-sass-code-analysis/41167/2

https://stackoverflow.com/questions/35055014/how-to-understand-the-result-of-sass-analysis-in-cuda-gpu