跳转至

笔记

Assembly Arm

关于X86 与 arm的寄存器的区别写在了arm那篇下

arm

https://developer.arm.com/documentation/dui0068/b/CIHEDHIF

Arm 的四种寻址方式

ldr & str

Aarch64

Arm A64 Instruction Set Architecture https://modexp.wordpress.com/2018/10/30/arm64-assembly/

直接阅读文档 Arm® A64 Instruction Set Architecture Armv8, for Armv8-A architecture profile最有效

指令后缀说明

read from ARMv8 Instruction Set Overview 4.2 Instruction Mnemonics

The container is one of:

The subtype is one of:

combine

<name>{<subtype>}      <container> 

注意后缀的作用主体

指令速查

官网查找指令: https://developer.arm.com/architectures/instruction-sets/intrinsics

https://armconverter.com/?disasm&code=0786b04e

SIMD/vector

几乎每个指令都可以同时作用在不同寄存器和vector或者scalar上。比如add指令,并没有像X86一样设计vadd或者addps等单独 的指令,如果一定要区分,只能从寄存器是不是vector下手。

根据这个图,确实是有做向量操作的add,FADD是float-add的意思,ADDP是将相邻的寄存器相加放入目的寄存器的意思。不影响是标量scalar还是向量vector的操作。addv是将一个向量寄存器里的每个分量归约求和的意思,确实只能用在向量指令。

由于需要满足64或者128位只有下面几种情况

需要额外注意的是另外一种写法,位操作指令,不在乎寄存器形状shape

# 128位and
and %q3 %q7 -> %q3
and v3.16b, v3.16b, v7.16b
是同一个意思,但是不支持and v3.8h, v3.8h, v7.8h
DUP //Duplicate general-purpose register to vector.or Duplicate vector element to vector or scalar.
addp //Add Pair of elements (scalar). This instruction adds two vector elements in the source SIMD&FP register and writes 
//the scalar result into the destination SIMD&FP register.

calculate

add
addp //Add Pair of elements (scalar). This instruction adds two vector elements in the source SIMD&FP register and writes the scalar result into the destination SIMD&FP register.
adds // Add , setting flags.
eor // Bitwise Exclusive OR
orr // Move (register) copies the value in a source register to the destination register. Alias of ORR.

Address

ADRP // Form PC-relative address to 4KB page.

Branch

b.cond // branch condition eg. b.ne
bl //Branch with Link branches to a PC-relative offset, setting the register X30 to PC+4 
//带链接的跳转。 首先将当前指令的下一条指令地址保存在LR寄存器,然后跳转的lable。通常用于调用子程序,可通过在子程序的尾部添加mov  pc, lr 返回。
blr //Branch with Link to Register calls a subroutine at an address in a register, setting register X30 to PC+4.
cbnz //Compare and Branch on Nonzero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect the condition flags.
tbnz // test and branch not zero
ret //Return from subroutine, branches unconditionally to an address in a register, with a hint that this is a subroutine return.

Load/Store

ldrb // b是byte的意思
ldar // LDAR Load-Acquire(申请锁) Register 
STLR //Store-Release(释放锁) Register 
ldp // load pair(two) register
stp // store pair(two) register
ldr(b/h/sb/sh/sw) // load register , sb/sh/sw is signed byte/half/word
str // store register
ldur // load register (unscaled) unscaled means that in the machine-code, the offset will not be encoded with a scaled offset like ldr uses. or offset is minus.
prfm // prefetch memory

Control/conditional

ccmp // comdition compare
CMEQ // Compare bitwise Equal (vector). This instruction compares each vector element from the frst source SIMD&FP register with the corresponding vector element from the second source SIMD&FP register
CSEL // If the condition is true, Conditional Select writes the value of the frst source register to the destination register. If the condition is false, it writes the value of the second source register to the destination register.
CSINC //Conditional Select Increment returns
CSINV //Conditional Select Invert returns
CSNEG //Conditional Select Negation returns

Logic&Move

ASRV //Arithmetic Shift Right Variable
lsl //logic shift left
orr //bitwise(逐位) or
eor //Bitwise Exclusive OR
TST/ANDS //Test bits (immediate), setting the condition flags and discarding the result. Alias of ANDS.
MOVZ //Move wide with zero moves an optionally-shifted 16-bit immediate value to a register
UBFM // Unigned Bitfield Move. This instruction is used by the aliases LSL (immediate), LSR (immediate), UBFIZ, UBFX, UXTB, and UXTH
BFM //Bitfield Move
BIC (shifted register) //Bitwise Bit Clear
CLZ // Count Leading Zeros counts the number of binary zero bits before the frst binary one bit in the value of the source register, and writes the result to the destination register.
REV, REV16, REVSH, and RBIT // below
REV //Reverse byte order in a word.
REV16 //Reverse byte order in each halfword independently.
REVSH //Reverse byte order in the bottom halfword, and sign extend to 32 bits.
RBIT //Reverse the bit order in a 32-bit word.

Modifier

uxtb // zero extend byte 无符号(Unsigned)扩展一个字节(Byte)到 32位

system

dmb  //data memory barrier
SVC //The SVC instruction causes an exception. This means that the processor mode changes to Supervisor,

ARM no push/pop

PUSH {r3}
POP {r3}

are aliases for

str r3, [sp, #-4]!
ldr r3, [sp], #4

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://www.cs.virginia.edu/~evans/cs216/guides/x86.html

https://blog.csdn.net/gaojinshan/article/details/11534569

Intel® Intrinsics Guide

符号说明

_mm_sin_ps intrinsic is a packed 128-bit vector of four 32-bit precision floating point numbers.The intrinsic computes the sine of each of these four numbers and returns the four results in a packed 128-bit vector.

ISA

AVX2 & AVX

AVX2在AVX的基础上完善了256位寄存器的一些实现

FMA

float-point multiply add/sub

include 128/256 bits regs

AVX_VNNI

AVX-VNNI is a VEX-coded variant of the AVX512-VNNI instruction set extension. It provides the same set of operations, but is limited to 256-bit vectors and does not support any additional features of EVEX encoding, such as broadcasting, opmask registers or accessing more than 16 vector registers. This extension allows to support VNNI operations even when full AVX-512 support is not implemented by the processor.

dpbusd  //_mm_dpbusd_avx_epi32
dpwssd // b 与 w 是 byte 和dword。 us和ss是ab两数是不是signed
dpwssds // 最后的s是 signed saturation饱和计算的意思,计算不允许越界。

AVX-512

有时间再看吧

KNC

current generation of Intel Xeon Phi co-processors (codename "Knight's Corner", abbreviated KNC) supports 512-bit SIMD instruction set called "Intel® Initial Many Core Instructions" (abbreviated Intel® IMCI).

https://stackoverflow.com/questions/22670205/are-there-simdsse-avx-instructions-in-the-x86-compatible-accelerators-intel

AMX

Intel® Advanced Matrix Extensions (Intel® AMX) is a new 64-bit programming paradigm consisting of two components: * A set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image * An accelerator that is able to operate on tiles; the first implementation of this accelerator is called TMUL (tile matrix multiply unit).

这个不适用于特殊矩阵和稀疏矩阵,这类一般先转换化简再SIMD

SVML

Short Vector Math Library Operations (SVML)

The Intel® oneAPI DPC++/C++ Compiler provides short vector math library (SVML) intrinsics to compute vector math functions. These intrinsics are available for IA-32 and Intel® 64 architectures running on supported operating systems. The prototypes for the SVML intrinsics are available in the immintrin.h file.

Using SVML intrinsics is faster than repeatedly calling the scalar math functions. However, the intrinsics differ from the scalar functions in accuracy.

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

Manual AVX256 SIMD

类型区别

The __m256 data type can hold eight 32-bit floating-point values.

The __m256d data type can hold four 64-bit double precision floating-point values.

The __m256i data type can hold thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit integer values

向量预取

_mm512_mask_prefetch_i32extgather_ps

Load & Store

__m256i _mm256_loadu_epi32 (void const* mem_addr) //读入连续的256位数据,为32位int
_mm256_lddqu_si256 //上面识别不了也可以考虑这个
__m256d _mm256_loadu_pd (double const * mem_addr) // 读入连续4个double
__m256d _mm256_broadcast_sd (double const * mem_addr) // 读取一个double,并复制4份
__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale) // 间隔读取
scatter // 类似间隔读取
_mm512_mask_prefetch_i32extgather_ps // 有选择预取
mask // 根据掩码选择不读,0等操作
_mm256_stream_pd // 跳过cache直接写入内存,但是需要对齐
_mm_storeu_si128 // int直接写入内存,不需要对齐

不连续读取

long long int vindexList = [0,2,4,6];
__m256i vindex = __mm256_loadu_epi64(vindexList);
__m256d vj1 = __mm256_i64gather_pd(&rebuiltCoord[jj*k], vindex, 1);

设置每个元素

__m256d _mm256_set_pd (double e3, double e2, double e1, double e0) // 设置为四个元素
__m256d _mm256_set1_pd (double a) // 设置为同一个元素

Arithmetic

_mm256_hadd_epi16 // Horizontally add eg.dst[15:0] := a[31:16] + a[15:0]
_mm256_mulhi_epi16 // Multiply the packed signed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.
_mm256_sign_epi16 // 根据b的值,将-a/0/a存入dst
// 乘加,乘减,的计算组合也有

横向结果归约

_mm256_reduce_add_ph // 求和

手动实现向量浮点abs绝对值

static const double DP_SIGN_One = 0x7fffffffffffffff;
__m256d vDP_SIGN_Mask = _mm256_set1_pd(DP_SIGN_One);
vj1 = _mm256_and_pd(vj1, vDP_SIGN_Mask);

Shift

_mm_bsrli_si128 // byte shift right 
_mm_slli_epi16 // shift left

logic

_mm_test_all_zeros
_mm_test_all_ones //判断是不是全0或1

Elementary Math Functions

向量化 取反、sqrt

Convert

_mm256_cvtepi32_pd // Convert_Int32_To_FP64

Compare

_mm256_cmp_pd // 按照double 32 bit 比较

Swizzle(混合)

_mm256_blendv_pd // 根据mask结果,从a和b里选择写入dst
_mm_blend_epi32 // 寄存器内数据的移动
_mm256_permute4x64_epi64 // 寄存器高位复制到低位
VEXTRACTF128 __m128d _mm256_extractf128_pd (__m256d a, int offset); // 寄存器内数据的移动
VUNPCKHPD __m512d _mm512_unpackhi_pd( __m512d a, __m512d b); //寄存器内数据的移动

类型转换

__m256d _mm256_undefined_pd (void)
__m128i low = _mm256_castsi256_si128(v);  //__m256i 变 type __m128i,源向量较低的128位不变地传递给结果。这种内在的特性不会向生成的代码引入额外的操作。

Select4(SRC, control) {
CASE (control[1:0]) OF
    0: TMP ←SRC[31:0];
    1: TMP ←SRC[63:32];
    2: TMP ←SRC[95:64];
    3: TMP ←SRC[127:96];
ESAC;
RETURN TMP
}

VSHUFPS (VEX.128 encoded version) ¶
DEST[31:0]  ←Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] ←Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] ←Select4(SRC2[127:0], imm8[5:4]);
DEST[127:96]←Select4(SRC2[127:0], imm8[7:6]);
DEST[MAXVL-1:128] ←0
之后float类型转换为double,再求和。

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_loadu_pd&ig_expand=4317

Arm - Neon

https://community.arm.com/arm-community-blogs/b/operating-systems-blog/posts/arm-neon-programming-quick-reference

Arm cpu 向量化支持判断

向量化指令

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://blog.csdn.net/heliangbin87/article/details/79581113?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.no_search_link&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.no_search_link

CSAPP: Machine Programming III: Procedures

stack

register 使用约定

rax 返回/传出寄存器
rdi rsi 传入寄存器
寄存器 %rsp 存放栈顶地址 (lowest stack address) pushq %rsp-8 popq %rsp+8
rip 存call地址

caller 调用者 callee 被调用者

calling procedure

callq 调用
retq 返回

调用控制

https://bkfish.github.io/2018/12/21/CSAPP又双叒叕来一遍之函数调用过程栈帧的变化/

传参数

  1. push到栈里
  2. 递归调用,把上一级的数据及时push保存
  3. 保存在寄存器里

Managing local data

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

SIMD+SSE+AVX

SIMD

SIMD全称Single Instruction Multiple Data,单指令多数据流,能够复制多个操作数,并把它们打包在大型寄存器的一组指令集。

通过使用矢量寄存器,指令译码后几个执行部件同时访问内存,一次性获得所有操作数进行运算。这个特点使SIMD特别适合于多媒体应用等数据密集型运算。如 AMD的3D NOW!技术

MMX

MMX是由57条指令组成的SIMD多媒体指令集,MMX将64位寄存当作2个32位或8个8位寄存器来用,只能处理整形计算,这样的64位寄存器有8组,分别命名为MM0~MM7.这些寄存器不是为MMX单独设置的,而是借用的FPU的寄存器,占用浮点寄存器进行运算(64位MMX寄存器实际上就是浮点数寄存器的别名),以至于MMX指令和浮点数操作不能同时工作。为了减少在MMX和浮点数模式切换之间所消耗的时间,程序员们尽可能减少模式切换的次数,也就是说,这两种操作在应用上是互斥的。

SSE

SSE为Streaming SIMD Extensions的缩写。Intel SSE指令通过128bit位宽的专用寄存器, 支持一次操作128bit数据. float是单精度浮点数, 占32bit, 那么可以使用一条SSE指令一次计算4个float数。注意这些SSE指令要求参数中的内存地址必须对齐于16字节边界。

SSE专用矢量寄存器个数,是每个core一个吗?

SSE有8个128位寄存器,XMM0 ~XMM7。此外SSE还提供了新的控制/状态寄存器MXCSR。为了回答这个问题,我们需要了解CPU的架构。每个core是独占register的

SSE 相关编译命令

addps xmm0, xmm1 ; reg-reg addps xmm0, [ebx] ; reg-mem sse提供了两个版本的指令,其一以后缀ps结尾,这组指令对打包单精度浮点值执行类似mmx操作运算,而第二种后缀ss

SSE 相关函数

  1. load系列 eg.__m128 _mm_load_ss (float *p)
  2. store系列 eg.__m128 _mm_set_ss (float w)
  3. 其他操作 eg.__m128 _mm_add_ss (__m128 a, __m128 b)包括加法、减法、乘法、除法、开方、最大值、最小值、近似求倒数、求开方的倒数等等浮点操作

SSE指令集的发展

1. SSE2则进一步支持双精度浮点数,由于寄存器长度没有变长,所以只能支持2个双精度浮点计算或是4个单精度浮点计算.另外,它在这组寄存器上实现了整型计算,从而代替了MMX. 2. SSE3支持一些更加复杂的算术计算. 3. SSE4增加了更多指令,并且在数据搬移上下了一番工夫,支持不对齐的数据搬移,增加了super shuffle引擎. 4. 由于2007年8月,AMD抢先宣布了SSE5指令集。之后Intel将新出的叫做AVX指令集。由于SSE5和AVX指令集功能类似,并且AVX包含更多的优秀特性,因此AMD决定支持AVX指令集

AVX

Advanced Vector Extensions。较新的Intel CPU都支持AVX指令集, 它可以一次操作256bit数据, 是SSE的2倍,可以使用一条AVX指令一次计算8个float数。AVX指令要求内存地址对齐于32字节边界。

SSE 与 AVX的发展

性能对比

根据参考文章,其中用gcc编译AVX版代码时需要加-mavx选项. 开启-O3选项,一般不用将代码改成多次计算和内存对齐。

判断是否向量化,看汇编

GNU

gcc -march=native -c -Q --help=target # 查看支持的指令集
g++ -O2 -ftree-vectorize -ftree-vectorizer-verbose=9 -S -c foo.cpp -o /dev/stdout | c++filt # 查看汇编
OBJDUMP # 反汇编
c++函数在linux系统下编译之后会变成如下样子
_ZNK4Json5ValueixEPKc
在linux命令行使用c++filter
$ c++filt _ZNK4Json5ValueixEPKc
Json::Value::operator[](char const*) const
可以得到函数的原始名称, 展开后续追踪

intel icpc

clang

-Rpass=loop-vectorize 
identifies loops that were successfully vectorized.

-Rpass-missed=loop-vectorize 
identifies loops that failed vectorization and indicates if vectorization was specified.

-Rpass-analysis=loop-vectorize 
identifies the statements that caused vectorization to fail.

常见汇编代码

xmm 寄存器
movsd

MMX指令

手动向量化

循环展开8次

例子1

SIMD寄存器

需要进一步的研究学习

暂无

遇到的问题

暂无

参考文献

https://www.dazhuanlan.com/2020/02/01/5e3475c89d5bd/

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

LLVM Mca : huawei HiSilicon's TSV110 work

几个对比图

x轴的含义是改变port值的意思,比如tsv110alu2是在tsv110的基础上将alu的值改成2

相关的 git commit

commit c9ca3a3c66a493d72cf7afc7ee975e2de399f2e5
Author: Elvina Yakubova <[email protected]>
Date:   Sat Nov 7 01:50:43 2020 +0300

    [AArch64] Add driver tests for HiSilicon's TSV110

commit 93b99728b1676d23ab5dabc606344230d25e7f4b
Author: Elvina Yakubova <[email protected]>
Date:   Sat Nov 7 01:22:35 2020 +0300

    [AArch64] Add pipeline model for HiSilicon's TSV110

    This patch adds the scheduling and cost model for TSV110.

    Reviewed by: SjoerdMeijer, bryanpkc

    Differential Revision: https://reviews.llvm.org/D89972

commit 123553921f86ac0fad7b742740aa45e8d380be02
Author: Bryan Chan <[email protected]>
Date:   Fri Nov 9 19:32:08 2018 +0000

    [AArch64] Support HiSilicon's TSV110 processor

    Reviewers: t.p.northover, SjoerdMeijer, kristof.beyls

    Reviewed By: kristof.beyls

    Subscribers: olista01, javed.absar, kristof.beyls, kristina, llvm-commits

    Differential Revision: https://reviews.llvm.org/D53908

    llvm-svn: 346546    
只有3个,感觉和2个功能很相关。

最近 Driver commit

类似的llvm check的设置

复现上面的图

要改的地方

应该每次都要重新编译安装

测试的汇编代码

  1. 判断llvm/test/MC/AArch64下的汇编能用吗?选个最大的,neon 不支持, armv8.2也并不支持。感觉有特别要求
    cat neon-diagnostics.s|llvm-mca -timeline -show-encoding -all-stats -all-views
    
  2. 选择osaca的benchmark里的add.c

AArch64SchedTSV110.td

locate at llvm/lib/Target/AArch64/AArch64SchedTSV110.td

td file

tablegen(LLVM class) definitions

部分指令解释

def : InstRW<[TSV110Wr_2cyc_1MDU],   (instregex "^(AND|BIC|EON|EOR|ORN|ORR)[WX]rs$")>;
BIC (bit clear) EON (Exclusive OR) ORR (OR operations on the values in Rn and Operand2)

InstRW的定义

// Map a set of opcodes to a list of SchedReadWrite types. This allows
// the subtarget to easily override specific operations.
//
// SchedModel ties this opcode mapping to a processor.
class InstRW<list<SchedReadWrite> rw, dag instrlist> {
  list<SchedReadWrite> OperandReadWrites = rw;
  dag Instrs = instrlist;
  SchedMachineModel SchedModel = ?;
  // Allow a subtarget to mark some instructions as unsupported.
  bit Unsupported = false;
}
TSV110Wr_2cyc_1MDU的定义
def TSV110Wr_2cyc_1MDU   : SchedWriteRes<[TSV110UnitMDU]>   { let Latency = 2; }

class SchedWriteRes<list<ProcResourceKind> resources> : SchedWrite,
  ProcWriteResources<resources>;

//定义TSV110上可用的每种处理器资源和数量,
//它有8条pipeline管道,每个管道都有自己的队列,微操作在那里等待
//它们的operands和issue将无序地发送到八个执行管道之一。
def TSV110UnitMDU  : ProcResource<1>; // Multi-Cycle

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

Code Migration And Alignment

导言

  • 越靠近一线的研发,更会忙碌于开源代码/特性的迁移工作。
  • 原因主要在于客户发现了效果好的开源成果,就觉得没有复用门槛,反过来催促开发快点实现。
  • 读论文也是为了更好的理解迁移的代码,而较少关注其原理

无论是把 PyTorch代码 迁移到其他框架(e.g.,MindSpore),还是把将代码继承到All IN ONE 框架(e.g., MindSpeed-MM),都经常遇到如下头大的问题:

  1. 一行行代码理解迁移速度太慢,并且要理解的非重要、不相关内容太多。
  2. 一股脑先移植过来,总是遇到channel对不上、触发算子计算维度限制条件 等问题。
  3. 训练推理流程打通之后,也会遇到精度不对齐的问题。

原始的解决办法就是在计算流程上打印关键数据的变化,找到是开始出现了差异(非预期)地方,使用起来非常不方便:

  1. 需要手动加print;
  2. 需要肉眼对比打屏信息;

想寻找/开发一个python工具DataDiffer/TensorDiffer:

  1. 比如通过装饰器等方法,跟踪函数内,指定变量的变化;
  2. 包括shape,tensor内前5个非0值,
  3. 支持将变化信息保存到文件,方便后续对比;