Tutorials¶

2022年5月17日
分类于 Tutorials
需要 4 分钟阅读时间

Css & Scss

CSS (Cascading Style Sheets) 和 SCSS (Sassy CSS) 都是用于样式表的编程语言，用于定义网页的外观和布局。

2022年5月15日
分类于 Tutorials
需要 7 分钟阅读时间

Cuda Optimize

Outline

General optimization guidance
1. Coalescing memory operations
2. Occupancy and latency hiding
3. Using shared memory
Example 1: transpose
1. Coalescing and bank conflict avoidance
Example 2: efficient parallel reductions
1. Using peak performance metrics to guide optimization
2. Avoiding SIMD divergence & bank conflicts
3. Loop unrolling
4. Using template parameters to write general-yet-optimized code
5. Algorithmic strategy: Cost efficiency

CUDA 优化策略

基础

最大化并行独立性
最大化计算密度

减少数据传输

数据可以直接在GPU生成。
一次大传输也比分开的小批次快

zerocopy

如果我们数据只会在 GPU 产生和使用，我们不需要来回进行拷贝。

https://migocpp.wordpress.com/2018/06/08/cuda-memory-access-global-zero-copy-unified/

简而言之，在 host 使用命令：cudaHostRegisterMapped 之后用 cudaHostGetDevicePointer 进行映射最后解除绑定 cudaHostUnregister

即，

// First, pin the memory (or cudaHostAlloc instead)
cudaHostRegister(h_a, …, cudaHostRegisterMapped);
cudaHostRegister(h_b, …, cudaHostRegisterMapped);
cudaHostRegister(h_c, …, cudaHostRegisterMapped);

cudaHostGetDevicePointer(&a, h_a, 0);
cudaHostGetDevicePointer(&b, h_b, 0);
cudaHostGetDevicePointer(&c, h_c, 0);

kernel<<<...>>>(a, b, c);
cudaDeviceSynchronize();

// unpin/release host memory
cudaHostUnregister(h_a);
cudaHostUnregister(h_b);
cudaHostUnregister(h_c);

cuda warp shuffle

只要两个thread在同一个warp中，允许thread直接读其他thread的寄存器值，这种比通过shared Memory进行thread间的通讯效果更好，latency更低，同时也不消耗额外的内存资源来执行数据交换。ref

访存连续性

Optimize for spatial locality in cached texture memory ???
避免bank conflict：如果没有bank冲突的话，共享内存的访存速度将会非常的快，大约比全局内存的访问延迟低100多倍，但是速度没有寄存器快。然而，如果在使用共享内存时发生了bank冲突的话，性能将会降低很多很多。

Global Memory：coalesced access

对齐(Starting address for a region must be a multiple of region size)集体访问，有数量级的差异Coalesced

利用好每个block里的thread，全部每个线程各自读取自己对齐(Starting address for a region must be a multiple of region size 不一定是自己用的)数据到shared memory开辟的总空间。由于需要的数据全部合力读取进来了，计算时正常使用需要的读入的数据。

特别是对于结构体使用SoA(structure of arrays)而不是AoS（array of structures），如果结构体实在不能对齐, 可以使用 __align(X), where X = 4, 8, or 16.强制对齐。

对齐读取 float3 code

对于small Kernel和访存瓶颈的Kernel影响很大

由于需要对齐读取，3float是12字节，所以只能拆成三份。

有无采用对齐shared读取，有10倍的加速。

利用好Shared Memory

比globalMemory快百倍
可以来避免 non-Coalesced access
SM的线程可以共享
Use one / a few threads to load / compute data shared by all threads

隐藏延迟的方法

增加SM上线程数量，
block数> SM数，这样所有的multiprocessors至少有一个block执行
threads/block>128 。原因：机器上一般有最多4个Warp调度器=4*32=128
threadsInblock=N*WarpSize=N*32
在 SM 上的 TB 越多越好，让 Thread Block 不停的跑我们的利用率就会高。
但是如果 Thread Block 太多，我们每一个 SM 能分配的寄存器就会变少，所以就会发生 Register Spill, 使用更高级的 L1、L2 Cache 去代替 Registers。所以 TB 不能太多，需要减少 Register Spill 的次数。
1. 资源占用率不要太高（最多一半？
多使用 __syncthreads
最好的参数需要self-tuning出来

_global__ void transpose_naive(float *odata, float *idata, int width, int height)
{
   unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
   unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
   if (xIndex < width && yIndex < height)
   {
      unsigned int index_in = xIndex + width * yIndex;
      unsigned int index_out = yIndex + height * xIndex;
      odata[index_out] = idata[index_in]; 
   }
}

思想：将大矩阵划分成方块，并且存储在SMEM里。不仅SMEM速度更快，而且每行元素个数变少，跨行访问的间距变小，局部性增强。而且对于大矩阵加速效果会更明显。

__global__ void transpose(float *odata, float *idata, int width, int height)
{
   __shared__ float block[BLOCK_DIM*BLOCK_DIM];
   unsigned int xBlock = blockDim.x * blockIdx.x;
   unsigned int yBlock = blockDim.y * blockIdx.y;
   unsigned int xIndex = xBlock + threadIdx.x;
   unsigned int yIndex = yBlock + threadIdx.y;
   unsigned int index_out, index_transpose;
   if (xIndex < width && yIndex < height)
   {
      unsigned int index_in = width * yIndex + xIndex;
      unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
      block[index_block] = idata[index_in];
      index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
      index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
   }
   __syncthreads();
   if (xIndex < width && yIndex < height)
      odata[index_out] = block[index_transpose]
}

coalescing access

when Block/tile dimensions are multiples of 16 ???

关于bank conflict

https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/

对于一个32 × 32个元素的共享内存块，一列数据中的所有元素都映射到相同的SMEM bank ，导致bank conflict 的最坏情况:读取一列数据会导致32路的存储库冲突。

幸运的是，只需要将tile的元素宽度改为33，而不是32就行。

优化实例2 - 数据归约

具体问题：将长数组的所有元素，归约求和为一个结果。¹²

总体思路

为了避免全局同步的巨大开销，采取分级归约

由于归约的计算密度低 1 flop per element loaded (bandwidth-optimal)

所以优化目标是将访存带宽用满。

384-bit memory interface, 900 MHz DDR
384 * 1800 / 8 = 86.4 GB/s

step0 : baseline - Interleaved Addressing 交错/间隔寻址

__global__ void reduce0(int *g_idata, int *g_odata) {
   extern __shared__ int sdata[];

   // each thread loads one element from global to shared mem
   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
   sdata[tid] = g_idata[i];
   __syncthreads();

   // do reduction in shared mem
   for(unsigned int s=1; s < blockDim.x; s *= 2) {
      if (tid % (s) == 0) {
         sdata[tid] += sdata[tid + s];
      }
      __syncthreads();
   }

   // write result for this block to global mem
   if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

工作的线程越来越少。一开始是全部，最后一次只有thread0.

Step1 : 使用连续的index

Just replace divergent branch With strided index and non-divergent branch，但是会带来bank conflict。

原理和Warp发射有关，假如在这里每个Warp并行的线程是2。一个Warp运行耗时为T.

Step0: 4+4+2+1=11T

Step1: 4+2+1+1=8T

for (unsigned int s=1; s < blockDim.x; s *= 2) {
   int index = 2 * s * tid;
   if (index < blockDim.x) {
      sdata[index] += sdata[index + s];
   }
   __syncthreads();
}

Step2: 连续寻址

for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
   if (tid < s) {
      sdata[tid] += sdata[tid + s];
   }
   __syncthreads();
}

原本寻址

现在寻址有一边连续了

Step3 : 弥补浪费的线程

方法：在load SMEM的时候提前做一次规约加法，通过减少一半的block数，将原本两个block里的值load+add存储在sum里。

// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();

step4 : Unrolling the Last Warp

当s< 32的时候，就只有一个Warp工作了。

使用warp的SIMD还省去了__syncthreads()的麻烦

for (unsigned int s=blockDim.x/2; s>32; s>>=1) 
{
   if (tid < s)
      sdata[tid] += sdata[tid + s];
   __syncthreads();
}
if (tid < 32)
{
   sdata[tid] += sdata[tid + 32]; 
   sdata[tid] += sdata[tid + 16]; 
   sdata[tid] += sdata[tid + 8]; 
   sdata[tid] += sdata[tid + 4]; 
   sdata[tid] += sdata[tid + 2]; 
   sdata[tid] += sdata[tid + 1]; 
}

为了保持整洁，最后一个if还做了无效的计算。eg, Warp里的最后一个线程只有第一句命令有用。

Step5 : 根据blockSize完全展开for和去除代码

由于for循环里是二分的，而且小于32的单独处理了，导致for循环里实际运行代码最多就3句。

利用代码模板和编译器的自动优化实现：

template <unsigned int blockSize>
__global__ void reduce5(int *g_idata, int *g_odata)

红色代码会在编译时自动优化。

step6 ：归并算法优化

加速级联？？

Cost= processors × time complexity

我们知道N个元素直接二叉树归约是O(log N) 时间 Cost=N*O(log N).

但是假如只有P个线程先做N/P的串行加法, 然后是log(P)的归约。总cost=P(N/P+log(P))

当P=N/log(N), cost=O(N)

each thread should sum O(log n) elements来设置

比如，1024 or 2048 elements per block vs. 256 线程。每个sum n=4个元素。具体参数要perf

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
   sdata[tid] += g_idata[i] + g_idata[i+blockSize];
   i += gridSize;
}
__syncthreads();

final code

template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
   extern __shared__ int sdata[];

   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*(blockSize*2) + tid;
   unsigned int gridSize = blockSize*2*gridDim.x;
   sdata[tid] = 0;

   do { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; } while (i < n);
   __syncthreads();

   if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
   if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
   if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

   if (tid < 32) {
      if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
      if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
      if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
      if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
      if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
      if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
   }
   if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

architecture {sm_10}
abiversion {0}
modname {cubin}
code {
   name = BlackScholesGPU
   lmem = 0    # per thread local memory
   smem = 68   # per thread block shared memory
   reg = 20    # per thread registers

参考文献

SC07 Optimizing Parallel Reduction in CUDA - Mark Harris ↩↩↩
2009 清华邓仰东 cuda lecture pdf 注意也是参考的SC07 Nvidia。 ↩↩↩

2022年5月12日
分类于 Tutorials
需要 1 分钟阅读时间

Nvprof

安装

$ which nvprof 
/usr/local/cuda/bin/nvprof

基本使用

摘要模式

命令行直接运行

nvprof ./myApp

跟踪API

nvprof --print-gpu-trace ./myApp

保存在log里

sudo /usr/local/cuda/bin/nvprof --log-file a.log --metrics achieved_occupancy /staff/shaojiemike/github/cutests/22-commonstencil/common

可视化

nsight可以直接在远程机器上运行
ssh -X host
.ssh/config
1. add
2. XAuthLocation /opt/X11/bin/xauth #for macbookAir
3. ForwardX11Trusted yes
4. ForwardX11 yes
Visual Profiler也可以ssh直接连接远程机器
或者导出分析结果以便可视化, 在Visual Profiler使用

nvprof --export-profile timeline.prof <app> <app args>
nvprof --analysis-metrics -o  nbody-analysis.nvprof ./myApp

profile kernel

sudo /usr/local/cuda/bin/ncu -k stencil_kernel -s 0 -c 1 /staff/shaojiemike/github/cutests/22-commonstencil/best

ncu-ui是可视化界面，但是没弄懂

带宽profile

上限测量

# shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:02:08]                                                                                                                                                                      $ ./bin/x86_64/linux/release/bandwidthTest                                                                                                                                                                                           [CUDA Bandwidth Test] - Starting...                                                                                                                                                                                                  Running on...                                                                                                                                                                                                                                                                                                                                                                                                                                                              Device 0: Tesla P40                                                                                                                                                                                                                  Quick Mode                                                                                                                                                                                                                                                                                                                                                                                                                                                                Host to Device Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     11.8                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Host Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     13.0                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Device Bandwidth, 1 Device(s)                                                                                                                                                                                              PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     244.3                                                                                                                                                                                                                                                                                                                                                                                                                                     Result = PASS                                                                                                                                                                                                                                                                                                                                                                                                                                                             NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.                                                                                                                                                                                       # shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:03:24]                                                                                                                                                                      $ ./bin/x86_64/linux/release/p2pBandwidthLatencyTest

实际值

nvprof通过指定与dram，L1或者L2 的metrics来实现。具体解释可以参考官网

在 Maxwell 和之后的架构中 L1 和 SMEM 合并

Metric Name	解释
achieved_occupancy	活跃cycle是 Warps 活跃的比例
dram_read_throughput
dram_utilization	在0到10的范围内，相对于峰值利用率，设备内存的利用率水平
shared_load_throughput
shared_utilization
l2_utilization

2022年5月4日
分类于 Tutorials
需要 1 分钟阅读时间

Hugo

Hugo is a Go-based static site generator known for its speed and flexibility in 2013.
Hugo has set itself apart by being fast. More precisely, it has set itself apart by being much faster than Jekyll.
Jekyll uses Liquid as its templating language. Hugo uses Go templating. Most people seem to agree that it is a little bit easier to learn Jekyll’s syntax than Hugo’s.¹

2022年4月24日
分类于 Tutorials
需要 1 分钟阅读时间

shaojiemike@snode6:~$ groups shaojiemike
shaojiemike : staff sudo

所有组

cat /etc/group

 getent passwd

first reboot server

ctrl + alt + F3     #jump into command line
login
su - {user-name}
sudo -s
sudo -i
# If invoked without a user name, su defaults to becoming the superuser
ip a |less          #check ip address fjw弄了静态IP就没这个问题了

限制当前shell用户爆内存

宕机一般是爆内存，进程分配肯定会注意不超过物理核个数。

在zshrc里写入 25*1024*1024 = 25GB的内存上限

ulimit -v 26214400

当前shell程序超内存，会输出Memory Error结束。

测试读取200GB大文件到内存

with open("/home/shaojiemike/test/DynamoRIO/OpenBLASRawAssembly/openblas_utest.log", 'r') as f:
    data= f.readlines()
    print(len(data))

有文章说Linux有些版本内核会失效

2022年3月31日
分类于 Tutorials
需要 2 分钟阅读时间

https://www.cnblogs.com/dufeixiang/p/11624210.html

改shell

复杂还有bug,我还是改profile吧

https://ibug.io/blog/2022/03/linux-openldap-server/#user-chsh

挂载

挂在同一个地方，肯定是一样的

# shaojiemike @ snode2 in ~ [20:18:20]
$ df -h .
Filesystem       Size  Used Avail Use% Mounted on
10.1.13.1:/home   15T   11T  3.1T  78% /staff

# shaojiemike @ snode0 in ~ [20:25:51]
$ mount|grep staff
10.1.13.1:/home on /staff type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.13.50,local_lock=none,addr=10.1.13.1)

tmpfs是磁盘里的虚拟内存的意思。

设置

具体设置要登录到中央机器上去

# shaojiemike @ hades1 in ~ [20:41:06]
$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 hades1
# 222.195.72.30 hades0
# 202.38.72.64 hades1
# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

114.214.198.26  synology
10.1.13.1       acsa-nfs
10.1.13.6       discovery
10.1.13.50      snode0
10.1.13.51      snode1
10.1.13.52      snode2
10.1.13.53      snode3
10.1.13.54      snode4
10.1.13.55      snode5
10.1.13.56      snode6
10.1.13.114     swabl
10.1.13.119     node19
10.1.13.102     node2
10.1.13.58      hades0
10.1.13.57      hades1

# shaojiemike @ snode0 in ~ [20:36:26]
$ sudo cat /etc/nslcd.conf
# /etc/nslcd.conf
# nslcd configuration file. See nslcd.conf(5)
# for details.

# The user and group nslcd should run as.
uid nslcd
gid nslcd

# The location at which the LDAP server(s) should be reachable.
uri ldaps://ldap.swangeese.fun

需要进一步的研究学习

总共涉及几台机器

2022年3月30日
分类于 Tutorials
需要 1 分钟阅读时间

2022年3月30日
分类于 Tutorials
需要 5 分钟阅读时间

Python MPI

全局解释器锁（GIL,Global Interpreter Lock)

Python代码的执行由Python虚拟机（解释器）来控制。

对Python虚拟机的访问由全局解释器锁（GIL）来控制，正是这个锁能保证同时只有一个线程在运行。所以就会出现尽管你设置了多线程的任务，但是只能跑一个的情况。

但是I/O密集的程序(爬虫)相对好一点，因为I/O操作会调用内建的操作系统C代码，所以这时会释放GIL锁，达到部分多线程的效果。

通常我们用的解释器是官方实现的CPython，要真正利用多核，除非重写一个不带GIL的解释器。

2022年3月6日
分类于 Tutorials
需要 2 分钟阅读时间

Powershell

导言

在 Windows Terminal 里使用，有些踩坑经验¹

2022年2月22日
分类于 Tutorials
需要 1 分钟阅读时间

pip

pip + 换源，能使用export http_proxy （：多试几次

pip install xxx -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com 或者设置文件

也可以通过环境变量

export PIP_INDEX_URL=https://mirrors.aliyun.com/pypi/simple/
export PIP_TRUSTED_HOST=mirrors.aliyun.com