跳转至

IPCC Preliminary SLIC Optimization 2

chivier advise on IPCC amd_256

技术路线 描述 时间 加速比 备注
Baseline 串行程序 21872 ms 1
核心循环openmp 未指定 8079ms
核心循环openmp 单节点64核 7690ms 2.84
换intel的ipcp 基于上一步 3071 ms 7.12
-xHOST 其余不行,基于上一步 4012ms
-O3 基于上一步 3593ms

node5

Intel(R) Xeon(R) Platinum 8153 CPU @ 2.00GHz

技术路线 描述 时间 加速比 备注
Baseline 串行程序 29240 ms 1
核心循环openmp 未指定(htop看出64核) 12244 ms
去除无用计算+两个numk的for循环 080501 11953 ms 10054 ms
计算融合(去除inv) 080502 15702 ms 14923 ms 15438 ms 11987 ms
maxlab openmp 基于第三行080503 13872 ms 11716 ms
循环展开?? 14436 ms 14232 ms 15680 ms

-xCOMMON-AVX512 not supports

Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, MOVBE, POPCNT, AVX, F16C, FMA, BMI, LZCNT, AVX2, AVX512F, ADX and AVX512CD instructions.
-xCORE-AVX2
Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, MOVBE, POPCNT, AVX, F16C, FMA, BMI, LZCNT and AVX2 instructions
没有 FXSAVE,BMI,LZCNT 有BMI1,BMI2

使用-xAVX,或者-xHOST 来选择可用的最先进指令集

Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, POPCNT and AVX instructions.

-fast bugs

ld: cannot find -lstdc++
ld: cannot find -lstdc++
/public1/soft/intel/2020u4/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libiomp5.a(ompt-general.o): In function `ompt_pre_init':
(.text+0x2281): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/var/spool/slurm/d/job437118/slurm_script: line 23: ./SLIC_slurm_intel_o3: No such file or directory

AMD EPYC 7~~2

icpc -Ofast -march=core-avx2 -ipo -mdynamic-no-pic -unroll-aggressive -no-prec-div -fp-mode fast=2 -funroll-all-loops -falign-loops -fma -ftz -fomit-frame-pointer -std=c++11 -qopenmp SLIC_openmp.cpp -o SLIC_slurm_intel_o3

后续优化

基于核心的openmp并行

去除无用计算

delete all maxxy
if(maxxy[klabels[i]] < distxy[i]) maxxy[klabels[i]] = distxy[i];

计算融合(减少访存次数)

  1. 将inv去除(效果存疑)
  2. maxlab openmp并行(由于不是计算密集的,是不是要循环展开)

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献