Benchmark

导言

以现实中实际使用的应用为基础，根据其领域和应用计算特点来分类。

Differentiated-Bounded Applications¶

According to the DAMOV, data movement have six different types.

Although the authors only use temporal Locality to classify the apps, the Figure 3: Locality-based clustering of 44 representative functions shows some cases deserved to test.

In the context of data movement, the DAMOV framework identifies six distinct types. While the authors primarily utilize temporal locality for app classification, Figure 3 offers a comprehensive view of the locality-based clustering of 44 representative functions, highlighting specific cases that warrant further examination.

Take, for instance, the four cases situated in the lower right corner:

function short name	benchmark	class
CHAHsti	Chai-Hito	1b: DRAM Latency
CHAOpad	Chai-Padding	1c: L1/L2 Cache Capacity
PHELinReg	Phoenix-Linear Regression	1b
PHEStrMat	Phoenix-String Matching	1b

TLB percentage high -> tlb miss rate high -> memory access address in large span -> spacial locality is low.

benchmark¶

Chai Benchmark¶

The Chai benchmark code can be sourced either from DAMOV or directly from its GitHub repository. Chai stands for "Collaborative Heterogeneous Applications for Integrated Architectures."

Installing Chai is a straightforward process. You can achieve it by executing the command python3 compile.py.

One notable feature of the Chai benchmark is its adaptability in terms of input size. Modifying the input size of the following applications is a simple and flexible task:

# cd application directory
./bfs_00 -t 4 -f input/xxx.input
./hsti -n 1024000             # Image Histogram - Input Partitioning (HSTI)
./hsto -n 1024000             # Image Histogram - Output Partitioning (HSTO)
./ooppad -m 1000 -n 1000   # Padding (PAD)
./ooptrns -m 100 -n 10000  # Transpose / In-place Transposition (TRNS)
./sc -n 1024000            # Stream Compaction (SC)
./sri -n 1024000           # select application

# vector pack , 2048 = 1024 * 2, 1024 = 2^n
./vpack -m 2048 -n 1024 -i 2 
# vector unpack , 2048 = 1024 * 2, 1024 = 2^n
./vupack -m 2048 -n 1024 -i 2

Parboil (how to run)¶

The Parboil suite was developed from a collection of benchmarks used at the University of Illinois to measure and compare the performance of computation-intensive algorithms executing on either a CPU or a GPU. Each implementation of a GPU algorithm is either in CUDA or OpenCL, and requires a system capable of executing applications using those APIs.

# compile , vim compile.py 
# python2.7 ./parboil compile bfs omp_base 
python3 compile.py

# no idea how to run, failed command: (skip)
python2.7 ./parboil run bfs cuda default 
# exe in benchmarks/*, but need some nowhere input.

Phoenix¶

Phoenix is a shared-memory implementation of Google's MapReduce model for data-intensive processing tasks.

# with code from DAMOV
import os
import sys

os.chdir("phoenix-2.0/tests") # app is changed from sample_apps/*
os.system("make")
os.chdir("../../")

# generate excution in phoenix-2.0/tests/{app}/{app}

# running for example
./phoenix-2.0/tests/linear_regression/linear_regression ./phoenix-2.0/datasets/linear_regression_datafiles/key_file_500MB.txt

PolyBench¶

PolyBench is a benchmark suite of 30 numerical computations with static control flow, extracted from operations in various application domains (linear algebra computations, image processing, physics simulation, dynamic programming, statistics, etc.).

PolyBench features include:

A single file, tunable at compile-time, used for the kernel instrumentation. It performs extra operations such as cache flushing before the kernel execution, and can set real-time scheduling to prevent OS interference.

# compile using DAMOV code
python compile.py
# exe in OpenMP/compiled, and all running in no parameter

PriM¶

real apps for the first real PIM platform

Rodinia (developed)¶

Rodinia, a benchmark suite for heterogeneous parallel computing which target multi-core CPU and GPU platforms, which first introduced in 2009 IISWC.

zsim hooked code from github

Install the CUDA/OCL drivers, SDK and toolkit on your machine.
Modify the common/make.config file to change the settings of
rodinia home directory and CUDA/OCL library paths.
It seems need intel opencl, but intel do everything to oneapi
To compile all the programs of the Rodinia benchmark suite, simply use the universal makefile to compile all the programs, or go to each benchmark directory and make individual programs.
full code with related data can be downloaded from website

mkdir -p ./bin/linux/omp
make OMP

Running the zsim hooked apps

cd bin/linux/omp
./pathfinder 100000 100 7
./myocyte.out 100 1 0 4
./lavaMD -cores 4 -boxes1d 10 # -boxes1d  (number of boxes in one dimension, the total number of boxes will be that^3)
./omp/lud_omp -s 8000
./srad 2048 2048 0 127 0 127 2 0.5 2
./backprop 4 65536 # OMP_NUM_THREADS=4


# need to download data or file
./hotspot 1024 1024 2 4 ../../data/hotspot/temp_1024 ../../data/hotspot/power_1024 output.out
./OpenMP/leukocyte 5 4 ../../data/leukocyte/testfile.avi
# streamcluster
./sc_omp k1 k2 d n chunksize clustersize infile outfile nproc
./sc_omp 10 20 256 65536 65536 1000 none output.txt 4
./bfs 4 ../../data/bfs/graph1MW_6.txt 
./kmeans_serial/kmeans -i ../../data/kmeans/kdd_cup
./kmeans_openmp/kmeans -n 4 -i ../../data/kmeans/kdd_cup

dynamic data structures¶

We choose this specific suite because dynamic data structures are the core of many server workloads (e.g., Memcached’s hash table, RocksDB’s skip list), and are a great match for nearmemory processing

ASCYLIB + OPTIK¹

Graph Apps¶

Graph500 Benchmark Exploration¶

Official Version¶

The official version of the Graph500 benchmark can be downloaded from its GitHub repository. Notable features of this version include:

Primarily MPI Implementation: The benchmark is built as an MPI (Message Passing Interface) version, without an accompanying OpenMP version. This can be disappointing for those utilizing tools like zsim.
Flexible n Value: By default, the value n is set to powers of 2, but it's possible to change this behavior through configuration adjustments.
Customization Options: Environment variables can be altered to modify the execution process. For instance, the BFS (Breadth-First Search) portion can be skipped or the destination path for saved results can be changed.

Unofficial Repository¶

An alternative unofficial repository also exists. However, it requires OpenCL for compilation. The process can be broken down as follows:

OpenCL Dependency: The unofficial repository mandates the presence of OpenCL. To set up OpenCL, you can refer to this tutorial.

sudo apt-get install clinfo
sudo apt-get install opencl-headers
sudo apt install opencl-dev

After completing the OpenCL setup and the compilation process using cmake & make, we obtain the executable file named benchmark. By default, running this executable without any arguments appears to utilize only a single core, despite attempts to set the environment variable with export OMP_NUM_THREADS=32. This default behavior led to a runtime of approximately 5 minutes to generate a report related to edges-node-verify status (or similar). However, for someone without an in-depth technical background, this report can be confusing, especially when trying to locate the BFS (Breadth-First Search) and SSSP (Single-Source Shortest Path) components.

What is even more disheartening is that the TLB (Translation Lookaside Buffer) result is disappointingly low, similar to the performance of the GUPS (Giga Updates Per Second) OpenMP version.

In order to gain a clearer understanding and potentially address these issues, further investigation and potentially adjustments to the program configuration may be necessary.

$ ./tlbstat -c '/staff/shaojiemike/github/graph500_openmp/build/benchmark' 
command is /staff/shaojiemike/github/graph500_openmp/build/benchmark                       
K_CYCLES   K_INSTR      IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC  K_ITLBCYC  DTLB% ITLB%         
20819312   10801013    0.52 7736938    1552557    369122     51902       1.77  0.25
20336549   10689727    0.53 7916323    1544469    354426     48123       1.74  0.24

GAP¶

Zsim?/LLVM Pass Instrumentation Code from PIMProf paper github

But the graph dataset should be generated by yourself following the github:

# 2^17 nodes 
./converter -g17 -k16 -b kron-17.sg
./converter -g17 -k16 -wb kron-17.wsg

Kernels Included

Breadth-First Search (BFS) - direction optimizing
Single-Source Shortest Paths (SSSP) - delta stepping
PageRank (PR) - iterative method in pull direction
Connected Components (CC) - Afforest & Shiloach-Vishkin
Betweenness Centrality (BC) - Brandes
Triangle Counting (TC) - Order invariant with possible relabelling

ligra¶

Code also from DAMOV

# compile
python3 compile.py

# 3 kind exe of each app, relative code can be found in /ligra directory
# emd: edgeMapDense() maybe processing related dense-data 
# ems: edgeMapSparse() analyse edge-data
# compute: of course, the core compute part

the source code is difficult to read, skip

the graph format : It seems lines_num = offsets + edges + 3

AdjacencyGraph
16777216 # offsets, and vertex from []
100000000 #   uintE* edges = newA(uintE,m);
0
470
794 # must monotonic increasing, and range [0,edges), represent the folloing edges are belong to corresponding vector
……
14680024 # random but range [0,vector-1], represent each node's conjoint others nodes(so there are pairs).
16644052
16284631
15850460

$ wc -l  /staff/qcjiang/codes/DAMOV/workloads/ligra/inputs/rMat_10M
116777219 /staff/qcjiang/codes/DAMOV/workloads/ligra/inputs/rMat_10M

Pagerank Algorithm should be referened from my another post.

HPC¶

maybe more computation-intensive than graph applications

parsec¶

From DAMOV

The Princeton Application Repository for Shared-Memory Computers (PARSEC) is a collection of parallel programs which can be used for performance studies of multiprocessor machines.

# compile
python3 compile_parsec.py

# exe in pkgs/binaries
./pkgs/binaries/blackscholes 4 ./pkgs/inputs/blackscholes/in_64K.txt black.out
./pkgs/binaries/fluidanimate 4 10 ./pkgs/inputs/fluidanimate/in_300K.fluid

STREAM apps¶

DAMOV code for memory bandwidth testing which reference J. D. McCalpin et al., “Memory Bandwidth and Machine Balance in Current High Performance Computers,” IEEE TCCA Newsletter, 1995

# compile
python3 compile.py

# default run with Failed Validation error(whatever)
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5072.0     0.205180     0.157728     0.472323
Add:             6946.6     0.276261     0.172746     0.490767

Hardware Effects¶

Two very Interesting repositys of cpu and gpu hardware effects that can degrade application performance in surprising ways and that may be very hard to explain without knowledge of the low-level CPU/GPU and OS architecture.

Test also using DAMOV code

# compile
python3 compile.py

# Every example directory has a README that explains the individual effects.
./build/bandwidth-saturation/bandwidth-saturation 0 1
./build/false-sharing/false-sharing 3 8

Most cases using the help of perf to know more about your system ability.

HPCC¶

From DAMOV and The HPC Challenge (HPCC) Benchmark Suite,” in SC, 2006

RandomAccess OpenMP version (it is also the GUPS)

# install
python compile.py

# must export
export OMP_NUM_THREADS=32
./OPENMP/ra_omp 26 # 26 need almost 20GB mem, 27 need 30GB mem, and so on.

but the openmp version shows no big pgw overhead

GUPS¶

描述系统的随机访问能力

From github or hpcc official-web and RandomAccess MPI version

# download using github
make -f Makefile.linux gups_vanilla
# running
gups_vanilla N M chunk

N = length of global table is 2^N
Thus N = 30 would run with a billion-element table.
M = # of update sets per proc 越大代表算得越久（随机访问越久），
chunk = # of updates in one set on each proc
In the official HPCC benchmark this is specified to be no larger than 1024, but you can run the code with any value you like. Your GUPS performance will typically decrease for smaller chunk size.

测试之后，最佳搭配如下 mpirun -n 32 ./gups_vanilla 32 100000 2048 其中n最多32，不然会爆内存。

n	32	31	30	29
DTLB%	66.81	45.22	19.50	13.20
ITLB%	0.06	0.07	0.06	0.09

或者单独运行./gups_vanilla 30 100000 k (n最多30), ./tlbstat -c '/usr/bin/mpirun -np 1 /staff/shaojiemike/github/gups/gups_vanilla 30 100000 8192

n\k	2048	4096	8192
30	44%	90%	80%
27			88%
24		83%	83%
20		58%	62%
15		0.27%	0.3%

手动构造 bigJump

#include <bits/stdc++.h>
#include "../zsim_hooks.h"
using namespace std;

#define MOD int(1e9)

// 2000 tlb entries is the normal cpu config, 
// for 4KB page, each data access trigers tlb miss, jump over 1000 int， 
// and after 2000 entries to repeat, so at least 2000 * 4KB = 8MB space

// #define BASIC_8MB 2000000 * 2
#define BASIC_8MB (1 << 22)

// 1 second program. stream add 6GB/s， int is 4B, repeated 10^9
// #define all_loops 1000000
#define all_loops (1 << 20)

int main(int argc, char* argv[]) {
   if (argc != 4) {
      std::cerr << "Usage: " << argv[0] << " <space scale> <jump distance scale> <loop times>" << std::endl;
      return 1;
   }

   // Convert the second command-line argument (argv[1]) to an integer
   int N = std::atoi(argv[1]);
   int J = std::atoi(argv[2]);
   int M = std::atoi(argv[3]);

   std::cout << "Number read from command line: " << N << " " << J << " (N,J should not big, [0,5] is best.)" <<std::endl;

   const int size = BASIC_8MB << N;
   const int size_mask = size - 1;
   int * jump_space = (int *)malloc(size * sizeof(int));

   zsim_begin();
   int result = 0;
   int mem_access_count = 0;
   int mem_access_index = 0;
   // int mask = (1<<10<<J)-1;
   // int ran = 0x12345678;
   int mask = (1<<J)-1;
   int ran = (1<<30)-1;
   // without random address, tlb occupancy is alse high
   // ran = (ran << 1) ^ ((int) ran < 0 ? 0x87654321 : 0);
   while(mem_access_count++ < all_loops*M){
      // read & write 
      jump_space[mem_access_index] = ran;
      mem_access_index = (mem_access_index + (1024 + ran & mask) ) & (size_mask);
      // cout << "mem_access_index = " << mem_access_index << endl;
   }
   zsim_end();

   //print first 5 elements
   printf("result %d",result);
}

HPCG¶

From DAMOV and High Performance Conjugate Gradient Benchmark (HPCG)

HPCG is a software package that performs a fixed number of multigrid preconditioned (using a symmetric Gauss-Seidel smoother) conjugate gradient (PCG) iterations using double precision (64 bit) floating point values. 浮点数的共轭梯度求解

follow the instructions in INSTALL and analyze the compile.py

choose makefile like setup/Make.GCC_OMP
config values like MPdir, but we can leave them beacuse we use GCC_OMP which set -DHPCG_NO_MPI in it
add -DHPGSym to CXXFLAGS or HPCG_OPTS
cd build and ../configure GCC_OMP
run compile.py to compile the executable files
get 4 ComputePrologation in build/bin
test the exe using xhpcg 32 24 16 for three dimension
or xhpcg --nx=16 --rt=120 for NX=NY=NZ=16 and time is 120 seconds
change int refMaxIters = 50; to int refMaxIters = 1; to limit CG iteration number
to be attention: --nx=16 must be a multiple of 8
if there is no geometry arguments on the command line, hpcg will ReadHpcgDat and get the default --nx=104 --rt=60

value\nx	96	240	360	480
mem		17GB	40GB	72.8GB
time(icarus0)	8s	84S	4min40s	core dumped(7mins)

core dumped for xhpcg_HPGPro: ../src/GenerateProblem_ref.cpp:204: void GenerateProblem_ref(SparseMatrix&, Vector*, Vector*, Vector*): Assertion 'totalNumberOfNonzeros>0' failed.

MPdir        = 
MPinc        = 
MPlib        = 

HPCG_OPTS     = -DHPCG_NO_MPI -DHPGSym

compile error¶

../src/ComputeResidual.cpp:59:13: error: 'n' not specified in enclosing 'parallel'

just add n to shared values

Database¶

Hash Joins¶

This package provides implementations of the main-memory hash join algorithms described and studied in C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu, “Main-Memory Hash Joins on Modern Processor Architectures,” TKDE, 2015.

Test also in DAMOV

# install
python compile.py

# runing
./src/mchashjoins_* -r 12800000 -s 12000000 -x 12345 -y 54321

these case shows tlb resource strains

AI¶

Darknet for CV using multi-cores¶

From DAMOV and official documentation is detailed

shaojiemike @ snode6 in ~/github/DAMOV/workloads/Darknet on git:main x [14:13:02]                   │drwxr-xr-x  3 shaojiemike staff   21 Mar 14  2022 .
$ ./darknet detect cfg/yolo.cfg ./weights/yolo.weights data/dog.jpg

model in weight files, different size model (28MB -528MB)can be downloaded from website
picture data in data files. (70KB - 100MB)
must run the exe in the file directory. SHIT.

genomics / DNA¶

BWA¶

From DAMOV and for mapping DNA sequences against a large reference genome, such as the human genome

Download DNA data like ref.fa following Data download steps
stucked when running sratoolkit.3.0.6-ubuntu64/bin/fastq-dump --split-files SRR7733443 due to generate several 96GB SRR7733443_X.fastq files which X from 1 to n.
sratool can not limit the file size, but we can use head -c 100MB SRR7733443_1.fastq > ref_100MB.fastq to get wanted file size.
Further running commands you can read the github
./bwa index -p abc ref_100MB.fastq will generate sevaral abc.suffix files using 50 seconds.
and now you can run ./bwa mem -t 4 abc ref_100MB.fastq or ./bwa aln -t 4 abc ref_100MB.fastq

GASE¶

GASE - Generic Aligner for *Seed-and-Extend

GASE is a DNA read aligner, developed for measuring the mapping accuracy and execution time of different combinations of seeding and extension techniques. GASE is implemented by extending BWA (version 0.7.13) developed by Heng Li.

Code also from DAMOV. But seems there are some program syntax errors, skip this app.

需要进一步的研究学习¶

暂无

遇到的问题¶

暂无

开题缘由、总结、反思、吐槽~~¶

参考文献¶

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

ASCYLIB + OPTIK ↩