According to the DAMOV, data movement have six different types.
Although the authors only use temporal Locality to classify the apps, the Figure 3: Locality-based clustering of 44 representative functions shows some cases deserved to test.
In the context of data movement, the DAMOV framework identifies six distinct types. While the authors primarily utilize temporal locality
for app classification, Figure 3 offers a comprehensive view of the locality-based
clustering of 44 representative functions, highlighting specific cases that warrant further examination.
Take, for instance, the four cases situated in the lower right corner:
function short name |
benchmark |
class |
CHAHsti |
Chai-Hito |
1b: DRAM Latency |
CHAOpad |
Chai-Padding |
1c: L1/L2 Cache Capacity |
PHELinReg |
Phoenix-Linear Regression |
1b |
PHEStrMat |
Phoenix-String Matching |
1b |
TLB percentage high -> tlb miss rate high -> memory access address in large span -> spacial locality is low.
The Chai benchmark code can be sourced either from DAMOV or directly from its GitHub repository. Chai stands for "Collaborative Heterogeneous Applications for Integrated Architectures."
Installing Chai is a straightforward process. You can achieve it by executing the command python3 compile.py
.
One notable feature of the Chai benchmark is its adaptability in terms of input size. Modifying the input size of the following applications is a simple and flexible task:
# cd application directory
./bfs_00 -t 4 -f input/xxx.input
./hsti -n 1024000 # Image Histogram - Input Partitioning (HSTI)
./hsto -n 1024000 # Image Histogram - Output Partitioning (HSTO)
./ooppad -m 1000 -n 1000 # Padding (PAD)
./ooptrns -m 100 -n 10000 # Transpose / In-place Transposition (TRNS)
./sc -n 1024000 # Stream Compaction (SC)
./sri -n 1024000 # select application
# vector pack , 2048 = 1024 * 2, 1024 = 2^n
./vpack -m 2048 -n 1024 -i 2
# vector unpack , 2048 = 1024 * 2, 1024 = 2^n
./vupack -m 2048 -n 1024 -i 2
The Parboil suite was developed from a collection of benchmarks used at the
University of Illinois to measure and compare the performance of computation-intensive algorithms executing on either a CPU or a GPU. Each
implementation of a GPU algorithm is either in CUDA or OpenCL, and requires
a system capable of executing applications using those APIs.
# compile , vim compile.py
# python2.7 ./parboil compile bfs omp_base
python3 compile.py
# no idea how to run, failed command: (skip)
python2.7 ./parboil run bfs cuda default
# exe in benchmarks/*, but need some nowhere input.
Phoenix is a shared-memory implementation of Google's MapReduce model for data-intensive processing tasks.
# with code from DAMOV
import os
import sys
os.chdir("phoenix-2.0/tests") # app is changed from sample_apps/*
os.system("make")
os.chdir("../../")
# generate excution in phoenix-2.0/tests/{app}/{app}
# running for example
./phoenix-2.0/tests/linear_regression/linear_regression ./phoenix-2.0/datasets/linear_regression_datafiles/key_file_500MB.txt
PolyBench is a benchmark suite of 30 numerical computations with static control flow, extracted from operations in various application domains (linear algebra computations, image processing, physics
simulation, dynamic programming, statistics, etc.).
PolyBench features include:
- A single file, tunable at compile-time, used for the kernel
instrumentation. It performs extra operations such as cache flushing
before the kernel execution, and can set real-time scheduling to
prevent OS interference.
# compile using DAMOV code
python compile.py
# exe in OpenMP/compiled, and all running in no parameter
real apps for the first real PIM platform
Rodinia, a benchmark suite for heterogeneous parallel computing which target multi-core CPU and GPU platforms, which first introduced in 2009 IISWC.
zsim hooked code from github
- Install the CUDA/OCL drivers, SDK and toolkit on your machine.
- Modify the
common/make.config
file to change the settings of
- rodinia home directory and CUDA/OCL library paths.
- It seems need intel opencl, but intel do everything to oneapi
- To compile all the programs of the Rodinia benchmark suite, simply use the universal
makefile
to compile all the programs, or go to each benchmark directory and make individual programs.
- full code with related data can be downloaded from website
mkdir -p ./bin/linux/omp
make OMP
Running the zsim hooked apps
cd bin/linux/omp
./pathfinder 100000 100 7
./myocyte.out 100 1 0 4
./lavaMD -cores 4 -boxes1d 10 # -boxes1d (number of boxes in one dimension, the total number of boxes will be that^3)
./omp/lud_omp -s 8000
./srad 2048 2048 0 127 0 127 2 0.5 2
./backprop 4 65536 # OMP_NUM_THREADS=4
# need to download data or file
./hotspot 1024 1024 2 4 ../../data/hotspot/temp_1024 ../../data/hotspot/power_1024 output.out
./OpenMP/leukocyte 5 4 ../../data/leukocyte/testfile.avi
# streamcluster
./sc_omp k1 k2 d n chunksize clustersize infile outfile nproc
./sc_omp 10 20 256 65536 65536 1000 none output.txt 4
./bfs 4 ../../data/bfs/graph1MW_6.txt
./kmeans_serial/kmeans -i ../../data/kmeans/kdd_cup
./kmeans_openmp/kmeans -n 4 -i ../../data/kmeans/kdd_cup
We choose this specific suite because dynamic data structures are the core of many server workloads (e.g., Memcached’s hash table, RocksDB’s skip list), and are a great match for nearmemory processing
ASCYLIB + OPTIK
The official version of the Graph500 benchmark can be downloaded from its GitHub repository. Notable features of this version include:
- Primarily MPI Implementation: The benchmark is built as an MPI (Message Passing Interface) version, without an accompanying OpenMP version. This can be disappointing for those utilizing tools like zsim.
- Flexible
n
Value: By default, the value n
is set to powers of 2, but it's possible to change this behavior through configuration adjustments.
- Customization Options: Environment variables can be altered to modify the execution process. For instance, the BFS (Breadth-First Search) portion can be skipped or the destination path for saved results can be changed.
An alternative unofficial repository also exists. However, it requires OpenCL for compilation. The process can be broken down as follows:
- OpenCL Dependency: The unofficial repository mandates the presence of OpenCL. To set up OpenCL, you can refer to this tutorial.
sudo apt-get install clinfo
sudo apt-get install opencl-headers
sudo apt install opencl-dev
After completing the OpenCL setup and the compilation process using cmake & make
, we obtain the executable file named benchmark
.
By default, running this executable without any arguments appears to utilize only a single core, despite attempts to set the environment variable with export OMP_NUM_THREADS=32
.
This default behavior led to a runtime of approximately 5 minutes to generate a report related to edges-node-verify status (or similar). However, for someone without an in-depth technical background, this report can be confusing, especially when trying to locate the BFS (Breadth-First Search) and SSSP (Single-Source Shortest Path) components.
What is even more disheartening is that the TLB (Translation Lookaside Buffer) result is disappointingly low, similar to the performance of the GUPS (Giga Updates Per Second) OpenMP version.
In order to gain a clearer understanding and potentially address these issues, further investigation and potentially adjustments to the program configuration may be necessary.
$ ./tlbstat -c '/staff/shaojiemike/github/graph500_openmp/build/benchmark'
command is /staff/shaojiemike/github/graph500_openmp/build/benchmark
K_CYCLES K_INSTR IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC K_ITLBCYC DTLB% ITLB%
20819312 10801013 0.52 7736938 1552557 369122 51902 1.77 0.25
20336549 10689727 0.53 7916323 1544469 354426 48123 1.74 0.24
Zsim?/LLVM Pass Instrumentation Code from PIMProf paper github
But the graph dataset should be generated by yourself following the github:
# 2^17 nodes
./converter -g17 -k16 -b kron-17.sg
./converter -g17 -k16 -wb kron-17.wsg
Kernels Included
- Breadth-First Search (BFS) - direction optimizing
- Single-Source Shortest Paths (SSSP) - delta stepping
- PageRank (PR) - iterative method in pull direction
- Connected Components (CC) - Afforest & Shiloach-Vishkin
- Betweenness Centrality (BC) - Brandes
- Triangle Counting (TC) - Order invariant with possible relabelling
Code also from DAMOV
# compile
python3 compile.py
# 3 kind exe of each app, relative code can be found in /ligra directory
# emd: edgeMapDense() maybe processing related dense-data
# ems: edgeMapSparse() analyse edge-data
# compute: of course, the core compute part
the source code is difficult to read, skip
the graph format : It seems lines_num = offsets + edges + 3
AdjacencyGraph
16777216 # offsets, and vertex from []
100000000 # uintE* edges = newA(uintE,m);
0
470
794 # must monotonic increasing, and range [0,edges), represent the folloing edges are belong to corresponding vector
……
14680024 # random but range [0,vector-1], represent each node's conjoint others nodes(so there are pairs).
16644052
16284631
15850460
$ wc -l /staff/qcjiang/codes/DAMOV/workloads/ligra/inputs/rMat_10M
116777219 /staff/qcjiang/codes/DAMOV/workloads/ligra/inputs/rMat_10M
Pagerank Algorithm should be referened from my another post.
maybe more computation-intensive than graph applications
From DAMOV
The Princeton Application Repository for Shared-Memory Computers (PARSEC) is a
collection of parallel programs which can be used for performance studies of
multiprocessor machines.
# compile
python3 compile_parsec.py
# exe in pkgs/binaries
./pkgs/binaries/blackscholes 4 ./pkgs/inputs/blackscholes/in_64K.txt black.out
./pkgs/binaries/fluidanimate 4 10 ./pkgs/inputs/fluidanimate/in_300K.fluid
DAMOV code for memory bandwidth testing which reference J. D. McCalpin et al., “Memory Bandwidth and Machine Balance in Current High Performance Computers,” IEEE TCCA Newsletter, 1995
# compile
python3 compile.py
# default run with Failed Validation error(whatever)
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 5072.0 0.205180 0.157728 0.472323
Add: 6946.6 0.276261 0.172746 0.490767
Two very Interesting repositys of cpu and gpu hardware effects that can degrade application performance
in surprising ways and that may be very hard to explain without knowledge of the low-level CPU/GPU and OS architecture.
Test also using DAMOV code
# compile
python3 compile.py
# Every example directory has a README that explains the individual effects.
./build/bandwidth-saturation/bandwidth-saturation 0 1
./build/false-sharing/false-sharing 3 8
Most cases using the help of perf
to know more about your system ability.
From DAMOV and The HPC Challenge (HPCC) Benchmark Suite,” in SC, 2006
- RandomAccess OpenMP version (it is also the GUPS)
# install
python compile.py
# must export
export OMP_NUM_THREADS=32
./OPENMP/ra_omp 26 # 26 need almost 20GB mem, 27 need 30GB mem, and so on.
but the openmp version shows no big pgw overhead
描述系统的随机访问能力
From github or hpcc official-web and RandomAccess MPI version
# download using github
make -f Makefile.linux gups_vanilla
# running
gups_vanilla N M chunk
N = length of global table is 2^N
Thus N = 30 would run with a billion-element table.
M = # of update sets per proc
越大代表算得越久(随机访问越久),
chunk = # of updates in one set on each proc
- In the official HPCC benchmark this is specified to be no larger than 1024, but you can run the code with any value you like. Your GUPS performance will typically decrease for smaller chunk size.
测试之后,最佳搭配如下 mpirun -n 32 ./gups_vanilla 32 100000 2048
其中n最多32,不然会爆内存。
n |
32 |
31 |
30 |
29 |
DTLB% |
66.81 |
45.22 |
19.50 |
13.20 |
ITLB% |
0.06 |
0.07 |
0.06 |
0.09 |
或者单独运行./gups_vanilla 30 100000 k
(n最多30), ./tlbstat -c '/usr/bin/mpirun -np 1 /staff/shaojiemike/github/gups/gups_vanilla 30 100000 8192
n\k |
1024 |
2048 |
4096 |
8192 |
30 |
|
44% |
90% |
80% |
27 |
|
|
|
88% |
24 |
|
|
83% |
83% |
20 |
|
|
58% |
62% |
15 |
|
|
0.27% |
0.3% |
手动构造 bigJump
#include <bits/stdc++.h>
#include "../zsim_hooks.h"
using namespace std;
#define MOD int(1e9)
// 2000 tlb entries is the normal cpu config,
// for 4KB page, each data access trigers tlb miss, jump over 1000 int,
// and after 2000 entries to repeat, so at least 2000 * 4KB = 8MB space
// #define BASIC_8MB 2000000 * 2
#define BASIC_8MB (1 << 22)
// 1 second program. stream add 6GB/s, int is 4B, repeated 10^9
// #define all_loops 1000000
#define all_loops (1 << 20)
int main(int argc, char* argv[]) {
if (argc != 4) {
std::cerr << "Usage: " << argv[0] << " <space scale> <jump distance scale> <loop times>" << std::endl;
return 1;
}
// Convert the second command-line argument (argv[1]) to an integer
int N = std::atoi(argv[1]);
int J = std::atoi(argv[2]);
int M = std::atoi(argv[3]);
std::cout << "Number read from command line: " << N << " " << J << " (N,J should not big, [0,5] is best.)" <<std::endl;
const int size = BASIC_8MB << N;
const int size_mask = size - 1;
int * jump_space = (int *)malloc(size * sizeof(int));
zsim_begin();
int result = 0;
int mem_access_count = 0;
int mem_access_index = 0;
// int mask = (1<<10<<J)-1;
// int ran = 0x12345678;
int mask = (1<<J)-1;
int ran = (1<<30)-1;
// without random address, tlb occupancy is alse high
// ran = (ran << 1) ^ ((int) ran < 0 ? 0x87654321 : 0);
while(mem_access_count++ < all_loops*M){
// read & write
jump_space[mem_access_index] = ran;
mem_access_index = (mem_access_index + (1024 + ran & mask) ) & (size_mask);
// cout << "mem_access_index = " << mem_access_index << endl;
}
zsim_end();
//print first 5 elements
printf("result %d",result);
}
From DAMOV and High Performance Conjugate Gradient Benchmark (HPCG)
HPCG is a software package that performs a fixed number of multigrid preconditioned
(using a symmetric Gauss-Seidel smoother) conjugate gradient (PCG) iterations using double
precision (64 bit) floating point values. 浮点数的共轭梯度求解
follow the instructions in INSTALL
and analyze the compile.py
- choose makefile like
setup/Make.GCC_OMP
- config values like
MPdir
, but we can leave them beacuse we use GCC_OMP
which set -DHPCG_NO_MPI
in it
- add
-DHPGSym
to CXXFLAGS
or HPCG_OPTS
cd build
and ../configure GCC_OMP
- run
compile.py
to compile the executable files
- get 4
ComputePrologation
in build/bin
- test the exe using
xhpcg 32 24 16
for three dimension
- or
xhpcg --nx=16 --rt=120
for NX=NY=NZ=16
and time is 120 seconds
- change
int refMaxIters = 50;
to int refMaxIters = 1;
to limit CG
iteration number
- to be attention:
--nx=16
must be a multiple of 8
- if there is no geometry arguments on the command line, hpcg will
ReadHpcgDat
and get the default --nx=104 --rt=60
value\nx |
96 |
240 |
360 |
480 |
mem |
|
17GB |
40GB |
72.8GB |
time(icarus0) |
8s |
84S |
4min40s |
core dumped(7mins) |
core dumped for xhpcg_HPGPro: ../src/GenerateProblem_ref.cpp:204: void GenerateProblem_ref(SparseMatrix&, Vector*, Vector*, Vector*): Assertion 'totalNumberOfNonzeros>0' failed.
MPdir =
MPinc =
MPlib =
HPCG_OPTS = -DHPCG_NO_MPI -DHPGSym
../src/ComputeResidual.cpp:59:13: error: 'n' not specified in enclosing 'parallel'
just add n
to shared values
This package provides implementations of the main-memory hash join algorithms
described and studied in C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu, “Main-Memory Hash Joins on Modern Processor Architectures,” TKDE, 2015.
Test also in DAMOV
# install
python compile.py
# runing
./src/mchashjoins_* -r 12800000 -s 12000000 -x 12345 -y 54321
these case shows tlb resource strains
From DAMOV and official documentation is detailed
shaojiemike @ snode6 in ~/github/DAMOV/workloads/Darknet on git:main x [14:13:02] │drwxr-xr-x 3 shaojiemike staff 21 Mar 14 2022 .
$ ./darknet detect cfg/yolo.cfg ./weights/yolo.weights data/dog.jpg
- model in
weight
files, different size model (28MB -528MB)can be downloaded from website
- picture data in
data
files. (70KB - 100MB)
- must run the exe in the file directory. SHIT.
From DAMOV and for mapping DNA sequences against a large reference genome, such as the human genome
- Download DNA data like
ref.fa
following Data download steps
- stucked when running
sratoolkit.3.0.6-ubuntu64/bin/fastq-dump --split-files SRR7733443
due to generate several 96GB SRR7733443_X.fastq
files which X
from 1 to n.
- sratool can not limit the file size, but we can use
head -c 100MB SRR7733443_1.fastq > ref_100MB.fastq
to get wanted
file size.
- Further running commands you can read the github
./bwa index -p abc ref_100MB.fastq
will generate sevaral abc.suffix
files using 50 seconds.
- and now you can run
./bwa mem -t 4 abc ref_100MB.fastq
or ./bwa aln -t 4 abc ref_100MB.fastq
GASE - Generic Aligner for *Seed-and-Extend
GASE is a DNA read aligner, developed for measuring the mapping accuracy and execution time of different combinations of seeding and extension techniques. GASE is implemented by extending BWA (version 0.7.13) developed by Heng Li.
Code also from DAMOV. But seems there are some program syntax errors, skip this app.
暂无
暂无
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks