跳转至

2023

Diary 230827: 上海二次元之旅

缘由

华为实习要结束了,作为二次元,在中国秋叶原怎么能不好好逛逛呢?

目标

  1. 百联zx,外文书店,
  2. 百米香榭
  3. 迪美地下城(香港名街关门装修了
  4. 第一百货和新世界
  5. 大丸百货的4F的华漫潮玩
  6. 静安大悦城的间谍过家家的快闪点。
  7. 徐家汇的一楼jump店、龙猫店,二楼GSC店
  8. mihoyo总部

爱上海

上海真是包容性极强的地方。原本内心对二次元的热爱,竟然这么多人也喜欢。不必隐藏,时刻伪装。可以暂时放松自我的感觉真好。

论对二次元人物的喜爱

爱的定义

爱或者热爱是最浓烈的情感。对象一般是可以交互的人物,物体说不定也可以。但是至少要能与他持续产生美好的回忆和点滴,来支持这份情感。

比如说,我一直想让自己能热爱我的工作,就需要创造小的阶段成功和胜利来支持自己走下去。

区分喜爱与贪恋美色

  1. 首先和对方待在一起很舒服,很喜欢陪伴的感觉,想长期走下去。
  2. 其实不是满脑子瑟瑟的想法
  3. 外表美肯定是加分项,但是更关注气质,想法和精神层面的东西。

三次元与二次元人物

三次元的人物包括偶像歌手,和演员。需要演出,演唱会来与粉丝共创回忆,演员也需要影视剧作品。

二次元人物大多数来自于动画,因为游戏一般不以刻画人物为目的,比如主机游戏 当然galgame和二次元手游除外。

日本动画以远超欧美和国创的题材和人物的细腻刻画(不愧是galgame大国,Band Dream it’s my go到人物心里描写简直一绝)创造了许多令人喜爱的角色。

比较优势

  1. 表现能力的上限来看,动画也是远超游戏(不然游戏里为什么要动画CG)和真人影视剧的。
  2. 二次元人物的二次创作的低门槛(无论从还原难度还是法律约束上来说,毕竟三次元人物经常和真人强绑定)和舆论高包容性(传统二次元社区可比饭圈干净多了)都有远超三次元的优势。
  3. 此外二创Cosplay的平易近人或者说触手可及的真实感。二创能创造出远超原本作品的人物记忆和羁绊
  4. 另一点可能的是二创的低门槛带来的创作快乐,这一点在之前分析音乐的快乐有提到。二创主要有音乐,mmd,iwara动画
  5. cos 可以让原本平凡的人生,染上对应角色不平凡经历的色彩
  6. 最后一点就是永恒性吧,第一点是之前我分析过人们喜欢在变化的生活中追求不变,或者相反。三次元人物或者演员会老去,但是二次元人物能在一部新剧场版下重现活力
  7. 另一点就是不会被背叛。

比较缺点

  1. 对于二次元角色的喜爱在时间的长河里是单向的,除开代入主角,很难收获二次元角色对自己的喜爱(这样看galgame稍微弥补了这点)。交流交互隔着次元的屏障。
  2. 成长可塑性的略微欠缺:如果作品已经完结了,除开少量二创,角色形象基本就确定了。除非输入到AI里训练,使之生命延续。
  3. 惊喜性缺失: 真实人物是多面的,不可控的。但是二次元角色的反转特性只存在于剧集的剧情里。

初步结论

女朋友 > 喜欢二次元(连载 > 完结) >> 追星

图片轰炸

23.08.27 to do

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

UnimportantView: Anime Recommendation

起源与目标

  1. 看番没有形成自己的喜好,导致看到不对的,反而有副作用,
  2. 什么番可以成为精神支柱,而不是看了之后。反而精神内耗更严重了。(看了happy sugar life后,直接抑郁了)

说明

不同于恋爱番,催泪番,这样的分类。其实我更在意作品想表达的主题,作者想展现给读者什么。 无论是各种道理,还是就是某个环境,虚幻世界。

羁绊:对人的爱,爱情、亲情、友情。

何为爱的寻爱之旅

番剧名 精神内核 评语 喜爱的角色 音乐
Happy Sugar Life 守护你是我的爱语 难以理解的爱的世界里,两位迷途少女相遇,救赎,领悟爱的蜜罐生活 砂糖、盐 金丝雀、ED、悲伤小提琴

我推的孩子(第一集)

Violet Garden

羁绊的破碎和重组

BanG Dream It's my go !!!!! 初羁绊(友情,百合,重女)的破碎和reunion

病名为爱

未来日记

家有女友、渣愿

点滴恋爱

百合类的成长:终将成为你,

我心危

轮回宿命类

跨越时空也无法阻止我爱你

命运石之门

RE0

无法抵达的简单幸福未来

寒蝉鸣泣之时

魔法少女小圆

史诗类

复杂、紧张的鸿篇巨制。多非单一的精神内核可以概括。多为群像剧。

奇幻、幻想世界史诗

Fate Zero

钢炼

EVA

to do

刀剑

四谎

CLANND

龙与虎

巨人

超炮

凉宫

鲁鲁修

轻音

补番列表

  1. 物语系列

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

TLB: real pagewalk overhead

简介

TLB的介绍,请看

页表相关

理论基础

大体上是应用访问越随机, 数据量越大,pgw开销越大。

ISCA 2013 shows the pgw overhead in big memory servers.

Basu 等 - Efficient Virtual Memory for Big Memory Servers.pdf

or ISCA 2020 Guvenilir 和 Patt - 2020 - Tailored Page Sizes.pdf

机器配置

# shaojiemike @ snode6 in ~/github/hugoMinos on git:main x [11:17:05]
$ cpuid -1 -l 2
CPU:
      0x63: data TLB: 2M/4M pages, 4-way, 32 entries
            data TLB: 1G pages, 4-way, 4 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries
      0x76: instruction TLB: 2M/4M pages, fully, 8 entries
      0xff: cache data is in CPUID leaf 4
      0xb5: instruction TLB: 4K, 8-way, 64 entries
      0xf0: 64 byte prefetching
      0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries
# if above command turns out empty
cpuid -1 |grep TLB -A 10 -B 5
# will show sth like

L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
    instruction # entries     = 0x40 (64)
    instruction associativity = 0xff (255)
    data # entries            = 0x40 (64)
    data associativity        = 0xff (255)
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
    instruction # entries     = 0x40 (64)
    instruction associativity = 0xff (255)
    data # entries            = 0x40 (64)
    data associativity        = 0xff (255)
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
    instruction # entries     = 0x200 (512)
    instruction associativity = 2-way (2)
    data # entries            = 0x800 (2048)
    data associativity        = 4-way (4)
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
    instruction # entries     = 0x200 (512)
    instruction associativity = 4-way (4)
    data # entries            = 0x800 (2048)
    data associativity        = 8-way (6)

OS config

default there is no hugopage(usually 4MB) to use.

$ cat /proc/meminfo | grep huge -i
AnonHugePages:      8192 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB

explained is here.

设置页表大小

other ways: change source code

  1. way1: Linux transparent huge page (THP) support allows the kernel to automatically promote regular memory pages into huge pages, cat /sys/kernel/mm/transparent_hugepage/enabled but achieve this needs some details.
  2. way2: Huge pages are allocated from a reserved pool which needs to change sys-config. for example echo 20 > /proc/sys/vm/nr_hugepages. And you need to write speacial C++ code to use the hugo page
# using mmap system call to request huge page
mount -t hugetlbfs \
    -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
    min_size=<value>,nr_inodes=<value> none /mnt/huge

without recompile

But there is a blog using unmaintained tool hugeadm and iodlr library to do this.

sudo apt install libhugetlbfs-bin
sudo hugeadm --create-global-mounts
sudo hugeadm --pool-pages-min 2M:64

So meminfo is changed

$ cat /proc/meminfo | grep huge -i
AnonHugePages:      8192 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      64
HugePages_Free:       64
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:          131072 kB

using iodlr library

git clone 

应用测量

Measurement tools from code

# shaojiemike @ snode6 in ~/github/PIA_huawei on git:main x [17:40:50]
$ ./investigation/pagewalk/tlbstat -c '/staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/sssp.inj -f /staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/benchmark/kron-20.wsg -n1'
command is /staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/sssp.inj -f /staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/benchmark/kron-20.wsg -n1
K_CYCLES   K_INSTR      IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC  K_ITLBCYC  DTLB% ITLB%
324088     207256      0.64 733758     3276       18284      130         5.64  0.04
21169730   11658340    0.55 11802978   757866     316625     24243       1.50  0.11

平均单次开销(开始到稳定): dtlb miss read need 24~50 cycle ,itlb miss read need 40~27 cycle

案例的时间分布:

  • 读数据开销占比不大,2.5%左右
  • pagerank等图应用并行计算时,飙升至 22%
  • bfs 最多就是 5%,没有那么随机的访问。
  • 但是gemv 在65000 100000超内存前,即使是全部在计算,都是0.24%
  • 原因:访存模式:图应用的访存模式通常是随机的、不规则的。它们不像矩阵向量乘法(gemv)等应用那样具有良好的访存模式,后者通常以连续的方式访问内存。连续的内存访问可以利用空间局部性,通过预取和缓存块的方式减少TLB缺失的次数。
  • github - GUOPS can achive 90%
  • DAMOV - ligra - pagerank can achive 90% in 20M input case

gemm

  • nomal gemm can achive 100% some situation
  • matrix too big can not be filled in cache, matrix2 access jump lines so always cache miss
  • O3 flag seems no time reduce, beacause there is no SIMD assembly in code
  • memory access time = pgw + tlb access time + load data 2 cache time

gemm

the gemm's core line is

for(int i=0; i<N; i++){
   // ignore the overflow, do not influence the running time.
   for(int j=0; j<N; j++){
      for(int l=0; l<N; l++){
            // gemm
            // ans[i * N + j] += matrix1[i * N + l] * matrix2[l * N + j];

            // for gemm sequantial
            ans[i * N + j] += matrix1[i * N + l] * matrix2[j * N + l];
      }
   }
}

and real time breakdown is as followed. to do

  1. first need to perf get the detail time

bigJump

manual code to test if tlb entries is run out

$ ./tlbstat -c '../../test/manual/bigJump.exe 1 10 100'
command is ../../test/manual/bigJump.exe 1 10 100
K_CYCLES   K_INSTR      IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC  K_ITLBCYC  DTLB% ITLB%
2002404    773981      0.39 104304528  29137      2608079    684        130.25  0.03

$ perf stat -e mem_uops_retired.all_loads -e mem_uops_retired.all_stores -e mem_uops_retired.stlb_miss_loads -e mem_uops_retired.stlb_miss_stores ./bigJump.exe 1 10 500
Number read from command line: 1 10 (N,J should not big, [0,5] is best.)
result 0
 Performance counter stats for './bigJump.exe 1 10 500':

          10736645      mem_uops_retired.all_loads
         532100339      mem_uops_retired.all_stores
             57715      mem_uops_retired.stlb_miss_loads
         471629056      mem_uops_retired.stlb_miss_stores

In this case, tlb miss rate up to 47/53 = 88.6%

Big bucket hash table

using big hash table

other apps

Any algorithm that does random accesses into a large memory region will likely suffer from TLB misses. Examples are plenty: binary search in a big array, large hash tables, histogram-like algorithms, etc.

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

AI Compiler

百度

秋招面试时遇到高铁柱前辈。问了相关的问题(对AI专业的人可能是基础知识)

  1. nvcc编译器不好用吗?为什么要开发tvm之类的编译器?
  2. 答:首先,nvcc是类似与gcc, msvc(Microsoft Visual C++) 之类的传统的编译器,支持的是CUDA C/C++ 代码。
  3. 但是tvm编译器是张量编译器,支持的是python之类的代码,将其中的网络设计,编译拆解成各种算子,然后使用cudnn或者特定硬件的高效机器码来执行。

蔚来

数字信号处理器 (Digital signal processor)

HLO 简单理解为编译器 IR。

TVM介绍

https://tvm.apache.org

  1. TVM解决的问题:
  2. 2017年,deploy Deep learning(TF,Pytorch) everywhere(hardware).
  3. Before TVM,
    1. 手动调优:loop tiling for locality.
    2. operator fusion 算子融合。虽然性能高,但是部署不高效
  4. 编译优化思路引入深度学习
  5. 定义了算子描述到部署空间的映射。核心是感知调度空间,并且实现compute/schedule 分离
  6. TVM当前的发展
  7. 上层计算图表示:NNVM Relay Relax
  8. 底层优化方式:manual -> AutoTVM(schedule最优参数的搜索,基于AI的cost model) -> Ansor(也不再需要手动写AutoTVM模版,使用模版规则生成代码)
  9. TVM的额外工作
  10. HeteroCL: TVM + FPGA

  1. output Fusion
  2. 减少Global Memory Copy

把中间算子库替换成编译器?

暂时不好支持张量

AI自动调整变化来调优

自动调参。缺点:

  1. 需要人工写模版
  2. 人工导致解空间变小

随机各级循环应用优化策略(并行,循环展开,向量化

介绍了Ansor效果很好

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

Graph Algorithms: Pagerank

Pagerank

  1. Network and social network can be identified as a weighted graph
  2. How to do the important ranking is Pagerank

how to design a graph when pr executing memory access jump data-array very random?

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

https://zhuanlan.zhihu.com/p/137561088

Python: DataStructure

check if it is empty?

strings, lists, tuples

# Correct:
if not seq:
if seq:

# Wrong:
if len(seq):
if not len(seq):

debug

try:
    # sth
except Exception as e:
    pprint.pprint(list)
    raise e
finally:
    un_set()

for

step

调参需要测试间隔值

for i in range(1, 101, 3):
    print(i)

遍历修改值

  • 使用 enumerate 函数结合 for 循环遍历 list,以修改 list 中的元素。
  • enumerate 函数返回一个包含元组的迭代器,其中每个元组包含当前遍历元素的索引和值。在 for 循环中,我们通过索引 i 修改了列表中的元素。
# 对于 二维list appDataDict
baseline = appDataDict[0][0] # CPU Total
for i, line in enumerate(appDataDict):
    for j, entry in enumerate(line):
        appDataDict[i][j] = round(entry/baseline, 7)

itertools

itertools --- 为高效循环而创建迭代器的函数

for a,b,c in permutations((a,b,c)):

String 字符串

%c  格式化字符及其ASCII码
%s  格式化字符串
%d  格式化整数
%u  格式化无符号整型
%o  格式化无符号八进制数
%x  格式化无符号十六进制数
%X  格式化无符号十六进制数(大写)
%f  格式化浮点数字,可指定小数点后的精度
%e  用科学计数法格式化浮点数
%E  作用同%e,用科学计数法格式化浮点数
%g  %f和%e的简写
%G  %F  %E 的简写
%p  用十六进制数格式化变量的地址
print("My name is %s and weight is %d kg!" % ('Zara', 21))

string <-> list

' '.join(pass_list) and pass_list.split(" ")

对齐"\n".join(["%-10s" % item for item in List_A])

字符串开头判断

text = "Hello, world!"

if text.startswith("Hello"):
    print("The string starts with 'Hello'")
else:
    print("The string does not start with 'Hello'")

format 格式化函数

Python2.6 开始,通过 {}: 来代替以前的 %

>>>"{} {}".format("hello", "world")    # 不设置指定位置,按默认顺序
'hello world'

>>> "{1} {0} {1}".format("hello", "world")  # 设置指定位置
'world hello world'

# 字符串补齐100位,<表示左对齐
variable = "Hello"
padded_variable = "{:<100}".format(variable)

数字处理

print("{:.2f}".format(3.1415926)) # 保留小数点后两位

{:>10d} 右对齐 (默认, 宽度为10)
{:^10d} 中间对齐 (宽度为10)

小数位

x = round(x,3)# 保留小数点后三位

容器:List

https://www.runoob.com/python/python-lists.html

初始化以及访问

list = ['physics', 'chemistry', 1997, 2000]
list = []          ## 空列表
print(list[0])

切片

格式:[start_index:end_index:step]

不包括end_index的元素

二维数组

list_three = [[0 for i in range(3)] for j in range(3)]

//numpy 创建连续的可自动向量化线程并行
import numpy as np
# 创建一个 3x4 的数组且所有值全为 0
x3 = np.zeros((3, 4), dtype=int)
# 创建一个 3x4 的数组,然后将所有元素的值填充为 2
x5 = np.full((3, 4), 2, dtype=int)

size 大小

len(day)

排序

# take second element for sort
def takeSecond(elem):
    return elem[2]

LCData.sort(key=takeSecond)

# [1740, '黄业琦', 392, '第 196 场周赛'],
# [1565, '林坤贤', 458, '第 229 场周赛'],
# [1740, '黄业琦', 458, '第 229 场周赛'],
# [1509, '林坤贤', 460, '第 230 场周赛'],
# [1740, '黄业琦', 460, '第 230 场周赛'],
# [1779, '黄业琦', 558, '第 279 场周赛'],

对应元素相加到一个变量

tmp_list = [[],[],[],[]]
# 注意不需要右值赋值
[x.append(copy.deepcopy(entry)) for x,entry in zip(tmp_list, to_add)]

两个list对应元素相加

对于等长的

list1 = [1, 2, 3, 4, 5]
list2 = [6, 7, 8, 9, 10]

result = [x + y for x, y in zip(list1, list2)]
print(result)

如果两个列表的长度不同,你可以使用zip_longest()函数来处理它们。zip_longest()函数可以处理不等长的列表,并使用指定的填充值填充缺失的元素。

from itertools import zip_longest

list1 = [1, 2, 3, 4, 5]
list2 = [6, 7, 8]

result = [x + y for x, y in zip_longest(list1, list2, fillvalue=0)]
print(result)

如果是二维list

list1 = [[1, 2, 3],
         [4, 5, 6],
         [7, 8, 9]]

list2 = [[10, 11, 12],
         [13, 14, 15]]

rows = max(len(list1), len(list2))
cols = max(len(row) for row in list1 + list2)

result = [[0] * cols for _ in range(rows)]

for i in range(rows):
    for j in range(cols):
        if i < len(list1) and j < len(list1[i]):
            result[i][j] += list1[i][j]
        if i < len(list2) and j < len(list2[i]):
            result[i][j] += list2[i][j]

print(result)

# 将一个二维列表的所有元素除以一个数A
result = [[element / A for element in row] for row in list1]

直接赋值、浅拷贝和深度拷贝

Python append() 与深拷贝、浅拷贝

python赋值只是引用,别名

list.append('Google')   ## 使用 append() 添加元素
alist.append( num ) # 浅拷贝 ,之后修改num 会影响alist内的值

import copy
alist.append( copy.deepcopy( num ) ) # 深拷贝

# delete
del list[2]

for循环迭代的元素 也是 引用

original_list = [1, 2, 3]

for item in original_list:
    item *= 2 # 每个元素是不可变的

print(original_list) 

original_list = [[1,2,3], [2], [3]]

for item in original_list:
    item.append("xxx") # 每个元素是可变的

print(original_list) 

# [1, 2, 3]
# [[1, 2, 3, 'xxx'], [2, 'xxx'], [3, 'xxx']]

函数传参是引用,但是能通过切片来得到类似指针

参数的传递 函数声明时的形参,使用时,等同于函数体内的局部变量。由于Python中一切皆为对象。因此,参数传递时直接传递对象的地址,但具体使用分两种类型: 1.传递不可变对象的引用(起到其他语言值传递的效果) 数字,字符串,元组,function等 2.传递可变对象的引用(起到其他语言引用传递的效果) 字典,列表,集合,自定义的对象等

def fun0(a):
    a = [0,0] # a在修改后,指向的地址发生改变,相当于新建了一个值为[0,0]

def fun(a):
    a[0] = [1,2]

def fun2(a):
    a[:] = [10,20]

b = [3,4]
fun0(b)
print(b)
fun(b)
print(b)
fun2(b)
print(b)

# [3, 4]
# [[1, 2], 4]
# [10, 20]

return 返回值

可变的也是引用

def fun1(l):
    l.append("0")
    return l 

def fun2(l):
    return l

if __name__=="__main__":
    l = [1,2,3,4,5]

    rel2 = fun2(l)
    print(rel2)   
    rel1 = fun1(l)
    print(rel1)   
    print(rel2)   
    l.append("xxx")
    print(rel1)   
    print(rel2)   
    del rel1[2]
    print(rel1)   
    print(rel2)  

# [1, 2, 3, 4, 5]
# [1, 2, 3, 4, 5, '0']
# [1, 2, 3, 4, 5, '0']
# [1, 2, 3, 4, 5, '0', 'xxx']
# [1, 2, 3, 4, 5, '0', 'xxx']
# [1, 2, 4, 5, '0', 'xxx']
# [1, 2, 4, 5, '0', 'xxx']

容器:元组Tuple

  • 元组和列表类似,但是不同的是元组不能修改,但可以对元组进行连接组合,元组使用小括号。
  • 元组中只包含一个元素时,需要在元素后面添加逗号,否则括号会被当作运算符使用。
#创建元组
tup = (1, 2, 3, 4, 5)
tup1 = (23, 78);
tup2 = ('ab', 'cd')
tup3 = tup1 + tup2

容器:Dict

empty dict

a= {}
a=dict()

key 支持tuple元组

类似c++ 的 pair<int,int>

bblHashDict[(tmpHigherHash,tmpLowerHash)]=tmpBBL

但是这样就不支持json.dump, json.dump() 无法序列化 Python 中元组(tuple)作为字典的 key,这会导致 json.dump() 函数在写入此类字典数据时会进入死循环或陷入卡住状态

初始化以及访问

>>> tinydict = {'a': 1, 'b': 2, 'b': '3'}
>>> tinydict['b']
'3'
a_dict = {'color': 'blue'}
for key in a_dict:
 print(key)
# color
for key in a_dict:
    print(key, '->', a_dict[key])
# color -> blue
for item in a_dict.items():
    print(item)
# ('color', 'blue')
for key, value in a_dict.items():
 print(key, '->', value)
# color -> blue

判断key 是否存在

以下是两种常用的方法:

方法一:使用in操作符: in操作符返回一个布尔值,True表示存在,False表示不存在。

Copy code
my_dict = {"key1": "value1", "key2": "value2", "key3": "value3"}

# 判断是否存在指定的键
if "key2" in my_dict:
    print("Key 'key2' exists in the dictionary.")
else:
    print("Key 'key2' does not exist in the dictionary.")

方法二:使用dict.get()方法: dict.get()方法在键存在时返回对应的值,不存在时返回None。根据需要选择适合的方法进行判断。

Copy code
my_dict = {"key1": "value1", "key2": "value2", "key3": "value3"}

# 判断是否存在指定的键
if my_dict.get("key2") is not None:
    print("Key 'key2' exists in the dictionary.")
else:
    print("Key 'key2' does not exist in the dictionary.")

这两种方法都可以用来判断字典中是否存在指定的键。

size 大小

len(day)

修改以及添加

tinydict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}

tinydict['Age'] = 8 # 更新
tinydict['School'] = "RUNOOB" # 添加

合并

dict1 = {'a': 10, 'b': 8} 
dict2 = {'d': 6, 'c': 4} 

# dict2保留了合并的结果
dict2.update(dict1)
print(dict2)
{'d': 6, 'c': 4, 'a': 10, 'b': 8}

删除

del tinydict['Name']  # 删除键是'Name'的条目
tinydict.clear()      # 清空字典所有条目
del tinydict          # 删除字典
from pprint import pprint
pprint

容器:set

无序不重复序列

初始化

a=  set() # 空set

thisset = set(("Google", "Runoob", "Taobao"))
>>> basket = {'apple', 'orange', 'apple', 'pear', 'orange', 'banana'}
>>> print(basket)                      # 这里演示的是去重功能

list2set

setL=set(listV)

set2list

my_set = {'Geeks', 'for', 'geeks'}

s = list(my_set)
print(s)
# ['Geeks', 'for', 'geeks']

添加

thisset.add("Facebook")

合并

x = {"apple", "banana", "cherry"}
y = {"google", "runoob", "apple"}

z = x.union(y) 

print(z)
# {'cherry', 'runoob', 'google', 'banana', 'apple'}

删除与清空

s.remove( x )
a.clear()

修改原本的值

修改传入参数

在Python中,函数的参数是按值传递的,也就是说在函数内部修改参数不会影响到函数外部的变量。

但是有几种方法可以实现类似修改参数的效果:

  1. 返回修改后的值,在函数外部重新赋值
def func(x):
    x = x + 1 
    return x

a = 10
a = func(a) 
print(a) # 11
  1. 使用可变对象作为参数,修改可变对象的内部值
def func(lst):
    lst.append(1)

lst = [1,2,3]
func(lst)
print(lst) # [1,2,3,1]

这里lst是列表,在func内部修改了lst,由于lst是可变的,所以函数外部的lst也被修改了。

  1. 使用全局变量
count = 0
def func():
    global count
    count += 1

func()
print(count) # 1

通过global关键字声明count为全局变量,这样就可以在函数内部修改全局变量count了。

所以要修改传入参数的值,主要的方法是:

  1. 返回修改后的值并重新赋值
  2. 传入一个可变对象并修改可变对象内部的值
  3. 使用全局变量

这些技巧可以实现模拟修改参数的效果。

修改for循环对象

在Python中,for循环遍历的是一个迭代器,每次循环会将迭代器的下一个元素赋值给循环变量。

如果想在for循环内部修改迭代器中的元素,可以使用以下方法:

  1. 将迭代器转换成列表,然后修改列表中的元素:
all_for_one = ['a', 'b', 'c']

for app_info in list(all_for_one):
    if app_info == 'b':
        all_for_one[1] = 'x' 

print(all_for_one) # ['a', 'x', 'c']

这里通过list()将迭代器转成列表,然后修改列表的元素。

  1. 使用循环的索引而不是直接使用元素:
all_for_one = ['a', 'b', 'c']

for i in range(len(all_for_one)):
    if i == 1:
        all_for_one[i] = 'x'

print(all_for_one) # ['a', 'x', 'c']

通过索引 i 来访问并修改元素。

  1. 使用enumerate()在循环中获取索引:
all_for_one = ['a', 'b', 'c']

for i, app_info in enumerate(all_for_one):
    if i == 1:
        all_for_one[i] = 'x'

print(all_for_one) # ['a', 'x', 'c']

enumerate()可以同时迭代索引和元素。

所以主要思路就是不直接修改循环变量,而是通过索引或临时列表来修改迭代器中的原始元素。

修改for循环的对象 set

对于 set 这样的不可变对象,不能直接通过索引或者 enumerate 来修改,因为 set 对象不支持索引和切片操作。

对于 set,可以使用以下方法在循环内部进行修改:

  1. 将 set 转换成 list,修改后再转换回 set
s = {'a', 'b', 'c'}

for item in list(s):
    if item == 'b':
        s = list(s)
        s[1] = 'x'
        s = set(s)

print(s) # {'a', 'x', 'c'}
  1. 创建一个新的 set,在循环中添加修改后的元素
s = {'a', 'b', 'c'}
new_s = set()

for item in s:
    if item == 'b':
        new_s.add('x')
    else:
        new_s.add(item)

s = new_s

print(s) # {'a', 'x', 'c'}
  1. 使用 set 的discard()和add()方法在循环中修改
s = {'a', 'b', 'c'}

for item in s:
    if item == 'b':
        s.discard(item)
        s.add('x')

print(s) # {'a', 'x', 'c'}

上面这些方法的关键思路都是:

  1. 将不可变对象设置转换成可变类型
  2. 在循环中针对可变类型进行修改
  3. 再转换回不可变 Set 对象

这样就可以实现循环中修改 Set 的效果。

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://blog.csdn.net/weixin_63719049/article/details/125680242

Benchmark

Differentiated-Bounded Applications

According to the DAMOV, data movement have six different types.

Although the authors only use temporal Locality to classify the apps, the Figure 3: Locality-based clustering of 44 representative functions shows some cases deserved to test.

In the context of data movement, the DAMOV framework identifies six distinct types. While the authors primarily utilize temporal locality for app classification, Figure 3 offers a comprehensive view of the locality-based clustering of 44 representative functions, highlighting specific cases that warrant further examination.

Take, for instance, the four cases situated in the lower right corner:

function short name benchmark class
CHAHsti Chai-Hito 1b: DRAM Latency
CHAOpad Chai-Padding 1c: L1/L2 Cache Capacity
PHELinReg Phoenix-Linear Regression 1b
PHEStrMat Phoenix-String Matching 1b

TLB percentage high -> tlb miss rate high -> memory access address in large span -> spacial locality is low.

benchmark

Chai Benchmark

The Chai benchmark code can be sourced either from DAMOV or directly from its GitHub repository. Chai stands for "Collaborative Heterogeneous Applications for Integrated Architectures."

Installing Chai is a straightforward process. You can achieve it by executing the command python3 compile.py.

One notable feature of the Chai benchmark is its adaptability in terms of input size. Modifying the input size of the following applications is a simple and flexible task:

# cd application directory
./bfs_00 -t 4 -f input/xxx.input
./hsti -n 1024000             # Image Histogram - Input Partitioning (HSTI)
./hsto -n 1024000             # Image Histogram - Output Partitioning (HSTO)
./ooppad -m 1000 -n 1000   # Padding (PAD)
./ooptrns -m 100 -n 10000  # Transpose / In-place Transposition (TRNS)
./sc -n 1024000            # Stream Compaction (SC)
./sri -n 1024000           # select application

# vector pack , 2048 = 1024 * 2, 1024 = 2^n
./vpack -m 2048 -n 1024 -i 2 
# vector unpack , 2048 = 1024 * 2, 1024 = 2^n
./vupack -m 2048 -n 1024 -i 2 

Parboil (how to run)

The Parboil suite was developed from a collection of benchmarks used at the University of Illinois to measure and compare the performance of computation-intensive algorithms executing on either a CPU or a GPU. Each implementation of a GPU algorithm is either in CUDA or OpenCL, and requires a system capable of executing applications using those APIs.

# compile , vim compile.py 
# python2.7 ./parboil compile bfs omp_base 
python3 compile.py

# no idea how to run, failed command: (skip)
python2.7 ./parboil run bfs cuda default 
# exe in benchmarks/*, but need some nowhere input.

Phoenix

Phoenix is a shared-memory implementation of Google's MapReduce model for data-intensive processing tasks.

# with code from DAMOV
import os
import sys

os.chdir("phoenix-2.0/tests") # app is changed from sample_apps/*
os.system("make")
os.chdir("../../")

# generate excution in phoenix-2.0/tests/{app}/{app}

# running for example
./phoenix-2.0/tests/linear_regression/linear_regression ./phoenix-2.0/datasets/linear_regression_datafiles/key_file_500MB.txt 

PolyBench

PolyBench is a benchmark suite of 30 numerical computations with static control flow, extracted from operations in various application domains (linear algebra computations, image processing, physics simulation, dynamic programming, statistics, etc.).

PolyBench features include:

  • A single file, tunable at compile-time, used for the kernel instrumentation. It performs extra operations such as cache flushing before the kernel execution, and can set real-time scheduling to prevent OS interference.
# compile using DAMOV code
python compile.py
# exe in OpenMP/compiled, and all running in no parameter

PriM

real apps for the first real PIM platform

Rodinia (developed)

Rodinia, a benchmark suite for heterogeneous parallel computing which target multi-core CPU and GPU platforms, which first introduced in 2009 IISWC.

zsim hooked code from github

  1. Install the CUDA/OCL drivers, SDK and toolkit on your machine.
  2. Modify the common/make.config file to change the settings of
  3. rodinia home directory and CUDA/OCL library paths.
  4. It seems need intel opencl, but intel do everything to oneapi
  5. To compile all the programs of the Rodinia benchmark suite, simply use the universal makefile to compile all the programs, or go to each benchmark directory and make individual programs.
  6. full code with related data can be downloaded from website
mkdir -p ./bin/linux/omp
make OMP

Running the zsim hooked apps

cd bin/linux/omp
./pathfinder 100000 100 7
./myocyte.out 100 1 0 4
./lavaMD -cores 4 -boxes1d 10 # -boxes1d  (number of boxes in one dimension, the total number of boxes will be that^3)
./omp/lud_omp -s 8000
./srad 2048 2048 0 127 0 127 2 0.5 2
./backprop 4 65536 # OMP_NUM_THREADS=4


# need to download data or file
./hotspot 1024 1024 2 4 ../../data/hotspot/temp_1024 ../../data/hotspot/power_1024 output.out
./OpenMP/leukocyte 5 4 ../../data/leukocyte/testfile.avi
# streamcluster
./sc_omp k1 k2 d n chunksize clustersize infile outfile nproc
./sc_omp 10 20 256 65536 65536 1000 none output.txt 4
./bfs 4 ../../data/bfs/graph1MW_6.txt 
./kmeans_serial/kmeans -i ../../data/kmeans/kdd_cup
./kmeans_openmp/kmeans -n 4 -i ../../data/kmeans/kdd_cup

dynamic data structures

We choose this specific suite because dynamic data structures are the core of many server workloads (e.g., Memcached’s hash table, RocksDB’s skip list), and are a great match for nearmemory processing

ASCYLIB + OPTIK1

Graph Apps

Graph500 Benchmark Exploration

Official Version

The official version of the Graph500 benchmark can be downloaded from its GitHub repository. Notable features of this version include:

  • Primarily MPI Implementation: The benchmark is built as an MPI (Message Passing Interface) version, without an accompanying OpenMP version. This can be disappointing for those utilizing tools like zsim.
  • Flexible n Value: By default, the value n is set to powers of 2, but it's possible to change this behavior through configuration adjustments.
  • Customization Options: Environment variables can be altered to modify the execution process. For instance, the BFS (Breadth-First Search) portion can be skipped or the destination path for saved results can be changed.
Unofficial Repository

An alternative unofficial repository also exists. However, it requires OpenCL for compilation. The process can be broken down as follows:

  • OpenCL Dependency: The unofficial repository mandates the presence of OpenCL. To set up OpenCL, you can refer to this tutorial.
sudo apt-get install clinfo
sudo apt-get install opencl-headers
sudo apt install opencl-dev

After completing the OpenCL setup and the compilation process using cmake & make, we obtain the executable file named benchmark. By default, running this executable without any arguments appears to utilize only a single core, despite attempts to set the environment variable with export OMP_NUM_THREADS=32. This default behavior led to a runtime of approximately 5 minutes to generate a report related to edges-node-verify status (or similar). However, for someone without an in-depth technical background, this report can be confusing, especially when trying to locate the BFS (Breadth-First Search) and SSSP (Single-Source Shortest Path) components.

What is even more disheartening is that the TLB (Translation Lookaside Buffer) result is disappointingly low, similar to the performance of the GUPS (Giga Updates Per Second) OpenMP version.

In order to gain a clearer understanding and potentially address these issues, further investigation and potentially adjustments to the program configuration may be necessary.

$ ./tlbstat -c '/staff/shaojiemike/github/graph500_openmp/build/benchmark' 
command is /staff/shaojiemike/github/graph500_openmp/build/benchmark                       
K_CYCLES   K_INSTR      IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC  K_ITLBCYC  DTLB% ITLB%         
20819312   10801013    0.52 7736938    1552557    369122     51902       1.77  0.25
20336549   10689727    0.53 7916323    1544469    354426     48123       1.74  0.24

GAP

Zsim?/LLVM Pass Instrumentation Code from PIMProf paper github

But the graph dataset should be generated by yourself following the github:

# 2^17 nodes 
./converter -g17 -k16 -b kron-17.sg
./converter -g17 -k16 -wb kron-17.wsg

Kernels Included

  • Breadth-First Search (BFS) - direction optimizing
  • Single-Source Shortest Paths (SSSP) - delta stepping
  • PageRank (PR) - iterative method in pull direction
  • Connected Components (CC) - Afforest & Shiloach-Vishkin
  • Betweenness Centrality (BC) - Brandes
  • Triangle Counting (TC) - Order invariant with possible relabelling

ligra

Code also from DAMOV

# compile
python3 compile.py

# 3 kind exe of each app, relative code can be found in /ligra directory
# emd: edgeMapDense() maybe processing related dense-data 
# ems: edgeMapSparse() analyse edge-data
# compute: of course, the core compute part

the source code is difficult to read, skip

the graph format : It seems lines_num = offsets + edges + 3

AdjacencyGraph
16777216 # offsets, and vertex from []
100000000 #   uintE* edges = newA(uintE,m);
0
470
794 # must monotonic increasing, and range [0,edges), represent the folloing edges are belong to corresponding vector
……
14680024 # random but range [0,vector-1], represent each node's conjoint others nodes(so there are pairs).
16644052
16284631
15850460

$ wc -l  /staff/qcjiang/codes/DAMOV/workloads/ligra/inputs/rMat_10M
116777219 /staff/qcjiang/codes/DAMOV/workloads/ligra/inputs/rMat_10M

Pagerank Algorithm should be referened from my another post.

HPC

maybe more computation-intensive than graph applications

parsec

From DAMOV

The Princeton Application Repository for Shared-Memory Computers (PARSEC) is a collection of parallel programs which can be used for performance studies of multiprocessor machines.

# compile
python3 compile_parsec.py

# exe in pkgs/binaries
./pkgs/binaries/blackscholes 4 ./pkgs/inputs/blackscholes/in_64K.txt black.out
./pkgs/binaries/fluidanimate 4 10 ./pkgs/inputs/fluidanimate/in_300K.fluid

STREAM apps

DAMOV code for memory bandwidth testing which reference J. D. McCalpin et al., “Memory Bandwidth and Machine Balance in Current High Performance Computers,” IEEE TCCA Newsletter, 1995

# compile
python3 compile.py

# default run with Failed Validation error(whatever)
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5072.0     0.205180     0.157728     0.472323
Add:             6946.6     0.276261     0.172746     0.490767

Hardware Effects

Two very Interesting repositys of cpu and gpu hardware effects that can degrade application performance in surprising ways and that may be very hard to explain without knowledge of the low-level CPU/GPU and OS architecture.

Test also using DAMOV code

# compile
python3 compile.py

# Every example directory has a README that explains the individual effects.
./build/bandwidth-saturation/bandwidth-saturation 0 1
./build/false-sharing/false-sharing 3 8

Most cases using the help of perf to know more about your system ability.

HPCC

From DAMOV and The HPC Challenge (HPCC) Benchmark Suite,” in SC, 2006

  • RandomAccess OpenMP version (it is also the GUPS)
# install
python compile.py

# must export
export OMP_NUM_THREADS=32
./OPENMP/ra_omp 26 # 26 need almost 20GB mem, 27 need 30GB mem, and so on.

but the openmp version shows no big pgw overhead

GUPS

描述系统的随机访问能力

From github or hpcc official-web and RandomAccess MPI version

# download using github
make -f Makefile.linux gups_vanilla
# running
gups_vanilla N M chunk
  • N = length of global table is 2^N
  • Thus N = 30 would run with a billion-element table.
  • M = # of update sets per proc 越大代表算得越久(随机访问越久),
  • chunk = # of updates in one set on each proc
  • In the official HPCC benchmark this is specified to be no larger than 1024, but you can run the code with any value you like. Your GUPS performance will typically decrease for smaller chunk size.

测试之后,最佳搭配如下 mpirun -n 32 ./gups_vanilla 32 100000 2048 其中n最多32,不然会爆内存。

n 32 31 30 29
DTLB% 66.81 45.22 19.50 13.20
ITLB% 0.06 0.07 0.06 0.09

或者单独运行./gups_vanilla 30 100000 k (n最多30), ./tlbstat -c '/usr/bin/mpirun -np 1 /staff/shaojiemike/github/gups/gups_vanilla 30 100000 8192

n\k 1024 2048 4096 8192
30 44% 90% 80%
27 88%
24 83% 83%
20 58% 62%
15 0.27% 0.3%
手动构造 bigJump
#include <bits/stdc++.h>
#include "../zsim_hooks.h"
using namespace std;

#define MOD int(1e9)

// 2000 tlb entries is the normal cpu config, 
// for 4KB page, each data access trigers tlb miss, jump over 1000 int, 
// and after 2000 entries to repeat, so at least 2000 * 4KB = 8MB space

// #define BASIC_8MB 2000000 * 2
#define BASIC_8MB (1 << 22)

// 1 second program. stream add 6GB/s, int is 4B, repeated 10^9
// #define all_loops 1000000
#define all_loops (1 << 20)

int main(int argc, char* argv[]) {
   if (argc != 4) {
      std::cerr << "Usage: " << argv[0] << " <space scale> <jump distance scale> <loop times>" << std::endl;
      return 1;
   }

   // Convert the second command-line argument (argv[1]) to an integer
   int N = std::atoi(argv[1]);
   int J = std::atoi(argv[2]);
   int M = std::atoi(argv[3]);

   std::cout << "Number read from command line: " << N << " " << J << " (N,J should not big, [0,5] is best.)" <<std::endl;

   const int size = BASIC_8MB << N;
   const int size_mask = size - 1;
   int * jump_space = (int *)malloc(size * sizeof(int));

   zsim_begin();
   int result = 0;
   int mem_access_count = 0;
   int mem_access_index = 0;
   // int mask = (1<<10<<J)-1;
   // int ran = 0x12345678;
   int mask = (1<<J)-1;
   int ran = (1<<30)-1;
   // without random address, tlb occupancy is alse high
   // ran = (ran << 1) ^ ((int) ran < 0 ? 0x87654321 : 0);
   while(mem_access_count++ < all_loops*M){
      // read & write 
      jump_space[mem_access_index] = ran;
      mem_access_index = (mem_access_index + (1024 + ran & mask) ) & (size_mask);
      // cout << "mem_access_index = " << mem_access_index << endl;
   }
   zsim_end();

   //print first 5 elements
   printf("result %d",result);
}

HPCG

From DAMOV and High Performance Conjugate Gradient Benchmark (HPCG)

HPCG is a software package that performs a fixed number of multigrid preconditioned (using a symmetric Gauss-Seidel smoother) conjugate gradient (PCG) iterations using double precision (64 bit) floating point values. 浮点数的共轭梯度求解

follow the instructions in INSTALL and analyze the compile.py

  1. choose makefile like setup/Make.GCC_OMP
  2. config values like MPdir, but we can leave them beacuse we use GCC_OMP which set -DHPCG_NO_MPI in it
  3. add -DHPGSym to CXXFLAGS or HPCG_OPTS
  4. cd build and ../configure GCC_OMP
  5. run compile.py to compile the executable files
  6. get 4 ComputePrologation in build/bin
  7. test the exe using xhpcg 32 24 16 for three dimension
  8. or xhpcg --nx=16 --rt=120 for NX=NY=NZ=16 and time is 120 seconds
  9. change int refMaxIters = 50; to int refMaxIters = 1; to limit CG iteration number
  10. to be attention: --nx=16 must be a multiple of 8
  11. if there is no geometry arguments on the command line, hpcg will ReadHpcgDat and get the default --nx=104 --rt=60
value\nx 96 240 360 480
mem 17GB 40GB 72.8GB
time(icarus0) 8s 84S 4min40s core dumped(7mins)

core dumped for xhpcg_HPGPro: ../src/GenerateProblem_ref.cpp:204: void GenerateProblem_ref(SparseMatrix&, Vector*, Vector*, Vector*): Assertion 'totalNumberOfNonzeros>0' failed.

MPdir        = 
MPinc        = 
MPlib        = 

HPCG_OPTS     = -DHPCG_NO_MPI -DHPGSym
compile error
../src/ComputeResidual.cpp:59:13: error: 'n' not specified in enclosing 'parallel'

just add n to shared values

Database

Hash Joins

This package provides implementations of the main-memory hash join algorithms described and studied in C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu, “Main-Memory Hash Joins on Modern Processor Architectures,” TKDE, 2015.

Test also in DAMOV

# install
python compile.py

# runing
./src/mchashjoins_* -r 12800000 -s 12000000 -x 12345 -y 54321

these case shows tlb resource strains

AI

Darknet for CV using multi-cores

From DAMOV and official documentation is detailed

shaojiemike @ snode6 in ~/github/DAMOV/workloads/Darknet on git:main x [14:13:02]                   │drwxr-xr-x  3 shaojiemike staff   21 Mar 14  2022 .
$ ./darknet detect cfg/yolo.cfg ./weights/yolo.weights data/dog.jpg
  • model in weight files, different size model (28MB -528MB)can be downloaded from website
  • picture data in data files. (70KB - 100MB)
  • must run the exe in the file directory. SHIT.

genomics / DNA

BWA

From DAMOV and for mapping DNA sequences against a large reference genome, such as the human genome

  • Download DNA data like ref.fa following Data download steps
  • stucked when running sratoolkit.3.0.6-ubuntu64/bin/fastq-dump --split-files SRR7733443 due to generate several 96GB SRR7733443_X.fastq files which X from 1 to n.
  • sratool can not limit the file size, but we can use head -c 100MB SRR7733443_1.fastq > ref_100MB.fastq to get wanted file size.
  • Further running commands you can read the github
  • ./bwa index -p abc ref_100MB.fastq will generate sevaral abc.suffix files using 50 seconds.
  • and now you can run ./bwa mem -t 4 abc ref_100MB.fastq or ./bwa aln -t 4 abc ref_100MB.fastq

GASE

GASE - Generic Aligner for *Seed-and-Extend


GASE is a DNA read aligner, developed for measuring the mapping accuracy and execution time of different combinations of seeding and extension techniques. GASE is implemented by extending BWA (version 0.7.13) developed by Heng Li.

Code also from DAMOV. But seems there are some program syntax errors, skip this app.

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Perf

perf 介绍

Perf是Linux下的一个性能分析工具(profiler)。基于Linux 内核perf_event子系统实现。

Perf不仅能够分析 PMU 事件(硬件事件),也能分析各种软件事件,如进程切换、pagefault、网络、文件IO等。

常用perf命令

间隔选项

-I 100 以100ms为间隔打印数据

性能事件的属性

硬件性能事件由处理器中的PMU提供支持。由于现代处理器的主频非常高,再加上深度流水线机制,从性能事件被触发,到处理器响应 PMI中断,流水线上可能已处理过数百条指令。

那么PMI中断采到的指令地址就不再是触发性能事件的那条指令的地址了,而且可能具有非常严重的偏差。

为了解决这个问题,Intel处理器通过PEBS机制实现了高精度事件采样。PEBS通过硬件在计数器溢出时将处理器现场直接保存到内存(而不是在响应中断时才保存寄存器现场),从而使得 perf能够采到真正触发性能事件的那条指令的地址,提高了采样精度。

在默认条件下,perf不使用PEBS机制。用户如果想要使用高精度采样,需要在指定性能事件时,在事件名后添加后缀”:p”或”:pp”。Perf在采样精度上定义了4个级别,如下表所示。

级别描述

  0 无精度保证
  1 采样指令与触发性能事件的指令之间的偏差为常数(:p)
  2 需要尽量保证采样指令与触发性能事件的指令之间的偏差为0(:pp)
  3 保证采样指令与触发性能事件的指令之间的偏差必须为0(:ppp)
  性能事件的精度级别目前的X86处理器,包括Intel处理器与AMD处理器均仅能实现前 3 个精度级别。

除了精度级别以外,性能事件还具有其它几个属性,均可以通过”event: X”的方式予以指定。

标志属性

  u 仅统计用户空间程序触发的性能事件
  k 仅统计内核触发的性能事件,有些根据寄存器计数器获得的数值,无法区分内核态还是用户态产生。
  uk 测量两者的合
  h 仅统计Hypervisor触发的性能事件
  G 在KVM虚拟机中,仅统计Guest系统触发的性能事件
  H 仅统计 Host 系统触发的性能事件
  p 精度级别

perf list

列出可采集事件(to be used in -e).

常用的集合事件 Metric Groups: perf list metricgroup

perf stat

可以列出程序运行的基本分析数据,或者特殊指明事件

$ perf stat ./bin/pivot "../run/uniformvector-2dim-5h.txt"
dim = 2, n = 500, k = 2
Using time : 236.736000 ms
max : 143 351 58880.823709
min : 83 226 21884.924801

 Performance counter stats for './bin/pivot ../run/uniformvector-2dim-5h.txt':

          7,445.60 msec task-clock                #   30.814 CPUs utilized
               188      context-switches          #    0.025 K/sec
                33      cpu-migrations            #    0.004 K/sec
               678      page-faults               #    0.091 K/sec
    14,181,698,360      cycles                    #    1.905 GHz                      (75.63%)
    46,455,227,542      instructions              #    3.28  insn per cycle           (74.37%)
     2,008,507,493      branches                  #  269.758 M/sec                    (74.18%)
        13,872,537      branch-misses             #    0.69% of all branches          (75.82%)

       0.241629599 seconds time elapsed

       7.448593000 seconds user
       0.000000000 seconds sys

Using Metric Groups: perf stat -M Summary,TLB --metric-only ./exe

--metric-only will just print calculated metric without raw data.

perf record

Run a command and record its profile into perf.data

must add perf record -g xxx to generate perf report topdown graph

不指定 -e 默认 -e cycles:u 。要统计全部的周期使用 -e cycles:uk

注意周期和指令统计的是多核的。选项-a可以指定所有核。

cycles和task-clock基本差不多

在测试时加上 uppp 的p提高精度,防止采样偏移到其他指令。

perf record -g -e branch-misses:uppp ./bin/pivot "../run/uniformvector-2dim-5h.txt"
perf record -g -e task-clock:uppp ./bin/pivot "../run/uniformvector-2dim-5h.txt"

可以看出分支失败在for循环这里

echo 0 > /proc/sys/kernel/kptr_restrict is needed.

perf report

Read perf.data (created by perf record) and display the profile

可交互的命令行。不仅可以显示出汇编和原代码的对应关系,还可以jump自动跳转

perf annotate

  1. program compiled in -g flag
  2. perf annotate can show each line execution cycles percentage

percentage at the end of each line

When you select many metric using -e option, but there is limited PMU hardware to use. So perf using Event multiplexing to measure parttime data to estimate the full data. the % is the parttime / fulltime

pmc-tools

pmc-tools

实践

Because of perf estimation, when numbers bigger than 10^9, the math relationship is more convincible.

Integer Operations per seconds (IntOps)

No integer-ralated state But float in perf list
  1. perf is hard to measure int
  2. Check raw hardware event codes (to be used with -rNNN) also difficult choice

Roofline: PMU & Perf

maybe we should further research Integer PMU events. Or just use VTune

1

TLB miss rate

reference inspire me in snode6

  • Total number of memory references ( X ) = mem_uops_retired.all_loads + mem_uops_retired.all_stores
  • Total number of memory references that missed in TLB ( Y ) = mem_uops_retired.stlb_miss_loads + mem_uops_retired.stlb_miss_stores
  • TLB miss rate = Y/X
perf stat \
    -e mem_uops_retired.all_loads \
    -e mem_uops_retired.all_stores \
    -e mem_uops_retired.stlb_miss_loads \
    -e mem_uops_retired.stlb_miss_stores xxx

according to the following experience, mem_uops_retired.stlb_miss_stores is equal to dtlb_store_misses.miss_causes_a_walk.

$ perf stat -e mem_uops_retired.all_loads -e mem_uops_retired.all_stores -e mem_uops_retired.stlb_miss_loads -e mem_uops_retired.stlb_miss_stores -e dtlb_load_misses.miss_causes_a_walk\
    -e dtlb_store_misses.miss_causes_a_walk \
    -e itlb_misses.miss_causes_a_walk  \
    -e dtlb_load_misses.walk_duration \
    -e dtlb_store_misses.walk_duration \
    -e itlb_misses.walk_duration \
    -e instructions:uk \
    -e cycles:uk ./bigJump.exe 1 10 500
Number read from command line: 1 10 (N,J should not big, [0,5] is best.)
result 0
 Performance counter stats for './bigJump.exe 1 10 500':

           3253636      mem_uops_retired.all_loads                                     (41.53%)
         529570049      mem_uops_retired.all_stores                                     (41.62%)
             59111      mem_uops_retired.stlb_miss_loads                                     (41.71%)
         471688964      mem_uops_retired.stlb_miss_stores                                     (33.50%)
            101474      dtlb_load_misses.miss_causes_a_walk                                     (33.56%)
         477591045      dtlb_store_misses.miss_causes_a_walk                                     (33.47%)
             61667      itlb_misses.miss_causes_a_walk                                     (33.37%)
           5591102      dtlb_load_misses.walk_duration                                     (33.28%)
       16489869334      dtlb_store_misses.walk_duration                                     (33.22%)
           2202174      itlb_misses.walk_duration                                     (33.22%)
        3712587926      instructions:uk           #    0.34  insn per cycle           (41.52%)
       10791067051      cycles:uk                                                     (41.52%)

perf测量 page walk 时间占比

脚本如下:

# Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
USER="/staff/qcjiang/codes/PIA_workspace"
APP="$USER/workloads/pagerank/cpp/pagerank $USER/workloads/pagerank/test/barabasi-100000-pr-p.txt"
perf stat \
    -e dtlb_load_misses.miss_causes_a_walk\
    -e dtlb_store_misses.miss_causes_a_walk \
    -e itlb_misses.miss_causes_a_walk  \
    -e dtlb_load_misses.walk_duration \
    -e dtlb_store_misses.walk_duration \
    -e itlb_misses.walk_duration \
    -e instructions:uk \
    -e cycles:uk \
    $APP

# dtlb_load_misses.walk_duration
# [Cycles when PMH(Page Miss Handling) is busy with page walks Spec update: BDM69]

# Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GH
perf stat\
    -e dtlb_load_misses.walk_completed\
    -e dtlb_load_misses.walk_active\
    -e dtlb_store_misses.walk_completed\
    -e dtlb_store_misses.walk_active\
    -e itlb_misses.walk_completed\
    -e itlb_misses.walk_active\


# different perf on hades0 AMD cpu
# AMD do not have the page walk time record
perf stat\
 -e ls_l1_d_tlb_miss.all\
 -e l1_dtlb_misses\
 -e l2_dtlb_misses\
 -e l2_itlb_misses\
 -e bp_l1_tlb_miss_l2_tlb_miss\
 -e bp_l1_tlb_miss_l2_tlb_miss.if2m\
 -e bp_l1_tlb_miss_l2_tlb_miss.if4k\
 -e bp_l1_tlb_miss_l2_tlb_miss.if1g\
 -e ls_tablewalker.dc_type0\
 -e ls_tablewalker.dc_type1\
 -e ls_tablewalker.dside\
 -e ls_tablewalker.ic_type0\
 -e ls_tablewalker.ic_type1\
 -e ls_tablewalker.iside -e instructions:uk -e cycles:uk /staff/shaojiemike/github/DAMOV/workloads/gemm/gemm 2000

统计TLB miss的性能事件可以帮助分析程序是否存在地址翻译方面的性能瓶颈。STLB miss通常表示存储指令无法利用局部性,访问了过多随机地址。优化程序的数据访问模式可以减少TLB miss。

-I 100 以100ms为间隔打印数据

$ ./investigation/pagewalk/tlbstat -c '/staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/sssp.inj -f /staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/benchmark/kron-20.wsg -n1'
command is /staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/sssp.inj -f /staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/benchmark/kron-20.wsg -n1
K_CYCLES   K_INSTR      IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC  K_ITLBCYC  DTLB% ITLB%
186001     83244       0.45 753584     63         27841      1          14.97  0.00
229121     89734       0.39 1009100    213        31103      10         13.58  0.00
6233833    3629907     0.58 1227653    316159     45877      8347        0.74  0.13
16579329   8756681     0.53 10860225   524414     264655     15348       1.60  0.09
  • obs1: 前期读取数据部分DTLB大
  • obs2: cycle数前面是单核,后面是32核。而且读取时与计算时的core动态频率也不同。
  • 根据lscpu的结果。最高3.3GHz, min 1.2GHz. 2*32*33/12 = 176 ~ 165。符合预期。
  • obs3: 可以不考虑kernel。内核操作(malloc空间)的浮动很大。

为什么强调是Retired

在Intel处理器中,不是所有的存储指令(store uops)都会退休(retire)执行。具体来说,有以下几种情况:

  1. 存储指令被重排序或规避了,没有真正执行,所以不会退休。
  2. 存储指令执行期间遇到异常,被取消了,同样不会退休。
  3. 优化后的微操作 fusion 可能会 cancel 掉冗余的存储指令,这些指令也不会退休。
  4. 硬件优化会将多个存储指令合并(coalescing)为一个存储指令执行,其它冗余指令不会退休。

综上所述,不是所有的存储指令都会 retirement,其中有一部分指令因为各种原因被取消或合并了。

所以这个事件“Retired store uops” 特别强调统计的是执行完整并退休的存储指令。只统计退休的存储指令,可以更准确地反映程序实际进行了存储操作的次数。如果包含被取消的指令,会引入噪声,影响分析结果。

此外,TLB miss只发生在指令真正要执行时,那时已经可以确定指令确实会退休,不会被取消。

进阶应用:

改变数据规模好方式,统计该函数花费的cycle,斜率即CPE(cycles per element)

Perf stat例子: 通过CPE估计L1 Cache latency

改变链表的长度n,统计该函数花费的cycle,斜率即CPE(cycles per element)

Perf stat例子: 通过CPE估计分支预测失败惩罚

实际例子

$ perf record -g ./bin/pivot "../run/uniformvector-2dim-5h.txt"
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict and /proc/sys/kernel/perf_event_paranoid.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
dim = 2, n = 500, k = 2
Using time : 240.525000 ms
max : 143 351 58880.823709
min : 83 226 21884.924801
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.794 MB perf.data (30470 samples) ]

在不改变 /proc/sys/kernel 的时候

$ sudo perf record ./bin/pivot "../run/uniformvector-2dim-5h.txt"
dim = 2, n = 500, k = 2
Using time : 389.349000 ms
max : 143 351 58880.823709
min : 83 226 21884.924801
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.909 MB perf.data (49424 samples) ]

超算使用perf分析

有时候由于架构不同超算不能使用vtune,

srun

srun -p IPCC -N 1 -n 1 -c 64 -t 1  perf record -g ../build/bin/pivot uniformvector-4dim-1h.txt

没有生成perf.data文件。终端末尾输出一堆乱码,不懂(后面知道了乱码是因为没有指定 -o 输出,默认竟然输出到命令行)

srun -p IPCC -N 1 -n 1 -c 64 -t 1  /usr/bin/bash

申请bash然后运行一样结果

salloc

这样可以

ipcc22_0029@ln121 ~/github/MarchZnver1/IPCC2022-preliminary/run (main*) [07:38:46]           
> salloc -p IPCC -N 1 -t 10:00                                            
salloc: Granted job allocation 2175282                                       
salloc: Waiting for resource configuration                                         
salloc: Nodes fb0101 are ready for job                                     
bash-4.2$ ls                                                             
check.py  manual.log  refer-2dim-5h.txt  refer-4dim-1h.txt  result.txt  run.2022-08-10.log  run_case1.sh  run-ipcc-mpi.sh  run-ipcc.sh  run-mpi.sh  run.sh  uniformvector-2dim-5h.txt  uniformvector-4dim-1h.txt   
bash-4.2$ perf record -g ../build/bin/pivot uniformvector-4dim-1h.txt

但实际在登录节点跑的,有ssh不上去

sbatch

#!/bin/bash
#SBATCH -p IPCC
#SBATCH -t 3:00
#SBATCH --nodes=1
#SBATCH --exclude=
#SBATCH --cpus-per-task=64
#SBATCH --mail-type=FAIL
#SBATCH [email protected]

source /public1/soft/modules/module.sh
module purge

module load gcc/8.1.0
module load mpich/3.1.4-gcc8.1.0

logname=vtune
export OMP_PROC_BIND=close; export OMP_PLACES=cores
perf record -g -e task-clock:uppp /public1/home/ipcc22_0029/shaojiemike/github/IPCC2022-preliminary/build/bin/pivot /public1/home/ipcc22_0029/shaojiemike/slurm/uniformvector-4dim-1h.txt |tee ./$logname

返回

ipcc22_0029@ln121 ~/slurm  [11:18:06]
> cat slurm-2180072.out 
Error:
task-clock:uppp: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'

修改指令为 perf record -g -o perfData -e task-clock:upp 支持 pp

分析CVT

感觉寄存器才用到10,还可以展开一次。

分析CVT变sub

为了形成add流水,编译器把sub都改成add了。

vmovaps 0xf98(%rip),%ymm9 # 81a0 <blockSize+0x20> 应该是从静态变量里读的。

分析再展开一次float

结果已经不对了,而且也没快(还是和资源流水有关吧,寄存器也吃紧了,sub都没在一起)

但是说明 unroll_SumDistance(j) 这种写法是能在汇编层面实现展开的

对比分析再展开一次double

由于寄存器更加紧张,load都不能再一起, ymm9 寄存器不够。

第一条红色的vadd感觉可以放下来。可能add和sub是公用同一个流水?

perf stat

单节点

 Performance counter stats for '/public1/home/ipcc22_0029/shaojiemike/github/IPCC2022-preliminary/build/bin/pivot /public1/home/ipcc22_0029/shaojiemike/slurm/case2/uniformvector-4dim-1h.txt':

         73,637.63 msec task-clock:u              #   54.327 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             9,433      page-faults:u             #    0.128 K/sec                  
   192,614,020,152      cycles:u                  #    2.616 GHz                      (83.34%)
     4,530,181,367      stalled-cycles-frontend:u #    2.35% frontend cycles idle     (83.32%)
    25,154,915,770      stalled-cycles-backend:u  #   13.06% backend cycles idle      (83.33%)
   698,720,546,628      instructions:u            #    3.63  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (83.34%)
    27,780,261,977      branches:u                #  377.256 M/sec                    (83.33%)
        11,900,773      branch-misses:u           #    0.04% of all branches          (83.33%)

       1.355446229 seconds time elapsed

      73.465281000 seconds user
       0.181156000 seconds sys

两个节点的perf数据感觉有问题, perf record 结果也很奇怪

Performance counter stats for 'mpirun -n 2 /public1/home/ipcc22_0029/shaojiemike/github/IPCC2022-preliminary/build/bin/pivot /public1/home/ipcc22_0029/shaojiemike/slurm/case2/uniformvector-4dim-1h.txt':

             51.37 msec task-clock:u              #    0.060 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             2,278      page-faults:u             #    0.044 M/sec                  
        39,972,793      cycles:u                  #    0.778 GHz                      (84.56%)
         2,747,434      stalled-cycles-frontend:u #    6.87% frontend cycles idle     (85.16%)
        10,620,259      stalled-cycles-backend:u  #   26.57% backend cycles idle      (88.39%)
        58,479,982      instructions:u            #    1.46  insn per cycle         
                                                  #    0.18  stalled cycles per insn  (89.18%)
        14,068,620      branches:u                #  273.884 M/sec                    (77.31%)
           365,530      branch-misses:u           #    2.60% of all branches          (75.40%)

       0.850258803 seconds time elapsed

       0.015115000 seconds user
       0.038139000 seconds sys

具体分析例子

对于精心展开的手动向量化代码(理应用满了avx2的16个YMM*寄存器)

但是 -march=znver1 还是有1/3的加速。

可以发现变快的代码几点区别(可能和相同类型指令在一起执行,没什么关系) 1. load的形式变了,没有采用了insertf128。(0.71 下降到 0.06) 2. load的次数变少了(vmaxpd 地址元素实现,减少指令数) 3. 指令没有被拆分的那么散,更紧凑了。

下面是核心展开两次的代码: 快速的代码 1. load统一load快?而且在循环外围load,,无需load(vmax地址元素实现) 2. 从sub指令可以看出,是编译器是很想一起load,但是只有16个.(ymm0是总和,ymm9是掩码

综合来看指令数减少了许多(看截图,快的36条指令, 慢的54条指令)。性能瓶颈在指令退休?? 可以perf stat来验证 可以看出IPC基本没变。

这样就可以解释为什么数据重用加上 -march=znver1 变快了,但是原本的没变快。因为数据重用多了n*n的许多的数据读取,然后该选项合并消减了许多指令(直接隐含到vmaxpd里了),所以有加速。但是原本的实现基本没有数据读取,自然没得优化,所以基本没加速。

机器的IPC影响因素

可以看出优化了有些许变化,主要原因是不同指令的Throughput不同(每个周期能执行的指令数)。

核数竟然就是 userTime/elapsedTime 这也太粗暴了吧。

运行不同的例子,执行的程序内代码的权重和类型不同,IPC有所不同。(小例子有400/600的时间都在MPI_Init, 自然IPC会低一些)

常见问题

限制选项

perf_event_paranoid setting is 4:
  -1: Allow use of (almost) all events by all users
      Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow raw and ftrace function tracepoint access
>= 1: Disallow CPU event access
>= 2: Disallow kernel profiling

编辑 /etc/sysctl.conf 文件,加入 kernel.perf_event_paranoid = -1 保存文件并退出。然后,你可以通过重新加载 sysctl 配置来使设置生效: sudo sysctl -p

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

实验室同学袁福焱组会汇报

https://zhuanlan.zhihu.com/p/141694060

Intel SDM(Software Developer's Manual)

Introduction

This set consists of

volume Descriptions pages(size)
volume 1 Basic Architecture 500 pages(3MB)
volume 2 (combined 2A, 2B, 2C, and 2D) full instruction set reference 2522 pages(10.8MB)
volume 3 (combined 3A, 3B, 3C, and 3D) system programming guide 1534 pages(8.5MB)
volume 4 MODEL-SPECIFIC REGISTERS (MSRS) 520 pages

volume3: Memory management(paging), protection, task management, interrupt and exception handling, multi-processor support, thermal and power management features, debugging, performance monitoring, system management mode, virtual machine extensions (VMX) instructions, Intel® Virtualization Technology (Intel® VT), and Intel® Software Guard Extensions (Intel® SGX).

AMD64 Architecture Programmer's Manual (3336 pages)

more graph and easier to read.

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

Git Submodule: Data & Code Repository Separate

Just follow my mind

  1. parent repository ignore the sub directory in .gitignore
  2. Normal git usage in child repository, eg. init commit push
  3. submodule for what?
    1. for auto upload and init multi-subgit to target git commit

submodule

submodule is hard to use when it will be blocked under big sub-srepository.

I recommend to use .gitignore or just use ln -s xxx yyy

chatgpt balabala

Adding a Git repository inside another Git repository, often referred to as a "submodule," is a way to manage a separate repository as a part of your main project. This is particularly useful when you want to include another project within your project and keep them separate.

Here's how you can add a submodule and set up .gitignore for the outer repository:

  1. Add a Submodule:

    Navigate to the root of your outer Git repository and run the following command:

    git submodule add <repository_url> <submodule_path>
    

    Replace <repository_url> with the URL of the repository you want to add as a submodule, and <submodule_path> with the relative path within your outer repository where you want to place the submodule.

  2. Update and Initialize Submodule:

    After adding the submodule, you need to update and initialize it:

    git submodule update --init --recursive
    

    This command fetches the submodule's content and initializes it.

  3. Set Up .gitignore:

    To ignore the submodule's content and avoid committing it to the outer repository, you need to add the submodule path to the .gitignore file of the outer repository. For example, if your submodule is located in a directory called "submodule_folder," add the following line to the .gitignore file:

    submodule_folder/
    
  4. Working with Submodules:

    When you clone the outer repository on another system or share it with others, they will need to run git submodule update --init --recursive to fetch and initialize the submodule's content.

    To make changes to the submodule, navigate into the submodule directory and work as you would in a regular Git repository. When you commit changes in the submodule, you will need to push them to its remote repository. After that, commit the updated submodule reference in the outer repository.

    To update the submodule to the latest version available, you can navigate into the submodule directory and run git pull origin master (or the appropriate branch) to update its content. Then, commit the updated reference in the outer repository.

Remember that submodules have their own independent version control, so you need to manage them separately. Submodules provide a way to incorporate external projects into your main project while keeping their development and versioning separate.

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。