跳转至

笔记

Conda

conda

Anaconda和Miniconda都是针对数据科学和机器学习领域的Python发行版本,它们包含了许多常用的数据科学包和工具,使得安装和管理这些包变得更加简单。

解决了几个痛点:

  1. 不同python环境的切换(类似VirtualEnv)
  2. 高效的包管理工具(类似pip,特别是在Windows上好用)

anaconda

Anaconda是一个全功能的Python发行版本,由Anaconda, Inc.(前称Continuum Analytics)提供。

  • 它包含了Python解释器以及大量常用的数据科学、机器学习和科学计算的第三方库和工具,如NumPy、Pandas、Matplotlib、SciPy等。
  • Anaconda还包含一个名为Conda的包管理器,用于安装、更新和管理这些库及其依赖项。
  • Anaconda发行版通常较大(500MB),因为它预装了许多常用的包,适用于不希望从头开始搭建环境的用户。

Miniconda

Miniconda是Anaconda的轻量级版本(50MB),它也由Anaconda, Inc.提供。

  • 与Anaconda不同,Miniconda只包含了Python解释器和Conda包管理器,没有预装任何其他包。这意味着用户可以根据自己的需求手动选择要安装的包,从而实现一个精简而高度定制化的Python环境。
  • 对于希望从零开始构建数据科学环境或需要更细粒度控制的用户,Miniconda是一个很好的选择。
Install miniconda

According to the official website,

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# choose local path to install, maybe ~/.local
# init = yes, will auto modified the .zshrc to add the miniconda to PATH

# If you'd prefer that conda's base environment not be activated on startup,
#    set the auto_activate_base parameter to false:
conda config --set auto_activate_base false

you need to close all terminal(all windows in one section including all split windows), and reopen a terminal will take effect;

Python on windows1

创建与激活

# 激活环境(base),路径为指定的 conda 安装路径下的 `bin/activate` 文件
source /home/m00876805/anaconda3/bin/activate


# 使用以下命令创建一个名为"myenv"的虚拟环境(您可以将"myenv"替换为您喜欢的环境名称):
conda create --name myenv python=3.8

# list existed env
conda env list
/home/m00876805/anaconda3/bin/conda env list

# 查看具体环境的详细信息
conda env export --name <env_name>

# 激活,退出
conda activate name
conda deactivate name

conda pack

  • 目的: conda pack 用于将现有的 Conda 环境打包成一个压缩文件(如 .tar.gz),便于在其他系统上分发和安装。
  • 打包内容: 打包的内容包括环境中的所有依赖、库和包(定制修改包),通常用于在不使用 Anaconda 或 Miniconda 的系统上还原环境。
  • 恢复方式: 打包后的环境可以解压缩到指定位置,之后运行 conda-unpack 来修复路径,使其在新环境中正常工作。

打包

conda-pack 可以将 Conda 环境打包成一个 .tar.gz 文件,以便于跨机器或系统移动和还原环境。以下是使用 conda-pack 打包和还原环境的步骤:

1. 打包环境

假设要打包的环境名为 my_env

conda pack -n my_env -o my_env.tar.gz

这会在当前目录生成一个 my_env.tar.gz 文件。你可以将这个文件复制到其他系统或机器上解压还原。

2. 还原环境

在一个特定的 conda 环境目录(例如 /home/anaconda3)下还原和激活打包的环境,可以按以下步骤操作:

假设场景

  • 目标 conda 激活路径:/home/anaconda3/bin/activate
  • 打包文件:my_env.tar.gz
  • 解压后的环境名称:my_env

步骤

  1. 解压文件到 conda 环境目录

首先,将打包文件解压到指定的 conda 环境目录下的 envs 目录:

mkdir -p /home/anaconda3/envs/my_env
tar -xzf my_env.tar.gz -C /home/anaconda3/envs/my_env --strip-components 1

这里的 --strip-components 1 会去掉 tar.gz 包中的顶层目录结构,使内容直接解压到 my_env 文件夹内。

  1. 激活并修复环境

激活该环境,并运行 conda-unpack 来修复路径:

source /home/anaconda3/bin/activate /home/anaconda3/envs/my_env
conda-unpack

现在,my_env 环境已在 /home/anaconda3 目录下的 envs 文件夹中完成还原,可以正常使用。

conda env export

  • 目的: conda env export > freeze.yml 用于导出当前 Conda 环境的配置,包括所有安装的包和它们的版本信息,以 YAML 格式保存。
  • 导出内容: 导出的内容主要是依赖项和版本号,而不包括包的实际二进制文件。适用于在相同或不同系统上重建环境。
  • 恢复方式: 使用 conda env create -f freeze.yml 可以根据导出的 YAML 文件创建一个新环境。

conda list -e > requirements.txt 和 conda env export > freeze.yml

conda list -e > requirements.txtconda env export > freeze.yml 都是用于记录和管理 Conda 环境中安装的包,但它们之间有一些关键的区别:

conda list -e > requirements.txt
conda install --yes --file requirements.txt

conda list -e

  • 用途: 这个命令生成一个以简单文本格式列出当前环境中所有包及其版本的文件(requirements.txt)。
  • 内容: 列出的内容通常仅包括包的名称和版本,而不包含环境的依赖关系、渠道等信息。
  • 安装方式: 通过 conda install --yes --file requirements.txt 可以尝试使用 Conda 安装这些列出的包。这种方式适合简单的包管理,但可能在处理复杂依赖时存在问题。

conda env export

  • 用途: 这个命令生成一个 YAML 文件(freeze.yml),它包含了当前环境的完整配置,包括所有包、版本、渠道等信息。
  • 内容: 导出的 YAML 文件包含了完整的依赖关系树,可以确保在重建环境时完全匹配原始环境的状态。
  • 安装方式: 通过 conda env create -f freeze.yml 可以根据 YAML 文件创建一个新的环境,确保与原环境一致。

关系与总结

  • 复杂性: conda env export 更加全面和可靠,适合重建相同的环境;而 conda list -e 更简单,适合快速记录包。
  • 使用场景: 对于需要准确重建环境的情况,使用 freeze.yml 是更好的选择;而对于简单的包列表管理,requirements.txt 可能足够用。

因此,如果你的目标是确保环境的一致性,使用 conda env exportfreeze.yml 是推荐的做法;如果只是想快速记录并安装一组包,requirements.txt 是一个方便的选择。

安装

在conda命令无效时使用pip命令来代替

while read requirement; do conda install --yes $requirement || pip install $requirement; done < requirements.txt

The double pipe (“||”) is a control operator that represents the logical OR operation. It is used to execute a command or series of commands only if the previous command or pipeline has failed or has returned a non-zero status code.

复制已有环境(fork)

conda create -n 新环境名称 --clone 原环境名称 --copy

虽然是完全复制,但是pip install -e安装的包会因为源文件的改动而失效

pip install -e 是用于在开发模式下安装 Python 包的命令,允许你在不复制包文件的情况下,将项目源代码直接安装到 Python 环境中,并保持源代码与环境中的包同步更新。这对于开发过程中频繁修改和测试代码非常有用。

以下是 pip install -e 的使用方法:pip install -e /path/to/project

详细解释:

  • /path/to/project:项目的根目录,通常包含 setup.py 文件。setup.py 文件定义了包的名称、依赖、入口点等信息。
  • -e 选项:表示“可编辑安装”(editable),意味着它不会复制项目文件到 Python 环境的 site-packages 目录,而是创建一个符号链接,指向原始项目路径。这样你可以在原路径下修改源代码,Python 环境中的包会实时反映这些修改。

通过 pip freeze 命令更好地查看

  1. 通过 pip freeze 命令更好地查看

如果你想明确区分哪些包是通过 pip install -e 安装的,可以使用 pip freeze 命令。与 pip list 不同,pip freeze 会将包的版本和安装源显示出来。对于 -e(editable mode)安装的包,pip freeze 会有特殊标记。

运行以下命令:

pip freeze

输出示例:

-e git+https://github.com/example/project.git@abc123#egg=my_project
numpy==1.21.0
requests==2.25.1

在这里,带有 -e 标记的行表示这个包是通过 pip install -e 安装的,后面跟的是包的源代码路径(例如 Git 仓库 URL 或本地路径),而不是直接列出包的版本号。

  1. 输出解析

  2. -e 标记:表示这个包是以开发模式安装的。

  3. 普通包:对于直接通过 pip install 安装的包(不是开发模式),它们会以 包名==版本号 的形式列出。
  4. git URL 或本地路径:开发模式下安装的包会指向源代码的路径,通常是 git 仓库 URL 或本地路径(如果是通过本地文件系统安装的)。

参考文献

https://blog.csdn.net/Mao_Jonah/article/details/89502380


  1. ref 

Linux Terminal

导言

对程序员来说,一个好用、易用的terminal,就是和军人手上有把顺手的好枪一样。

基础知识

用户的环境变量和配置文件

在Linux系统中,用户的环境变量和配置文件可以在不同的节点生效。以下是这些文件的功能和它们生效的时机:

  1. /etc/environment:

    • 功能: 设置系统范围的环境变量。
    • 生效时机: 在用户登录时读取,但不会执行shell命令。它主要用于设置变量,如PATH、LANG等。
  2. /etc/profile:

    • 功能: 为系统的每个用户设置环境信息。
    • 生效时机: 当用户登录时,会读取并执行该文件中的配置。它是针对登录shell(例如,通过终端登录或ssh登录)的。
  3. /etc/profile.d/:

    • 功能: 存放多个脚本,这些脚本会被/etc/profile读取和执行。
    • 生效时机: 与/etc/profile相同,登录shell时执行。它使得系统管理员可以将不同的配置分散到多个文件中管理。
  4. /etc/bash.bashrc:

    • 功能: 为所有用户设置bash shell的配置。
    • 生效时机: 对于非登录shell(例如,打开一个新的终端窗口)时会读取并执行。
  5. ~/.profile:

    • 功能: 为单个用户设置环境信息。
    • 生效时机: 用户登录时读取并执行,主要针对登录shell。
  6. ~/.bashrc:

    • 功能: 为单个用户配置bash shell的设置。
    • 生效时机: 用户打开一个新的bash shell(非登录shell)时读取并执行。

总结

  • /etc/environment/etc/profile 主要用于系统范围的环境变量设置,前者不会执行shell命令,后者会执行。
  • /etc/profile.d/ 中的脚本作为 /etc/profile 的扩展,用于更灵活的管理配置。
  • /etc/bash.bashrc 适用于所有用户的bash配置,但只针对非登录shell。
  • ~/.profile~/.bashrc 适用于单个用户,前者用于登录shell,后者用于非登录shell。

通过这些文件,系统和用户可以灵活地设置和管理环境变量和shell配置,以满足不同的需求和使用场景。

\n \r 回车 换行
符号 ASCII码 意义
\n 10 换行NL: 本义是光标往下一行(不一定到下一行行首),n的英文newline,控制字符可以写成LF,即Line Feed
\r 13 回车CR: 本义是光标重新回到本行开头,r的英文return,控制字符可以写成CR,即Carriage Return

在不同的操作系统这几个字符表现不同:

  1. 在WIN系统下,这两个字符就是表现的本义,
  2. 在UNIX类系统,换行\n就表现为光标下一行并回到行首,
  3. 在MAC上,\r就表现为回到本行开头并往下一行,至于ENTER键的定义是与操作系统有关的。通常用的Enter是两个加起来。
\n: UNIX 系统行末结束符
\n\r: window 系统行末结束符
\r: MAC OS 系统行末结束符

终端命令行代理

在任意层级的SHELL配置文件里添加

export http_proxy=http://yourproxy:port
export https_proxy=http://yourproxy:port

写成bashrc的脚本命令

#YJH proxy
export proxy_addr=localhost
export proxy_http_port=7890
export proxy_socks_port=7890
function set_proxy() {
   export http_proxy=http://$proxy_addr:$proxy_http_port #如果使用git 不行,这两个http和https改成socks5就行
   export https_proxy=http://$proxy_addr:$proxy_http_port
   export all_proxy=socks5://$proxy_addr:$proxy_socks_port
   export no_proxy=127.0.0.1,.huawei.com,localhost,local,.local 
}
function unset_proxy() {
   unset http_proxy
   unset https_proxy
   unset all_proxy
}
function test_proxy() {
   curl -v -x http://$proxy_addr:$proxy_http_port https://www.google.com | egrep 'HTTP/(2|1.1) 200'
   # socks5h://$proxy_addr:$proxy_socks_port
}
# set_proxy # 如果要登陆时默认启用代理则取消注释这句

常用命令

check process create time

ps -eo pid,lstart,cmd |grep bhive
date

kill all process by name

 sudo ps -ef | grep 'bhive-re' | grep -v grep | awk '{print $2}' | sudo xargs -r kill -9

常见问题

鼠标滚轮输出乱码

滚轮乱码,是tmux set mouse on的原因

进入tmux后退出,并运行reset即可

sudo后找不到命令

当你使用sudo去执行一个程序时,处于安全的考虑,这个程序将在一个新的、最小化的环境中执行,也就是说,诸如PATH这样的环境变量,在sudo命令下已经被重置成默认状态了。

添加所需要的路径(如 /usr/local/bin)到/etc/sudoers文件"secure_path"下

Defaults    secure_path = /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin
在用python使用curses写多进程进度条的时候,混乱输出

解决办法如下:

stdscr = curses.initscr() # 不要设置为全局变量
# 而且 使用set_win unset_win 保持区域换行的行为

参考文献

Zsim-tlb: bug

bug

zsim-tlb simulate in icarus0

pinbin: build/opt/zsim.cpp:816: LEVEL_BASE::VOID VdsoCallPoint(LEVEL_VM::THREADID): Assertion `vdsoPatchData[tid].level' failed.
Pin app terminated abnormally due to signal 6.

locate error

VOID VdsoCallPoint(THREADID tid) {
    //level=0,invalid
    assert(vdsoPatchData[tid].level);
    vdsoPatchData[tid].level++;
    // info("vDSO internal callpoint, now level %d", vdsoPatchData[tid].level); //common
}
  • vDSO (virtual dynamic shared object) is a kernel machanism for exporting a carefully set kernel space routines (eg. not secret api, gettid() and gettimeofday()) to user spapce to eliminate the performance penalty of user-kernel mode switch according to wiki. vDSO
  • You can use some __vdso_getcpu() C library, and kernel will auto move it to user-space
  • vDSO overcome vsyscall(first linux-kernel machanism to accelerate syscall) drawback.
  • In zsim, vDSO have only four function enum VdsoFunc {VF_CLOCK_GETTIME, VF_GETTIMEOFDAY, VF_TIME, VF_GETCPU};

vDSO simulate part

// Instrumentation function, called for EVERY instruction
VOID VdsoInstrument(INS ins) {
    ADDRINT insAddr = INS_Address(ins); //get ins addr
    if (unlikely(insAddr >= vdsoStart && insAddr < vdsoEnd)) {
        //INS is vdso syscall
        if (vdsoEntryMap.find(insAddr) != vdsoEntryMap.end()) {
            VdsoFunc func = vdsoEntryMap[insAddr];
            //call VdsoEntryPoint function
            //argv are: tid ,func(IARG_UINT32),arg0(LEVEL_BASE::REG_RDI),arg1(LEVEL_BASE::REG_RSI) 
            INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) VdsoEntryPoint, IARG_THREAD_ID, IARG_UINT32, (uint32_t)func, IARG_REG_VALUE, LEVEL_BASE::REG_RDI, IARG_REG_VALUE, LEVEL_BASE::REG_RSI, IARG_END);
        } else if (INS_IsCall(ins)) {   //call instruction
            INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) VdsoCallPoint, IARG_THREAD_ID, IARG_END);
        } else if (INS_IsRet(ins)) {    //Ret instruction
            INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) VdsoRetPoint, IARG_THREAD_ID, IARG_REG_REFERENCE, LEVEL_BASE::REG_RAX /* return val */, IARG_END);
        }
    }

    //Warn on the first vsyscall code translation
    if (unlikely(insAddr >= vsyscallStart && insAddr < vsyscallEnd && !vsyscallWarned)) {
        warn("Instrumenting vsyscall page code --- this process executes vsyscalls, which zsim does not virtualize!");
        vsyscallWarned = true;
    }
}

INS_Address is from pin-kit, but INS_InsertCall is pin api.

try:

.level is just show the level of nested vsyscall. I think comment the assert which trigerd when callfunc before entryfunc is just fun.

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

Diary 230827: 上海二次元之旅

缘由

华为实习要结束了,作为二次元,在中国秋叶原怎么能不好好逛逛呢?

目标

  1. 百联zx,外文书店,
  2. 百米香榭
  3. 迪美地下城(香港名街关门装修了
  4. 第一百货和新世界
  5. 大丸百货的4F的华漫潮玩
  6. 静安大悦城的间谍过家家的快闪点。
  7. 徐家汇的一楼jump店、龙猫店,二楼GSC店
  8. mihoyo总部

爱上海

上海真是包容性极强的地方。原本内心对二次元的热爱,竟然这么多人也喜欢。不必隐藏,时刻伪装。可以暂时放松自我的感觉真好。

论对二次元人物的喜爱

爱的定义

爱或者热爱是最浓烈的情感。对象一般是可以交互的人物,物体说不定也可以。但是至少要能与他持续产生美好的回忆和点滴,来支持这份情感。

比如说,我一直想让自己能热爱我的工作,就需要创造小的阶段成功和胜利来支持自己走下去。

区分喜爱与贪恋美色

  1. 首先和对方待在一起很舒服,很喜欢陪伴的感觉,想长期走下去。
  2. 其实不是满脑子瑟瑟的想法
  3. 外表美肯定是加分项,但是更关注气质,想法和精神层面的东西。

三次元与二次元人物

三次元的人物包括偶像歌手,和演员。需要演出,演唱会来与粉丝共创回忆,演员也需要影视剧作品。

二次元人物大多数来自于动画,因为游戏一般不以刻画人物为目的,比如主机游戏 当然galgame和二次元手游除外。

日本动画以远超欧美和国创的题材和人物的细腻刻画(不愧是galgame大国,Band Dream it’s my go到人物心里描写简直一绝)创造了许多令人喜爱的角色。

比较优势

  1. 表现能力的上限来看,动画也是远超游戏(不然游戏里为什么要动画CG)和真人影视剧的。
  2. 二次元人物的二次创作的低门槛(无论从还原难度还是法律约束上来说,毕竟三次元人物经常和真人强绑定)和舆论高包容性(传统二次元社区可比饭圈干净多了)都有远超三次元的优势。
  3. 此外二创Cosplay的平易近人或者说触手可及的真实感。二创能创造出远超原本作品的人物记忆和羁绊
  4. 另一点可能的是二创的低门槛带来的创作快乐,这一点在之前分析音乐的快乐有提到。二创主要有音乐,mmd,iwara动画
  5. cos 可以让原本平凡的人生,染上对应角色不平凡经历的色彩
  6. 最后一点就是永恒性吧,第一点是之前我分析过人们喜欢在变化的生活中追求不变,或者相反。三次元人物或者演员会老去,但是二次元人物能在一部新剧场版下重现活力
  7. 另一点就是不会被背叛。

比较缺点

  1. 对于二次元角色的喜爱在时间的长河里是单向的,除开代入主角,很难收获二次元角色对自己的喜爱(这样看galgame稍微弥补了这点)。交流交互隔着次元的屏障。
  2. 成长可塑性的略微欠缺:如果作品已经完结了,除开少量二创,角色形象基本就确定了。除非输入到AI里训练,使之生命延续。
  3. 惊喜性缺失: 真实人物是多面的,不可控的。但是二次元角色的反转特性只存在于剧集的剧情里。

初步结论

女朋友 > 喜欢二次元(连载 > 完结) >> 追星

图片轰炸

23.08.27 to do

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

UnimportantView: Anime Recommendation

起源与目标

  1. 看番没有形成自己的喜好,导致看到不对的,反而有副作用,
  2. 什么番可以成为精神支柱,而不是看了之后。反而精神内耗更严重了。(看了happy sugar life后,直接抑郁了)

说明

不同于恋爱番,催泪番,这样的分类。其实我更在意作品想表达的主题,作者想展现给读者什么。 无论是各种道理,还是就是某个环境,虚幻世界。

羁绊:对人的爱,爱情、亲情、友情。

何为爱的寻爱之旅

番剧名 精神内核 评语 喜爱的角色 音乐
Happy Sugar Life 守护你是我的爱语 难以理解的爱的世界里,两位迷途少女相遇,救赎,领悟爱的蜜罐生活 砂糖、盐 金丝雀、ED、悲伤小提琴

我推的孩子(第一集)

Violet Garden

羁绊的破碎和重组

BanG Dream It's my go !!!!! 初羁绊(友情,百合,重女)的破碎和reunion

病名为爱

未来日记

家有女友、渣愿

点滴恋爱

百合类的成长:终将成为你,

我心危

轮回宿命类

跨越时空也无法阻止我爱你

命运石之门

RE0

无法抵达的简单幸福未来

寒蝉鸣泣之时

魔法少女小圆

史诗类

复杂、紧张的鸿篇巨制。多非单一的精神内核可以概括。多为群像剧。

奇幻、幻想世界史诗

Fate Zero

钢炼

EVA

to do

刀剑

四谎

CLANND

龙与虎

巨人

超炮

凉宫

鲁鲁修

轻音

补番列表

  1. 物语系列

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

TLB: real pagewalk overhead

简介

TLB的介绍,请看

页表相关

理论基础

大体上是应用访问越随机, 数据量越大,pgw开销越大。

ISCA 2013 shows the pgw overhead in big memory servers.

Basu 等 - Efficient Virtual Memory for Big Memory Servers.pdf

or ISCA 2020 Guvenilir 和 Patt - 2020 - Tailored Page Sizes.pdf

机器配置

# shaojiemike @ snode6 in ~/github/hugoMinos on git:main x [11:17:05]
$ cpuid -1 -l 2
CPU:
      0x63: data TLB: 2M/4M pages, 4-way, 32 entries
            data TLB: 1G pages, 4-way, 4 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries
      0x76: instruction TLB: 2M/4M pages, fully, 8 entries
      0xff: cache data is in CPUID leaf 4
      0xb5: instruction TLB: 4K, 8-way, 64 entries
      0xf0: 64 byte prefetching
      0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries
# if above command turns out empty
cpuid -1 |grep TLB -A 10 -B 5
# will show sth like

L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
    instruction # entries     = 0x40 (64)
    instruction associativity = 0xff (255)
    data # entries            = 0x40 (64)
    data associativity        = 0xff (255)
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
    instruction # entries     = 0x40 (64)
    instruction associativity = 0xff (255)
    data # entries            = 0x40 (64)
    data associativity        = 0xff (255)
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
    instruction # entries     = 0x200 (512)
    instruction associativity = 2-way (2)
    data # entries            = 0x800 (2048)
    data associativity        = 4-way (4)
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
    instruction # entries     = 0x200 (512)
    instruction associativity = 4-way (4)
    data # entries            = 0x800 (2048)
    data associativity        = 8-way (6)

OS config

default there is no hugopage(usually 4MB) to use.

$ cat /proc/meminfo | grep huge -i
AnonHugePages:      8192 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB

explained is here.

设置页表大小

other ways: change source code

  1. way1: Linux transparent huge page (THP) support allows the kernel to automatically promote regular memory pages into huge pages, cat /sys/kernel/mm/transparent_hugepage/enabled but achieve this needs some details.
  2. way2: Huge pages are allocated from a reserved pool which needs to change sys-config. for example echo 20 > /proc/sys/vm/nr_hugepages. And you need to write speacial C++ code to use the hugo page
# using mmap system call to request huge page
mount -t hugetlbfs \
    -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
    min_size=<value>,nr_inodes=<value> none /mnt/huge

without recompile

But there is a blog using unmaintained tool hugeadm and iodlr library to do this.

sudo apt install libhugetlbfs-bin
sudo hugeadm --create-global-mounts
sudo hugeadm --pool-pages-min 2M:64

So meminfo is changed

$ cat /proc/meminfo | grep huge -i
AnonHugePages:      8192 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      64
HugePages_Free:       64
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:          131072 kB

using iodlr library

git clone 

应用测量

Measurement tools from code

# shaojiemike @ snode6 in ~/github/PIA_huawei on git:main x [17:40:50]
$ ./investigation/pagewalk/tlbstat -c '/staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/sssp.inj -f /staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/benchmark/kron-20.wsg -n1'
command is /staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/sssp.inj -f /staff/shaojiemike/github/sniper_PIMProf/PIMProf/gapbs/benchmark/kron-20.wsg -n1
K_CYCLES   K_INSTR      IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC  K_ITLBCYC  DTLB% ITLB%
324088     207256      0.64 733758     3276       18284      130         5.64  0.04
21169730   11658340    0.55 11802978   757866     316625     24243       1.50  0.11

平均单次开销(开始到稳定): dtlb miss read need 24~50 cycle ,itlb miss read need 40~27 cycle

案例的时间分布:

  • 读数据开销占比不大,2.5%左右
  • pagerank等图应用并行计算时,飙升至 22%
  • bfs 最多就是 5%,没有那么随机的访问。
  • 但是gemv 在65000 100000超内存前,即使是全部在计算,都是0.24%
  • 原因:访存模式:图应用的访存模式通常是随机的、不规则的。它们不像矩阵向量乘法(gemv)等应用那样具有良好的访存模式,后者通常以连续的方式访问内存。连续的内存访问可以利用空间局部性,通过预取和缓存块的方式减少TLB缺失的次数。
  • github - GUOPS can achive 90%
  • DAMOV - ligra - pagerank can achive 90% in 20M input case

gemm

  • nomal gemm can achive 100% some situation
  • matrix too big can not be filled in cache, matrix2 access jump lines so always cache miss
  • O3 flag seems no time reduce, beacause there is no SIMD assembly in code
  • memory access time = pgw + tlb access time + load data 2 cache time

gemm

the gemm's core line is

for(int i=0; i<N; i++){
   // ignore the overflow, do not influence the running time.
   for(int j=0; j<N; j++){
      for(int l=0; l<N; l++){
            // gemm
            // ans[i * N + j] += matrix1[i * N + l] * matrix2[l * N + j];

            // for gemm sequantial
            ans[i * N + j] += matrix1[i * N + l] * matrix2[j * N + l];
      }
   }
}

and real time breakdown is as followed. to do

  1. first need to perf get the detail time

bigJump

manual code to test if tlb entries is run out

$ ./tlbstat -c '../../test/manual/bigJump.exe 1 10 100'
command is ../../test/manual/bigJump.exe 1 10 100
K_CYCLES   K_INSTR      IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC  K_ITLBCYC  DTLB% ITLB%
2002404    773981      0.39 104304528  29137      2608079    684        130.25  0.03

$ perf stat -e mem_uops_retired.all_loads -e mem_uops_retired.all_stores -e mem_uops_retired.stlb_miss_loads -e mem_uops_retired.stlb_miss_stores ./bigJump.exe 1 10 500
Number read from command line: 1 10 (N,J should not big, [0,5] is best.)
result 0
 Performance counter stats for './bigJump.exe 1 10 500':

          10736645      mem_uops_retired.all_loads
         532100339      mem_uops_retired.all_stores
             57715      mem_uops_retired.stlb_miss_loads
         471629056      mem_uops_retired.stlb_miss_stores

In this case, tlb miss rate up to 47/53 = 88.6%

Big bucket hash table

using big hash table

other apps

Any algorithm that does random accesses into a large memory region will likely suffer from TLB misses. Examples are plenty: binary search in a big array, large hash tables, histogram-like algorithms, etc.

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

AI Compiler

百度

秋招面试时遇到高铁柱前辈。问了相关的问题(对AI专业的人可能是基础知识)

  1. nvcc编译器不好用吗?为什么要开发tvm之类的编译器?
  2. 答:首先,nvcc是类似与gcc, msvc(Microsoft Visual C++) 之类的传统的编译器,支持的是CUDA C/C++ 代码。
  3. 但是tvm编译器是张量编译器,支持的是python之类的代码,将其中的网络设计,编译拆解成各种算子,然后使用cudnn或者特定硬件的高效机器码来执行。

蔚来

数字信号处理器 (Digital signal processor)

HLO 简单理解为编译器 IR。

TVM介绍

https://tvm.apache.org

  1. TVM解决的问题:
  2. 2017年,deploy Deep learning(TF,Pytorch) everywhere(hardware).
  3. Before TVM,
    1. 手动调优:loop tiling for locality.
    2. operator fusion 算子融合。虽然性能高,但是部署不高效
  4. 编译优化思路引入深度学习
  5. 定义了算子描述到部署空间的映射。核心是感知调度空间,并且实现compute/schedule 分离
  6. TVM当前的发展
  7. 上层计算图表示:NNVM Relay Relax
  8. 底层优化方式:manual -> AutoTVM(schedule最优参数的搜索,基于AI的cost model) -> Ansor(也不再需要手动写AutoTVM模版,使用模版规则生成代码)
  9. TVM的额外工作
  10. HeteroCL: TVM + FPGA

  1. output Fusion
  2. 减少Global Memory Copy

把中间算子库替换成编译器?

暂时不好支持张量

AI自动调整变化来调优

自动调参。缺点:

  1. 需要人工写模版
  2. 人工导致解空间变小

随机各级循环应用优化策略(并行,循环展开,向量化

介绍了Ansor效果很好

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

Graph Algorithms: Pagerank

Pagerank

  1. Network and social network can be identified as a weighted graph
  2. How to do the important ranking is Pagerank

how to design a graph when pr executing memory access jump data-array very random?

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

https://zhuanlan.zhihu.com/p/137561088

Python: DataStructure

check if it is empty?

strings, lists, tuples

# Correct:
if not seq:
if seq:

# Wrong:
if len(seq):
if not len(seq):

debug

try:
    # sth
except Exception as e:
    pprint.pprint(list)
    raise e
finally:
    un_set()

for

step

调参需要测试间隔值

for i in range(1, 101, 3):
    print(i)

遍历修改值

  • 使用 enumerate 函数结合 for 循环遍历 list,以修改 list 中的元素。
  • enumerate 函数返回一个包含元组的迭代器,其中每个元组包含当前遍历元素的索引和值。在 for 循环中,我们通过索引 i 修改了列表中的元素。
# 对于 二维list appDataDict
baseline = appDataDict[0][0] # CPU Total
for i, line in enumerate(appDataDict):
    for j, entry in enumerate(line):
        appDataDict[i][j] = round(entry/baseline, 7)

itertools

itertools --- 为高效循环而创建迭代器的函数

for a,b,c in permutations((a,b,c)):

String 字符串

%c  格式化字符及其ASCII码
%s  格式化字符串
%d  格式化整数
%u  格式化无符号整型
%o  格式化无符号八进制数
%x  格式化无符号十六进制数
%X  格式化无符号十六进制数(大写)
%f  格式化浮点数字,可指定小数点后的精度
%e  用科学计数法格式化浮点数
%E  作用同%e,用科学计数法格式化浮点数
%g  %f和%e的简写
%G  %F  %E 的简写
%p  用十六进制数格式化变量的地址
print("My name is %s and weight is %d kg!" % ('Zara', 21))

string <-> list

' '.join(pass_list) and pass_list.split(" ")

对齐"\n".join(["%-10s" % item for item in List_A])

字符串开头判断

text = "Hello, world!"

if text.startswith("Hello"):
    print("The string starts with 'Hello'")
else:
    print("The string does not start with 'Hello'")

format 格式化函数

Python2.6 开始,通过 {}: 来代替以前的 %

>>>"{} {}".format("hello", "world")    # 不设置指定位置,按默认顺序
'hello world'

>>> "{1} {0} {1}".format("hello", "world")  # 设置指定位置
'world hello world'

# 字符串补齐100位,<表示左对齐
variable = "Hello"
padded_variable = "{:<100}".format(variable)

数字处理

print("{:.2f}".format(3.1415926)) # 保留小数点后两位

{:>10d} 右对齐 (默认, 宽度为10)
{:^10d} 中间对齐 (宽度为10)

小数位

x = round(x,3)# 保留小数点后三位

容器:List

https://www.runoob.com/python/python-lists.html

初始化以及访问

list = ['physics', 'chemistry', 1997, 2000]
list = []          ## 空列表
print(list[0])

切片

格式:[start_index:end_index:step]

不包括end_index的元素

二维数组

list_three = [[0 for i in range(3)] for j in range(3)]

//numpy 创建连续的可自动向量化线程并行
import numpy as np
# 创建一个 3x4 的数组且所有值全为 0
x3 = np.zeros((3, 4), dtype=int)
# 创建一个 3x4 的数组,然后将所有元素的值填充为 2
x5 = np.full((3, 4), 2, dtype=int)

size 大小

len(day)

排序

# take second element for sort
def takeSecond(elem):
    return elem[2]

LCData.sort(key=takeSecond)

# [1740, '黄业琦', 392, '第 196 场周赛'],
# [1565, '林坤贤', 458, '第 229 场周赛'],
# [1740, '黄业琦', 458, '第 229 场周赛'],
# [1509, '林坤贤', 460, '第 230 场周赛'],
# [1740, '黄业琦', 460, '第 230 场周赛'],
# [1779, '黄业琦', 558, '第 279 场周赛'],

对应元素相加到一个变量

tmp_list = [[],[],[],[]]
# 注意不需要右值赋值
[x.append(copy.deepcopy(entry)) for x,entry in zip(tmp_list, to_add)]

两个list对应元素相加

对于等长的

list1 = [1, 2, 3, 4, 5]
list2 = [6, 7, 8, 9, 10]

result = [x + y for x, y in zip(list1, list2)]
print(result)

如果两个列表的长度不同,你可以使用zip_longest()函数来处理它们。zip_longest()函数可以处理不等长的列表,并使用指定的填充值填充缺失的元素。

from itertools import zip_longest

list1 = [1, 2, 3, 4, 5]
list2 = [6, 7, 8]

result = [x + y for x, y in zip_longest(list1, list2, fillvalue=0)]
print(result)

如果是二维list

list1 = [[1, 2, 3],
         [4, 5, 6],
         [7, 8, 9]]

list2 = [[10, 11, 12],
         [13, 14, 15]]

rows = max(len(list1), len(list2))
cols = max(len(row) for row in list1 + list2)

result = [[0] * cols for _ in range(rows)]

for i in range(rows):
    for j in range(cols):
        if i < len(list1) and j < len(list1[i]):
            result[i][j] += list1[i][j]
        if i < len(list2) and j < len(list2[i]):
            result[i][j] += list2[i][j]

print(result)

# 将一个二维列表的所有元素除以一个数A
result = [[element / A for element in row] for row in list1]

直接赋值、浅拷贝和深度拷贝

Python append() 与深拷贝、浅拷贝

python赋值只是引用,别名

list.append('Google')   ## 使用 append() 添加元素
alist.append( num ) # 浅拷贝 ,之后修改num 会影响alist内的值

import copy
alist.append( copy.deepcopy( num ) ) # 深拷贝

# delete
del list[2]

for循环迭代的元素 也是 引用

original_list = [1, 2, 3]

for item in original_list:
    item *= 2 # 每个元素是不可变的

print(original_list) 

original_list = [[1,2,3], [2], [3]]

for item in original_list:
    item.append("xxx") # 每个元素是可变的

print(original_list) 

# [1, 2, 3]
# [[1, 2, 3, 'xxx'], [2, 'xxx'], [3, 'xxx']]

函数传参是引用,但是能通过切片来得到类似指针

参数的传递 函数声明时的形参,使用时,等同于函数体内的局部变量。由于Python中一切皆为对象。因此,参数传递时直接传递对象的地址,但具体使用分两种类型: 1.传递不可变对象的引用(起到其他语言值传递的效果) 数字,字符串,元组,function等 2.传递可变对象的引用(起到其他语言引用传递的效果) 字典,列表,集合,自定义的对象等

def fun0(a):
    a = [0,0] # a在修改后,指向的地址发生改变,相当于新建了一个值为[0,0]

def fun(a):
    a[0] = [1,2]

def fun2(a):
    a[:] = [10,20]

b = [3,4]
fun0(b)
print(b)
fun(b)
print(b)
fun2(b)
print(b)

# [3, 4]
# [[1, 2], 4]
# [10, 20]

return 返回值

可变的也是引用

def fun1(l):
    l.append("0")
    return l 

def fun2(l):
    return l

if __name__=="__main__":
    l = [1,2,3,4,5]

    rel2 = fun2(l)
    print(rel2)   
    rel1 = fun1(l)
    print(rel1)   
    print(rel2)   
    l.append("xxx")
    print(rel1)   
    print(rel2)   
    del rel1[2]
    print(rel1)   
    print(rel2)  

# [1, 2, 3, 4, 5]
# [1, 2, 3, 4, 5, '0']
# [1, 2, 3, 4, 5, '0']
# [1, 2, 3, 4, 5, '0', 'xxx']
# [1, 2, 3, 4, 5, '0', 'xxx']
# [1, 2, 4, 5, '0', 'xxx']
# [1, 2, 4, 5, '0', 'xxx']

容器:元组Tuple

  • 元组和列表类似,但是不同的是元组不能修改,但可以对元组进行连接组合,元组使用小括号。
  • 元组中只包含一个元素时,需要在元素后面添加逗号,否则括号会被当作运算符使用。
#创建元组
tup = (1, 2, 3, 4, 5)
tup1 = (23, 78);
tup2 = ('ab', 'cd')
tup3 = tup1 + tup2

容器:Dict

empty dict

a= {}
a=dict()

key 支持tuple元组

类似c++ 的 pair<int,int>

bblHashDict[(tmpHigherHash,tmpLowerHash)]=tmpBBL

但是这样就不支持json.dump, json.dump() 无法序列化 Python 中元组(tuple)作为字典的 key,这会导致 json.dump() 函数在写入此类字典数据时会进入死循环或陷入卡住状态

初始化以及访问

>>> tinydict = {'a': 1, 'b': 2, 'b': '3'}
>>> tinydict['b']
'3'
a_dict = {'color': 'blue'}
for key in a_dict:
 print(key)
# color
for key in a_dict:
    print(key, '->', a_dict[key])
# color -> blue
for item in a_dict.items():
    print(item)
# ('color', 'blue')
for key, value in a_dict.items():
 print(key, '->', value)
# color -> blue

判断key 是否存在

以下是两种常用的方法:

方法一:使用in操作符: in操作符返回一个布尔值,True表示存在,False表示不存在。

Copy code
my_dict = {"key1": "value1", "key2": "value2", "key3": "value3"}

# 判断是否存在指定的键
if "key2" in my_dict:
    print("Key 'key2' exists in the dictionary.")
else:
    print("Key 'key2' does not exist in the dictionary.")

方法二:使用dict.get()方法: dict.get()方法在键存在时返回对应的值,不存在时返回None。根据需要选择适合的方法进行判断。

Copy code
my_dict = {"key1": "value1", "key2": "value2", "key3": "value3"}

# 判断是否存在指定的键
if my_dict.get("key2") is not None:
    print("Key 'key2' exists in the dictionary.")
else:
    print("Key 'key2' does not exist in the dictionary.")

这两种方法都可以用来判断字典中是否存在指定的键。

size 大小

len(day)

修改以及添加

tinydict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}

tinydict['Age'] = 8 # 更新
tinydict['School'] = "RUNOOB" # 添加

合并

dict1 = {'a': 10, 'b': 8} 
dict2 = {'d': 6, 'c': 4} 

# dict2保留了合并的结果
dict2.update(dict1)
print(dict2)
{'d': 6, 'c': 4, 'a': 10, 'b': 8}

删除

del tinydict['Name']  # 删除键是'Name'的条目
tinydict.clear()      # 清空字典所有条目
del tinydict          # 删除字典
from pprint import pprint
pprint

容器:set

无序不重复序列

初始化

a=  set() # 空set

thisset = set(("Google", "Runoob", "Taobao"))
>>> basket = {'apple', 'orange', 'apple', 'pear', 'orange', 'banana'}
>>> print(basket)                      # 这里演示的是去重功能

list2set

setL=set(listV)

set2list

my_set = {'Geeks', 'for', 'geeks'}

s = list(my_set)
print(s)
# ['Geeks', 'for', 'geeks']

添加

thisset.add("Facebook")

合并

x = {"apple", "banana", "cherry"}
y = {"google", "runoob", "apple"}

z = x.union(y) 

print(z)
# {'cherry', 'runoob', 'google', 'banana', 'apple'}

删除与清空

s.remove( x )
a.clear()

修改原本的值

修改传入参数

在Python中,函数的参数是按值传递的,也就是说在函数内部修改参数不会影响到函数外部的变量。

但是有几种方法可以实现类似修改参数的效果:

  1. 返回修改后的值,在函数外部重新赋值
def func(x):
    x = x + 1 
    return x

a = 10
a = func(a) 
print(a) # 11
  1. 使用可变对象作为参数,修改可变对象的内部值
def func(lst):
    lst.append(1)

lst = [1,2,3]
func(lst)
print(lst) # [1,2,3,1]

这里lst是列表,在func内部修改了lst,由于lst是可变的,所以函数外部的lst也被修改了。

  1. 使用全局变量
count = 0
def func():
    global count
    count += 1

func()
print(count) # 1

通过global关键字声明count为全局变量,这样就可以在函数内部修改全局变量count了。

所以要修改传入参数的值,主要的方法是:

  1. 返回修改后的值并重新赋值
  2. 传入一个可变对象并修改可变对象内部的值
  3. 使用全局变量

这些技巧可以实现模拟修改参数的效果。

修改for循环对象

在Python中,for循环遍历的是一个迭代器,每次循环会将迭代器的下一个元素赋值给循环变量。

如果想在for循环内部修改迭代器中的元素,可以使用以下方法:

  1. 将迭代器转换成列表,然后修改列表中的元素:
all_for_one = ['a', 'b', 'c']

for app_info in list(all_for_one):
    if app_info == 'b':
        all_for_one[1] = 'x' 

print(all_for_one) # ['a', 'x', 'c']

这里通过list()将迭代器转成列表,然后修改列表的元素。

  1. 使用循环的索引而不是直接使用元素:
all_for_one = ['a', 'b', 'c']

for i in range(len(all_for_one)):
    if i == 1:
        all_for_one[i] = 'x'

print(all_for_one) # ['a', 'x', 'c']

通过索引 i 来访问并修改元素。

  1. 使用enumerate()在循环中获取索引:
all_for_one = ['a', 'b', 'c']

for i, app_info in enumerate(all_for_one):
    if i == 1:
        all_for_one[i] = 'x'

print(all_for_one) # ['a', 'x', 'c']

enumerate()可以同时迭代索引和元素。

所以主要思路就是不直接修改循环变量,而是通过索引或临时列表来修改迭代器中的原始元素。

修改for循环的对象 set

对于 set 这样的不可变对象,不能直接通过索引或者 enumerate 来修改,因为 set 对象不支持索引和切片操作。

对于 set,可以使用以下方法在循环内部进行修改:

  1. 将 set 转换成 list,修改后再转换回 set
s = {'a', 'b', 'c'}

for item in list(s):
    if item == 'b':
        s = list(s)
        s[1] = 'x'
        s = set(s)

print(s) # {'a', 'x', 'c'}
  1. 创建一个新的 set,在循环中添加修改后的元素
s = {'a', 'b', 'c'}
new_s = set()

for item in s:
    if item == 'b':
        new_s.add('x')
    else:
        new_s.add(item)

s = new_s

print(s) # {'a', 'x', 'c'}
  1. 使用 set 的discard()和add()方法在循环中修改
s = {'a', 'b', 'c'}

for item in s:
    if item == 'b':
        s.discard(item)
        s.add('x')

print(s) # {'a', 'x', 'c'}

上面这些方法的关键思路都是:

  1. 将不可变对象设置转换成可变类型
  2. 在循环中针对可变类型进行修改
  3. 再转换回不可变 Set 对象

这样就可以实现循环中修改 Set 的效果。

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://blog.csdn.net/weixin_63719049/article/details/125680242

Benchmark

Differentiated-Bounded Applications

According to the DAMOV, data movement have six different types.

Although the authors only use temporal Locality to classify the apps, the Figure 3: Locality-based clustering of 44 representative functions shows some cases deserved to test.

In the context of data movement, the DAMOV framework identifies six distinct types. While the authors primarily utilize temporal locality for app classification, Figure 3 offers a comprehensive view of the locality-based clustering of 44 representative functions, highlighting specific cases that warrant further examination.

Take, for instance, the four cases situated in the lower right corner:

function short name benchmark class
CHAHsti Chai-Hito 1b: DRAM Latency
CHAOpad Chai-Padding 1c: L1/L2 Cache Capacity
PHELinReg Phoenix-Linear Regression 1b
PHEStrMat Phoenix-String Matching 1b

TLB percentage high -> tlb miss rate high -> memory access address in large span -> spacial locality is low.

benchmark

Chai Benchmark

The Chai benchmark code can be sourced either from DAMOV or directly from its GitHub repository. Chai stands for "Collaborative Heterogeneous Applications for Integrated Architectures."

Installing Chai is a straightforward process. You can achieve it by executing the command python3 compile.py.

One notable feature of the Chai benchmark is its adaptability in terms of input size. Modifying the input size of the following applications is a simple and flexible task:

# cd application directory
./bfs_00 -t 4 -f input/xxx.input
./hsti -n 1024000             # Image Histogram - Input Partitioning (HSTI)
./hsto -n 1024000             # Image Histogram - Output Partitioning (HSTO)
./ooppad -m 1000 -n 1000   # Padding (PAD)
./ooptrns -m 100 -n 10000  # Transpose / In-place Transposition (TRNS)
./sc -n 1024000            # Stream Compaction (SC)
./sri -n 1024000           # select application

# vector pack , 2048 = 1024 * 2, 1024 = 2^n
./vpack -m 2048 -n 1024 -i 2 
# vector unpack , 2048 = 1024 * 2, 1024 = 2^n
./vupack -m 2048 -n 1024 -i 2 

Parboil (how to run)

The Parboil suite was developed from a collection of benchmarks used at the University of Illinois to measure and compare the performance of computation-intensive algorithms executing on either a CPU or a GPU. Each implementation of a GPU algorithm is either in CUDA or OpenCL, and requires a system capable of executing applications using those APIs.

# compile , vim compile.py 
# python2.7 ./parboil compile bfs omp_base 
python3 compile.py

# no idea how to run, failed command: (skip)
python2.7 ./parboil run bfs cuda default 
# exe in benchmarks/*, but need some nowhere input.

Phoenix

Phoenix is a shared-memory implementation of Google's MapReduce model for data-intensive processing tasks.

# with code from DAMOV
import os
import sys

os.chdir("phoenix-2.0/tests") # app is changed from sample_apps/*
os.system("make")
os.chdir("../../")

# generate excution in phoenix-2.0/tests/{app}/{app}

# running for example
./phoenix-2.0/tests/linear_regression/linear_regression ./phoenix-2.0/datasets/linear_regression_datafiles/key_file_500MB.txt 

PolyBench

PolyBench is a benchmark suite of 30 numerical computations with static control flow, extracted from operations in various application domains (linear algebra computations, image processing, physics simulation, dynamic programming, statistics, etc.).

PolyBench features include:

  • A single file, tunable at compile-time, used for the kernel instrumentation. It performs extra operations such as cache flushing before the kernel execution, and can set real-time scheduling to prevent OS interference.
# compile using DAMOV code
python compile.py
# exe in OpenMP/compiled, and all running in no parameter

PriM

real apps for the first real PIM platform

Rodinia (developed)

Rodinia, a benchmark suite for heterogeneous parallel computing which target multi-core CPU and GPU platforms, which first introduced in 2009 IISWC.

zsim hooked code from github

  1. Install the CUDA/OCL drivers, SDK and toolkit on your machine.
  2. Modify the common/make.config file to change the settings of
  3. rodinia home directory and CUDA/OCL library paths.
  4. It seems need intel opencl, but intel do everything to oneapi
  5. To compile all the programs of the Rodinia benchmark suite, simply use the universal makefile to compile all the programs, or go to each benchmark directory and make individual programs.
  6. full code with related data can be downloaded from website
mkdir -p ./bin/linux/omp
make OMP

Running the zsim hooked apps

cd bin/linux/omp
./pathfinder 100000 100 7
./myocyte.out 100 1 0 4
./lavaMD -cores 4 -boxes1d 10 # -boxes1d  (number of boxes in one dimension, the total number of boxes will be that^3)
./omp/lud_omp -s 8000
./srad 2048 2048 0 127 0 127 2 0.5 2
./backprop 4 65536 # OMP_NUM_THREADS=4


# need to download data or file
./hotspot 1024 1024 2 4 ../../data/hotspot/temp_1024 ../../data/hotspot/power_1024 output.out
./OpenMP/leukocyte 5 4 ../../data/leukocyte/testfile.avi
# streamcluster
./sc_omp k1 k2 d n chunksize clustersize infile outfile nproc
./sc_omp 10 20 256 65536 65536 1000 none output.txt 4
./bfs 4 ../../data/bfs/graph1MW_6.txt 
./kmeans_serial/kmeans -i ../../data/kmeans/kdd_cup
./kmeans_openmp/kmeans -n 4 -i ../../data/kmeans/kdd_cup

dynamic data structures

We choose this specific suite because dynamic data structures are the core of many server workloads (e.g., Memcached’s hash table, RocksDB’s skip list), and are a great match for nearmemory processing

ASCYLIB + OPTIK1

Graph Apps

Graph500 Benchmark Exploration

Official Version

The official version of the Graph500 benchmark can be downloaded from its GitHub repository. Notable features of this version include:

  • Primarily MPI Implementation: The benchmark is built as an MPI (Message Passing Interface) version, without an accompanying OpenMP version. This can be disappointing for those utilizing tools like zsim.
  • Flexible n Value: By default, the value n is set to powers of 2, but it's possible to change this behavior through configuration adjustments.
  • Customization Options: Environment variables can be altered to modify the execution process. For instance, the BFS (Breadth-First Search) portion can be skipped or the destination path for saved results can be changed.
Unofficial Repository

An alternative unofficial repository also exists. However, it requires OpenCL for compilation. The process can be broken down as follows:

  • OpenCL Dependency: The unofficial repository mandates the presence of OpenCL. To set up OpenCL, you can refer to this tutorial.
sudo apt-get install clinfo
sudo apt-get install opencl-headers
sudo apt install opencl-dev

After completing the OpenCL setup and the compilation process using cmake & make, we obtain the executable file named benchmark. By default, running this executable without any arguments appears to utilize only a single core, despite attempts to set the environment variable with export OMP_NUM_THREADS=32. This default behavior led to a runtime of approximately 5 minutes to generate a report related to edges-node-verify status (or similar). However, for someone without an in-depth technical background, this report can be confusing, especially when trying to locate the BFS (Breadth-First Search) and SSSP (Single-Source Shortest Path) components.

What is even more disheartening is that the TLB (Translation Lookaside Buffer) result is disappointingly low, similar to the performance of the GUPS (Giga Updates Per Second) OpenMP version.

In order to gain a clearer understanding and potentially address these issues, further investigation and potentially adjustments to the program configuration may be necessary.

$ ./tlbstat -c '/staff/shaojiemike/github/graph500_openmp/build/benchmark' 
command is /staff/shaojiemike/github/graph500_openmp/build/benchmark                       
K_CYCLES   K_INSTR      IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC  K_ITLBCYC  DTLB% ITLB%         
20819312   10801013    0.52 7736938    1552557    369122     51902       1.77  0.25
20336549   10689727    0.53 7916323    1544469    354426     48123       1.74  0.24

GAP

Zsim?/LLVM Pass Instrumentation Code from PIMProf paper github

But the graph dataset should be generated by yourself following the github:

# 2^17 nodes 
./converter -g17 -k16 -b kron-17.sg
./converter -g17 -k16 -wb kron-17.wsg

Kernels Included

  • Breadth-First Search (BFS) - direction optimizing
  • Single-Source Shortest Paths (SSSP) - delta stepping
  • PageRank (PR) - iterative method in pull direction
  • Connected Components (CC) - Afforest & Shiloach-Vishkin
  • Betweenness Centrality (BC) - Brandes
  • Triangle Counting (TC) - Order invariant with possible relabelling

ligra

Code also from DAMOV

# compile
python3 compile.py

# 3 kind exe of each app, relative code can be found in /ligra directory
# emd: edgeMapDense() maybe processing related dense-data 
# ems: edgeMapSparse() analyse edge-data
# compute: of course, the core compute part

the source code is difficult to read, skip

the graph format : It seems lines_num = offsets + edges + 3

AdjacencyGraph
16777216 # offsets, and vertex from []
100000000 #   uintE* edges = newA(uintE,m);
0
470
794 # must monotonic increasing, and range [0,edges), represent the folloing edges are belong to corresponding vector
……
14680024 # random but range [0,vector-1], represent each node's conjoint others nodes(so there are pairs).
16644052
16284631
15850460

$ wc -l  /staff/qcjiang/codes/DAMOV/workloads/ligra/inputs/rMat_10M
116777219 /staff/qcjiang/codes/DAMOV/workloads/ligra/inputs/rMat_10M

Pagerank Algorithm should be referened from my another post.

HPC

maybe more computation-intensive than graph applications

parsec

From DAMOV

The Princeton Application Repository for Shared-Memory Computers (PARSEC) is a collection of parallel programs which can be used for performance studies of multiprocessor machines.

# compile
python3 compile_parsec.py

# exe in pkgs/binaries
./pkgs/binaries/blackscholes 4 ./pkgs/inputs/blackscholes/in_64K.txt black.out
./pkgs/binaries/fluidanimate 4 10 ./pkgs/inputs/fluidanimate/in_300K.fluid

STREAM apps

DAMOV code for memory bandwidth testing which reference J. D. McCalpin et al., “Memory Bandwidth and Machine Balance in Current High Performance Computers,” IEEE TCCA Newsletter, 1995

# compile
python3 compile.py

# default run with Failed Validation error(whatever)
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5072.0     0.205180     0.157728     0.472323
Add:             6946.6     0.276261     0.172746     0.490767

Hardware Effects

Two very Interesting repositys of cpu and gpu hardware effects that can degrade application performance in surprising ways and that may be very hard to explain without knowledge of the low-level CPU/GPU and OS architecture.

Test also using DAMOV code

# compile
python3 compile.py

# Every example directory has a README that explains the individual effects.
./build/bandwidth-saturation/bandwidth-saturation 0 1
./build/false-sharing/false-sharing 3 8

Most cases using the help of perf to know more about your system ability.

HPCC

From DAMOV and The HPC Challenge (HPCC) Benchmark Suite,” in SC, 2006

  • RandomAccess OpenMP version (it is also the GUPS)
# install
python compile.py

# must export
export OMP_NUM_THREADS=32
./OPENMP/ra_omp 26 # 26 need almost 20GB mem, 27 need 30GB mem, and so on.

but the openmp version shows no big pgw overhead

GUPS

描述系统的随机访问能力

From github or hpcc official-web and RandomAccess MPI version

# download using github
make -f Makefile.linux gups_vanilla
# running
gups_vanilla N M chunk
  • N = length of global table is 2^N
  • Thus N = 30 would run with a billion-element table.
  • M = # of update sets per proc 越大代表算得越久(随机访问越久),
  • chunk = # of updates in one set on each proc
  • In the official HPCC benchmark this is specified to be no larger than 1024, but you can run the code with any value you like. Your GUPS performance will typically decrease for smaller chunk size.

测试之后,最佳搭配如下 mpirun -n 32 ./gups_vanilla 32 100000 2048 其中n最多32,不然会爆内存。

n 32 31 30 29
DTLB% 66.81 45.22 19.50 13.20
ITLB% 0.06 0.07 0.06 0.09

或者单独运行./gups_vanilla 30 100000 k (n最多30), ./tlbstat -c '/usr/bin/mpirun -np 1 /staff/shaojiemike/github/gups/gups_vanilla 30 100000 8192

n\k 1024 2048 4096 8192
30 44% 90% 80%
27 88%
24 83% 83%
20 58% 62%
15 0.27% 0.3%
手动构造 bigJump
#include <bits/stdc++.h>
#include "../zsim_hooks.h"
using namespace std;

#define MOD int(1e9)

// 2000 tlb entries is the normal cpu config, 
// for 4KB page, each data access trigers tlb miss, jump over 1000 int, 
// and after 2000 entries to repeat, so at least 2000 * 4KB = 8MB space

// #define BASIC_8MB 2000000 * 2
#define BASIC_8MB (1 << 22)

// 1 second program. stream add 6GB/s, int is 4B, repeated 10^9
// #define all_loops 1000000
#define all_loops (1 << 20)

int main(int argc, char* argv[]) {
   if (argc != 4) {
      std::cerr << "Usage: " << argv[0] << " <space scale> <jump distance scale> <loop times>" << std::endl;
      return 1;
   }

   // Convert the second command-line argument (argv[1]) to an integer
   int N = std::atoi(argv[1]);
   int J = std::atoi(argv[2]);
   int M = std::atoi(argv[3]);

   std::cout << "Number read from command line: " << N << " " << J << " (N,J should not big, [0,5] is best.)" <<std::endl;

   const int size = BASIC_8MB << N;
   const int size_mask = size - 1;
   int * jump_space = (int *)malloc(size * sizeof(int));

   zsim_begin();
   int result = 0;
   int mem_access_count = 0;
   int mem_access_index = 0;
   // int mask = (1<<10<<J)-1;
   // int ran = 0x12345678;
   int mask = (1<<J)-1;
   int ran = (1<<30)-1;
   // without random address, tlb occupancy is alse high
   // ran = (ran << 1) ^ ((int) ran < 0 ? 0x87654321 : 0);
   while(mem_access_count++ < all_loops*M){
      // read & write 
      jump_space[mem_access_index] = ran;
      mem_access_index = (mem_access_index + (1024 + ran & mask) ) & (size_mask);
      // cout << "mem_access_index = " << mem_access_index << endl;
   }
   zsim_end();

   //print first 5 elements
   printf("result %d",result);
}

HPCG

From DAMOV and High Performance Conjugate Gradient Benchmark (HPCG)

HPCG is a software package that performs a fixed number of multigrid preconditioned (using a symmetric Gauss-Seidel smoother) conjugate gradient (PCG) iterations using double precision (64 bit) floating point values. 浮点数的共轭梯度求解

follow the instructions in INSTALL and analyze the compile.py

  1. choose makefile like setup/Make.GCC_OMP
  2. config values like MPdir, but we can leave them beacuse we use GCC_OMP which set -DHPCG_NO_MPI in it
  3. add -DHPGSym to CXXFLAGS or HPCG_OPTS
  4. cd build and ../configure GCC_OMP
  5. run compile.py to compile the executable files
  6. get 4 ComputePrologation in build/bin
  7. test the exe using xhpcg 32 24 16 for three dimension
  8. or xhpcg --nx=16 --rt=120 for NX=NY=NZ=16 and time is 120 seconds
  9. change int refMaxIters = 50; to int refMaxIters = 1; to limit CG iteration number
  10. to be attention: --nx=16 must be a multiple of 8
  11. if there is no geometry arguments on the command line, hpcg will ReadHpcgDat and get the default --nx=104 --rt=60
value\nx 96 240 360 480
mem 17GB 40GB 72.8GB
time(icarus0) 8s 84S 4min40s core dumped(7mins)

core dumped for xhpcg_HPGPro: ../src/GenerateProblem_ref.cpp:204: void GenerateProblem_ref(SparseMatrix&, Vector*, Vector*, Vector*): Assertion 'totalNumberOfNonzeros>0' failed.

MPdir        = 
MPinc        = 
MPlib        = 

HPCG_OPTS     = -DHPCG_NO_MPI -DHPGSym
compile error
../src/ComputeResidual.cpp:59:13: error: 'n' not specified in enclosing 'parallel'

just add n to shared values

Database

Hash Joins

This package provides implementations of the main-memory hash join algorithms described and studied in C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu, “Main-Memory Hash Joins on Modern Processor Architectures,” TKDE, 2015.

Test also in DAMOV

# install
python compile.py

# runing
./src/mchashjoins_* -r 12800000 -s 12000000 -x 12345 -y 54321

these case shows tlb resource strains

AI

Darknet for CV using multi-cores

From DAMOV and official documentation is detailed

shaojiemike @ snode6 in ~/github/DAMOV/workloads/Darknet on git:main x [14:13:02]                   │drwxr-xr-x  3 shaojiemike staff   21 Mar 14  2022 .
$ ./darknet detect cfg/yolo.cfg ./weights/yolo.weights data/dog.jpg
  • model in weight files, different size model (28MB -528MB)can be downloaded from website
  • picture data in data files. (70KB - 100MB)
  • must run the exe in the file directory. SHIT.

genomics / DNA

BWA

From DAMOV and for mapping DNA sequences against a large reference genome, such as the human genome

  • Download DNA data like ref.fa following Data download steps
  • stucked when running sratoolkit.3.0.6-ubuntu64/bin/fastq-dump --split-files SRR7733443 due to generate several 96GB SRR7733443_X.fastq files which X from 1 to n.
  • sratool can not limit the file size, but we can use head -c 100MB SRR7733443_1.fastq > ref_100MB.fastq to get wanted file size.
  • Further running commands you can read the github
  • ./bwa index -p abc ref_100MB.fastq will generate sevaral abc.suffix files using 50 seconds.
  • and now you can run ./bwa mem -t 4 abc ref_100MB.fastq or ./bwa aln -t 4 abc ref_100MB.fastq

GASE

GASE - Generic Aligner for *Seed-and-Extend


GASE is a DNA read aligner, developed for measuring the mapping accuracy and execution time of different combinations of seeding and extension techniques. GASE is implemented by extending BWA (version 0.7.13) developed by Heng Li.

Code also from DAMOV. But seems there are some program syntax errors, skip this app.

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks