ML Optimizer
导言
- 优化算法(Optimizer)目标是优化(最小化或最大化)一个损失函数,以调整模型参数,使模型在训练数据上表现得更好。
- 在深度学习中,优化算法是训练神经网络时至关重要的组成部分,它们决定了模型参数如何更新以最小化损失。
- 所以梯度下降、动量法、随机梯度下降、RMSprop、Adam、AdamW、LAMB等算法都是优化算法。
导言
256 bit (32 byte) random key
生成加密后文件secretfile.txt.enc
得到加密过的一次性key : secret.key.enc
openssl rsautl -encrypt -oaep -pubin -inkey <(ssh-keygen -e -f recipients-key.pub -m PKCS8) -in secret.key -out secret.key.enc
<()
是子进程的意思。
加密文件 secretfile.txt.enc和 secret.key.enc
暂无
https://www.bjornjohansen.com/encrypt-file-using-ssh-key
https://blog.csdn.net/makenothing/article/details/54645578
GPU Processing Clusters (GPCs),
Texture Processing Clusters (TPCs),
上面两张图组成一个SM,Special Function Units (SFUs)
图中红框是一个SM
10496个流处理器,核心加速频率1.70GHz,384-bit 24GB GDDR6X显存。
在之前的GA100大核心中,每组SM是64个INT32单元、64个FP32单元及32个FP64单元组成的,但在GA102核心中,FP64单元大幅减少,增加了RT Core,Tensor Core也略微减少。
https://zhuanlan.zhihu.com/p/394352476
OpenMP 4.0 提供 OMP_PLACES
和 OMP_PROC_BIND
环境变量来指定程序中的 OpenMP 线程如何绑定到处理器。这两个环境变量通常结合使用。OMP_PLACES
用于指定线程将绑定到的计算机位置(硬件线程、核心或插槽)。OMP_PROC_BIND
用于指定绑定策略(线程关联性策略),这项策略指定如何将线程分配到位置。
除了 OMP_PLACES
和 OMP_PROC_BIND
这两个环境变量外,OpenMP 4.0 还提供可在 parallel 指令中使用的 proc_bind
子句。proc_bind
子句用于指定如何将执行并行区域的线程组绑定到处理器。
OMP_NUM_THREADS=28 OMP_PROC_BIND=true OMP_PLACES=cores:每个线程绑定到一个 core,使用默认的分布(线程 n 绑定到 core n);
OMP_NUM_THREADS=2 OMP_PROC_BIND=true OMP_PLACES=sockets:每个线程绑定到一个 socket;
OMP_NUM_THREADS=4 OMP_PROC_BIND=close OMP_PLACES=cores:每个线程绑定到一个 core,线程在 socket 上连续分布(分别绑定到 core 0,1,2,3;
OMP_NUM_THREADS=4 OMP_PROC_BIND=spread OMP_PLACES=cores:每个线程绑定到一个 core,线程在 socket 上尽量散开分布(分别绑定到 core 0,7,14,21;
静态扩展 * 文本代码在一个编译制导语句之后,被封装到一个结构块中
孤立语句 * 一个OpenMP的编译制导语句不依赖于其它的语句
并行域中的代码被所有的线程执行
for语句指定紧随它的循环语句必须由线程组并行执行;
sections编译制导语句指定内部的代码被划分给线程组中的各线程
不同的section由不同的线程执行
single编译制导语句指定内部代码只有线程组中的一个线程执行。
线程组中没有执行single语句的线程会一直等待代码块的结束,使用nowait子句除外
来自 https://ppc.cs.aalto.fi/ch3/nowait/
#pragma omp critical [name] newline
#pragma omp atomic
x++;
The fastest way is neither critical nor atomic. Approximately, addition with critical section is 200 times more expensive than simple addition, atomic addition is 25 times more expensive then simple addition.(maybe no so much expensive, the atomic operation will have a few cycle overhead (synchronizing a cache line) on the cost of roughly a cycle. A critical section incurs the cost of a lock.)
The fastest option (not always applicable) is to give each thread its own counter and make reduce operation when you need total sum.
omp critical is for mutual exclusion(互斥), omp ordered refers to a specific loop and ensures that the region executes sequentually in the order of loop iterations. Therefore omp ordered is stronger than omp critical, but also only makes sense within a loop.
omp ordered has some other clauses, such as simd to enforce the use of a single SIMD lane only. You can also specify dependencies manually with the depend clause.
Note: Both omp critical and omp ordered regions have an implicit memory flush at the entry and the exit.
vector<int> v;
#pragma omp parallel for ordered schedule(dynamic, anyChunkSizeGreaterThan1)
for (int i = 0; i < n; ++i){
...
...
...
#pragma omp ordered
v.push_back(i);
}
tid List of Timeline
iterations
0 0,1,2 ==o==o==o
1 3,4,5 ==.......o==o==o
2 6,7,8 ==..............o==o==o
=
shows that the thread is executing code in parallel. o
is when the thread is executing the ordered region. .
is the thread being idle, waiting for its turn to execute the ordered region.
With schedule(static,1)
the following would happen:
见 https://docs.microsoft.com/en-us/cpp/parallel/openmp/reference/openmp-clauses?view=msvc-160
#pragma omp parallel for collapse(2)
for( int y = y1; y < y2; y++ )
{
for( int x = x1; x < x2; x++ )
{
------------------------------------------------
| static | static | dynamic | dynamic | guided |
| 1 | 5 | 1 | 5 | |
------------------------------------------------
| 0 | 0 | 0 | 2 | 1 |
| 1 | 0 | 3 | 2 | 1 |
| 2 | 0 | 3 | 2 | 1 |
| 3 | 0 | 3 | 2 | 1 |
| 0 | 0 | 2 | 2 | 1 |
| 1 | 1 | 2 | 3 | 3 |
| 2 | 1 | 2 | 3 | 3 |
| 3 | 1 | 0 | 3 | 3 |
| 0 | 1 | 0 | 3 | 3 |
| 1 | 1 | 0 | 3 | 2 |
| 2 | 2 | 1 | 0 | 2 |
| 3 | 2 | 1 | 0 | 2 |
| 0 | 2 | 1 | 0 | 3 |
| 1 | 2 | 2 | 0 | 3 |
| 2 | 2 | 2 | 0 | 0 |
| 3 | 3 | 2 | 1 | 0 |
| 0 | 3 | 3 | 1 | 1 |
| 1 | 3 | 3 | 1 | 1 |
| 2 | 3 | 3 | 1 | 1 |
| 3 | 3 | 0 | 1 | 3 |
------------------------------------------------
private
variables are not initialised, i.e. they start with random values like any other local automatic variable
firstprivate
initial the value as the before value.
lastprivate
save the value to the after region. 这个last的意思不是实际最后运行的一个线程,而是调度发射队列的最后一个线程。从另一个角度上说,如果你保存的值来自随机一个线程,这也是没有意义的。
firstprivate and lastprivate are just special cases of private
#pragma omp parallel
{
#pragma omp for lastprivate(i)
for (i=0; i<n-1; i++)
a[i] = b[i] + b[i+1];
}
a[i]=b[i];
A private
variable is local to a region and will most of the time be placed on the stack
. The lifetime of the variable's privacy is the duration defined of the data scoping clause
. Every thread (including the master thread) makes a private copy of the original variable
(the new variable is no longer storage-associated with the original variable).
A threadprivate
variable on the other hand will be most likely placed in the heap
or in the thread local storage (that can be seen as a global memory local to a thread). A threadprivate variable persist across regions
(depending on some restrictions). The master thread uses the original variable
, all other threads make a private copy of the original variable (the master variable is still storage-associated with the original variable).
可以指定某一task任务在指定第几个thread运行吗?
简单理解sections其实是for的展开形式,适合于少量的“任务”,并且适合于没有迭代关系的“任务”。每一个section被一个线程去执行。
omp_get_thread_num() //获取线程的num,即ID。在并行区域外,获取的是master线程的ID,即为0。
omp_get_num_threads/omp_set_num_threads() //设置/获取线程数量,用于覆盖OMP_NUM_THREADS环境变量的设置。omp_set_num_threads在串行区域调用才会有效,omp_get_num_threads获取当前线程组的线程数量,一般在并行区域调用,在串行区域调用返回为1。
omp_get_max_threads() //返回OpenMP当前环境下能创建线程的最大数量。
OMP_SCHEDULE:只能用到for,parallel for中。它的值就是处理器中循环的次数
OMP_NUM_THREADS:定义执行中最大的线程数
OMP_DYNAMIC:通过设定变量值TRUE或FALSE,来确定是否动态设定并行域执行的线程数
OMP_NESTED:确定是否可以并行嵌套
#include <omp.h>
int main(int argc, _TCHAR* argv[])
{
printf("ID: %d, Max threads: %d, Num threads: %d \n",omp_get_thread_num(), omp_get_max_threads(), omp_get_num_threads());
omp_set_num_threads(5);
printf("ID: %d, Max threads: %d, Num threads: %d \n",omp_get_thread_num(), omp_get_max_threads(), omp_get_num_threads());
#pragma omp parallel num_threads(5)
{
// omp_set_num_threads(6); // Do not call it in parallel region
printf("ID: %d, Max threads: %d, Num threads: %d \n",omp_get_thread_num(), omp_get_max_threads(), omp_get_num_threads());
}
printf("ID: %d, Max threads: %d, Num threads: %d \n",omp_get_thread_num(), omp_get_max_threads(), omp_get_num_threads());
omp_set_num_threads(6);
printf("ID: %d, Max threads: %d, Num threads: %d \n",omp_get_thread_num(), omp_get_max_threads(), omp_get_num_threads());
return 0;
}
♦OpenMP为循环级并行提供了方便的功能。线程由编译器根据用户指令创建和管理。
♦pthread提供了更复杂、更动态的方法。线程由用户显式创建和管理。
暂无
暂无
对子句和制导的关系不清楚
https://blog.csdn.net/gengshenghong/article/details/7004594
https://docs.microsoft.com/en-us/cpp/parallel/openmp/reference/openmp-clauses?view=msvc-160