Nvprof

安装¶

$ which nvprof 
/usr/local/cuda/bin/nvprof

基本使用¶

摘要模式¶

命令行直接运行

nvprof ./myApp

跟踪API¶

nvprof --print-gpu-trace ./myApp

保存在log里¶

sudo /usr/local/cuda/bin/nvprof --log-file a.log --metrics achieved_occupancy /staff/shaojiemike/github/cutests/22-commonstencil/common

可视化¶

nsight可以直接在远程机器上运行
ssh -X host
.ssh/config
1. add
2. XAuthLocation /opt/X11/bin/xauth #for macbookAir
3. ForwardX11Trusted yes
4. ForwardX11 yes
Visual Profiler也可以ssh直接连接远程机器
或者导出分析结果以便可视化, 在Visual Profiler使用

nvprof --export-profile timeline.prof <app> <app args>
nvprof --analysis-metrics -o  nbody-analysis.nvprof ./myApp

profile kernel¶

sudo /usr/local/cuda/bin/ncu -k stencil_kernel -s 0 -c 1 /staff/shaojiemike/github/cutests/22-commonstencil/best

ncu-ui是可视化界面，但是没弄懂

带宽profile¶

上限测量¶

# shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:02:08]                                                                                                                                                                      $ ./bin/x86_64/linux/release/bandwidthTest                                                                                                                                                                                           [CUDA Bandwidth Test] - Starting...                                                                                                                                                                                                  Running on...                                                                                                                                                                                                                                                                                                                                                                                                                                                              Device 0: Tesla P40                                                                                                                                                                                                                  Quick Mode                                                                                                                                                                                                                                                                                                                                                                                                                                                                Host to Device Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     11.8                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Host Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     13.0                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Device Bandwidth, 1 Device(s)                                                                                                                                                                                              PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     244.3                                                                                                                                                                                                                                                                                                                                                                                                                                     Result = PASS                                                                                                                                                                                                                                                                                                                                                                                                                                                             NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.                                                                                                                                                                                       # shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:03:24]                                                                                                                                                                      $ ./bin/x86_64/linux/release/p2pBandwidthLatencyTest

实际值¶

nvprof通过指定与dram，L1或者L2 的metrics来实现。具体解释可以参考官网

在 Maxwell 和之后的架构中 L1 和 SMEM 合并

Metric Name	解释
achieved_occupancy	活跃cycle是 Warps 活跃的比例
dram_read_throughput
dram_utilization	在0到10的范围内，相对于峰值利用率，设备内存的利用率水平
shared_load_throughput
shared_utilization
l2_utilization

需要进一步的研究学习¶

暂无

遇到的问题¶

暂无

开题缘由、总结、反思、吐槽~~¶

参考文献¶

无