Nvprof
安装¶
基本使用¶
摘要模式¶
命令行直接运行
跟踪API¶
保存在log里¶
sudo /usr/local/cuda/bin/nvprof --log-file a.log --metrics achieved_occupancy /staff/shaojiemike/github/cutests/22-commonstencil/common
可视化¶
- nsight可以直接在远程机器上运行
- ssh -X host
- .ssh/config
- add
- XAuthLocation /opt/X11/bin/xauth #for macbookAir
- ForwardX11Trusted yes
- ForwardX11 yes
- Visual Profiler也可以ssh直接连接远程机器
- 或者导出分析结果以便可视化, 在Visual Profiler使用
nvprof --export-profile timeline.prof <app> <app args>
nvprof --analysis-metrics -o nbody-analysis.nvprof ./myApp
profile kernel¶
sudo /usr/local/cuda/bin/ncu -k stencil_kernel -s 0 -c 1 /staff/shaojiemike/github/cutests/22-commonstencil/best
ncu-ui是可视化界面,但是没弄懂
带宽profile¶
上限测量¶
# shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:02:08] $ ./bin/x86_64/linux/release/bandwidthTest [CUDA Bandwidth Test] - Starting... Running on... Device 0: Tesla P40 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 11.8 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 13.0 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 244.3 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. # shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:03:24] $ ./bin/x86_64/linux/release/p2pBandwidthLatencyTest
实际值¶
nvprof通过指定与dram,L1或者L2 的metrics来实现。具体解释可以参考官网
在 Maxwell 和之后的架构中 L1 和 SMEM 合并
Metric Name | 解释 |
---|---|
achieved_occupancy | 活跃cycle是 Warps 活跃的比例 |
dram_read_throughput | |
dram_utilization | 在0到10的范围内,相对于峰值利用率,设备内存的利用率水平 |
shared_load_throughput | |
shared_utilization | |
l2_utilization |
需要进一步的研究学习¶
暂无
遇到的问题¶
暂无
开题缘由、总结、反思、吐槽~~¶
参考文献¶
无