跳转至

Nvprof

安装

$ which nvprof 
/usr/local/cuda/bin/nvprof

基本使用

摘要模式

命令行直接运行

nvprof ./myApp

跟踪API

nvprof --print-gpu-trace ./myApp

保存在log里

sudo /usr/local/cuda/bin/nvprof --log-file a.log --metrics achieved_occupancy /staff/shaojiemike/github/cutests/22-commonstencil/common

可视化

  1. nsight可以直接在远程机器上运行
  2. ssh -X host
  3. .ssh/config
    1. add
    2. XAuthLocation /opt/X11/bin/xauth #for macbookAir
    3. ForwardX11Trusted yes
    4. ForwardX11 yes
  4. Visual Profiler也可以ssh直接连接远程机器
  5. 或者导出分析结果以便可视化, 在Visual Profiler使用
nvprof --export-profile timeline.prof <app> <app args>
nvprof --analysis-metrics -o  nbody-analysis.nvprof ./myApp

profile kernel

sudo /usr/local/cuda/bin/ncu -k stencil_kernel -s 0 -c 1 /staff/shaojiemike/github/cutests/22-commonstencil/best

ncu-ui是可视化界面,但是没弄懂

带宽profile

上限测量

# shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:02:08]                                                                                                                                                                      $ ./bin/x86_64/linux/release/bandwidthTest                                                                                                                                                                                           [CUDA Bandwidth Test] - Starting...                                                                                                                                                                                                  Running on...                                                                                                                                                                                                                                                                                                                                                                                                                                                              Device 0: Tesla P40                                                                                                                                                                                                                  Quick Mode                                                                                                                                                                                                                                                                                                                                                                                                                                                                Host to Device Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     11.8                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Host Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     13.0                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Device Bandwidth, 1 Device(s)                                                                                                                                                                                              PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     244.3                                                                                                                                                                                                                                                                                                                                                                                                                                     Result = PASS                                                                                                                                                                                                                                                                                                                                                                                                                                                             NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.                                                                                                                                                                                       # shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:03:24]                                                                                                                                                                      $ ./bin/x86_64/linux/release/p2pBandwidthLatencyTest        

实际值

nvprof通过指定与dram,L1或者L2 的metrics来实现。具体解释可以参考官网

在 Maxwell 和之后的架构中 L1 和 SMEM 合并

Metric Name 解释
achieved_occupancy 活跃cycle是 Warps 活跃的比例
dram_read_throughput
dram_utilization 在0到10的范围内,相对于峰值利用率,设备内存的利用率水平
shared_load_throughput
shared_utilization
l2_utilization

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献