uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures (2019)
摘要¶
现代计算机微架构是最复杂的几个人造系统。在上面预测,解释和优化软件是困难的。我们需要其运行行为的可信模型,但是事实是稀缺的。
本文设计和实现了一种构建X86指令的延迟,吞吐量和端口使用的可信模型。并仔细探究了这三个指标的定义。尤其是latency的值在不同的操作数情况时是如何确定的。
同时其结果也是机器可读的。并且对已有的所有Intel架构都进行了测试。
官网有结果 http://www.uops.info
We also plan to release the source code of our tool as open source
1 简介¶
2 相关工作¶
Information provided by Intel¶
Measurement-based Approaches¶
3 Background¶
Pipeline of Intel Core CPUs¶
Assembler Instructions¶
Hardware Performance Counters¶
4 Definitions¶
Latency¶
Throughput¶
Port Usage¶
5 Algorithms¶
Port Usage¶
- Finding Blocking Instructions
- Port Usage Algorithm
Latency¶
- Register -> Register
- Both registers are general-purpose registers
- Both registers are SIMD registers
- The registers have different types
- Memory → Register
- Status Flags → Register
- Register → Memory
- Divisions
Throughtput¶
- Measuring Throughput
- Computing Throughput from Port Usage
Computing Throughput from Port Usage¶
Details of the x86 Instruction Set¶
Measurements on the Hardware¶
Analysis Using Intel IACA¶
Machine-readable Output¶
7 Evaluation¶
balabala~
8 Limitations¶
9 Conclusions and Future Work¶
我们的工具可以用来优化llvm-mca等软件。
Future work includes adapting our algorithms to AMD x86 CPUs. 官网已经实现了。
We would also like to extend our approach tocharacterize other undocumented performance-relevant aspects of the pipeline, e.g., regarding micro and macro-fusion, or whether instructions use the simple decoder, the complex decoder, or the Microcode-ROM.
需要进一步的研究学习¶
暂无
遇到的问题¶
暂无
开题缘由、总结、反思、吐槽~~¶
参考文献¶
无