linux – PERF STAT不计算内存负载,但计算内存存储

发布时间：2020-12-14 00:58:46 所属栏目：Linux 来源：网络整理

导读：Linux内核：4.10.0-20-generic(也在4.11.3上试过) Ubuntu：17.04 我一直在尝试使用perf stat收集内存访问的统计信息.我能够收集内存存储的统计信息,但内存加载的计数返回0值. 以下是内存存储的详细信息： – perf stat -e cpu/mem-stores/u ./libquantum_bas

Linux内核：4.10.0-20-generic(也在4.11.3上试过)

Ubuntu：17.04

我一直在尝试使用perf stat收集内存访问的统计信息.我能够收集内存存储的统计信息,但内存加载的计数返回0值.

以下是内存存储的详细信息： –

perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100,37 qubits required
Random seed: 33
Measured 3277 (0.200012),fractional approximation is 1/5.
Odd denominator,trying to expand by 2.
Possible period is 10.
100 = 4 * 25

 Performance counter stats for './libquantum_base.arnab 100':

       158,115,510      cpu/mem-stores/u                                            

       0.559922797 seconds time elapsed

对于内存加载,我得到0计数,如下所示： –

perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100,trying to expand by 2.
Possible period is 10.
100 = 4 * 25

 Performance counter stats for './libquantum_base.arnab 100':

                 0      cpu/mem-loads/u                                             

       0.563806170 seconds time elapsed

我无法理解为什么这不恰当.我应该以任何方式使用不同的事件来获取正确的数据吗？

解决方法

mem-loads事件映射到Intel处理器上的MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3性能监视单元事件.事件MEM_TRANS_RETIRED.LOAD_LATENCY_ *是特殊的,只能使用p修饰符计数.也就是说,您必须指定mem-loads：p to perf才能正确使用事件.

MEM_TRANS_RETIRED.LOAD_LATENCY_ *是一个精确的事件,只有精确的计算才有意义.根据this英特尔文章(强调我的)：

When a user elects to sample one of these events,special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling),and so only some loads are
tracked. Loads are randomly chosen,the latency determined for each,
and the correct event(s) incremented (latency >4,>8,>16,etc). Due
to the nature of the sampling for this event,only a small percentage
of an application’s data loads can be tracked at any one time.

如您所见,MEM_TRANS_RETIRED.LOAD_LATENCY_ *决不会计算负载总数,也不是为此目的而设计的.

如果要确定代码中的哪些指令正在发出超过特定周期数的加载请求,则MEM_TRANS_RETIRED.LOAD_LATENCY_ *是要使用的正确性能事件.事实上,这正是perf-mem的目的,它在using this event之前实现了它的目的.

如果要计算已停用的负载微控制器的总数,则应使用L1-dcache-loads,它将映射到Intel处理器上的MEM_UOPS_RETIRED.ALL_LOADS性能事件.

另一方面,mem-store和L1-dcache-store映射到所有当前Intel处理器上完全相同的性能事件,即MEM_UOPS_RETIRED.ALL_STORES,它会计算所有已退役的商店uop.

总而言之,如果您使用的是perf-stat,您应该(几乎)总是使用L1-dcache-loads和L1-dcache-store来分别计算退役的加载和存储.这些映射到您在发布的答案中使用的原始事件,只是更便携,因为它们也适用于AMD处理器.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!