c – 如何解释cachegrind输出缓存未命中?
出于好奇,我跑了几个不同版本的矩阵乘法,并运行cachegrind反对它.在下面的结果中,我想知道哪些部分是L1,L2,L3缺失和引用,这些是什么意思?以下是我的矩阵乘法代码,万一有人需要.
#define SLOWEST ==6933== Cachegrind,a cache and branch-prediction profiler ==6933== Copyright (C) 2002-2012,and GNU GPL'd,by Nicholas Nethercote et al. ==6933== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==6933== Command: ./a.out 500 ==6933== --6933-- warning: L3 cache found,using its data for the LL simulation. --6933-- warning: pretending that LL cache has associativity 24 instead of actual 16 Multiplied matrix A and B in 60.7487 seconds. ==6933== ==6933== I refs: 6,039,791,314 ==6933== I1 misses: 1,611 ==6933== LLi misses: 1,519 ==6933== I1 miss rate: 0.00% ==6933== LLi miss rate: 0.00% ==6933== ==6933== D refs: 2,892,704,678 (2,763,005,485 rd + 129,699,193 wr) ==6933== D1 misses: 136,223,560 ( 136,174,705 rd + 48,855 wr) ==6933== LLd misses: 53,675 ( 5,247 rd + 48,428 wr) ==6933== D1 miss rate: 4.7% ( 4.9% + 0.0% ) ==6933== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==6933== ==6933== LL refs: 136,225,171 ( 136,176,316 rd + 48,855 wr) ==6933== LL misses: 55,194 ( 6,766 rd + 48,428 wr) ==6933== LL miss rate: 0.0% ( 0.0% + 0.0% ) #define SLOWER ==8463== Cachegrind,a cache and branch-prediction profiler ==8463== Copyright (C) 2002-2012,by Nicholas Nethercote et al. ==8463== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==8463== Command: ./a.out 500 ==8463== --8463-- warning: L3 cache found,using its data for the LL simulation. --8463-- warning: pretending that LL cache has associativity 24 instead of actual 16 Multiplied matrix A and B in 49.7397 seconds. ==8463== ==8463== I refs: 4,537,213,120 ==8463== I1 misses: 1,571 ==8463== LLi misses: 1,487 ==8463== I1 miss rate: 0.00% ==8463== LLi miss rate: 0.00% ==8463== ==8463== D refs: 2,891,485,608 (2,761,862,312 rd + 129,623,296 wr) ==8463== D1 misses: 59,961,522 ( 59,913,256 rd + 48,266 wr) ==8463== LLd misses: 53,113 ( 5,246 rd + 47,867 wr) ==8463== D1 miss rate: 2.0% ( 2.1% + 0.0% ) ==8463== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==8463== ==8463== LL refs: 59,963,093 ( 59,914,827 rd + 48,266 wr) ==8463== LL misses: 54,600 ( 6,733 rd + 47,867 wr) ==8463== LL miss rate: 0.0% ( 0.0% + 0.0% ) #define SLOW ==9174== Cachegrind,a cache and branch-prediction profiler ==9174== Copyright (C) 2002-2012,by Nicholas Nethercote et al. ==9174== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==9174== Command: ./a.out 500 ==9174== --9174-- warning: L3 cache found,using its data for the LL simulation. --9174-- warning: pretending that LL cache has associativity 24 instead of actual 16 Multiplied matrix A and B in 35.8901 seconds. ==9174== ==9174== I refs: 3,713,059 ==9174== I1 misses: 1,570 ==9174== LLi misses: 1,486 ==9174== I1 miss rate: 0.00% ==9174== LLi miss rate: 0.00% ==9174== ==9174== D refs: 1,893,235,586 (1,112,301 rd + 130,123,285 wr) ==9174== D1 misses: 63,285,950 ( 62,987,684 rd + 298,266 wr) ==9174== LLd misses: 53,867 wr) ==9174== D1 miss rate: 3.3% ( 3.5% + 0.2% ) ==9174== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==9174== ==9174== LL refs: 63,287,520 ( 62,989,254 rd + 298,266 wr) ==9174== LL misses: 54,599 ( 6,732 rd + 47,867 wr) ==9174== LL miss rate: 0.0% ( 0.0% + 0.0% ) #define MEDIUM ==7838== Cachegrind,a cache and branch-prediction profiler ==7838== Copyright (C) 2002-2012,by Nicholas Nethercote et al. ==7838== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==7838== Command: ./a.out 500 ==7838== --7838-- warning: L3 cache found,using its data for the LL simulation. --7838-- warning: pretending that LL cache has associativity 24 instead of actual 16 Multiplied matrix A and B in 23.4097 seconds. ==7838== ==7838== I refs: 2,548,967,151 ==7838== I1 misses: 1,610 ==7838== LLi misses: 1,522 ==7838== I1 miss rate: 0.00% ==7838== LLi miss rate: 0.00% ==7838== ==7838== D refs: 1,399,237,303 (1,267,363,440 rd + 131,873,863 wr) ==7838== D1 misses: 592,807 ( 293,091 rd + 299,716 wr) ==7838== LLd misses: 53,147 ( 5,248 rd + 47,899 wr) ==7838== D1 miss rate: 0.0% ( 0.0% + 0.2% ) ==7838== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==7838== ==7838== LL refs: 594,417 ( 294,701 rd + 299,716 wr) ==7838== LL misses: 54,669 ( 6,770 rd + 47,899 wr) ==7838== LL miss rate: 0.0% ( 0.0% + 0.0% ) #define MEDIUMISH ==8438== Cachegrind,a cache and branch-prediction profiler ==8438== Copyright (C) 2002-2012,by Nicholas Nethercote et al. ==8438== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==8438== Command: ./a.out 500 ==8438== --8438-- warning: L3 cache found,using its data for the LL simulation. --8438-- warning: pretending that LL cache has associativity 24 instead of actual 16 Multiplied matrix A and B in 24.0327 seconds. ==8438== ==8438== I refs: 2,550,211,553 ==8438== I1 misses: 1,576 ==8438== LLi misses: 1,488 ==8438== I1 miss rate: 0.00% ==8438== LLi miss rate: 0.00% ==8438== ==8438== D refs: 1,400,107,343 (1,610,303 rd + 132,497,040 wr) ==8438== D1 misses: 339,977 ( 42,583 rd + 297,394 wr) ==8438== LLd misses: 53,114 ( 5,866 wr) ==8438== D1 miss rate: 0.0% ( 0.0% + 0.2% ) ==8438== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==8438== ==8438== LL refs: 341,553 ( 44,159 rd + 297,394 wr) ==8438== LL misses: 54,602 ( 6,736 rd + 47,866 wr) ==8438== LL miss rate: 0.0% ( 0.0% + 0.0% ) 矩阵乘法码. #if defined(SLOWEST) void multiply (float **A,float **B,float **out,int size) { for (int row=0;row<size;row++) for (int col=0;col<size;col++) for (int in=0;in<size;in++) out[row][col] += A[row][in] * B[in][col]; } // Takes in 1-D arrays,same as before. #elif defined(SLOWER) void multiply (float *A,float *B,float *out,int size) { for (int row=0;row<size;row++) for (int col=0;col<size;col++) for (int in=0;in<size;in++) out[row * size + col] += A[row * size + in] * B[in * size + col]; } // Flips first and second loops #elif defined(SLOW) void multiply (float *A,int size) { for (int col=0;col<size;col++) for (int row=0;row<size;row++) { float curr = 0; // prevents from calculating position each time through for (int in=0;in<size;in++) curr += A[row * size + in] * B[in *size + col]; out[row * size + col] = curr; } } #elif defined(MEDIUM) // Keeps it organized for future codes. float dotProduct(float *A,int size) { float curr = 0; for (int i=0;i<size;i++) curr += A[i] * B[i]; return curr; } void multiply (float *A,int size) { float *temp = new float[size]; for (int col=0;col<size;col++) { for (int i=0;i<size;i++) // stores column into sequential array temp[i] = B[i * size + col]; for (int row=0;row<size;row++) out[row * size + col] = dotProduct(&A[row],temp,size); // uses function above for dot product. } delete[] temp; } #elif defined(MEDIUMISH) float dotProduct(float *A,int size) { for (int i=0;i<size-1;i++) for (int j=i+1;j<size;j++) std::swap(B[i * size + j],B[j * size + i]); for (int col=0;col<size;col++) for (int row=0;row<size;row++) out[row * size + col] = dotProduct(&A[row],&B[row],size); // uses function above for dot product. } #elif defined(FAST) #elif defined(FASTER) #endif 解决方法
根据
documentation cachegrind只能模拟第一个和最后一个级别的缓存:
这意味着您无法获取L2信息,但只能使用L1和L3. cachegrind输出的第一部分报告有关L1指令缓存的信息.在所有示例中,L1指令缓存未命中的数量是不明确的,错过率始终为0%.这意味着您的所有程序都适合您的L1指令缓存. 输出的第二部分报告关于L1和LL(最后一级缓存,在您的情况下为L3)数据高速缓存的信息.使用D1错误率:信息,你应该看到哪个版本的矩阵乘法算法是“最高缓存效率” cachegrind输出的最后部分总结了关于指令和数据的LL(最后一级缓存,在您的情况下为L3)的信息.因此,它给出了缓存服务的存储器访问次数和存储器请求的百分比. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |