c – 为什么gcc的相同编译选项在不同的计算机体系结构上表现不同
我使用以下两个makefile来编译我的程序来做高斯模糊.
> g -Ofast -ffast-math -march = native -flto -fwhole-program -std = c 11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp 我的两个测试环境是: > i7 4710HQ 4核8线程 但是,第一个输出在E5上的速度为2倍,在i7上的速度为0.5倍. 任何人都可以给出一些解释吗? 这是源代码:https://github.com/makeapp007/interpolateFloatImg 我会尽快给出更多细节. i7上的程序将在8个线程上运行. ====更新==== 我是这个项目的原作者的队友,结果如下. Arch-Lenovo-Y50 ~/project/ca/3/12 (git)-[master] % perf stat -d ./interpolateFloatImg lobby.bin out.bin 255 20 Kernel kernelSize : 255 Standard deviation : 20 Kernel maximum: 0.000397887 Kernel minimum: 1.22439e-21 Reading width 20265 height 8533 = 172921245 Micro seconds: 211199093 Performance counter stats for './interpolateFloatImg lobby.bin out.bin 255 20': 1423026.281358 task-clock:u (msec) # 6.516 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 2,604 page-faults:u # 0.002 K/sec 4,167,572,543,807 cycles:u # 2.929 GHz (46.79%) 6,713,517,640,459 instructions:u # 1.61 insn per cycle (59.29%) 725,873,982,404 branches:u # 510.092 M/sec (57.28%) 23,468,237,735 branch-misses:u # 3.23% of all branches (56.99%) 544,480,682,764 L1-dcache-loads:u # 382.622 M/sec (37.00%) 545,000,783,842 L1-dcache-load-misses:u # 100.10% of all L1-dcache hits (31.44%) 38,696,703,292 LLC-loads:u # 27.193 M/sec (26.68%) 1,204,652 LLC-load-misses:u # 3.11% of all LL-cache hits (35.70%) 218.384387536 seconds time elapsed 这些是工作站的结果: workstation:~/mossCAP3/repos/liuyh1_liujzh/12$ perf stat -d ./interpolateFloatImg ../../../lobby.bin out.bin 255 20 Kernel kernelSize : 255 Standard deviation : 20 Kernel maximum: 0.000397887 Kernel minimum: 1.22439e-21 Reading width 20265 height 8533 = 172921245 Micro seconds: 133661220 Performance counter stats for './interpolateFloatImg ../../../lobby.bin out.bin 255 20': 2035379.528531 task-clock (msec) # 14.485 CPUs utilized 7,370 context-switches # 0.004 K/sec 273 cpu-migrations # 0.000 K/sec 3,123 page-faults # 0.002 K/sec 5,272,393,071,699 cycles # 2.590 GHz [49.99%] 0 stalled-cycles-frontend # 0.00% frontend cycles idle 0 stalled-cycles-backend # 0.00% backend cycles idle 7,425,570,600,025 instructions # 1.41 insns per cycle [62.50%] 370,199,835,630 branches # 181.882 M/sec [62.50%] 47,444,417,555 branch-misses # 12.82% of all branches [62.50%] 591,137,049,749 L1-dcache-loads # 290.431 M/sec [62.51%] 545,926,505,523 L1-dcache-load-misses # 92.35% of all L1-dcache hits [62.51%] 38,725,975,976 LLC-loads # 19.026 M/sec [50.00%] 1,093,840,555 LLC-load-misses # 2.82% of all LL-cache hits [49.99%] 140.520016141 seconds time elapsed ==== ====更新 workstation:~$cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c 20 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz workstation:~$dmesg | grep cache [ 0.041489] Dentry cache hash table entries: 4194304 (order: 13,33554432 bytes) [ 0.047512] Inode-cache hash table entries: 2097152 (order: 12,16777216 bytes) [ 0.050088] Mount-cache hash table entries: 65536 (order: 7,524288 bytes) [ 0.050121] Mountpoint-cache hash table entries: 65536 (order: 7,524288 bytes) [ 0.558666] PCI: pci_cache_line_size set to 64 bytes [ 0.918203] VFS: Dquot-cache hash table entries: 512 (order 0,4096 bytes) [ 0.948808] xhci_hcd 0000:00:14.0: cache line size of 32 is not supported [ 1.076303] ehci-pci 0000:00:1a.0: cache line size of 32 is not supported [ 1.089022] ehci-pci 0000:00:1d.0: cache line size of 32 is not supported [ 1.549796] sd 4:0:0:0: [sda] Write cache: enabled,read cache: enabled,doesn't support DPO or FUA [ 1.552711] sd 5:0:0:0: [sdb] Write cache: enabled,doesn't support DPO or FUA [ 1.552955] sd 6:0:0:0: [sdc] Write cache: enabled,doesn't support DPO or FUA 解决方法
您的程序具有非常高的缓存未命中率.它对程序有好处还是对它不好?
545,842次L1-dcache-load-miss:u#100.10%的所有L1-dcache命中 545,523 L1-dcache-load-miss#92.35%的所有L1-dcache命中 i7和E5中的缓存大小可能不同,因此它是差异的一个来源.其他是 – 不同的汇编程序代码,不同的gcc版本,不同的gcc选项. 您应该尝试查看代码,查找热点,分析命令处理的像素数以及处理顺序对cpu和内存的处理方式.重写热点(花费大部分时间的代码部分)是解决任务http://shtech.org/course/ca/projects/3/的关键. 您可以在记录/报告/注释模式下使用perf profiler来查找热点(如果您将使用-g选项重新编译项目将更容易): # Profile program using cpu cycle performance counter; write profile to perf.data file perf record ./test test_arg1 test_arg2 # Read perf.data file and report functions where time was spent # (Do not change ./test file,or recompile it after record and before report) perf report # Find the hotspot in the top functions by annotation # you may use Arrows and Enter to do "annotate" action from report; or: perf annonate -s top_function_name perf annonate -s top_function_name > annotate_func1.txt 我能够在我的移动i5-4 *(intel haswell)上增加7个小bin文件和277个10个参数的速度,其中2个核心(4个虚拟核心启用了HT)和AVX2 FMA. 需要重写一些循环/循环嵌套.您应该了解CPU缓存的工作原理以及它更容易实现:经常错过或不经常错过.此外,gcc可能是愚蠢的,可能并不总是检测到读取数据的模式;可能需要这种检测来并行处理几个像素. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |