加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 百科 > 正文

c – 为什么gcc的相同编译选项在不同的计算机体系结构上表现不同

发布时间:2020-12-16 10:03:21 所属栏目:百科 来源:网络整理
导读:我使用以下两个makefile来编译我的程序来做高斯模糊. g -Ofast -ffast-math -march = native -flto -fwhole-program -std = c 11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp g -O3 -std = c 11 -fopenmp -o interpolateFloatImg interpolateF
我使用以下两个makefile来编译我的程序来做高斯模糊.

> g -Ofast -ffast-math -march = native -flto -fwhole-program -std = c 11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp
> g -O3 -std = c 11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp

我的两个测试环境是:

> i7 4710HQ 4核8线程
> E5 2650

但是,第一个输出在E5上的速度为2倍,在i7上的速度为0.5倍.
第二个输出在i7上表现得更快,但在E5上表现得更慢.

任何人都可以给出一些解释吗?

这是源代码:https://github.com/makeapp007/interpolateFloatImg

我会尽快给出更多细节.

i7上的程序将在8个线程上运行.
我不知道这个程序会在E5上生成多少线程.

====更新====

我是这个项目的原作者的队友,结果如下.

Arch-Lenovo-Y50 ~/project/ca/3/12 (git)-[master] % perf stat -d ./interpolateFloatImg lobby.bin out.bin 255 20
Kernel kernelSize  : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height  8533 = 172921245
Micro seconds: 211199093
Performance counter stats for './interpolateFloatImg lobby.bin out.bin 255 20':
1423026.281358      task-clock:u (msec)       #    6.516 CPUs utilized          
             0      context-switches:u        #    0.000 K/sec                  
             0      cpu-migrations:u          #    0.000 K/sec                  
         2,604      page-faults:u             #    0.002 K/sec                  
4,167,572,543,807      cycles:u                  #    2.929 GHz                      (46.79%)
6,713,517,640,459      instructions:u            #    1.61  insn per cycle           (59.29%)
725,873,982,404      branches:u                #  510.092 M/sec                    (57.28%)
23,468,237,735      branch-misses:u           #    3.23% of all branches          (56.99%)
544,480,682,764      L1-dcache-loads:u         #  382.622 M/sec                    (37.00%)
545,000,783,842      L1-dcache-load-misses:u   #  100.10% of all L1-dcache hits    (31.44%)
38,696,703,292      LLC-loads:u               #   27.193 M/sec                    (26.68%)
1,204,652      LLC-load-misses:u         #    3.11% of all LL-cache hits     (35.70%)
218.384387536 seconds time elapsed

这些是工作站的结果:

workstation:~/mossCAP3/repos/liuyh1_liujzh/12$ perf stat -d ./interpolateFloatImg ../../../lobby.bin out.bin 255 20
Kernel kernelSize  : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height  8533 = 172921245
Micro seconds: 133661220
Performance counter stats for './interpolateFloatImg ../../../lobby.bin out.bin 255 20':
2035379.528531      task-clock (msec)         #   14.485 CPUs utilized          
         7,370      context-switches          #    0.004 K/sec                  
           273      cpu-migrations            #    0.000 K/sec                  
         3,123      page-faults               #    0.002 K/sec                  
5,272,393,071,699      cycles                    #    2.590 GHz                     [49.99%]
             0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
             0      stalled-cycles-backend    #    0.00% backend  cycles idle   
7,425,570,600,025      instructions              #    1.41  insns per cycle         [62.50%]
370,199,835,630      branches                  #  181.882 M/sec                   [62.50%]
47,444,417,555      branch-misses             #   12.82% of all branches         [62.50%]
591,137,049,749      L1-dcache-loads           #  290.431 M/sec                   [62.51%]
545,926,505,523      L1-dcache-load-misses     #   92.35% of all L1-dcache hits   [62.51%]
38,725,975,976      LLC-loads                 #   19.026 M/sec                   [50.00%]
 1,093,840,555      LLC-load-misses           #    2.82% of all LL-cache hits    [49.99%]
140.520016141 seconds time elapsed

==== ====更新
E5的规格:

workstation:~$cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
     20  Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
workstation:~$dmesg | grep cache
[    0.041489] Dentry cache hash table entries: 4194304 (order: 13,33554432 bytes)
[    0.047512] Inode-cache hash table entries: 2097152 (order: 12,16777216 bytes)
[    0.050088] Mount-cache hash table entries: 65536 (order: 7,524288 bytes)
[    0.050121] Mountpoint-cache hash table entries: 65536 (order: 7,524288 bytes)
[    0.558666] PCI: pci_cache_line_size set to 64 bytes
[    0.918203] VFS: Dquot-cache hash table entries: 512 (order 0,4096 bytes)
[    0.948808] xhci_hcd 0000:00:14.0: cache line size of 32 is not supported
[    1.076303] ehci-pci 0000:00:1a.0: cache line size of 32 is not supported
[    1.089022] ehci-pci 0000:00:1d.0: cache line size of 32 is not supported
[    1.549796] sd 4:0:0:0: [sda] Write cache: enabled,read cache: enabled,doesn't support DPO or FUA
[    1.552711] sd 5:0:0:0: [sdb] Write cache: enabled,doesn't support DPO or FUA
[    1.552955] sd 6:0:0:0: [sdc] Write cache: enabled,doesn't support DPO or FUA

解决方法

您的程序具有非常高的缓存未命中率.它对程序有好处还是对它不好?

545,842次L1-dcache-load-miss:u#100.10%的所有L1-dcache命中

545,523 L1-dcache-load-miss#92.35%的所有L1-dcache命中

i7和E5中的缓存大小可能不同,因此它是差异的一个来源.其他是 – 不同的汇编程序代码,不同的gcc版本,不同的gcc选项.

您应该尝试查看代码,查找热点,分析命令处理的像素数以及处理顺序对cpu和内存的处理方式.重写热点(花费大部分时间的代码部分)是解决任务http://shtech.org/course/ca/projects/3/的关键.

您可以在记录/报告/注释模式下使用perf profiler来查找热点(如果您将使用-g选项重新编译项目将更容易):

# Profile program using cpu cycle performance counter; write profile to perf.data file
perf record ./test test_arg1 test_arg2
# Read perf.data file and report functions where time was spent 
#  (Do not change ./test file,or recompile it after record and before report)
perf report
# Find the hotspot in the top functions by annotation
#  you may use Arrows and Enter to do "annotate" action from report; or:
perf annonate -s top_function_name
perf annonate -s top_function_name > annotate_func1.txt

我能够在我的移动i5-4 *(intel haswell)上增加7个小bin文件和277个10个参数的速度,其中2个核心(4个虚拟核心启用了HT)和AVX2 FMA.

需要重写一些循环/循环嵌套.您应该了解CPU缓存的工作原理以及它更容易实现:经常错过或不经常错过.此外,gcc可能是愚蠢的,可能并不总是检测到读取数据的模式;可能需要这种检测来并行处理几个像素.

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读