x264中的聚合性存取优化
Write combining 聚合性存取
i.e. a single 32-bit read/write may be faster than a single 8-bit read/write. But the main issue is that we could be doing this: In a single operation instead of 4,we just copied the whole array. Faster speed-wise and shorter code-wise,too. The alignment is to ensure that we don’t copy between unaligned arrays,which could crash on non-x86 architectures (e.g. PowerPC) and would also go slightly slower on x86 (but still faster than the uncombined write). But,one might ask,can’t the compiler do this? Well,there are many reasons it doesn’t happen. We’ll start from the easiest case and go to the hardest case. 试想我们要从一个数组拷贝几个变量到另一个数组,通常会这样写代码 即使不对齐也比分成4次操作时更快。你可能会问,编译器会帮我们做这个工作吗?答案是,有很多原因导致编译器无法帮你做这个工作。我们从简单的到难的逐个解释。 writing zero to {a,b} is the same as writing a 32-bit zero to the whole struct. But GCC doesn’t even optimize this; it still assigns the zeroes separately! How stupid. which it might not know whether they’re aligned or not) and assigning zero or constant value,write-combining is trivial. But again,GCC doesn’t do it. 最简单的情况是一个简单的结构体的置0操作 ,s={a,b},a和b都是16位的整数。似乎这种情况下,编译器会让结构体字节对齐,然后和写入一个32位整数一样,一次性置0。但是gcc根本不 会这么做,而是分两次置0,多愚蠢! 较难点的情况是给数组进行赋值的时候,gcc也是如此愚蠢的操作。 Now,we get to the harder stuff. What if we’re copying between two arrays,both of which are directly accessed? Now,we have to be able to detect this sequential copying and merge it. This basically is a simple form of autovectorization; its no surprise at all that GCC doesn’t do this. knowing that the pointers are aligned (though we as programmers might know that they always are). There are cases where it could make accurate derivations (by annotating pointers passed between functions) as to whether they are aligned or not,in which case it might be able to do write combining; this would of course be very difficult. Of course,on x86,its still worthwhile to combine even if there’s a misalignment risk,since it will only go slightly slower rather than crash. gcc不会自动做优化一点都不用感到惊讶。 最复杂的情况是在处理指针参数时,编译器无法知道指针指向的数据的字节对齐方式 The end result of this kind of operation is a massive speed boost in such functions; for example,in the section where motion vectors are cached (in macroblock_cache_save) I got over double the speed by converting 16-bit copies to write-combined copies. This of course is only on a 32-bit system; on a 64-bit system we could do even better. The code of course uses 64-bit so that a 64-bit compiled binary will do it as best it can. The compiler is smart enough to split the copies on 32-bit systems,of course. 如果在64位系统上,效果会更明显。 We could actually do even better if we were willing to use MMX or SSE,since MMX could be used for 64-bit copies on 32-bit systems and SSE could be used for 128-bit copies. Unfortunately,this would completely sacrifice portability and at this point the speed boost would be pretty small from the current merged copies. its quite easy to manipulate them as pairs. This allowed me to drastically speed up a lot of manipulation involved in motion vector prediction and general copying and storing. The result of all the issues described in the article is this massive diff. 如果用MMX和SSE指令,效果会更好,因为MMX指令可以在32-bit系统上做64-bit的拷贝操作,而SSE可以做128-bit操作。但这会降低代码的可移植性,而性能的提升却很微小。 这里讨论的优化工作的最大成果就在于此。 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |