delphi – 如何优化DivMod以获得10的常数除数

发布时间：2020-12-15 04:18:01 所属栏目：大数据来源：网络整理

导读：在Delphi的math.pas单元中有一个DivMod程序,我希望将其转换为内联并优化它,除数总是为10.但我不知道五角大楼ASM的细节.波纹管程序的转换是什么？ procedure DivMod(Dividend: Integer; Divisor: Word; var Result,Remainder: Word);asm PUSH EBX MOV EBX,EDX

在Delphi的math.pas单元中有一个DivMod程序,我希望将其转换为内联并优化它,除数总是为10.但我不知道五角大楼ASM的细节.波纹管程序的转换是什么？

procedure DivMod(Dividend: Integer; Divisor: Word;
  var Result,Remainder: Word);
asm
        PUSH    EBX
        MOV     EBX,EDX
        MOV     EDX,EAX
        SHR     EDX,16
        DIV     BX
        MOV     EBX,Remainder
        MOV     [ECX],AX
        MOV     [EBX],DX
        POP     EBX
end;

解决方法

到目前为止,您可以做的最重要的优化是使用定点乘法逆来除以编译时常量： Why does GCC use multiplication by a strange number in implementing integer division?.

任何体面的C编译器都会为你做这件事,但显然Delphi不会,所以有一个合理的理由用asm来做.

你能在EAX中返回一个值,而不是将商和余数存储到内存中吗？看起来像是浪费来传递2个指针args,并强制调用者从内存中检索值. (更新,是的,我认为你可以通过使它成为一个函数而不是一个过程;但我只是盲目地从其他答案中修改Delphi代码.)

无论如何,幸运的是我们可以得到一个C编译器来完成为我们计算乘法逆和移位计数的艰苦工作.我们甚至可以使用它看起来像Delphi用于内联asm的相同“调用约定”. GCC’s regparm=3 32-bit calling convention在EAX,EDX和ECX中传递args(按此顺序).

您可能想要为只需要商的情况制作单独的版本,因为(与慢速div指令不同),如果使用快速,则必须将余数分别计算为x – (x / y)* y乘法逆.但是现在x86上的速度仍然是原来的两倍到4倍.

或者您可以将剩余计算留在纯Delphi中完成,除非编译器在优化时非常糟糕.

#ifdef _MSC_VER
#define CONVENTION  _fastcall   // not the same,but 2 register args are better than none.
#else
#define CONVENTION __attribute__((regparm(3)))
#endif

// use gcc -Os to get it to emit code with actual div.

divmod10(unsigned x,unsigned *quot,unsigned *rem) {
    unsigned tmp = x/10;
    // *quot = tmp;
    *rem = x%10;
    return tmp;
}

From the Godbolt compiler explorer：

# gcc8.2  -O3 -Wall -m32
div10:    # simplified version without the remainder,returns in EAX
        mov     edx,-858993459     # 0xCCCCCCCD
        mul     edx                 # EDX:EAX = dividend * 0xCCCCCCCD
        mov     eax,edx
        shr     eax,3
        ret
      # quotient in EAX

# returns quotient in EAX,stores remainder to [ECX]
# quotient pointer in EDX is unused (and destroyed).
divmod10:
        mov     edx,-858993459
        push    ebx
        mov     ebx,eax
        mul     edx                      # EDX:EAX = dividend * 0xCCCCCCCD
        mov     eax,3
        # quotient in EAX = high_half(product) >> 3 = product >> (32+3)
        lea     edx,[eax+eax*4]         # EDX = quotient*5
        add     edx,edx                 # EDX = quot * 10
        sub     ebx,edx                 # remainder = dividend - quot*10
        mov     DWORD PTR [ecx],ebx     # store remainder
        pop     ebx
        ret
        # quotient in EAX

这是C编译器输出.根据需要调整Delphi inline asm;我认为输入在Delphi的正确寄存器中.

如果Delphi inline-asm不允许你破坏EDX,你可以保存/恢复它.或者你想删除未使用的商指针输入,然后你可以调整asm,或调整Godbolt上的C并查看新的编译器输出.

这是比使用div更多的指令,但div非常慢(10 uop,即使在Skylake也有26个周期延迟.)

如果在Delphi中有64位整数类型,则可以在Delphi源代码中执行此操作并避免使用内联asm.或者如MBo所示,您可以使用$CCCD作为乘法反转,用于仅使用32位整数类型的0..2 ^ 16-1范围内的输入.

对于其余部分,存储/重载往返(4到5个周期)与最近具有mov-elimination的Intel CPU上的实际计算具有相似的延迟(3 1到商,另外3个用于lea / add / sub = 7),所以不得不使用内联asm这是相当废话.但它仍然比div指令更好地延迟和吞吐量.见https://agner.org/optimize/和其他性能链接in the x86 tag wiki.

您可以复制/粘贴Delphi版本

(如果我做对了,我不知道Delphi,只是在SO和this site上复制了修改过的例子,基于我推断的调用约定/语法)

我不确定我是否获得了inline-asm的arg-passing权利. This RADStudio documentation说：“除了ESP和EBP之外,asm声明在声明中可以对注册内容一无所知.”但我假设args在EAX和EDX中.

使用64位代码的asm可能很愚蠢,因为在64位中你可以有效地使用纯Pascal进行64位乘法. How do I implement an efficient 32 bit DivMod in 64 bit code.所以在{$IFDEF CPUX64}块中,最好的选择可能是纯粹的pascal使用UInt64(3435973837)* num;

function Div10(Num: Cardinal): Cardinal;
{$IFDEF PUREPASCAL}
begin
  Result := Num div 10;
end;
{$ELSE !PUREPASCAL}
{$IFDEF CPUX86}
asm
        MOV     EDX,$CCCCCCCD
        MUL     EDX                   // EDX:EAX = Num * fixed-point inverse
        MOV     EAX,EDX               // mov then overwrite is ideal for Intel mov-elimination
        SHR     EAX,3
end;
{$ENDIF CPUX86}
{$IFDEF CPUX64}
asm
         // TODO: use pure pascal for this; Uint64 is efficient on x86-64
        // Num in ECX,upper bits of RCX possibly contain garbage?
        mov     eax,ecx              // zero extend Num into RAX
        mov     ecx,$CCCCCCCD        // doesn't quite fit in a sign-extended 32-bit immediate for imul
        imul    rax,rcx              // RAX = Num * fixed-point inverse
        shr     rax,35               // quotient = eax
end;
{$ENDIF CPUX64}
{$ENDIF}

 {Remainder is the function return value}
function DivMod10(Num: Cardinal; var Quotient: Cardinal): Cardinal;
{$IFDEF PUREPASCAL}
begin
  Quotient := Num div 10;
  Result := Num mod 10;
end;
{$ELSE !PUREPASCAL}
{$IFDEF CPUX86}
asm
    // Num in EAX,@Quotient in EDX
    push    esi
    mov     ecx,edx           // save @quotient
    mov     edx,$CCCCCCCD
    mov     esi,eax           // save dividend for use in remainder calc
    mul     edx                // EDX:EAX = dividend * 0xCCCCCCCD
    shr     edx,3             // EDX = quotient
    mov     [ecx],edx         // store quotient into @quotient

    lea     edx,[edx + 4*edx] // EDX = quot * 5
    add     edx,edx           // EDX = quot * 10
    mov     eax,esi                  // off the critical path
    sub     eax,edx           // Num - (Num/10)*10
    pop     esi
    // Remainder in EAX = return value
end;
{$ENDIF CPUX86}
{$IFDEF CPUX64}
asm
        // TODO: use pure pascal for this?  Uint64 is efficient on x86-64
    // Num in ECX,@Quotient in RDX
    mov     r8d,ecx          // zero-extend Num into R8
    mov     eax,$CCCCCCCD
    imul    rax,r8
    shr     rax,35           // quotient in eax

    lea     ecx,[rax + 4*rax]
    add     ecx,ecx          // ecx = 10*(Num/10)
    mov     [rdx],eax        // store quotient

    mov     eax,r8d          // copy Num again
    sub     eax,ecx          // remainder = Num - 10*(Num/10)
    // we could have saved 1 mov instruction by returning the quotient
    // and storing the remainder.  But this balances latency better.
end;
{$ENDIF CPUX64}
{$ENDIF}

存储商并返回余数意味着两者可能在调用者中几乎同时就绪,因为计算商的余数的额外延迟与存储转发重叠. IDK,如果这是好的,或者如果基于商的某些工作开始无序执行通常更好.我猜想如果你打电话给DivMod10,你可能只想要其余的.

但是在分为十进制数字的循环中,重复除以10,它是形成关键路径的商,因此返回商的一个版本并存储余数将是一个更好的选择.

在这种情况下,你将商数作为EAX中的返回值,并将函数arg重命名为余数.

asm基于此C函数版本(https://godbolt.org/z/qu2kvV)的clang输出,针对Windows x64调用约定.但通过一些调整可以提高效率,例如将mov从关键路径中移出,并使用不同的寄存器来避免REX前缀.用一个ADD替换一个LEA.

unsigned divmod10(unsigned x,unsigned *quot) {
    unsigned qtmp = x/10;
    unsigned rtmp = x%10;
     *quot = qtmp;
     //*rem = rtmp;
    return rtmp;
}

我使用clang的版本代替gcc,因为imul r64,r64在Intel CPU和Ryzen上更快(3个周期延迟/ 1个uop). mul r32是3 uops,Sandybridge系列每2个时钟吞吐量只有1个.我认为乘法硬件自然会产生一个128位的结果,并将其中的低64分成edx：eax需要额外的uop,或类似的东西.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!