I will provide some code and timings but I love how you ignored my
main points:
1) The optimizations of the code *around* the function (ie: the
callers), which Michael also pointed out, cannot be done in ASM.
2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and
Nehalem you will get totally different results with your ASM code,
while the compilers will generate the best possible code.
3) The fact that someone will now have to write optimized versions for
each other architecture
4) The fact that if the loop is what you're truly worried about, you
can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a
similar intrinsic), and still keep the rest of the function portable C.
Also, gcc does support profiling, another fact you don't seem to know.
However, with linker optimizations, you do not need a profiler, the
linker will do the static analysis.
Also, to everyone sayings things like "I was able to save a <operand
name here>", I hope you understand that smaller != faster.
On 4-Aug-09, at 10:13 AM, Timo Kreuzer wrote:
  Michael Steil wrote:
 I wonder, has either of you, Alex or Timo actually *benchmarked* the
 code on some sort of native i386 CPU before you argue whether it
 should be a stosb or a stosd? If not, writing assembly would be a
 clear case of "premature optimization".
 
 I did. on Athlon X2 64, I called the function a bunch ot times, with a
 100x100 rect, measuring time with rdtsc  the results were quite
 random,
 but roughly
 asm: ~580
 gcc 4.2 -march=k8 -fexpensive-optimizations -O3: ~1800
 WDK: /GL /Oi /Ot /O2 : ~2600
 MSVC 2008 express: /GL /Oi /Ot /O2 ~1800
 using a 50x50 rect shifts the advantage slightly in direction of the
 asm
 implementations.
 I added volatile to the pointer to prevent the loop to be optimized
 away.
 using memset was a bit slower than a normal loop.
 This is what msvc produced with the above settings
 _DIB_32BPP_ColorFill:
    push   ebx
    mov   ebx, [eax+8]
    sub    ebx, [eax]
    test    ebx, ebx
    jg      short label1
    xor    al, al
    pop   ebx
    retn
 label1:
    mov  ecx, [eax+4]
    push esi
    mov esi, [eax+0Ch]
    sub  esi, ecx
    test  esi, esi
    jg     short label2
    pop  esi
    xor   al, al
    pop  ebx
    retn
 label2:
    mov  eax, [edx+4]
    imul  ecx, eax
    add  ecx, [edx]
    cdq
    and  edx, 3
    add  eax, edx
    sar   eax, 2
    add  eax, eax
    push edi
    mov edi, ecx
    add  eax, eax
    jmp  short label3
 align 10h
 label3:
    mov  ecx, edi
    mov  edx, ebx
 label4:
    mov  dword ptr [ecx], 3039h
    add   ecx, 4
    sub   edx, 1
    jnz    short  label4
    dec   esi
    add   edi, eax
    test   esi, esi
    jg     short  label3
    pop  edi
    pop  esi
    mov al, 1
    pop ebx
    retn
 I though myself I did something wrong. For me no compiler was able to
 generate code as fast as the asm code.
 I don't know how Alex managed to get better optimizations, maybe he
 knows a secret ninja /Oxxx switch, or maybe express and wdk version
 both
 suck at optimizing or maybe I'm just too supid... ;-)
  See above: If all you want to optimize is the
loop, then have C code
 with asm("rep movsd") in it, or fix the static inline memcpy() to be
 more efficient (if it isn't efficient in the first place).
 
 I tried __stosd() which actually resulted in a faster function. with
 ~610 gcc was aslmost as fast as the asm implementation, msvc actually
 won with 590. But that was using not pure portable code. It's the best
 solution, it seems, although it will probably still be slower unless
 we
 set our optimization to max.
 Btw, I already thought about rewriting our dib code some time ago.
 Using
 inline functions instead of a code generator. The idea is to make it
 fully portable, optimizable though inline asm functions where useful
 and
 easier to maintain then the current stuff. It's on my list...
 Timo
 _______________________________________________
 Ros-dev mailing list
 Ros-dev(a)reactos.org
 
http://www.reactos.org/mailman/listinfo/ros-dev