On 4-Aug-09, at 12:36 PM, Timo Kreuzer wrote:

Alex Ionescu wrote:
I will provide some code and timings but I love how you ignored my  
main points:
  
And you ignored my main points:
1) The optimization "around" the function is not important, as the function is not called that often, the loop is much more important.

If the function is not called often, then you using ASM as an optimization is what we call "premature optimization".

You should spend your time profiling the codebase and identifying real bottlenecks.

2) It doesn't matter if the function performs differently on different machines as long as it's always faster than the portable code.

Which is an assumption you're making. My point is that it won't be.

3) Noone is forced to write optimized versions, we have a C version for all other architectures..

So now we have a huge performance delta (if it were to exist) between different architectures, as well as two code bases (and possibly more, as people write more ASM versions), and eventually there are 10 versions of the same function, with different bugs.

Great!

(Please don't write code in a real company's product, kthx).

4) I'm not worried about the loop, the loop is fine the way I wrote it ;-)

I don't get it? You just claimed the loop is "90%", and that yours is better because it's in ASM and uses REP MOVSD. So take the C version, and make an inline REP MOVSD instead of a memset, and you know have your exact code, but written in C.

I just claimed that you couldn't provide a faster C version. Faster in terms of real life usage. And I'm yet waiting for you to prove me wrong.

1) The optimizations of the code *around* the function (ie: the  
callers), which Michael also pointed out, cannot be done in ASM.
2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and  
Nehalem you will get totally different results with your ASM code,  
while the compilers will generate the best possible code.
3) The fact that someone will now have to write optimized versions for  
each other architecture
4) The fact that if the loop is what you're truly worried about, you  
can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a  
similar intrinsic), and still keep the rest of the function portable C.

Also, gcc does support profiling, another fact you don't seem to know.  
However, with linker optimizations, you do not need a profiler, the  
linker will do the static analysis.

Also, to everyone sayings things like "I was able to save a <operand  
name here>", I hope you understand that smaller != faster.

On 4-Aug-09, at 10:13 AM, Timo Kreuzer wrote:

  
Michael Steil wrote:
    
I wonder, has either of you, Alex or Timo actually *benchmarked* the
code on some sort of native i386 CPU before you argue whether it
should be a stosb or a stosd? If not, writing assembly would be a
clear case of "premature optimization".

      
I did. on Athlon X2 64, I called the function a bunch ot times, with a
100x100 rect, measuring time with rdtsc  the results were quite  
random,
but roughly
asm: ~580
gcc 4.2 -march=k8 -fexpensive-optimizations -O3: ~1800
WDK: /GL /Oi /Ot /O2 : ~2600
MSVC 2008 express: /GL /Oi /Ot /O2 ~1800

using a 50x50 rect shifts the advantage slightly in direction of the  
asm
implementations.

I added volatile to the pointer to prevent the loop to be optimized  
away.
using memset was a bit slower than a normal loop.
This is what msvc produced with the above settings

_DIB_32BPP_ColorFill:
   push   ebx
   mov   ebx, [eax+8]
   sub    ebx, [eax]
   test    ebx, ebx
   jg      short label1
   xor    al, al
   pop   ebx
   retn

label1:
   mov  ecx, [eax+4]
   push esi
   mov esi, [eax+0Ch]
   sub  esi, ecx
   test  esi, esi
   jg     short label2
   pop  esi
   xor   al, al
   pop  ebx
   retn

label2:
   mov  eax, [edx+4]
   imul  ecx, eax
   add  ecx, [edx]
   cdq
   and  edx, 3
   add  eax, edx
   sar   eax, 2
   add  eax, eax
   push edi
   mov edi, ecx
   add  eax, eax
   jmp  short label3

align 10h
label3:
   mov  ecx, edi
   mov  edx, ebx

label4:
   mov  dword ptr [ecx], 3039h
   add   ecx, 4
   sub   edx, 1
   jnz    short  label4

   dec   esi
   add   edi, eax
   test   esi, esi
   jg     short  label3

   pop  edi
   pop  esi
   mov al, 1
   pop ebx
   retn



I though myself I did something wrong. For me no compiler was able to
generate code as fast as the asm code.
I don't know how Alex managed to get better optimizations, maybe he
knows a secret ninja /Oxxx switch, or maybe express and wdk version  
both
suck at optimizing or maybe I'm just too supid... ;-)


    
See above: If all you want to optimize is the loop, then have C code
with asm("rep movsd") in it, or fix the static inline memcpy() to be
more efficient (if it isn't efficient in the first place).

      
I tried __stosd() which actually resulted in a faster function. with
~610 gcc was aslmost as fast as the asm implementation, msvc actually
won with 590. But that was using not pure portable code. It's the best
solution, it seems, although it will probably still be slower unless  
we
set our optimization to max.

Btw, I already thought about rewriting our dib code some time ago.  
Using
inline functions instead of a code generator. The idea is to make it
fully portable, optimizable though inline asm functions where useful  
and
easier to maintain then the current stuff. It's on my list...

Timo


_______________________________________________
Ros-dev mailing list
Ros-dev@reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev
    

Best regards,
Alex Ionescu


_______________________________________________
Ros-dev mailing list
Ros-dev@reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev

  

_______________________________________________
Ros-dev mailing list
Ros-dev@reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev

Best regards,
Alex Ionescu