1) The optimizations of the code *around* the
function (ie: the
callers), which Michael also pointed out, cannot be done in ASM.
2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1
and
Nehalem you will get totally different results with your ASM code,
while the compilers will generate the best possible code.
3) The fact that someone will now have to write optimized versions
for
each other architecture
4) The fact that if the loop is what you're truly worried about, you
can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a
similar intrinsic), and still keep the rest of the function
portable C.
Also, gcc does support profiling, another fact you don't seem to
know.
However, with linker optimizations, you do not need a profiler, the
linker will do the static analysis.
Also, to everyone sayings things like "I was able to save a <operand
name here>", I hope you understand that smaller != faster.
On 4-Aug-09, at 10:13 AM, Timo Kreuzer wrote:
Michael Steil wrote:
I wonder, has either of you, Alex or Timo
actually *benchmarked*
the
code on some sort of native i386 CPU before you argue whether it
should be a stosb or a stosd? If not, writing assembly would be a
clear case of "premature optimization".
I did. on Athlon X2 64, I called the function a bunch ot times,
with a
100x100 rect, measuring time with rdtsc the results were quite
random,
but roughly
asm: ~580
gcc 4.2 -march=k8 -fexpensive-optimizations -O3: ~1800
WDK: /GL /Oi /Ot /O2 : ~2600
MSVC 2008 express: /GL /Oi /Ot /O2 ~1800
using a 50x50 rect shifts the advantage slightly in direction of the
asm
implementations.
I added volatile to the pointer to prevent the loop to be optimized
away.
using memset was a bit slower than a normal loop.
This is what msvc produced with the above settings
_DIB_32BPP_ColorFill:
push ebx
mov ebx, [eax+8]
sub ebx, [eax]
test ebx, ebx
jg short label1
xor al, al
pop ebx
retn
label1:
mov ecx, [eax+4]
push esi
mov esi, [eax+0Ch]
sub esi, ecx
test esi, esi
jg short label2
pop esi
xor al, al
pop ebx
retn
label2:
mov eax, [edx+4]
imul ecx, eax
add ecx, [edx]
cdq
and edx, 3
add eax, edx
sar eax, 2
add eax, eax
push edi
mov edi, ecx
add eax, eax
jmp short label3
align 10h
label3:
mov ecx, edi
mov edx, ebx
label4:
mov dword ptr [ecx], 3039h
add ecx, 4
sub edx, 1
jnz short label4
dec esi
add edi, eax
test esi, esi
jg short label3
pop edi
pop esi
mov al, 1
pop ebx
retn
I though myself I did something wrong. For me no compiler was able
to
generate code as fast as the asm code.
I don't know how Alex managed to get better optimizations, maybe he
knows a secret ninja /Oxxx switch, or maybe express and wdk version
both
suck at optimizing or maybe I'm just too supid... ;-)
See above: If all you want to optimize is the
loop, then have C
code
with asm("rep movsd") in it, or fix the static inline memcpy() to
be
more efficient (if it isn't efficient in the first place).
I tried __stosd() which actually resulted in a faster function. with
~610 gcc was aslmost as fast as the asm implementation, msvc
actually
won with 590. But that was using not pure portable code. It's the
best
solution, it seems, although it will probably still be slower unless
we
set our optimization to max.
Btw, I already thought about rewriting our dib code some time ago.
Using
inline functions instead of a code generator. The idea is to make it
fully portable, optimizable though inline asm functions where useful
and
easier to maintain then the current stuff. It's on my list...
Timo
_______________________________________________
Ros-dev mailing list
Ros-dev(a)reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev
Best regards,
Alex Ionescu
_______________________________________________
Ros-dev mailing list
Ros-dev(a)reactos.org