Alex Ionescu wrote:
I will provide some code and timings but I love how you ignored my main points:
And you ignored my main points: 1) The optimization "around" the function is not important, as the function is not called that often, the loop is much more important. 2) It doesn't matter if the function performs differently on different machines as long as it's always faster than the portable code. 3) Noone is forced to write optimized versions, we have a C version for all other architectures.. 4) I'm not worried about the loop, the loop is fine the way I wrote it ;-) I just claimed that you couldn't provide a faster C version. Faster in terms of real life usage. And I'm yet waiting for you to prove me wrong.
- The optimizations of the code *around* the function (ie: the
callers), which Michael also pointed out, cannot be done in ASM. 2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and Nehalem you will get totally different results with your ASM code, while the compilers will generate the best possible code. 3) The fact that someone will now have to write optimized versions for each other architecture 4) The fact that if the loop is what you're truly worried about, you can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a similar intrinsic), and still keep the rest of the function portable C.
Also, gcc does support profiling, another fact you don't seem to know. However, with linker optimizations, you do not need a profiler, the linker will do the static analysis.
Also, to everyone sayings things like "I was able to save a <operand name here>", I hope you understand that smaller != faster.
On 4-Aug-09, at 10:13 AM, Timo Kreuzer wrote:
Michael Steil wrote:
I wonder, has either of you, Alex or Timo actually *benchmarked* the code on some sort of native i386 CPU before you argue whether it should be a stosb or a stosd? If not, writing assembly would be a clear case of "premature optimization".
I did. on Athlon X2 64, I called the function a bunch ot times, with a 100x100 rect, measuring time with rdtsc the results were quite random, but roughly asm: ~580 gcc 4.2 -march=k8 -fexpensive-optimizations -O3: ~1800 WDK: /GL /Oi /Ot /O2 : ~2600 MSVC 2008 express: /GL /Oi /Ot /O2 ~1800
using a 50x50 rect shifts the advantage slightly in direction of the asm implementations.
I added volatile to the pointer to prevent the loop to be optimized away. using memset was a bit slower than a normal loop. This is what msvc produced with the above settings
_DIB_32BPP_ColorFill: push ebx mov ebx, [eax+8] sub ebx, [eax] test ebx, ebx jg short label1 xor al, al pop ebx retn
label1: mov ecx, [eax+4] push esi mov esi, [eax+0Ch] sub esi, ecx test esi, esi jg short label2 pop esi xor al, al pop ebx retn
label2: mov eax, [edx+4] imul ecx, eax add ecx, [edx] cdq and edx, 3 add eax, edx sar eax, 2 add eax, eax push edi mov edi, ecx add eax, eax jmp short label3
align 10h label3: mov ecx, edi mov edx, ebx
label4: mov dword ptr [ecx], 3039h add ecx, 4 sub edx, 1 jnz short label4
dec esi add edi, eax test esi, esi jg short label3
pop edi pop esi mov al, 1 pop ebx retn
I though myself I did something wrong. For me no compiler was able to generate code as fast as the asm code. I don't know how Alex managed to get better optimizations, maybe he knows a secret ninja /Oxxx switch, or maybe express and wdk version both suck at optimizing or maybe I'm just too supid... ;-)
See above: If all you want to optimize is the loop, then have C code with asm("rep movsd") in it, or fix the static inline memcpy() to be more efficient (if it isn't efficient in the first place).
I tried __stosd() which actually resulted in a faster function. with ~610 gcc was aslmost as fast as the asm implementation, msvc actually won with 590. But that was using not pure portable code. It's the best solution, it seems, although it will probably still be slower unless we set our optimization to max.
Btw, I already thought about rewriting our dib code some time ago. Using inline functions instead of a code generator. The idea is to make it fully portable, optimizable though inline asm functions where useful and easier to maintain then the current stuff. It's on my list...
Timo
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Best regards, Alex Ionescu
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev