With all respect Alex, although I agree with you in the core, that this does
not deserve the disadvantages of asm for a tiny performance difference if
any (portability, readability, etc), I don't agree with many your arguments.
-->
1) The optimizations of the code *around* the function (ie: the
callers), which Michael also pointed out, cannot be done in ASM.
<--
Yes, it can. I could always outperform or match a C compiler at that, and
did many times (I'm the author of an original PC BIOS, performance
libraries, mission critical systems, etc).
I very often used regs for calling params, local storage through SP instead
of BP, good use and reuse of registers, etc.
In fact, the loop the compiler generated was identical to the asm source
except for the two instructions the compiler added (that serve for no
purpose, it is a msvc issue).
It is actually in the calling overhead and local initialization and storage
where I could easily beat the compiler, since it complies with rules that I
can safely break.
Furthermore, in most cases a compiler won't change calling convention unless
the source specifies it, and in any case the register based calling used by
compilers is way restricted compared with what can be done in asm which can
always use more efficient methods (more extensive and intelligent register
allocation).
In any case, the most important optimizations are equally done in C and
assembly when the programmer knows how to write optimum code and does not
have to comply with a prototype. For example passing arguments as a pointer
to an struct is always more efficient.
-->
2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and
Nehalem you will get totally different results with your ASM code,
while the compilers will generate the best possible code.
<--
There are very few and specific cases where the optimum code for different
processors is different, and this is not the case.
If gcc generates different code for this function and different CPUs, it is
not for a good reason.
There is only a meaningful exception for this function: if the inner loop
can use a 64 bit rep stos instead of 32. And in this case it can be done in
asm, while I don't know any compiler that would use a 64 bit rep stos
instruction for a 32 bit target regardless of the CPU having 64 bit
registers.
-->
4) The fact that if the loop is what you're truly worried about, you
can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a
similar intrinsic), and still keep the rest of the function portable C.
<--
It is not necessary to use to use a built in function like you mention,
because any optimizing compiler will use rep movsd anyway, with better
register allocation if any different.
If inline asm is used instead, optimizations for the whole function are
disabled, as the compiler does not analyze what's done in inline assembly.
-->
Also, gcc does support profiling, another fact you don't seem to know.
However, with linker optimizations, you do not need a profiler, the
linker will do the static analysis.
<--
Function level linking and profiling based optimization are very different
things, the linker in no way can perform a similar statistical analysis.
-->
Also, to everyone sayings things like "I was able to save a <operand
name here>", I hope you understand that smaller != faster.
<--
The save of these two instructions improve both the speed and size. Note
that the loop the compiler generated was exactly the same as the original
assembly, only with those two instructions added. I discern where I save
speed, size, both, or none, in either C or assembly.
I wrote this not to be argumentative or confrontational, but just because I
don't like to read arguments that are not true, and I hope you all take this
as constructive knowledge.
BTW, I hardly support the use of assemly except in very specific cases, and
this is not one. I disagreed with Alex in the arguments, not in the core.
Jose Catena
DIGIWAVES S.L.