With all respect Alex, although I agree with you in the core, that this does not deserve the disadvantages of asm for a tiny performance difference if any (portability, readability, etc), I don't agree with many your arguments.
--> 1) The optimizations of the code *around* the function (ie: the callers), which Michael also pointed out, cannot be done in ASM.
<-- Yes, it can. I could always outperform or match a C compiler at that, and did many times (I'm the author of an original PC BIOS, performance libraries, mission critical systems, etc). I very often used regs for calling params, local storage through SP instead of BP, good use and reuse of registers, etc. In fact, the loop the compiler generated was identical to the asm source except for the two instructions the compiler added (that serve for no purpose, it is a msvc issue). It is actually in the calling overhead and local initialization and storage where I could easily beat the compiler, since it complies with rules that I can safely break. Furthermore, in most cases a compiler won't change calling convention unless the source specifies it, and in any case the register based calling used by compilers is way restricted compared with what can be done in asm which can always use more efficient methods (more extensive and intelligent register allocation). In any case, the most important optimizations are equally done in C and assembly when the programmer knows how to write optimum code and does not have to comply with a prototype. For example passing arguments as a pointer to an struct is always more efficient.
--> 2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and Nehalem you will get totally different results with your ASM code, while the compilers will generate the best possible code.
<-- There are very few and specific cases where the optimum code for different processors is different, and this is not the case. If gcc generates different code for this function and different CPUs, it is not for a good reason. There is only a meaningful exception for this function: if the inner loop can use a 64 bit rep stos instead of 32. And in this case it can be done in asm, while I don't know any compiler that would use a 64 bit rep stos instruction for a 32 bit target regardless of the CPU having 64 bit registers.
--> 4) The fact that if the loop is what you're truly worried about, you can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a similar intrinsic), and still keep the rest of the function portable C.
<-- It is not necessary to use to use a built in function like you mention, because any optimizing compiler will use rep movsd anyway, with better register allocation if any different. If inline asm is used instead, optimizations for the whole function are disabled, as the compiler does not analyze what's done in inline assembly.
--> Also, gcc does support profiling, another fact you don't seem to know. However, with linker optimizations, you do not need a profiler, the linker will do the static analysis.
<-- Function level linking and profiling based optimization are very different things, the linker in no way can perform a similar statistical analysis.
--> Also, to everyone sayings things like "I was able to save a <operand name here>", I hope you understand that smaller != faster.
<-- The save of these two instructions improve both the speed and size. Note that the loop the compiler generated was exactly the same as the original assembly, only with those two instructions added. I discern where I save speed, size, both, or none, in either C or assembly.
I wrote this not to be argumentative or confrontational, but just because I don't like to read arguments that are not true, and I hope you all take this as constructive knowledge. BTW, I hardly support the use of assemly except in very specific cases, and this is not one. I disagreed with Alex in the arguments, not in the core.
Jose Catena DIGIWAVES S.L.