Note to everyone else: I just spent some time to do the calculations and have data proving C code can be faster -- I will post tonight from home.
Now to get to your argument, Jose..
Best regards, Alex Ionescu
On Tue, Aug 4, 2009 at 2:19 PM, Jose Catenajc1@diwaves.com wrote:
With all respect Alex, although I agree with you in the core, that this does not deserve the disadvantages of asm for a tiny performance difference if any (portability, readability, etc), I don't agree with many your arguments.
Also keep in mind Timo admitted "This code is not called often", making ASM optimization useless.
-->
- The optimizations of the code *around* the function (ie: the
callers), which Michael also pointed out, cannot be done in ASM.
<-- Yes, it can. I could always outperform or match a C compiler at that, and did many times (I'm the author of an original PC BIOS, performance libraries, mission critical systems, etc). I very often used regs for calling params, local storage through SP instead of BP, good use and reuse of registers, etc.
An optimizing compiler will do this too.
In fact, the loop the compiler generated was identical to the asm source except for the two instructions the compiler added (that serve for no purpose, it is a msvc issue).
Really? Here's sample code from my faster C version:
.text:004013E0 lea eax, [esi+eax*4] .text:004013E3 lea esi, ds:0[edi*4] .text:004013EA lea eax, [ebp+eax+0] .text:004013EE db 66h .text:004013EE nop
99% percent of people on this list (and you, probably) will tell me "this is a GCC issue" or that this is "useless code".
Guess what, I compiled with mtune=core2 and this code sequence is specifically generated before the loop.
Timo, and I admit not even myself, would think of adding this kind of code. But once I asked some experts what this does, I understood why it's there.
To quote Michael "if you think the compiler is generating useless code, try to find out what the code is doing." In most cases, your thinking that it is "wrong" or "useless" is probably wrong itself.
As a challenge, can you tell me the point of this code? Why is it written this way? If I build for 486 (which is what ALL OF YOU SEEM TO BE STUCK ON!!!), I get code that looks like Timo's.
It is actually in the calling overhead and local initialization and storage where I could easily beat the compiler, since it complies with rules that I can safely break.
That doesn't make any sense. You are AGREEING with me. My point is that a compiler will break normal calling rules while the assembly code will have to respect at least some rules, because you won't know apriori all your callers (you might in a BIOS, but not in a giant code like win32k). The compiler on the other hand, DOES know all the callers, and will hapilly clober registers, change the calling convention, etc. Please re-read Michael's email.
Furthermore, in most cases a compiler won't change calling convention unless the source specifies it
Completely not true. Compilers will do this. This is not 1994 anymore.
, and in any case the register based calling used by
compilers is way restricted compared with what can be done in asm which can always use more efficient methods (more extensive and intelligent register allocation).
Again, simply NOT true. Today's compilers will be able to do things like "All callers of foo must have param 8 in ECX", and will write the code that way, not to save/restore ECX, and to use it as a parameter. You CANNOT do this in assembly unless you have a very small number of callers that you know nobody else will touch. As soon as someone else adds a caller, YOU have to do all the work to make ECX work that way.
You seem to have a very 1990ies understanding of how compilers work (respecting calling conventions, save/restoring registers, not touching ebp, etc). Probably because you worked on BIOSes, which yes, in that time, worked that way.
Please read a bit into the technologies such as LLVM or microsoft's link time code generator.
In any case, the most important optimizations are equally done in C and assembly when the programmer knows how to write optimum code and does not have to comply with a prototype.
Again, NO. Unless you control all your callsites and are willing to update the code each single time a cal site gets added, the compiler WILL beat you. LLVM and LTCG can even go 2-3 call sites away, such that callers of foo which call bar which call baz have some sort of stack frame or register content that will make barbaz faster.
For example passing arguments as a pointer to an struct is always more efficient.
It actually depends, and again the compiler can make this choice.
--> 2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and Nehalem you will get totally different results with your ASM code, while the compilers will generate the best possible code.
<-- There are very few and specific cases where the optimum code for different processors is different, and this is not the case.
False. I got radically different ASM when building for K8, I7, Core2 and Pentium.
If gcc generates different code for this function and different CPUs, it is not for a good reason.
Excuse me?
There is only a meaningful exception for this function: if the inner loop can use a 64 bit rep stos instead of 32. And in this case it can be done in asm, while I don't know any compiler that would use a 64 bit rep stos instruction for a 32 bit target regardless of the CPU having 64 bit registers.
Again, this is full of assumptions. You seem to be saying "GCC is stupid, I know better". Yet you don't even understand WHY gcc will generate different code for different CPUs.
Please read into the topics of "pipelines" and "caches" and "micro-operations" as a good starting point.
--> 4) The fact that if the loop is what you're truly worried about, you can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a similar intrinsic), and still keep the rest of the function portable C.
<-- It is not necessary to use to use a built in function like you mention, because any optimizing compiler will use rep movsd anyway, with better register allocation if any different.
Ummm, if you think "rep movsd" is what an optimizing compiler will use, then I'm sorry but you don't have the credentials to be in this argument, and I'm wasting my time. Rep Movsd is the SLOWEST way to achieve this loop on modern CPUs. On my core2 build, for example, gcc used "mov" and a loop instead. Only when building for Pentium 1, did it use a rep movsd.
Please stop thinking that 1 line of ASM is faster than 12 lines, because 12 > 1. On modern CPUs, a "manual" loop will be faster than a rep movsd, nearly ALWAYS.
If inline asm is used instead, optimizations for the whole function are disabled, as the compiler does not analyze what's done in inline assembly.
LOL??? Again, maybe true in the 1990ies. but first of all:
1) Built-ins are not "inline asm", and will be optimized 2) GCC and MSVC both will optimize the inline assembler according to the function the inline is present in. The old mantra that "inline asm disables optimizations" hasn't been true since about 2001...
In fact, when assembly is *required* (for something like a trap save), it is ALWAYS better to use an inline __asm__ block within the C function, then to call the external function in an .S or .ASM file, because compilers like gcc will be able to fine-tune the assembly you wrote, and modify it to work better with the C code. LTCG will, in some cases, optimize the ASM you wrote by hand in the external .ASM file as well.
--> Also, gcc does support profiling, another fact you don't seem to know. However, with linker optimizations, you do not need a profiler, the linker will do the static analysis.
<-- Function level linking and profiling based optimization are very different things, the linker in no way can perform a similar statistical analysis.
But it can make static analysis.
--> Also, to everyone sayings things like "I was able to save a <operand name here>", I hope you understand that smaller != faster.
<-- The save of these two instructions improve both the speed and size. Note that the loop the compiler generated was exactly the same as the original assembly, only with those two instructions added. I discern where I save speed, size, both, or none, in either C or assembly.
I wrote this not to be argumentative or confrontational, but just because I don't like to read arguments that are not true, and I hope you all take this as constructive knowledge. BTW, I hardly support the use of assemly except in very specific cases, and this is not one. I disagreed with Alex in the arguments, not in the core.
Thanks Jose, but unfortunately you are wrong. If we were having this argument in:
1) 1986 2) on a 486 3) about BIOS code (which is small and rarely extended, with all calls "Controlled")
I would bow down and give you my hat in an instant, but times have changed.
I don't want to waste more time on these arguments, because I know I'm right and I've asked several people which all agree with me -- people that work closely with Intel, compiler technology and assembly. I cannot convince people that don't even have the basic knowledge to be able to UNDERSTAND the arguments. Do some reading, then come back.
I will post numbers and charts when I'm home, at least they will provide some "visual" confirmation of what I'm saying, but I doubt that will be enough.
Jose Catena DIGIWAVES S.L.
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev