On 4 Aug 2009, at 03:06, Timo Kreuzer wrote:
Alex Ionescu wrote:
You see, because one of the #1 reasons why inline ASM loses vs C, is that gcc understood the structure of my code -- it realized that I was calling the function with static parameters, and integrated them into the function itself.
What the compiler does not know is with what parameters this function is generally called, unless you would profile the code and then reuse the profiling data to recompile, but I don't think that gcc supports that. So it has to rely on generic optimization. Hand coded assembly can be optimized for the special usage pattern. Anyway that's theory.
A compiler that does link time optimization like MSVC and LLVM will inline a small function like this one every time it is invoked, not only saving the call and the stack frame, but also to save copying around the arguments. If caller 1 happens to have the argument "iColor" in ebx, it will just inline the function in a way that it can work directly on ebx. And for caller 2, where iColor is in esi, it will inline it in a way so it can work with it in esi.
If you compile the source into an i386 binary .o file and analyze this one, the compiler does not know what registers hold the data *in general*, because you could still link it against anything. But at least it knows it for every single already existing invocation, and generate corresponding code.
But as soon as the compiler/linker knows all invocations (because you create the final executable, and your function is not an exported symbol any more), instead of optimizing by inlining every invocation, it can also decide to choose its own calling conventions between the n callers and the function and go for a register based calling convention, and make sure all callers prepare their arguments so that they end up in the right registers without the need to copy them around.
If you don't compile with a link time optimizing compiler like MSVC or LLVM (yet), you can choose to have your function as a "static inline" in a header file instead of in a C file, and the compiler won't require a link time optimizer in order to do the optimized inlining.
I tried calling the function twice -- gcc actually inlined the function twice, with static parameters twice!
This function is neither static nor called with static parameters.
Not in the general case, but for every individual case, yes. And as soon as all possible callers are known to the toolchain, it might find some static property, like: A size is always divisible by four, a pointer is always aligned to a certain boundary, or a 64 bit argument will never be >= 2^32. You can't make these assumptions in assembly, because adding a single new caller will break your code. But in the C case, if you add a caller for which this property is not true any more, the compiler won't do this optimization any more.
But this proves one of the first points -- gcc will be able to analyze the form of your program, and make minute optimizations that are *impossible* to do in ASM. For example, it could decide "all functions calling this function will store parameter 3 in ECX" and this will optimize the overall speed of the entire program, not only the function, plus save stack space. This is only an example of the many hidden optimizations it could decide to do.
All the small optimizations "around" the function don't matter much in this case. You can assume that the functions spends > 90% of the time inside the loop. So the loop needs to be optimized, everything else is candy.
So you are saying you only care that your rep movs is fast, and you don't care about the rest, because it's only 10%? In this case, please write your function in C, and replace only the memcpy() with inline asm, in order to make the code less error prone, more readable, more maintainable and more portable, without losing any of your goals (=speed in the tight loop).
But if it's really just the memcpy() that you think is slow, then you should fix the memcpy(), and have your compiler inline the version that you think is fastest. Or maybe the version it is inlining today is faster than what you think is faster.
I wonder, has either of you, Alex or Timo actually *benchmarked* the code on some sort of native i386 CPU before you argue whether it should be a stosb or a stosd? If not, writing assembly would be a clear case of "premature optimization".
Depending on how many times/how this function is called, gcc could've done any number of register allocation and tree/loop optimizations based on the code.
Once I fooled gcc into generating "stupid" code, the output was very similar, but more optimized than yours -- partly because ebp was clobbered.
I fixed the function to *not* clobber ebp. Misusing ebp is lame.
If you compile debug, you want ebp to be sane. If you compile release (i.e. perfomance), you usually do not care, In any case, with C code, you can decide whether you want debuggability or speed with a compile time switch. With assembly, you can not. And please don't start with #ifdefs.
In case you're wondering, yes, gcc inlined the rep stosd. However, it chose rep stosb instead, because I did not give it a guarantee of alignment (that would be a simple __attribute__).
I wonder what compiler you are using then. I tried it with our current RosBE with maximum optimization and it didn't do that. Same with gcc 4.4.0 and msc (I tested the one that ships with the WDK 2008) also with maximum optimization for speed.
If the compiled code looks suboptimal, these should be the steps to take: 1. Understand whether the code is actually suboptimal, i.e. is it slow? Maybe the compiler uses stosb for a reason - for example because stosb is faster on unaligned data, and the compiler could not make any assumptions on the alignment of the data. 2. In case the code is bad because the compiler couldn't make an assumption, i.e. it did not know enough, then give it hints. GCC takes lots of hints. 3. In case the compiler was right, because you were making an assumption that cannot me made in general, then you should leave it the way it is, because the compiler is probably doing the right thing. 4. If the compiler is really acting stupid, compare it to other compilers to be really sure, and then file a bug with the compiler. 5. Refrain from hand-coding something that works around a compiler *performance* issue unless it is really really necessary. If you have to do so, clearly mark the code with a bug tracker number and state that this code should be reverted into C as soon as the compiler bug is fixed.
In short: Nowadays, compilers are really really smart. Amazingly smart. If you think you're better, then you're either doing it wrong, making assumptions that you haven't told the compiler about, or it's a compiler bug. In no case, try to do something by hand unless you perfectly understand *why* the compiler is doing it some way.
And yes, I understand that current compilers may have some limitations concerning SIMD instruction sets. (...although even this isn't strictly true any more today, if you know how to give the compiler the right hints and use the right intrinsics.) But this case is not one of these.
I used push/pop in favour of movs to improve the readability, as it doesn't really matter. If I had been up to ultra optimization, I could have quenched out a few cycles more. Optimizing the loop was sufficient for me.
See above: If all you want to optimize is the loop, then have C code with asm("rep movsd") in it, or fix the static inline memcpy() to be more efficient (if it isn't efficient in the first place).
The challenge was obviously the compiler. Please let us know which version of gcc you were using and with what options, it seems to be way more sophisticated than all the compilers/options I know.
While I would indeed be interested in seeing the compiler output both of you get, because otherwise all this is pretty theoretical, I also think that it doesn't matter in the big picture. Even if today's compiler doesn't generate perfect code for a simple task as this one (and ReactOS' toolchain doesn't use the latest GCC today), then tomorrow's compiler will, and it's not worth doing any assembly any more today.
You talked about compiler optimization, and what it could theoretically do here and there, but the only thing that is worth optimizing in this function is the loop and here you managed to get a lousy rep stosb, not a stosd or even SSE stuff?
Now if we care a lot about memcpy() performance, then we could add many more tricks to the codebase. The memcpy() code in the "commpage" of Mac OS X is pretty fascinating, for example. Depending on the CPU type, different versions of memcpy() are in place, and some versions first look at the size of the region, and branch to one of three different implementations. Small regions use rep stosb, bigger ones SSE or the like. (I'm sure NT does something similar.)
If you are more into static optimization, or want to give the compiler hints about the typical size of the region, you could do some clever macro hacks that supply the typical size range as an extra argument to memcpy(), so that the best version can be chosen at preprocessing time, without the need for the size check in the compiled code - if, for example, DIB_32BPP_ColorFill() typically did memcpy()s of smaller sizes.
Michael