On 4 Aug 2009, at 03:06, Timo Kreuzer wrote:
Alex Ionescu wrote:
You see, because one of the #1 reasons why inline ASM loses vs C, is
that gcc understood the structure of my code -- it realized that I
was
calling the function with static parameters, and integrated them into
the function itself.
What the compiler does not know is with what parameters this
function is generally called, unless you would profile the code and
then reuse the profiling data to recompile, but I don't think that
gcc supports that. So it has to rely on generic optimization. Hand
coded assembly can be optimized for the special usage pattern.
Anyway that's theory.
A compiler that does link time optimization like MSVC and LLVM will
inline a small function like this one every time it is invoked, not
only saving the call and the stack frame, but also to save copying
around the arguments. If caller 1 happens to have the argument
"iColor" in ebx, it will just inline the function in a way that it can
work directly on ebx. And for caller 2, where iColor is in esi, it
will inline it in a way so it can work with it in esi.
If you compile the source into an i386 binary .o file and analyze this
one, the compiler does not know what registers hold the data *in
general*, because you could still link it against anything. But at
least it knows it for every single already existing invocation, and
generate corresponding code.
But as soon as the compiler/linker knows all invocations (because you
create the final executable, and your function is not an exported
symbol any more), instead of optimizing by inlining every invocation,
it can also decide to choose its own calling conventions between the n
callers and the function and go for a register based calling
convention, and make sure all callers prepare their arguments so that
they end up in the right registers without the need to copy them around.
If you don't compile with a link time optimizing compiler like MSVC or
LLVM (yet), you can choose to have your function as a "static inline"
in a header file instead of in a C file, and the compiler won't
require a link time optimizer in order to do the optimized inlining.
I tried
calling the function twice -- gcc actually inlined the
function twice, with static parameters twice!
This function is neither static nor called with static parameters.
Not in the general case, but for every individual case, yes. And as
soon as all possible callers are known to the toolchain, it might find
some static property, like: A size is always divisible by four, a
pointer is always aligned to a certain boundary, or a 64 bit argument
will never be >= 2^32. You can't make these assumptions in assembly,
because adding a single new caller will break your code. But in the C
case, if you add a caller for which this property is not true any
more, the compiler won't do this optimization any more.
But this
proves one of the first points -- gcc will be able to
analyze
the form of your program, and make minute optimizations that are
*impossible* to do in ASM. For example, it could decide "all
functions
calling this function will store parameter 3 in ECX" and this will
optimize the overall speed of the entire program, not only the
function, plus save stack space. This is only an example of the many
hidden optimizations it could decide to do.
All the small optimizations "around" the function don't matter much
in this case. You can assume that the functions spends > 90% of the
time inside the loop. So the loop needs to be optimized, everything
else is candy.
So you are saying you only care that your rep movs is fast, and you
don't care about the rest, because it's only 10%? In this case, please
write your function in C, and replace only the memcpy() with inline
asm, in order to make the code less error prone, more readable, more
maintainable and more portable, without losing any of your goals
(=speed in the tight loop).
But if it's really just the memcpy() that you think is slow, then you
should fix the memcpy(), and have your compiler inline the version
that you think is fastest. Or maybe the version it is inlining today
is faster than what you think is faster.
I wonder, has either of you, Alex or Timo actually *benchmarked* the
code on some sort of native i386 CPU before you argue whether it
should be a stosb or a stosd? If not, writing assembly would be a
clear case of "premature optimization".
Depending on
how many times/how this function is called, gcc could've
done any number of register allocation and tree/loop optimizations
based on the code.
Once I fooled gcc into generating "stupid" code, the output was very
similar, but more optimized than yours -- partly because ebp was
clobbered.
I fixed the function to *not* clobber ebp. Misusing ebp is lame.
If you compile debug, you want ebp to be sane. If you compile release
(i.e. perfomance), you usually do not care, In any case, with C code,
you can decide whether you want debuggability or speed with a compile
time switch. With assembly, you can not. And please don't start with
#ifdefs.
In case
you're wondering, yes, gcc inlined the rep stosd.
However, it chose rep stosb instead, because I did not give it a
guarantee of alignment (that would be a simple __attribute__).
I wonder what compiler you are using then. I tried it with our
current RosBE with maximum optimization and it didn't do that. Same
with gcc 4.4.0 and msc (I tested the one that ships with the WDK
2008) also with maximum optimization for speed.
If the compiled code looks suboptimal, these should be the steps to
take:
1. Understand whether the code is actually suboptimal, i.e. is it
slow? Maybe the compiler uses stosb for a reason - for example because
stosb is faster on unaligned data, and the compiler could not make any
assumptions on the alignment of the data.
2. In case the code is bad because the compiler couldn't make an
assumption, i.e. it did not know enough, then give it hints. GCC takes
lots of hints.
3. In case the compiler was right, because you were making an
assumption that cannot me made in general, then you should leave it
the way it is, because the compiler is probably doing the right thing.
4. If the compiler is really acting stupid, compare it to other
compilers to be really sure, and then file a bug with the compiler.
5. Refrain from hand-coding something that works around a compiler
*performance* issue unless it is really really necessary. If you have
to do so, clearly mark the code with a bug tracker number and state
that this code should be reverted into C as soon as the compiler bug
is fixed.
In short: Nowadays, compilers are really really smart. Amazingly
smart. If you think you're better, then you're either doing it wrong,
making assumptions that you haven't told the compiler about, or it's a
compiler bug. In no case, try to do something by hand unless you
perfectly understand *why* the compiler is doing it some way.
And yes, I understand that current compilers may have some limitations
concerning SIMD instruction sets. (...although even this isn't
strictly true any more today, if you know how to give the compiler the
right hints and use the right intrinsics.) But this case is not one of
these.
I used push/pop in favour of movs to improve the
readability, as it
doesn't really matter. If I had been up to ultra optimization, I
could have quenched out a few cycles more. Optimizing the loop was
sufficient for me.
See above: If all you want to optimize is the loop, then have C code
with asm("rep movsd") in it, or fix the static inline memcpy() to be
more efficient (if it isn't efficient in the first place).
The challenge was obviously the compiler. Please let
us know which
version of gcc you were using and with what options, it seems to be
way more sophisticated than all the compilers/options I know.
While I would indeed be interested in seeing the compiler output both
of you get, because otherwise all this is pretty theoretical, I also
think that it doesn't matter in the big picture. Even if today's
compiler doesn't generate perfect code for a simple task as this one
(and ReactOS' toolchain doesn't use the latest GCC today), then
tomorrow's compiler will, and it's not worth doing any assembly any
more today.
You talked about compiler optimization, and what it
could
theoretically do here and there, but the only thing that is worth
optimizing in this function is the loop and here you managed to get
a lousy rep stosb, not a stosd or even SSE stuff?
Now if we care a lot about memcpy() performance, then we could add
many more tricks to the codebase. The memcpy() code in the "commpage"
of Mac OS X is pretty fascinating, for example. Depending on the CPU
type, different versions of memcpy() are in place, and some versions
first look at the size of the region, and branch to one of three
different implementations. Small regions use rep stosb, bigger ones
SSE or the like. (I'm sure NT does something similar.)
If you are more into static optimization, or want to give the compiler
hints about the typical size of the region, you could do some clever
macro hacks that supply the typical size range as an extra argument to
memcpy(), so that the best version can be chosen at preprocessing
time, without the need for the size check in the compiled code - if,
for example, DIB_32BPP_ColorFill() typically did memcpy()s of smaller
sizes.
Michael