Re: [ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

4 Aug 2009

Note to everyone else: I just spent some time to do the calculations
and have data proving C code can be faster -- I will post tonight from
home.
Now to get to your argument, Jose..
Best regards,
Alex Ionescu
On Tue, Aug 4, 2009 at 2:19 PM, Jose Catena&lt;jc1(a)diwaves.com&gt; wrote:
...
  With all respect Alex, although I agree with you in
the core, that this does
 not deserve the disadvantages of asm for a tiny performance difference if
 any (portability, readability, etc), I don't agree with many your arguments. 
Also keep in mind Timo admitted "This code is not called often",
making ASM optimization useless.
...

 -->
 1) The optimizations of the code *around* the function (ie: the
 callers), which Michael also pointed out, cannot be done in ASM.
 <--
 Yes, it can. I could always outperform or match a C compiler at that, and
 did many times (I'm the author of an original PC BIOS, performance
 libraries, mission critical systems, etc).
 I very often used regs for calling params, local storage through SP instead
 of BP, good use and reuse of registers, etc. 
An optimizing compiler will do this too.
...
  In fact, the loop the compiler generated was identical
to the asm source
 except for the two instructions the compiler added (that serve for no
 purpose, it is a msvc issue). 
Really? Here's sample code from my faster C version:
.text:004013E0                 lea     eax, [esi+eax*4]
.text:004013E3                 lea     esi, ds:0[edi*4]
.text:004013EA                 lea     eax, [ebp+eax+0]
.text:004013EE                 db      66h
.text:004013EE                 nop
99% percent of people on this list (and you, probably) will tell me
"this is a GCC issue" or that this is "useless code".
Guess what, I compiled with mtune=core2 and this code sequence is
specifically generated before the loop.
Timo, and I admit not even myself, would think of adding this kind of
code. But once I asked some experts what this does, I understood why
it's there.
To quote Michael "if you think the compiler is generating useless
code, try to find out what the code is doing." In most cases, your
thinking that it is "wrong" or "useless" is probably wrong itself.
As a challenge, can you tell me the point of this code? Why is it
written this way? If I build for 486 (which is what ALL OF YOU SEEM TO
BE STUCK ON!!!), I get code that looks like Timo's.
...
  It is actually in the calling overhead and local
initialization and storage
 where I could easily beat the compiler, since it complies with rules that I
 can safely break. 
That doesn't make any sense. You are AGREEING with me. My point is
that a compiler will break normal calling rules while the assembly
code will have to respect at least some rules, because you won't know
apriori all your callers (you might in a BIOS, but not in a giant code
like win32k). The compiler on the other hand, DOES know all the
callers, and will hapilly clober registers, change the calling
convention, etc. Please re-read Michael's email.
...
  Furthermore, in most cases a compiler won't change
calling convention unless
 the source specifies it 
Completely not true. Compilers will do this. This is not 1994 anymore.
, and in any case the register based calling used by
...
  compilers is way restricted compared with what can be
done in asm which can
 always use more efficient methods (more extensive and intelligent register
 allocation). 
Again, simply NOT true. Today's compilers will be able to do things
like "All callers of foo must have param 8 in ECX", and will write the
code that way, not to save/restore ECX, and to use it as a parameter.
You CANNOT do this in assembly unless you have a very small number of
callers that you know nobody else will touch. As soon as someone else
adds a caller, YOU have to do all the work to make ECX work that way.
You seem to have a very 1990ies understanding of how compilers work
(respecting calling conventions, save/restoring registers, not
touching ebp, etc). Probably because you worked on BIOSes, which yes,
in that time, worked that way.
Please read a bit into the technologies such as LLVM or microsoft's
link time code generator.
...
  In any case, the most important optimizations are
equally done in C and
 assembly when the programmer knows how to write optimum code and does not
 have to comply with a prototype. 
Again, NO. Unless you control all your callsites and are willing to
update the code each single time a cal site gets added, the compiler
WILL beat you. LLVM and LTCG can even go 2-3 call sites away, such
that callers of foo which call bar which call baz have some sort of
stack frame or register content that will make barbaz faster.
...
  For example passing arguments as a pointer
 to an struct is always more efficient.

It actually depends, and again the compiler can make this choice.
...
  -->
 2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and
 Nehalem you will get totally different results with your ASM code,
 while the compilers will generate the best possible code.
 <--
 There are very few and specific cases where the optimum code for different
 processors is different, and this is not the case. 
False. I got radically different ASM when building for K8, I7, Core2
and Pentium.
...
  If gcc generates different code for this function and
different CPUs, it is
 not for a good reason. 
Excuse me?
...
  There is only a meaningful exception for this
function: if the inner loop
 can use a 64 bit rep stos instead of 32. And in this case it can be done in
 asm, while I don't know any compiler that would use a 64 bit rep stos
 instruction for a 32 bit target regardless of the CPU having 64 bit
 registers. 
Again, this is full of assumptions. You seem to be saying "GCC is
stupid, I know better". Yet you don't even understand WHY gcc will
generate different code for different CPUs.
Please read into the topics of "pipelines" and "caches" and
"micro-operations" as a good starting point.
...

 -->
 4) The fact that if the loop is what you're truly worried about, you
 can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a
 similar intrinsic), and still keep the rest of the function portable C.
 <--
 It is not necessary to use to use a built in function like you mention,
 because any optimizing compiler will use rep movsd anyway, with better
 register allocation if any different. 
Ummm, if you think "rep movsd" is what an optimizing compiler will
use, then I'm sorry but you don't have the credentials to be in this
argument, and I'm wasting my time. Rep Movsd is the SLOWEST way to
achieve this loop on modern CPUs. On my core2 build, for example, gcc
used "mov" and a loop instead. Only when building for Pentium 1, did
it use a rep movsd.
Please stop thinking that 1 line of ASM is faster than 12 lines,
because 12 > 1. On modern CPUs, a "manual" loop will be faster than a
rep movsd, nearly ALWAYS.
...
  If inline asm is used instead, optimizations for the
whole function are
 disabled, as the compiler does not analyze what's done in inline assembly. 
LOL??? Again, maybe true in the 1990ies. but first of all:
1) Built-ins are not "inline asm", and will be optimized
2) GCC and MSVC both will optimize the inline assembler according to
the function the inline is present in. The old mantra that "inline asm
disables optimizations" hasn't been true since about 2001...
In fact, when assembly is *required* (for something like a trap save),
it is ALWAYS better to use an inline __asm__ block within the C
function, then to call the external function in an .S or .ASM file,
because compilers like gcc will be able to fine-tune the assembly you
wrote, and modify it to work better with the C code. LTCG will, in
some cases, optimize the ASM you wrote by hand in the external .ASM
file as well.
...

 -->
 Also, gcc does support profiling, another fact you don't seem to know.
 However, with linker optimizations, you do not need a profiler, the
 linker will do the static analysis.
 <--
 Function level linking and profiling based optimization are very different
 things, the linker in no way can perform a similar statistical analysis. 
But it can make static analysis.
...

 -->
 Also, to everyone sayings things like "I was able to save a <operand
 name here>", I hope you understand that smaller != faster.
 <--
 The save of these two instructions improve both the speed and size. Note
 that the loop the compiler generated was exactly the same as the original
 assembly, only with those two instructions added. I discern where I save
 speed, size, both, or none, in either C or assembly.
 I wrote this not to be argumentative or confrontational, but just because I
 don't like to read arguments that are not true, and I hope you all take this
 as constructive knowledge.
 BTW, I hardly support the use of assemly except in very specific cases, and
 this is not one. I disagreed with Alex in the arguments, not in the core. 
Thanks Jose, but unfortunately you are wrong. If we were having this
argument in:
1) 1986
2) on a 486
3) about BIOS code (which is small and rarely extended, with all calls
"Controlled")
I would bow down and give you my hat in an instant, but times have changed.
I don't want to waste more time on these arguments, because I know I'm
right and I've asked several people which all agree with me -- people
that work closely with Intel, compiler technology and assembly. I
cannot convince people that don't even have the basic knowledge to be
able to UNDERSTAND the arguments. Do some reading, then come back.
I will post numbers and charts when I'm home, at least they will
provide some "visual" confirmation of what I'm saying, but I doubt
that will be enough.
...

 Jose Catena
 DIGIWAVES S.L.
 _______________________________________________
 Ros-dev mailing list
 Ros-dev(a)reactos.org
 http://www.reactos.org/mailman/listinfo/ros-dev

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments