Re: [ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

4 Aug 2009


      Alex Ionescu wrote:
...
I won.
Did you?
...
I actually had to spend the better part of the hour convincing GCC
*not* to optimize, just to make things fair.
You see, because one of the #1 reasons why inline ASM loses vs C, is
that gcc understood the structure of my code -- it realized that I was
calling the function with static parameters, and integrated them into
the function itself.
What the compiler does not know is with what parameters this function is
generally called, unless you would profile the code and then reuse the
profiling data to recompile, but I don't think that gcc supports that.
So it has to rely on generic optimization. Hand coded assembly can be
optimized for the special usage pattern. Anyway that's theory.
...
I tried calling the function twice -- gcc actually inlined the
function twice, with static parameters twice!
This function is neither static nor called with static parameters.
...
I finally settled on calling it with arguments from the command line,
which are always changing.
But this proves one of the first points -- gcc will be able to analyze
the form of your program, and make minute optimizations that are
*impossible* to do in ASM. For example, it could decide "all functions
calling this function will store parameter 3 in ECX" and this will
optimize the overall speed of the entire program, not only the
function, plus save stack space. This is only an example of the many
hidden optimizations it could decide to do.
All the small optimizations "around" the function don't matter much in
this case. You can assume that the functions spends > 90% of the time
inside the loop. So the loop needs to be optimized, everything else is
candy.
...
Depending on how many times/how this function is called, gcc could've
done any number of register allocation and tree/loop optimizations
based on the code.
Once I fooled gcc into generating "stupid" code, the output was very
similar, but more optimized than yours -- partly because ebp was
clobbered.
I fixed the function to *not* clobber ebp. Misusing ebp is lame.
...
In case you're wondering, yes, gcc inlined the rep stosd.
However, it chose rep stosb instead, because I did not give it a
guarantee of alignment (that would be a simple __attribute__).
I wonder what compiler you are using then. I tried it with our current
RosBE with maximum optimization and it didn't do that. Same with gcc
4.4.0 and  msc (I tested the one that ships with the WDK 2008) also with
maximum optimization for speed.
...
More importantly however, once I selected -mtune=core2, gcc destroyed
you. It made uglier (less compact) code, but didn't use any push/pops
at all, and moved data directly into the stack. It also used some more
I used push/pop in favour of movs to improve the readability, as it
doesn't really matter. If I had been up to ultra optimization, I could
have quenched out a few cycles more. Optimizing the loop was sufficient
for me.
...
exotic checks and operands, because it KNEW that this would be faster
on the Core 2 I was testing on. When I used -mtune=486, or -mtune=k6,
I once again got very different looking programs. Because gcc knew
what was best for each chip. You don't, and even if you did, you'd
have to write 50 versions of your assembly code.
Having one version that runs on all x86 machines and is faster than
anything our current gcc can generate should be enough, thanks.
...
I also built it for x64 and ARM, and got fast code for those platforms
too -- your assembly code requires someone to port it.
True, I never said anything else. But this is not the question.
...
Additionally, gcc also aligned the code, and certain parts of the
loop, to best suit the cache settings of the platform, and where the
code was actually located in the binary.
Finally, on certain platforms, gcc chose to call memset instead, and
provided a highly optimized memset implementation which even used SSE
4.1 if required (if it determined it would be fastest for this set of
inputs). Again, your rep movsd, while fast on 486, is slow as molasses
on newer Core processors (or even the P3), because it gets micro-coded
and has to do a lot of pre-setup work.
As already mentioned memset doesn't work. And how does the compiler know
if something is worth the hassle or not? It's about 15 cycles for a rep,
call a subfunction and you quickly get more than 15 cycles overhead. How
does the compiler possibly "determine it would be fastest for this set
of inputs", without profiling? Again Theory.
...
I don't know if you were trying to bait me -- I respect you and I'm
pretty sure you knew these facts, so I'm surprised about this
"challenge".
The challenge was obviously the compiler. Please let us know which
version of gcc you were using and with what options, it seems to be way
more sophisticated than all the compilers/options I know.
I am the first to replace the asm version with a C implementation, as
soon as we use a proper gcc with decent optimization in reactos that
will create faster code. But I currently don't see this.
You talked about compiler optimization, and what it could theoretically
do here and there, but the only thing that is worth optimizing in this
function is the loop and here you managed to get a lousy rep stosb, not
a stosd or even SSE stuff? And what about the rest of the loop? And
where's the code? Where's the disassembly? I don't care if the compiler
"can do" or "could decide to do" something. I only care about what comes
out at the end. Quite disappointing what I've seen so far.
What you are saying is like, noone uses a plane nowadays, cause trains
are way faster. That might be true for a transrapid going at 500km/h,
while a crop duster might only make 200 km/h. But that doesn't count
when you plan a journey from Boston to San Francisco. :-P
You do not win, before reproducable and usable results are there.
Regards,
Timo
...
Best regards,
Alex Ionescu
On Mon, Aug 3, 2009 at 7:05 PM, WaxDragonwaxdragon@gmail.com wrote:
...
Your kung-fu is the best, Alex.
On Aug 3, 2009 7:22 PM, "Alex Ionescu" ionucu@videotron.ca wrote:
Just got back to San Francisco... I will take you up on the challenge.
Your ass is grass, and I'm the lawnmower.
Best regards,
Alex Ionescu
On Mon, Aug 3, 2009 at 11:15 AM, Timo Kreuzer timo.kreuzer@web.de wrote: >
...
yeah ;-) > > Dmitr...

Ros-dev mailing list
Ros-dev@reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev

Ros-dev mailing list
Ros-dev@reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev

Ros-dev mailing list
Ros-dev@reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments