Alex Ionescu wrote:
I won.
  
Did you?

I actually had to spend the better part of the hour convincing GCC
*not* to optimize, just to make things fair.

You see, because one of the #1 reasons why inline ASM loses vs C, is
that gcc understood the structure of my code -- it realized that I was
calling the function with static parameters, and integrated them into
the function itself.
  
What the compiler does not know is with what parameters this function is generally called, unless you would profile the code and then reuse the profiling data to recompile, but I don't think that gcc supports that. So it has to rely on generic optimization. Hand coded assembly can be optimized for the special usage pattern. Anyway that's theory.

I tried calling the function twice -- gcc actually inlined the
function twice, with static parameters twice!
  
This function is neither static nor called with static parameters.

I finally settled on calling it with arguments from the command line,
which are always changing.

But this proves one of the first points -- gcc will be able to analyze
the form of your program, and make minute optimizations that are
*impossible* to do in ASM. For example, it could decide "all functions
calling this function will store parameter 3 in ECX" and this will
optimize the overall speed of the entire program, not only the
function, plus save stack space. This is only an example of the many
hidden optimizations it could decide to do.
  
All the small optimizations "around" the function don't matter much in this case. You can assume that the functions spends > 90% of the time inside the loop. So the loop needs to be optimized, everything else is candy.

Depending on how many times/how this function is called, gcc could've
done any number of register allocation and tree/loop optimizations
based on the code.

Once I fooled gcc into generating "stupid" code, the output was very
similar, but more optimized than yours -- partly because ebp was
clobbered.
I fixed the function to *not* clobber ebp. Misusing ebp is lame.

 In case you're wondering, yes, gcc inlined the rep stosd.
However, it chose rep stosb instead, because I did not give it a
guarantee of alignment (that would be a simple __attribute__).
  
I wonder what compiler you are using then. I tried it with our current RosBE with maximum optimization and it didn't do that. Same with gcc 4.4.0 and  msc (I tested the one that ships with the WDK 2008) also with maximum optimization for speed.

More importantly however, once I selected -mtune=core2, gcc destroyed
you. It made uglier (less compact) code, but didn't use any push/pops
at all, and moved data directly into the stack. It also used some more
  
I used push/pop in favour of movs to improve the readability, as it doesn't really matter. If I had been up to ultra optimization, I could have quenched out a few cycles more. Optimizing the loop was sufficient for me.

exotic checks and operands, because it KNEW that this would be faster
on the Core 2 I was testing on. When I used -mtune=486, or -mtune=k6,
I once again got very different looking programs. Because gcc knew
what was best for each chip. You don't, and even if you did, you'd
have to write 50 versions of your assembly code.
  
Having one version that runs on all x86 machines and is faster than anything our current gcc can generate should be enough, thanks.

I also built it for x64 and ARM, and got fast code for those platforms
too -- your assembly code requires someone to port it.
  
True, I never said anything else. But this is not the question.

Additionally, gcc also aligned the code, and certain parts of the
loop, to best suit the cache settings of the platform, and where the
code was actually located in the binary.

Finally, on certain platforms, gcc chose to call memset instead, and
provided a highly optimized memset implementation which even used SSE
4.1 if required (if it determined it would be fastest for this set of
inputs). Again, your rep movsd, while fast on 486, is slow as molasses
on newer Core processors (or even the P3), because it gets micro-coded
and has to do a lot of pre-setup work.
  
As already mentioned memset doesn't work. And how does the compiler know if something is worth the hassle or not? It's about 15 cycles for a rep, call a subfunction and you quickly get more than 15 cycles overhead. How does the compiler possibly "determine it would be fastest for this set of inputs", without profiling? Again Theory.

I don't know if you were trying to bait me -- I respect you and I'm
pretty sure you knew these facts, so I'm surprised about this
"challenge".
  
The challenge was obviously the compiler. Please let us know which version of gcc you were using and with what options, it seems to be way more sophisticated than all the compilers/options I know.
I am the first to replace the asm version with a C implementation, as soon as we use a proper gcc with decent optimization in reactos that will create faster code. But I currently don't see this.

You talked about compiler optimization, and what it could theoretically do here and there, but the only thing that is worth optimizing in this function is the loop and here you managed to get a lousy rep stosb, not a stosd or even SSE stuff? And what about the rest of the loop? And where's the code? Where's the disassembly? I don't care if the compiler "can do" or "could decide to do" something. I only care about what comes out at the end. Quite disappointing what I've seen so far.

What you are saying is like, noone uses a plane nowadays, cause trains are way faster. That might be true for a transrapid going at 500km/h, while a crop duster might only make 200 km/h. But that doesn't count when you plan a journey from Boston to San Francisco. :-P

You do not win, before reproducable and usable results are there.

Regards,
Timo

Best regards,
Alex Ionescu



On Mon, Aug 3, 2009 at 7:05 PM, WaxDragon<waxdragon@gmail.com> wrote:
  
Your kung-fu is the best, Alex.

On Aug 3, 2009 7:22 PM, "Alex Ionescu" <ionucu@videotron.ca> wrote:

Just got back to San Francisco... I will take you up on the challenge.
Your ass is grass, and I'm the lawnmower.
Best regards,
Alex Ionescu

On Mon, Aug 3, 2009 at 11:15 AM, Timo Kreuzer <timo.kreuzer@web.de> wrote: >
    
yeah ;-) > > Dmitr...
      
_______________________________________________
Ros-dev mailing list
Ros-dev@reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev

_______________________________________________
Ros-dev mailing list
Ros-dev@reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev


    

_______________________________________________
Ros-dev mailing list
Ros-dev@reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev