I won.
I actually had to spend the better part of the hour convincing GCC *not* to optimize, just to make things fair.
You see, because one of the #1 reasons why inline ASM loses vs C, is that gcc understood the structure of my code -- it realized that I was calling the function with static parameters, and integrated them into the function itself.
I tried calling the function twice -- gcc actually inlined the function twice, with static parameters twice!
I finally settled on calling it with arguments from the command line, which are always changing.
But this proves one of the first points -- gcc will be able to analyze the form of your program, and make minute optimizations that are *impossible* to do in ASM. For example, it could decide "all functions calling this function will store parameter 3 in ECX" and this will optimize the overall speed of the entire program, not only the function, plus save stack space. This is only an example of the many hidden optimizations it could decide to do.
Depending on how many times/how this function is called, gcc could've done any number of register allocation and tree/loop optimizations based on the code.
Once I fooled gcc into generating "stupid" code, the output was very similar, but more optimized than yours -- partly because ebp was clobbered. In case you're wondering, yes, gcc inlined the rep stosd. However, it chose rep stosb instead, because I did not give it a guarantee of alignment (that would be a simple __attribute__).
More importantly however, once I selected -mtune=core2, gcc destroyed you. It made uglier (less compact) code, but didn't use any push/pops at all, and moved data directly into the stack. It also used some more exotic checks and operands, because it KNEW that this would be faster on the Core 2 I was testing on. When I used -mtune=486, or -mtune=k6, I once again got very different looking programs. Because gcc knew what was best for each chip. You don't, and even if you did, you'd have to write 50 versions of your assembly code.
I also built it for x64 and ARM, and got fast code for those platforms too -- your assembly code requires someone to port it.
Additionally, gcc also aligned the code, and certain parts of the loop, to best suit the cache settings of the platform, and where the code was actually located in the binary.
Finally, on certain platforms, gcc chose to call memset instead, and provided a highly optimized memset implementation which even used SSE 4.1 if required (if it determined it would be fastest for this set of inputs). Again, your rep movsd, while fast on 486, is slow as molasses on newer Core processors (or even the P3), because it gets micro-coded and has to do a lot of pre-setup work.
I don't know if you were trying to bait me -- I respect you and I'm pretty sure you knew these facts, so I'm surprised about this "challenge".
Best regards, Alex Ionescu
On Mon, Aug 3, 2009 at 7:05 PM, WaxDragonwaxdragon@gmail.com wrote:
Your kung-fu is the best, Alex.
On Aug 3, 2009 7:22 PM, "Alex Ionescu" ionucu@videotron.ca wrote:
Just got back to San Francisco... I will take you up on the challenge. Your ass is grass, and I'm the lawnmower. Best regards, Alex Ionescu
On Mon, Aug 3, 2009 at 11:15 AM, Timo Kreuzer timo.kreuzer@web.de wrote: >
yeah ;-) > > Dmitr...
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev