Your kung-fu is the best, Alex.
On Aug 3, 2009 7:22 PM, "Alex Ionescu" ionucu@videotron.ca wrote:
Just got back to San Francisco... I will take you up on the challenge. Your ass is grass, and I'm the lawnmower.
Best regards, Alex Ionescu
On Mon, Aug 3, 2009 at 11:15 AM, Timo Kreuzer timo.kreuzer@web.de wrote: >
yeah ;-) > > Dmitr...
_______________________________________________ Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
I won.
I actually had to spend the better part of the hour convincing GCC *not* to optimize, just to make things fair.
You see, because one of the #1 reasons why inline ASM loses vs C, is that gcc understood the structure of my code -- it realized that I was calling the function with static parameters, and integrated them into the function itself.
I tried calling the function twice -- gcc actually inlined the function twice, with static parameters twice!
I finally settled on calling it with arguments from the command line, which are always changing.
But this proves one of the first points -- gcc will be able to analyze the form of your program, and make minute optimizations that are *impossible* to do in ASM. For example, it could decide "all functions calling this function will store parameter 3 in ECX" and this will optimize the overall speed of the entire program, not only the function, plus save stack space. This is only an example of the many hidden optimizations it could decide to do.
Depending on how many times/how this function is called, gcc could've done any number of register allocation and tree/loop optimizations based on the code.
Once I fooled gcc into generating "stupid" code, the output was very similar, but more optimized than yours -- partly because ebp was clobbered. In case you're wondering, yes, gcc inlined the rep stosd. However, it chose rep stosb instead, because I did not give it a guarantee of alignment (that would be a simple __attribute__).
More importantly however, once I selected -mtune=core2, gcc destroyed you. It made uglier (less compact) code, but didn't use any push/pops at all, and moved data directly into the stack. It also used some more exotic checks and operands, because it KNEW that this would be faster on the Core 2 I was testing on. When I used -mtune=486, or -mtune=k6, I once again got very different looking programs. Because gcc knew what was best for each chip. You don't, and even if you did, you'd have to write 50 versions of your assembly code.
I also built it for x64 and ARM, and got fast code for those platforms too -- your assembly code requires someone to port it.
Additionally, gcc also aligned the code, and certain parts of the loop, to best suit the cache settings of the platform, and where the code was actually located in the binary.
Finally, on certain platforms, gcc chose to call memset instead, and provided a highly optimized memset implementation which even used SSE 4.1 if required (if it determined it would be fastest for this set of inputs). Again, your rep movsd, while fast on 486, is slow as molasses on newer Core processors (or even the P3), because it gets micro-coded and has to do a lot of pre-setup work.
I don't know if you were trying to bait me -- I respect you and I'm pretty sure you knew these facts, so I'm surprised about this "challenge".
Best regards, Alex Ionescu
On Mon, Aug 3, 2009 at 7:05 PM, WaxDragonwaxdragon@gmail.com wrote:
Your kung-fu is the best, Alex.
On Aug 3, 2009 7:22 PM, "Alex Ionescu" ionucu@videotron.ca wrote:
Just got back to San Francisco... I will take you up on the challenge. Your ass is grass, and I'm the lawnmower. Best regards, Alex Ionescu
On Mon, Aug 3, 2009 at 11:15 AM, Timo Kreuzer timo.kreuzer@web.de wrote: >
yeah ;-) > > Dmitr...
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Alex Ionescu wrote:
I won.
Did you?
I actually had to spend the better part of the hour convincing GCC *not* to optimize, just to make things fair.
You see, because one of the #1 reasons why inline ASM loses vs C, is that gcc understood the structure of my code -- it realized that I was calling the function with static parameters, and integrated them into the function itself.
What the compiler does not know is with what parameters this function is generally called, unless you would profile the code and then reuse the profiling data to recompile, but I don't think that gcc supports that. So it has to rely on generic optimization. Hand coded assembly can be optimized for the special usage pattern. Anyway that's theory.
I tried calling the function twice -- gcc actually inlined the function twice, with static parameters twice!
This function is neither static nor called with static parameters.
I finally settled on calling it with arguments from the command line, which are always changing.
But this proves one of the first points -- gcc will be able to analyze the form of your program, and make minute optimizations that are *impossible* to do in ASM. For example, it could decide "all functions calling this function will store parameter 3 in ECX" and this will optimize the overall speed of the entire program, not only the function, plus save stack space. This is only an example of the many hidden optimizations it could decide to do.
All the small optimizations "around" the function don't matter much in this case. You can assume that the functions spends > 90% of the time inside the loop. So the loop needs to be optimized, everything else is candy.
Depending on how many times/how this function is called, gcc could've done any number of register allocation and tree/loop optimizations based on the code.
Once I fooled gcc into generating "stupid" code, the output was very similar, but more optimized than yours -- partly because ebp was clobbered.
I fixed the function to *not* clobber ebp. Misusing ebp is lame.
In case you're wondering, yes, gcc inlined the rep stosd. However, it chose rep stosb instead, because I did not give it a guarantee of alignment (that would be a simple __attribute__).
I wonder what compiler you are using then. I tried it with our current RosBE with maximum optimization and it didn't do that. Same with gcc 4.4.0 and msc (I tested the one that ships with the WDK 2008) also with maximum optimization for speed.
More importantly however, once I selected -mtune=core2, gcc destroyed you. It made uglier (less compact) code, but didn't use any push/pops at all, and moved data directly into the stack. It also used some more
I used push/pop in favour of movs to improve the readability, as it doesn't really matter. If I had been up to ultra optimization, I could have quenched out a few cycles more. Optimizing the loop was sufficient for me.
exotic checks and operands, because it KNEW that this would be faster on the Core 2 I was testing on. When I used -mtune=486, or -mtune=k6, I once again got very different looking programs. Because gcc knew what was best for each chip. You don't, and even if you did, you'd have to write 50 versions of your assembly code.
Having one version that runs on all x86 machines and is faster than anything our current gcc can generate should be enough, thanks.
I also built it for x64 and ARM, and got fast code for those platforms too -- your assembly code requires someone to port it.
True, I never said anything else. But this is not the question.
Additionally, gcc also aligned the code, and certain parts of the loop, to best suit the cache settings of the platform, and where the code was actually located in the binary.
Finally, on certain platforms, gcc chose to call memset instead, and provided a highly optimized memset implementation which even used SSE 4.1 if required (if it determined it would be fastest for this set of inputs). Again, your rep movsd, while fast on 486, is slow as molasses on newer Core processors (or even the P3), because it gets micro-coded and has to do a lot of pre-setup work.
As already mentioned memset doesn't work. And how does the compiler know if something is worth the hassle or not? It's about 15 cycles for a rep, call a subfunction and you quickly get more than 15 cycles overhead. How does the compiler possibly "determine it would be fastest for this set of inputs", without profiling? Again Theory.
I don't know if you were trying to bait me -- I respect you and I'm pretty sure you knew these facts, so I'm surprised about this "challenge".
The challenge was obviously the compiler. Please let us know which version of gcc you were using and with what options, it seems to be way more sophisticated than all the compilers/options I know. I am the first to replace the asm version with a C implementation, as soon as we use a proper gcc with decent optimization in reactos that will create faster code. But I currently don't see this.
You talked about compiler optimization, and what it could theoretically do here and there, but the only thing that is worth optimizing in this function is the loop and here you managed to get a lousy rep stosb, not a stosd or even SSE stuff? And what about the rest of the loop? And where's the code? Where's the disassembly? I don't care if the compiler "can do" or "could decide to do" something. I only care about what comes out at the end. Quite disappointing what I've seen so far.
What you are saying is like, noone uses a plane nowadays, cause trains are way faster. That might be true for a transrapid going at 500km/h, while a crop duster might only make 200 km/h. But that doesn't count when you plan a journey from Boston to San Francisco. :-P
You do not win, before reproducable and usable results are there.
Regards, Timo
Best regards, Alex Ionescu
On Mon, Aug 3, 2009 at 7:05 PM, WaxDragonwaxdragon@gmail.com wrote:
Your kung-fu is the best, Alex.
On Aug 3, 2009 7:22 PM, "Alex Ionescu" ionucu@videotron.ca wrote:
Just got back to San Francisco... I will take you up on the challenge. Your ass is grass, and I'm the lawnmower. Best regards, Alex Ionescu
On Mon, Aug 3, 2009 at 11:15 AM, Timo Kreuzer timo.kreuzer@web.de wrote: >
yeah ;-) > > Dmitr...
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
On 4 Aug 2009, at 03:06, Timo Kreuzer wrote:
Alex Ionescu wrote:
You see, because one of the #1 reasons why inline ASM loses vs C, is that gcc understood the structure of my code -- it realized that I was calling the function with static parameters, and integrated them into the function itself.
What the compiler does not know is with what parameters this function is generally called, unless you would profile the code and then reuse the profiling data to recompile, but I don't think that gcc supports that. So it has to rely on generic optimization. Hand coded assembly can be optimized for the special usage pattern. Anyway that's theory.
A compiler that does link time optimization like MSVC and LLVM will inline a small function like this one every time it is invoked, not only saving the call and the stack frame, but also to save copying around the arguments. If caller 1 happens to have the argument "iColor" in ebx, it will just inline the function in a way that it can work directly on ebx. And for caller 2, where iColor is in esi, it will inline it in a way so it can work with it in esi.
If you compile the source into an i386 binary .o file and analyze this one, the compiler does not know what registers hold the data *in general*, because you could still link it against anything. But at least it knows it for every single already existing invocation, and generate corresponding code.
But as soon as the compiler/linker knows all invocations (because you create the final executable, and your function is not an exported symbol any more), instead of optimizing by inlining every invocation, it can also decide to choose its own calling conventions between the n callers and the function and go for a register based calling convention, and make sure all callers prepare their arguments so that they end up in the right registers without the need to copy them around.
If you don't compile with a link time optimizing compiler like MSVC or LLVM (yet), you can choose to have your function as a "static inline" in a header file instead of in a C file, and the compiler won't require a link time optimizer in order to do the optimized inlining.
I tried calling the function twice -- gcc actually inlined the function twice, with static parameters twice!
This function is neither static nor called with static parameters.
Not in the general case, but for every individual case, yes. And as soon as all possible callers are known to the toolchain, it might find some static property, like: A size is always divisible by four, a pointer is always aligned to a certain boundary, or a 64 bit argument will never be >= 2^32. You can't make these assumptions in assembly, because adding a single new caller will break your code. But in the C case, if you add a caller for which this property is not true any more, the compiler won't do this optimization any more.
But this proves one of the first points -- gcc will be able to analyze the form of your program, and make minute optimizations that are *impossible* to do in ASM. For example, it could decide "all functions calling this function will store parameter 3 in ECX" and this will optimize the overall speed of the entire program, not only the function, plus save stack space. This is only an example of the many hidden optimizations it could decide to do.
All the small optimizations "around" the function don't matter much in this case. You can assume that the functions spends > 90% of the time inside the loop. So the loop needs to be optimized, everything else is candy.
So you are saying you only care that your rep movs is fast, and you don't care about the rest, because it's only 10%? In this case, please write your function in C, and replace only the memcpy() with inline asm, in order to make the code less error prone, more readable, more maintainable and more portable, without losing any of your goals (=speed in the tight loop).
But if it's really just the memcpy() that you think is slow, then you should fix the memcpy(), and have your compiler inline the version that you think is fastest. Or maybe the version it is inlining today is faster than what you think is faster.
I wonder, has either of you, Alex or Timo actually *benchmarked* the code on some sort of native i386 CPU before you argue whether it should be a stosb or a stosd? If not, writing assembly would be a clear case of "premature optimization".
Depending on how many times/how this function is called, gcc could've done any number of register allocation and tree/loop optimizations based on the code.
Once I fooled gcc into generating "stupid" code, the output was very similar, but more optimized than yours -- partly because ebp was clobbered.
I fixed the function to *not* clobber ebp. Misusing ebp is lame.
If you compile debug, you want ebp to be sane. If you compile release (i.e. perfomance), you usually do not care, In any case, with C code, you can decide whether you want debuggability or speed with a compile time switch. With assembly, you can not. And please don't start with #ifdefs.
In case you're wondering, yes, gcc inlined the rep stosd. However, it chose rep stosb instead, because I did not give it a guarantee of alignment (that would be a simple __attribute__).
I wonder what compiler you are using then. I tried it with our current RosBE with maximum optimization and it didn't do that. Same with gcc 4.4.0 and msc (I tested the one that ships with the WDK 2008) also with maximum optimization for speed.
If the compiled code looks suboptimal, these should be the steps to take: 1. Understand whether the code is actually suboptimal, i.e. is it slow? Maybe the compiler uses stosb for a reason - for example because stosb is faster on unaligned data, and the compiler could not make any assumptions on the alignment of the data. 2. In case the code is bad because the compiler couldn't make an assumption, i.e. it did not know enough, then give it hints. GCC takes lots of hints. 3. In case the compiler was right, because you were making an assumption that cannot me made in general, then you should leave it the way it is, because the compiler is probably doing the right thing. 4. If the compiler is really acting stupid, compare it to other compilers to be really sure, and then file a bug with the compiler. 5. Refrain from hand-coding something that works around a compiler *performance* issue unless it is really really necessary. If you have to do so, clearly mark the code with a bug tracker number and state that this code should be reverted into C as soon as the compiler bug is fixed.
In short: Nowadays, compilers are really really smart. Amazingly smart. If you think you're better, then you're either doing it wrong, making assumptions that you haven't told the compiler about, or it's a compiler bug. In no case, try to do something by hand unless you perfectly understand *why* the compiler is doing it some way.
And yes, I understand that current compilers may have some limitations concerning SIMD instruction sets. (...although even this isn't strictly true any more today, if you know how to give the compiler the right hints and use the right intrinsics.) But this case is not one of these.
I used push/pop in favour of movs to improve the readability, as it doesn't really matter. If I had been up to ultra optimization, I could have quenched out a few cycles more. Optimizing the loop was sufficient for me.
See above: If all you want to optimize is the loop, then have C code with asm("rep movsd") in it, or fix the static inline memcpy() to be more efficient (if it isn't efficient in the first place).
The challenge was obviously the compiler. Please let us know which version of gcc you were using and with what options, it seems to be way more sophisticated than all the compilers/options I know.
While I would indeed be interested in seeing the compiler output both of you get, because otherwise all this is pretty theoretical, I also think that it doesn't matter in the big picture. Even if today's compiler doesn't generate perfect code for a simple task as this one (and ReactOS' toolchain doesn't use the latest GCC today), then tomorrow's compiler will, and it's not worth doing any assembly any more today.
You talked about compiler optimization, and what it could theoretically do here and there, but the only thing that is worth optimizing in this function is the loop and here you managed to get a lousy rep stosb, not a stosd or even SSE stuff?
Now if we care a lot about memcpy() performance, then we could add many more tricks to the codebase. The memcpy() code in the "commpage" of Mac OS X is pretty fascinating, for example. Depending on the CPU type, different versions of memcpy() are in place, and some versions first look at the size of the region, and branch to one of three different implementations. Small regions use rep stosb, bigger ones SSE or the like. (I'm sure NT does something similar.)
If you are more into static optimization, or want to give the compiler hints about the typical size of the region, you could do some clever macro hacks that supply the typical size range as an extra argument to memcpy(), so that the best version can be chosen at preprocessing time, without the need for the size check in the compiled code - if, for example, DIB_32BPP_ColorFill() typically did memcpy()s of smaller sizes.
Michael
Below my C code based on the C code previously shown here, and the assembly generated by vc. This function, as most ones, does not benefit much from asm coding, although some cycles can be saved, most notably inside the loop (a cmp and additional branch in vc generated code). Some algorithms can benefit a lot from asm, though. For example the Fletcher checksum or incrementing/decrementing variables larger than the register size, where the use of the carry flag can save many cycles. Also when a function exec time is very critical may deserve asm coding, but I think in this case it does not worth it, as the saving in percentage is tiny (any compiler I know will use rep stosd for the inner loop, which has the largest weight in the total time).
BOOLEAN DIB_32BPP_ColorFill(SURFOBJ* pso, RECTL* prcl, ULONG iColor) { LONG lDelta, cx, cy; char * pulLine;
lDelta = pso->lDelta; pulLine= (char *)((char *)pso->pvScan0 + prcl->top * lDelta + (prcl->left << 2));
cx = prcl->right - prcl->left; if (cx <= 0) return TRUE;
cy = prcl->bottom - prcl->top; if (cy <= 0) return TRUE;
ULONG *p; ULONG c; for(; cy--; pulLine += lDelta) { for(p = (ULONG *)pulLine, c = cx; c--; ) { *p++ = iColor; } }
return TRUE; }
PUBLIC ?DIB_32BPP_ColorFill@@YAEPAU_SURFOBJ@@PAU_RECTL@@K@Z ; DIB_32BPP_ColorFill ; Function compile flags: /Ogtpy _TEXT SEGMENT ?DIB_32BPP_ColorFill@@YAEPAU_SURFOBJ@@PAU_RECTL@@K@Z PROC ; DIB_32BPP_ColorFill ; Line 52 mov ecx, DWORD PTR ds:4 ; Line 54 mov edx, DWORD PTR ds:8 push ebp mov ebp, DWORD PTR ds:36 imul ecx, ebp xor eax, eax mov eax, DWORD PTR [eax] push esi lea esi, DWORD PTR [ecx+eax*4] add esi, DWORD PTR ds:32 sub edx, eax ; Line 55 test edx, edx ; Line 56 jle SHORT $LN22@DIB_32BPP_ push ebx ; Line 58 mov ebx, DWORD PTR ds:12 sub ebx, DWORD PTR ds:4 ; Line 59 test ebx, ebx ; Line 60 jle SHORT $LN21@DIB_32BPP_ push edi npad 4 $LL18@DIB_32BPP_: ; Line 64 dec ebx ; Line 66 test edx, edx je SHORT $LN2@DIB_32BPP_ mov ecx, edx xor eax, eax mov edi, esi rep stosd $LN2@DIB_32BPP_: add esi, ebp test ebx, ebx jne SHORT $LL18@DIB_32BPP_ pop edi $LN21@DIB_32BPP_: pop ebx $LN22@DIB_32BPP_: pop esi ; Line 72 mov al, 1 pop ebp ; Line 73 ret 0 ?DIB_32BPP_ColorFill@@YAEPAU_SURFOBJ@@PAU_RECTL@@K@Z ENDP ; DIB_32BPP_ColorFill
In asm I would write the loop as: mov eax, iColor mov ebx, pulLine mov edx, cy L1: mov di, bx mov cx, _cx rep stosd add dx, lDelta dec dx jnz l1
Jose Catena DIGIWAVES S.L.
A correction of my previous msg:
In asm I would write the loop as: mov eax, iColor mov ebx, pulLine mov edx, cy L1: mov edi, ebx mov ecx, _cx rep stosd add ebx, lDelta dec edx jnz l1
It is not possible to optimize the loop further AFAIK, and this only saves a cmp and jnz in the outer loop, a tiny gain.
Jose Catena DIGIWAVES S.L.
Jose Catena schrieb:
A correction of my previous msg:
In asm I would write the loop as: mov eax, iColor mov ebx, pulLine mov edx, cy L1: mov edi, ebx mov ecx, _cx rep stosd add ebx, lDelta dec edx jnz l1
It is not possible to optimize the loop further AFAIK, and this only saves a cmp and jnz in the outer loop, a tiny gain.
That looks almost like the version I wrote, except I didn't use memory access inside the loop, but registers, which should saves some cycles.
That looks almost like the version I wrote, except I didn't use memory
access inside the loop, but registers, which should saves some cycles.
Yes, the same. That register vs memory does not make a difference, takes the same time if it is in the L1 cache (both just 1 cycle), and it will always be there.
Jose Catena DIGIWAVES S.L.
Michael Steil wrote:
I wonder, has either of you, Alex or Timo actually *benchmarked* the code on some sort of native i386 CPU before you argue whether it should be a stosb or a stosd? If not, writing assembly would be a clear case of "premature optimization".
I did. on Athlon X2 64, I called the function a bunch ot times, with a 100x100 rect, measuring time with rdtsc the results were quite random, but roughly asm: ~580 gcc 4.2 -march=k8 -fexpensive-optimizations -O3: ~1800 WDK: /GL /Oi /Ot /O2 : ~2600 MSVC 2008 express: /GL /Oi /Ot /O2 ~1800
using a 50x50 rect shifts the advantage slightly in direction of the asm implementations.
I added volatile to the pointer to prevent the loop to be optimized away. using memset was a bit slower than a normal loop. This is what msvc produced with the above settings
_DIB_32BPP_ColorFill: push ebx mov ebx, [eax+8] sub ebx, [eax] test ebx, ebx jg short label1 xor al, al pop ebx retn
label1: mov ecx, [eax+4] push esi mov esi, [eax+0Ch] sub esi, ecx test esi, esi jg short label2 pop esi xor al, al pop ebx retn
label2: mov eax, [edx+4] imul ecx, eax add ecx, [edx] cdq and edx, 3 add eax, edx sar eax, 2 add eax, eax push edi mov edi, ecx add eax, eax jmp short label3
align 10h label3: mov ecx, edi mov edx, ebx
label4: mov dword ptr [ecx], 3039h add ecx, 4 sub edx, 1 jnz short label4
dec esi add edi, eax test esi, esi jg short label3
pop edi pop esi mov al, 1 pop ebx retn
I though myself I did something wrong. For me no compiler was able to generate code as fast as the asm code. I don't know how Alex managed to get better optimizations, maybe he knows a secret ninja /Oxxx switch, or maybe express and wdk version both suck at optimizing or maybe I'm just too supid... ;-)
See above: If all you want to optimize is the loop, then have C code with asm("rep movsd") in it, or fix the static inline memcpy() to be more efficient (if it isn't efficient in the first place).
I tried __stosd() which actually resulted in a faster function. with ~610 gcc was aslmost as fast as the asm implementation, msvc actually won with 590. But that was using not pure portable code. It's the best solution, it seems, although it will probably still be slower unless we set our optimization to max.
Btw, I already thought about rewriting our dib code some time ago. Using inline functions instead of a code generator. The idea is to make it fully portable, optimizable though inline asm functions where useful and easier to maintain then the current stuff. It's on my list...
Timo
I will provide some code and timings but I love how you ignored my main points:
1) The optimizations of the code *around* the function (ie: the callers), which Michael also pointed out, cannot be done in ASM. 2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and Nehalem you will get totally different results with your ASM code, while the compilers will generate the best possible code. 3) The fact that someone will now have to write optimized versions for each other architecture 4) The fact that if the loop is what you're truly worried about, you can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a similar intrinsic), and still keep the rest of the function portable C.
Also, gcc does support profiling, another fact you don't seem to know. However, with linker optimizations, you do not need a profiler, the linker will do the static analysis.
Also, to everyone sayings things like "I was able to save a <operand name here>", I hope you understand that smaller != faster.
On 4-Aug-09, at 10:13 AM, Timo Kreuzer wrote:
Michael Steil wrote:
I wonder, has either of you, Alex or Timo actually *benchmarked* the code on some sort of native i386 CPU before you argue whether it should be a stosb or a stosd? If not, writing assembly would be a clear case of "premature optimization".
I did. on Athlon X2 64, I called the function a bunch ot times, with a 100x100 rect, measuring time with rdtsc the results were quite random, but roughly asm: ~580 gcc 4.2 -march=k8 -fexpensive-optimizations -O3: ~1800 WDK: /GL /Oi /Ot /O2 : ~2600 MSVC 2008 express: /GL /Oi /Ot /O2 ~1800
using a 50x50 rect shifts the advantage slightly in direction of the asm implementations.
I added volatile to the pointer to prevent the loop to be optimized away. using memset was a bit slower than a normal loop. This is what msvc produced with the above settings
_DIB_32BPP_ColorFill: push ebx mov ebx, [eax+8] sub ebx, [eax] test ebx, ebx jg short label1 xor al, al pop ebx retn
label1: mov ecx, [eax+4] push esi mov esi, [eax+0Ch] sub esi, ecx test esi, esi jg short label2 pop esi xor al, al pop ebx retn
label2: mov eax, [edx+4] imul ecx, eax add ecx, [edx] cdq and edx, 3 add eax, edx sar eax, 2 add eax, eax push edi mov edi, ecx add eax, eax jmp short label3
align 10h label3: mov ecx, edi mov edx, ebx
label4: mov dword ptr [ecx], 3039h add ecx, 4 sub edx, 1 jnz short label4
dec esi add edi, eax test esi, esi jg short label3
pop edi pop esi mov al, 1 pop ebx retn
I though myself I did something wrong. For me no compiler was able to generate code as fast as the asm code. I don't know how Alex managed to get better optimizations, maybe he knows a secret ninja /Oxxx switch, or maybe express and wdk version both suck at optimizing or maybe I'm just too supid... ;-)
See above: If all you want to optimize is the loop, then have C code with asm("rep movsd") in it, or fix the static inline memcpy() to be more efficient (if it isn't efficient in the first place).
I tried __stosd() which actually resulted in a faster function. with ~610 gcc was aslmost as fast as the asm implementation, msvc actually won with 590. But that was using not pure portable code. It's the best solution, it seems, although it will probably still be slower unless we set our optimization to max.
Btw, I already thought about rewriting our dib code some time ago. Using inline functions instead of a code generator. The idea is to make it fully portable, optimizable though inline asm functions where useful and easier to maintain then the current stuff. It's on my list...
Timo
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Best regards, Alex Ionescu
Alex Ionescu wrote:
I will provide some code and timings but I love how you ignored my main points:
And you ignored my main points: 1) The optimization "around" the function is not important, as the function is not called that often, the loop is much more important. 2) It doesn't matter if the function performs differently on different machines as long as it's always faster than the portable code. 3) Noone is forced to write optimized versions, we have a C version for all other architectures.. 4) I'm not worried about the loop, the loop is fine the way I wrote it ;-) I just claimed that you couldn't provide a faster C version. Faster in terms of real life usage. And I'm yet waiting for you to prove me wrong.
- The optimizations of the code *around* the function (ie: the
callers), which Michael also pointed out, cannot be done in ASM. 2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and Nehalem you will get totally different results with your ASM code, while the compilers will generate the best possible code. 3) The fact that someone will now have to write optimized versions for each other architecture 4) The fact that if the loop is what you're truly worried about, you can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a similar intrinsic), and still keep the rest of the function portable C.
Also, gcc does support profiling, another fact you don't seem to know. However, with linker optimizations, you do not need a profiler, the linker will do the static analysis.
Also, to everyone sayings things like "I was able to save a <operand name here>", I hope you understand that smaller != faster.
On 4-Aug-09, at 10:13 AM, Timo Kreuzer wrote:
Michael Steil wrote:
I wonder, has either of you, Alex or Timo actually *benchmarked* the code on some sort of native i386 CPU before you argue whether it should be a stosb or a stosd? If not, writing assembly would be a clear case of "premature optimization".
I did. on Athlon X2 64, I called the function a bunch ot times, with a 100x100 rect, measuring time with rdtsc the results were quite random, but roughly asm: ~580 gcc 4.2 -march=k8 -fexpensive-optimizations -O3: ~1800 WDK: /GL /Oi /Ot /O2 : ~2600 MSVC 2008 express: /GL /Oi /Ot /O2 ~1800
using a 50x50 rect shifts the advantage slightly in direction of the asm implementations.
I added volatile to the pointer to prevent the loop to be optimized away. using memset was a bit slower than a normal loop. This is what msvc produced with the above settings
_DIB_32BPP_ColorFill: push ebx mov ebx, [eax+8] sub ebx, [eax] test ebx, ebx jg short label1 xor al, al pop ebx retn
label1: mov ecx, [eax+4] push esi mov esi, [eax+0Ch] sub esi, ecx test esi, esi jg short label2 pop esi xor al, al pop ebx retn
label2: mov eax, [edx+4] imul ecx, eax add ecx, [edx] cdq and edx, 3 add eax, edx sar eax, 2 add eax, eax push edi mov edi, ecx add eax, eax jmp short label3
align 10h label3: mov ecx, edi mov edx, ebx
label4: mov dword ptr [ecx], 3039h add ecx, 4 sub edx, 1 jnz short label4
dec esi add edi, eax test esi, esi jg short label3
pop edi pop esi mov al, 1 pop ebx retn
I though myself I did something wrong. For me no compiler was able to generate code as fast as the asm code. I don't know how Alex managed to get better optimizations, maybe he knows a secret ninja /Oxxx switch, or maybe express and wdk version both suck at optimizing or maybe I'm just too supid... ;-)
See above: If all you want to optimize is the loop, then have C code with asm("rep movsd") in it, or fix the static inline memcpy() to be more efficient (if it isn't efficient in the first place).
I tried __stosd() which actually resulted in a faster function. with ~610 gcc was aslmost as fast as the asm implementation, msvc actually won with 590. But that was using not pure portable code. It's the best solution, it seems, although it will probably still be slower unless we set our optimization to max.
Btw, I already thought about rewriting our dib code some time ago. Using inline functions instead of a code generator. The idea is to make it fully portable, optimizable though inline asm functions where useful and easier to maintain then the current stuff. It's on my list...
Timo
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Best regards, Alex Ionescu
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
On 4-Aug-09, at 12:36 PM, Timo Kreuzer wrote:
Alex Ionescu wrote:
I will provide some code and timings but I love how you ignored my main points:
And you ignored my main points:
- The optimization "around" the function is not important, as the
function is not called that often, the loop is much more important.
If the function is not called often, then you using ASM as an optimization is what we call "premature optimization".
You should spend your time profiling the codebase and identifying real bottlenecks.
- It doesn't matter if the function performs differently on
different machines as long as it's always faster than the portable code.
Which is an assumption you're making. My point is that it won't be.
- Noone is forced to write optimized versions, we have a C version
for all other architectures..
So now we have a huge performance delta (if it were to exist) between different architectures, as well as two code bases (and possibly more, as people write more ASM versions), and eventually there are 10 versions of the same function, with different bugs.
Great!
(Please don't write code in a real company's product, kthx).
- I'm not worried about the loop, the loop is fine the way I wrote
it ;-)
I don't get it? You just claimed the loop is "90%", and that yours is better because it's in ASM and uses REP MOVSD. So take the C version, and make an inline REP MOVSD instead of a memset, and you know have your exact code, but written in C.
I just claimed that you couldn't provide a faster C version. Faster in terms of real life usage. And I'm yet waiting for you to prove me wrong.
- The optimizations of the code *around* the function (ie: the
callers), which Michael also pointed out, cannot be done in ASM. 2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and Nehalem you will get totally different results with your ASM code, while the compilers will generate the best possible code. 3) The fact that someone will now have to write optimized versions for each other architecture 4) The fact that if the loop is what you're truly worried about, you can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a similar intrinsic), and still keep the rest of the function portable C.
Also, gcc does support profiling, another fact you don't seem to know. However, with linker optimizations, you do not need a profiler, the linker will do the static analysis.
Also, to everyone sayings things like "I was able to save a <operand name here>", I hope you understand that smaller != faster.
On 4-Aug-09, at 10:13 AM, Timo Kreuzer wrote:
Michael Steil wrote:
I wonder, has either of you, Alex or Timo actually *benchmarked* the code on some sort of native i386 CPU before you argue whether it should be a stosb or a stosd? If not, writing assembly would be a clear case of "premature optimization".
I did. on Athlon X2 64, I called the function a bunch ot times, with a 100x100 rect, measuring time with rdtsc the results were quite random, but roughly asm: ~580 gcc 4.2 -march=k8 -fexpensive-optimizations -O3: ~1800 WDK: /GL /Oi /Ot /O2 : ~2600 MSVC 2008 express: /GL /Oi /Ot /O2 ~1800
using a 50x50 rect shifts the advantage slightly in direction of the asm implementations.
I added volatile to the pointer to prevent the loop to be optimized away. using memset was a bit slower than a normal loop. This is what msvc produced with the above settings
_DIB_32BPP_ColorFill: push ebx mov ebx, [eax+8] sub ebx, [eax] test ebx, ebx jg short label1 xor al, al pop ebx retn
label1: mov ecx, [eax+4] push esi mov esi, [eax+0Ch] sub esi, ecx test esi, esi jg short label2 pop esi xor al, al pop ebx retn
label2: mov eax, [edx+4] imul ecx, eax add ecx, [edx] cdq and edx, 3 add eax, edx sar eax, 2 add eax, eax push edi mov edi, ecx add eax, eax jmp short label3
align 10h label3: mov ecx, edi mov edx, ebx
label4: mov dword ptr [ecx], 3039h add ecx, 4 sub edx, 1 jnz short label4
dec esi add edi, eax test esi, esi jg short label3
pop edi pop esi mov al, 1 pop ebx retn
I though myself I did something wrong. For me no compiler was able to generate code as fast as the asm code. I don't know how Alex managed to get better optimizations, maybe he knows a secret ninja /Oxxx switch, or maybe express and wdk version both suck at optimizing or maybe I'm just too supid... ;-)
See above: If all you want to optimize is the loop, then have C code with asm("rep movsd") in it, or fix the static inline memcpy() to be more efficient (if it isn't efficient in the first place).
I tried __stosd() which actually resulted in a faster function. with ~610 gcc was aslmost as fast as the asm implementation, msvc actually won with 590. But that was using not pure portable code. It's the best solution, it seems, although it will probably still be slower unless we set our optimization to max.
Btw, I already thought about rewriting our dib code some time ago. Using inline functions instead of a code generator. The idea is to make it fully portable, optimizable though inline asm functions where useful and easier to maintain then the current stuff. It's on my list...
Timo
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Best regards, Alex Ionescu
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Best regards, Alex Ionescu
With all respect Alex, although I agree with you in the core, that this does not deserve the disadvantages of asm for a tiny performance difference if any (portability, readability, etc), I don't agree with many your arguments.
--> 1) The optimizations of the code *around* the function (ie: the callers), which Michael also pointed out, cannot be done in ASM.
<-- Yes, it can. I could always outperform or match a C compiler at that, and did many times (I'm the author of an original PC BIOS, performance libraries, mission critical systems, etc). I very often used regs for calling params, local storage through SP instead of BP, good use and reuse of registers, etc. In fact, the loop the compiler generated was identical to the asm source except for the two instructions the compiler added (that serve for no purpose, it is a msvc issue). It is actually in the calling overhead and local initialization and storage where I could easily beat the compiler, since it complies with rules that I can safely break. Furthermore, in most cases a compiler won't change calling convention unless the source specifies it, and in any case the register based calling used by compilers is way restricted compared with what can be done in asm which can always use more efficient methods (more extensive and intelligent register allocation). In any case, the most important optimizations are equally done in C and assembly when the programmer knows how to write optimum code and does not have to comply with a prototype. For example passing arguments as a pointer to an struct is always more efficient.
--> 2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and Nehalem you will get totally different results with your ASM code, while the compilers will generate the best possible code.
<-- There are very few and specific cases where the optimum code for different processors is different, and this is not the case. If gcc generates different code for this function and different CPUs, it is not for a good reason. There is only a meaningful exception for this function: if the inner loop can use a 64 bit rep stos instead of 32. And in this case it can be done in asm, while I don't know any compiler that would use a 64 bit rep stos instruction for a 32 bit target regardless of the CPU having 64 bit registers.
--> 4) The fact that if the loop is what you're truly worried about, you can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a similar intrinsic), and still keep the rest of the function portable C.
<-- It is not necessary to use to use a built in function like you mention, because any optimizing compiler will use rep movsd anyway, with better register allocation if any different. If inline asm is used instead, optimizations for the whole function are disabled, as the compiler does not analyze what's done in inline assembly.
--> Also, gcc does support profiling, another fact you don't seem to know. However, with linker optimizations, you do not need a profiler, the linker will do the static analysis.
<-- Function level linking and profiling based optimization are very different things, the linker in no way can perform a similar statistical analysis.
--> Also, to everyone sayings things like "I was able to save a <operand name here>", I hope you understand that smaller != faster.
<-- The save of these two instructions improve both the speed and size. Note that the loop the compiler generated was exactly the same as the original assembly, only with those two instructions added. I discern where I save speed, size, both, or none, in either C or assembly.
I wrote this not to be argumentative or confrontational, but just because I don't like to read arguments that are not true, and I hope you all take this as constructive knowledge. BTW, I hardly support the use of assemly except in very specific cases, and this is not one. I disagreed with Alex in the arguments, not in the core.
Jose Catena DIGIWAVES S.L.
Note to everyone else: I just spent some time to do the calculations and have data proving C code can be faster -- I will post tonight from home.
Now to get to your argument, Jose..
Best regards, Alex Ionescu
On Tue, Aug 4, 2009 at 2:19 PM, Jose Catenajc1@diwaves.com wrote:
With all respect Alex, although I agree with you in the core, that this does not deserve the disadvantages of asm for a tiny performance difference if any (portability, readability, etc), I don't agree with many your arguments.
Also keep in mind Timo admitted "This code is not called often", making ASM optimization useless.
-->
- The optimizations of the code *around* the function (ie: the
callers), which Michael also pointed out, cannot be done in ASM.
<-- Yes, it can. I could always outperform or match a C compiler at that, and did many times (I'm the author of an original PC BIOS, performance libraries, mission critical systems, etc). I very often used regs for calling params, local storage through SP instead of BP, good use and reuse of registers, etc.
An optimizing compiler will do this too.
In fact, the loop the compiler generated was identical to the asm source except for the two instructions the compiler added (that serve for no purpose, it is a msvc issue).
Really? Here's sample code from my faster C version:
.text:004013E0 lea eax, [esi+eax*4] .text:004013E3 lea esi, ds:0[edi*4] .text:004013EA lea eax, [ebp+eax+0] .text:004013EE db 66h .text:004013EE nop
99% percent of people on this list (and you, probably) will tell me "this is a GCC issue" or that this is "useless code".
Guess what, I compiled with mtune=core2 and this code sequence is specifically generated before the loop.
Timo, and I admit not even myself, would think of adding this kind of code. But once I asked some experts what this does, I understood why it's there.
To quote Michael "if you think the compiler is generating useless code, try to find out what the code is doing." In most cases, your thinking that it is "wrong" or "useless" is probably wrong itself.
As a challenge, can you tell me the point of this code? Why is it written this way? If I build for 486 (which is what ALL OF YOU SEEM TO BE STUCK ON!!!), I get code that looks like Timo's.
It is actually in the calling overhead and local initialization and storage where I could easily beat the compiler, since it complies with rules that I can safely break.
That doesn't make any sense. You are AGREEING with me. My point is that a compiler will break normal calling rules while the assembly code will have to respect at least some rules, because you won't know apriori all your callers (you might in a BIOS, but not in a giant code like win32k). The compiler on the other hand, DOES know all the callers, and will hapilly clober registers, change the calling convention, etc. Please re-read Michael's email.
Furthermore, in most cases a compiler won't change calling convention unless the source specifies it
Completely not true. Compilers will do this. This is not 1994 anymore.
, and in any case the register based calling used by
compilers is way restricted compared with what can be done in asm which can always use more efficient methods (more extensive and intelligent register allocation).
Again, simply NOT true. Today's compilers will be able to do things like "All callers of foo must have param 8 in ECX", and will write the code that way, not to save/restore ECX, and to use it as a parameter. You CANNOT do this in assembly unless you have a very small number of callers that you know nobody else will touch. As soon as someone else adds a caller, YOU have to do all the work to make ECX work that way.
You seem to have a very 1990ies understanding of how compilers work (respecting calling conventions, save/restoring registers, not touching ebp, etc). Probably because you worked on BIOSes, which yes, in that time, worked that way.
Please read a bit into the technologies such as LLVM or microsoft's link time code generator.
In any case, the most important optimizations are equally done in C and assembly when the programmer knows how to write optimum code and does not have to comply with a prototype.
Again, NO. Unless you control all your callsites and are willing to update the code each single time a cal site gets added, the compiler WILL beat you. LLVM and LTCG can even go 2-3 call sites away, such that callers of foo which call bar which call baz have some sort of stack frame or register content that will make barbaz faster.
For example passing arguments as a pointer to an struct is always more efficient.
It actually depends, and again the compiler can make this choice.
--> 2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and Nehalem you will get totally different results with your ASM code, while the compilers will generate the best possible code.
<-- There are very few and specific cases where the optimum code for different processors is different, and this is not the case.
False. I got radically different ASM when building for K8, I7, Core2 and Pentium.
If gcc generates different code for this function and different CPUs, it is not for a good reason.
Excuse me?
There is only a meaningful exception for this function: if the inner loop can use a 64 bit rep stos instead of 32. And in this case it can be done in asm, while I don't know any compiler that would use a 64 bit rep stos instruction for a 32 bit target regardless of the CPU having 64 bit registers.
Again, this is full of assumptions. You seem to be saying "GCC is stupid, I know better". Yet you don't even understand WHY gcc will generate different code for different CPUs.
Please read into the topics of "pipelines" and "caches" and "micro-operations" as a good starting point.
--> 4) The fact that if the loop is what you're truly worried about, you can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a similar intrinsic), and still keep the rest of the function portable C.
<-- It is not necessary to use to use a built in function like you mention, because any optimizing compiler will use rep movsd anyway, with better register allocation if any different.
Ummm, if you think "rep movsd" is what an optimizing compiler will use, then I'm sorry but you don't have the credentials to be in this argument, and I'm wasting my time. Rep Movsd is the SLOWEST way to achieve this loop on modern CPUs. On my core2 build, for example, gcc used "mov" and a loop instead. Only when building for Pentium 1, did it use a rep movsd.
Please stop thinking that 1 line of ASM is faster than 12 lines, because 12 > 1. On modern CPUs, a "manual" loop will be faster than a rep movsd, nearly ALWAYS.
If inline asm is used instead, optimizations for the whole function are disabled, as the compiler does not analyze what's done in inline assembly.
LOL??? Again, maybe true in the 1990ies. but first of all:
1) Built-ins are not "inline asm", and will be optimized 2) GCC and MSVC both will optimize the inline assembler according to the function the inline is present in. The old mantra that "inline asm disables optimizations" hasn't been true since about 2001...
In fact, when assembly is *required* (for something like a trap save), it is ALWAYS better to use an inline __asm__ block within the C function, then to call the external function in an .S or .ASM file, because compilers like gcc will be able to fine-tune the assembly you wrote, and modify it to work better with the C code. LTCG will, in some cases, optimize the ASM you wrote by hand in the external .ASM file as well.
--> Also, gcc does support profiling, another fact you don't seem to know. However, with linker optimizations, you do not need a profiler, the linker will do the static analysis.
<-- Function level linking and profiling based optimization are very different things, the linker in no way can perform a similar statistical analysis.
But it can make static analysis.
--> Also, to everyone sayings things like "I was able to save a <operand name here>", I hope you understand that smaller != faster.
<-- The save of these two instructions improve both the speed and size. Note that the loop the compiler generated was exactly the same as the original assembly, only with those two instructions added. I discern where I save speed, size, both, or none, in either C or assembly.
I wrote this not to be argumentative or confrontational, but just because I don't like to read arguments that are not true, and I hope you all take this as constructive knowledge. BTW, I hardly support the use of assemly except in very specific cases, and this is not one. I disagreed with Alex in the arguments, not in the core.
Thanks Jose, but unfortunately you are wrong. If we were having this argument in:
1) 1986 2) on a 486 3) about BIOS code (which is small and rarely extended, with all calls "Controlled")
I would bow down and give you my hat in an instant, but times have changed.
I don't want to waste more time on these arguments, because I know I'm right and I've asked several people which all agree with me -- people that work closely with Intel, compiler technology and assembly. I cannot convince people that don't even have the basic knowledge to be able to UNDERSTAND the arguments. Do some reading, then come back.
I will post numbers and charts when I'm home, at least they will provide some "visual" confirmation of what I'm saying, but I doubt that will be enough.
Jose Catena DIGIWAVES S.L.
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Jose Catena wrote:
--> 4) The fact that if the loop is what you're truly worried about, you can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a similar intrinsic), and still keep the rest of the function portable C.
<-- It is not necessary to use to use a built in function like you mention, because any optimizing compiler will use rep movsd anyway, with better register allocation if any different. If inline asm is used instead, optimizations for the whole function are disabled, as the compiler does not analyze what's done in inline assembly.
- A builtin / intrinsic != inline asm - Using inline asm doesn't disable the optimization of the function, at least not for gcc, of cause the compiler cannot optimize what's inside the asm, but how would you want to optimize "rep stosd" anyway? - As I already stated: none of the compilers I tested was able to generate a rep stosd from either a loop or memset.even at maximum optimization. So either I did something wrong in all cases or the above statement about "any optimizing compiler" is simply not true.
Regards, Timo
- A builtin / intrinsic != inline asm
I never said that. I used "instead". I apologize for not being clear enough.
but how would you want to optimize "rep stosd" anyway?
No way. That's what I said, possibly with the exception of using a 64 bit equivalent if we could assume that the CPU is 64 bit capable. But Alex knows better, he's is calling me an ignorant. He says that
L1: Mov [edi], eax Add edi, 4 Dec ecx Jnz L1
Is faster than
rep stosd
Both things do exactly the same thing, the later much smaller AND FASTER in any CPU from the 386 to the i7. And he shows an irrelevant portion of code to prove nothing regarding what I said, BTW we don't know what his compiler generated for the loop. In other cases he changes the meaning of what I wrote, corrects something I didn't say at all, or make unbased assumptions. I'm not going to answer him, LOL! This would be an endless loop. Anyway I always agreed with him in that asm is not helpful in this and most cases. This discussion is a waste of time. I thought from previous posts that he had better knowledge, and perhaps he has, but certainly does not know much of assembly and CPU architectures, yet he pretends and doesn't like to be corrected... bad for him.
none of the compilers I tested was able to generate a rep stosd from
either a loop or memset
LOL, are we really in 2009? Try the C source I posted, it should be compiled as rep stosd. MSVC and Intel certainly do regardless of the target CPU, and not precisely since recent versions. Let me know if yours doesn't, I won't like a compiler that doesn't do such a basic and evident optimization. Most often I know pretty well what a compiler will generate without looking at the generated asm, the way C code is written matters in some cases.
As for memset, MSVC inline memset will generate rep stosd and possibly a stosw and/or stosb if the byte count is not a multiple of the max size or non constant, what's ok. The library version also uses the same, with the call overhead. Anyway memset is not suitable here, it is for 8 bit and wmemset for 16 bit values, while we want to store 32 bit values.
Jose Catena DIGIWAVES S.L.
On 4 Aug 2009, at 17:37, Jose Catena wrote:
but how would you want to optimize "rep stosd" anyway?
No way. That's what I said, possibly with the exception of using a 64 bit equivalent if we could assume that the CPU is 64 bit capable. But Alex knows better, he's is calling me an ignorant. He says that
L1: Mov [edi], eax Add edi, 4 Dec ecx Jnz L1
Is faster than
rep stosd
Both things do exactly the same thing, the later much smaller AND FASTER in any CPU from the 386 to the i7.
I have done some tests on all generations of Intel CPUs since Yonah, and in all cases, rep stosd was faster than any loop I could craft or GCC would generate from my C code.
But this does *not* mean that * rep stosd is by definition faster than a scalar loop * rep stosd is by definition faster than any kind of loop.
Look at the test program at the end of this email. It compares rep stosd with a hand-crafted loop written with SSE instructions and SSE registers (parts borrowed from XNU).
On all tested machines, the SSE version is significantly faster (for big loops):
Yonah: Genuine Intel(R) CPU T2500 @ 2.00GHz SSE is 3.34x faster than stosl
Merom: Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz SSE is 4.86x faster than stosl
Penryn: Intel(R) Xeon(R) CPU 5150 @ 2.66GHz SSE is 4.94x faster than stosl
Nehalem: Intel(R) Xeon(R) CPU E5462 @ 2.80GHz SSE is 4.62x faster than stosl
So one should not assume that it's a good idea to always just use rep stosd. Use memset(), and have an optimized implementation of memset() somewhere else. One that can be inlined, and checks the size and branches to the optimal implementation: Like XNU does it, for example:
http://fxr.watson.org/fxr/source/osfmk/i386/commpage/?v=xnu-1228
Michael
#include <stdlib.h> #include <stdio.h> #include <string.h>
#define MIN(a,b) ((a)<(b)? (a):(b))
#define DATASIZE (1024*1024) #define TIMES 10000
static inline long long rdtsc64(void) { long long ret; __asm__ volatile("lfence; rdtsc; lfence" : "=A" (ret)); return ret; }
static inline void sse(int *p) { int c_new; char *p_new; asm volatile ( "1: \n" "movdqa %%xmm0,(%%edi,%%ecx) \n" "movdqa %%xmm0,16(%%edi,%%ecx) \n" "movdqa %%xmm0,32(%%edi,%%ecx) \n" "movdqa %%xmm0,48(%%edi,%%ecx) \n" "subl $64,%%ecx \n" "jns 1b \n" : "=D"(p_new), "=c"(c_new) : "D"(p), "c"(DATASIZE/sizeof(int)) ); }
static inline void stos(int *p) { int c_new; char *p_new; asm volatile ( "rep stosl" : "=D"(p_new), "=c"(c_new) : "D"(p), "c"(DATASIZE/sizeof(int)), "a"(1) ); }
int main() { void *data = malloc(DATASIZE); long long t1, t2, t3, m1, m2; int i;
t1 = rdtsc64();
for (i = 0; i < TIMES; i++) sse(data);
t2 = rdtsc64();
for (i = 0; i < TIMES; i++) stos(data);
t3 = rdtsc64();
m1 = t2 - t1; m2 = t3 - t2;
if (m1>m2) printf("stosl is %.2fx faster than SSE\n", (float)m1/m2); else printf("SSE is %.2fx faster than stosl\n", (float)m2/m1);
return 0; }
Also, rep movsd will be slower on small counts. On most processors, less than 8 iterations will be faster with a move than with a rep.
This has changed lately: http://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-win...
With blocks larger than 512 bytes, SSE/FPU code will always be faster.
On 4-Aug-09, at 9:50 PM, Michael Steil wrote:
On 4 Aug 2009, at 17:37, Jose Catena wrote:
but how would you want to optimize "rep stosd" anyway?
No way. That's what I said, possibly with the exception of using a 64 bit equivalent if we could assume that the CPU is 64 bit capable. But Alex knows better, he's is calling me an ignorant. He says that
L1: Mov [edi], eax Add edi, 4 Dec ecx Jnz L1
Is faster than
rep stosd
Both things do exactly the same thing, the later much smaller AND FASTER in any CPU from the 386 to the i7.
I have done some tests on all generations of Intel CPUs since Yonah, and in all cases, rep stosd was faster than any loop I could craft or GCC would generate from my C code.
But this does *not* mean that
- rep stosd is by definition faster than a scalar loop
- rep stosd is by definition faster than any kind of loop.
Look at the test program at the end of this email. It compares rep stosd with a hand-crafted loop written with SSE instructions and SSE registers (parts borrowed from XNU).
On all tested machines, the SSE version is significantly faster (for big loops):
Yonah: Genuine Intel(R) CPU T2500 @ 2.00GHz SSE is 3.34x faster than stosl
Merom: Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz SSE is 4.86x faster than stosl
Penryn: Intel(R) Xeon(R) CPU 5150 @ 2.66GHz SSE is 4.94x faster than stosl
Nehalem: Intel(R) Xeon(R) CPU E5462 @ 2.80GHz SSE is 4.62x faster than stosl
So one should not assume that it's a good idea to always just use rep stosd. Use memset(), and have an optimized implementation of memset() somewhere else. One that can be inlined, and checks the size and branches to the optimal implementation: Like XNU does it, for example:
http://fxr.watson.org/fxr/source/osfmk/i386/commpage/?v=xnu-1228
Michael
#include <stdlib.h> #include <stdio.h> #include <string.h>
#define MIN(a,b) ((a)<(b)? (a):(b))
#define DATASIZE (1024*1024) #define TIMES 10000
static inline long long rdtsc64(void) { long long ret; __asm__ volatile("lfence; rdtsc; lfence" : "=A" (ret)); return ret; }
static inline void sse(int *p) { int c_new; char *p_new; asm volatile ( "1: \n" "movdqa %%xmm0,(%%edi,%%ecx) \n" "movdqa %%xmm0,16(%%edi,%%ecx) \n" "movdqa %%xmm0,32(%%edi,%%ecx) \n" "movdqa %%xmm0,48(%%edi,%%ecx) \n" "subl $64,%%ecx \n" "jns 1b \n" : "=D"(p_new), "=c"(c_new) : "D"(p), "c"(DATASIZE/sizeof(int)) ); }
static inline void stos(int *p) { int c_new; char *p_new; asm volatile ( "rep stosl" : "=D"(p_new), "=c"(c_new) : "D"(p), "c"(DATASIZE/sizeof(int)), "a"(1) ); }
int main() { void *data = malloc(DATASIZE); long long t1, t2, t3, m1, m2; int i;
t1 = rdtsc64();
for (i = 0; i < TIMES; i++) sse(data);
t2 = rdtsc64();
for (i = 0; i < TIMES; i++) stos(data);
t3 = rdtsc64();
m1 = t2 - t1; m2 = t3 - t2;
if (m1>m2) printf("stosl is %.2fx faster than SSE\n", (float)m1/m2); else printf("SSE is %.2fx faster than stosl\n", (float)m2/m1);
return 0; }
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Best regards, Alex Ionescu
On most processors, less than 8 iterations will be faster with a move than
with a rep.
I'd say more like 4 (separate moves), and not feasible if the number of iterations is variable like in our case. It would be possible a loop with many moves inside, even better SSE stores, and after that a rep stosd for the remainder, indeed faster for large cx counts. Does any compiler currently generate that automatically? None of the ones I know, but can be done to some extent writing it that way in C. Possible in asm? Of course. DMA fill? No joy. GPU accelerated fill? Perhaps in the future. I keep thinking that this is not important enough to justify asm, not even to break the loop in two in C. At least not before ROS is complete and stable and we want to optimize every bit. And by then we may be very well thinking about GPU accelerated GDI too.
Jose Catena DIGIWAVES S.L.
On Aug 5, 2009, at 11:43 AM, Jose Catena wrote:
stable and we want to optimize every bit. And by then we may be very well thinking about GPU accelerated GDI too.
GDI is GPU accelerated if corresponding DDI driver supports that, and base support for this exists in ReactOS by design. In arwinss it is also a key part of the kernelmode graphics module.
WBR, Aleksey Bragin.
have an optimized implementation of memset() somewhere else. One that can
be inlined, and checks the size and branches to the optimal implementation
Yep, that's a good way to minimize where asm is used (if asm at all), making general purpose fast functions available to any other function. But don't call it memset (it fills byte values only), but something like MemFill with a size param or MemFill32, MemFill64, MemFill16, MemFill8. Good post, Michael.
Jose Catena DIGIWAVES S.L.
Having faster code is tempting, but it's not yet needed. We are not even half-way from 1.0.
Keeping additional asm code will only limit the amount of possible maintainers. Every routine is a subject to multiple revisions until 1.0, so try to not make the process more difficult.