Also, rep movsd will be slower on small counts. On most processors, less than 8 iterations will be faster with a move than with a rep.
This has changed lately: http://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-win...
With blocks larger than 512 bytes, SSE/FPU code will always be faster.
On 4-Aug-09, at 9:50 PM, Michael Steil wrote:
On 4 Aug 2009, at 17:37, Jose Catena wrote:
but how would you want to optimize "rep stosd" anyway?
No way. That's what I said, possibly with the exception of using a 64 bit equivalent if we could assume that the CPU is 64 bit capable. But Alex knows better, he's is calling me an ignorant. He says that
L1: Mov [edi], eax Add edi, 4 Dec ecx Jnz L1
Is faster than
rep stosd
Both things do exactly the same thing, the later much smaller AND FASTER in any CPU from the 386 to the i7.
I have done some tests on all generations of Intel CPUs since Yonah, and in all cases, rep stosd was faster than any loop I could craft or GCC would generate from my C code.
But this does *not* mean that
- rep stosd is by definition faster than a scalar loop
- rep stosd is by definition faster than any kind of loop.
Look at the test program at the end of this email. It compares rep stosd with a hand-crafted loop written with SSE instructions and SSE registers (parts borrowed from XNU).
On all tested machines, the SSE version is significantly faster (for big loops):
Yonah: Genuine Intel(R) CPU T2500 @ 2.00GHz SSE is 3.34x faster than stosl
Merom: Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz SSE is 4.86x faster than stosl
Penryn: Intel(R) Xeon(R) CPU 5150 @ 2.66GHz SSE is 4.94x faster than stosl
Nehalem: Intel(R) Xeon(R) CPU E5462 @ 2.80GHz SSE is 4.62x faster than stosl
So one should not assume that it's a good idea to always just use rep stosd. Use memset(), and have an optimized implementation of memset() somewhere else. One that can be inlined, and checks the size and branches to the optimal implementation: Like XNU does it, for example:
http://fxr.watson.org/fxr/source/osfmk/i386/commpage/?v=xnu-1228
Michael
#include <stdlib.h> #include <stdio.h> #include <string.h>
#define MIN(a,b) ((a)<(b)? (a):(b))
#define DATASIZE (1024*1024) #define TIMES 10000
static inline long long rdtsc64(void) { long long ret; __asm__ volatile("lfence; rdtsc; lfence" : "=A" (ret)); return ret; }
static inline void sse(int *p) { int c_new; char *p_new; asm volatile ( "1: \n" "movdqa %%xmm0,(%%edi,%%ecx) \n" "movdqa %%xmm0,16(%%edi,%%ecx) \n" "movdqa %%xmm0,32(%%edi,%%ecx) \n" "movdqa %%xmm0,48(%%edi,%%ecx) \n" "subl $64,%%ecx \n" "jns 1b \n" : "=D"(p_new), "=c"(c_new) : "D"(p), "c"(DATASIZE/sizeof(int)) ); }
static inline void stos(int *p) { int c_new; char *p_new; asm volatile ( "rep stosl" : "=D"(p_new), "=c"(c_new) : "D"(p), "c"(DATASIZE/sizeof(int)), "a"(1) ); }
int main() { void *data = malloc(DATASIZE); long long t1, t2, t3, m1, m2; int i;
t1 = rdtsc64();
for (i = 0; i < TIMES; i++) sse(data);
t2 = rdtsc64();
for (i = 0; i < TIMES; i++) stos(data);
t3 = rdtsc64();
m1 = t2 - t1; m2 = t3 - t2;
if (m1>m2) printf("stosl is %.2fx faster than SSE\n", (float)m1/m2); else printf("SSE is %.2fx faster than stosl\n", (float)m2/m1);
return 0; }
Ros-dev mailing list Ros-dev@reactos.org http://www.reactos.org/mailman/listinfo/ros-dev
Best regards, Alex Ionescu