Also, rep movsd will be slower on small counts. On most processors,
less than 8 iterations will be faster with a move than with a rep.
This has changed lately:
With blocks larger than 512 bytes, SSE/FPU code will always be faster.
On 4-Aug-09, at 9:50 PM, Michael Steil wrote:
On 4 Aug 2009, at 17:37, Jose Catena wrote:
but how
would you want to optimize "rep stosd" anyway?
No way. That's what I said, possibly with the exception of using a
64 bit
equivalent if we could assume that the CPU is 64 bit capable.
But Alex knows better, he's is calling me an ignorant. He says that
L1: Mov [edi], eax
Add edi, 4
Dec ecx
Jnz L1
Is faster than
rep stosd
Both things do exactly the same thing, the later much smaller AND
FASTER in
any CPU from the 386 to the i7.
I have done some tests on all generations of Intel CPUs since Yonah,
and in all cases, rep stosd was faster than any loop I could craft or
GCC would generate from my C code.
But this does *not* mean that
* rep stosd is by definition faster than a scalar loop
* rep stosd is by definition faster than any kind of loop.
Look at the test program at the end of this email. It compares rep
stosd with a hand-crafted loop written with SSE instructions and SSE
registers (parts borrowed from XNU).
On all tested machines, the SSE version is significantly faster (for
big loops):
Yonah: Genuine Intel(R) CPU T2500 @ 2.00GHz
SSE is 3.34x faster than stosl
Merom: Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz
SSE is 4.86x faster than stosl
Penryn: Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
SSE is 4.94x faster than stosl
Nehalem: Intel(R) Xeon(R) CPU E5462 @ 2.80GHz
SSE is 4.62x faster than stosl
So one should not assume that it's a good idea to always just use rep
stosd. Use memset(), and have an optimized implementation of memset()
somewhere else. One that can be inlined, and checks the size and
branches to the optimal implementation: Like XNU does it, for example:
http://fxr.watson.org/fxr/source/osfmk/i386/commpage/?v=xnu-1228
Michael
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define MIN(a,b) ((a)<(b)? (a):(b))
#define DATASIZE (1024*1024)
#define TIMES 10000
static inline long long
rdtsc64(void)
{
long long ret;
__asm__ volatile("lfence; rdtsc; lfence" : "=A" (ret));
return ret;
}
static inline void
sse(int *p) {
int c_new;
char *p_new;
asm volatile (
"1: \n"
"movdqa %%xmm0,(%%edi,%%ecx) \n"
"movdqa %%xmm0,16(%%edi,%%ecx) \n"
"movdqa %%xmm0,32(%%edi,%%ecx) \n"
"movdqa %%xmm0,48(%%edi,%%ecx) \n"
"subl $64,%%ecx \n"
"jns 1b \n"
: "=D"(p_new), "=c"(c_new)
: "D"(p), "c"(DATASIZE/sizeof(int))
);
}
static inline void
stos(int *p) {
int c_new;
char *p_new;
asm volatile (
"rep stosl"
: "=D"(p_new), "=c"(c_new)
: "D"(p), "c"(DATASIZE/sizeof(int)), "a"(1)
);
}
int
main() {
void *data = malloc(DATASIZE);
long long t1, t2, t3, m1, m2;
int i;
t1 = rdtsc64();
for (i = 0; i < TIMES; i++)
sse(data);
t2 = rdtsc64();
for (i = 0; i < TIMES; i++)
stos(data);
t3 = rdtsc64();
m1 = t2 - t1;
m2 = t3 - t2;
if (m1>m2)
printf("stosl is %.2fx faster than SSE\n", (float)m1/m2);
else
printf("SSE is %.2fx faster than stosl\n", (float)m2/m1);
return 0;
}
_______________________________________________
Ros-dev mailing list
Ros-dev(a)reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev