Re: [ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

5 Aug 2009

      Also, rep movsd will be slower on small counts. On most processors,  
less than 8 iterations will be faster with a move than with a rep.
This has changed lately: http://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-win...
With blocks larger than 512 bytes, SSE/FPU code will always be faster.
On 4-Aug-09, at 9:50 PM, Michael Steil wrote:
...
On 4 Aug 2009, at 17:37, Jose Catena wrote:
...
...
but how would you want to optimize "rep stosd" anyway?
No way. That's what I said, possibly with the exception of using a
64 bit
equivalent if we could assume that the CPU is 64 bit capable.
But Alex knows better, he's is calling me an ignorant. He says that
L1:	Mov [edi], eax
   Add edi, 4
   Dec ecx
   Jnz L1
Is faster than
rep stosd
Both things do exactly the same thing, the later much smaller AND
FASTER in
any CPU from the 386 to the i7.
I have done some tests on all generations of Intel CPUs since Yonah,
and in all cases, rep stosd was faster than any loop I could craft or
GCC would generate from my C code.
But this does *not* mean that

rep stosd is by definition faster than a scalar loop
rep stosd is by definition faster than any kind of loop.

Look at the test program at the end of this email. It compares rep
stosd with a hand-crafted loop written with SSE instructions and SSE
registers (parts borrowed from XNU).
On all tested machines, the SSE version is significantly faster (for
big loops):
Yonah: Genuine Intel(R) CPU           T2500  @ 2.00GHz
SSE is 3.34x faster than stosl
Merom: Intel(R) Core(TM)2 Duo CPU     P7350  @ 2.00GHz
SSE is 4.86x faster than stosl
Penryn: Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
SSE is 4.94x faster than stosl
Nehalem: Intel(R) Xeon(R) CPU           E5462  @ 2.80GHz
SSE is 4.62x faster than stosl
So one should not assume that it's a good idea to always just use rep
stosd. Use memset(), and have an optimized implementation of memset()
somewhere else. One that can be inlined, and checks the size and
branches to the optimal implementation: Like XNU does it, for example:
http://fxr.watson.org/fxr/source/osfmk/i386/commpage/?v=xnu-1228
Michael
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define MIN(a,b) ((a)<(b)? (a):(b))
#define DATASIZE (1024*1024)
#define TIMES 10000
static inline long long
rdtsc64(void)
{
   long long ret;
   __asm__ volatile("lfence; rdtsc; lfence" : "=A" (ret));
   return ret;
}
static inline void
sse(int *p) {
   int c_new;
   char *p_new;
   asm volatile (
   	"1:				\n"
   	"movdqa  %%xmm0,(%%edi,%%ecx)	\n"
   	"movdqa  %%xmm0,16(%%edi,%%ecx)	\n"
   	"movdqa  %%xmm0,32(%%edi,%%ecx)	\n"
   	"movdqa  %%xmm0,48(%%edi,%%ecx)	\n"
   	"subl    $64,%%ecx		\n"
   	"jns     1b			\n"
   	: "=D"(p_new), "=c"(c_new)
   	: "D"(p), "c"(DATASIZE/sizeof(int))
   );
}
static inline void
stos(int *p) {
   int c_new;
   char *p_new;
   asm volatile (
   	"rep stosl"
   	: "=D"(p_new), "=c"(c_new)
   	: "D"(p), "c"(DATASIZE/sizeof(int)), "a"(1)
   );
}
int
main() {
   void *data = malloc(DATASIZE);
   long long t1, t2, t3, m1, m2;
   int i;
t1 = rdtsc64();
for (i = 0; i < TIMES; i++)
   	sse(data);
t2 = rdtsc64();
for (i = 0; i < TIMES; i++)
   	stos(data);
t3 = rdtsc64();
m1 = t2 - t1;
   m2 = t3 - t2;
if (m1>m2)
   	printf("stosl is %.2fx faster than SSE\n", (float)m1/m2);
   else
   	printf("SSE is %.2fx faster than stosl\n", (float)m2/m1);
return 0;
}

Ros-dev mailing list
Ros-dev@reactos.org
http://www.reactos.org/mailman/listinfo/ros-dev
Best regards,
Alex Ionescu

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments