On most processors, less than 8 iterations will be
faster with a move than
with a rep.
I'd say more like 4 (separate moves), and not feasible if the number of
iterations is variable like in our case. It would be possible a loop with
many moves inside, even better SSE stores, and after that a rep stosd for
the remainder, indeed faster for large cx counts. Does any compiler
currently generate that automatically? None of the ones I know, but can be
done to some extent writing it that way in C. Possible in asm? Of course.
DMA fill? No joy. GPU accelerated fill? Perhaps in the future.
I keep thinking that this is not important enough to justify asm, not even
to break the loop in two in C. At least not before ROS is complete and
stable and we want to optimize every bit. And by then we may be very well
thinking about GPU accelerated GDI too.
Jose Catena
DIGIWAVES S.L.