On most processors, less than 8 iterations will be faster with a move than
with a rep.
I'd say more like 4 (separate moves), and not feasible if the number of iterations is variable like in our case. It would be possible a loop with many moves inside, even better SSE stores, and after that a rep stosd for the remainder, indeed faster for large cx counts. Does any compiler currently generate that automatically? None of the ones I know, but can be done to some extent writing it that way in C. Possible in asm? Of course. DMA fill? No joy. GPU accelerated fill? Perhaps in the future. I keep thinking that this is not important enough to justify asm, not even to break the loop in two in C. At least not before ROS is complete and stable and we want to optimize every bit. And by then we may be very well thinking about GPU accelerated GDI too.
Jose Catena DIGIWAVES S.L.