have an optimized implementation of memset() somewhere else. One that can
be inlined, and checks the size and branches to the optimal implementation
Yep, that's a good way to minimize where asm is used (if asm at all), making general purpose fast functions available to any other function. But don't call it memset (it fills byte values only), but something like MemFill with a size param or MemFill32, MemFill64, MemFill16, MemFill8. Good post, Michael.
Jose Catena DIGIWAVES S.L.