Ash wrote:
BSR has a latency of 8-12 Cycles on Athlon/P3 but can
be pipelined.
Worse (up to ~80 cycles) on Pentium and other older CPUs.
^^^ I think this might be our "silver bullet".
I don't want to waste 256 bytes of L1 cache ( assuming we get a cache
hit ), or spend 100's of cycles once per interrupt waiting for the cache
miss lookup to go through, so the table-based approach is bad in this
scenario.
In some more testing, BSR is never significantly faster than the C code,
but in many scenarios is equivalent in speed. However, the C code
trashes too many registers to do in parallel with anything else.
All that being said, I think BSR's ability to be able to be pipelined
will make it our big winner after all. I will work on trying to verify
if BSR pipelines well on AMD products, too.