Ash schrieb:
BSR has a latency of 8-12 Cycles on Athlon/P3 but can
be pipelined.
Worse (up to ~80 cycles) on Pentium and other older CPUs.
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_3748,0…
My tests have shown that you're right and BSR is much too slow.
Dont know about A64 - maybe someone can test BSR with
A64?
I have an AMD64 here but it doesn't run in 64 bit mode.
It doesnt make much sense to put the optimized ASM in
there, neither
is much hope of GCC having a good day and doing a lot of optimisation.
So far the best option would be the macro with a lookup table (only
one global kernel table tho).
I've converted your sources to be compileable with GCC (MinGW). I
attached the sources.
Here are the updated STATS
also available at
http://hackersquest.org/kerneltest.html
result orig function 46ffffe9
it took 1526862 18%
result orig function inlined 46ffffe9
it took 1041460 12%
result second proposal inlined 46ffffe9
it took 1248990 15%
result optimized asm 46ffffe9
it took 1321532 16%
result lookup inlined 46ffffe9
it took 682264 8%
result bsr inlined 46ffffe9
it took 1751088 21%
result macro 46ffffe9
it took 653692 7%
This are my results on the AMD64 using your Release-EXE:
STATS
result orig function 46ffffe9
it took 1272638 18%
result orig function inlined 46ffffe9
it took 875751 12%
result second proposal inlined 46ffffe9
it took 1051861 15%
result optimized asm 46ffffe9
it took 1225282 17%
result lookup inlined 46ffffe9
it took 549861 7%
result bsr inlined 46ffffe9
it took 1410179 20%
result macro 46ffffe9
it took 607638 8%
This are my results using the GCC EXE (-O2):
STATS
result orig function 46ffffe9
it took 1321663 24%
result orig function inlined 46ffffe9
it took 879318 16%
result second proposal inlined 46ffffe9
it took 940285 17%
result lookup inlined 46ffffe9
it took 615267 11%
result bsr inlined 46ffffe9
it took 1103432 20%
result macro 46ffffe9
it took 484450 9%
BTW: I had to remove all functions using the __asm() statement. The
"result bsr inlined" uses my GCC BSR macro. You can see that using BSR
seems to be much too slow ...
Regards,
Mark