Re: [ros-dev] Speed Tests (was: ping Alex regarding log2() forscheduler)

24 Mar 2005

BSR has a latency of 8-12 Cycles on Athlon/P3 but can be pipelined. Worse
(up to ~80 cycles) on Pentium and other older CPUs.
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_3748,0…
...
  Maximum latency on a Pentium 4 is eight clock cycles.
But its throughput
 is one, which means it is fully pipelined. So when you start this
 instruction eight clock cycles before you need the result, it behaves as
 if it only takes one clock cycle. You're not going to be able to beat that
 with any other code. The closest you can get is to convert the integer to
 float, then extract the exponent. That could be done in less than eight
 clock cycles but throughput will be lower.
http://www.flipcode.com/cgi-bin/fcmsg.cgi?thread_show=16986&msg=113105
Dont know about A64 - maybe someone can test BSR with A64?
I'm very disappointed by its peformance on my K7 system - i might have
messed something up thought.
You could save on the Pushing / Popping but that would be kinda like
cheating and if you do it dirty/lazy I wont get the right returns either.
It doesnt make much sense to put the optimized ASM in there, neither is much
hope of GCC having a good day and doing a lot of optimisation.
So far the best option would be the macro with a lookup table (only one
global kernel table tho).
Here are the updated STATS
also available at http://hackersquest.org/kerneltest.html
result orig function            46ffffe9
it took         1526862         18%
result orig function inlined    46ffffe9
it took         1041460         12%
result second proposal inlined  46ffffe9
it took         1248990         15%
result optimized asm            46ffffe9
it took         1321532         16%
result lookup inlined           46ffffe9
it took         682264          8%
result bsr inlined              46ffffe9
it took         1751088         21%
result macro                    46ffffe9
it took         653692          7%
http://wohngebaeudeversicherung.einsurance.de/
----- Original Message -----
From: "Alex Ionescu" &lt;ionucu(a)videotron.ca&gt;
To: "ReactOS Development List" &lt;ros-dev(a)reactos.com&gt;
Sent: Thursday, March 24, 2005 3:57 AM
Subject: Re: [ros-dev] Speed Tests (was: ping Alex regarding log2()
forscheduler)
...
  Ash wrote:
  Hello,
 I'd like to provide a small VS.NET project file with 5 different tests.
 - 46ffffe9 tells everything is alright with the function calculation
 - times are measured with QueryPerformanceCounter
 - loop run cnt: 0x2ffffff
 result orig function            46ffffe9
 it took         1491052
 result orig function inlined    46ffffe9
 it took         1035547
 result second proposal inlined  46ffffe9
 it took         1244434
 result optimized asm            46ffffe9
 it took         1338367
 result debug asm                46ffffe9
 it took         8774815
 The second proposal is the original proposal but more shifts - still
 slower tho.
 Interesting is the inlined version generated by MSVC, shaving off almost
 1/3 of the overall time.
 Also stared as "optimized asm" - no gurantee on register safety tho ;)
 For portability and performance sake it should be considered to create a
 compiler macro.
 This function is terribly small, any optimisations inside are outweighted
 by the calling overhead in this case.
 The most impressive one is the original function inlined, althought the
 ASM would only work on x86.
 Please do not think about using 64k tables, thats what, 1/2 of a Sempron
 L2 cache?
 It would really really trash performance. 
 Hi Ash,
 Thanks a lot for your tests. I don't have much time tonight, but if you'd
 like/can, can you add two more tests?
 One using "bsr", the intel opcode. It does all the work for you and
 returns the index.
 i think it's as simple as "bsr ecx, eax" where "ecx" is the mask
and eax
 is the returned index. Or it might be backwards.
 The second test is using a 256-byte log2 table:
 const char LogTable256[] = {
  0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
  4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
  5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
  5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
 };
 and a lookup code like this:
 Addon = 16;
 if ((IntMask = Mask >> 16)) {
     Addon = 0;
     IntMask = Mask;
 }
 if (IntMask && 0xFFFFFF00) {
     Addon += 8;
 }
 HighBit = LogTable256[(Mask >> Addon)] + Addon;
 _______________________________________________
 Ros-dev mailing list
 Ros-dev(a)reactos.com
 http://reactos.com:8080/mailman/listinfo/ros-dev

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [ros-dev] Speed Tests (was: ping Alex regarding log2() forscheduler)