Alex,
I managed to shave off another comparison in favor of a shift:
int highest_bit ( int i ) { int ret = 0; if ( i > 0xffff ) i >>= 16, ret = 16; if ( i > 0xff ) i >>= 8, ret += 8; if ( i > 0xf ) i >>= 4, ret += 4; if ( i > 0x3 ) i >>= 2, ret += 2; return ret + (i>>1); }
FWIW, rdtsc reports that compile for speed is 2x as fast as compile for size.
I've thought about ways to possibly optimize pipeline throughput for this, but too many of the calculations depend on results of immediately previous calculations to get any real advantage.
Here's updated micro-code for BSR, if you really think AMD & Intel will actually do it...
IF r/m = 0 THEN ZF := 1; register := UNDEFINED; ELSE ZF := 0; register := 0; temp := r/m; IF OperandSize = 32 THEN IF (temp & 0xFFFF0000) != 0 THEN temp >>= 16; register |= 16; FI; FI; IF (temp & 0xFF00) != 0 THEN temp >>= 8; register |= 8; FI; IF (temp & 0xF0) != 0 THEN temp >>= 4; register |= 4; FI; IF (temp & 0xC) != 0 THEN temp >>= 2; register |= 2; FI; register |= (temp>>1); FI;
Royce Mitchell III wrote:
I managed to shave off another comparison in favor of a shift:
If aiming for maximum speed, wouldn't something like the following be even faster? Note that I was to lazy to time it, but as it now contains not a single conditional jump I suspect it could compare favorable. Another thing to keep in mind is that this version should be constant time irrespective of input, while the version with conditional jumps likely would vary (quite a lot?) depending on the bit-patterns of the input.
On the other hand this code is, if possible, even harder to parallelize to make simultaneous use of multiple pipes (simultaneous paths of execution, for super-scalar CPU's). I suspect introducing yet another temporary, probably for (i&0xf), and interleave use of that one with the real i could make pairable instructions possible. As "ret" would then become a synchronization point, perhaps addition of a ret2 used by mentioned temporary, and only adding ret and ret2 at the point of return could help that.
Another thing to possibly look into could be to hold the now constant values used for comparison (0xffff, 0xff, ...) in a register and shift that too - to remove the register loads from memory (that is, assuming a shift operation indeed is faster than a memory-read, even if from L1 cache).
Combining these ideas might however be a bad idea on the Z80-on-stereoids called IA32, where there actually doesn't exist a single GPR, and the actual usable register count is lower than the amount of toes on any healthy parrot. :-)
int highest_bit(unsigned int i) { int temp; int ret = 0; temp = i > 0xffff; i >>= 16*temp; ret = 16*temp; temp = i > 0xff; i >>= 8*temp; ret += 8*temp; temp = i > 0xf; i >>= 4*temp; ret += 4*temp; temp = i > 0x3; i >>= 2*temp; ret += 2*temp; ret += (i>>1); return ret; }
/Mike
Mike Nordell wrote:
int highest_bit(unsigned int i) { int temp; int ret = 0; temp = i > 0xffff; i >>= 16*temp; ret = 16*temp; temp = i > 0xff; i >>= 8*temp; ret += 8*temp; temp = i > 0xf; i >>= 4*temp; ret += 4*temp; temp = i > 0x3; i >>= 2*temp; ret += 2*temp; ret += (i>>1); return ret; }
This is almost, but not quite, 50% slower in msvc6 w/ "compile for speed". :(
Perhaps the reordering could help...
Royce Mitchell III wrote:
Mike Nordell wrote:
int highest_bit(unsigned int i) { int temp; int ret = 0; temp = i > 0xffff; i >>= 16*temp; ret = 16*temp; temp = i > 0xff; i >>= 8*temp; ret += 8*temp; temp = i > 0xf; i >>= 4*temp; ret += 4*temp; temp = i > 0x3; i >>= 2*temp; ret += 2*temp; ret += (i>>1); return ret; }
This is almost, but not quite, 50% slower in msvc6 w/ "compile for speed". :(
Perhaps the reordering could help...
FWIW, I also came up with the following code last night, which was only a hair slower than the code I sent to Alex, but perhaps might lend itself to speed-up better?
int highest_bit_v3 ( unsigned int i ) { int ret = 0; int n = i >> 16; if ( n ) i = n, ret = 16; n = i >> 8; if ( n ) i = n, ret += 8; n = i >> 4; if ( n ) i = n, ret += 4; n = i >> 2; if ( n ) i = n, ++ret, ++ret; return ret + (i>>1); }
Hi I have not test run this code. But i think this form maby are faster that v3 of Royce Mitchell III code it leave to the compiler to optimze it more or I complete wrong
int highest_bit ( unsigned int i ) { int ret = 0; int n ; int t; int x;
for (t = 0; t>3;t++) { x = (16 << t); n = i >> x; if ( n ) i = n, ret += x ; } return ret + (i>>1); }
----- Original Message ----- From: "Royce Mitchell III" royce3@ev1.net To: "ReactOS Development List" ros-dev@reactos.com Sent: Wednesday, March 23, 2005 10:08 PM Subject: Re: [ros-dev] ping Alex regarding log2() for scheduler
Royce Mitchell III wrote:
Mike Nordell wrote:
int highest_bit(unsigned int i) { int temp; int ret = 0; temp = i > 0xffff; i >>= 16*temp; ret = 16*temp; temp = i > 0xff; i >>= 8*temp; ret += 8*temp; temp = i > 0xf; i >>= 4*temp; ret += 4*temp; temp = i > 0x3; i >>= 2*temp; ret += 2*temp; ret += (i>>1); return ret; }
This is almost, but not quite, 50% slower in msvc6 w/ "compile for speed". :(
Perhaps the reordering could help...
FWIW, I also came up with the following code last night, which was only a hair slower than the code I sent to Alex, but perhaps might lend itself to speed-up better?
int highest_bit_v3 ( unsigned int i ) { int ret = 0; int n = i >> 16; if ( n ) i = n, ret = 16; n = i >> 8; if ( n ) i = n, ret += 8; n = i >> 4; if ( n ) i = n, ret += 4; n = i >> 2; if ( n ) i = n, ++ret, ++ret; return ret + (i>>1); }
Ros-dev mailing list Ros-dev@reactos.com http://reactos.com:8080/mailman/listinfo/ros-dev
Magnus Olsen wrote:
Hi I have not test run this code. But i think this form maby are faster that v3 of Royce Mitchell III code it leave to the compiler to optimze it more or I complete wrong
int highest_bit ( unsigned int i ) { int ret = 0; int n ; int t; int x;
for (t = 0; t>3;t++) { x = (16 << t); n = i >> x; if ( n ) i = n, ret += x ; } return ret + (i>>1); }
Magnus,
Well... I was skeptical at first....
Your code is indeed faster, but it is not outputting correct results, so if you can tweak it to output correct results without sacrificing speed I will indeed be impressed!
/me crosses fingers ;)
Royce3
Royce Mitchell III wrote:
Magnus Olsen wrote:
Hi I have not test run this code. But i think this form maby are faster that v3 of Royce Mitchell III code it leave to the compiler to optimze it more or I complete wrong int highest_bit ( unsigned int i ) { int ret = 0; int n ; int t; int x;
for (t = 0; t>3;t++) { x = (16 << t); n = i >> x; if ( n ) i = n, ret += x ; } return ret + (i>>1); }
Unfortunately your code is broken in a couple ways, most notably the code inside the for() statement never executes. When I fix it to work correctly, it takes almost 3x longer than my code :(
I will test it later to moring see if I can fix the broket part and keep the speed. I did worte it direcly on mailing list without test see if really works or not
----- Original Message ----- From: "Royce Mitchell III" royce3@ev1.net To: "ReactOS Development List" ros-dev@reactos.com Sent: Wednesday, March 23, 2005 11:04 PM Subject: Re: [ros-dev] ping Alex regarding log2() for scheduler
Magnus Olsen wrote:
Hi I have not test run this code. But i think this form maby are faster that v3 of Royce Mitchell III code it leave to the compiler to optimze it more or I complete wrong
int highest_bit ( unsigned int i ) { int ret = 0; int n ; int t; int x;
for (t = 0; t>3;t++) { x = (16 << t); n = i >> x; if ( n ) i = n, ret += x ; } return ret + (i>>1); }
Magnus,
Well... I was skeptical at first....
Your code is indeed faster, but it is not outputting correct results, so if you can tweak it to output correct results without sacrificing speed I will indeed be impressed!
/me crosses fingers ;)
Royce3
Ros-dev mailing list Ros-dev@reactos.com http://reactos.com:8080/mailman/listinfo/ros-dev
Royce Mitchell III schrieb:
I managed to shave off another comparison in favor of a shift:
What about using a 64k table?
static BYTE HighestBits[65536] = { 0x00, 0x01, 0x02, 0x02, ... // and so on };
// returns 0xFF when no bits set in dwValue BYTE highest_bit(DWORD dwValue) { BYTE ret; if (i > 0xffff) { ret = HighestBits[(int) (dwValue >> 16)]; } else { ret = HighestBits[(int) dwValue]; } return ret - 1; }
Regards, Mark
Mark Junker wrote:
Royce Mitchell III schrieb:
I managed to shave off another comparison in favor of a shift:
What about using a 64k table?
static BYTE HighestBits[65536] = { 0x00, 0x01, 0x02, 0x02, ... // and so on };
// returns 0xFF when no bits set in dwValue BYTE highest_bit(DWORD dwValue) { BYTE ret; if (i > 0xffff) { ret = HighestBits[(int) (dwValue >> 16)]; } else { ret = HighestBits[(int) dwValue]; } return ret - 1; }
Regards, Mark
There's a couple bugs in your code, but...
The table lookups are a little bit faster when they take full advantage of available cpu caching. I'm not sure that they are good enough to continue to fare well when put into our target environment ( at least once per interrupt/thread schedule ). Besides, it's wasting a lot of npool ram for a slight performance improvement.
Couldn't you just take the binary log and round down? Okay ... just kidding.
Seriously, I think Mark had the right idea ... if you use a smaller table. This is similar to some ECC code I use for counting the number of 1's in a word. Removing the while() loop is left "as an exercise for the reader" ...
static const unsigned char list[16] = {0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4};
// 0 = no bits set, 1 = 1st bit set, etc. unsigned char highest_bit(int iWord) { unsigned char iResult = 0; unsigned char iPass = 0; while (iWord != 0) { iResult = list[iWord & 0xF]; iWord >>= 4; if (iWord) iPass += 4; } return iResult + iPass; }
-rick
Rick Parrish wrote:
Couldn't you just take the binary log and round down? Okay ... just kidding.
Seriously, I think Mark had the right idea ... if you use a smaller table. This is similar to some ECC code I use for counting the number of 1's in a word. Removing the while() loop is left "as an exercise for the reader" ...
static const unsigned char list[16] = {0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4};
I have experimented with both 16 and 256 entry tables. 16 = somewhat slower than no tables, 256 = somewhat faster than no tables, but both of those results are based on the code and tables getting cached. I don't anticipate our particular target benefiting from cache coherency... it runs very frequently, but not in a tight loop.
Sorry, but may I know what you're discussing about?
Seems to me like how to implement the page address translation on a MIPS machine.
Royce Mitchell III wrote:
Rick Parrish wrote:
Couldn't you just take the binary log and round down? Okay ... just kidding.
Seriously, I think Mark had the right idea ... if you use a smaller table. This is similar to some ECC code I use for counting the number of 1's in a word. Removing the while() loop is left "as an exercise for the reader" ...
static const unsigned char list[16] = {0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4};
I have experimented with both 16 and 256 entry tables. 16 = somewhat slower than no tables, 256 = somewhat faster than no tables, but both of those results are based on the code and tables getting cached. I don't anticipate our particular target benefiting from cache coherency... it runs very frequently, but not in a tight loop.
Ros-dev mailing list Ros-dev@reactos.com http://reactos.com:8080/mailman/listinfo/ros-dev
Robert Köpferl wrote:
Sorry, but may I know what you're discussing about?
Seems to me like how to implement the page address translation on a MIPS machine.
Alex is working on a scheduler speed-up, and asked me for help with code to find the highest bit set in a mask, so that we don't have to blindly start at the highest priority.