Royce Mitchell III wrote:
I managed to shave off another comparison in favor of
a shift:
If aiming for maximum speed, wouldn't something like the following be even
faster? Note that I was to lazy to time it, but as it now contains not a
single conditional jump I suspect it could compare favorable. Another thing
to keep in mind is that this version should be constant time irrespective of
input, while the version with conditional jumps likely would vary (quite a
lot?) depending on the bit-patterns of the input.
On the other hand this code is, if possible, even harder to parallelize to
make simultaneous use of multiple pipes (simultaneous paths of execution,
for super-scalar CPU's). I suspect introducing yet another temporary,
probably for (i&0xf), and interleave use of that one with the real i could
make pairable instructions possible. As "ret" would then become a
synchronization point, perhaps addition of a ret2 used by mentioned
temporary, and only adding ret and ret2 at the point of return could help
that.
Another thing to possibly look into could be to hold the now constant values
used for comparison (0xffff, 0xff, ...) in a register and shift that too -
to remove the register loads from memory (that is, assuming a shift
operation indeed is faster than a memory-read, even if from L1 cache).
Combining these ideas might however be a bad idea on the Z80-on-stereoids
called IA32, where there actually doesn't exist a single GPR, and the actual
usable register count is lower than the amount of toes on any healthy
parrot. :-)
int highest_bit(unsigned int i)
{
int temp;
int ret = 0;
temp = i > 0xffff; i >>= 16*temp; ret = 16*temp;
temp = i > 0xff; i >>= 8*temp; ret += 8*temp;
temp = i > 0xf; i >>= 4*temp; ret += 4*temp;
temp = i > 0x3; i >>= 2*temp; ret += 2*temp;
ret += (i>>1);
return ret;
}
/Mike