Royce Mitchell III wrote:
I managed to shave off another comparison in favor of a shift:
If aiming for maximum speed, wouldn't something like the following be even faster? Note that I was to lazy to time it, but as it now contains not a single conditional jump I suspect it could compare favorable. Another thing to keep in mind is that this version should be constant time irrespective of input, while the version with conditional jumps likely would vary (quite a lot?) depending on the bit-patterns of the input.
On the other hand this code is, if possible, even harder to parallelize to make simultaneous use of multiple pipes (simultaneous paths of execution, for super-scalar CPU's). I suspect introducing yet another temporary, probably for (i&0xf), and interleave use of that one with the real i could make pairable instructions possible. As "ret" would then become a synchronization point, perhaps addition of a ret2 used by mentioned temporary, and only adding ret and ret2 at the point of return could help that.
Another thing to possibly look into could be to hold the now constant values used for comparison (0xffff, 0xff, ...) in a register and shift that too - to remove the register loads from memory (that is, assuming a shift operation indeed is faster than a memory-read, even if from L1 cache).
Combining these ideas might however be a bad idea on the Z80-on-stereoids called IA32, where there actually doesn't exist a single GPR, and the actual usable register count is lower than the amount of toes on any healthy parrot. :-)
int highest_bit(unsigned int i) { int temp; int ret = 0; temp = i > 0xffff; i >>= 16*temp; ret = 16*temp; temp = i > 0xff; i >>= 8*temp; ret += 8*temp; temp = i > 0xf; i >>= 4*temp; ret += 4*temp; temp = i > 0x3; i >>= 2*temp; ret += 2*temp; ret += (i>>1); return ret; }
/Mike