Alex,
I managed to shave off another comparison in favor of a shift:
int highest_bit ( int i ) { int ret = 0; if ( i > 0xffff ) i >>= 16, ret = 16; if ( i > 0xff ) i >>= 8, ret += 8; if ( i > 0xf ) i >>= 4, ret += 4; if ( i > 0x3 ) i >>= 2, ret += 2; return ret + (i>>1); }
FWIW, rdtsc reports that compile for speed is 2x as fast as compile for size.
I've thought about ways to possibly optimize pipeline throughput for this, but too many of the calculations depend on results of immediately previous calculations to get any real advantage.
Here's updated micro-code for BSR, if you really think AMD & Intel will actually do it...
IF r/m = 0 THEN ZF := 1; register := UNDEFINED; ELSE ZF := 0; register := 0; temp := r/m; IF OperandSize = 32 THEN IF (temp & 0xFFFF0000) != 0 THEN temp >>= 16; register |= 16; FI; FI; IF (temp & 0xFF00) != 0 THEN temp >>= 8; register |= 8; FI; IF (temp & 0xF0) != 0 THEN temp >>= 4; register |= 4; FI; IF (temp & 0xC) != 0 THEN temp >>= 2; register |= 2; FI; register |= (temp>>1); FI;