New subject: ping Alex regarding log2() for scheduler

25 Mar 2005

Hi Royce:
int highest_bit ( int i )
{
    int ret = 0;
    if ( i > 0xffff )
        i >>= 16, ret = 16;
    if ( i > 0xff )
        i >>= 8, ret += 8;
    if ( i > 0xf )
        i >>= 4, ret += 4;
    if ( i > 0x3 )
        i >>= 2, ret += 2;
    return ret + (i>>1);
}
Guys I think I have one of those optimizations manuals for asm that actually use only one
multiplication
by a called ¨magic number¨ to get the first bit set. I actually don´t want to trace this
code in my head
but if you are searching a fast BSF replacement... It was tested using brute force for the
2^32 possible values
on the argumment. Maybe you could check the internet  for ¨Agner Fog optimization¨. I have
a very bad
memory so maybe the name is wrong. This is all the info I can give until
I get home. Ohh well i can do it myself: http://www.df.lth.se/~john_e/gems/gem0039.html It
was Harley´s magic number.
so I wasn´t that far.
or you can read it here... I have a c version but I guess you don´t need that :)
----------------------------------------------------------------------------------------------------------
Fast BSF replacement     ASM/80386
Macro EMBSF5 emulates the BSF instruction for non-zero argument
<------------------------------------ remember to test for this!!!!!!!
This macro utilizes an algorithm published in the news group comp.arch by Robert Harley in
1996. The algorithm converts the general problem of finding the position of the least
significant set bit into a special case of counting the number of bits in a contiguous
block of set bits. By computing x^(x-1), where x is the original input argument, a
right-aligned contiguous group of set bits is created, whose cardinality equals the
position of the least significant set bit in the original input plus 1.
The input x is of the form (regular expression): {x}n1{0}m. x-1 has the form {x}n{0}(m+1),
and x^(x-1) has the form {0}n{1}(m+1). This step is pretty similar to the one used by
macro PREPBSF <http://www.df.lth.se/~john_e/gems/gem002e.html> , only that PREPBSF
<http://www.df.lth.se/~john_e/gems/gem002e.html>  creates a right-aligned group of
set bits whose cardinality equals exactly the position of the least significant set bit.
Harley's algorithm then employs a special method to count the number of bits in the
right-aligned contiguous block of set bits. I am not sure upon which number theoretical
argument it is founded, and it wasn't explained in the news group post.
According to Harley, if a 32-bit number of the form 00...01...11 is multiplied by the
"magic" number (7*255*255*255), then bits <31:26> of the result uniquely
identify the number of set bits in that number. A 64 entry table is used to map that
unique result to the bit count. Here, I have modified the table to reflect the bit
position of the least significant set bit in the original argument, which is one less than
the bit count of the intermediate result.
I have tested the macro EMBSF5 exhaustively for all 2^32-1 possible inputs, i.e. for all
32-bit numbers except zero.
Place the following table in the data segment:
table   db      0, 0, 0,15, 0, 1,28, 0,16, 0, 0, 0, 2,21,29, 0
        db      0, 0,19,17,10, 0,12, 0, 0, 3, 0, 6, 0,22,30, 0
        db     14, 0,27, 0, 0, 0,20, 0,18, 9,11, 0, 5, 0, 0,13
        db     26, 0, 0, 8, 0, 4, 0,25, 0, 7,24, 0,23, 0,31, 0
And here follows the actual macro:
;
; emulate bsf instruction
;
; input:
;   eax = number to preform a bsf on ( != 0 )
;
; output:
;   edx = result of bsf operation
;
; destroys:
;   ecx
;   eflags
;
MACRO   EMBSF5
        mov     edx,eax         ; do not disturb original argument
        dec     edx             ; n-1
        xor     edx,eax         ; n^(n-1), now EDX = 00..01..11
IFDEF FASTMUL
        imul    edx,7*255*255*255       ; multiply by Harley's magic number
ELSE
        mov     ecx,edx         ; do multiply using shift/add method
        shl     edx,3
        sub     edx,ecx
        mov     ecx,edx
        shl     edx,8
        sub     edx,ecx
        mov     ecx,edx
        shl     edx,8
        sub     edx,ecx
        mov     ecx,edx
        shl     edx,8
        sub     edx,ecx
ENDIF
        shr     edx,26          ; extract bits <31:26>
        movzx   edx,[table+edx] ; translate into bit count - 1
ENDM
Note: FASTMUL can be defined if your CPU has a fast integer multiplicator, like the AMD.
The IMUL replacement should run in about 8-9 cycles.
Gem writer: Norbert Juffa <mailto:norbert.juffa@amd.com>
last updated: 1999-09-02
----------------------------------------------------------------------------------------------------------
...
 > updated micro-code << Wow where I can get
this? Do you have an assembler/disassembler?
I once asked Tigran Aizvian but he told me not to have the information at all.
With that I could save myself some time reversing it by trial and error using the black
box approach.
Someday. As far as I know still Intel CPUs microcode updates are not digitally signed
so it could be reprogrammed.
Here's updated micro-code for BSR, if you really think AMD & Intel will
actually do it...
IF r/m = 0
THEN
   ZF := 1;
   register := UNDEFINED;
ELSE
   ZF := 0;
   register := 0;
   temp := r/m;
   IF OperandSize = 32 THEN
     IF (temp & 0xFFFF0000) != 0 THEN
       temp >>= 16;
       register |= 16;
     FI;
   FI;
   IF (temp & 0xFF00) != 0 THEN
     temp >>= 8;
     register |= 8;
   FI;
   IF (temp & 0xF0) != 0 THEN
     temp >>= 4;
     register |= 4;
   FI;
   IF (temp & 0xC) != 0 THEN
     temp >>= 2;
     register |= 2;
   FI;
   register |= (temp>>1);
FI;

RE: [ros-dev] ping Alex regarding log2() for scheduler