From reactos@einsurance.de Thu Mar 24 03:35:47 2005 From: Ash To: ros-dev@reactos.org Subject: [ros-dev] Speed Tests (was: ping Alex regarding log2() for scheduler) Date: Thu, 24 Mar 2005 03:35:41 +0100 Message-ID: <002501c5301a$2c282ac0$3814a8c0@PR0N> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============5127529251679256654==" --===============5127529251679256654== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello, I'd like to provide a small VS.NET project file with 5 different tests. - 46ffffe9 tells everything is alright with the function calculation - times are measured with QueryPerformanceCounter - loop run cnt: 0x2ffffff result orig function 46ffffe9 it took 1491052 result orig function inlined 46ffffe9 it took 1035547 result second proposal inlined 46ffffe9 it took 1244434 result optimized asm 46ffffe9 it took 1338367 result debug asm 46ffffe9 it took 8774815 The second proposal is the original proposal but more shifts - still slower=20 tho. Interesting is the inlined version generated by MSVC, shaving off almost 1/3 = of the overall time. Also stared as "optimized asm" - no gurantee on register safety tho ;) For portability and performance sake it should be considered to create a=20 compiler macro. This function is terribly small, any optimisations inside are outweighted by = the calling overhead in this case. The most impressive one is the original function inlined, althought the ASM=20 would only work on x86. Please do not think about using 64k tables, thats what, 1/2 of a Sempron L2=20 cache? It would really really trash performance. Available at: Kernel Test http://wohngebaeudeve= rsicherung.einsurance.de=20 --===============5127529251679256654==-- From ionucu@videotron.ca Thu Mar 24 04:02:06 2005 From: Alex Ionescu To: ros-dev@reactos.org Subject: Re: [ros-dev] Speed Tests (was: ping Alex regarding log2() for scheduler) Date: Wed, 23 Mar 2005 21:57:51 -0500 Message-ID: <42422CAF.3070606@videotron.ca> In-Reply-To: <002501c5301a$2c282ac0$3814a8c0@PR0N> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============3955606650658226275==" --===============3955606650658226275== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Ash wrote: > Hello, > > I'd like to provide a small VS.NET project file with 5 different tests. > > - 46ffffe9 tells everything is alright with the function calculation > - times are measured with QueryPerformanceCounter > - loop run cnt: 0x2ffffff > > result orig function 46ffffe9 > it took 1491052 > result orig function inlined 46ffffe9 > it took 1035547 > result second proposal inlined 46ffffe9 > it took 1244434 > result optimized asm 46ffffe9 > it took 1338367 > result debug asm 46ffffe9 > it took 8774815 > > The second proposal is the original proposal but more shifts - still > slower tho. > > Interesting is the inlined version generated by MSVC, shaving off > almost 1/3 of the overall time. > Also stared as "optimized asm" - no gurantee on register safety tho ;) > > For portability and performance sake it should be considered to create > a compiler macro. > This function is terribly small, any optimisations inside are > outweighted by the calling overhead in this case. > The most impressive one is the original function inlined, althought > the ASM would only work on x86. > > Please do not think about using 64k tables, thats what, 1/2 of a > Sempron L2 cache? > It would really really trash performance. Hi Ash, Thanks a lot for your tests. I don't have much time tonight, but if you'd like/can, can you add two more tests? One using "bsr", the intel opcode. It does all the work for you and returns the index. i think it's as simple as "bsr ecx, eax" where "ecx" is the mask and eax is the returned index. Or it might be backwards. The second test is using a 256-byte log2 table: const char LogTable256[] = { 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7 }; and a lookup code like this: Addon = 16; if ((IntMask = Mask >> 16)) { Addon = 0; IntMask = Mask; } if (IntMask && 0xFFFFFF00) { Addon += 8; } HighBit = LogTable256[(Mask >> Addon)] + Addon; --===============3955606650658226275==-- From royce3@ev1.net Thu Mar 24 04:21:32 2005 From: Royce Mitchell III To: ros-dev@reactos.org Subject: Re: [ros-dev] Speed Tests (was: ping Alex regarding log2() for scheduler) Date: Wed, 23 Mar 2005 21:24:54 -0600 Message-ID: <42423306.805@ev1.net> In-Reply-To: <42422CAF.3070606@videotron.ca> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6611712362754768832==" --===============6611712362754768832== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Alex Ionescu wrote: > > Addon = 16; > if ((IntMask = Mask >> 16)) { > Addon = 0; > IntMask = Mask; > } > if (IntMask && 0xFFFFFF00) { > Addon += 8; > } > HighBit = LogTable256[(Mask >> Addon)] + Addon; > methinks there's bugs there, use this instead: int highest_bit_tabled ( unsigned int i ) { int ret = 0; if ( i > 0xffff ) i >>= 16, ret = 16; if ( i > 0xff ) i >>= 8, ret += 8; return ret + LogTable256[i]; } also, FWIW, I've tried the following three tests: ( i > 0xffff ) ( i & 0xffff0000 ) ( i >> 16 ) and the first is the fastest on my a64 --===============6611712362754768832==-- From reactos@einsurance.de Thu Mar 24 07:01:02 2005 From: Ash To: ros-dev@reactos.org Subject: Re: [ros-dev] Speed Tests (was: ping Alex regarding log2() forscheduler) Date: Thu, 24 Mar 2005 07:00:55 +0100 Message-ID: <00c201c53036$d7da6ab0$3814a8c0@PR0N> In-Reply-To: <42422CAF.3070606@videotron.ca> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============5417320602436791753==" --===============5417320602436791753== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable BSR has a latency of 8-12 Cycles on Athlon/P3 but can be pipelined. Worse=20 (up to ~80 cycles) on Pentium and other older CPUs. http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_3748,00.= html > Maximum latency on a Pentium 4 is eight clock cycles. But its throughput=20 > is one, which means it is fully pipelined. So when you start this=20 > instruction eight clock cycles before you need the result, it behaves as=20 > if it only takes one clock cycle. You're not going to be able to beat that = > with any other code. The closest you can get is to convert the integer to=20 > float, then extract the exponent. That could be done in less than eight=20 > clock cycles but throughput will be lower. http://www.flipcode.com/cgi-bin/fcmsg.cgi?thread_show=3D16986&msg=3D113105 Dont know about A64 - maybe someone can test BSR with A64? I'm very disappointed by its peformance on my K7 system - i might have=20 messed something up thought. You could save on the Pushing / Popping but that would be kinda like=20 cheating and if you do it dirty/lazy I wont get the right returns either. It doesnt make much sense to put the optimized ASM in there, neither is much = hope of GCC having a good day and doing a lot of optimisation. So far the best option would be the macro with a lookup table (only one=20 global kernel table tho). Here are the updated STATS also available at http://hackersquest.org/kerneltest.html result orig function 46ffffe9 it took 1526862 18% result orig function inlined 46ffffe9 it took 1041460 12% result second proposal inlined 46ffffe9 it took 1248990 15% result optimized asm 46ffffe9 it took 1321532 16% result lookup inlined 46ffffe9 it took 682264 8% result bsr inlined 46ffffe9 it took 1751088 21% result macro 46ffffe9 it took 653692 7% http://wohngebaeudeversicherung.einsurance.de/ ----- Original Message -----=20 From: "Alex Ionescu" To: "ReactOS Development List" Sent: Thursday, March 24, 2005 3:57 AM Subject: Re: [ros-dev] Speed Tests (was: ping Alex regarding log2()=20 forscheduler) > Ash wrote: > >> Hello, >> >> I'd like to provide a small VS.NET project file with 5 different tests. >> >> - 46ffffe9 tells everything is alright with the function calculation >> - times are measured with QueryPerformanceCounter >> - loop run cnt: 0x2ffffff >> >> result orig function 46ffffe9 >> it took 1491052 >> result orig function inlined 46ffffe9 >> it took 1035547 >> result second proposal inlined 46ffffe9 >> it took 1244434 >> result optimized asm 46ffffe9 >> it took 1338367 >> result debug asm 46ffffe9 >> it took 8774815 >> >> The second proposal is the original proposal but more shifts - still=20 >> slower tho. >> >> Interesting is the inlined version generated by MSVC, shaving off almost=20 >> 1/3 of the overall time. >> Also stared as "optimized asm" - no gurantee on register safety tho ;) >> >> For portability and performance sake it should be considered to create a=20 >> compiler macro. >> This function is terribly small, any optimisations inside are outweighted = >> by the calling overhead in this case. >> The most impressive one is the original function inlined, althought the=20 >> ASM would only work on x86. >> >> Please do not think about using 64k tables, thats what, 1/2 of a Sempron=20 >> L2 cache? >> It would really really trash performance. > > Hi Ash, > > Thanks a lot for your tests. I don't have much time tonight, but if you'd=20 > like/can, can you add two more tests? > > One using "bsr", the intel opcode. It does all the work for you and=20 > returns the index. > > i think it's as simple as "bsr ecx, eax" where "ecx" is the mask and eax=20 > is the returned index. Or it might be backwards. > > The second test is using a 256-byte log2 table: > > const char LogTable256[] =3D { > 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, > 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, > 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, > 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, > 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, > 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, > 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, > 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, > 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, > 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, > 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, > 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, > 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, > 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, > 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, > 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7 > }; > > > and a lookup code like this: > > Addon =3D 16; > if ((IntMask =3D Mask >> 16)) { > Addon =3D 0; > IntMask =3D Mask; > } > if (IntMask && 0xFFFFFF00) { > Addon +=3D 8; > } > HighBit =3D LogTable256[(Mask >> Addon)] + Addon; > > _______________________________________________ > Ros-dev mailing list > Ros-dev(a)reactos.com > http://reactos.com:8080/mailman/listinfo/ros-dev >=20 --===============5417320602436791753==-- From mjscod@gmx.de Thu Mar 24 10:55:42 2005 From: Mark Junker To: ros-dev@reactos.org Subject: Re: [ros-dev] Speed Tests (was: ping Alex regarding log2() forscheduler) Date: Thu, 24 Mar 2005 10:54:36 +0100 Message-ID: <42428E5C.2010406@gmx.de> In-Reply-To: <00c201c53036$d7da6ab0$3814a8c0@PR0N> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6179984386690925787==" --===============6179984386690925787== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Ash schrieb: > It doesnt make much sense to put the optimized ASM in there, neither > is much hope of GCC having a good day and doing a lot of optimisation. > So far the best option would be the macro with a lookup table (only > one global kernel table tho). What about using inline assembler in a macro - only as a platform specific optimization of course? > result bsr inlined 46ffffe9 > it took 1751088 21% Using GCC this should be much faster when using #define get_bits(value) \ ({ \ int bits = -1; \ __asm("bsr %1, %0\n" \ : "+r" (bits) \ : "rm" (value)); \ bits; \ }) This macro returns -1 when no bits were set. I tested it and it works as expected. When the -1 as "error" isn't suitable, you might want to change it to 0 ... line 3 of the macro. > result macro 46ffffe9 > it took 653692 7% The table approach doesn't seem to be bad either . Regards, Mark --===============6179984386690925787==-- From mjscod@gmx.de Thu Mar 24 11:27:01 2005 From: Mark Junker To: ros-dev@reactos.org Subject: Re: [ros-dev] Speed Tests (was: ping Alex regarding log2() forscheduler) Date: Thu, 24 Mar 2005 11:25:48 +0100 Message-ID: <424295AC.1040008@gmx.de> In-Reply-To: <00c201c53036$d7da6ab0$3814a8c0@PR0N> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1822923537834903962==" --===============1822923537834903962== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Ash schrieb: > BSR has a latency of 8-12 Cycles on Athlon/P3 but can be pipelined.=20 > Worse (up to ~80 cycles) on Pentium and other older CPUs. > http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_3748,0= 0.html=20 > My tests have shown that you're right and BSR is much too slow. > Dont know about A64 - maybe someone can test BSR with A64? I have an AMD64 here but it doesn't run in 64 bit mode. > It doesnt make much sense to put the optimized ASM in there, neither=20 > is much hope of GCC having a good day and doing a lot of optimisation. > So far the best option would be the macro with a lookup table (only=20 > one global kernel table tho). I've converted your sources to be compileable with GCC (MinGW). I=20 attached the sources. > Here are the updated STATS > also available at http://hackersquest.org/kerneltest.html > > result orig function 46ffffe9 > it took 1526862 18% > result orig function inlined 46ffffe9 > it took 1041460 12% > result second proposal inlined 46ffffe9 > it took 1248990 15% > result optimized asm 46ffffe9 > it took 1321532 16% > result lookup inlined 46ffffe9 > it took 682264 8% > result bsr inlined 46ffffe9 > it took 1751088 21% > result macro 46ffffe9 > it took 653692 7% This are my results on the AMD64 using your Release-EXE: STATS result orig function 46ffffe9 it took 1272638 18% result orig function inlined 46ffffe9 it took 875751 12% result second proposal inlined 46ffffe9 it took 1051861 15% result optimized asm 46ffffe9 it took 1225282 17% result lookup inlined 46ffffe9 it took 549861 7% result bsr inlined 46ffffe9 it took 1410179 20% result macro 46ffffe9 it took 607638 8% This are my results using the GCC EXE (-O2): STATS result orig function 46ffffe9 it took 1321663 24% result orig function inlined 46ffffe9 it took 879318 16% result second proposal inlined 46ffffe9 it took 940285 17% result lookup inlined 46ffffe9 it took 615267 11% result bsr inlined 46ffffe9 it took 1103432 20% result macro 46ffffe9 it took 484450 9% BTW: I had to remove all functions using the __asm() statement. The=20 "result bsr inlined" uses my GCC BSR macro. You can see that using BSR=20 seems to be much too slow ... Regards, Mark --===============1822923537834903962== Content-Type: application/x-zip-compressed Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="SpeedTest.zip" MIME-Version: 1.0 UEsDBBQAAgAIACBaeDI5EvL1aQUAAJwhAAANAAAAU3BlZWRUZXN0LmNwcO1ZbW/aSBD+DBL/YUTE ycYQG5rmouSSU9RG6UlRL5fwrSDkmDX4CjayTZVelf/e2fULa3vMi0vV+xDLicAzz+yzz+7M7hpd h8clY5MBC8Jja7mEc3jPbMdlAYQzBswN/a+w9Bw3BNvzxTPLcwNvzsBcLueOZYaO5x436rreqDfq R45rzVcTBs0gnJj28/GsyR+v3MCZumwCPM7Mmc6wtfGTE44935mCAhm7A2qj/q1Rr2We+iyESzAu GnXAy7ER5cAVGM82XhwB8YVPry6hd9qJIb1TClNEnEUADT8RgKL/Sep/UvR/U/Tvp/792B+/rnw3 egqKc3XVU9HyUhBsPHbcOQ5JQbn4+UTJ6vcq33b5AmZ5leSrJTSvUBpOs1YjJct4nm3TK+N9sota GUR/F70QkZNMdCIVjed1CNbM9OHOmw7Mpznrvz39NMIeCUUAjA6/e+LuS/cb+uYQJL7PzSFvO3vd FSE4Wvvc/2vI75297lfIz4dgTl3sVonmnvd5tdxYi3j2/upqnikdcn1wRpsKyPi1hLyWkFdI9RJy NBHb8UzJWJiW7ymuCkNc0xX+4bc49w28VPgzNqwthiFbav0TOYlFjnJfvpM4GeERAHc2Jfbe6Qii dmvnWEHSJnh43gxvArc7JWg4GyXYgn3EqxA/Qdhg0N1+CnzlizlfMUEAa5LyDbZew7S08cKJYYKL rZ7jsRkslCa2By2sUy1j6DZpz+hq/uv+B4ad8yI9n/EQ1TI6hZBFT+OciEd5nkPz0m+CwvumbvP0 F+gZiaheUJ6bFFp7vuBYHbF5wH7qSIk17qAjhQpov0Yrd+LYfH6LfUC4MB1X4R9Nf2p1YDx49+H6 oc2/ffk0ig8gd39/vOV/EISmHx+V4zNIasKwtGGNSXYWpdhShyiGfFIiY1AOup6LgsPkEUaOl005 VLQ3IlstMUU4Ph8oEPU8QoiSSmIylswuzJox63NG/hJzXh3CjRaRcJTVIsxZWQgHSQLCKne2cAZe uaGwPfdtcUU+ug4h5j3wjjquOQd75Vr8pRDaMgrl314oBmYVoi0T3WTUPyvmf71nvu35C9O12DsP W2Y+rjZ31w+3N+O/Pg5ubm8eoI1Lz3qe83Nsjb+jym2hL7GEOPBHRB8/aho/tPMEk+lpBD9HhHzZ k1GSVmqJPMkgkzJJU6D0Zc9hZUvC/pB8KW1tA+8fk1PmKcnK88adlIuay6vSV0CHE1UOW1nUDG1t A+/qouZ5ZmRNmo7KSapmWl02HGAPp2QUuLKGMVltI9vq+q3ZEbphlU1Fiypufnt0OJkwWmWNODet SK66KjEZSRKxpKRiJAtM8ThzOEFEPNGDiqpEJDWKZXVlUlbUojBhT6sp4NrOje3MSi9z4I8OKBQP V3nqCHoawa+6RBGftl7ce4TOgiX7iAxRVUkWiG5uK0DjpbWgJE7i0S1ZI4txc2sMEVf26JYsE8W4 abUlIka2bqFQ5vaMIk48kYgo3NLNzYUiPipgBBwN3WwNKoKThCfgwtTN5SwV5D7ZsSZflFy09fTQ iKHWiGHSMhJroLfXUmnQ1tc916R+qKreM+K98dLHpm2l+Ti4HjwOfTx4CgWSxz4LVvMow9MkHYbD sGWcPYNw70gnhAzUwV4jL0DnifjXasWAtG+dvASpRdWFSFu5pEtVGaXM5NyDWoIrp5hGLqcab+uW vrf0AmnTTJEtZNKOZGUcTTYTuZxsPInWHCmWUo6W8CsyjDA0tzheOSs+dzdTSpJ2R73QnabC45Tz iAoAZ0BxWK/VO6siIDSTKNqai97Os0kXWppOWgV31IT701REpIRJvKTpehK2NZGi3Md1L/6JxUh+ T/kOUEsDBBQAAgAIAChXeDJqGlljLgEAAE4CAAAIAAAATWFrZWZpbGWNkU1Lw0AQhs8N5D8MJYEm YTfgzYDQUlsNRARbtF6U7e7ELNl8kA8JHvztbkoNMXrwMId995mPd2b1sL4NHzev2zDaAARXYC3q BJUCwRqk2CF49rOd2cKhH7I0jfU2Wt3sNEiemKbI/QWQ0G/rylcFZ8qXOVetQA0eDv9Fo+uBjEb/ Sh5NwzR0bgC7ElHssW5OSsUT+Y6BacyWWSpkBaSEI+Npqwec6TGBXJ7fvrVYjRw6g71Y5gIoEJmz DGHuUl6WcyBFNZamQpepqVTydColU0Fg/FO6YynGUuGv+hWfO71DrpDlvb8qAxKD+wnu6RYuLXQw HUeW9uCwl9GKNFM3gsUdLXSFN8/TXcBagvWi3Z93fepi0yIAu3d+5qzF99UcIHxIG7M9yXlP/sV9 AVBLAwQUAAIACADVBngy4eF4V8gAAABSAQAACAAAAHN0ZGFmeC5oXY89bsMwDIV3A74DgaxBvAdF puxZcgFWoiwWluRSFBLfvrSDonEXAvze488bBqjqMTxPEc7A2U3NEwSerBQxDbNH8VCXqpR2hnrs u2EAc81Svsgp1JkcB3Z7G2hEBRSCVslDEPpulHVajvDZdNuxii5iHk3n/OdYxb7ru8MsOCaEkh2t vZHfEx9cqgphurxDtW1yijv24OzLo/6jKb2ibXh75n673s4gFEjI7gF6z8ol4wSR0JNUWErbUo+C CdZvWSxntIG++wFQSwMEFAACAAgARgZ4MsEMEgDiAQAAJAUAAAoAAABSZWFkTWUudHh0tVNNa+Mw EL0b/B+Gnro0eO+BHEJ2YQOlDjR0z4o8jtW1PtBISdNfvyPZaVxIb6lOtjQf7715s1jc5pQF8FnV T8/1429YbjaP69Vyu66fYA7PDrHZIgXYePuKMkB9QH9QeCyLxc36l8XSub/qXfgGOkEgPYqADYRO 0QSCcK5XUgRlDbTWw8nGCqAstimsVT2CtCYIZQgEUNRa+BPYFo6dCCkYjqrvOdA0oAygkF16DR3m ZOIvEcpCi38I0aUEf715lRCXxcdbdZCO1Rl0zGAU5bKasYAbhcsAE+yX1cPD+ZZgjwZ9ZhtJmT0I A8sJz0GVCobi63ChqAwX00OU2NkYckueDqUbJvaiKIoeUrfEbNLpTHnG3Zqx9PVyrhch3dMsNW7V PvocQkPqBzeeV/QsIWHP/9ziqEKXKgzFrzD6rKB07gv5pkMnHokcoOf8n7c8ZVFzUw8UmFlyYjbF POMMzbJ9q7oZjF8TuEhn+wjPvqEkr4VdVH3DLnQepdWO39nZKBquf79Z/fmRU4YSRmh+vEjh2JZJ 28/J4eRwNPmYMCCxu9fvk8LYMCpw2U9mSHC3rX/V8zv2hNZo2MVMmdcqzYk9IzzfjJs1jkzaBtNG lQV1NiZpmqwTr4OMFKxW798y0v9QSwMEFAACAAgARgZ4Mo36b4fBAAAAKAEAAAoAAABzdGRhZngu Y3BwZVC7bsMwDNwN+B8O6Rx7z1YgKLJliIeujEjDChRKkGS0/vvSrpEl6/Ge7HuUyjT+di4lnFDi nJ1g9EFQJ6rw6sLMUvCYSzVIjE/KlPl1apu+xy2J8CCldslN+PEh4C4bP2U5uvhMZsmYhFjypthz 4/3xT3dRK3l919QliYWNMT+p+qht0zYfezgOu810WGGzHa7n6wlZRsmiNoV0ATH7VUlhL1CwxBlq lc0Yt+H8+fXdXTa9jYPGdbg18WV7Rdv8AVBLAQIUABQAAgAIACBaeDI5EvL1aQUAAJwhAAANAAAA AAAAAAAAIAAAAAAAAABTcGVlZFRlc3QuY3BwUEsBAhQAFAACAAgAKFd4MmoaWWMuAQAATgIAAAgA AAAAAAAAAAAgAAAAlAUAAE1ha2VmaWxlUEsBAhQAFAACAAgA1QZ4MuHheFfIAAAAUgEAAAgAAAAA AAAAAAAgAAAA6AYAAHN0ZGFmeC5oUEsBAhQAFAACAAgARgZ4MsEMEgDiAQAAJAUAAAoAAAAAAAAA AAAgAAAA1gcAAFJlYWRNZS50eHRQSwECFAAUAAIACABGBngyjfpvh8EAAAAoAQAACgAAAAAAAAAA ACAAAADgCQAAc3RkYWZ4LmNwcFBLBQYAAAAABQAFABcBAADJCgAAAAA= --===============1822923537834903962==-- From michael@fritscher.net Thu Mar 24 14:19:16 2005 From: michael@fritscher.net To: ros-dev@reactos.org Subject: Re: [ros-dev] Speed Tests (was: ping Alex regarding log2() forscheduler) Date: Thu, 24 Mar 2005 14:15:48 +0100 Message-ID: <65423.217.236.110.178.1111670148.squirrel@vs-241-34.vs.winprofi.de> In-Reply-To: <424295AC.1040008@gmx.de> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============5569829912146333996==" --===============5569829912146333996== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Testing on a Atlon 1.2 Ghz and a K6 233 Mhz: Both on Windows 2000 T:\cvs\_cd>gcc --version gcc (GCC) 3.4.2 (mingw-special) Copyright (C) 2004 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Atlon 1.2 ghz: T:\cvs\_cd>speedtest STATS result orig function 46ffffe9 it took 944301 23% result orig function inlined 46ffffe9 it took 697547 17% result second proposal inlined 46ffffe9 it took 774974 19% result lookup inlined 46ffffe9 it took 603607 15% result bsr inlined 46ffffe9 it took 656956 16% result macro 46ffffe9 it took 336330 8% K6 233 Mhz: C:\>speedtest STATS result orig function 46ffffe9 it took 5819845 23% result orig function inlined 46ffffe9 it took 3533468 14% result second proposal inlined 46ffffe9 it took 4043743 16% result lookup inlined 46ffffe9 it took 2290520 9% result bsr inlined 46ffffe9 it took 5961779 24% result macro 46ffffe9 it took 3001376 12% --===============5569829912146333996==-- From royce3@ev1.net Thu Mar 24 14:32:03 2005 From: Royce Mitchell III To: ros-dev@reactos.org Subject: Re: [ros-dev] Speed Tests (was: ping Alex regarding log2() forscheduler) Date: Thu, 24 Mar 2005 07:35:04 -0600 Message-ID: <4242C208.7040709@ev1.net> In-Reply-To: <00c201c53036$d7da6ab0$3814a8c0@PR0N> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============8355724744602064968==" --===============8355724744602064968== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Ash wrote: > BSR has a latency of 8-12 Cycles on Athlon/P3 but can be pipelined. > Worse (up to ~80 cycles) on Pentium and other older CPUs. ^^^ I think this might be our "silver bullet". I don't want to waste 256 bytes of L1 cache ( assuming we get a cache hit ), or spend 100's of cycles once per interrupt waiting for the cache miss lookup to go through, so the table-based approach is bad in this scenario. In some more testing, BSR is never significantly faster than the C code, but in many scenarios is equivalent in speed. However, the C code trashes too many registers to do in parallel with anything else. All that being said, I think BSR's ability to be able to be pipelined will make it our big winner after all. I will work on trying to verify if BSR pipelines well on AMD products, too. --===============8355724744602064968==--