One of the things which has bothered me a bit is the code duplication we have in our DIB engine (subsys/win32k/dib). Most of the BitBlt routines in there are very similar. With the recent interest in optimizations a bunch of new (almost identical) routines were added. Don't get me wrong, I'm not saying that adding those optimizations was a bad idea, I'm just pointing out that we have a lot of code duplication.
There are 256 possible ROP codes, we support 1bpp, 4bpp, 8bpp, 16bpp, 24bpp and 32bpp, so in theory there could be 1536 routines with basically the same structure. I've been playing around with the idea to write a code generator which would generate the source code for those routines. That would cut down on the duplicated source code and associated maintenance problems (you only need to change the code generator) while still allowing optimized code for each individual ROP code.
Just to give you an idea what such a code generator would look like, I've attached my first attempt. Please note that it doesn't really try to optimize the generated code yet, it's just to give an impression. The code generated (16bpp only atm) is rather large, you can get it from ftp://ftp.geldorp.nl/pub/ReactOS/dib16gen.c if you like (or compile the code generator ("gcc -o gendib gendib.c") and run it).
A possible problem is that the generated code is quite large. When using the generated 16bpp code, size of win32k.sys increases by about 350kb. Extrapolating this for all bpps, it would mean that win32k.sys would triple in size.
So, I'm wondering what you guys are thinking. Should we basically trade memory for speed? Problem is that I can't quantify the speed increase at the moment.
Gé van Geldorp.
Hi I like the idea I see some problem example we need the change the soucre for each rop4 example blackness calling the color_fill and some other problem with a gen.c
Quoting Ge van Geldorp gvg@reactos.com:
One of the things which has bothered me a bit is the code duplication we have in our DIB engine (subsys/win32k/dib). Most of the BitBlt routines in there are very similar. With the recent interest in optimizations a bunch of new (almost identical) routines were added. Don't get me wrong, I'm not saying that adding those optimizations was a bad idea, I'm just pointing out that we have a lot of code duplication.
There are 256 possible ROP codes, we support 1bpp, 4bpp, 8bpp, 16bpp, 24bpp and 32bpp, so in theory there could be 1536 routines with basically the same structure. I've been playing around with the idea to write a code generator which would generate the source code for those routines. That would cut down on the duplicated source code and associated maintenance problems (you only need to change the code generator) while still allowing optimized code for each individual ROP code.
Just to give you an idea what such a code generator would look like, I've attached my first attempt. Please note that it doesn't really try to optimize the generated code yet, it's just to give an impression. The code generated (16bpp only atm) is rather large, you can get it from ftp://ftp.geldorp.nl/pub/ReactOS/dib16gen.c if you like (or compile the code generator ("gcc -o gendib gendib.c") and run it).
A possible problem is that the generated code is quite large. When using the generated 16bpp code, size of win32k.sys increases by about 350kb. Extrapolating this for all bpps, it would mean that win32k.sys would triple in size.
So, I'm wondering what you guys are thinking. Should we basically trade memory for speed? Problem is that I can't quantify the speed increase at the moment.
Gé van Geldorp.
Basic GDI functions should always be optimized for speed rather than size. The entire windowing system and other components rely on these functions, any bottleneck in these functions results in a bottleneck in everything else. If your code generator is just as fast as current implementations, or can be made just as fast if not faster, then i'd say go for it. Otherwise, i'd say don't.
Richard
Ge van Geldorp wrote:
One of the things which has bothered me a bit is the code duplication we have in our DIB engine (subsys/win32k/dib). Most of the BitBlt routines in there are very similar. With the recent interest in optimizations a bunch of new (almost identical) routines were added. Don't get me wrong, I'm not saying that adding those optimizations was a bad idea, I'm just pointing out that we have a lot of code duplication.
There are 256 possible ROP codes, we support 1bpp, 4bpp, 8bpp, 16bpp, 24bpp and 32bpp, so in theory there could be 1536 routines with basically the same structure. I've been playing around with the idea to write a code generator which would generate the source code for those routines. That would cut down on the duplicated source code and associated maintenance problems (you only need to change the code generator) while still allowing optimized code for each individual ROP code.
Just to give you an idea what such a code generator would look like, I've attached my first attempt. Please note that it doesn't really try to optimize the generated code yet, it's just to give an impression. The code generated (16bpp only atm) is rather large, you can get it from ftp://ftp.geldorp.nl/pub/ReactOS/dib16gen.c if you like (or compile the code generator ("gcc -o gendib gendib.c") and run it).
A possible problem is that the generated code is quite large. When using the generated 16bpp code, size of win32k.sys increases by about 350kb. Extrapolating this for all bpps, it would mean that win32k.sys would triple in size.
So, I'm wondering what you guys are thinking. Should we basically trade memory for speed? Problem is that I can't quantify the speed increase at the moment.
Gé van Geldorp.
#include <stdio.h>
#define USES_DEST(RopCode) ((((RopCode) & 0xaa) >> 1) != ((RopCode) & 0x55)) #define USES_SOURCE(RopCode) ((((RopCode) & 0xcc) >> 2) != ((RopCode) & 0x33)) #define USES_PATTERN(RopCode) ((((RopCode) & 0xf0) >> 4) != ((RopCode) & 0x0f))
static void PrintRoutineName(FILE *Out, unsigned Bpp, unsigned RopCode) { static struct { unsigned RopCode; char *Name; } KnownCodes[] = { { 0x00, "BLACKNESS" }, { 0x11, "NOTSRCERASE" }, { 0x33, "NOTSRCCOPY" }, { 0x44, "SRCERASE" }, { 0x55, "DSTINVERT" }, { 0x5a, "PATINVERT" }, { 0x66, "SRCINVERT" }, { 0x88, "SRCAND" }, { 0xbb, "MERGEPAINT" }, { 0xc0, "MERGECOPY" }, { 0xcc, "SRCCOPY" }, { 0xee, "SRCPAINT" }, { 0xf0, "PATCOPY" }, { 0xfb, "PATPAINT" }, { 0xff, "WHITENESS" } }; unsigned Index;
for (Index = 0; Index < sizeof(KnownCodes) / sizeof(KnownCodes[0]); Index++) { if (RopCode == KnownCodes[Index].RopCode) { fprintf(Out, "DIB_%uBPP_BitBlt_%s", Bpp, KnownCodes[Index].Name); return; } } fprintf(Out, "DIB_%uBPP_BitBlt_%02x", Bpp, RopCode); }
static void CreatePrimitive(FILE *Out, unsigned Bpp, unsigned RopCode) { int UsesSource; int UsesPattern; int UsesDest;
UsesSource = USES_SOURCE(RopCode); UsesPattern = USES_PATTERN(RopCode); UsesDest = USES_DEST(RopCode);
fprintf(Out, "\n"); fprintf(Out, "static void\n"); PrintRoutineName(Out, Bpp, RopCode); fprintf(Out, "(PBLTINFO BltInfo)\n"); fprintf(Out, "{\n"); fprintf(Out, " ULONG DestX, DestY;\n"); if (UsesSource) { fprintf(Out, " ULONG SourceX, SourceY;\n"); } if (UsesPattern) { fprintf(Out, " ULONG PatternY = 0;\n"); } fprintf(Out, " ULONG Dest = 0, Source = 0, Pattern = 0;\n"); fprintf(Out, " PULONG DestBits;\n"); fprintf(Out, " ULONG RoundedRight;\n"); fprintf(Out, "\n"); fprintf(Out, " RoundedRight = BltInfo->DestRect.right -\n"); fprintf(Out, " ((BltInfo->DestRect.right - BltInfo->DestRect.left) & 0x1);\n"); if (UsesSource) { fprintf(Out, " SourceY = BltInfo->SourcePoint.y;\n"); } fprintf(Out, " DestBits = (PULONG)(\n"); fprintf(Out, " BltInfo->DestSurface->pvScan0 +\n"); fprintf(Out, " (BltInfo->DestRect.left << 1) +\n"); fprintf(Out, " BltInfo->DestRect.top * BltInfo->DestSurface->lDelta);\n"); fprintf(Out, "\n"); if (UsesPattern) { fprintf(Out, " if (BltInfo->PatternSurface)\n"); fprintf(Out, " {\n"); fprintf(Out, " PatternY = (BltInfo->DestRect.top + BltInfo->BrushOrigin.y) %\n"); fprintf(Out, " BltInfo->PatternSurface->sizlBitmap.cy;\n"); fprintf(Out, " }\n"); fprintf(Out, " else\n"); fprintf(Out, " {\n"); fprintf(Out, " Pattern = BltInfo->Brush->iSolidColor |\n"); fprintf(Out, " (BltInfo->Brush->iSolidColor << 16);\n"); fprintf(Out, " }\n"); } fprintf(Out, "\n"); fprintf(Out, " for (DestY = BltInfo->DestRect.top; DestY < BltInfo->DestRect.bottom; DestY++)\n"); fprintf(Out, " {\n"); if (UsesSource) { fprintf(Out, " SourceX = BltInfo->SourcePoint.x;\n"); fprintf(Out, "\n"); } fprintf(Out, " for (DestX = BltInfo->DestRect.left; DestX < RoundedRight; DestX += 2, DestBits++"); if (UsesSource) { fprintf(Out, ", SourceX += 2"); } fprintf(Out, ")\n"); fprintf(Out, " {\n"); if (UsesDest) { fprintf(Out, " Dest = *DestBits;\n"); fprintf(Out, "\n"); } if (UsesSource) { fprintf(Out, " Source = DIB_GetSource(BltInfo->SourceSurface, SourceX, SourceY, BltInfo->XlateSourceToDest);\n"); fprintf(Out, " Source |= DIB_GetSource(BltInfo->SourceSurface, SourceX + 1, SourceY, BltInfo->XlateSourceToDest) << 16;\n"); fprintf(Out, "\n"); } if (UsesPattern) { fprintf(Out, " if (BltInfo->PatternSurface)\n"); fprintf(Out, " {\n"); fprintf(Out, " Pattern = DIB_GetSource(BltInfo->PatternSurface, (DestX + BltInfo->BrushOrigin.x) % BltInfo->PatternSurface->sizlBitmap.cx, PatternY, BltInfo->XlatePatternToDest);\n"); fprintf(Out, " Pattern |= DIB_GetSource(BltInfo->PatternSurface, (DestX + BltInfo->BrushOrigin.x + 1) % BltInfo->PatternSurface->sizlBitmap.cx, PatternY, BltInfo->XlatePatternToDest) << 16;\n"); fprintf(Out, " }\n"); fprintf(Out, "\n"); } fprintf(Out, " *DestBits = DIB_DoRop(BltInfo->Rop4, Dest, Source, Pattern);\n"); fprintf(Out, " }\n"); fprintf(Out, "\n"); fprintf(Out, " if (DestX < BltInfo->DestRect.right)\n"); fprintf(Out, " {\n"); if (UsesDest) { fprintf(Out, " Dest = *((PUSHORT)DestBits);\n"); fprintf(Out, "\n"); } if (UsesSource) { fprintf(Out, " Source = DIB_GetSource(BltInfo->SourceSurface, SourceX, SourceY, BltInfo->XlateSourceToDest);\n"); } if (UsesPattern) { fprintf(Out, " if (BltInfo->PatternSurface)\n"); fprintf(Out, " {\n"); fprintf(Out, " Pattern = DIB_GetSource(BltInfo->PatternSurface, (DestX + BltInfo->BrushOrigin.x) % BltInfo->PatternSurface->sizlBitmap.cx, PatternY, BltInfo->XlatePatternToDest);\n"); fprintf(Out, " }\n"); fprintf(Out, "\n"); } fprintf(Out, " DIB_16BPP_PutPixel(BltInfo->DestSurface, DestX, DestY, DIB_DoRop(BltInfo->Rop4, Dest, Source, Pattern) & 0xFFFF);\n"); fprintf(Out, " DestBits = (PULONG)((ULONG_PTR)DestBits + 2);\n"); fprintf(Out, " }\n"); fprintf(Out, "\n"); if (UsesSource) { fprintf(Out, " SourceY++;\n"); } if (UsesPattern) { fprintf(Out, " if (BltInfo->PatternSurface)\n"); fprintf(Out, " {\n"); fprintf(Out, " PatternY++;\n"); fprintf(Out, " PatternY %= BltInfo->PatternSurface->sizlBitmap.cy;\n"); fprintf(Out, " }\n"); } fprintf(Out, " DestBits = (PULONG)(\n"); fprintf(Out, " (ULONG_PTR)DestBits -\n"); fprintf(Out, " ((BltInfo->DestRect.right - BltInfo->DestRect.left) << 1) +\n"); fprintf(Out, " BltInfo->DestSurface->lDelta);\n"); fprintf(Out, " }\n"); fprintf(Out, "}\n"); }
static void CreateTable(FILE *Out, unsigned Bpp) { unsigned RopCode;
fprintf(Out, "\n"); fprintf(Out, "static void (*PrimitivesTable[256])(PBLTINFO) =\n"); fprintf(Out, " {\n"); for (RopCode = 0; RopCode < 256; RopCode++) { fprintf(Out, " "); PrintRoutineName(Out, Bpp, RopCode); if (RopCode < 255) { putc(',', Out); } putc('\n', Out); } fprintf(Out, " };\n"); }
static void CreateBitBlt(FILE *Out, unsigned Bpp) { fprintf(Out, "\n"); fprintf(Out, "BOOLEAN\n"); fprintf(Out, "DIB_%uBPP_BitBlt(PBLTINFO BltInfo)\n", Bpp); fprintf(Out, "{\n"); fprintf(Out, " PrimitivesTable[BltInfo->Rop4 & 0xff](BltInfo);\n"); fprintf(Out, "\n"); fprintf(Out, " return TRUE;\n"); fprintf(Out, "}\n"); }
int main(int argc, char *argv[]) { FILE *Out; unsigned RopCode; unsigned Bpp;
Bpp = 16; Out = fopen("dib16gen.c", "w"); if (NULL == Out) { perror("Error opening output file"); exit(1); }
fprintf(Out, "/* This is a generated file. Please do not edit */\n"); fprintf(Out, "\n"); fprintf(Out, "#include "w32k.h"\n");
for (RopCode = 0; RopCode < 256; RopCode++) { CreatePrimitive(Out, Bpp, RopCode); } CreateTable(Out, Bpp); CreateBitBlt(Out, Bpp);
fclose(Out);
return 0; }
Ros-dev mailing list Ros-dev@reactos.com http://reactos.com:8080/mailman/listinfo/ros-dev
I'll let you know of my opinion when I see some numbers ;-)
Casper
-----Original Message----- From: ros-dev-bounces@reactos.com [mailto:ros-dev-bounces@reactos.com] On Behalf Of Ge van Geldorp Sent: 10. juni 2005 12:03 To: 'ReactOS Development List' Subject: [ros-dev] DIB code generator
So, I'm wondering what you guys are thinking. Should we basically trade memory for speed? Problem is that I can't quantify the speed increase at the moment.
Gé van Geldorp.
Ge van Geldorp wrote:
a lot of neat things
Well, I'm kind of confused. I always thought a code generator that removes code duplication would *slow down* the code (Because it uses a general approach) and *reduce size* (because it doesn't duplicate code anymore). I am at a loss on why autogenerating code would make win32k faster and 3 times larger? Isn't removing duplicate code going to make it smaller?
Pardon my confusion...anyways, here's what I think:
A 3X size increase is not OK unless the speed improvement is phenomenal. I'm talking at least 3X.
I also think we should leave win32k's dib functions alone for now and simply try to minimize the duplication (but once again I don't see why this increases size). It turns out that XP's actual fill rate is 48. Ours is about 190, possibly 200 with a retail build. We are about 4 times faster. Our HLINE rate is even faster then the actual VMWare Hardware Accelerated driver. In all other tests we beat XP. I think it's time to congratulate everyone who helped on the optimizations, and move on to fixing other things.... I do however, once again, agree that the current win32k dib code is a mess and should be cleaned up...
Best regards, Alex Ionescu
From: Alex Ionescu
Well, I'm kind of confused. I always thought a code generator that removes code duplication would *slow down* the code (Because it uses a general approach) and *reduce size* (because it doesn't duplicate code anymore). I am at a loss on why autogenerating code would make win32k faster and 3 times larger? Isn't removing duplicate code going to make it smaller?
Depends on what you call "duplicated code". I meant hand-written code. The code generator will in itself create a lot of almost-identical-but-slightly-different code. Basically what I'm trying to do is move "if" statements from the innermost loop to the outside. That results in a speed increase (less code in the inner loops) at the expense of bigger code (almost-identical-but-slightly-different code in the if and else parts).
I've been thinking a bit more about it. There are 256 raster operations, some of which are used more often than others. For example, the PATCOPY ("Copies the brush currently selected in hdcDest, into the destination bitmap.") rop is used far more often than the NOTSRCCPY ("Copies the inverted source rectangle to the destination.") rop. It doesn't make sense to use a lot of memory for code which is seldom executed.
So, I'm going to limit myself to the 15 named (see BitBlt documentation) rop codes. For the other 241 codes the current (generic) code can be used. At 6 depths, this still means 90 primitive routines so I'll think I'll continue work on the code generator. Ofcourse, before actually committing anything I'll do some timing tests to see if there is actually a significant performance improvement.
Gé van Geldorp.
You're confusing a generative approach with a generic approach. It's a game with the meta layers of definition languages and compilers.
Handwriting several similar functions is generative (work) and one should think about stepping up one language-layer and let a special compiler (generator) write that. On the other hand one could make a general approach (many ifs).
Alex Ionescu wrote:
Ge van Geldorp wrote:
a lot of neat things
Well, I'm kind of confused. I always thought a code generator that removes code duplication would *slow down* the code (Because it uses a general approach) and *reduce size* (because it doesn't duplicate code anymore). I am at a loss on why autogenerating code would make win32k faster and 3 times larger? Isn't removing duplicate code going to make it smaller?
Pardon my confusion...anyways, here's what I think:
A 3X size increase is not OK unless the speed improvement is phenomenal. I'm talking at least 3X.
I also think we should leave win32k's dib functions alone for now and simply try to minimize the duplication (but once again I don't see why this increases size). It turns out that XP's actual fill rate is 48. Ours is about 190, possibly 200 with a retail build. We are about 4 times faster. Our HLINE rate is even faster then the actual VMWare Hardware Accelerated driver. In all other tests we beat XP. I think it's time to congratulate everyone who helped on the optimizations, and move on to fixing other things.... I do however, once again, agree that the current win32k dib code is a mess and should be cleaned up...
Best regards, Alex Ionescu _______________________________________________ Ros-dev mailing list Ros-dev@reactos.com http://reactos.com:8080/mailman/listinfo/ros-dev
Ge van Geldorp wrote:
One of the things which has bothered me a bit is the code duplication we have in our DIB engine (subsys/win32k/dib). Most of the BitBlt routines in there are very similar. With the recent interest in optimizations a bunch of new (almost identical) routines were added. Don't get me wrong, I'm not saying that adding those optimizations was a bad idea, I'm just pointing out that we have a lot of code duplication.
There are 256 possible ROP codes, we support 1bpp, 4bpp, 8bpp, 16bpp, 24bpp and 32bpp, so in theory there could be 1536 routines with basically the same structure. I've been playing around with the idea to write a code generator which would generate the source code for those routines. That would cut down on the duplicated source code and associated maintenance problems (you only need to change the code generator) while still allowing optimized code for each individual ROP code.
Just to give you an idea what such a code generator would look like, I've attached my first attempt. Please note that it doesn't really try to optimize the generated code yet, it's just to give an impression. The code generated (16bpp only atm) is rather large, you can get it from ftp://ftp.geldorp.nl/pub/ReactOS/dib16gen.c if you like (or compile the code generator ("gcc -o gendib gendib.c") and run it).
A possible problem is that the generated code is quite large. When using the generated 16bpp code, size of win32k.sys increases by about 350kb. Extrapolating this for all bpps, it would mean that win32k.sys would triple in size.
So, I'm wondering what you guys are thinking. Should we basically trade memory for speed? Problem is that I can't quantify the speed increase at the moment.
Gé van Geldorp.
Hi!
Couldn't this be done using C++ templates or - since ReactOS doesn't use C++ - preprocessor macros? This would make the code just as optimized as with the generator. OTOH I expect that debugging would be a little bit harder.
Regards, JJ
I think, this size increase is not worth it. If you say win32k will triple in size, that's 200% plus. And just for ROP and so on which is just sparely used. I think of one generic method that is capable of doing all BitBlt modes and a hand full generated/hand mande routines wich do the usual stuff as invert and 1:1
Ge van Geldorp wrote:
One of the things which has bothered me a bit is the code duplication we have in our DIB engine (subsys/win32k/dib). Most of the BitBlt routines in there are very similar. With the recent interest in optimizations a bunch of new (almost identical) routines were added. Don't get me wrong, I'm not saying that adding those optimizations was a bad idea, I'm just pointing out that we have a lot of code duplication.
There are 256 possible ROP codes, we support 1bpp, 4bpp, 8bpp, 16bpp, 24bpp and 32bpp, so in theory there could be 1536 routines with basically the same structure. I've been playing around with the idea to write a code generator which would generate the source code for those routines. That would cut down on the duplicated source code and associated maintenance problems (you only need to change the code generator) while still allowing optimized code for each individual ROP code.
Just to give you an idea what such a code generator would look like, I've attached my first attempt. Please note that it doesn't really try to optimize the generated code yet, it's just to give an impression. The code generated (16bpp only atm) is rather large, you can get it from ftp://ftp.geldorp.nl/pub/ReactOS/dib16gen.c if you like (or compile the code generator ("gcc -o gendib gendib.c") and run it).
A possible problem is that the generated code is quite large. When using the generated 16bpp code, size of win32k.sys increases by about 350kb. Extrapolating this for all bpps, it would mean that win32k.sys would triple in size.
So, I'm wondering what you guys are thinking. Should we basically trade memory for speed? Problem is that I can't quantify the speed increase at the moment.
Gé van Geldorp.
#include <stdio.h>
#define USES_DEST(RopCode) ((((RopCode) & 0xaa) >> 1) != ((RopCode) & 0x55)) #define USES_SOURCE(RopCode) ((((RopCode) & 0xcc) >> 2) != ((RopCode) & 0x33)) #define USES_PATTERN(RopCode) ((((RopCode) & 0xf0) >> 4) != ((RopCode) & 0x0f))
static void PrintRoutineName(FILE *Out, unsigned Bpp, unsigned RopCode) { static struct { unsigned RopCode; char *Name; } KnownCodes[] = { { 0x00, "BLACKNESS" }, { 0x11, "NOTSRCERASE" }, { 0x33, "NOTSRCCOPY" }, { 0x44, "SRCERASE" }, { 0x55, "DSTINVERT" }, { 0x5a, "PATINVERT" }, { 0x66, "SRCINVERT" }, { 0x88, "SRCAND" }, { 0xbb, "MERGEPAINT" }, { 0xc0, "MERGECOPY" }, { 0xcc, "SRCCOPY" }, { 0xee, "SRCPAINT" }, { 0xf0, "PATCOPY" }, { 0xfb, "PATPAINT" }, { 0xff, "WHITENESS" } }; unsigned Index;
for (Index = 0; Index < sizeof(KnownCodes) / sizeof(KnownCodes[0]); Index++) { if (RopCode == KnownCodes[Index].RopCode) { fprintf(Out, "DIB_%uBPP_BitBlt_%s", Bpp, KnownCodes[Index].Name); return; } } fprintf(Out, "DIB_%uBPP_BitBlt_%02x", Bpp, RopCode); }
static void CreatePrimitive(FILE *Out, unsigned Bpp, unsigned RopCode) { int UsesSource; int UsesPattern; int UsesDest;
UsesSource = USES_SOURCE(RopCode); UsesPattern = USES_PATTERN(RopCode); UsesDest = USES_DEST(RopCode);
fprintf(Out, "\n"); fprintf(Out, "static void\n"); PrintRoutineName(Out, Bpp, RopCode); fprintf(Out, "(PBLTINFO BltInfo)\n"); fprintf(Out, "{\n"); fprintf(Out, " ULONG DestX, DestY;\n"); if (UsesSource) { fprintf(Out, " ULONG SourceX, SourceY;\n"); } if (UsesPattern) { fprintf(Out, " ULONG PatternY = 0;\n"); } fprintf(Out, " ULONG Dest = 0, Source = 0, Pattern = 0;\n"); fprintf(Out, " PULONG DestBits;\n"); fprintf(Out, " ULONG RoundedRight;\n"); fprintf(Out, "\n"); fprintf(Out, " RoundedRight = BltInfo->DestRect.right -\n"); fprintf(Out, " ((BltInfo->DestRect.right - BltInfo->DestRect.left) & 0x1);\n"); if (UsesSource) { fprintf(Out, " SourceY = BltInfo->SourcePoint.y;\n"); } fprintf(Out, " DestBits = (PULONG)(\n"); fprintf(Out, " BltInfo->DestSurface->pvScan0 +\n"); fprintf(Out, " (BltInfo->DestRect.left << 1) +\n"); fprintf(Out, " BltInfo->DestRect.top * BltInfo->DestSurface->lDelta);\n"); fprintf(Out, "\n"); if (UsesPattern) { fprintf(Out, " if (BltInfo->PatternSurface)\n"); fprintf(Out, " {\n"); fprintf(Out, " PatternY = (BltInfo->DestRect.top + BltInfo->BrushOrigin.y) %\n"); fprintf(Out, " BltInfo->PatternSurface->sizlBitmap.cy;\n"); fprintf(Out, " }\n"); fprintf(Out, " else\n"); fprintf(Out, " {\n"); fprintf(Out, " Pattern = BltInfo->Brush->iSolidColor |\n"); fprintf(Out, " (BltInfo->Brush->iSolidColor << 16);\n"); fprintf(Out, " }\n"); } fprintf(Out, "\n"); fprintf(Out, " for (DestY = BltInfo->DestRect.top; DestY < BltInfo->DestRect.bottom; DestY++)\n"); fprintf(Out, " {\n"); if (UsesSource) { fprintf(Out, " SourceX = BltInfo->SourcePoint.x;\n"); fprintf(Out, "\n"); } fprintf(Out, " for (DestX = BltInfo->DestRect.left; DestX < RoundedRight; DestX += 2, DestBits++"); if (UsesSource) { fprintf(Out, ", SourceX += 2"); } fprintf(Out, ")\n"); fprintf(Out, " {\n"); if (UsesDest) { fprintf(Out, " Dest = *DestBits;\n"); fprintf(Out, "\n"); } if (UsesSource) { fprintf(Out, " Source = DIB_GetSource(BltInfo->SourceSurface, SourceX, SourceY, BltInfo->XlateSourceToDest);\n"); fprintf(Out, " Source |= DIB_GetSource(BltInfo->SourceSurface, SourceX + 1, SourceY, BltInfo->XlateSourceToDest) << 16;\n"); fprintf(Out, "\n"); } if (UsesPattern) { fprintf(Out, " if (BltInfo->PatternSurface)\n"); fprintf(Out, " {\n"); fprintf(Out, " Pattern = DIB_GetSource(BltInfo->PatternSurface, (DestX + BltInfo->BrushOrigin.x) % BltInfo->PatternSurface->sizlBitmap.cx, PatternY, BltInfo->XlatePatternToDest);\n"); fprintf(Out, " Pattern |= DIB_GetSource(BltInfo->PatternSurface, (DestX + BltInfo->BrushOrigin.x + 1) % BltInfo->PatternSurface->sizlBitmap.cx, PatternY, BltInfo->XlatePatternToDest) << 16;\n"); fprintf(Out, " }\n"); fprintf(Out, "\n"); } fprintf(Out, " *DestBits = DIB_DoRop(BltInfo->Rop4, Dest, Source, Pattern);\n"); fprintf(Out, " }\n"); fprintf(Out, "\n"); fprintf(Out, " if (DestX < BltInfo->DestRect.right)\n"); fprintf(Out, " {\n"); if (UsesDest) { fprintf(Out, " Dest = *((PUSHORT)DestBits);\n"); fprintf(Out, "\n"); } if (UsesSource) { fprintf(Out, " Source = DIB_GetSource(BltInfo->SourceSurface, SourceX, SourceY, BltInfo->XlateSourceToDest);\n"); } if (UsesPattern) { fprintf(Out, " if (BltInfo->PatternSurface)\n"); fprintf(Out, " {\n"); fprintf(Out, " Pattern = DIB_GetSource(BltInfo->PatternSurface, (DestX + BltInfo->BrushOrigin.x) % BltInfo->PatternSurface->sizlBitmap.cx, PatternY, BltInfo->XlatePatternToDest);\n"); fprintf(Out, " }\n"); fprintf(Out, "\n"); } fprintf(Out, " DIB_16BPP_PutPixel(BltInfo->DestSurface, DestX, DestY, DIB_DoRop(BltInfo->Rop4, Dest, Source, Pattern) & 0xFFFF);\n"); fprintf(Out, " DestBits = (PULONG)((ULONG_PTR)DestBits + 2);\n"); fprintf(Out, " }\n"); fprintf(Out, "\n"); if (UsesSource) { fprintf(Out, " SourceY++;\n"); } if (UsesPattern) { fprintf(Out, " if (BltInfo->PatternSurface)\n"); fprintf(Out, " {\n"); fprintf(Out, " PatternY++;\n"); fprintf(Out, " PatternY %= BltInfo->PatternSurface->sizlBitmap.cy;\n"); fprintf(Out, " }\n"); } fprintf(Out, " DestBits = (PULONG)(\n"); fprintf(Out, " (ULONG_PTR)DestBits -\n"); fprintf(Out, " ((BltInfo->DestRect.right - BltInfo->DestRect.left) << 1) +\n"); fprintf(Out, " BltInfo->DestSurface->lDelta);\n"); fprintf(Out, " }\n"); fprintf(Out, "}\n"); }
static void CreateTable(FILE *Out, unsigned Bpp) { unsigned RopCode;
fprintf(Out, "\n"); fprintf(Out, "static void (*PrimitivesTable[256])(PBLTINFO) =\n"); fprintf(Out, " {\n"); for (RopCode = 0; RopCode < 256; RopCode++) { fprintf(Out, " "); PrintRoutineName(Out, Bpp, RopCode); if (RopCode < 255) { putc(',', Out); } putc('\n', Out); } fprintf(Out, " };\n"); }
static void CreateBitBlt(FILE *Out, unsigned Bpp) { fprintf(Out, "\n"); fprintf(Out, "BOOLEAN\n"); fprintf(Out, "DIB_%uBPP_BitBlt(PBLTINFO BltInfo)\n", Bpp); fprintf(Out, "{\n"); fprintf(Out, " PrimitivesTable[BltInfo->Rop4 & 0xff](BltInfo);\n"); fprintf(Out, "\n"); fprintf(Out, " return TRUE;\n"); fprintf(Out, "}\n"); }
int main(int argc, char *argv[]) { FILE *Out; unsigned RopCode; unsigned Bpp;
Bpp = 16; Out = fopen("dib16gen.c", "w"); if (NULL == Out) { perror("Error opening output file"); exit(1); }
fprintf(Out, "/* This is a generated file. Please do not edit */\n"); fprintf(Out, "\n"); fprintf(Out, "#include "w32k.h"\n");
for (RopCode = 0; RopCode < 256; RopCode++) { CreatePrimitive(Out, Bpp, RopCode); } CreateTable(Out, Bpp); CreateBitBlt(Out, Bpp);
fclose(Out);
return 0; }
Ros-dev mailing list Ros-dev@reactos.com http://reactos.com:8080/mailman/listinfo/ros-dev
I've committed the first version of a DIB Blt code generator (the generator is in tools/gendib, the file it generates ends up in subsys/win32k/dib/dib16gen.c). For now, it only generates code for 16bpp destination surfaces, other depths will follow. I've decided on a compromise, only code for named rops will be generated. This keeps the size of the code reasonable (win32k.sys grew from 952k to 1128k, an increase of 176k or 20%), while still speeding up the most used rop codes considerably. For example, the speed of PATINVERT increased by a factor 7. Optimized code was already present for SRCCOPY and PATCOPY, so the generated code isn't faster for these cases. We totally smoke the Windows XP DIB engine Blt routines now for 16bpp.
Gé van Geldorp.
-----Original Message----- From: ros-dev-bounces@reactos.com [mailto:ros-dev-bounces@reactos.com] On Behalf Of Ge van Geldorp Sent: Friday, June 10, 2005 12:03 To: 'ReactOS Development List' Subject: [ros-dev] DIB code generator
One of the things which has bothered me a bit is the code duplication we have in our DIB engine (subsys/win32k/dib). Most of the BitBlt routines in there are very similar. With the recent interest in optimizations a bunch of new (almost identical) routines were added. Don't get me wrong, I'm not saying that adding those optimizations was a bad idea, I'm just pointing out that we have a lot of code duplication.
There are 256 possible ROP codes, we support 1bpp, 4bpp, 8bpp, 16bpp, 24bpp and 32bpp, so in theory there could be 1536 routines with basically the same structure. I've been playing around with the idea to write a code generator which would generate the source code for those routines. That would cut down on the duplicated source code and associated maintenance problems (you only need to change the code generator) while still allowing optimized code for each individual ROP code.
Just to give you an idea what such a code generator would look like, I've attached my first attempt. Please note that it doesn't really try to optimize the generated code yet, it's just to give an impression. The code generated (16bpp only atm) is rather large, you can get it from ftp://ftp.geldorp.nl/pub/ReactOS/dib16gen.c if you like (or compile the code generator ("gcc -o gendib gendib.c") and run it).
A possible problem is that the generated code is quite large. When using the generated 16bpp code, size of win32k.sys increases by about 350kb. Extrapolating this for all bpps, it would mean that win32k.sys would triple in size.
So, I'm wondering what you guys are thinking. Should we basically trade memory for speed? Problem is that I can't quantify the speed increase at the moment.
Gé van Geldorp.
Excellent work Gé!
I ran some rosperf tests for your new code, the results are at http://waxdragon.homeip.net/~ford/reactos/rosperf/ .
Check the rosperf_dibgen* files, there is a debug and a release run under qemu 0.7.0 and a debug run under vmware 4.5.2 (using VBE, of course).
WD