News:

Simutrans Chat Room
Where cool people of Simutrans can meet up.

New dirty tile management

Started by Markohs, April 23, 2013, 04:05:26 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Markohs

Having a look to the dirty tile management looks a bit sub-optimal. If I'm going to link background redraw with dirty tile management the implementation has to be changed to group dirty tiles in bigger chunks. This can improve performance even without background drawing, because I guess SDL for example or even W32 will prefer to receive less screen updates, with bigger chunks. The current implementation groups them in horizontal strips, but not vertically.

I think I'll try to implement this with a quadtree, even if it ends not being useful for background it will improve simutrans performance I think.


EDIT: Trying a R-Tree instead, we'll see if it performs better than current implementation.
EDIT2: But that will be later and depending how Bayern Munich - Barça ends tonight! Good luck to both teams (specially Barça as I'm a fan ;) ) and may the best of both win! ;)

Mod note: topic split from http://forum.simutrans.com/index.php?topic=10835.msg116296#msg116296 on April 27, 2013 at 08:52:15 PM (UTC) by Prissi
~~IgorEliezer

Yona-TYT


Quote from: Markohs on April 23, 2013, 04:05:26 PM
But that will be later and depending how Bayern Munich - Barça ends tonight! Good luck to both teams (specially Barça as I'm a fan ;) ) and may the best of both win! ;)
Jejejeje...si pero el Barça no la tiene  fácil ya que es  visitante  :::)

Markohs

hehe we'll see, we have messi. if he's in the mood will be hard to beat us. but yes, today munich is favourite.  gl! :)

Yona-TYT

Bayern Munich 4 - 0 Barça......... :o

Markohs

Well, no more offtopic. :) Just saying that even Bayern outplayed Barça the whole match and it's a well deserved win, the refereee helped the germans quite a lot with 2 goals that should not have been goals(clear offside and a foul). ;)

Markohs

#5
R-tree was too much overkill for this, and didn't warranteed non overlapping dirty zones.

At the end I ended adapting the dirty tile management from The Vexi Platform , a library written in java, that codes is designed to be fast an have low overhead, released as apache license 2.0 so should not be much problem. I mainly jjust translated from java and adapted it a bit to our enviroment.

Loooks like it groups dirty tiles better, bigger chunks, not overlapped.

On this screenshot dark green are the old code dirty tiles, and on yellow the new code ones, what do you think?




A patch here, but mixed with the new background code, I can separate it further if you are interested, the main logic is in simdirty.cc/h

I'll take some measurements turfit style, I suspect this will save quite a lot of CPU/bandwidth.

EDIT: Inspecting this further, there are bugs on this, and it actually creates more dirty sections than the old code. It also needs to set a minimum dirty zone dimension, on it.

prissi

Especially moving stuff was deliberately done with bigger moving boxes. But if the smaller do not produce any errors (mostly on oversized airplanes, birds, pedesstrains jumping over crossings, zeppelin, smoke and the like) i.e. leave any traces in non-debug mode it seems quite useful. I will try it too.

Markohs

Needs extra work, the code I got doesn't merge dirty boxes when aligned horizontally or vertically, and that part of the code never got done:


//            if (close(x, _x) && close(w, _w)) {
//                if (close(h, _y)) {
//                    Math.min(x, _x), Math.max(w, _w)
//                }
//                if (close(y, _h)) {
//                   
//                }
//            }


That's just the part that was suposed to merge zones... And it's not easy because if you merge zones you need to make sure the sum of both don't overlap other dirty zones, it's harder than it looks.

Corrected code a bit, w,h thse variable names made no sense, and the dirty zone had to be igonored if .x0 was -1, you'll find this patch easier to read.

I also truncated coordinates from their lower 4 bits to force a minimum size of the zones, as it was with old code.

https://dl.dropboxusercontent.com/u/30024783/dirty_tiles-2.patch

TurfIt

#8
I just finished timing different tile sizes... I'll list the results here along with from your patch. patch v1 only as -2 crashes on startup (v1 crashes on exit).

# oftiles/framems/frame
tilesizemin# avg# max# GDISDL
1205624625646.840.5
2113158349031.427.4
461595175826.227.0
8365372527.027.5
16225730230.126.7
32210312328.326.2
642253728.526.4
patch-211027117525.420.4

There's way too much time spent creating those rectangles in the patch. Visual artifacts abound too... rectangles getting too close to the moving objects. I see artifacts with the non-patch dirty tile stuff when tilesize <= 4 as well. I don't think the moving objects are doing a good job of marking their actual pixels dirty...

Looking at the tiles generated with the non-patch routine, I don't see much room for merging them vertically. There's mostly a stair step pattern, very few vertically aligned tiles with the same left/right. Where I do see vertical dirty tiles is mostly on the station bar graph which is being needlessly marked dirty each frame. Fix that and there will be even less merging opportunity.

EDIT: updated patch numbers for v2.

Markohs

#9
I think I can fix it, gimme some time. thanks for your tests. you have to tell me someday what you use to generate that. ;)
i have a v3 already, should fix most of crashes (switched from new delete [] to malloc remalloc) now triggers stack over-rrcursion sometimes through.

edit:btw does somebody know if there is a esy way to compile a simutrans with vs that keeps the windows console open so i can see the printf?

TurfIt

See the corrected table above.
Now with the x==-1 continue, and 16bit masking.
Also using w-x, h-y instead of w,h since w,h aren't width,height.  Nice trap there!  8)

Fix the new/delete crashes, and artifacts, and this might actually work.  :)

prissi

The vehicles do the dirty masking only after steps, and not in mid tile. Thus very close boxes will lead to artefacts. Best to see this is the Zepplin circling for an occupied airport.

Looking at the table, tilesize 8 (which we have in the current code already) seems the best compromise for GDI and SDL.

And looking at the patch: Why not using the vector_tpl for the dirty_list?

Markohs

#12
 About the vehicle dirty zones masked, I'll leave that to the rest of coders if you don't mind, I'm not interested in that part of the code atm (yet). :)

I'm making some advances to the merging, with little to no extra CPU usage.

About vector_tpl yes I can use that, I just made a direct and fast conversion from java, to try the concept. I'll use vector_tpl ofc if I keep the vector. But I'm experimenting with std::multimap, I think that will give even bigger CPU boost, if it's worth it I'll use the multimap (I think there is no tpl data structure on that), we can always ship a implementation to that for Haiku (gcc 2.95.1), or revert to the vector code using #ifdef.

There are a lot of things to fix still, code formatting, function names etc... I'll try to have it ready soon for you to test and give your opinion.

EDIT: btw, turfit, did you compiled with thread support? the code is not thread safe still.

TurfIt

tilesize 16 is the current in trunk. I repeated the tests for 8, 16, and 32 as the table above was only a single run. Turns out the high GDI 16 time was an anomaly. Times aren't directly comparable to above as the retest was with the r6469 fix, and the attached patch. (and it's a couple hours later which always screws with my timing!)


# oftiles/framems/frame
tilesizemin# avg# max# GDISDL
8342448428.827.1
16218322427.626.8
322759827.526.6

The attached patch stops the station ware bars from being always dirty - needs to cache the old value to avoid artifacts.
It also merges dirty tiles horizontally a bit more. Ugly hack to skip single non-dirty tiles surrounded by dirties. Doesn't seem to affect things measurably, but I'll throw it out for discussion. Moot if Markohs algorithm is used.

I had partial threading compiled in. Disabled in simsys_* due to interfering with my timing. simview threading was enabled. The existing dirty tile code isn't thread safe either  :o   (safe enough to not crash - just possible artifacts)

Ters

Quote from: Markohs on April 24, 2013, 07:02:48 PM
edit:btw does somebody know if there is a esy way to compile a simutrans with vs that keeps the windows console open so i can see the printf?

There should be a setting to compile as a console application in there somewhere, but I only have Visual Studio at work, and only for very rare C++ or C# work.

Dwachs

Quote from: Markohs on April 24, 2013, 07:02:48 PM
edit:btw does somebody know if there is a esy way to compile a simutrans with vs that keeps the windows console open so i can see the printf?
There is this switch:

WIN32_CONSOLE = 1 # adds a console for windows debugging

Dont know whether it is working ...
Parsley, sage, rosemary, and maggikraut.

prissi

That unfourtunately does not work on MSVC when using the internal debugger, if this is the question. You had to built a console project, i.e. change the settings in the project. But not sure about newer MSVC versions.

Markohs

yea, it doesn't work.

I managed to get it working adding:

   AllocConsole();
   freopen("conin$","r",stdin);
   freopen("conout$","w",stdout);
   freopen("conout$","w",stderr);
   printf("Debugging Window:\n");

in WinMain()


TurfIt

WIN32_CONSOLE works just fine with MinGW.

Markohs

yea, but not on vs. I'll add that to simsys if you don't mind, if it detects the macro and visual studio

Markohs

My new version, with just making the algorithm consider as intersecting tiles when they differ just in 1 pixel (this makes the algorithm process sightly more triangles, and just adding a few extra checks I managed to implement zone merging, that was specially needed for seas, since they are animated and cause the rectangle count to skyrocket. Good news also that vertical merge is easy to implement and will do so soon.

Rectangle count is now always lower than the old version.

I *think* the CPU cost it's more than acceptable. The patch here soon, a screenshot now (red means fusioned rectangles(left to right) blue from right to left, and yellow not fusioned rectangles)


prissi

#21
Sorry, why does the animated sea make the rectangle count skyrocket? With the dirty tiles in simgraph this will result on one big dirty tile for the whole band, i.e. almost a single operation.

Markohs

yes, I know. but my previous code didn't managed to compact the tiles so much before. this new does.

Markohs

A preview of the state of the patch, to note:

1) Due to debugging, there is extra code to code a color in each dirty zone to help me debug how it was created, this ofc makes the code less efficient, and needs to be disabled on final patch, or protected with #ifdefine DEBUG_FLUSH_BUFFER. Dirty() takes one extra debugging argument too.
2) Compressing the dirty rectangles makes necessary much extra logic. To make the compiler able to optimize all the expressions when compiling, I used the preprocessor so the compiler can re-arrange the evaluations and use lazy evaluations etc. I tried to make code readable even in this conditions. This might mean the code will run way slower in debug compillarions, but should be good enough on release (or /O3).
3) This algorithm is a recursive one. To be able to compress tiles I had to remove the recursion "fita" or "guard", dunno the name in english, that warranteed the algorithm didn't en in infinite recursion (in this original algorithm it was the "ind", parameter that was allways increased each recursive call). I think the algorithm is correct and will not fail, but I'm not 100% sure.
4) This is *NOT* thread safe still
5) There are a lot of tile invalidations (x0=-1), it whould be nice testing if switching to vector_tpl and compacting the vector on each position invalidation will speed up the program. I'm not sure, since the vector is reseted each frame anyway.
6) Made the patch against the last trunk, background code is removed from this now.
7) I plan passing the bit-mask of the minimum tile dimensions to the object creator, the current masking  in the call in mark_rect_dirty_nc will be removed.

A SS of the result.


TurfIt

I'll see about getting this profiled... But, looking back on my last test of patch-2, I missed the fact a lot of the work has been moved out of the display section of sync_step and into the sync section. While there was an overall reduction in the total time, the sync was slowed by over 350%. This absolutely kills the achievable fast forward rate - using the normal Yoshi profile game, I can normally hit ~220x FF, with patch-2 58x. IMHO, all the rectangle merging must be moved out of sync. i.e. It cannot be done in mark_rect_dirty_nc(). How about retaining the existing dirtytile arrays, and processing them once in display_flush_buffer() to create the merge?

Also, this patch isn't using the double dirty buffer like the existing code - tile_dirty and tile_dirty_old. With the current moving object dirty marking, _old is required to reduce artifacts. Even still, I can induce artifacts at certain zooms, at certain FPS settings, and watching certain vehicles. But, without the _old, artifacts are everywhere!  After a quick look, I think it would be a boatload of work to get the moving objects 'correctly' marking, until then, _old is required IMHO.

Finally, I actually whipped up a quick naive merging algorithm in display_flush_buffer(). Two actually - one that tends to create more horizontal rectangles, and one that spits out verticals. < 10 extra lines of code  ;)  But I've yet to reenable the _old, so will need a few more lines for that. Once done, I'll compare timings and rectangle counts for my patch, your patch, and trunk.

TurfIt

New results below. Tile counts not comparable to previous due to the trunk changes I made eliminating several unnecessary screen updates; Which also has the effect of removing many of the easily merged tiles too... Timing not comparable as I ran this on my old Core2 instead, and it's the total sync_step time, not just display.

# oftiles/framems/sync_step
min# avg# max# SDL
trunk217625543.46
v merge116623443.46
h merge116123143.31
-6120731055.78
-6 mod218927043.59
Patch -6 is increasing the number of dr_texture calls!
For -6 mod, I reenabled the trunk dirty tiles, and fed the merged horizontal stripes to tile_dirty_list.dirty() in display_flush_buffer(). This solves the fast forward problem, and gives much better results, but still not good. I think this approach can be abandoned. I think my simple H Merge is working good enough - not that there's much timing difference on my systems, but it might help Macs out some more...

Perhaps a mod can split all this dirty tile talk to a seperate thread? Prissis last post is not clear if the problem is with patch -6, or the world limits patch which is what this is supposed to be about!

Dwachs

#26
I pretty much get consistently infinite recursion by scrolling around the yoshi game. Here is a stack excerpt:

#1  0x000000000062a05a in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:121
#2  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#3  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122
#4  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#5  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122
#6  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#7  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122

see the repeating arguments ...

Edit: This can be fixed by always doing the recursive calls with the index argument set to i+1 (these are the calls for merged regions). Imho there is no need to call with index=0, as the list maintains disjoint rectangles.

Btw, are the dirty regions defined as x0 <= x < x1 or x0 <= x <= x1 ?
Parsley, sage, rosemary, and maggikraut.

Markohs

#27
Quote from: Dwachs on April 27, 2013, 06:45:59 PM
I pretty much get consistently infinite recursion by scrolling around the yoshi game. Here is a stack excerpt:

#1  0x000000000062a05a in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:121
#2  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#3  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122
#4  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#5  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122
#6  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#7  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122

see the repeating arguments ...

Edit: This can be fixed by always doing the recursive calls with the index argument set to i+1 (these are the calls for merged regions). Imho there is no need to call with index=0, as the list maintains disjoint rectangles.

Btw, are the dirty regions defined as x0 <= x < x1 or x0 <= x <= x1 ?

Sorry guys was busy this weekend. :)

Yes, the infinite recursion can allways be fixed with i+1 in the call. It only has the side effect that the newly created object might have ended merged with a rectangle created before. But we can get rid of the infinte recursion problem that way.

Dirty regions include all values from x0 to x1 , including them both.

What's why on split I use x0 to _x1 -1 and _x1 to x1

Anyway, I think my code might have gone too far there, because it won't ever be so fast as turfit code, I think.

I guess turfit optimzations are more than enough, as he says. they will allways be better than my patch because I was running a algoithm that's not time constant as the old code it. It was a bit crazy, but it deserved a try I think. :) My algorithm was (x is number of dirty zones) at O(x) (with calls i+i)  more or less and Turfit one O(1), impossible to beat that.

The problem of getting the smallest set of dirty rectangles belongs to complexity EXP, if I'm not wrong. We can only aproach to this program with a heuristic, and my aproach it's not fast enough.

It only whould have been clearly better if we fixed the merging a bit, and displaced all the calculations to another thread.

But it's completely inneficient trying to vertically merge rows after a h-row is created in the current code, turfit? I mean because maybe when sea is displayed, there is much vmerge that can be done, after the hmerge has been done.

EDIT: a lot, changed most of the message, sorry. ;)

TurfIt

Simple merging patch attached. Reduces rectangle count by ~10%. Makes no attempt at finding the minimum #, just finds the width of the dirty block, then tries to extend it down.

But, simple went away when I changed to scanning whole uint32s rather than bits. Gives me ~1% faster on a Core2 to ~10% on an i7. If this is too cryptic, removing the ability to span the dirty block width across words cleans things up a lot. But increases the rectangle count substantially. Or, I could just revert to bits...

Also, I'm using a GCC builtin with a slower lookup table for non GCC. There's apparently a MSVC intrinsic that'll do the job too (_BitScanForward), but I can't test what I don't have...


Markohs

Why don't you store dirty bits in an array of bool ? Can't maybe that be way faster? It won't be so much memory. And the speed it's the important here.

TurfIt

The bit manipulations rely strongly on a 32 bit word length, hence the array of uint32. Unless a bool is guaranteed length on all platforms...
The only memory waste is from alignment by forcing the first dirty tile on each row to be the first bit in a word. The pre-patch bit wise algorithm packed them tighter, but my head would explode trying to handle that word wise. Also, the memory isn't much, depends upon screen size but 500-3000 bytes would be expected. Easily within L1 cache.

Markohs

I mean one entire bool for each dirty tile position, not packing multiple tiles in a uint32. You waste quite a lot of memory but don't need to do any boolean arichmetic, you just dirty[y*row_len + x];

It's just a suggestion, but getting rid of the and ors and bitwise logic maybe saves CPU. After all, I doubt this memory structure will be highter than one or two Kb, no? Didn't made the calculations. :)

I guess bool's size is standarized to 32 bits, I'm not sure. But just declaring it as bool and refering to the array as bool positions, maybe the compiler has room to do magic and optimize the algorithm.

But the extra memory I/O and possible cache invalidations can maybe make it slower. I don't know, I'd try it and compare.

TurfIt

An entire int (what a bool ends up as) for each tile! Yikes. Remember, memory access expensive, CPU operations cheap. dirt cheap infact. The structure is already 3kB (full screen retina display) let alone 1-2kb. (bit b vs byte B), a tile per bool would be 80kB... Feel free to try if you like, but I'm sure that would be glacial. Bit ands, ors, shifts, etc. doesn't get any more basic an operation for a CPU that those.

kierongreen

While 80kb might not sound much but difference between 1-2kb and 80kb is fitting into many level 1 or 2 caches (or not), which would have a considerable performance impact.

Markohs

#34
Well, you are both probably right I guess. :)

Using a bool for each tile also makes the code automatically reentrant to multiple threads as a side-effect.

It also might benefit from the compiler optimizations (loop unrolling) more easiliy.

But you are right, it might get the CPU cache dirty. And whould take more CPU to clear the buffer.

EDIT: Looks like in C++ a bool is allways 8 bits. Might not be so big difference then.
EDIT2: I tested its size with sizeof and:

g++ (GCC) 3.4.3 on ultrasparc machine outputs 1 byte.
Visual Studio 2012 1 byte too.

I guess gcc on intel will be one byte too. So I'd say it's just 1 byte on all plattforms.

Anyway the buffer size will be multiplied by 8, so it might end being too big. And bitwise CPU operations are cheap, so you are both prolly right.