News:

Simutrans.com Portal
Our Simutrans site. You can find everything about Simutrans from here.

New dirty tile management

Started by Markohs, April 23, 2013, 04:05:26 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Markohs

Having a look to the dirty tile management looks a bit sub-optimal. If I'm going to link background redraw with dirty tile management the implementation has to be changed to group dirty tiles in bigger chunks. This can improve performance even without background drawing, because I guess SDL for example or even W32 will prefer to receive less screen updates, with bigger chunks. The current implementation groups them in horizontal strips, but not vertically.

I think I'll try to implement this with a quadtree, even if it ends not being useful for background it will improve simutrans performance I think.


EDIT: Trying a R-Tree instead, we'll see if it performs better than current implementation.
EDIT2: But that will be later and depending how Bayern Munich - Barça ends tonight! Good luck to both teams (specially Barça as I'm a fan ;) ) and may the best of both win! ;)

Mod note: topic split from http://forum.simutrans.com/index.php?topic=10835.msg116296#msg116296 on April 27, 2013 at 08:52:15 PM (UTC) by Prissi
~~IgorEliezer

Yona-TYT


Quote from: Markohs on April 23, 2013, 04:05:26 PM
But that will be later and depending how Bayern Munich - Barça ends tonight! Good luck to both teams (specially Barça as I'm a fan ;) ) and may the best of both win! ;)
Jejejeje...si pero el Barça no la tiene  fácil ya que es  visitante  :::)

Markohs

hehe we'll see, we have messi. if he's in the mood will be hard to beat us. but yes, today munich is favourite.  gl! :)

Yona-TYT

Bayern Munich 4 - 0 Barça......... :o

Markohs

Well, no more offtopic. :) Just saying that even Bayern outplayed Barça the whole match and it's a well deserved win, the refereee helped the germans quite a lot with 2 goals that should not have been goals(clear offside and a foul). ;)

Markohs

#5
R-tree was too much overkill for this, and didn't warranteed non overlapping dirty zones.

At the end I ended adapting the dirty tile management from The Vexi Platform , a library written in java, that codes is designed to be fast an have low overhead, released as apache license 2.0 so should not be much problem. I mainly jjust translated from java and adapted it a bit to our enviroment.

Loooks like it groups dirty tiles better, bigger chunks, not overlapped.

On this screenshot dark green are the old code dirty tiles, and on yellow the new code ones, what do you think?




A patch here, but mixed with the new background code, I can separate it further if you are interested, the main logic is in simdirty.cc/h

I'll take some measurements turfit style, I suspect this will save quite a lot of CPU/bandwidth.

EDIT: Inspecting this further, there are bugs on this, and it actually creates more dirty sections than the old code. It also needs to set a minimum dirty zone dimension, on it.

prissi

Especially moving stuff was deliberately done with bigger moving boxes. But if the smaller do not produce any errors (mostly on oversized airplanes, birds, pedesstrains jumping over crossings, zeppelin, smoke and the like) i.e. leave any traces in non-debug mode it seems quite useful. I will try it too.

Markohs

Needs extra work, the code I got doesn't merge dirty boxes when aligned horizontally or vertically, and that part of the code never got done:


//            if (close(x, _x) && close(w, _w)) {
//                if (close(h, _y)) {
//                    Math.min(x, _x), Math.max(w, _w)
//                }
//                if (close(y, _h)) {
//                   
//                }
//            }


That's just the part that was suposed to merge zones... And it's not easy because if you merge zones you need to make sure the sum of both don't overlap other dirty zones, it's harder than it looks.

Corrected code a bit, w,h thse variable names made no sense, and the dirty zone had to be igonored if .x0 was -1, you'll find this patch easier to read.

I also truncated coordinates from their lower 4 bits to force a minimum size of the zones, as it was with old code.

https://dl.dropboxusercontent.com/u/30024783/dirty_tiles-2.patch

TurfIt

#8
I just finished timing different tile sizes... I'll list the results here along with from your patch. patch v1 only as -2 crashes on startup (v1 crashes on exit).

# oftiles/framems/frame
tilesizemin# avg# max# GDISDL
1205624625646.840.5
2113158349031.427.4
461595175826.227.0
8365372527.027.5
16225730230.126.7
32210312328.326.2
642253728.526.4
patch-211027117525.420.4

There's way too much time spent creating those rectangles in the patch. Visual artifacts abound too... rectangles getting too close to the moving objects. I see artifacts with the non-patch dirty tile stuff when tilesize <= 4 as well. I don't think the moving objects are doing a good job of marking their actual pixels dirty...

Looking at the tiles generated with the non-patch routine, I don't see much room for merging them vertically. There's mostly a stair step pattern, very few vertically aligned tiles with the same left/right. Where I do see vertical dirty tiles is mostly on the station bar graph which is being needlessly marked dirty each frame. Fix that and there will be even less merging opportunity.

EDIT: updated patch numbers for v2.

Markohs

#9
I think I can fix it, gimme some time. thanks for your tests. you have to tell me someday what you use to generate that. ;)
i have a v3 already, should fix most of crashes (switched from new delete [] to malloc remalloc) now triggers stack over-rrcursion sometimes through.

edit:btw does somebody know if there is a esy way to compile a simutrans with vs that keeps the windows console open so i can see the printf?

TurfIt

See the corrected table above.
Now with the x==-1 continue, and 16bit masking.
Also using w-x, h-y instead of w,h since w,h aren't width,height.  Nice trap there!  8)

Fix the new/delete crashes, and artifacts, and this might actually work.  :)

prissi

The vehicles do the dirty masking only after steps, and not in mid tile. Thus very close boxes will lead to artefacts. Best to see this is the Zepplin circling for an occupied airport.

Looking at the table, tilesize 8 (which we have in the current code already) seems the best compromise for GDI and SDL.

And looking at the patch: Why not using the vector_tpl for the dirty_list?

Markohs

#12
 About the vehicle dirty zones masked, I'll leave that to the rest of coders if you don't mind, I'm not interested in that part of the code atm (yet). :)

I'm making some advances to the merging, with little to no extra CPU usage.

About vector_tpl yes I can use that, I just made a direct and fast conversion from java, to try the concept. I'll use vector_tpl ofc if I keep the vector. But I'm experimenting with std::multimap, I think that will give even bigger CPU boost, if it's worth it I'll use the multimap (I think there is no tpl data structure on that), we can always ship a implementation to that for Haiku (gcc 2.95.1), or revert to the vector code using #ifdef.

There are a lot of things to fix still, code formatting, function names etc... I'll try to have it ready soon for you to test and give your opinion.

EDIT: btw, turfit, did you compiled with thread support? the code is not thread safe still.

TurfIt

tilesize 16 is the current in trunk. I repeated the tests for 8, 16, and 32 as the table above was only a single run. Turns out the high GDI 16 time was an anomaly. Times aren't directly comparable to above as the retest was with the r6469 fix, and the attached patch. (and it's a couple hours later which always screws with my timing!)


# oftiles/framems/frame
tilesizemin# avg# max# GDISDL
8342448428.827.1
16218322427.626.8
322759827.526.6

The attached patch stops the station ware bars from being always dirty - needs to cache the old value to avoid artifacts.
It also merges dirty tiles horizontally a bit more. Ugly hack to skip single non-dirty tiles surrounded by dirties. Doesn't seem to affect things measurably, but I'll throw it out for discussion. Moot if Markohs algorithm is used.

I had partial threading compiled in. Disabled in simsys_* due to interfering with my timing. simview threading was enabled. The existing dirty tile code isn't thread safe either  :o   (safe enough to not crash - just possible artifacts)

Ters

Quote from: Markohs on April 24, 2013, 07:02:48 PM
edit:btw does somebody know if there is a esy way to compile a simutrans with vs that keeps the windows console open so i can see the printf?

There should be a setting to compile as a console application in there somewhere, but I only have Visual Studio at work, and only for very rare C++ or C# work.

Dwachs

Quote from: Markohs on April 24, 2013, 07:02:48 PM
edit:btw does somebody know if there is a esy way to compile a simutrans with vs that keeps the windows console open so i can see the printf?
There is this switch:

WIN32_CONSOLE = 1 # adds a console for windows debugging

Dont know whether it is working ...
Parsley, sage, rosemary, and maggikraut.

prissi

That unfourtunately does not work on MSVC when using the internal debugger, if this is the question. You had to built a console project, i.e. change the settings in the project. But not sure about newer MSVC versions.

Markohs

yea, it doesn't work.

I managed to get it working adding:

   AllocConsole();
   freopen("conin$","r",stdin);
   freopen("conout$","w",stdout);
   freopen("conout$","w",stderr);
   printf("Debugging Window:\n");

in WinMain()


TurfIt

WIN32_CONSOLE works just fine with MinGW.

Markohs

yea, but not on vs. I'll add that to simsys if you don't mind, if it detects the macro and visual studio

Markohs

My new version, with just making the algorithm consider as intersecting tiles when they differ just in 1 pixel (this makes the algorithm process sightly more triangles, and just adding a few extra checks I managed to implement zone merging, that was specially needed for seas, since they are animated and cause the rectangle count to skyrocket. Good news also that vertical merge is easy to implement and will do so soon.

Rectangle count is now always lower than the old version.

I *think* the CPU cost it's more than acceptable. The patch here soon, a screenshot now (red means fusioned rectangles(left to right) blue from right to left, and yellow not fusioned rectangles)


prissi

#21
Sorry, why does the animated sea make the rectangle count skyrocket? With the dirty tiles in simgraph this will result on one big dirty tile for the whole band, i.e. almost a single operation.

Markohs

yes, I know. but my previous code didn't managed to compact the tiles so much before. this new does.

Markohs

A preview of the state of the patch, to note:

1) Due to debugging, there is extra code to code a color in each dirty zone to help me debug how it was created, this ofc makes the code less efficient, and needs to be disabled on final patch, or protected with #ifdefine DEBUG_FLUSH_BUFFER. Dirty() takes one extra debugging argument too.
2) Compressing the dirty rectangles makes necessary much extra logic. To make the compiler able to optimize all the expressions when compiling, I used the preprocessor so the compiler can re-arrange the evaluations and use lazy evaluations etc. I tried to make code readable even in this conditions. This might mean the code will run way slower in debug compillarions, but should be good enough on release (or /O3).
3) This algorithm is a recursive one. To be able to compress tiles I had to remove the recursion "fita" or "guard", dunno the name in english, that warranteed the algorithm didn't en in infinite recursion (in this original algorithm it was the "ind", parameter that was allways increased each recursive call). I think the algorithm is correct and will not fail, but I'm not 100% sure.
4) This is *NOT* thread safe still
5) There are a lot of tile invalidations (x0=-1), it whould be nice testing if switching to vector_tpl and compacting the vector on each position invalidation will speed up the program. I'm not sure, since the vector is reseted each frame anyway.
6) Made the patch against the last trunk, background code is removed from this now.
7) I plan passing the bit-mask of the minimum tile dimensions to the object creator, the current masking  in the call in mark_rect_dirty_nc will be removed.

A SS of the result.


TurfIt

I'll see about getting this profiled... But, looking back on my last test of patch-2, I missed the fact a lot of the work has been moved out of the display section of sync_step and into the sync section. While there was an overall reduction in the total time, the sync was slowed by over 350%. This absolutely kills the achievable fast forward rate - using the normal Yoshi profile game, I can normally hit ~220x FF, with patch-2 58x. IMHO, all the rectangle merging must be moved out of sync. i.e. It cannot be done in mark_rect_dirty_nc(). How about retaining the existing dirtytile arrays, and processing them once in display_flush_buffer() to create the merge?

Also, this patch isn't using the double dirty buffer like the existing code - tile_dirty and tile_dirty_old. With the current moving object dirty marking, _old is required to reduce artifacts. Even still, I can induce artifacts at certain zooms, at certain FPS settings, and watching certain vehicles. But, without the _old, artifacts are everywhere!  After a quick look, I think it would be a boatload of work to get the moving objects 'correctly' marking, until then, _old is required IMHO.

Finally, I actually whipped up a quick naive merging algorithm in display_flush_buffer(). Two actually - one that tends to create more horizontal rectangles, and one that spits out verticals. < 10 extra lines of code  ;)  But I've yet to reenable the _old, so will need a few more lines for that. Once done, I'll compare timings and rectangle counts for my patch, your patch, and trunk.

TurfIt

New results below. Tile counts not comparable to previous due to the trunk changes I made eliminating several unnecessary screen updates; Which also has the effect of removing many of the easily merged tiles too... Timing not comparable as I ran this on my old Core2 instead, and it's the total sync_step time, not just display.

# oftiles/framems/sync_step
min# avg# max# SDL
trunk217625543.46
v merge116623443.46
h merge116123143.31
-6120731055.78
-6 mod218927043.59
Patch -6 is increasing the number of dr_texture calls!
For -6 mod, I reenabled the trunk dirty tiles, and fed the merged horizontal stripes to tile_dirty_list.dirty() in display_flush_buffer(). This solves the fast forward problem, and gives much better results, but still not good. I think this approach can be abandoned. I think my simple H Merge is working good enough - not that there's much timing difference on my systems, but it might help Macs out some more...

Perhaps a mod can split all this dirty tile talk to a seperate thread? Prissis last post is not clear if the problem is with patch -6, or the world limits patch which is what this is supposed to be about!

Dwachs

#26
I pretty much get consistently infinite recursion by scrolling around the yoshi game. Here is a stack excerpt:

#1  0x000000000062a05a in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:121
#2  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#3  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122
#4  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#5  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122
#6  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#7  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122

see the repeating arguments ...

Edit: This can be fixed by always doing the recursive calls with the index argument set to i+1 (these are the calls for merged regions). Imho there is no need to call with index=0, as the list maintains disjoint rectangles.

Btw, are the dirty regions defined as x0 <= x < x1 or x0 <= x <= x1 ?
Parsley, sage, rosemary, and maggikraut.

Markohs

#27
Quote from: Dwachs on April 27, 2013, 06:45:59 PM
I pretty much get consistently infinite recursion by scrolling around the yoshi game. Here is a stack excerpt:

#1  0x000000000062a05a in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:121
#2  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#3  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122
#4  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#5  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122
#6  0x000000000062a5d7 in dirtylist_t::dirty (this=0x9fa8e0, x0=656, y0=160, x1=927, y1=175, ind=208, col=171) at simdirty.cc:217
#7  0x000000000062a092 in dirtylist_t::dirty (this=0x9fa8e0, x0=624, y0=160, x1=927, y1=175, ind=0, col=128) at simdirty.cc:122

see the repeating arguments ...

Edit: This can be fixed by always doing the recursive calls with the index argument set to i+1 (these are the calls for merged regions). Imho there is no need to call with index=0, as the list maintains disjoint rectangles.

Btw, are the dirty regions defined as x0 <= x < x1 or x0 <= x <= x1 ?

Sorry guys was busy this weekend. :)

Yes, the infinite recursion can allways be fixed with i+1 in the call. It only has the side effect that the newly created object might have ended merged with a rectangle created before. But we can get rid of the infinte recursion problem that way.

Dirty regions include all values from x0 to x1 , including them both.

What's why on split I use x0 to _x1 -1 and _x1 to x1

Anyway, I think my code might have gone too far there, because it won't ever be so fast as turfit code, I think.

I guess turfit optimzations are more than enough, as he says. they will allways be better than my patch because I was running a algoithm that's not time constant as the old code it. It was a bit crazy, but it deserved a try I think. :) My algorithm was (x is number of dirty zones) at O(x) (with calls i+i)  more or less and Turfit one O(1), impossible to beat that.

The problem of getting the smallest set of dirty rectangles belongs to complexity EXP, if I'm not wrong. We can only aproach to this program with a heuristic, and my aproach it's not fast enough.

It only whould have been clearly better if we fixed the merging a bit, and displaced all the calculations to another thread.

But it's completely inneficient trying to vertically merge rows after a h-row is created in the current code, turfit? I mean because maybe when sea is displayed, there is much vmerge that can be done, after the hmerge has been done.

EDIT: a lot, changed most of the message, sorry. ;)

TurfIt

Simple merging patch attached. Reduces rectangle count by ~10%. Makes no attempt at finding the minimum #, just finds the width of the dirty block, then tries to extend it down.

But, simple went away when I changed to scanning whole uint32s rather than bits. Gives me ~1% faster on a Core2 to ~10% on an i7. If this is too cryptic, removing the ability to span the dirty block width across words cleans things up a lot. But increases the rectangle count substantially. Or, I could just revert to bits...

Also, I'm using a GCC builtin with a slower lookup table for non GCC. There's apparently a MSVC intrinsic that'll do the job too (_BitScanForward), but I can't test what I don't have...


Markohs

Why don't you store dirty bits in an array of bool ? Can't maybe that be way faster? It won't be so much memory. And the speed it's the important here.

TurfIt

The bit manipulations rely strongly on a 32 bit word length, hence the array of uint32. Unless a bool is guaranteed length on all platforms...
The only memory waste is from alignment by forcing the first dirty tile on each row to be the first bit in a word. The pre-patch bit wise algorithm packed them tighter, but my head would explode trying to handle that word wise. Also, the memory isn't much, depends upon screen size but 500-3000 bytes would be expected. Easily within L1 cache.

Markohs

I mean one entire bool for each dirty tile position, not packing multiple tiles in a uint32. You waste quite a lot of memory but don't need to do any boolean arichmetic, you just dirty[y*row_len + x];

It's just a suggestion, but getting rid of the and ors and bitwise logic maybe saves CPU. After all, I doubt this memory structure will be highter than one or two Kb, no? Didn't made the calculations. :)

I guess bool's size is standarized to 32 bits, I'm not sure. But just declaring it as bool and refering to the array as bool positions, maybe the compiler has room to do magic and optimize the algorithm.

But the extra memory I/O and possible cache invalidations can maybe make it slower. I don't know, I'd try it and compare.

TurfIt

An entire int (what a bool ends up as) for each tile! Yikes. Remember, memory access expensive, CPU operations cheap. dirt cheap infact. The structure is already 3kB (full screen retina display) let alone 1-2kb. (bit b vs byte B), a tile per bool would be 80kB... Feel free to try if you like, but I'm sure that would be glacial. Bit ands, ors, shifts, etc. doesn't get any more basic an operation for a CPU that those.

kierongreen

While 80kb might not sound much but difference between 1-2kb and 80kb is fitting into many level 1 or 2 caches (or not), which would have a considerable performance impact.

Markohs

#34
Well, you are both probably right I guess. :)

Using a bool for each tile also makes the code automatically reentrant to multiple threads as a side-effect.

It also might benefit from the compiler optimizations (loop unrolling) more easiliy.

But you are right, it might get the CPU cache dirty. And whould take more CPU to clear the buffer.

EDIT: Looks like in C++ a bool is allways 8 bits. Might not be so big difference then.
EDIT2: I tested its size with sizeof and:

g++ (GCC) 3.4.3 on ultrasparc machine outputs 1 byte.
Visual Studio 2012 1 byte too.

I guess gcc on intel will be one byte too. So I'd say it's just 1 byte on all plattforms.

Anyway the buffer size will be multiplied by 8, so it might end being too big. And bitwise CPU operations are cheap, so you are both prolly right.

Ters

Quote from: Markohs on May 06, 2013, 02:00:41 PM
Using a bool for each tile also makes the code automatically reentrant to multiple threads as a side-effect.

I'm not so sure about this. Especially if multiple threads are interested in the same bool. That reading and writing a bool are each atomic won't help much. If multiple threads are not interested in the same bool, one could just align each thread's area of responsability with the word size.

Markohs

#36
Well, but there is no "clear dirty" operation, just a "set dirty" operation. I think that's enough to warantee it's reentrant in this particular case.

If two threads set a tile dirty at the same time, it doesn't really matter the order or any of the circunstances, they will just write "true" at the same position. This doesn't happen in the current code, since you need to read the packed tile byte, perform the bitwise operations, and write it back to the buffer. And this is far from atomic.

EDIT: I just implemented this bool tiles and well, didn't measured in detail yet, but coudn't see much problems.

In my 1280 x 1024 screen, old code used 2x618 byte buffers, the bool implementation 2x4941byte ones, less than 8 the size.

Markohs

#37
Well, coun't find performance differences, really... I'll just post the patch here for the case someone is interested, but well, it's nothing too interesting. To benchmark it fully I'd need higher granularity time counters and well, maybe it's not really worth it, if there is any difference, it's minimal.

But this code should be reentrant safe, and the older one is not.

https://dl.dropboxusercontent.com/u/30024783/dirty_bool.patch

EDIT: just for the sake of correcting myself, in one of my previous posts I mentioned minimizing the number of dirty rectangles merging the was a EXP problem. It's not, it's a NP-complete problem if I'm not wrong, a variation of the knapsack problem. Not really important, but coudn't stand thinking some whould read my post that and think "oh man that guy has no idea". Even he's probably right :).  Nothing too important. ;)

TurfIt

sizeof( bool ) is defined as implementation specific. Seems current x86 compilers are using 1 byte. Older ones were using 4.

Using a byte (bool) is no more thread safe than words (uint32). CPUs don't work with bytes, but words. To modify a byte means reading a word, masking high bits to 0, shifting right until the desired byte is in the low position, modify this byte (really still the 32bit word register), shifting the byte back to the correct word position, combining it back with the 3 other bytes in the word, and finally writing the word back to cache/memory. Of course this is all done in hardware at a microcode level, but it's still being done.

Even using words doesn't guarantee atomicity unless using specific hardware instructions on multicore systems. If you want thread safety here, protect the shared global memory with a mutex. But, IMHO it's not too important. Worst case is a tile doesn't get marked dirty that should have, and it's a very very small window where this could happen.

Trunk code should be giving 640 byte buffers for 1280x1024? My uint32 version should be 768 due to forced alignment.

For timing on Windows, I've found queryperformancecounter to be the only thing with sub millisecond results. For me, it's finding a 300kHz clock somewhere. Of course YMMV.

Also, you need to consider the effect of the code on other routines. Accessing a larger memory footprint might be evicting other stuff from the caches that would've otherwise remained, and hence cause a slowdown elsewhere... The timing I've reported is the effect on the entire sync_step.

Assuming you're in MSVC land, can you look at adding its bitscan intrinsic to my patch?

Markohs

Quote from: TurfIt on May 06, 2013, 04:43:58 PM
Using a byte (bool) is no more thread safe than words (uint32). CPUs don't work with bytes, but words. To modify a byte means reading a word, masking high bits to 0, shifting right until the desired byte is in the low position, modify this byte (really still the 32bit word register), shifting the byte back to the correct word position, combining it back with the 3 other bytes in the word, and finally writing the word back to cache/memory. Of course this is all done in hardware at a microcode level, but it's still being done.

After some reading I'd say you are not right, since from the 486 processor onwards, intel warrantees atomicity starting from 8-bit.

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

Section 8.1.1


The Intel486 processor (and newer processors since) guar
antees that the following basic memory operations will
always be carried out atomically:

Reading or writing a byte

Reading or writing a word aligned on a 16-bit boundary

Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guaran
tees that the following additional memory operations
will always be carried out atomically:

Reading or writing a quadword aligned on a 64-bit boundary

16-bit accesses to uncached memory locations that fit within a 32-bit data bu



This just applies to x86 processors, anyway. So you are right. Anyway I'm unsure if the C compiler will grant you this atomicity soomewere else, my wild guess is that when the hardware doesn't suport that, it foces bool type to the required boundary (32-bits, I guess)

Quote from: TurfIt on May 06, 2013, 04:43:58 PM
Even using words doesn't guarantee atomicity unless using specific hardware instructions on multicore systems. If you want thread safety here, protect the shared global memory with a mutex. But, IMHO it's not too important. Worst case is a tile doesn't get marked dirty that should have, and it's a very very small window where this could happen.

Well, ofc you don't get atomicity without mutexes, but as long as you are using a 8-bit register (I know it's just the lower part of a 32-bit register), intel allows you to access memory at byte level, and it grants you atomicity reading/writing them.

The thing is, Turfit, the reentrant-safety I was mentioning, comes from the fact that the code doesn't *READ* any position, it just writes. Your current implementation reads a 32-bit value, modifies it and writes it back. My implementation doesn't. So I think mutexes are not needed since it's inherently reentrant. But I'm not 100% of it. :)

I know the chance of failing is low, but believe me it exists and happens often if you use dirty marking on multiple threads. On some tests I made on the past, it just takes some seconds for something to get wrong.

Quote
Trunk code should be giving 640 byte buffers for 1280x1024? My uint32 version should be 768 due to forced alignment.

For timing on Windows, I've found queryperformancecounter to be the only thing with sub millisecond results. For me, it's finding a 300kHz clock somewhere. Of course YMMV.

Yup, saw it already, thanks. :) Anyway I decided to step aside and let you implement this your way, I just wanted to experiment a bit, I'm sorry I bothered you with this. :)

Quote
Also, you need to consider the effect of the code on other routines. Accessing a larger memory footprint might be evicting other stuff from the caches that would've otherwise remained, and hence cause a slowdown elsewhere... The timing I've reported is the effect on the entire sync_step.

Yes you are right, I didn't measured the impact of that.

Quote
Assuming you're in MSVC land, can you look at adding its bitscan intrinsic to my patch?

Sure, I will. ;)

Just tell me when your work is complete and I'll send you the changes and do the tests you want.

Ters

x86 is a rather odd architecture. Other architectures are more likely to only operate at words.

Markohs

on that architectures I'm pretty sure a bool will be word sized

Ters


Markohs

#43
Sure, we all fallen to that trap.

But I think this code proves what I said it's correct, and I think it won't fail on any plattform. If someone knows why this shoudn't work on all plattforms, just tell me. ;)

I'm *pretty sure* the C++ standard warrants that access to a variable is independant of accesses to other variables no matter what the alignment, and if the hardware can't warrant  that, he'll use whatever alignments necessary for this to work no matter what.

Whould be impossible to write programs if this wasn't that way, no? :)


#include <stdio.h>
#include <pthread.h>

volatile bool booleans[32];
int id[32];

pthread_t threads[32];

pthread_attr_t attr;


void *thread(void *parameter){
        const int id = *(int*)parameter;
        printf("Thread %i alive\n",id);

        int i = 0;
        bool last_state = false;
        booleans[id]=last_state;

        while(true){

                if(booleans[id]!=last_state){
                        printf("******************** UNEXPECTED STATE ********************\n");
                }
                else {
                        printf(".");
                }

                booleans[id]= (i%2);
                last_state = (i%2);

                i++;
        }

        return NULL;
}

int main(){

        printf("sizeof(bool): %i , sizeof(booleans): %i\n",sizeof(bool),sizeof(booleans));

        printf("any key to create threads\n");
        getchar();

        pthread_attr_init(&attr);
        pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

        for(int i = 0 ; i < 32 ; i++){
                id[i] = i;
                pthread_create(&threads[i], &attr, thread, &id[i]);
        }

        printf("threads created\n");
        getchar();


}


EDIT: Tested on a sparc64 with 2 CPU's and with a SPARC-Enterprise-T5120 with a chip with 32 cores (adapted the program to work on 32 threads ofc). Both didn't fail. Both CPU's are able to store 8-bit values on memory, even they are 64 bit architectures. bool was 1 byte long on them. And the array sizeof was num_threads(32).

TurfIt

Quote from: Markohs on May 06, 2013, 05:20:57 PM
I know the chance of failing is low, but believe me it exists and happens often if you use dirty marking on multiple threads. On some tests I made on the past, it just takes some seconds for something to get wrong.
The lower the theoretical failure chance, the more likely a failure is observed. At least that's my experience  ;)
I've no doubt this is failing, I've just never been ever to see it, and really have no idea how to objectively test. But, if artifacts aren't all over, there's no reason IMHO to change to a thread safe version.


Quote from: Markohs on May 06, 2013, 05:20:57 PM
Yup, saw it already, thanks. :) Anyway I decided to step aside and let you implement this your way, I just wanted to experiment a bit, I'm sorry I bothered you with this. :)
Experimenting a bit is all I was doing too. I really didn't expect any measurable difference implementing this, just did it because it was there... Definitely no bother.


Quote from: Markohs on May 06, 2013, 05:20:57 PM
Yes you are right, I didn't measured the impact of that.
I measured your bool patch at 0.2% faster overall sync_step, but at a 3% hit to the sync section (affects fastforward rate).
But my patch was 0.5% overall, 6.5% sync. Missed that in all the jitter in my previous tests. My change to the marking dirty is just too heavy - reverted that to bit wise operations and now at 0.8%, 0.2%. Much better.

Still not sure if it's worthy of committing. The performance difference is truly negligible as expected, and I'll be the first to admit it's cryptic...


Quote from: Markohs on May 06, 2013, 05:20:57 PM
Just tell me when your work is complete and I'll send you the changes and do the tests you want.
I've done all I'm going to in that section. All that's needed is to extend the #if #else to include a MSVC branch with the _BitScanForward intrinsic, and figure out if it needs a -1 or not... (just turn on the DebugFlushBuffer, if needed it'll be obvious!)
I'd played with reducing the tile size for non moving objects, but that's really a separate patch, and has given me enough of a headache to shelve it for a while.

kierongreen

If you were achieving a 50% speed up then the odd graphical glitch might be acceptable. For a few % at best it seems not worthwhile to me.

prissi

I would second that: If there is a clean solution just a little slower than go for it. Chances are high, that a compiler can optimize the easier solution much more effictively than the complex one (especially recursion seems like a tough optimizer challenge.)

TurfIt

I sense this meandering discussion has caused confusion...
Once upon a time, there was a thought to try merging dirty tiles into larger rectangles than the current trunk horizontal stripes. And by this reduce the number of calls to the graphics library (SDL_UpdateRect and StretchDIBits). Apparently, SDL on Mac is  adverse to many such calls. Markohs came up with a recursive algorithm which replaced the entire current dirty tile structures, but it turned out way too slow. I suggested a very simple merging using the existing structure (never showed the patch for this) but then decided to replace the while loops that search for set bits with the bit scan CPU instructions. I also attempted to replace the bit by bit dirty tile marking with a routine that would set multiple bits at once (this is the patch above).

At some point, the potential glitches from multithreading with simple_drawing mode enabled came up. Markohs suggested changing to a bool per dirty tile instead of using the 8 packed bits. I found this causes a 3% slowdown in sync_step where moving objects mark their positions dirty. If you look at the mark_tiles_dirty() in isolation, using bool is 15% slower than the packed 8 bit access in trunk. The other timing point is display_flush_buffer(), using bool or packed 8bit is equivalent. 3% isn't bad, but simple_drawing causes graphic glitches anyways, what's potentially a couple more? My patch doesn't change the marking dirty (posted patch does, I reverted that section  since it was 50% slower!), so has equal chance as trunk to cause glitches.

For the rectangle merging, trunk takes 23us/frame in display_flush_buffer(). Simple merging increases that to 41us/frame, but the 10% fewer rectangles passed to SDL are giving me a 4% overall decrease in screen update time. Using my bit scan patch drops display_flush_buffer() back to 29us/frame and includes the simple merging, and yields 4.5% better overall update time.

Is it all worth it? I can't really say, none of my systems are overly sensitive to rectangle count. I can only hope there's an improvement on those that are...
Attached are the bit scan, and 'C' versions. Either clean?



Markohs

a 4'5% better update time makes it worthy for inclusion in trunk imho. Everything sums up, and if we have some users with speed problems, this can help them.

The new code is just sigtly more complex than old one. I'd include it.

I'd just add some doxygen in the functions to document the implementation better, that's lacking on the code and now it's a perfect moment to document. ;) ( I know, it sucks but has to be done )

Markohs

#49
TufIt, dunno how to post a patch against your patch, I'll just post the changes to simgraph16.cc here:


#ifdef _MSC_VER
#   include <io.h>
#   define W_OK 2
#include <intrin.h>
#else
#   include <unistd.h>
#endif
...
               else { // dirty block ends in word_x2
#if GCC_ATLEAST(3,2)
                  x2 = __builtin_ffs( ~tile_dirty_old[word_x2] ) - 1; // find edge of dirty block in word_x2
#elif defined(_MSC_VER)
                  if( ! _BitScanForward((unsigned long*) &x2, ~tile_dirty_old[word_x2]) ) {
                     x2 = -1;
                  }
#else
                  const uint32 tv = ~tile_dirty_old[word_x2];
                  x2 = Mod37BitPosition[(-tv & tv) % 37];
#endif
                  masks[word_x2-word_x1] = 0xFFFFFFFF >> (32 - x2);
               }
            }
            else { // dirty block is all within one word - word_x1
#if GCC_ATLEAST(3,2)
               x2 = __builtin_ffs( testval ) - 1;
#elif defined(_MSC_VER)
               if( ! _BitScanForward((unsigned long*) &x2 , testval) ) {
                  x2 = -1;
               }
#else
               x2 = Mod37BitPosition[(-testval & testval) % 37];
#endif
               masks[0] = (0xFFFFFFFF << (32 - x2 + (x1 & 31))) >> (32 - x2);
            }


EDIT: Win32 also has a 64-bit version of this function, like int __builtin_ffsll (unsigned long long) , do you think we can use those functions too? . btw, as you saw it didn't need the -1, that's because the function returns 1 on success and 0 if the argument is zero.

Ters

The 64-bit bsf instruction, which I understand these things end up as, is only available on amd64 aka x64. Win32 is normally the name of the basic Windows API from Windows 95 or so onwards.

prissi

Generally we do not do any optimisation but for GCC. Other platforms (like Amiga with PowerPC) will surely need plain C-code as well. I would also refrain from optimising the 64 code.

But the way: The german c't 7/2013 just had a nice article on bitscans. There are subroutines that are faster than the intrisics on almost any processor ...


const uint8 MultiplyDeBruijnBitPosition[32] =
{
  // precomputed lookup table
  0,  1, 28,  2, 29, 14, 24,  3, 30, 22, 20, 15, 25, 17,  4,  8, 31, 27, 13, 23, 21, 19, 16,  7, 26, 12, 18,  6, 11,  5, 10,  9
};

uint8 lowestBitSet(uint32 x)
{
  // leave only lowest bit
  x  &= -((sint32)x);
  // DeBruijn constant
  x  *= 0x077CB531;
  // get upper 5 bits
  x >>= 27;
  // convert to actual position
  return MultiplyDeBruijnBitPosition[x];
}


EDIT: And I think 5% gain is a lot considering that this is not much obfuscated code to add (and also likely quite protable).

BTW: Allegro do not require dirty tiles at all.

Markohs

Well prissi, but if the optimization for Visual C is so easy as the one I posted I don't see any reason to not include it. Turfit already implemented a generic C version for non-gcc compilers, it's in the patch, uses a module 37 pre-calculated array.

@Ters: ok, I was not sure of that, thx. :)

kierongreen

If 5% is simple, and is consistent across many platforms and systems I'd agree it'd be a good change.

prissi

The intrinsic is way slower than the routine I posted, about 10 times on most CPU (at least according to the benchmarking of c't, which rarely fails).

Markohs

Well, if your version it's faster, ofc we should use it. It's faster that gcc's one too?

Markohs

#56
I guess  c't is that german webpage I've found on the internet. I'm very curious to know why it's slower than that algorithm, because a compiler intrinsic function it's suposed to be compiled specially by the compiler and generate a specially chosen assembler instruction to do that. Doesn't Visual Studio do that? Can you point me to that article please?

I'm very curious to know why it's like that, not that I doubt of what you said.

EDIT: Looks like it's indeed generating a assembler instruction (bsf) just for that test, now I'm even more curious to know how can it be slower. :)


TurfIt

#57
Doing bit manipulation on uint64s when compiled to a 32-bit target would be rather counterproductive - read slow, hence why sticking with uint32s. Simutrans and 64 bit compilations don't get along anyways re optimizations, so no point adding more special cases IMO.

As for GCC specific optimizations, IMHO we should be using more... at least to the extent that they don't harm other compilers. I tend to think it's the most common compiler in use. Seems some like MSVC, but it's already in a huge hole performance wise when it comes to the graphics with it not using the asm blocks in simgraph. If someone truly cares about MSVC performance, they'd port those over.

I originally chose the Mod37 fallback because it distinguishes between the 00 and FF cases, where the DeBruijn routine doesn't. However, I've since ended up adding the special handling for them anyways, so will switch. Timing wise, there's not a huge difference, it's really at the limit of what I can measure with a 300kHz clock...  ffs = 29us/frame, DeBruijn = 29.5us, Mod37 = 30.5 us. I can't test the MSVC intrinsic, but looking at the code Markohs posted, it's spitting out the same instruction.

@Markohs: Is wrapping the _BitScanForward in the if() required? i.e. Don't the surrounding ifs take care of the special case?

Is this c't article available anywhere? Certainly x86 has a history of including 'dud' instructions. I rather doubt the bitscan instructions are though, it's just too common a need.
EDIT: http://chessprogramming.wikispaces.com/BitScan is interesting.

I think I'm hearing things aren't too cryptic/obfuscated. I sat on this for a week hoping for a lightning bolt of inspiration containing a more elegant routine, but nothing happened...
Documentation - suggestions on what else is needed? I thought I'd commented sufficiently, and really the only way to understand bit manipulations is to work through them.



GCC code for the _builtin_ffs:

movl __ZL14tile_dirty_old, %eax
movl -40(%ebp), %edx
sall $2, %edx
addl %edx, %eax
movl (%eax), %eax
notl %eax
xorl %edx, %edx
bsfl %eax, %eax
sete %dl
negl %edx
orl %edx, %eax
incl %eax
decl %eax
movl %eax, -44(%ebp)
movl -84(%ebp), %eax
movl -40(%ebp), %edx
subl %eax, %edx
movl $32, %eax
subl -44(%ebp), %eax
movl $-1, %esi
movl %esi, %edi
movb %al, %cl
shrl %cl, %edi
movl %edi, -108(%ebp)
movl -76(%ebp), %eax
movl -108(%ebp), %ecx
movl %ecx, (%eax,%edx,4)
jmp L957

Spits out an inline bsf as expected.

Markohs

Quote from: TurfIt on May 09, 2013, 12:50:37 AM
Doing bit manipulation on uint64s when compiled to a 32-bit target would be rather counterproductive - read slow, hence why sticking with uint32s. Simutrans and 64 bit compilations don't get along anyways re optimizations, so no point adding more special cases IMO.

ok, was just an idea (a bad one I see). :)

Quote from: TurfIt on May 09, 2013, 12:50:37 AM
As for GCC specific optimizations, IMHO we should be using more... at least to the extent that they don't harm other compilers. I tend to think it's the most common compiler in use. Seems some like MSVC, but it's already in a huge hole performance wise when it comes to the graphics with it not using the asm blocks in simgraph. If someone truly cares about MSVC performance, they'd port those over.

Hum...I thought that assembler was not used in gcc neither, that was old code not being compiled! It's really such a big performance difference?

Adding compiler-specific is a nice thing to do imho, if we restrict ourselves in doing in selected files only, critical for performance. It's okay doing it in simgraph16.cc, certain tpl data structures if useful and maybe other critical parts of the code.

I think most of you have a wrong concept when it comes to MSVC compared to GCC. Looking at benchmarks with those two compilers, they are actually very very close in code performance, it's not that gcc generates clearly better code than msvc, and in some circumstances msvc code is even better.

Quote from: TurfIt on May 09, 2013, 12:50:37 AM
@Markohs: Is wrapping the _BitScanForward in the if() required? i.e. Don't the surrounding ifs take care of the special case?

Yup, it's not needed, they can be removed. I just wanted the code to have exactly the same result than your gcc macro, but it's indeed not needed because that will never happen.

Quote
Documentation - suggestions on what else is needed? I thought I'd commented sufficiently, and really the only way to understand bit manipulations is to work through them.

Documenting the variables, even the name is quite explicit. And expressing tile_dirty is a bit-map of dirty tile positions, where each bit represents a rectangle of screen. Also the functions. It's pretty obvious for us that know how they work, but for an outsider they are criptical. Not overdoing it ofc, that outsider is suposed to be able to read code. :)

prissi

MSVC might be not much worse; but optimisations for GCC likely work on more x86 architectures (Linux, MacOS, Haiku) than MSVC. If set the performance to 1.0 for the Intel compiler, it's 0.95-1.05 for Portland, 1.3 GCC and 1.7 MSVC for the spec suite.

bsf has a lot microcode it involvles, especially for the 64 bit version, especially on older CPUs (where optimisation is mainly an issue). In the c't tests the DeBruijn was down to 2.43 cycles at the end (obviously this depends very much on cache size, i.e. the table is fully in the cache and thing like that.)

The article is not on the net (I can sent it to you, but it is obviously in German). But the article Ters gave has aparently a similar info.