News:

Simutrans Sites
Know our official sites. Find tools and resources for Simutrans.

New dirty tile management

Started by Markohs, April 23, 2013, 04:05:26 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Ters

Quote from: Markohs on May 06, 2013, 02:00:41 PM
Using a bool for each tile also makes the code automatically reentrant to multiple threads as a side-effect.

I'm not so sure about this. Especially if multiple threads are interested in the same bool. That reading and writing a bool are each atomic won't help much. If multiple threads are not interested in the same bool, one could just align each thread's area of responsability with the word size.

Markohs

#36
Well, but there is no "clear dirty" operation, just a "set dirty" operation. I think that's enough to warantee it's reentrant in this particular case.

If two threads set a tile dirty at the same time, it doesn't really matter the order or any of the circunstances, they will just write "true" at the same position. This doesn't happen in the current code, since you need to read the packed tile byte, perform the bitwise operations, and write it back to the buffer. And this is far from atomic.

EDIT: I just implemented this bool tiles and well, didn't measured in detail yet, but coudn't see much problems.

In my 1280 x 1024 screen, old code used 2x618 byte buffers, the bool implementation 2x4941byte ones, less than 8 the size.

Markohs

#37
Well, coun't find performance differences, really... I'll just post the patch here for the case someone is interested, but well, it's nothing too interesting. To benchmark it fully I'd need higher granularity time counters and well, maybe it's not really worth it, if there is any difference, it's minimal.

But this code should be reentrant safe, and the older one is not.

https://dl.dropboxusercontent.com/u/30024783/dirty_bool.patch

EDIT: just for the sake of correcting myself, in one of my previous posts I mentioned minimizing the number of dirty rectangles merging the was a EXP problem. It's not, it's a NP-complete problem if I'm not wrong, a variation of the knapsack problem. Not really important, but coudn't stand thinking some whould read my post that and think "oh man that guy has no idea". Even he's probably right :).  Nothing too important. ;)

TurfIt

sizeof( bool ) is defined as implementation specific. Seems current x86 compilers are using 1 byte. Older ones were using 4.

Using a byte (bool) is no more thread safe than words (uint32). CPUs don't work with bytes, but words. To modify a byte means reading a word, masking high bits to 0, shifting right until the desired byte is in the low position, modify this byte (really still the 32bit word register), shifting the byte back to the correct word position, combining it back with the 3 other bytes in the word, and finally writing the word back to cache/memory. Of course this is all done in hardware at a microcode level, but it's still being done.

Even using words doesn't guarantee atomicity unless using specific hardware instructions on multicore systems. If you want thread safety here, protect the shared global memory with a mutex. But, IMHO it's not too important. Worst case is a tile doesn't get marked dirty that should have, and it's a very very small window where this could happen.

Trunk code should be giving 640 byte buffers for 1280x1024? My uint32 version should be 768 due to forced alignment.

For timing on Windows, I've found queryperformancecounter to be the only thing with sub millisecond results. For me, it's finding a 300kHz clock somewhere. Of course YMMV.

Also, you need to consider the effect of the code on other routines. Accessing a larger memory footprint might be evicting other stuff from the caches that would've otherwise remained, and hence cause a slowdown elsewhere... The timing I've reported is the effect on the entire sync_step.

Assuming you're in MSVC land, can you look at adding its bitscan intrinsic to my patch?

Markohs

Quote from: TurfIt on May 06, 2013, 04:43:58 PM
Using a byte (bool) is no more thread safe than words (uint32). CPUs don't work with bytes, but words. To modify a byte means reading a word, masking high bits to 0, shifting right until the desired byte is in the low position, modify this byte (really still the 32bit word register), shifting the byte back to the correct word position, combining it back with the 3 other bytes in the word, and finally writing the word back to cache/memory. Of course this is all done in hardware at a microcode level, but it's still being done.

After some reading I'd say you are not right, since from the 486 processor onwards, intel warrantees atomicity starting from 8-bit.

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

Section 8.1.1


The Intel486 processor (and newer processors since) guar
antees that the following basic memory operations will
always be carried out atomically:

Reading or writing a byte

Reading or writing a word aligned on a 16-bit boundary

Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guaran
tees that the following additional memory operations
will always be carried out atomically:

Reading or writing a quadword aligned on a 64-bit boundary

16-bit accesses to uncached memory locations that fit within a 32-bit data bu



This just applies to x86 processors, anyway. So you are right. Anyway I'm unsure if the C compiler will grant you this atomicity soomewere else, my wild guess is that when the hardware doesn't suport that, it foces bool type to the required boundary (32-bits, I guess)

Quote from: TurfIt on May 06, 2013, 04:43:58 PM
Even using words doesn't guarantee atomicity unless using specific hardware instructions on multicore systems. If you want thread safety here, protect the shared global memory with a mutex. But, IMHO it's not too important. Worst case is a tile doesn't get marked dirty that should have, and it's a very very small window where this could happen.

Well, ofc you don't get atomicity without mutexes, but as long as you are using a 8-bit register (I know it's just the lower part of a 32-bit register), intel allows you to access memory at byte level, and it grants you atomicity reading/writing them.

The thing is, Turfit, the reentrant-safety I was mentioning, comes from the fact that the code doesn't *READ* any position, it just writes. Your current implementation reads a 32-bit value, modifies it and writes it back. My implementation doesn't. So I think mutexes are not needed since it's inherently reentrant. But I'm not 100% of it. :)

I know the chance of failing is low, but believe me it exists and happens often if you use dirty marking on multiple threads. On some tests I made on the past, it just takes some seconds for something to get wrong.

Quote
Trunk code should be giving 640 byte buffers for 1280x1024? My uint32 version should be 768 due to forced alignment.

For timing on Windows, I've found queryperformancecounter to be the only thing with sub millisecond results. For me, it's finding a 300kHz clock somewhere. Of course YMMV.

Yup, saw it already, thanks. :) Anyway I decided to step aside and let you implement this your way, I just wanted to experiment a bit, I'm sorry I bothered you with this. :)

Quote
Also, you need to consider the effect of the code on other routines. Accessing a larger memory footprint might be evicting other stuff from the caches that would've otherwise remained, and hence cause a slowdown elsewhere... The timing I've reported is the effect on the entire sync_step.

Yes you are right, I didn't measured the impact of that.

Quote
Assuming you're in MSVC land, can you look at adding its bitscan intrinsic to my patch?

Sure, I will. ;)

Just tell me when your work is complete and I'll send you the changes and do the tests you want.

Ters

x86 is a rather odd architecture. Other architectures are more likely to only operate at words.

Markohs

on that architectures I'm pretty sure a bool will be word sized

Ters


Markohs

#43
Sure, we all fallen to that trap.

But I think this code proves what I said it's correct, and I think it won't fail on any plattform. If someone knows why this shoudn't work on all plattforms, just tell me. ;)

I'm *pretty sure* the C++ standard warrants that access to a variable is independant of accesses to other variables no matter what the alignment, and if the hardware can't warrant  that, he'll use whatever alignments necessary for this to work no matter what.

Whould be impossible to write programs if this wasn't that way, no? :)


#include <stdio.h>
#include <pthread.h>

volatile bool booleans[32];
int id[32];

pthread_t threads[32];

pthread_attr_t attr;


void *thread(void *parameter){
        const int id = *(int*)parameter;
        printf("Thread %i alive\n",id);

        int i = 0;
        bool last_state = false;
        booleans[id]=last_state;

        while(true){

                if(booleans[id]!=last_state){
                        printf("******************** UNEXPECTED STATE ********************\n");
                }
                else {
                        printf(".");
                }

                booleans[id]= (i%2);
                last_state = (i%2);

                i++;
        }

        return NULL;
}

int main(){

        printf("sizeof(bool): %i , sizeof(booleans): %i\n",sizeof(bool),sizeof(booleans));

        printf("any key to create threads\n");
        getchar();

        pthread_attr_init(&attr);
        pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

        for(int i = 0 ; i < 32 ; i++){
                id[i] = i;
                pthread_create(&threads[i], &attr, thread, &id[i]);
        }

        printf("threads created\n");
        getchar();


}


EDIT: Tested on a sparc64 with 2 CPU's and with a SPARC-Enterprise-T5120 with a chip with 32 cores (adapted the program to work on 32 threads ofc). Both didn't fail. Both CPU's are able to store 8-bit values on memory, even they are 64 bit architectures. bool was 1 byte long on them. And the array sizeof was num_threads(32).

TurfIt

Quote from: Markohs on May 06, 2013, 05:20:57 PM
I know the chance of failing is low, but believe me it exists and happens often if you use dirty marking on multiple threads. On some tests I made on the past, it just takes some seconds for something to get wrong.
The lower the theoretical failure chance, the more likely a failure is observed. At least that's my experience  ;)
I've no doubt this is failing, I've just never been ever to see it, and really have no idea how to objectively test. But, if artifacts aren't all over, there's no reason IMHO to change to a thread safe version.


Quote from: Markohs on May 06, 2013, 05:20:57 PM
Yup, saw it already, thanks. :) Anyway I decided to step aside and let you implement this your way, I just wanted to experiment a bit, I'm sorry I bothered you with this. :)
Experimenting a bit is all I was doing too. I really didn't expect any measurable difference implementing this, just did it because it was there... Definitely no bother.


Quote from: Markohs on May 06, 2013, 05:20:57 PM
Yes you are right, I didn't measured the impact of that.
I measured your bool patch at 0.2% faster overall sync_step, but at a 3% hit to the sync section (affects fastforward rate).
But my patch was 0.5% overall, 6.5% sync. Missed that in all the jitter in my previous tests. My change to the marking dirty is just too heavy - reverted that to bit wise operations and now at 0.8%, 0.2%. Much better.

Still not sure if it's worthy of committing. The performance difference is truly negligible as expected, and I'll be the first to admit it's cryptic...


Quote from: Markohs on May 06, 2013, 05:20:57 PM
Just tell me when your work is complete and I'll send you the changes and do the tests you want.
I've done all I'm going to in that section. All that's needed is to extend the #if #else to include a MSVC branch with the _BitScanForward intrinsic, and figure out if it needs a -1 or not... (just turn on the DebugFlushBuffer, if needed it'll be obvious!)
I'd played with reducing the tile size for non moving objects, but that's really a separate patch, and has given me enough of a headache to shelve it for a while.

kierongreen

If you were achieving a 50% speed up then the odd graphical glitch might be acceptable. For a few % at best it seems not worthwhile to me.

prissi

I would second that: If there is a clean solution just a little slower than go for it. Chances are high, that a compiler can optimize the easier solution much more effictively than the complex one (especially recursion seems like a tough optimizer challenge.)

TurfIt

I sense this meandering discussion has caused confusion...
Once upon a time, there was a thought to try merging dirty tiles into larger rectangles than the current trunk horizontal stripes. And by this reduce the number of calls to the graphics library (SDL_UpdateRect and StretchDIBits). Apparently, SDL on Mac is  adverse to many such calls. Markohs came up with a recursive algorithm which replaced the entire current dirty tile structures, but it turned out way too slow. I suggested a very simple merging using the existing structure (never showed the patch for this) but then decided to replace the while loops that search for set bits with the bit scan CPU instructions. I also attempted to replace the bit by bit dirty tile marking with a routine that would set multiple bits at once (this is the patch above).

At some point, the potential glitches from multithreading with simple_drawing mode enabled came up. Markohs suggested changing to a bool per dirty tile instead of using the 8 packed bits. I found this causes a 3% slowdown in sync_step where moving objects mark their positions dirty. If you look at the mark_tiles_dirty() in isolation, using bool is 15% slower than the packed 8 bit access in trunk. The other timing point is display_flush_buffer(), using bool or packed 8bit is equivalent. 3% isn't bad, but simple_drawing causes graphic glitches anyways, what's potentially a couple more? My patch doesn't change the marking dirty (posted patch does, I reverted that section  since it was 50% slower!), so has equal chance as trunk to cause glitches.

For the rectangle merging, trunk takes 23us/frame in display_flush_buffer(). Simple merging increases that to 41us/frame, but the 10% fewer rectangles passed to SDL are giving me a 4% overall decrease in screen update time. Using my bit scan patch drops display_flush_buffer() back to 29us/frame and includes the simple merging, and yields 4.5% better overall update time.

Is it all worth it? I can't really say, none of my systems are overly sensitive to rectangle count. I can only hope there's an improvement on those that are...
Attached are the bit scan, and 'C' versions. Either clean?



Markohs

a 4'5% better update time makes it worthy for inclusion in trunk imho. Everything sums up, and if we have some users with speed problems, this can help them.

The new code is just sigtly more complex than old one. I'd include it.

I'd just add some doxygen in the functions to document the implementation better, that's lacking on the code and now it's a perfect moment to document. ;) ( I know, it sucks but has to be done )

Markohs

#49
TufIt, dunno how to post a patch against your patch, I'll just post the changes to simgraph16.cc here:


#ifdef _MSC_VER
#   include <io.h>
#   define W_OK 2
#include <intrin.h>
#else
#   include <unistd.h>
#endif
...
               else { // dirty block ends in word_x2
#if GCC_ATLEAST(3,2)
                  x2 = __builtin_ffs( ~tile_dirty_old[word_x2] ) - 1; // find edge of dirty block in word_x2
#elif defined(_MSC_VER)
                  if( ! _BitScanForward((unsigned long*) &x2, ~tile_dirty_old[word_x2]) ) {
                     x2 = -1;
                  }
#else
                  const uint32 tv = ~tile_dirty_old[word_x2];
                  x2 = Mod37BitPosition[(-tv & tv) % 37];
#endif
                  masks[word_x2-word_x1] = 0xFFFFFFFF >> (32 - x2);
               }
            }
            else { // dirty block is all within one word - word_x1
#if GCC_ATLEAST(3,2)
               x2 = __builtin_ffs( testval ) - 1;
#elif defined(_MSC_VER)
               if( ! _BitScanForward((unsigned long*) &x2 , testval) ) {
                  x2 = -1;
               }
#else
               x2 = Mod37BitPosition[(-testval & testval) % 37];
#endif
               masks[0] = (0xFFFFFFFF << (32 - x2 + (x1 & 31))) >> (32 - x2);
            }


EDIT: Win32 also has a 64-bit version of this function, like int __builtin_ffsll (unsigned long long) , do you think we can use those functions too? . btw, as you saw it didn't need the -1, that's because the function returns 1 on success and 0 if the argument is zero.

Ters

The 64-bit bsf instruction, which I understand these things end up as, is only available on amd64 aka x64. Win32 is normally the name of the basic Windows API from Windows 95 or so onwards.

prissi

Generally we do not do any optimisation but for GCC. Other platforms (like Amiga with PowerPC) will surely need plain C-code as well. I would also refrain from optimising the 64 code.

But the way: The german c't 7/2013 just had a nice article on bitscans. There are subroutines that are faster than the intrisics on almost any processor ...


const uint8 MultiplyDeBruijnBitPosition[32] =
{
  // precomputed lookup table
  0,  1, 28,  2, 29, 14, 24,  3, 30, 22, 20, 15, 25, 17,  4,  8, 31, 27, 13, 23, 21, 19, 16,  7, 26, 12, 18,  6, 11,  5, 10,  9
};

uint8 lowestBitSet(uint32 x)
{
  // leave only lowest bit
  x  &= -((sint32)x);
  // DeBruijn constant
  x  *= 0x077CB531;
  // get upper 5 bits
  x >>= 27;
  // convert to actual position
  return MultiplyDeBruijnBitPosition[x];
}


EDIT: And I think 5% gain is a lot considering that this is not much obfuscated code to add (and also likely quite protable).

BTW: Allegro do not require dirty tiles at all.

Markohs

Well prissi, but if the optimization for Visual C is so easy as the one I posted I don't see any reason to not include it. Turfit already implemented a generic C version for non-gcc compilers, it's in the patch, uses a module 37 pre-calculated array.

@Ters: ok, I was not sure of that, thx. :)

kierongreen

If 5% is simple, and is consistent across many platforms and systems I'd agree it'd be a good change.

prissi

The intrinsic is way slower than the routine I posted, about 10 times on most CPU (at least according to the benchmarking of c't, which rarely fails).

Markohs

Well, if your version it's faster, ofc we should use it. It's faster that gcc's one too?

Markohs

#56
I guess  c't is that german webpage I've found on the internet. I'm very curious to know why it's slower than that algorithm, because a compiler intrinsic function it's suposed to be compiled specially by the compiler and generate a specially chosen assembler instruction to do that. Doesn't Visual Studio do that? Can you point me to that article please?

I'm very curious to know why it's like that, not that I doubt of what you said.

EDIT: Looks like it's indeed generating a assembler instruction (bsf) just for that test, now I'm even more curious to know how can it be slower. :)


TurfIt

#57
Doing bit manipulation on uint64s when compiled to a 32-bit target would be rather counterproductive - read slow, hence why sticking with uint32s. Simutrans and 64 bit compilations don't get along anyways re optimizations, so no point adding more special cases IMO.

As for GCC specific optimizations, IMHO we should be using more... at least to the extent that they don't harm other compilers. I tend to think it's the most common compiler in use. Seems some like MSVC, but it's already in a huge hole performance wise when it comes to the graphics with it not using the asm blocks in simgraph. If someone truly cares about MSVC performance, they'd port those over.

I originally chose the Mod37 fallback because it distinguishes between the 00 and FF cases, where the DeBruijn routine doesn't. However, I've since ended up adding the special handling for them anyways, so will switch. Timing wise, there's not a huge difference, it's really at the limit of what I can measure with a 300kHz clock...  ffs = 29us/frame, DeBruijn = 29.5us, Mod37 = 30.5 us. I can't test the MSVC intrinsic, but looking at the code Markohs posted, it's spitting out the same instruction.

@Markohs: Is wrapping the _BitScanForward in the if() required? i.e. Don't the surrounding ifs take care of the special case?

Is this c't article available anywhere? Certainly x86 has a history of including 'dud' instructions. I rather doubt the bitscan instructions are though, it's just too common a need.
EDIT: http://chessprogramming.wikispaces.com/BitScan is interesting.

I think I'm hearing things aren't too cryptic/obfuscated. I sat on this for a week hoping for a lightning bolt of inspiration containing a more elegant routine, but nothing happened...
Documentation - suggestions on what else is needed? I thought I'd commented sufficiently, and really the only way to understand bit manipulations is to work through them.



GCC code for the _builtin_ffs:

movl __ZL14tile_dirty_old, %eax
movl -40(%ebp), %edx
sall $2, %edx
addl %edx, %eax
movl (%eax), %eax
notl %eax
xorl %edx, %edx
bsfl %eax, %eax
sete %dl
negl %edx
orl %edx, %eax
incl %eax
decl %eax
movl %eax, -44(%ebp)
movl -84(%ebp), %eax
movl -40(%ebp), %edx
subl %eax, %edx
movl $32, %eax
subl -44(%ebp), %eax
movl $-1, %esi
movl %esi, %edi
movb %al, %cl
shrl %cl, %edi
movl %edi, -108(%ebp)
movl -76(%ebp), %eax
movl -108(%ebp), %ecx
movl %ecx, (%eax,%edx,4)
jmp L957

Spits out an inline bsf as expected.

Markohs

Quote from: TurfIt on May 09, 2013, 12:50:37 AM
Doing bit manipulation on uint64s when compiled to a 32-bit target would be rather counterproductive - read slow, hence why sticking with uint32s. Simutrans and 64 bit compilations don't get along anyways re optimizations, so no point adding more special cases IMO.

ok, was just an idea (a bad one I see). :)

Quote from: TurfIt on May 09, 2013, 12:50:37 AM
As for GCC specific optimizations, IMHO we should be using more... at least to the extent that they don't harm other compilers. I tend to think it's the most common compiler in use. Seems some like MSVC, but it's already in a huge hole performance wise when it comes to the graphics with it not using the asm blocks in simgraph. If someone truly cares about MSVC performance, they'd port those over.

Hum...I thought that assembler was not used in gcc neither, that was old code not being compiled! It's really such a big performance difference?

Adding compiler-specific is a nice thing to do imho, if we restrict ourselves in doing in selected files only, critical for performance. It's okay doing it in simgraph16.cc, certain tpl data structures if useful and maybe other critical parts of the code.

I think most of you have a wrong concept when it comes to MSVC compared to GCC. Looking at benchmarks with those two compilers, they are actually very very close in code performance, it's not that gcc generates clearly better code than msvc, and in some circumstances msvc code is even better.

Quote from: TurfIt on May 09, 2013, 12:50:37 AM
@Markohs: Is wrapping the _BitScanForward in the if() required? i.e. Don't the surrounding ifs take care of the special case?

Yup, it's not needed, they can be removed. I just wanted the code to have exactly the same result than your gcc macro, but it's indeed not needed because that will never happen.

Quote
Documentation - suggestions on what else is needed? I thought I'd commented sufficiently, and really the only way to understand bit manipulations is to work through them.

Documenting the variables, even the name is quite explicit. And expressing tile_dirty is a bit-map of dirty tile positions, where each bit represents a rectangle of screen. Also the functions. It's pretty obvious for us that know how they work, but for an outsider they are criptical. Not overdoing it ofc, that outsider is suposed to be able to read code. :)

prissi

MSVC might be not much worse; but optimisations for GCC likely work on more x86 architectures (Linux, MacOS, Haiku) than MSVC. If set the performance to 1.0 for the Intel compiler, it's 0.95-1.05 for Portland, 1.3 GCC and 1.7 MSVC for the spec suite.

bsf has a lot microcode it involvles, especially for the 64 bit version, especially on older CPUs (where optimisation is mainly an issue). In the c't tests the DeBruijn was down to 2.43 cycles at the end (obviously this depends very much on cache size, i.e. the table is fully in the cache and thing like that.)

The article is not on the net (I can sent it to you, but it is obviously in German). But the article Ters gave has aparently a similar info.