News:

SimuTranslator
Make Simutrans speak your language.

Patches for big-endian machines, clang, no /proc

Started by Kernigh, November 10, 2018, 01:52:22 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Kernigh

I am attaching some patches to compile Simutrans for OpenBSD. I would like Simutrans to take these patches because they help platforms that have big-endian processors, use clang, or don't have /proc filesystem.

I want to add Simutrans to OpenBSD Ports, so that more OpenBSD users can play Simutrans. (FreeBSD, NetBSD, and a few Linux distros are already packaging Simutrans.) I had been playing Simutrans on OpenBSD/amd64, but I decided to make the OpenBSD port on my old PowerBook G4 running OpenBSD/macppc, which is a big-endian PowerPC machine. I discovered that Simutrans 120.4.1 doesn't work on big-endian machines.

You can apply each patch like patch -p0 < big-endian.diff. The patches are for Simutrans 120.4.1. I can also apply the patches to git mirror of svn trunk, but I have only compiled 120.4.1, not trunk.

big-endian.diff has the big-endian fixes. An integer uint32 i = 0x11223344; looks like {0x44, 0x33, 0x22, 0x11} in little-endian byte order, or {0x11, 0x22, 0x33, 0x44} in big-endian byte order. Simutrans uses little-endian order in .pak files. Simutrans has code to swap the byte order for big-endian processors (like my PowerPC G4), but this code is broken in a few places.

The first 2 chunks (descriptor/writer/*.cc) fix makeobj. I tested them by building pak64.japan, pak128, and pak128.Britain from their source code. Without the fixes, the .pak files are corrupt and Simutrans crashes. With the fixes, I can start a new game or load my old pak128 game.

The 3rd chunk (display/simgraph16.cc) fixes the display code. The tile_dirty code was assuming a little-endian processor. Further down, an assignment to const *sp was a compiler error. I fixed it the compile the game, and later, I fixed the tile_dirty code because Simutrans wasn't redrawing part of the screen.

clang.diff fixes some errors from clang. My big-endian machine uses gcc and didn't need these fixes. My amd64 machine uses clang. Most OpenBSD users have amd64 machines, so I must fix these clang errors before they can play Simutrans. In my experience, gcc is more compatible with old code, and clang is more likely to give errors.

path-search.diff is a sloppy attempt to fix the code in sysmain() that finds the executable. OpenBSD doesn't have /proc, so the code tries to parse argv[0]. The code was missing a PATH search. PATH (in uppercase) is the environment variable where the Unix shell looks for executables. There are 3 ways to run simutrans:

$ /somewhere/simutrans  # absolute path
$ ./simutrans           # relative path
$ simutrans             # PATH search


In the last case, argv[0] is "simutrans", and we need to repeat the PATH search to find the executable. My code is sloppy because my PATH search isn't exactly like the shell's, but it seems close enough to work. I need this patch because the OpenBSD port will put the simutrans executable in the PATH.

I have 2 other patches that I didn't share here. One of them patches makeobj/Makefile to enable PNG_CFLAGS (which now gets ignored); the other patch is only for OpenBSD Ports and tells simmain.cc where my port installed the data files.

TurfIt

From the progression of patches for the dirty tile change to uint32 from uint8, I see the code was originally as you have it in your patch. Then, after a bunch of timing tests, the cast to uint8* shows up. I don't have the timing results anymore, but can only presume it was shown faster to work on the uint8. Hence, I'd suggest to wrap this change in a #if SIM_BIG_ENDIAN.

There should be many more places throughout the code where -1 is assigned to a unsigned int that the overly picky compiler is missing....


DrSuperGood

QuoteThe tile_dirty code was assuming a little-endian processor.
I assume the problem manifested somewhere else, in that it was trying to treat tile_dirty as a 32bit unsigned but it was constructing it like a little endian 32bit unsigned?

In any case the reason uint8* might be faster is because it only has to read and write 1 byte per iteration (assuming no compiler optimizations), as opposed to 4 bytes if the native type was used as in the solution. Although one could write some clever logic to merge all bit sets into native units, the result is still that 4 bytes will be written, even if less than 8 bits need to be set (a single byte ends up modified).

The question I have to ask is why not make tile_dirty a uint8 array? Is there a speed benefit somewhere else from manipulating it as 32bit words?

Ters


The clang-patch looks correct. I can't comment on the path-search-patch.

In display_img_nc, it should be possible to use regular "non-optimized" (all else) for big-endian systems as well. No need to have that code twice. In fact, I don't think any of the code paths have any dependencies on byte order in the first place.

tile_dirty is no longer cast to uint8 *, but I can't see that its type is being changed. That should mean serious array index out of bounds errors! And that everything else won't work.

Quote from: DrSuperGood on November 10, 2018, 10:17:14 AMThe question I have to ask is why not make tile_dirty a uint8 array? Is there a speed benefit somewhere else from manipulating it as 32bit words?
It must be display_flush_buffer that benefits from being able to process things four bytes at a time, as most other access is for a single-bit. I think the compiler can fix the loop with the comment "combine current with last dirty tiles" to be just as fast either way, but once you reach the comment "dirty block spans words" it might need this kind of help from the coder. It can be that single-byte memory access is slower than four-byte, although I can't exactly see why.

prissi

I think the tile dirty access never ever showed up anywhere. Some extra bytes should be absolutely neglible compared to copying a rectangle to the screen.

Dwachs uses clang, as far as I know. I was not aware, that he has troubles, but those changes are trivial. I am only afraid, I may accidently write again something not clang compatible.

So thank you very much, especially for fixing the endian code. (I am still wondering, why the images do not need fixing, since the are unit16, but well, if it works ... )

All in in r8630

Ters

Quote from: prissi on November 10, 2018, 03:15:16 PMI think the tile dirty access never ever showed up anywhere. Some extra bytes should be absolutely neglible compared to copying a rectangle to the screen.
What extra bytes?

DrSuperGood

QuoteWhat extra bytes?
If an area spanning less than 25 bits is manipulated there is a chance that fewer than 4 bytes will need to be updated. This might be faster.

Unless areas bigger than 32 bits wide are being masked a lot, there will be no performance gains from doing stuff 4 bytes at a time.
QuoteIt must be display_flush_buffer that benefits from being able to process things four bytes at a time, as most other access is for a single-bit.
Technically one could treat it as individual bytes and then combine them for that step. A decent compiler should notice the merger of 4 consecutive bytes as a word read and so there would be no overhead. There might be overhead on Big Endian platforms due to endian change, but if the compiler is smart enough even that code should be highly efficient.

TurfIt

Faulty memory as usual. Using:

((uint8*)tile_dirty)[bit >> 3] |= 1 << (bit & 7);

instead of:

tile_dirty[bit >> 5] |= 1 << (bit &317);

wasn't due to speed, but artifacting!  Hint: multithreaded non-mutexed access in the drawing routines.

Original discussions: New dirty tile management

Ters

Quote from: Ters on November 10, 2018, 05:05:29 PM
What extra bytes?
Never mind. Nobody spotted or corrected my error in not seeing that the shift factor changed.

Quote from: TurfIt on November 10, 2018, 06:39:35 PM
multithreaded non-mutexed access in the drawing routines.
These gotchas are why I avoid the multi-threaded rendering.

Kernigh

Quote from: TurfIt on November 10, 2018, 02:40:56 AMThere should be many more places throughout the code where -1 is assigned to a unsigned int that the overly picky compiler is missing....

The compiler seems to be more strict with initializer lists and switch statements. Given these 3 conversions, only the 3rd is an error in clang 6:

/* example.cc */
extern int i;
char a = i;
char b(i);
char c{i};


$ clang -c example.cc
example.cc:5:8: error: non-constant-expression cannot be narrowed from type   
      'int' to 'char' in initializer list [-Wc++11-narrowing]                 
char c{i};
       ^
example.cc:5:8: note: insert an explicit cast to silence this issue           
char c{i};
       ^
       static_cast<char>( )
1 error generated.


Quote from: prissi on November 10, 2018, 03:15:16 PM(I am still wondering, why the images do not need fixing, since the are unit16, but well, if it works ... )

makeobj (on a big-endian machine) must fix the images by swapping each uint16 from big to little endian. These endian() swaps happened too early, so I moved them later in the code. I removed them from pixrgb_to_pixval(), and added it to encode_image() as *dest++ = endian(uint16(pixval));. This fixes comparisons like pixval >= 0x8020, where pixval must be big-endian (on a big-endian machine).

Quote from: TurfIt on November 10, 2018, 06:39:35 PMHint: multithreaded non-mutexed access in the drawing routines.

I have not tested multiple threads, so I don't know if my patches cause problems with multiple threads. I didn't add MULTI_THREAD=1 to my builds because OpenBSD doesn't have PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP. (FreeBSD and NetBSD might have the same problem.) The only way to initialize the recursive mutex in OpenBSD is to call pthread_mutex_init() with the correct attributes. I would make a patch for this, but I haven't found where to put the pthread_mutex_init() calls.

I have not tested the performance of uint8 versus uint32 in tile_dirty. My guess is that the speed would be about the same on a processor with a wide enough data bus. The MPC7450 RISC Microprocessor Family Reference Manual covers the MPC7447A in my PowerBook G4. Table 6-7 gives 3 cycles latency, 1 cycle throughput for each of lbz (read uint8), lhz (read uint16), lwz (read uint32), stb (write uint8), sth (write uint16), stw (write uint32). There is a misalignment penalty (section 6.4.4.1) if the uint16 or uint32 crosses an 8-byte boundary, but the memory allocator would align tile_dirty to avoid the penalty.

TurfIt

The attached patch was lurking on my HDD, not sure where from/why...  I have built multithreaded sever on FreeBSD before, maybe that's why I have this; Ugly though.

dirty_tile moved from uint8 to uint32 for speed. Except that access that cast back to uint8* for a hacky workaround of the multithreaded access artifacts. The 'correct' fix would be a dirty_tile array per thread if someone is ambitious... otherwise that change needs a revert (for little endian atleast.)

prissi

Does 8 bit access really changes anything notable in timing? In profiling the dirty buffer routines never should up very much, because the copying of 16 bit images to 32 bit buffer will involve much much more overhead ...

Kernigh

Quote from: TurfIt on November 14, 2018, 04:59:10 AMThe attached patch was lurking on my HDD, not sure where from/why...  I have built multithreaded sever on FreeBSD before, maybe that's why I have this; Ugly though.

Thank you. I haven't tried the patch yet. It won't work without editing, because it uses PTHREAD_MUTEX_RECURSIVE_NP, but OpenBSD only has PTHREAD_MUTEX_RECURSIVE. If I edit  your patch, I might post it in a new thread.

Atomic operations like atomic_fetch_or might help tile_dirty with multiple threads; but <atomic> (C++11) and <stdatomic.h> (C11) don't work on plain integers, so they might be difficult to use.

TurfIt

simthread.h should already handle the non _NP version since OSX is in the same boat.

Atomic would fix the problem with marking dirty, but make processing the array difficult from what I see.
By far the easiest fix is to revert the wrong change (for little endian). Two threads accessing different uint8s within the same uint32 works already.
The artifacts from non-atomic 32bit writes are few and seldom, but this is still wrong.

prissi

I submit it with Unit8, but DrSuperGood reverted it. However, the routine did not even show up in profiling, so I think indeed uint8 is the way to go.