News:

Simutrans.com Portal
Our Simutrans site. You can find everything about Simutrans from here.

Inline assembly in simgraph16

Started by DrSuperGood, February 20, 2017, 07:27:19 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

DrSuperGood

QuoteFor fillbox, it is still a bad idea.
This is a GCC problem though. MSVC last I checked (just updated to 2015 update 3 so might have changed) generated assembly for the low level similar to the inline assembly (rep stos) so should perform equally, if not better on older processors as I added alignment. I am guessing that GCC prioritizes against using rep stos, possibly because it thinks it is targeting an old processor model before fast string instructions were added or as optimized as they are for modern processors. For high level it does better than MSVC (again not checked with update 3) because MSVC cannot optimize std::fill well for uint16_t some reason, I think it lacks/lacked the intrinsics for non char types, having to default to loops rather than optimizing out to a rep stos as in the assembly.

I am guessing the assembly (rep stos) performance difference with Conroe was the result of non-aligned write the assembly performs. The high level Pentium 4 code does aligned writes with SSE which is why it is faster. If the assembly code was optimized to align like the low level code then it might reach performance levels on Conroe similar to the Pentium 4 high level. Modern processors executing string instructions probably implicitly align them at a micro code level, which is why the rep stos cannot be beaten on ivybridge but could on Conroe.

QuoteFor text, it depends on the processor. Conroe prefers the assembly, ivybridge does not.
Hence why I suggest removal of the assembly. Might cause regression on older processors but it is more future proof as modern processors gain a performance boost. One can generally rely on the trend that old core2 processors where it regresses will be replaced with i3/5/7 and AMD equivalents as time progresses.

kierongreen

I find it a little odd to knowingly reduce performance on older processors when the people using them will be the ones for whom performance is already more of an issue than on more modern machines.

DrSuperGood

QuoteI find it a little odd to knowingly reduce performance on older processors when the people using them will be the ones for whom performance is already more of an issue than on more modern machines.
It is also odd to knowingly reduce performance on newer processors when you know that more and more people in the future will be using them. My justification is entirely based on minimizing future development work, since naturally user metrics will shift towards the processors that it runs better on and away from the ones it runs worse on. If we prefer older processors for speed then it means that some time in the future another such topic as this will be created arguing about the code performance and it will have to be changed to what we could change it to now.

If Pentium 4 compilation is possible even the old processors will gain somewhat with speed. Just newer processors will have the biggest gains.

prissi

First, having a proper optimised build gave the most acceleration.

The above shows, that the assembler has to stay only for fillboxes. Other than that there is only a very little gain for the display_img with recolor and sometimes with text, which both are operations which are less frequent compared to image display. The shading and blending did become much more heavy in current releases, and are not shown here. That may become an issue on older machines. So teh 5% loss in Conroe for text and recoloring is probably not really measureable in a real game.

By the way: the same argument goes for multitasking. Low powered processors have few real cores. So the overhead can (in principle) weight out the gain quickly. (Even more in cheap servers, which are often single cores virtual machines ...)

So I agree, we have to optimise for everything. But since the compiler does now a better jab than a long time ago, it means that likely also other architektures like Arm and Power will profit from a clean and clear C code. Also, since Simutrans is released at least once a year, I am not worried about new processor performance now. You can always have your own build.

Ters

Quote from: DrSuperGood on February 26, 2017, 07:48:16 PM
This is a GCC problem though. MSVC last I checked (just updated to 2015 update 3 so might have changed) generated assembly for the low level similar to the inline assembly (rep stos) so should perform equally, if not better on older processors as I added alignment.

Unfortunately, that Microsoft does better is only useful for Windows users at best. I'm not even sure if that is a majority of our users anymore.

Quote from: DrSuperGood on February 27, 2017, 01:21:00 AM
It is also odd to knowingly reduce performance on newer processors when you know that more and more people in the future will be using them. My justification is entirely based on minimizing future development work, since naturally user metrics will shift towards the processors that it runs better on and away from the ones it runs worse on. If we prefer older processors for speed then it means that some time in the future another such topic as this will be created arguing about the code performance and it will have to be changed to what we could change it to now.

If Pentium 4 compilation is possible even the old processors will gain somewhat with speed. Just newer processors will have the biggest gains.

We didn't reduce performance. Intel did. And even if we optimize for new processors now, they will at some point become old processors and the discussion of switching to target newer processors will come up anyway. The frequency will be the same, it is just the offset that is different.

As far as I know, nobody has complained about worse performance on their new PC than their old, except for the SDL1 vs. SDL2 thing. That turned out to be an SDL problem, apparently, not something with Simutrans itself.

It has been a few years now since people complained that Simutrans (due to a configuration error on the build server) didn't run on a Pentium 3 running Windows 2000. However, it is possible that there is a large, loyal following of impoverished Simutrans fans that are too embarrassed, or simply don't grasp English enough, to be vocal on this board. (I actually do have a running Conroe computer, and probably will for years to come, unless it breaks down. I don't use it for Simutrans anymore, though. And it runs Linux.)

DrSuperGood

QuoteUnfortunately, that Microsoft does better is only useful for Windows users at best. I'm not even sure if that is a majority of our users anymore.
It seems GCC relies on intrinsic functions for memory operations from a standard C/C++ library (usually part of Linux?) rather than coming up with its own. For memory related operations it out sources quite a lot to standard memory intrinsic functions, which it then can inline (not always does) and optimize. Problem seems to be that the result of all this is that there is a huge bias towards set loops (or read write loops) rather than string operations such as "rep stos", something MSVC compiler does use for the low level set implementation but fails for the high level or the memory copy. I am guessing GCC is failing to use rep stos for both low and high level and only does well with vector extensions because eventually the vectors are large enough to compensate for the inefficiency. The old processor rep stos implementation must not be self aligning which is why the assembly is slower than vector instructions.

I am guessing there has been no major drive to add this optimization because the GCC developers believe bulk memory operations are not their problem (they say it is the problem of the standard library developers) and that the performance of vector extension optimized code is close enough that there is little to gain with such optimizations.

Ters

#41
Intrinsics are by definition provided by the compiler. GCC doesn't use that word, but its description for its built-in functions indicates that the compiler provides them, and library developers are encouraged to use them by providing macros redirecting standard names to the built-in names. However, the documentation seems to state that some standard library functions are routed to the built-in equivalents regardless of what the library says. This includes memcpy and memset. (I'm not sure how the compiler thinks it can implement malloc on its own. Especially when the compiler does not implement free.) Not that it appears to do so in practice. It does recognize a for-loop filling a uint8_t array as being the same as memset, but it generates a function call to memset rather than inline it. This is true for both GCC 6.3.0 on Windows and GCC 4.9.4 on Linux, so its reliance on a closed source, foreign C library on Windows does not appear to be a factor.

Edit:
This was odd. If I actually call memset explicitly, then it gets inlined. Seems like some optimized steps are run in the wrong order. It appears that for loops are replaced by memset after memset has been otherwise been inlined.

The inlined memset does use rep stos, but it uses stosd (stosl in GAS), so unaligned memory access may be an issue. I haven't quite figured out how it deals with an element count that is not divisible by four, but the only part I don't understand from the brief time devoted to it is the bit before the rep-part, so it must be there.

DrSuperGood

These optimizations have been merged into main. I did some final tidying up on them, fixing the low level performance for MSVC (which does not merge nearby memory reads but has not problem with an unaligned read) and adding robustness for different platform endians.


prissi

Your code in endiancheck is anyway assuming that the processor has a swap command to swap low and high word. Otherwise a 16 bit shift and an or is not neccessarily faster than just two copy commands. Since the only platform for now is the old PowerMac and Amiga target using big endian, I would rather suggest the compiler to use the simple copy code and leave optimisations to the compiler.

Also other parts of simutrans uses SIM_BIG_ENDIAN for endianess check. Then the code should do this too.

DrSuperGood

QuoteYour code in endiancheck is anyway assuming that the processor has a swap command to swap low and high word. Otherwise a 16 bit shift and an or is not neccessarily faster than just two copy commands. Since the only platform for now is the old PowerMac and Amiga target using big endian, I would rather suggest the compiler to use the simple copy code and leave optimisations to the compiler.
It should not need a swap instruction on big endian platforms. A good compiler should see that it could optimize the 2 separate aligned 16 bit reads into a single unaligned 32 bit read (if supported and faster) as it is always the correct order. If unaligned reads are not supported or are slow then then LOW_LEVEL code will not perform well anyway and leaving it to the compiler to copy is likely the fastest solution. The endian check is needed so that the two aligned 16 bit reads always equate to an unaligned 32 bit read to match the 32 bit write because GCC does not really have an understanding of unaligned 32 bit reads. MSVC seems to understand/support unaligned 32 bit reads but also lacks a lot of memory access optimizations so will not merge the 2 aligned 16 bit reads into a single unaligned 32 bit read, hence it needs its own separate case.

QuoteAlso other parts of simutrans uses SIM_BIG_ENDIAN for endianess check. Then the code should do this too.
However MSVC Simutrans defines LITTLE_ENDIAN for endian check... You saying that macro is useless?

Dwachs

There is LITTLE_ENDIAN in simconst.h (new) and SIM_BIG_ENDIAN in sim_types.h (since ages). This should be unified somehow.
Parsley, sage, rosemary, and maggikraut.

DrSuperGood

#46
QuoteThere is LITTLE_ENDIAN in simconst.h (new) and SIM_BIG_ENDIAN in sim_types.h (since ages). This should be unified somehow.
I would recommend using the standard macros of GCC. If they are not defined for a specific compiler one can manually define them for the platform, or use one of the various standard adapter macros available online (which use compile time comparisons to detect endian). I believe most platforms are little endian, at least the ones that it is reasonable for Simutrans to target, as such assuming it as a default when no such macros are available is a good idea and should work in most cases. For example MSVC does not have them but as far as I know only targets little endian machines (x86 and arm processors).

prissi

All windows are indeed little endian, even though it run on processors which could do both. But the XBox was once a PowerPC machine, as were the PowerMac (Office for Mac), and it was supported by windows compilers. And as soon as there is a Unix derivate, all guesses are off.

I found "_BIG_ENDIAN" for the andriod SDK, for GCC "__ORDER_BIG_ENDIAN__", for clang "_BIG_ENDIAN" A deeper search suggested the following macro monsters:

#if defined(__BYTE_ORDER) && __BYTE_ORDER == __BIG_ENDIAN || \
    defined(__BIG_ENDIAN__) || \
    defined(_BIG_ENDIAN) || \
    defined(__ARMEB__) || \
    defined(__THUMBEB__) || \
    defined(__AARCH64EB__) || \
    defined(_MIBSEB) || defined(__MIBSEB) || defined(__MIBSEB__) \\
    defined(_M_MPPC)  ||  defined(_M_PPC)
// It's a big-endian target architecture
#define SIM_BIG_ENDIAN
#elif defined(__BYTE_ORDER) && __BYTE_ORDER == __LITTLE_ENDIAN || \
    defined(__LITTLE_ENDIAN__) || \
    defined(__ARMEL__) || \
    defined(__THUMBEL__) || \
    defined(__AARCH64EL__) || \
    defined(_MIPSEL) || defined(__MIPSEL) || defined(__MIPSEL__)
// It's a little-endian target architecture
#else
#error "I don't know what architecture this is!"
#endif

DrSuperGood

I was just going by what was said on various GCC pages about macros...

Since GCC, or variants there of, seem to be the most common way to build Simutrans I would have thought that using these standard macros was a good idea. If not then one will have to define our own set of macros for the job and add them to the build process. If one assumes little endian is default, that means only a big endian macro is needed and must be defined for such platforms. This would save the need for both a big and little endian macro.

It is worth noting that apparently something called "middle endian" exists which is what GCC refers to as PDP. If the code is truly portable this should be supported as well however I am unsure if Simutrans will ever encounter such a thing (similar to it encountering platforms with non 8 bit chars/bytes). For the low level C loop PDP endian will work with the big endian branch.

Ters

If PDP refers to what I think it refers to, Simutrans will never run on those things unless Simutrans is simplified considerably.

prissi

Apart from old PDPs using 9 bits per byte, and the lack of anything beyond as basic K&R C compiler ...

More to the present: Clang is also a common compiler, and also it seems to use _BIG_ENDIAN instead (Dwachs is using it, so he could tell).

prissi

After finishing this, maybe it should go to technical documentation, before the next optimisation round circa 2020?

DrSuperGood

QuoteAfter finishing this, maybe it should go to technical documentation, before the next optimisation round circa 2020?
Possibly, as it does cover some details about assembly and optimizations that might persuade future users against using assembly.