News:

SimuTranslator
Make Simutrans speak your language.

Inline assembly in simgraph16

Started by DrSuperGood, February 20, 2017, 07:27:19 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

DrSuperGood

This is a more public break out of some internal discussions regarding the inline assembly in simgraph16.

The conversation started when I asked why MSVC Windows release builds were not possible. The reason given was a lack of inline assembly for graphic routines that apparently make the GCC builds run much faster than MSVC builds. MSVC does support inline assembly for 32 bit builds (not 64 bit builds) so I thought it would be a good idea to port it.

All I had to do was port it across, an easy task, and everything should be done. After reformatting how inline assembly works in the file to support multiple inline assembly blocks and different implantations it was time to add MSVC inline assembly. So I started by brushing up my x86 assembly and reading how GCC and MSVC inline assembly works. However why stop there, looking at how hacky some of it is and that MSVC intrinsic assembly does something completely different then perhaps there is was a different, better solution. So I set about studying assembly optimization, how processors work and different ways to move memory efficiently.

After doing all this I have come to the following conclusion...
Modern processors have very efficient rep movsd. It is slow for small copies and has bad setup time but is by far the most scalable and future proof way to move memory, MSVC uses this internally after performing a few checks to see if the processor supports such copying. This should even include rep movsw and movsb as they all end up doing the same thing, the processor breaks it down into a series of micro code read and write operations in the most efficient size. The reason this is not being used was it was "slow", which is the case for older processors before fast string optimization was added.
Data movement needs to be aligned. Modern processors have no real penalty for misaligned reads but still do for misaligned writes. As such if you are moving 32 bits of data, you need to move it from a 4 byte aligned location for optimum performance. The GCC assembly fails to do this. This also affects rep movsd performance slightly, as the micro code is slow and must align with smaller copies first. One needs to pre and post align so that the bulk middle is aligned properly for faster copying.
The data movement approach used by display_img_nc is bad. The problem is that movsd implicitly has a dependency on the previous movsd to increment 2 address pointers in registers, hence this pipelines extremely badly as chances are only 1 move can be executed per tick at best. Most copy loops use static offsets on a base address that is changed infrequently, allowing for multiple read / write memory instructions to be executed in parallel. Even though it looks like an more efficient, unrolled rep movsd it is not, as the processor will convert rep movsd into a highly efficient series of memory operations where as it will not do the same for a large number of sequential movsd instructions.
The compiler is very good at choosing memory operations. For example in display_fb_interna the compiler produces similar code to the inline assembly already for the c++ fallback. Someone really needs to re benchmark the GCC build on a modern processor to see if it is making a difference, or possibly even lowering performance.

Based on all this I am of mixed option if the assembly code is even needed anymore, especially for some of the functions. Someone needs to benchmark the GCC build with and without assembly to see if assembly is even making a difference. Perhaps more performance will be gained with a better aligning C++ implementation than the current assembly. Finally the performance on modern processors of that assembly needs to be checked, as it is possible it is slower than the C/C++ route.

prissi

You are free to benchmark it using the times command.

The current assembly was intensively tested for the Pentium 4 and Athlon 64. Especially the pentium 4 has a very deep pipeline but was very bad at rep movsd. That was actually the slowed way to copy data.

And movsd is very easy pipelined, since it has no side effect. "rep movsd" has a bad pipelining, since most processor does not unroll the code in the cache. So after each execution it will decrement and test again.

And reading has also a penalty, whenever you go over a 64 byte border. Then a read has to wait very long. Some for writing. As long as nobody else uses the 64 block, the also the writing is cached, when in a pipeline.

I also tried the assmbler code of various reference implementations of memcopy form AMD for Athlon as inlining, including MMX. But these codes only won for really large chunks. But they were horribly slow due all the overhead of aligning and so on for a typical simutrans image run length of less than 50.

Since the assembly makes most difference on slow and old devices with few core, it may be still the better code. If a fast processor is not slowed down, the the seems like a good compromise. Still, the times switch probably needs some tweaking for better testing.

TurfIt

#2
Repeated the 2012 tests from the other thread. Same two computers. Compiler now gcc 6.3.0 rather than 3.7.0. And most importantly DEBUG=1 rather than DEBUG=3.
-times is quite broken - Only showing ones that actually work.
Only one run of each executable too, jitter/anomalies abound, but the basics are there.
'C' is USE_C with ALIGN_COPY disabled. Obviously crashes when -march is selected beyond the i686 default. (i386 was default with gcc3.7.0)
'C align' is regular USE_C which also defines ALIGN_COPY due to GCC >4.2.0 clause. This the memcpy() path that was so horrid before.
'C while' is USE_C, ALIGN_COPY, but with the memcpy() replaced by a simple while(runlen--) loop in display_img_nc(). The ivybridge likes, the conroe hates.


ivybridge686pentium4ivybridge
asmCC alignC whileasmC alignC whileasmC alignC while
display_img():12000000iterationstook:9253408353996715948648674776915259765075
display_flush_buffer():7000000iterationstook:92179297919892129814983110129918091629221
display_text_proportional_len_clip():8000000iterationstook:95447102731973381021773567206875774487129
display_fillbox_wh():8000000iterationstook:90382006620151201489113128391284490891245712448
view->display(true):4800iterationstook:1015586249797978210676999299449946103089997
view->display(true) and flush4000iterationstook:8589731784248329903183658731839789179044
grund_t::get_neighbour():399976698iterationstook:8738869687378741927492129278887288208883
Total:64534651856902570265676116246253908633936308861797

conroe686pentium4686debug=3
asmCC alignC whileasmC alignC whileasmC align
display_img():12000000iterationstook:131298051877111274142907685101031622110303
display_flush_buffer():7000000iterationstook:123331219012288123171420513772138072477524743
display_text_proportional_len_clip():8000000iterationstook:129341645116414167731360415827158782494129263
display_fillbox_wh():8000000iterationstook:1608822722399023991916110159471593517786246891
view->display(true):4800iterationstook:266522280624529251432687224529268784749845712
view->display(true) and flush4000iterationstook:219161954521177214922307020930229734000439113
grund_t::get_neighbour():399976698iterationstook:198881958319789197592098120170201845979359350
Total:122940121348142870146677129132118860125758231018455375

skylake686pentium4ivybridgeskylake
asmCC alignC whileasmC alignC whileasmC alignC whileasmC alignC while, std:fill
display_img():12000000iterationstook:6609318446034396639245433798623246624146619147714181
display_flush_buffer():7000000iterationstook:7169711071357155736273767561687268436878715171507157
display_text_proportional_len_clip():8000000iterationstook:4837523849895001502451475524480854675702472355925335
display_fillbox_wh():8000000iterationstook:6690153451606916057672977317763669310462104536623106234290
view->display(true):4800iterationstook:95989092978698069861110069785932610087102719398101599987
view->display(true) and flush4000iterationstook:8110770583218318829793028283791285938718796886048576
grund_t::get_neighbour():399976698iterationstook:6754680467656732715671957191685768086851689469066929
Total:49767544785766857465508215230049905487005292253019489485380546455


So, stay away from DEBUG=3, select -march=pentium4, and let the compiler optimizer do it's job would be my conclusion. (except for display_fillbox_wh() which seems to quite like the asm)
But, this also becomes moot once you run multiple threads. Inefficient use of the CPU execution units is not important when the bottleneck moves away from them.

EDIT: added the 686 DEBUG=3 for a laugh. The 246891 is not a typo!
If the 120-2 release was indeed with asm, DEBUG=3, no wonder it barely runs...

EDIT2: added skylake results.  The skylake 'C while' also has std:fill replacing the while loop in display_fb_internal() as suggested below.

jamespetts

This is very useful information. For testing, I have re-compiled Simutrans-Extended with -DUSE_C enabled and march=pentium4, and this is the version now available on the nightly build server (it will not be in the .zip file until to-morrow morning, but the lone executable is the latest for Windows). However, this latter switch, the march=pentium4 does not work for the Linux builds as the Linux builds are 64-bit, whereas the pentium 4 instruction set is a 32-bit instruction set, so it gives an error when compiling. The same issue will be present for Standard.

Is there an alternative switch that is best for 64-bit Linux builds, or is it best to leave out the march=XXX command for 64-bit Linux builds?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TurfIt

The default with 64bit builds will already be at the prescott level with SSE3 enabled, so not really any need to go further, unless you want to set a higher minimum requirement.
With the 32 bit builds, bringing MMX, SSE, SSE2 to the optimizers arsenal with pentium4 usually makes sense.
In the earlier tests, gcc 3.7 didn't show any gains moving beyond pentium4 - ie, it didn't appear to make effective use of SSE3, SSE4, etc. I've not extensively tested with 6.3, but you can see a bit with the pentium4 vs ivybridge above. Some routines better, some worse, nothing earthshattering.
Be careful going too far if you want to remain compatible with AMD cpus, even quite recently sold from them are lacking stuff in the SSE4 series that the Intel units had 10 years ago.

Also note: atleast with the gcc 5.3.0 I used for the 120-1-2,3 releases last, year, -march=pentium4 must be accompanied with -mstackrealign. My computers (I tested on like 6!) worked fine, but once released, uggh. Possibly this still exists with 6.3.0, not tested. Send your executable to Dr.SuperGood - if he still has the same computer he had last year, it truly didn't like my executable without the forced realignment.

DrSuperGood

#5
QuoteThe current assembly was intensively tested for the Pentium 4 and Athlon 64. Especially the pentium 4 has a very deep pipeline but was very bad at rep movsd. That was actually the slowed way to copy data.
P4 had very basic rep movsd. The designer of that instruction has openly admitted the micro code implementation might have been a bad idea looking retrospectively. The problem is the configuration logic is done in micro code which has no branch prediction so deciding how to move the data takes 20-40 cycles overhead. That said apparently more modern processors like the i7 have had their implementation improved, and at least for big blocks it is the fastest way to move data.

I suspect the p4 rep movsd needs to be 4 byte address aligned to perform well, as it might not automatically align the destination move.

QuoteAnd movsd is very easy pipelined, since it has no side effect.
Yes it is pipelined. No it is not parallel. Where as a modern processor can execute 3-4 instructions per clock cycle, it can only execute 1 unrolled movsd per clock cycle because each movsd depends on register results from the previous movsd. This is why static address offset mov is more efficient as the processor can execute multiple such mov instructions in parallel as they do not depend on each other.

Or at least that is what I would imagine being the case and why you never see unrolled movsd instructions that often in modern code. It would require some pretty intelligent/complex logic to merge multiple movsd instructions executed consecutively into a single move operation of larger size and a single larger increment of the address registers. Possible technically but I think it is highly unlikely to exist in modern processors due to complexity for yield trade off.

Quote"rep movsd" has a bad pipelining, since most processor does not unroll the code in the cache. So after each execution it will decrement and test again.
Only on very old processors. Modern processors optimize rep movsd instruction heavily. Specifically they abstract the implementation using micro code to mean a general move and not a repeated loop of movsd. Further more, this move can be implemented using large registers, as much as 512 bits on modern processors with AVX while still being compatible with older processors and potentially use larger registers from tomorrows processors. One can even say that "rep movsd" does nothing related to "movsd" instruction next to the results will be the same as originally defined.

The reason for this strange optimization is that there is a need for efficient bulk memory copies. No amount of code can do such a transfer more efficiently than the processor itself as ultimately it can be viewed as a simple memory operation, moving a block of data from one address range to another, despite technically being a complex instruction. If anything the instruction is badly named now, it might as well be called something like "movblockb/w/d/q" rather than "rep movsb/w/d/q" as its mechanics have little to do with movs anymore other than producing a result the same as if movs was used repeatedly.

Only problem with rep movs is the huge setup cost. For very small moves this is really bad, in which case one might be better off with a few movs or mov instructions, possibly in a loop.

QuoteSo, stay away from DEBUG=3, select -march=pentium4, and let the compiler optimizer do it's job would be my conclusion. (except for display_fillbox_wh() which seems to quite like the asm)
And that is exactly my conclusion. As your bench mark shows, some of the assembly is even slower than letting the compiler do it. The fact it does not even try to align to 4 byte addresses to perform double word moves is probably a big reason.

I would say keep and revise some of the C loop implementations, adding better alignment to them, and consider removing much of the assembly.

QuoteIn the earlier tests, gcc 3.7 didn't show any gains moving beyond pentium4 - ie, it didn't appear to make effective use of SSE3, SSE4, etc. I've not extensively tested with 6.3, but you can see a bit with the pentium4 vs ivybridge above. Some routines better, some worse, nothing earthshattering.
Be careful going too far if you want to remain compatible with AMD cpus, even quite recently sold from them are lacking stuff in the SSE4 series that the Intel units had 10 years ago.
Most non trivial memory move operations now use "rep movsd" or "rep movsq" (x86-64). They have no need to take advantage of vector extensions any more. MSVC's 32bit memmove quite literally outsources to rep movsd if it detects fast string and only uses SSE2 if supported and processor does not support fast string. In MSVC2015 there seems to be a problem with memmove intrinsic as it does not inline and one cannot prevent it from performing those tests, hence an inline assembly "rep movsd" might be faster by a few cycles.

QuoteSend your executable to Dr.SuperGood - if he still has the same computer he had last year, it truly didn't like my executable without the forced realignment.
SSE2 instructions can only work on addresses that are multiples of 16 bytes, including the stack. One needs to align the stack frames and local variable locations that are involved with SSE instructions to a multiple of 16 bytes or they will fail. At least that is what I understand of stack alignment. AVX removes this requirement and has full unaligned support, however my processor is before that time. It is possible modern processors have back ported this to SSE instructions so they no longer produce an interrupt error when trying to use them misaligned.

Ters

display_fillbox_wh is an odd one. It looks like it should be very obvious to the compiler what is going on and the hand written assembly code seems very straight forward. Maybe you can try replacing the ALIGN_COPY code with std::fill(p, p + count, colval); and see if the compiler recognizes that code pattern better. However, it might be the case that the compiler assumes a low number of iterations, which might fit the other cases, but not this one. Maybe I can try to take a look at the generated assembly to see what might be the case.

Quote from: DrSuperGood on February 21, 2017, 04:22:27 AM
SSE2 instructions can only work on addresses that are multiples of 16 bytes, including the stack.

I don't know which instructions are SSE2 and which as SSE, but SSE instructions do not only work on 16-byte aligned memory addresses. The only ones I found in a hurry with such a requirement, are movaps and moveapd, but there are corresponding unaligned instructions. Since the aligned instructions are supposed to be faster, the compiler tends to pick them if it thinks alignment is guaranteed. However, GCC can/could easily be confused by casting PIXVAL * to uint32 *, which is what ALIGN_COPY is about.

DrSuperGood

Quotedisplay_fillbox_wh is an odd one. It looks like it should be very obvious to the compiler what is going on and the hand written assembly code seems very straight forward. Maybe you can try replacing the ALIGN_COPY code with std::fill(p, p + count, colval); and see if the compiler recognizes that code pattern better.
MSVC seems to have abysmal optimization for std::fill with sizes other than byte. It turns it into a per element loop using the size of a word (16 bits), only with byte translating to an intrinsic memset. On the other hand the low level C/C++ (with the set loop) translates into code very similar to the GCC assembly, using a single complex string instruction.

QuoteHowever, it might be the case that the compiler assumes a low number of iterations, which might fit the other cases, but not this one. Maybe I can try to take a look at the generated assembly to see what might be the case.
It is worth noting that GCC can technically mangle inline assembly to some extent as part of optimization. Unlike other compilers which are object code compilers, GCC is an assembly compiler and hence its inline assembly is subjected to full optimization and reorganization to a large extent. This contrasts with MSVC inline assembly which reduces optimizations performed to a function as the assembly block has to be included as is and cannot be changed by the compiler.

Quotedon't know which instructions are SSE2 and which as SSE, but SSE instructions do not only work on 16-byte aligned memory addresses. The only ones I found in a hurry with such a requirement, are movaps and moveapd, but there are corresponding unaligned instructions. Since the aligned instructions are supposed to be faster, the compiler tends to pick them if it thinks alignment is guaranteed. However, GCC can/could easily be confused by casting PIXVAL * to uint32 *, which is what ALIGN_COPY is about.
The problem is that if one does not align appropriately then the performance penalty might outweigh any performance advantage the instructions bring. Although not so much the case for memory reads anymore (modern processors support high speed unaligned reads) it is still very much the case for memory writes. Some older implementations of the unaligned instructions apparently worked by processing up to twice as much from memory (eg 32 bytes instead of 16 for a 128bit register, the 16 bytes processed are a subset of 2 aligned 16 byte chunks).

Ters

Quote from: DrSuperGood on February 21, 2017, 05:03:11 PM
MSVC seems to have abysmal optimization for std::fill with sizes other than byte. It turns it into a per element loop using the size of a word (16 bits), only with byte translating to an intrinsic memset. On the other hand the low level C/C++ (with the set loop) translates into code very similar to the GCC assembly, using a single complex string instruction.

That is surprising considering wchar_t on Windows is also a 16-bit value type. Sure, there are other functions meant for doing stuff to wchar_t arrays, but still.

DrSuperGood

#9
QuoteThat is surprising considering wchar_t on Windows is also a 16-bit value type. Sure, there are other functions meant for doing stuff to wchar_t arrays, but still.
Surprised me as well. It in-lines correctly but does not optimize out the loop, or even try to use a larger word size in the loop. This is fine if one is initializing an array of large complex objects which cannot sensibly be parallel created within the same assembly loop, however for a primitive integer type it should be trying to at least expand the word size involved.

It might be worth benchmarking a "rep stosw" loop with the current "rep stosd" loop. Modern processors often implicitly convert these repeat string instructions into large optimized memory operations. As such it might be possible that "rep stosw" on modern processors is faster than the "rep stosd" approach as it cuts down the alignment time. It might also be worth benchmarking the difference between an unaligned "rep stosd" and 4 byte aligned "rep stosd" as in theory the 4 byte alignment should benefit all processors, especially old ones, but does come at the cost of some extra tests.

Currently I am inclined towards there being 2 implementations for some of the algorithms. A high level implementation, eg std::fill C++, aimed at portability and readability. Also a low level implementation, eg align then copy loop then post align C/C++ code, aimed at extra performance on x86 computers. For the most part the GCC inline assembly could be removed as it is less future proof, less readable and even seems to perform worse in several situations according to your benchmarks (assuming larger results means slower).

here is a rough listing for the std::fill assembly in MSVC.
--- d:\simutransbuild\simutrans\simutrans\trunk\display\simgraph16.cc ----------
  3844:
  3845:
  3846: /**
  3847:  * Draw filled rectangle
  3848:  */
  3849: static void display_fb_internal(KOORD_VAL xp, KOORD_VAL yp, KOORD_VAL w, KOORD_VAL h, PIXVAL colval, bool dirty, KOORD_VAL cL, KOORD_VAL cR, KOORD_VAL cT, KOORD_VAL cB)
  3850: {
00E1DDC0 55                   push        ebp 
00E1DDC1 8B EC                mov         ebp,esp 
00E1DDC3 83 EC 14             sub         esp,14h 
  3851: if (clip_lr(&xp, &w, cL, cR) && clip_lr(&yp, &h, cT, cB)) {
00E1DDC6 FF 75 1C             push        dword ptr [cR] 
00E1DDC9 89 4D FC             mov         dword ptr [xp],ecx 
00E1DDCC 8D 4D FC             lea         ecx,[xp] 
00E1DDCF FF 75 18             push        dword ptr [cL] 
00E1DDD2 89 55 F8             mov         dword ptr [yp],edx 
00E1DDD5 8D 55 08             lea         edx,[w] 
00E1DDD8 E8 E3 71 FF FF       call        clip_lr (0E14FC0h) 
00E1DDDD 83 C4 08             add         esp,8 
00E1DDE0 84 C0                test        al,al 
00E1DDE2 0F 84 AB 00 00 00    je          display_fb_internal+0D3h (0E1DE93h) 
00E1DDE8 FF 75 24             push        dword ptr [cB] 
00E1DDEB 8D 55 0C             lea         edx,[h] 
00E1DDEE FF 75 20             push        dword ptr [cT] 
00E1DDF1 8D 4D F8             lea         ecx,[yp] 
00E1DDF4 E8 C7 71 FF FF       call        clip_lr (0E14FC0h) 
00E1DDF9 83 C4 08             add         esp,8 
00E1DDFC 84 C0                test        al,al 
00E1DDFE 0F 84 8F 00 00 00    je          display_fb_internal+0D3h (0E1DE93h) 
  3852: PIXVAL *p = textur + xp + yp * disp_width;
00E1DE04 8B 55 F8             mov         edx,dword ptr [yp] 
00E1DE07 8B 45 FC             mov         eax,dword ptr [xp] 
00E1DE0A 53                   push        ebx 
00E1DE0B 0F BF 1D 4C AA 09 01 movsx       ebx,word ptr ds:[109AA4Ch] 
00E1DE12 0F BF CA             movsx       ecx,dx 
00E1DE15 0F AF CB             imul        ecx,ebx 
00E1DE18 98                   cwde 
00E1DE19 56                   push        esi 
00E1DE1A 57                   push        edi 
  3853: int dx = disp_width - w;
  3854:
  3855: if (dirty) {
00E1DE1B 8B 7D 0C             mov         edi,dword ptr [h] 
00E1DE1E 03 C8                add         ecx,eax 
00E1DE20 A1 E8 FA 09 01       mov         eax,dword ptr ds:[0109FAE8h] 
00E1DE25 8D 34 48             lea         esi,[eax+ecx*2] 
00E1DE28 8B 45 08             mov         eax,dword ptr [w] 
00E1DE2B 98                   cwde 
00E1DE2C 2B D8                sub         ebx,eax 
00E1DE2E 89 45 F4             mov         dword ptr [ebp-0Ch],eax 
00E1DE31 80 7D 14 00          cmp         byte ptr [dirty],0 
00E1DE35 74 1B                je          display_fb_internal+92h (0E1DE52h) 
  3856: mark_rect_dirty_nc(xp, yp, xp + w - 1, yp + h - 1);
00E1DE37 8B 4D FC             mov         ecx,dword ptr [xp] 
00E1DE3A 8D 47 FF             lea         eax,[edi-1] 
00E1DE3D 03 C2                add         eax,edx 
00E1DE3F 50                   push        eax 
00E1DE40 8B 45 08             mov         eax,dword ptr [w] 
00E1DE43 48                   dec         eax 
00E1DE44 03 C1                add         eax,ecx 
00E1DE46 50                   push        eax 
00E1DE47 E8 74 74 FF FF       call        mark_rect_dirty_nc (0E152C0h) 
00E1DE4C 8B 45 F4             mov         eax,dword ptr [ebp-0Ch] 
00E1DE4F 83 C4 08             add         esp,8 
00E1DE52 66 8B 55 10          mov         dx,word ptr [colval] 
00E1DE56 8D 0C 00             lea         ecx,[eax+eax] 
00E1DE59 8D 04 1B             lea         eax,[ebx+ebx] 
00E1DE5C 89 4D EC             mov         dword ptr [ebp-14h],ecx 
00E1DE5F 89 45 F0             mov         dword ptr [ebp-10h],eax 
  3857: }
  3858:
  3859: do {
  3860: #if !defined NO_ASM && defined __GNUC__ && defined __i386__
  3861: // GCC x86 build inline assembly
  3862: const uint32 longcolval = (colval << 16) | colval;
  3863: unsigned int count = w;
  3864: asm volatile (
  3865: // uneven words to copy?
  3866: "shrl %1\n\t"
  3867: "jnc 0f\n\t"
  3868: // set first word
  3869: "stosw\n\t"
  3870: "0:\n\t"
  3871: // now we set long words ...
  3872: "rep\n\t"
  3873: "stosl"
  3874: : "+D" (p), "+c" (count)
  3875: : "a" (longcolval)
  3876: : "cc", "memory"
  3877: );
  3878: #elif !defined NO_ASM && defined _MSC_VER && defined _M_IX86
  3879: // MSVC x86 build inline assembly
  3880: const uint32 longcolval = (colval << 16) | colval;
  3881: __asm {
  3882: mov edi, p
  3883: xor ecx, ecx
  3884: mov cx, w
  3885: mov eax, longcolval
  3886: shr cx, 1
  3887: jnc aligned
  3888: stosw
  3889: aligned:
  3890: rep stosd
  3891: mov p, edi
  3892: };
  3893: #elif !defined NO_LOW_LEVEL
  3894: // low level c++
  3895: const uint32 longcolval = (colval << 16) | colval;
  3896: KOORD_VAL count = w;
  3897: uint32 *lp;
  3898:
  3899: if(  count & 1  ) {
  3900: *p++ = colval;
  3901: }
  3902: count >>= 1;
  3903: lp = (uint32 *)p;
  3904: while(  count-- != 0  ) {
  3905: *lp++ = longcolval;
  3906: }
  3907: p = (PIXVAL *)lp;
  3908:
  3909: std::fill(p, p + count, colval);
  3910: #else
  3911: // high level c++
  3912: PIXVAL *const fillend = p + w;
00E1DE62 8D 1C 31             lea         ebx,[ecx+esi] 
00E1DE65 33 C0                xor         eax,eax 
00E1DE67 8B CB                mov         ecx,ebx 
00E1DE69 2B CE                sub         ecx,esi 
00E1DE6B 41                   inc         ecx 
00E1DE6C D1 E9                shr         ecx,1 
00E1DE6E 3B F3                cmp         esi,ebx 
00E1DE70 0F 47 C8             cmova       ecx,eax 
  3913: std::fill(p, fillend, colval);
00E1DE73 85 C9                test        ecx,ecx 
00E1DE75 74 0B                je          display_fb_internal+0C2h (0E1DE82h) 
00E1DE77 40                   inc         eax 
00E1DE78 66 89 16             mov         word ptr [esi],dx 
00E1DE7B 8D 76 02             lea         esi,[esi+2] 
00E1DE7E 3B C1                cmp         eax,ecx 
00E1DE80 75 F5                jne         display_fb_internal+0B7h (0E1DE77h) 
  3914: p = fillend;
  3915: #endif
  3916: p += dx;
00E1DE82 8B 75 F0             mov         esi,dword ptr [ebp-10h] 
  3917: } while (--h != 0);
00E1DE85 4F                   dec         edi 
00E1DE86 8B 4D EC             mov         ecx,dword ptr [ebp-14h] 
00E1DE89 03 F3                add         esi,ebx 
00E1DE8B 66 85 FF             test        di,di 
00E1DE8E 75 D2                jne         display_fb_internal+0A2h (0E1DE62h) 
00E1DE90 5F                   pop         edi 
00E1DE91 5E                   pop         esi 
00E1DE92 5B                   pop         ebx 
  3918: }
  3919: }
00E1DE93 8B E5                mov         esp,ebp 
00E1DE95 5D                   pop         ebp 
00E1DE96 C3                   ret 

As one can see, it moves the colorval into dx and then writes out dx in a loop.

prissi

First, I separated since last release the message level from the debug level. So there is no need to have debug builds just for the messages as it was required before.

Also, given the really few occurrences of fillboxes during a screen redraw and the current steedup by the compiler, lets forget about the assembler code for all. It was a nice excercise in mid 2004 but apparently obsolete. Since -pentium4 is incompatible with older Athlons, and -i686 seems even faster for some of them, lets stay with -i686.

And finally the makefile still has to work for people using simutrans on arm machines (is the an arm compiler GCC for windows on rm) and powerpc (like hopefully soon Android and Amiga).

TurfIt

Added the skylake results to the table above.
The final test with skylake, and 'C while' also has std:fill() replacing the loop in ALIGN_COPY section of display_fb_internal() as suggested. Clear win. I didn't test on the other two CPUs.
No surprise there's no change moving from ivybridge to skylake allowed instructions - not really anything added there instruction wise, just a faster processor.
Skylake continues hating memcpy() in display_img_nc(), wonderful how replacing a standard function with a while loop there helps, while replace the while with a std helps the fillbox.

DrSuperGood

So anyone object to the assembly being removed?

I will try and whip up a patch for testing. In case there are any major performance regressions.

TurfIt

I think the time for the asm has passed, just like the time for non-P4 compatible cpus...

Anyone have a suggestion to 'fix' the C code in display_text_proportional_len_clip_rgb() ?   The conroe doesn't like, and strangely the skylake is slightly objecting too, but almost too small a difference to matter with this quicky single test run. The function name could use some clipping itself!

DrSuperGood

The following functions are the only functions which have assembly in them, as far as I can tell...
display_img_nc
display_scroll_band
display_fb_internal
display_text_proportional_len_clip_rgb

Attached is a patch that removes the assembly from them and replaces them with appropriate high level implementations. New, aligned, low level C implementations for display_img_nc and display_fb_internal can be turned on by defining "LOW_LEVEL". These low level C implementations might be faster on some compilers and older processors, however I am unsure as the extra alignment overhead might be larger than the alignment gains.


Ters

Looking at the assembly for while (--count != 0) { *p++ = colval; } versus the assembly for std::fill(p, p + count, colval);, they look almost exactly identical when using GCC 6.3. With no march, both seems to start out with some loop unrolling, before doing what appears to be a conditional jump based loop doing two pixels at a time, then finishing off with any potential odd pixel. With march=ivybridge, it uses vector instructions.

DrSuperGood

MSVC seems to have a rather bad optimizer... Apparently it is improved in 2017 but I doubt by that much. Perhaps it needs profile data to work properly.

prissi

I think the GCC C code did better than the std::copy in the above profiling, although this is likely pakset dependent. So I would expect a change in pak192 comic ... Thus I am not sure ig the C code should go into low level but rather be the standard one.

Furthermore, the texture pointer cannot be odd since it is a pointer to an uint16 and standard windows allocation routines always returns memory on a 256 byte border. (as long as this is not a simgraph24 or simgraph8), so removing that check would be fine and same another 5 cycles or so ;)


DrSuperGood

QuoteI think the GCC C code did better than the std::copy in the above profiling, although this is likely pakset dependent. So I would expect a change in pak192 comic ... Thus I am not sure ig the C code should go into low level but rather be the standard one.
The std::copy should be translated into the same code as the old C memmov/memcpy, the only main difference being is it implicitly handles the type size (as opposed to explicitly) and it is c++ rather than c.

The low level code is written to try and take advantage of basic memory behaviour. However I am unsure how much of a difference the alignment makes over the pervious approach.

QuoteFurthermore, the texture pointer cannot be odd since it is a pointer to an uint16 and standard windows allocation routines always returns memory on a 256 byte border. (as long as this is not a simgraph24 or simgraph8), so removing that check would be fine and same another 5 cycles or so
Which check? If it is checking that I might have accidently left it in by mistake...

Although the buffers are 16 bit (2 byte) aligned, the destination pointer is not 32 bit (4 byte) aligned which is apparently important for efficient double word (32 bit, 4 byte) writes. As such it checks if the address pointer has its 2 bit set (not 4 byte aligned), and if so it performs a pre-alignment word (16 bit, 2 byte) write so it is 4 byte aligned before the main double word (32 bit, 4 byte) write loop. After the loop ends any remaining word (16 bit, 2 byte) is written out.

TurfIt

#19
The LOW_LEVEL code crashes when selecting the P4 target (or higher), same as the trunk USE_C unaligned...
Rejigged show_times() to try and get the rest producing sensible - and fixed the image used for display_img() too. So, these results not directly comparable to the previous postings.

'align' is the trunk code with USE_C defined (and ALIGN_COPY again too).
'while' is with the patch applied, but display_image_nc() and display_fb_internal() with the while loop.
'low' is LOW_LEVEL from the patch.
'high' is patch.

686 default, pentium4, and core2 as -march= targets.
Ran only on the skylake CPU.


skylake686P4core2
alignwhilelowhighalignwhilehighalignwhilehigh
display_img():12000000iterationstook:6541442530614674656138514587666841994445
display_color_img() with recolor:6000000iterationstook:5565503331685503515344445428585548085270
display_flush_buffer():5000000iterationstook:5084508950915094524752665273504550235042
display_text_proportional_len_clip():8000000iterationstook:5274542756056177567054075474500154035468
display_fillbox_wh():8000000iterationstook:16051160471612116819804880567532655963987244
view->display(true):2000iterationstook:4298425238214181427241734228428141444140
view->display(true) and flush2000iterationstook:4372432639154247431142274300435042294215
grund_t::get_neighbour():299956943iterationstook:5155513851705260546955085503519652385199
welt->sync_step(0,0,1):2000iterationstook:4587449740714567451844334522457044274409
welt->sync_step(200,1,0):20000iterationstook:3442345034683512351735063510344334723464
welt->step():40000iterationstook:1640162116401668177517351778161416251629
Total:62008593075513161703545415060652134525814896550525

The fillbox needs another test, perhaps std:fill only gives the good speedup with > core2 instructions... or the slightly different way it's in the patch vs what I did last time.

DrSuperGood

QuoteThe LOW_LEVEL code crashes when selecting the P4 target (or higher), same as the trunk USE_C unaligned...
I am guessing this is what is causing the problem...

uint32 *ld = (uint32 *)p;
const uint32 *ls = (const uint32 *)sp;
...
uint32 *lp = (uint32 *)p;

It must be making a false assumption that the uint32 is aligned to 16 bytes or something like that.

I am quite surprised with the "LOW_LEVEL" performance of display_image_nc(). That is a pretty significant performance difference, nearly a 1/4 time saving. I am guessing correct write alignment must be making a pretty big difference? Most of the other implementations are probably using the repmovs approach which is why they are all about the same time.

I am not happy with the display_fillbox_wh() performance. Something must be broken with optimizing the loop. The core loops should be compiling into rep stos instructions, as with the old assembly. Any suggestions? Possibly moving colvald out of the loop may help with the low level implementation, as well as changing references to colval for 16bit (word) assignment.

The problem with display_text_proportional_len_clip_rgb is the sort of operation being done must not be translating efficiently. The assembly worked well because although it killed the pipeline and program cache, it would then perform all the sets very efficiently. The high level implementation should translate to the low level implementation by the compiler, however it seems they are too stupid to unroll a 8 sized loop. A massive 256 case statement could achieve performance similar to the old assembly while also being portable, however this will not solve the readability of the assembly.

Ters

Quote from: DrSuperGood on February 22, 2017, 04:54:43 PM
Which check? If it is checking that I might have accidently left it in by mistake...

I suspect it's this one:

if(  reinterpret_cast<size_t>(p) & 0x1  ) {
return;
}


Quote from: TurfIt on February 23, 2017, 03:21:00 AM
The LOW_LEVEL code crashes when selecting the P4 target (or higher), same as the trunk USE_C unaligned...

And likely for the same reason. Casting uint16_t * to uint32_t * breaks alignment rules. When GCC uses vector instructions it does not insert compensation for alignment on less than 4 bytes, because uint32_t is supposed to be 4-byte aligned.

Quote from: TurfIt on February 23, 2017, 03:21:00 AM
Rejigged show_times() to try and get the rest producing sensible - and fixed the image used for display_img() too. So, these results not directly comparable to the previous postings.

'align' is the trunk code with USE_C defined (and ALIGN_COPY again too).
'while' is with the patch applied, but display_image_nc() and display_fb_internal() with the while loop.
'low' is LOW_LEVEL from the patch.
'high' is patch.

686 default, pentium4, and core2 as -march= targets.
Ran only on the skylake CPU.

Apart from display_img, and display_color_img for low, there are not much difference.

DrSuperGood

QuoteI suspect it's this one:
Ops...

I had it there to check if all such writes were aligned. I suspected they would be but one can never be sure.

QuoteAnd likely for the same reason. Casting uint16_t * to uint32_t * breaks alignment rules. When GCC uses vector instructions it does not insert compensation for alignment on less than 4 bytes, because uint32_t is supposed to be 4-byte aligned.
The destination pointers now are 4 byte aligned, this is required for maximum performance of double word memory operations and probably why the new low level copy loop is so fast (at least for the pakset used to test). It must be the source pointer that is the problem as I cannot align that as the data would no longer match up or the destination would no longer be aligned. In theory there is an old SSE2 trick that can be used in such a case which involves reading aligned, aligning the unaligned read data in the XMM registers and then writing aligned, however I have no idea if one could tell GCC to do such a thing and it does have a performance overhead.

Here is an updated patch to try.

prissi

I think the low level code is fast, because it only copies a small amount of data. Thus any effort to stream more data requires more initial initialisation. Maybe the sdt:copy is faster in pak192.

However, I still do no understand which the code crashes on the pentium 4. A unit32 does not have to be aligned unless MMX is used. But unaligned copying is not forbidden in C, so  (There are packed structures in memory when loading, which crashed when we did the first power PC builds for MAC due to alignment mismatches, but the loops in simgraph.c were never bothered by that).

DrSuperGood

QuoteI think the low level code is fast, because it only copies a small amount of data. Thus any effort to stream more data requires more initial initialisation. Maybe the sdt:copy is faster in pak192.
Exactly. This is the problem with "rep movs" data moves on modern processors as the instruction has a 20-40 cycle initialization time as the micro code sorts out how it should move the data and lacks branch prediction. If the instruction was implemented with actual logic and a state machine then it would be unbeatable as it would take a fraction of the time to initialize. That said the low level code is also not that slow anymore as it always does aligned 32 bit writes as opposed to where ~50% of the time (as a guess) they were unaligned writes.

QuoteHowever, I still do no understand which the code crashes on the pentium 4. A unit32 does not have to be aligned unless MMX is used. But unaligned copying is not forbidden in C, so  (There are packed structures in memory when loading, which crashed when we did the first power PC builds for MAC due to alignment mismatches, but the loops in simgraph.c were never bothered by that).
It is a problem with the GCC compiler and pointer type casting. It always assumes that pointers to 32bit types are aligned to 4 bytes, their optimum alignment. This assumption is not unreasonable as manipulating 32bit types off alignment is likely to perform terribly anyway so is not something one should be doing. Using this assumption it then must be trying to align to 16 byte boundaries and then use aligned vector instructions (as the unaligned ones are slower). The problem is the low level parts of the graphic code cast a pointer to a 16bit type to a pointer to a 32bit type, resulting in a pointer to an unaligned 32bit type. The compiler must not pick up on this and the alignment code it generates ignores bits 0 and 1 assuming they are 0 so the result is it trying to use aligned instructions on unaligned memory and hence an exception being generated. Even if the exception was handled automatically by the OS to perform an unaligned memory access (apparently this feature is possible and probably is enable in Linux), the resulting performance would be so abysmal one might as well let the process crash instead of trying.

The new low level code aligns the destination pointer to 4 bytes, hence should be faster in ~50% of cases over previous implementation which did no 4 byte pointer alignment. The problem is that it still may perform an unaligned 4 byte read, as one cannot align both destination and source pointers to 4 byte boundaries if they have a 2 byte offset between them as doing so would change the results. On modern CPUs this is not a performance problem as aligned write with unaligned read is one of the fastest ways to move data due to advances in cache access mechanics having good unaligned read performance. However GCC will still use aligned read instructions to read the pointer which is not aligned and hence cause an exception.

If it were to implement the copy using AVX (really modern processors only) instead of SSE2 then it would have no problem with read alignment as the instructions do not have strict alignment requirements. However doing so would limit the usable audience so much that not even myself could play the game as my i7 pre dates AVX.

TurfIt

The revision with patch r2 doesn't fix the crash in display_img_nc() with LOW_LEVEL.

I had a hard time replicating that initial ~4000 result for display_fillbox_wh() when using std:fill(). Finally discovered DEBUG=1 required. Bizarre.
The results on the 20th were all with DEBUG=1. The tests I posted yesterday were actually also repeated with DEBUG=1 and no DEBUG; There's was no difference, so I only posted the no debug ones. Of course yesterday I never went beyond the core2 instructions. It seems the AVX and AVX2 are only working with DEBUG=1.


skylakewestmeresandybridgeivybridgehaswellbroadwellskylake
highhighhighhighhighhigh
display_img():12000000iterationstook:441044344451459746004602
display_color_img() with recolor:6000000iterationstook:525152905289513951385157
display_flush_buffer():5000000iterationstook:503049184912510350995121
display_text_proportional_len_clip():8000000iterationstook:536654845492546154565444
display_fillbox_wh():8000000iterationstook:752851785181426842674281
view->display(true):2000iterationstook:335133483354334233413382
view->display(true) and flush2000iterationstook:339934123411339233973498
grund_t::get_neighbour():299956943iterationstook:510050785108509751105260
welt->sync_step(0,0,1):2000iterationstook:360635893646362836623625
welt->sync_step(200,1,0):20000iterationstook:344734433429348534523445
welt->step():40000iterationstook:162416121635162516331625
Total:481124578645908451374515545440

So, std::fill() really needs AVX to get going, improves further with AVX2. std::copy() is actually slowing with AVX2.
I think we can agree producing .exes that need AVX2 won't fly.

Ters

The only effect I can imagine DEBUG=1 has on the code (apart for enabling asserts) is aligning the memory differently in order to make buffer overruns easier to catch. That might mean more of the time is spent doing aligned memory accesses as a side-effect.

DrSuperGood

QuoteThe revision with patch r2 doesn't fix the crash in display_img_nc() with LOW_LEVEL.
Fixing this is not exactly easy...

I have attached another patch which should have it fixed. The fix was done by combining 2 consecutive word access from the word aligned source to a single double word which is written to the double word aligned destination. If the compiler is smart it should realise this is a single unaligned double word read and optimize away the logic and not try to perform aligned vector reads. MSVC is not smart and instead "optimizes" it to 2 aligned word reads... Hopefully GCC does better.

This is meant only for testing on little endian platforms. It will currently break on big endian platforms. If this solution is determined viable then a macro could be used to support both system endian types.

Oh for GCC I found these flags which might be worth messing around with...
Quote
-mno-align-stringops
Do not align the destination of inlined string operations. This switch reduces code size and improves performance in case the destination is already aligned, but GCC doesn't know about it.
-minline-all-stringops
By default GCC inlines string operations only when the destination is known to be aligned to least a 4-byte boundary. This enables more inlining and increases code size, but may improve performance of code that depends on fast memcpy, strlen, and memset for short lengths.
-minline-stringops-dynamically
For string operations of unknown size, use run-time checks with inline code for small blocks and a library call for large blocks.
-mstringop-strategy=alg
Override the internal decision heuristic for the particular algorithm to use for inlining string operations. The allowed values for alg are:
'rep_byte'
'rep_4byte'
'rep_8byte'
Expand using i386 rep prefix of the specified size.
'byte_loop'
'loop'
'unrolled_loop'
Expand into an inline loop.
'libcall'
Always use a library call.

-mmemcpy-strategy=strategy
Override the internal decision heuristic to decide if __builtin_memcpy should be inlined and what inline algorithm to use when the expected size of the copy operation is known. strategy is a comma-separated list of alg:max_size:dest_align triplets. alg is specified in -mstringop-strategy, max_size specifies the max byte size with which inline algorithm alg is allowed. For the last triplet, the max_size must be -1. The max_size of the triplets in the list must be specified in increasing order. The minimal byte size for alg is 0 for the first triplet and max_size + 1 of the preceding range.
-mmemset-strategy=strategy
The option is similar to -mmemcpy-strategy= except that it is to control __builtin_memset expansion.
-momit-leaf-frame-pointer
Don't keep the frame pointer in a register for leaf functions. This avoids the instructions to save, set up, and restore frame pointers and makes an extra register available in leaf functions. The option -fomit-leaf-frame-pointer removes the frame pointer for leaf functions, which might make debugging harder.
Specifically comparing things like rep_4byte with unrolls might be a good idea.

jamespetts

You might all want to have a look at this thread relating to Simutrans-Extended: a number of users with a variety of CPUs reported crashes on zooming in/out when march=pentium4 was enabled, which problem disappeared for at least one user (others have yet to comment) when this was disabled. This is with -DUSE_C also enabled (and remaining enabled in the version that does not crash for some users).

The graphics code in Simutrans-Extended is unchanged from Standard, so a similar result is likely to obtain in Standard if -march=pentium4 is defined.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

An_dz

Quote from: DrSuperGood on February 24, 2017, 07:13:11 AM
If the compiler is smart it should realise this is a single unaligned double word read and optimize away the logic and not try to perform aligned vector reads. MSVC is not smart and instead "optimizes" it to 2 aligned word reads... Hopefully GCC does better.
Maybe use pragma directives to tell the compiler how to optimise and deal with the code?

Ters

Quote from: An_dz on February 24, 2017, 07:00:26 PM
Maybe use pragma directives to tell the compiler how to optimise and deal with the code?

These are not portable. And GCC prefers its own concept of function/variable/type attributes to pragmas.

prissi

I would still prefer to release with -i686 as architectur (i.e. the defauolt setting). With this, there were no crashes, right?

TurfIt

#32
OK, I'm getting about done with testing, but here's one more with the r3 patch. And let's throw in the 64bit too!
686 asm, and x86_64 align would represent the current trunk out of the box state.
LOW_LEVEL now works with the SSE instructions, and quite likes them.
x86_64 asm has the asm modified to work in 64bit, except for in display_text_proportional_len_clip_rgb() which remains with the USE_C path. Too much headache to convert text_pixel.c


ivybridge686pent4x86_64
asmhighlowasmhighlowasmalignhighlow
display_img():12000000iterationstook:7407639738177619650135297233436842182484
display_color_img() with recolor:6000000iterationstook:9160642544099485648737638960491247472622
display_flush_buffer():5000000iterationstook:6560655866306985699670136788678767886786
display_text_proportional_len_clip():8000000iterationstook:9427690170849910713975397009699469087002
display_fillbox_wh():8000000iterationstook:89412020920188903693611284490321274299249795
view->display(true):2000iterationstook:4066404436024272421140923787358435253246
view->display(true) and flush2000iterationstook:4153412836854345429941733900368035933314
grund_t::get_neighbour():299956943iterationstook:6247626662666793676368268923891190998926
welt->sync_step(0,0,1):2000iterationstook:4407435839544594452844024131389538563561
welt->sync_step(200,1,0):20000iterationstook:3814382337603871384338573823382538013823
welt->step():40000iterationstook:1860185218561990200220011936191319351921
Total:66042709606525268900621306003865522616105839453481
display_img_nc() - the asm is clearly outdated. LOW_LEVEL is preferred in all cases and should be the new default. The high code could be kept as a define HIGH_LEVEL option in case some future compiler update breaks the low level.
display_scroll_band() - this is actually not tested in any of the -times tests. Since it only applies to the scrolling ticker, which is quite small, performance here isn't something to particularly care about IMHO. Good enough to just drop the asm for the USE_C code.
display_text_proportional_len_clip_rgb() - asm not helping. LOW_LEVEL is slower. Suggest just keep the high level path as the only one. EDIT: Conroe is painting a different picture, asm still good, followed by low. Annoying.
display_fb_internal() - the problem child! For a 686 target, no 'C' construct has yet been discovered to approach the speed of the asm. Unless can be found, should keep the asm. Once you go to P4, the high level basically matches the asm. You need AVX instructions to start beating the asm. Suggest keep the asm (and enable for 64bit), with HIGH_LEVEL as an option.

EDIT: Conroe is the oldest CPU I still have functional, even then it was retired last year after 10 years of service. It was really starting to drag in everyday tasks. Don't know how many old clunkers people are still trying to run Simutrans on...

conroe686pent4x86_64
asmhighlowasmhighlowasmalignhighlow
display_img():12000000iterationstook:111879112760011747963796411121811182110128230
display_color_img() with recolor:6000000iterationstook:1354192868757139909968114131341312721124819764
display_flush_buffer():5000000iterationstook:873887398675977110126984610029100281002410032
display_text_proportional_len_clip():8000000iterationstook:12751171831619513566164531590814696143061553514212
display_fillbox_wh():8000000iterationstook:15986399772377915998118661571416021152161151511782
view->display(true):2000iterationstook:105821026396891112310753110861026010548106169787
view->display(true) and flush2000iterationstook:1084910545996411397110071132310531107851089410067
grund_t::get_neighbour():299956943iterationstook:14578146441459415076149871511724502244672459124730
welt->sync_step(0,0,1):2000iterationstook:11006106761016211516111851150410665110181100510245
welt->sync_step(200,1,0):20000iterationstook:6979705869366880692669977863775979337912
welt->step():40000iterationstook:3155315831583454343334603274328032973274
Total:119352140642119507124518116342122009132473131308128903120035



Quote from: DrSuperGood on February 24, 2017, 07:13:11 AM
This is meant only for testing on little endian platforms. It will currently break on big endian platforms. If this solution is determined viable then a macro could be used to support both system endian types.

Oh for GCC I found these flags which might be worth messing around with...Specifically comparing things like rep_4byte with unrolls might be a good idea.
I think big endian support is broken for a long while anyway - not 100% sure.

Applying attributes to target specific optimizations at each function can work well. But, you'd need to revisit their effectiveness quite often as compiler versions change. The biggest problem IMHO is the code clutter. If MSVC support is to be maintained, each attribute needs wrapping in #ifs (and IIRC their required placement doesn't mesh well with the macros currently used for this).


Quote from: jamespetts on February 24, 2017, 12:08:32 PM
The graphics code in Simutrans-Extended is unchanged from Standard, so a similar result is likely to obtain in Standard if -march=pentium4 is defined.
The 120.1.1 and 120.1.2 releases I did last year are both with -march=pentium4. Once -mstackrealign was added, I don't believe there's been widespread crashing. Also, self compiled Sim-Ex with -march=pentium4 works for me, your executable did indeed crash on zooming.


EDIT: Conroe finally finished, results added. Remind me not to run tests with the iterations set for reasonable times on much faster computers!

DrSuperGood

Quotedisplay_img_nc() - the asm is clearly outdated. LOW_LEVEL is preferred in all cases and should be the new default. The high code could be kept as a define HIGH_LEVEL option in case some future compiler update breaks the low level.
I would imagine that high level should always be the default because it is the most readable/maintainable/portable form of the code while the low level is a preferred hack/optimization for speed. One could make it default used then by uncommenting out the LOW_LEVEL macro declaration in the one file, the file intended for platform compatibility macro definitions (where USE_C and ALIGN and such came from).

To be honest I am amazed the low level code has given such a speed boost. I am guessing this might only be the case for pak64 and such low resolution paksets as bigger images might start to even out with the generic memcpy approaches the compiler uses. Still it is bound to run a lot faster than the asm. The x86-64 speed is quite astounding, maybe x86-64 builds of Simutrans can be faster than x86 builds with a bit more optimization.

Quotedisplay_text_proportional_len_clip_rgb() - asm not helping. LOW_LEVEL is slower. Suggest just keep the high level path as the only one. EDIT: Conroe is painting a different picture, asm still good, followed by low. Annoying.
Technically a suitable assembly like implementation could be made by adding a massive 256 branch case statement with hard coded writes. Compilers should produce something with similar efficiency to the assembly.

The assembly may be faster on Conroe but it is slower on ivybridge. I am guessing its because ivybridge has better throughput, pipelining, caching, etc so the loop executes more efficiently than performing discontinuous jumps into a massive sparse instruction table or an unrolled set of tests.

I think your suggestion of only keeping the high path is probably the best. It might cause a performance regression on older processors but it is an optimization for newer ones. As old processors are gradually replaced by new ones the removal of the assembly is the most future proof choice. The difference in performance with low and high level code does not justify the existence of both, especially since it shares similar results to the assembly with respect to newer and older processors.

Quotedisplay_fb_internal() - the problem child! For a 686 target, no 'C' construct has yet been discovered to approach the speed of the asm. Unless can be found, should keep the asm. Once you go to P4, the high level basically matches the asm. You need AVX instructions to start beating the asm. Suggest keep the asm (and enable for 64bit), with HIGH_LEVEL as an option.
GCC seems to not be optimizing it like MSVC does. In MSVC it compiles to a rep stos instruction for the main loop as in the assembly so should perform as fast as the assembly or even faster due to better alignment. As one can see rep stos cannot really be beaten in this case on modern processors due to the micro code implementation. It may even be possible that rep stos of only a word would be even faster as it allows the processor to deal with all the alignment as efficiently as possible. On older processors a SSE loop (P4) seems to be faster and on newer ones it is comparable to rep stos.

How viable would it be to compile a release using pent4? The high level implementation performs close enough to the assembly implementation on modern processors that the difference can be considered trivial while on old processors it appears to be a speed boost over the assembly. This would mean only the high level branch would have to be kept and performance regression is trivial at worst.

Ters

I've made some diagrams, because I find it easier to see the differences in that form. (flush_buffer is perhaps not quite as interesting, but it was in the middle of the interesting ones.)

It is interesting that selecting target processor doesn't have much to say, apart from 32-bit vs. 64-bit to a small degree. The exception is fillbox, which gets quite a boost with pentium4-instructions over plain pentium, but fillbox behaves quite differently from the others in every way.

Dropping the assembly for display_img and display_color_img seems a good idea based on these measurement. For fillbox, it is still a bad idea. For text, it depends on the processor. Conroe prefers the assembly, ivybridge does not.

DrSuperGood

QuoteFor fillbox, it is still a bad idea.
This is a GCC problem though. MSVC last I checked (just updated to 2015 update 3 so might have changed) generated assembly for the low level similar to the inline assembly (rep stos) so should perform equally, if not better on older processors as I added alignment. I am guessing that GCC prioritizes against using rep stos, possibly because it thinks it is targeting an old processor model before fast string instructions were added or as optimized as they are for modern processors. For high level it does better than MSVC (again not checked with update 3) because MSVC cannot optimize std::fill well for uint16_t some reason, I think it lacks/lacked the intrinsics for non char types, having to default to loops rather than optimizing out to a rep stos as in the assembly.

I am guessing the assembly (rep stos) performance difference with Conroe was the result of non-aligned write the assembly performs. The high level Pentium 4 code does aligned writes with SSE which is why it is faster. If the assembly code was optimized to align like the low level code then it might reach performance levels on Conroe similar to the Pentium 4 high level. Modern processors executing string instructions probably implicitly align them at a micro code level, which is why the rep stos cannot be beaten on ivybridge but could on Conroe.

QuoteFor text, it depends on the processor. Conroe prefers the assembly, ivybridge does not.
Hence why I suggest removal of the assembly. Might cause regression on older processors but it is more future proof as modern processors gain a performance boost. One can generally rely on the trend that old core2 processors where it regresses will be replaced with i3/5/7 and AMD equivalents as time progresses.

kierongreen

I find it a little odd to knowingly reduce performance on older processors when the people using them will be the ones for whom performance is already more of an issue than on more modern machines.

DrSuperGood

QuoteI find it a little odd to knowingly reduce performance on older processors when the people using them will be the ones for whom performance is already more of an issue than on more modern machines.
It is also odd to knowingly reduce performance on newer processors when you know that more and more people in the future will be using them. My justification is entirely based on minimizing future development work, since naturally user metrics will shift towards the processors that it runs better on and away from the ones it runs worse on. If we prefer older processors for speed then it means that some time in the future another such topic as this will be created arguing about the code performance and it will have to be changed to what we could change it to now.

If Pentium 4 compilation is possible even the old processors will gain somewhat with speed. Just newer processors will have the biggest gains.

prissi

First, having a proper optimised build gave the most acceleration.

The above shows, that the assembler has to stay only for fillboxes. Other than that there is only a very little gain for the display_img with recolor and sometimes with text, which both are operations which are less frequent compared to image display. The shading and blending did become much more heavy in current releases, and are not shown here. That may become an issue on older machines. So teh 5% loss in Conroe for text and recoloring is probably not really measureable in a real game.

By the way: the same argument goes for multitasking. Low powered processors have few real cores. So the overhead can (in principle) weight out the gain quickly. (Even more in cheap servers, which are often single cores virtual machines ...)

So I agree, we have to optimise for everything. But since the compiler does now a better jab than a long time ago, it means that likely also other architektures like Arm and Power will profit from a clean and clear C code. Also, since Simutrans is released at least once a year, I am not worried about new processor performance now. You can always have your own build.

Ters

Quote from: DrSuperGood on February 26, 2017, 07:48:16 PM
This is a GCC problem though. MSVC last I checked (just updated to 2015 update 3 so might have changed) generated assembly for the low level similar to the inline assembly (rep stos) so should perform equally, if not better on older processors as I added alignment.

Unfortunately, that Microsoft does better is only useful for Windows users at best. I'm not even sure if that is a majority of our users anymore.

Quote from: DrSuperGood on February 27, 2017, 01:21:00 AM
It is also odd to knowingly reduce performance on newer processors when you know that more and more people in the future will be using them. My justification is entirely based on minimizing future development work, since naturally user metrics will shift towards the processors that it runs better on and away from the ones it runs worse on. If we prefer older processors for speed then it means that some time in the future another such topic as this will be created arguing about the code performance and it will have to be changed to what we could change it to now.

If Pentium 4 compilation is possible even the old processors will gain somewhat with speed. Just newer processors will have the biggest gains.

We didn't reduce performance. Intel did. And even if we optimize for new processors now, they will at some point become old processors and the discussion of switching to target newer processors will come up anyway. The frequency will be the same, it is just the offset that is different.

As far as I know, nobody has complained about worse performance on their new PC than their old, except for the SDL1 vs. SDL2 thing. That turned out to be an SDL problem, apparently, not something with Simutrans itself.

It has been a few years now since people complained that Simutrans (due to a configuration error on the build server) didn't run on a Pentium 3 running Windows 2000. However, it is possible that there is a large, loyal following of impoverished Simutrans fans that are too embarrassed, or simply don't grasp English enough, to be vocal on this board. (I actually do have a running Conroe computer, and probably will for years to come, unless it breaks down. I don't use it for Simutrans anymore, though. And it runs Linux.)

DrSuperGood

QuoteUnfortunately, that Microsoft does better is only useful for Windows users at best. I'm not even sure if that is a majority of our users anymore.
It seems GCC relies on intrinsic functions for memory operations from a standard C/C++ library (usually part of Linux?) rather than coming up with its own. For memory related operations it out sources quite a lot to standard memory intrinsic functions, which it then can inline (not always does) and optimize. Problem seems to be that the result of all this is that there is a huge bias towards set loops (or read write loops) rather than string operations such as "rep stos", something MSVC compiler does use for the low level set implementation but fails for the high level or the memory copy. I am guessing GCC is failing to use rep stos for both low and high level and only does well with vector extensions because eventually the vectors are large enough to compensate for the inefficiency. The old processor rep stos implementation must not be self aligning which is why the assembly is slower than vector instructions.

I am guessing there has been no major drive to add this optimization because the GCC developers believe bulk memory operations are not their problem (they say it is the problem of the standard library developers) and that the performance of vector extension optimized code is close enough that there is little to gain with such optimizations.

Ters

#41
Intrinsics are by definition provided by the compiler. GCC doesn't use that word, but its description for its built-in functions indicates that the compiler provides them, and library developers are encouraged to use them by providing macros redirecting standard names to the built-in names. However, the documentation seems to state that some standard library functions are routed to the built-in equivalents regardless of what the library says. This includes memcpy and memset. (I'm not sure how the compiler thinks it can implement malloc on its own. Especially when the compiler does not implement free.) Not that it appears to do so in practice. It does recognize a for-loop filling a uint8_t array as being the same as memset, but it generates a function call to memset rather than inline it. This is true for both GCC 6.3.0 on Windows and GCC 4.9.4 on Linux, so its reliance on a closed source, foreign C library on Windows does not appear to be a factor.

Edit:
This was odd. If I actually call memset explicitly, then it gets inlined. Seems like some optimized steps are run in the wrong order. It appears that for loops are replaced by memset after memset has been otherwise been inlined.

The inlined memset does use rep stos, but it uses stosd (stosl in GAS), so unaligned memory access may be an issue. I haven't quite figured out how it deals with an element count that is not divisible by four, but the only part I don't understand from the brief time devoted to it is the bit before the rep-part, so it must be there.

DrSuperGood

These optimizations have been merged into main. I did some final tidying up on them, fixing the low level performance for MSVC (which does not merge nearby memory reads but has not problem with an unaligned read) and adding robustness for different platform endians.


prissi

Your code in endiancheck is anyway assuming that the processor has a swap command to swap low and high word. Otherwise a 16 bit shift and an or is not neccessarily faster than just two copy commands. Since the only platform for now is the old PowerMac and Amiga target using big endian, I would rather suggest the compiler to use the simple copy code and leave optimisations to the compiler.

Also other parts of simutrans uses SIM_BIG_ENDIAN for endianess check. Then the code should do this too.

DrSuperGood

QuoteYour code in endiancheck is anyway assuming that the processor has a swap command to swap low and high word. Otherwise a 16 bit shift and an or is not neccessarily faster than just two copy commands. Since the only platform for now is the old PowerMac and Amiga target using big endian, I would rather suggest the compiler to use the simple copy code and leave optimisations to the compiler.
It should not need a swap instruction on big endian platforms. A good compiler should see that it could optimize the 2 separate aligned 16 bit reads into a single unaligned 32 bit read (if supported and faster) as it is always the correct order. If unaligned reads are not supported or are slow then then LOW_LEVEL code will not perform well anyway and leaving it to the compiler to copy is likely the fastest solution. The endian check is needed so that the two aligned 16 bit reads always equate to an unaligned 32 bit read to match the 32 bit write because GCC does not really have an understanding of unaligned 32 bit reads. MSVC seems to understand/support unaligned 32 bit reads but also lacks a lot of memory access optimizations so will not merge the 2 aligned 16 bit reads into a single unaligned 32 bit read, hence it needs its own separate case.

QuoteAlso other parts of simutrans uses SIM_BIG_ENDIAN for endianess check. Then the code should do this too.
However MSVC Simutrans defines LITTLE_ENDIAN for endian check... You saying that macro is useless?

Dwachs

There is LITTLE_ENDIAN in simconst.h (new) and SIM_BIG_ENDIAN in sim_types.h (since ages). This should be unified somehow.
Parsley, sage, rosemary, and maggikraut.

DrSuperGood

#46
QuoteThere is LITTLE_ENDIAN in simconst.h (new) and SIM_BIG_ENDIAN in sim_types.h (since ages). This should be unified somehow.
I would recommend using the standard macros of GCC. If they are not defined for a specific compiler one can manually define them for the platform, or use one of the various standard adapter macros available online (which use compile time comparisons to detect endian). I believe most platforms are little endian, at least the ones that it is reasonable for Simutrans to target, as such assuming it as a default when no such macros are available is a good idea and should work in most cases. For example MSVC does not have them but as far as I know only targets little endian machines (x86 and arm processors).

prissi

All windows are indeed little endian, even though it run on processors which could do both. But the XBox was once a PowerPC machine, as were the PowerMac (Office for Mac), and it was supported by windows compilers. And as soon as there is a Unix derivate, all guesses are off.

I found "_BIG_ENDIAN" for the andriod SDK, for GCC "__ORDER_BIG_ENDIAN__", for clang "_BIG_ENDIAN" A deeper search suggested the following macro monsters:

#if defined(__BYTE_ORDER) && __BYTE_ORDER == __BIG_ENDIAN || \
    defined(__BIG_ENDIAN__) || \
    defined(_BIG_ENDIAN) || \
    defined(__ARMEB__) || \
    defined(__THUMBEB__) || \
    defined(__AARCH64EB__) || \
    defined(_MIBSEB) || defined(__MIBSEB) || defined(__MIBSEB__) \\
    defined(_M_MPPC)  ||  defined(_M_PPC)
// It's a big-endian target architecture
#define SIM_BIG_ENDIAN
#elif defined(__BYTE_ORDER) && __BYTE_ORDER == __LITTLE_ENDIAN || \
    defined(__LITTLE_ENDIAN__) || \
    defined(__ARMEL__) || \
    defined(__THUMBEL__) || \
    defined(__AARCH64EL__) || \
    defined(_MIPSEL) || defined(__MIPSEL) || defined(__MIPSEL__)
// It's a little-endian target architecture
#else
#error "I don't know what architecture this is!"
#endif

DrSuperGood

I was just going by what was said on various GCC pages about macros...

Since GCC, or variants there of, seem to be the most common way to build Simutrans I would have thought that using these standard macros was a good idea. If not then one will have to define our own set of macros for the job and add them to the build process. If one assumes little endian is default, that means only a big endian macro is needed and must be defined for such platforms. This would save the need for both a big and little endian macro.

It is worth noting that apparently something called "middle endian" exists which is what GCC refers to as PDP. If the code is truly portable this should be supported as well however I am unsure if Simutrans will ever encounter such a thing (similar to it encountering platforms with non 8 bit chars/bytes). For the low level C loop PDP endian will work with the big endian branch.

Ters

If PDP refers to what I think it refers to, Simutrans will never run on those things unless Simutrans is simplified considerably.

prissi

Apart from old PDPs using 9 bits per byte, and the lack of anything beyond as basic K&R C compiler ...

More to the present: Clang is also a common compiler, and also it seems to use _BIG_ENDIAN instead (Dwachs is using it, so he could tell).

prissi

After finishing this, maybe it should go to technical documentation, before the next optimisation round circa 2020?

DrSuperGood

QuoteAfter finishing this, maybe it should go to technical documentation, before the next optimisation round circa 2020?
Possibly, as it does cover some details about assembly and optimizations that might persuade future users against using assembly.