News:

Simutrans Tools
Know our tools that can help you to create add-ons, install and customize Simutrans.

Pull Request: Low Level RIBI Optimizations

Started by PJMack, April 30, 2021, 01:07:58 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

PJMack

This pull request, already on GitHub (https://github.com/jamespetts/simutrans-extended/pull/375) contains targeted performance optimizations as described below:

Modified ribi_t to avoid Look-Up-Tables (LUTs) to hypothetically increase performance.
Also adjusted Makefile to reflect conditional compilation of some such optimizations.
Additionally, enabled Link Time Optimizations (LTO).

As each column of the ribi_t::flags Look-Up Table is 16 bits, such columns are flattened into the ribi_t functions using such flags. After compilation, this replaces the address calculation, load, and with a Load Immediate, shift, and on the x86 and x86_64 architectures. Likewise, for 64 bit machines, the ribi_t::backwards, ribi_t::doppelr, and ribi_t::dirs have their LUTs flattened to 64 bit longs, resulting in a shift, Load Immediate, shift, and.  Additionally, when compiling for a local machine, the user would have the option of use the __builtin_popcount() function which would, on architectures and compilers that support it, reducing is_single(), is_twoway(), and is_threeway() to popcount, compare.  Such modifications would reduce the number of memory accesses in the algorithms that use such functions.

An LTO option is also added to the Makefile.

prissi

Clever idea to use a 64 bit variable as lookup table. There is however an issue: "lu" is not neccessarily a 64 bit variable on a 64 bit system. The C standard (and MSVC or ARM) require "ull" for a long long which is at least 64 bit. I would recommend the INT64_C macro:

static ribi backward(ribi x) {return (CINT64_C(0x01234A628951C84F)>>(x*4))&0xF; }

and so on. That should probably even work on 32 bit systems, although I am not sure if it is faster on 32 bit systems.

Also it is strange that you use >> but *4 for the left shift ... but then modern CPU do multiplication at the same speed as shifts.

prissi

I put this slightly modified version in r9745 in standard, which also works fine in 32 bit code. Thanks for your hard work figuring out the lookup integers.

PJMack

Thank You.  I updated the pull request to use INT64_C

I did, however, leave the Makefile as is since I am unsure of the performance implications of a 64 bit shift on a 32 bit processor, though I suspect it to be no better than the existing LUT.

The *4 is for code readability, it is transformed into a shift during compilation:
  1263dd:       48 bb 4f c8 51 89 62    movabs $0x1234a628951c84f,%rbx
  1263e4:       4a 23 01
  1263e7:       c1 e2 02                shl    $0x2,%edx
  1263ea:       c4 e2 ea f7 d3          sarx   %rdx,%rbx,%rdx
  1263ef:       83 e2 0f                and    $0xf,%edx




TurfIt


ifeq ($(shell getconf LONG_BIT),64)

LONG_BIT is 64 in both MINGW32 and MINGW64 shells...  Not a valid method to determine if the compiled executable is for 32 or 64 bit systems...




@prissi: r9745 breaks the makefile.... elif?

prissi

This is a part which I did not touch using manual editing. Apparently the reverting was missing this one (not sure which though).

On speedup: Also in 32 bit, even if there is more caculation involved  (there are the low level MMX command available though), the data are already in the CPU chache due to being const and the commands have a higher chance being actually inlined. Being in the cache and not waiting for filling the pipeline with external data should help there too.

jamespetts

Excellent, thank you very much for your work on this: now incorporated. It is splendid to see new people adding something worthwhile to the Simutrans-Extended codebase. Also - welcome to the forums!
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

PJMack

Thank You. 

I do however, now have one addendum, which is the ribi_t::nesw array.  I used a dummy struct and operator [] to turn the array into a function.

jamespetts

Quote from: PJMack on May 24, 2021, 09:16:45 PM
Thank You. 

I do however, now have one addendum, which is the ribi_t::nesw array.  I used a dummy struct and operator [] to turn the array into a function.

Thank you - now also incorporated.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi