The International Simutrans Forum

 

Author Topic: Pull Request: Low Level RIBI Optimizations  (Read 509 times)

0 Members and 1 Guest are viewing this topic.

Offline PJMack

  • *
  • Posts: 29
Pull Request: Low Level RIBI Optimizations
« on: April 30, 2021, 01:07:58 AM »
This pull request, already on GitHub (https://github.com/jamespetts/simutrans-extended/pull/375) contains targeted performance optimizations as described below:

Modified ribi_t to avoid Look-Up-Tables (LUTs) to hypothetically increase performance.
Also adjusted Makefile to reflect conditional compilation of some such optimizations.
Additionally, enabled Link Time Optimizations (LTO).

As each column of the ribi_t::flags Look-Up Table is 16 bits, such columns are flattened into the ribi_t functions using such flags. After compilation, this replaces the address calculation, load, and with a Load Immediate, shift, and on the x86 and x86_64 architectures. Likewise, for 64 bit machines, the ribi_t::backwards, ribi_t::doppelr, and ribi_t::dirs have their LUTs flattened to 64 bit longs, resulting in a shift, Load Immediate, shift, and.  Additionally, when compiling for a local machine, the user would have the option of use the __builtin_popcount() function which would, on architectures and compilers that support it, reducing is_single(), is_twoway(), and is_threeway() to popcount, compare.  Such modifications would reduce the number of memory accesses in the algorithms that use such functions.

An LTO option is also added to the Makefile.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10674
  • Languages: De,EN,JP
Re: Pull Request: Low Level RIBI Optimizations
« Reply #1 on: April 30, 2021, 02:33:04 AM »
Clever idea to use a 64 bit variable as lookup table. There is however an issue: "lu" is not neccessarily a 64 bit variable on a 64 bit system. The C standard (and MSVC or ARM) require "ull" for a long long which is at least 64 bit. I would recommend the INT64_C macro:
Code: [Select]
static ribi backward(ribi x) {return (CINT64_C(0x01234A628951C84F)>>(x*4))&0xF; }
and so on. That should probably even work on 32 bit systems, although I am not sure if it is faster on 32 bit systems.

Also it is strange that you use >> but *4 for the left shift ... but then modern CPU do multiplication at the same speed as shifts.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10674
  • Languages: De,EN,JP
Re: Pull Request: Low Level RIBI Optimizations
« Reply #2 on: April 30, 2021, 12:46:37 PM »
I put this slightly modified version in r9745 in standard, which also works fine in 32 bit code. Thanks for your hard work figuring out the lookup integers.

Offline PJMack

  • *
  • Posts: 29
Re: Pull Request: Low Level RIBI Optimizations
« Reply #3 on: April 30, 2021, 10:11:31 PM »
Thank You.  I updated the pull request to use INT64_C

I did, however, leave the Makefile as is since I am unsure of the performance implications of a 64 bit shift on a 32 bit processor, though I suspect it to be no better than the existing LUT.

The *4 is for code readability, it is transformed into a shift during compilation:
Code: [Select]
  1263dd:       48 bb 4f c8 51 89 62    movabs $0x1234a628951c84f,%rbx
  1263e4:       4a 23 01
  1263e7:       c1 e2 02                shl    $0x2,%edx
  1263ea:       c4 e2 ea f7 d3          sarx   %rdx,%rbx,%rdx
  1263ef:       83 e2 0f                and    $0xf,%edx



Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1434
Re: Pull Request: Low Level RIBI Optimizations
« Reply #4 on: April 30, 2021, 10:31:26 PM »
Code: [Select]
ifeq ($(shell getconf LONG_BIT),64)
LONG_BIT is 64 in both MINGW32 and MINGW64 shells...  Not a valid method to determine if the compiled executable is for 32 or 64 bit systems...



@prissi: r9745 breaks the makefile.... elif?

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10674
  • Languages: De,EN,JP
Re: Pull Request: Low Level RIBI Optimizations
« Reply #5 on: May 01, 2021, 12:18:23 PM »
This is a part which I did not touch using manual editing. Apparently the reverting was missing this one (not sure which though).

On speedup: Also in 32 bit, even if there is more caculation involved  (there are the low level MMX command available though), the data are already in the CPU chache due to being const and the commands have a higher chance being actually inlined. Being in the cache and not waiting for filling the pipeline with external data should help there too.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20805
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Pull Request: Low Level RIBI Optimizations
« Reply #6 on: May 01, 2021, 12:33:53 PM »
Excellent, thank you very much for your work on this: now incorporated. It is splendid to see new people adding something worthwhile to the Simutrans-Extended codebase. Also - welcome to the forums!

Offline PJMack

  • *
  • Posts: 29
Re: Pull Request: Low Level RIBI Optimizations
« Reply #7 on: May 24, 2021, 09:16:45 PM »
Thank You. 

I do however, now have one addendum, which is the ribi_t::nesw array.  I used a dummy struct and operator [] to turn the array into a function.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20805
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Pull Request: Low Level RIBI Optimizations
« Reply #8 on: May 29, 2021, 12:20:57 PM »
Thank You. 

I do however, now have one addendum, which is the ribi_t::nesw array.  I used a dummy struct and operator [] to turn the array into a function.

Thank you - now also incorporated.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10674
  • Languages: De,EN,JP
Re: Pull Request: Low Level RIBI Optimizations
« Reply #9 on: May 29, 2021, 01:33:33 PM »
Thank you, put it in standard as well.