News:

Simutrans Tools
Know our tools that can help you to create add-ons, install and customize Simutrans.

GCC and inline functions

Started by ZéQuimTó, September 14, 2014, 01:13:18 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

ZéQuimTó

Hi.
I noticed that simutrans has several functions declared as inline, and it makes perfect sense, since some functions like decode_uint16 are called something like 100 million times during loading, or min(int, int) which is also called dozens of million times during game-play.

However, when compiling with GCC (with -O3, so on and so forth...), GCC does not inline those functions.
It happens certainly because GCC is not able to detect how much calls are going to be made to those functions (probably because they depend on the actual savegame being read).

Bottom line is: there are several "inline" functions that I believe are not being inlined, degrading performance when called millions of times.
I tried to force inline [by using __attribute__((always_inline)) ]  and managed to inline a few functions just to test if they got inlined, and if that translated into faster code. They did got inlined, and it managed to improve the total time spent on decode_uint16 by 20%. I belive smaller functions would get a bigger improvement.

I stress that I tried to make GCC inline those functions using different optimization flags, yet, I could not find a way to get them to be inlined. Only by using the attribute.
Could you plz check if the deployed version has those functions actually inlined? Without debugging symbols, and without the makefile config used for deployment, its hard to check that.

If not, I would suggest to only add this attribute to the most performance-critical functions (and not just force the compiler to do that on every single "inline" function). If you plan to add this attribute to inline functions, here is a list of the ones that are called more often:
I profiled simutrans during pak/savegame loading, and during 20min gameplay (so that loading-related calls tends to be "negligible" - you could always subtract those, though.. ).

EDIT: Forgot to mention (perhaps of little importance): I am referring to the 120.0.1 nightly.


Ters

decode_uint16 and friends are only used when loading the pak set. Not really a place to spend much effort on hunting down performance bottlenecks.

As for min and max, they seem to be inlined just fine with DEBUG=1 and OPTIMIZE=1 in config.default. When PROFILE=1 or DEBUG>=2, -fno-inline is added, overriding what defaults come from -O3 and friends. This is done for a reason.

prissi

If you profile, you better go for the time. But still instrumenting the code (as said before) changes the inlining. Suddenly functions like ding_t::get_flag() became important timewise (3.51% of total time), quickstone_tpl is_bound() (1.6%) or the [] operator of the array template (1.59% of time). But then this may be inlined when not profiling or are just called very often.

ZéQuimTó

Thanks Ters.
Before your post I did not imagine that DEBUG=3 would change optimization flags. I tried DEBUG=1 and OPTIMIZE= and confirmed that those functions were inlined. It was a false alarm :D

prissi: Yes, when I started profiling simutrans, I was looking after time. But after looking at the assembly noticed the lack of inlining. Since non-inlining effect is proportional to the number of calls, I sorted the lists that way (by number of calls).

Cheers


Ters

Quote from: ZéQuimTó on September 14, 2014, 10:02:37 PM
Thanks Ters.
Before your post I did not imagine that DEBUG=3 would change optimization flags. I tried DEBUG=1 and OPTIMIZE= and confirmed that those functions were inlined. It was a false alarm :D

It's not immediately obvious, but once you've tried to debug optimized code, you will understand why shutting off optimizations are needed for serious debugging. The chaos, the pain.