sync_step performance improvements

ceeac · May 06, 2022, 09:35:29 AM

This patch improves the performance of sync_step by separating the sync_lists per type and eliminating virtual function calls to sync_step. As a result of the improved cache hit rate, for me sync_step is now about 5% faster for the big pak128 map and more than 25% faster for the yoshi map.

EDIT: Corrected attachment

prissi · May 06, 2022, 02:27:13 PM

I am surprised it does that much, since some lists (like traffic signs) are usually very empty while buildings change a lot during construction, and there are many convoys. 25% is really impressive, I think one cannot get that much elsewhere.

Since the introduction of a new class of syncobjs is something rather happening every ten years, I think this patch should go in.

TurfIt · May 07, 2022, 07:01:20 AM

What's the mechanism that's making this work?

Code Select

// padding to increase the from 32 to 40 bytes for sync_step performance reasons.Usually you'd want to reduce the size, unless some padding is needed for alignment, but 40 is so odd for that. Yet removing this padding I see a ~20% slow down with smoke...

ceeac · May 07, 2022, 07:57:25 AM

I think this is because objlist_t::grow_capacity allocates 4 pointers (32 bytes total) for 2, 3 or 4 objects on a tile so that the memory is mixed in freelist if wolke_t is also 32 bytes. I think there are not many freelist allocations of 40 bytes besides wolke_t so the memory is nicely packed together.

Matthew · May 07, 2022, 08:10:11 AM

This is a dramatic improvement in one of the most intensively-called functions in Simutrans. After 20+ years of development, that's a really impressive optimization. Thank you for contributing this, ceeac.

prissi · May 07, 2022, 01:21:20 PM

If one wants to have a private list for wolke_t, shouldn't one simply overload wolke_t new, since an unused padding should be removed by a good compiler. (Anyway, sizes are also different on 32 bit or ARM, the latter due to forcing words on even addresses.)

But I also wonder, since cachelines are 128 bytes (or multiples) how 40 bytes can be faster. Maybe smoke, with its short lifetime warrants its dedicated list, so all smoke is in the cache most of the time (depending on cache size).

PJMack · May 07, 2022, 05:42:43 PM

Quote from: ceeac on May 06, 2022, 09:35:29 AMAs a result of the improved cache hit rate, for me sync_step is now about 5% faster for the big pak128 map and more than 25% faster for the yoshi map.

May I ask what the system specs are for these tests? Also, what compiler optimization level are you using and are you using link time optimizations?

ceeac · May 07, 2022, 06:23:28 PM

i7 6700k (4GHz), 16 GiB RAM (3200 MT/s)
I used a standard Release build (i.e. CMAKE_BUILD_TYPE=Release) without LTO.

TurfIt · May 11, 2022, 02:10:03 AM

Quote from: ceeac on May 06, 2022, 09:35:29 AMfor me sync_step is now about 5% faster for the big pak128 map and more than 25% faster for the yoshi map.

Quote from: ceeac on May 07, 2022, 06:23:28 PMi7 6700k (4GHz), 16 GiB RAM (3200 MT/s)
I used a standard Release build (i.e. CMAKE_BUILD_TYPE=Release) without LTO.

My older computer is similar - 6700k (4.2Ghz), 32 GiB (3200 MT/s). I'm only seeing 14% faster with yoshi, and 0.7% for a big pak128 map (12301-3200x3200.sve). Using latest msys2 mingw64, makefile build (none of that cmake nonsense), non-debug, optimized. hmmm...

Newer computer (5950X) sees 16% with yoshi, 4% on 12301. Interestingly the 5950 outperforms the 6700 by 40+% on yoshi, but the 6700 beats it on the big map (by like 0.01%, but still...). When the data fits the big cache, new cpus can fly!

What big pak128 map are you using?

Quote from: ceeac on May 07, 2022, 07:57:25 AMI think this is because objlist_t::grow_capacity allocates 4 pointers (32 bytes total) for 2, 3 or 4 objects on a tile so that the memory is mixed in freelist if wolke_t is also 32 bytes. I think there are not many freelist allocations of 40 bytes besides wolke_t so the memory is nicely packed together.

That make senses now; So it's actually probably slower in isolation, but its interaction with freelist getting it a contiguous memory block of its own results in faster...

I've attached a patch with some experiments I did a long time ago covering this territory. Warning - it's extremely ugly and might leave you scarred for life if you look...
Some things it did:
1) a separate synclist for each object type - as you've done, but I moved them right out of karte_t into the objects themselves.
2) horribly abused freelist to allocate each type into its own slab. (intended to remove freelist abuse and create a proper allocator, but never happened.)
3) a couple different sync list add/remove algorithms - especially wolke and citycars are constantly creating/destroying so they deserve different handling.
4) periodic sorting of the synclists - to ensure the objects are accessed in the memory order they're allocated in.
5) rearrange object member variables to shrink objects to minimum size (this was done with 32 bit builds in mind.. 64 bloats many objects excessively ). As an example reducing citycars from 68 bytes to 64 resulted in a 25% speedup (when combined with the rest of the changes in the patch).
6) rearrange member variables to be in order of use. (helps the memory controller decide to perform a burst read of the entire cacheline vs a read of only the variable to then find out the whole cacheline is needed anyway.)
7) split variables out of convoi_t that are used in sync_step to allow them to be slab allocated.
8 ) some multithreading experiments - ignore those!
perhaps there's something in there to inspire some further work. I let scope creep get out of control, and now it's literally a decade later (started this in 2012!) with nothing to show.

Yona-TYT · May 11, 2022, 03:42:44 AM

I'm glad to see discussions of hardware acceleration again, hopefully some inspiration will be gained here and the the lights come on.

I know that this may be the biggest and most ambitious project in the history of simutrans, so cheer up!.

prissi · May 11, 2022, 03:05:19 PM

Here is a patch for a freetype_tpl which adds its own freelist for each class defined. I think the memory overhead from having more than one list for identical sizes is outweighed by the proximity of similar objects.

It is not yet thread-safe, but this exercise is left to the reader ...

Combining this with the separate sync_lists for each objects (especially wolke and pedestrians are frequently added and destroyed) I would expect even more improvements.

DrSuperGood · May 11, 2022, 06:55:18 PM

Quote from: TurfIt on May 11, 2022, 02:10:03 AMNewer computer (5950X) sees 16% with yoshi, 4% on 12301. Interestingly the 5950 outperforms the 6700 by 40+% on yoshi, but the 6700 beats it on the big map (by like 0.01%, but still...). When the data fits the big cache, new cpus can fly!

This will either be due to effective memory latency or some sort of tiny regression with multi threading.

Due to the chiplet architecture of Zen2 and Zen3 there is a larger latency penalty as experienced by a core when fetching from memory than with single monolithic dies such as used by older processors. If execution is entirely stalled due to cache misses it can compensate for the lower IPC and clock speed.

Simutrans does multithread some aspects, at the very least screen draw. This might introduce slight regressions in the R9 5950X if the scaling is not perfect or there is sufficiently high synchronisation overhead. Especially given that the 5950X has 2 core cluster dies so cross die synchronisation has a larger overhead than a monolithic chip. The i7-6700 has fewer cores and is monolithic so is less likely to be affected by such regression.

Memory bandwidth could also play a part but I would assume the 5950X has an equal or higher memory bandwidth than the i7-6700 given the age difference.

prissi · May 12, 2022, 07:11:54 AM

While the display is very monitor size dependent, the sync_step has always to update all tiles of the map, not just the visible ones. And there is no threading possible for most of the objects.

prissi · May 12, 2022, 02:33:22 PM

My first attempt with a template using only static members and fuctions still inreased the size of structures due to the zero member rule. So one has to use a static freelist_tpl member inside the actual class.

For grund and related, using the classic freelist still makes sense, because this way all boden and fundament tiles get from the same list. If then all other objects are not from the freelist, but their own freelist_tpl, then these ground tiles will also stay close in memory during loading, since no other objects will be allocated for the freelist.

Also to TurfIt ideas of reordering the object is sync lsit due to memory closeness: I think this is doomed for deterministic network gamies since the memory proximity may depend on the heap allocation which is beyond the control of simutrans on different architetures.

TurfIt · May 13, 2022, 03:14:41 AM

Quote from: DrSuperGood on May 11, 2022, 06:55:18 PMThis will either be due to effective memory latency or some sort of tiny regression with multi threading.

Or it could be Simutrans being Simutrans. By that I mean benchmarking small changes has always been annoying with large step changes in timings that show up. e.g. You can run 10 back to back 60sec tests with +/-20ms run to run variance, then run the same 10 tests 30 mins later and all 10 results are shifted by 10%. This behaviour has held for the 6 completely different CPU architectures I've ever tested with over the last 10 years.

Tonight's tests have yoshi improved by 16% (same as before), but the 12301 big map improved by 15%. Unknown yet if tonight is the fluke, or the norm...

Quote from: prissi on May 12, 2022, 02:33:22 PMMy first attempt with a template using only static members and fuctions still inreased the size of structures due to the zero member rule. So one has to use a static freelist_tpl member inside the actual class.

Using your v2 patch combined with ceeac's patch, and extending use of it to senke_t, roadsign_t, movingobj_t, and private_car_t, along with wolke_t and pedestrian_t that you already had yields another 3% gain for both yoshi map and 12301 map. (at least tonight

. and with the 5950 - can't test with the old 6700 again until next week.)

Quote from: prissi on May 12, 2022, 02:33:22 PMAlso to TurfIt ideas of reordering the object is sync lsit due to memory closeness: I think this is doomed for deterministic network gamies since the memory proximity may depend on the heap allocation which is beyond the control of simutrans on different architetures.

The next logical step is to get rid of the sync lists entirely. Just iterate over the freelists directly.

The objects are placed deterministically into each chunk. You can iterate the nodes in each chunk in order, then jump to the next chunk. If the next chunk is before or after the current one really doesn't matter, a few jumps around in memory is not a big deal. Jumping all over the place with every object is. The last thing you want to do with your random access memory is access it randomly!

prissi · May 13, 2022, 01:41:04 PM

Ah, I see, very clever. Yes, iterating over the freelist makes a lot of sense now. So one would need an iterating freelist and a non-iterating one.

Since there could be free chunks, one would need something like a bitset to know which are allocated and which are free inside each chunk for fast iteration. Then indeed not even a list is needed, since for each object the base address of the chunk can be obtained and from the constant size one gets the index inside the chunk.

prissi · May 14, 2022, 02:50:29 AM

Since the edit does not show any attachement buttons, here is a freelist_iter_tpl, that will also iterate over any of its members, that do not remove themselves explicitly from the "list" after creation.

jamespetts · May 14, 2022, 04:39:58 PM

This is very interesting - I wonder whether the same code might also show performance improvements in Extended...?

prissi · May 15, 2022, 11:45:34 AM

It should work in extended as well (when finished) but the performance hogs are different there. So less speedup expected.

prissi · May 15, 2022, 01:32:12 PM

Now handling even more lists with the iterator (and removed the extra lsit handling for pumpe and senke)

Yona-TYT · May 15, 2022, 02:18:05 PM

I apologize for asking a silly question.

Is it possible that this finds its way to hardware acceleration? Or is it still a distant dream?.

TurfIt · May 15, 2022, 05:14:53 PM

If by "hardware acceleration" you mean running on the GPU, then no. This is about the Simutrans simulation code which performs calculations best done on a general purpose CPU, as a GPU is not suited to this type of heavily branching code.

As for making use of a GPU in simutrans display code, past experiments have shown you need a quite powerful GPU to avoid 'hardware deceleration'. Powerful GPUs are usually paired with powerful CPUs which can already run Simutrans at good speeds. i.e. A lowend GPU/CPU is simply going to perform terrible always.

prissi · May 21, 2022, 01:56:54 PM

It is very hard to profile nowadays. gprof is broken since gcc 9 on MSYS2, and profiling in the Linux virtual machine returns rubbish. Just old school 5000 calls to sync_step(10) in a loop (which will neglect gains from freeing etc.) on the yoshi87.sve gives 707 ms versus 749 ms on my old i7-7500U 4 core CPU. I had hoped for more than 43 ms of 707 ms 5.6%. (All with posix backend, so now influence by redraws I hope.)

Moreover, the new code takes automatically care to remove items from list respective adds them to the sync_list automatically, and also simplifies senke_t. This is probably why 5000 sync_steps with 50 steps() reduced from 1645 ms to 1537 ms, i.e. another 50 ms improvement.

So overall a 6.5% improvement, which is nice. Thus I submit the code in r10643. roadsing has only animated traffci lights, using the sync list for these few objects is faster than iterating over all chuncks. (Using freelist_tpl to keep them together resulted in 682 ms using the sync_list versus 779 ms with freelist_iter_tpl iterating over the chunks for the few traffic lisghts. Putting them into their dedicated lists did not improve the timing, but I like the fact that the lights are switched before any cars can pass them. So Added this in r10644.

TurfIt · May 25, 2022, 02:05:34 AM

I gave up on profiling long ago. Attached is the method I've been using for timing, Windows only - other OSs need to find a different call to a high frequency timer. Works with fully optimized non-debug builds for a proper picture...
With the yoshi map, there's a 15% improvement between r10642 and r10647. "sim.exe -until 23474 -load yoshi87-2-102.sve -objects pak/ > a.txt"

Ters · July 05, 2022, 03:50:20 PM

Quote from: Yona-TYT on May 15, 2022, 02:18:05 PMI apologize for asking a silly question.

Is it possible that this finds its way to hardware acceleration? Or is it still a distant dream?.

I bit late answer as I haven't been reading this forum for a while:

The main obstacle for GPU rendering in Simutrans is that there is no clear separation between dynamic objects and static objects. Having the CPU prepare the draw calls and then sending them to the GPU simply takes more time than the CPU just drawing everything itself and sending the result. Especially when zoomed out. Changing textures is also expensive and cramming all the images into a texture atlas that would fit on low-end hardware was difficult even for pak64 with the old limit of 64k images. Maybe minimum maximum texture size has gone up. I haven't check that lately either.

Running the simulation on GPU doesn't seem like a great idea either. GPUs are made for crunching numbers. Particularly doing the same operations on many numbers in parallel. Simutrans' logic contains a lot of ifs, and so isn't as suited for GPUs. (The first GPUs didn't even support ifs, I think, but that is quite some time ago now.)

Yona-TYT · July 05, 2022, 06:12:36 PM

Thanks for your interesting answer. 👍

Quote from: Ters on July 05, 2022, 03:50:20 PMI bit late answer as I haven't been reading this forum for a while:

The main obstacle for GPU rendering in Simutrans is that there is no clear separation between dynamic objects and static objects. Having the CPU prepare the draw calls and then sending them to the GPU simply takes more time than the CPU just drawing everything itself and sending the result. Especially when zoomed out. Changing textures is also expensive and cramming all the images into a texture atlas that would fit on low-end hardware was difficult even for pak64 with the old limit of 64k images. Maybe minimum maximum texture size has gone up. I haven't check that lately either.

Running the simulation on GPU doesn't seem like a great idea either. GPUs are made for crunching numbers. Particularly doing the same operations on many numbers in parallel. Simutrans' logic contains a lot of ifs, and so isn't as suited for GPUs. (The first GPUs didn't even support ifs, I think, but that is quite some time ago now.)

Mariculous · July 06, 2022, 09:09:56 PM

Quote from: Ters on July 05, 2022, 03:50:20 PMRunning the simulation on GPU

Forget about this.
There are a few algorithms in simutrans-extended that would perform quite well on the GPU. For example calculation of the route matrix. There might also be some examples in standard, but you cannot rely on peoples devices to even have an openCL environment configured on their system.
Even if they did, you need to optimize the algorithm to a specific hardware, otherwise performance can easily become much worse than simply running on the CPU.

News:

sync_step performance improvements