The International Simutrans Forum

 

Author Topic: multithreading lake generation  (Read 782 times)

0 Members and 1 Guest are viewing this topic.

Online kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2305
multithreading lake generation
« on: June 16, 2020, 02:16:58 PM »
As part of other areas I'm looking at in the code I've been generating some fairly large maps - 10334x19498 tiles to be exact. Due to the iterative nature of the lake code this section takes a disproportionately long time on large maps - not helped by the fact that this hasn't been multithreaded. I'm not going to pretend I'm an expert on multithreaded code but here's my efforts.

I've done some analysis of the map generation times with this patch - while the lake code improves in speed dramatically I do note that the time seems to increase as thread count does for some elements of map generation (I don't think this is directly related to my patch). This could be down to inefficiencies from splitting work up, or just processor hitting thermal throttling - testing on an Core i7 10710U (6 cores, 12 threads) 32gb RAM. I've included several graphs as the scale of the decrease in lake code time masks some of the increases elsewhere.







Not sure if anyone else has any comments or observations on this multithreading effort for lake code specifically, or that of map generation in general.

If this is incorporated into Trunk it might make sense to have maximum lake height changed to be configurable in simuconf.tab - one reason it was limited to 8 when I wrote it originally was due to the length of time taken to run.


Edit:
Also see this graph for timings for a possible more representative map of 1024x1024
« Last Edit: June 16, 2020, 02:44:42 PM by kierongreen »

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9992
  • Languages: De,EN,JP
Re: multithreading lake generation
« Reply #1 on: June 16, 2020, 11:50:11 PM »
Anything that decreases the inital wait is worth doing it. Nice effort.

I wonder if some code (like map smoothing) is bandwidth limited and thus with more threads a higher chance that the cached area is not matching the needed area. I mean we are talking of memory objects way larger than any cache. (Not to mention that city generation on such a large map will probably never finish with default settings ... )

Offline Dwachs

  • DevTeam, Coder/patcher
  • Administrator
  • *
  • Posts: 4702
  • Languages: EN, DE, AT
Re: multithreading lake generation
« Reply #2 on: June 17, 2020, 06:25:29 AM »
(If you are at it: all these map generation routines could be moved from simworld to an extra file)

Online kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2305
Re: multithreading lake generation
« Reply #3 on: June 17, 2020, 01:53:48 PM »
Updated code - have removed smoothing multithreading and also incorporated ground allocation into that. This removes the slowdown in smoothing code when using multiple threads and also sees increases in speed elsewhere in map generation including in multithreaded lakes. I'm assuming this is down to an issue raised years ago about fragmented ground memory.

Not sure whether want to see about testing on different systems to see about relative performance before incorporating into trunk? Have tested this on creating new maps from heightmaps and randomly, as well as extending size of existing map but bugs can always exist...



Comparing with the initial attempt above which saw these times for map generation of 10334x14498

The new code results in the following times:





Looking at this in more detail (with a different test map size 2440x3816), starting from code currently in trunk:




By combining smooth map and creating grounds we can get a small improvement as we are avoiding looping over same grounds and recreating tiles:



Then by removing the multithreading from the new smoothing code the slowdown from having more threads can be eliminated from a number of sections of code:



Finally by adding back in the lake multithreading the overall map generation time is reduced further - although it should be noted that for single threading this is marginally slower than existing trunk code, this is probably down to the extra overheads of checking for seams in the lake code:






About moving map generation routines into a separate file - I can see some logic in that due to the size of simworld.cc but what sort of file structure were you envisaging?

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9992
  • Languages: De,EN,JP
Re: multithreading lake generation
« Reply #4 on: June 17, 2020, 02:06:31 PM »
One approach on splitting could be that the map generation still implements part of karte_t but just in a seperate file. I know it is not very one object one file like, but since it is relative seperated from the code, it should not matter.

About the above diagrams, is it true that the total time for multithreaded code was long from almost all cases than the single threaded one in current trunk? Or what did "total time" mean?

Online kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2305
Re: multithreading lake generation
« Reply #5 on: June 17, 2020, 02:18:34 PM »
One approach on splitting could be that the map generation still implements part of karte_t but just in a seperate file. I know it is not very one object one file like, but since it is relative seperated from the code, it should not matter.
Will think about this more - it does seem like a bit of work for not much gain.

About the above diagrams, is it true that the total time for multithreaded code was long from almost all cases than the single threaded one in current trunk? Or what did "total time" mean?
Yes, it appears that current trunk code was slower the more threads were configured in simuconf... This wasn't just the multithreaded sections - it was the single threaded parts also. At least on this machine - performance will vary dependent on number of cores/threads on processor, memory etc.

I've taken total time as being from the start of karte_t::init to after transitions are calculated in karte_t::enlarge_map (generally - the code after that wasn't multithreaded to begin with - it does show some small speedups after this patch though possibly due to the memory fragmentation issue).



Edit:
The slowdown with more threads wasn't that way when I first multithreaded the code on the old computer (Atom D510 - 2 cores/4 threads) I was using as a desktop then. Have checked now on that though and map generation seems to not vary massively on that computer depending on number of threads defined in simuconf (2 threads is quicker than 1 or 4) - although CPU usage goes up with more threads the time taken doesn't decrease. Any number of other factors might have changed in meantime - other changes to simutrans or even gcc.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5642
  • Languages: EN, NO
Re: multithreading lake generation
« Reply #6 on: June 17, 2020, 05:51:15 PM »
Maybe it's just the heat, but I don't see any threading in either of the patches.

Generally speaking about threading: Multiple threads fighting over shared caches might cause degradation compared to using fewer threads. A guide to using make's ability to run multiple jobs in parallell once said to tell make to run one more job in parallell than you had cores, but it has been changed to "it depends".)

Frequent locking of shared data might be a problem. It is also possible that the data structures are not optimal. Starting and stopping threads each time is also bad.

Very basic stuff, but it's all I've got without actually seeing the problem.

Online kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2305
Re: multithreading lake generation
« Reply #7 on: June 17, 2020, 06:53:01 PM »
For each height in turn the patch divides the map into segments separated by seams of one tile - lakes which can be formed within each segment are handled by parallel threads and once these have finished a pass is made along each seam. Because the threads access different parts of the map race conditions are avoided. It uses the already existing xy_thread routines to handle the division of the map.


And yes, multi threading works best in general when there is more processing than reading and writing of data. In this case due to the iterative nature of the lake generation multi threading seems to benefit substantially. For some other parts - like the ground creation routines the writing of data was clearly being limited (on this system at least) by memory speed. Additionally, as was identified several years ago, because the multi threading resulted in neighbouring grounds being scattered in memory this slowed down subsequent accesses due to not being able to be easily cached.

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1347
Re: multithreading lake generation
« Reply #8 on: June 17, 2020, 07:03:33 PM »
Attached are my (very ugly) graphs of trunk and your second patch timings.  This was creating a new map with the default settings (except size 2440x3816). Timed sections of karte_t::enlarge_map() only (i.e. total is the time in that routine only, nothing from init())

With trunk, sections do take longer with more threads, wiping out the gains of the multithreaded part, but the total time remains a constant 36.

Single threaded time does jump from 40 to 50 seconds, but the multithreading is much more effective with your patch - should be committed IMO.

Online kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2305
Re: multithreading lake generation
« Reply #9 on: June 17, 2020, 08:35:50 PM »
Thanks Ters, I’m just trying to see whether there’s any further useful tweaks, working out an explanation for why the single threaded lake generation slows down so much with patch, and thinking whether should add back in a special case to cover that. It comes down to what machines people are running Simutrans on these days - both age and core count. Always wary of improving performance for people who benefit least and disadvantaging those who are already struggling.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9992
  • Languages: De,EN,JP
Re: multithreading lake generation
« Reply #10 on: June 18, 2020, 02:15:32 AM »
I do not think there is anyon making large maps with a single threaded processor that runs windows10 (or windows7). Even the oldest computer I found that runs windows10 usually have at least hyperthreading. I have an ancient PC in the lab which I upgrade to 2GB and to run Windows10 two zears ago, but it runs at rather glacial speed. It serves the microscope camera well, but you would not want to generate large simutrans maps on it.

Here is the requirement for WIndows10:
https://www.intel.com/content/www/us/en/support/articles/000006105/processors.html

Apparently there are still single Core capable windows10. But I think these will not generate new map, or are rather used to wait. The default build is multithreaded (so these users recieve further punishment from overhead), and onlz server uses singlethreaded builds.

If the situation improves from 95% of the users (which it does right now) better keep the code and do not make it more complex trying to improve single user performance.


Online kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2305
Re: multithreading lake generation
« Reply #11 on: June 18, 2020, 04:12:17 AM »
Right - this is my best effort for now, this offers further speedups across a range of thread numbers particularly for single threaded which should now be close to trunk performance for lakes, and with improvements elsewhere should be quicker overall. Also caught an error which had crept in at one point between the first and second patch - part of the code only worked by accident in the first patch... Have now gone with what was being used accidentally, and now there's a first pass on the seams to divide the map up before multithreaded takes over. This means less changes to code also. Hopefully I've not made any more errors this time...

For my smaller test map here was the performance in the 2nd version:


And now:




and for the larger map the old version:


and the new:

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9992
  • Languages: De,EN,JP
Re: multithreading lake generation
« Reply #12 on: June 18, 2020, 04:58:14 AM »
This looks very good. I would say just commit and find errors from the nightly.

(I think I will throw in the same height level many climate patch too, after some more debugging. So the next version will feature a new map generation.)

Online kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2305
Re: multithreading lake generation
« Reply #13 on: June 18, 2020, 09:08:16 AM »
Done in 9145 - now just to hope haven't missed (another) race condition!

Offline ceeac

  • Devotee
  • *
  • Posts: 135
Re: multithreading lake generation
« Reply #14 on: June 18, 2020, 11:28:34 AM »
Access to 'need_to_flood' is not atomic:
Code: [Select]
==================
WARNING: ThreadSanitizer: data race (pid=21603)
  Write of size 1 at 0x000001974380 by main thread:
    #0 karte_t::create_lakes_loop(short, short, short, short) simworld.cc:1400:21 (sim+0xb2be12)
    #1 karte_t::world_xy_loop_thread(void*) simworld.cc:171:4 (sim+0xb1aad6)
    #2 karte_t::world_xy_loop(void (karte_t::*)(short, short, short, short), unsigned char) simworld.cc:251:2 (sim+0xb1b6ce)
    #3 karte_t::create_lakes(int, int) simworld.cc:1454:3 (sim+0xb2cdd5)
    #4 karte_t::enlarge_map(settings_t const*, signed char const*) simworld.cc:1878:3 (sim+0xb29969)
    #5 karte_t::init(settings_t*, signed char const*) simworld.cc:1268:2 (sim+0xb266ed)
    #6 welt_gui_t::action_triggered(gui_action_creator_t*, value_t) gui/welt.cc:519:10 (sim+0x817afa)
    #7 non-virtual thunk to welt_gui_t::action_triggered(gui_action_creator_t*, value_t) gui/welt.cc (sim+0x817e47)
    #8 gui_action_creator_t::call_listeners(value_t) gui/components/gui_action_creator.h:32:11 (sim+0x672484)
    #9 button_t::infowin_event(event_t const*) gui/components/gui_button.cc:276:4 (sim+0x674a35)
    #10 gui_container_t::infowin_event(event_t const*) gui/components/gui_container.cc:200:23 (sim+0x680afb)
    #11 virtual thunk to gui_container_t::infowin_event(event_t const*) gui/components/gui_container.cc (sim+0x681216)
    #12 gui_container_t::infowin_event(event_t const*) gui/components/gui_container.cc:200:23 (sim+0x680afb)
    #13 gui_frame_t::infowin_event(event_t const*) gui/gui_frame.cc:132:34 (sim+0x71e241)
    #14 welt_gui_t::infowin_event(event_t const*) gui/welt.cc:549:15 (sim+0x817ebf)
    #15 check_pos_win(event_t*) gui/simwin.cc:1521:20 (sim+0x7ee713)
    #16 modal_dialogue(gui_frame_t*, long, karte_t*, bool (*)()) simmain.cc:252:5 (sim+0xa989ca)
    #17 simu_main(int, char**) simmain.cc:1452:4 (sim+0xa9fb07)
    #18 sysmain(int, char**) sys/simsys.cc:1098:9 (sim+0xbe6be9)
    #19 main sys/simsys_s2.cc:790:9 (sim+0xc31cfb)

  Previous write of size 1 at 0x000001974380 by thread T5:
    #0 karte_t::create_lakes_loop(short, short, short, short) simworld.cc:1400:21 (sim+0xb2be12)
    #1 karte_t::world_xy_loop_thread(void*) simworld.cc:171:4 (sim+0xb1aad6)

  Location is global 'need_to_flood' of size 1 at 0x000001974380 (sim+0x000001974380)

  Thread T5 (tid=21613, running) created by main thread at:
    #0 pthread_create <null> (sim+0x4325db)
    #1 karte_t::world_xy_loop(void (karte_t::*)(short, short, short, short), unsigned char) simworld.cc:238:9 (sim+0xb1b618)
    #2 karte_t::enlarge_map(settings_t const*, signed char const*) simworld.cc:1808:4 (sim+0xb2911b)
    #3 karte_t::init(settings_t*, signed char const*) simworld.cc:1268:2 (sim+0xb266ed)
    #4 simu_main(int, char**) simmain.cc:1346:9 (sim+0xa9f05f)
    #5 sysmain(int, char**) sys/simsys.cc:1098:9 (sim+0xbe6be9)
    #6 main sys/simsys_s2.cc:790:9 (sim+0xc31cfb)

SUMMARY: ThreadSanitizer: data race simworld.cc:1400:21 in karte_t::create_lakes_loop(short, short, short, short)
==================

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1347
Re: multithreading lake generation
« Reply #15 on: June 18, 2020, 12:44:10 PM »
Doesn't matter, the threads are each only setting the same flag. A semaphore would be more technically correct, but a write only flag (and all writing the same value) is fine.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5642
  • Languages: EN, NO
Re: multithreading lake generation
« Reply #16 on: June 20, 2020, 07:11:50 PM »
Is the flag the only mutable data? Because without an atomic operation with the right memory order, there other changed may not be visible to other threads. Although consumer x86 may be more forgiving in that regard.

Although I think it will be hard to find a single core computer these days, that does not necessarily mean that all cores are available for Simutrans. Maybe some processing power is used for playing music, although that doesn't require much, or watching a tutorial video. Or making a tutorial video.

Online kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2305
Re: multithreading lake generation
« Reply #17 on: June 21, 2020, 03:11:05 AM »
Threads don't care whether each other has set the flag while they are running - it's the end result that matters once they've all finished. This may be able to be handled using atomic data types but I haven't looked into how to use those properly yet.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5642
  • Languages: EN, NO
Re: multithreading lake generation
« Reply #18 on: June 28, 2020, 08:56:36 AM »
How can you know that all threads are finished if they all write the same value to the same variable?

And I believe that without atomics, it is theoretically possible that the thread waiting to see if they are finished, will see that they are finished, but not see the data they have generated, or only parts of it.

Online kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2305
Re: multithreading lake generation
« Reply #19 on: June 28, 2020, 12:55:22 PM »
How can you know that all threads are finished if they all write the same value to the same variable? And I believe that without atomics, it is theoretically possible that the thread waiting to see if they are finished, will see that they are finished, but not see the data they have generated, or only parts of it.
Because in world_xy_loop we have the following code:
Code: [Select]
world_thread_param[t].keep_running = t < env_t::num_threads - 1;
...
    // and start processing
    simthread_barrier_wait( &world_barrier_start );

    // the last we can run ourselves
    world_xy_loop_thread(&world_thread_param[env_t::num_threads-1]);

    simthread_barrier_wait( &world_barrier_end );

which is also used in world_xy_loop_thread as follows:
Code: [Select]
        if(param->keep_running) {
            simthread_barrier_wait( &world_barrier_start ); // wait for all to start
        }
...
        if(param->keep_running) {
            simthread_barrier_wait( &world_barrier_end ); // wait for all to finish
        }
        else {
            return NULL;
        }

All of this will ensure that the threaded lake code in create_lakes:
Code: [Select]
        memset( stage, -1, sizeof(sint8) * size_x * size_y );
        world_xy_loop(&karte_t::create_lakes_loop, 0);

        if(need_to_flood) {
            flood_to_depth(  h, stage  );
        }
        else if(!threaded_need_to_flood) {
            break;
        }
will only move on from world_xy_loop to flood_to_depth once all threads have finished in world_xy_loop. The check at if(need_to_flood), which is the only time that variable is read will therefore be after all potential writes could have occurred.

Online kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2305
Re: multithreading lake generation
« Reply #20 on: July 03, 2020, 05:02:52 PM »
Just to say that I've applied a fix to trunk - not for a race issue but for a bug in the code. Due to a combination of the behaviour of my existing code in trunk as well as the threaded lakes did not generate as intended over the thread seams:


Unfortunately the fix does result in significantly longer generation times (although still shorter than prior to patch). I've tried to mitigate this through removing unnecessary memory writes - this has actually sped up single threaded generation.

The speed of the multithreaded code will depend on the map. If there are numerous small lakes threads will still help substantially, however a map which is relatively tidy/flat/dominated by ocean there is only a marginal benefit for more than 2 threads.

For a random 1024x1024 map (seed 33, height 700, roughness 4 timings are:


For my 2440x3816 test map which is quite 'clean' in terms of lakes performance is:



While for the 10334x19498 map I'm still working on and which is hence quite rough with numerous lakes timings are: