The International Simutrans Forum

Simutrans Extended => Simutrans-Extended development => Simutrans-Extended bug reports => Topic started by: freddyhayward on September 16, 2020, 10:23:56 PM

Title: Losses of synchronisation with mismatching convoy debug sums
Post by: freddyhayward on September 16, 2020, 10:23:56 PM
In my record of desyncs currently from 12/9/20-17/9/20, nearly ~60% (currently 17) have had the convoy debug sums mismatching (and all other checklist records matching perfectly).
The first (debug_sum[0]) is updated by adding the mantissa of some floating point value when each convoy is processed.
The second (debug_sum[1]) is the same but multiplied with the internal ID of the convoy. In all cases, the first and the second were mismatched together, but I wasn't able to discern the internal ID of the offending convoy(s) in any case.

https://docs.google.com/spreadsheets/d/1Jwa-5G6aXfkrCrPYLQvDbYt7LqYMgYapW8hyOF9IB3s/edit#gid=0
Title: Re: Losses of synchronisation with mismatching convoy debug sums
Post by: freddyhayward on September 27, 2020, 11:15:29 AM
As of now, there have been 35 recorded desyncs (of the total 70) with debug_sum[0] mismatching. What I have found is that these sums overflow so frequently between each check that it is almost always impossible to determine the unsynchronised convoy, assuming there is only one. I've made a change that significantly reduces the size of these sums so that they should be easier to investigate, here: https://github.com/jamespetts/simutrans-extended/pull/285
Title: Re: Losses of synchronisation with mismatching convoy debug sums
Post by: jamespetts on September 27, 2020, 12:03:14 PM
Thank you for your work on this: I have now incorporated your debugging change.

There are such a wide range of things that might cause a code divergence first to manifest itself in a difference in convoy movement that any additional information or precision is likely to be useful.
Title: Re: Losses of synchronisation with mismatching convoy debug sums
Post by: freddyhayward on October 06, 2020, 07:47:53 AM
I have pushed another change (https://github.com/jamespetts/simutrans-extended/pull/287) that adds two new convoy debug sums. The first is the sum of state indexes (e.g. reversing, loading, waiting for clearance) while the second is the sum of these indexes multiplied by the convoy ID. The purpose of this is to discern whether the mismatch results from differing states or small variations of movement within the same state.
Title: Re: Losses of synchronisation with mismatching convoy debug sums
Post by: ceeac on October 08, 2020, 06:04:54 AM
I think I found the issue (or at least a closely related one):

==================
WARNING: ThreadSanitizer: data race (pid=2827)
  Write of size 4 at 0x7b9c021439f0 by main thread:
    #0 convoi_t::warten_bis_weg_frei(int) /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:2826:11 (simutrans-extended+0xad829b)
    #1 vehicle_t::hop_check() /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:1618:9 (simutrans-extended+0xcb4e10)
    #2 vehicle_base_t::do_drive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:373:45 (simutrans-extended+0xcaa271)
    #3 road_vehicle_t::do_drive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:4419:46 (simutrans-extended+0xccbffb)
    #4 convoi_t::sync_step(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:1333:30 (simutrans-extended+0xac37e4)
    #5 karte_t::sync_list_t::sync_step(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:4818:14 (simutrans-extended+0xc48156)
    #6 karte_t::sync_step(unsigned int, bool, bool) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:4890:8 (simutrans-extended+0xc485cc)
    #7 karte_t::interactive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:11001:6 (simutrans-extended+0xc6abef)
    #8 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1396:9 (simutrans-extended+0xb8d325)
    #9 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:830:9 (simutrans-extended+0xc80fca)
    #10 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:191:10 (simutrans-extended+0xd5afbf)

  Previous read of size 4 at 0x7b9c021439f0 by thread T35:
    #0 convoi_t::threaded_step() /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:1740:6 (simutrans-extended+0xac75d9)
    #1 step_individual_convoy_threaded(void*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:1946:10 (simutrans-extended+0xc3576f)

  Location is heap block of size 13904 at 0x7b9c02143800 allocated by main thread:
    #0 operator new(unsigned long) <null> (simutrans-extended+0x4caeeb)
    #1 karte_t::load(loadsave_t*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:9468:19 (simutrans-extended+0xc60d25)
    #2 karte_t::load(char const*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:8823:3 (simutrans-extended+0xc5c2a7)
    #3 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1262:32 (simutrans-extended+0xb8c5e6)
    #4 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:830:9 (simutrans-extended+0xc80fca)
    #5 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:191:10 (simutrans-extended+0xd5afbf)

  Thread T35 (tid=2917, running) created by main thread at:
    #0 pthread_create <null> (simutrans-extended+0x43dfdb)
    #1 karte_t::init_threads() /home/ceeac/Projects/code/simutrans-ex/simworld.cc:2226:8 (simutrans-extended+0xc2d9df)
    #2 karte_t::load(loadsave_t*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:9815:2 (simutrans-extended+0xc639bb)
    #3 karte_t::load(char const*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:8823:3 (simutrans-extended+0xc5c2a7)
    #4 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1262:32 (simutrans-extended+0xb8c5e6)
    #5 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:830:9 (simutrans-extended+0xc80fca)
    #6 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:191:10 (simutrans-extended+0xd5afbf)

SUMMARY: ThreadSanitizer: data race /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:2826:11 in convoi_t::warten_bis_weg_frei(int)
==================


The problem is that the state of a convoy is modified non-atomically by the main thead in sync_step(), while a different thread reads the state in step().
To reproduce the issue, compile the non-graphical server with clang and -fsanitize=thread on linux, then load demo.sve from pak128.britain-ex.
Title: Re: Losses of synchronisation with mismatching convoy debug sums
Post by: freddyhayward on October 09, 2020, 07:13:52 AM
Unfortunately, my previous patch caused immediate desyncs on large servers, because convoy states aren't synchronised within steps (for better or worse). Sorry to everyone affected by this. I have made a new patch that uses sums [2] and [3] for convoy speeds and target speeds. I don't expect these to detect desyncs on their own (and some testing shows they don't), but they should hopefully give more information about affected convoys. https://github.com/jamespetts/simutrans-extended/pull/289