The International Simutrans Forum

 

Author Topic: Losses of synchronisation with mismatching convoy debug sums  (Read 406 times)

0 Members and 1 Guest are viewing this topic.

Offline freddyhayward

  • Devotee
  • *
  • Posts: 451
  • Languages: EN
Losses of synchronisation with mismatching convoy debug sums
« on: September 16, 2020, 10:23:56 PM »
In my record of desyncs currently from 12/9/20-17/9/20, nearly ~60% (currently 17) have had the convoy debug sums mismatching (and all other checklist records matching perfectly).
The first (debug_sum[0]) is updated by adding the mantissa of some floating point value when each convoy is processed.
The second (debug_sum[1]) is the same but multiplied with the internal ID of the convoy. In all cases, the first and the second were mismatched together, but I wasn't able to discern the internal ID of the offending convoy(s) in any case.

https://docs.google.com/spreadsheets/d/1Jwa-5G6aXfkrCrPYLQvDbYt7LqYMgYapW8hyOF9IB3s/edit#gid=0

Offline freddyhayward

  • Devotee
  • *
  • Posts: 451
  • Languages: EN
Re: Losses of synchronisation with mismatching convoy debug sums
« Reply #1 on: September 27, 2020, 11:15:29 AM »
As of now, there have been 35 recorded desyncs (of the total 70) with debug_sum[0] mismatching. What I have found is that these sums overflow so frequently between each check that it is almost always impossible to determine the unsynchronised convoy, assuming there is only one. I've made a change that significantly reduces the size of these sums so that they should be easier to investigate, here: https://github.com/jamespetts/simutrans-extended/pull/285

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20326
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Losses of synchronisation with mismatching convoy debug sums
« Reply #2 on: September 27, 2020, 12:03:14 PM »
Thank you for your work on this: I have now incorporated your debugging change.

There are such a wide range of things that might cause a code divergence first to manifest itself in a difference in convoy movement that any additional information or precision is likely to be useful.

Offline freddyhayward

  • Devotee
  • *
  • Posts: 451
  • Languages: EN
Re: Losses of synchronisation with mismatching convoy debug sums
« Reply #3 on: October 06, 2020, 07:47:53 AM »
I have pushed another change (https://github.com/jamespetts/simutrans-extended/pull/287) that adds two new convoy debug sums. The first is the sum of state indexes (e.g. reversing, loading, waiting for clearance) while the second is the sum of these indexes multiplied by the convoy ID. The purpose of this is to discern whether the mismatch results from differing states or small variations of movement within the same state.

Offline ceeac

  • Devotee
  • *
  • Posts: 212
Re: Losses of synchronisation with mismatching convoy debug sums
« Reply #4 on: October 08, 2020, 06:04:54 AM »
I think I found the issue (or at least a closely related one):
Code: [Select]
==================
WARNING: ThreadSanitizer: data race (pid=2827)
  Write of size 4 at 0x7b9c021439f0 by main thread:
    #0 convoi_t::warten_bis_weg_frei(int) /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:2826:11 (simutrans-extended+0xad829b)
    #1 vehicle_t::hop_check() /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:1618:9 (simutrans-extended+0xcb4e10)
    #2 vehicle_base_t::do_drive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:373:45 (simutrans-extended+0xcaa271)
    #3 road_vehicle_t::do_drive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:4419:46 (simutrans-extended+0xccbffb)
    #4 convoi_t::sync_step(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:1333:30 (simutrans-extended+0xac37e4)
    #5 karte_t::sync_list_t::sync_step(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:4818:14 (simutrans-extended+0xc48156)
    #6 karte_t::sync_step(unsigned int, bool, bool) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:4890:8 (simutrans-extended+0xc485cc)
    #7 karte_t::interactive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:11001:6 (simutrans-extended+0xc6abef)
    #8 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1396:9 (simutrans-extended+0xb8d325)
    #9 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:830:9 (simutrans-extended+0xc80fca)
    #10 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:191:10 (simutrans-extended+0xd5afbf)

  Previous read of size 4 at 0x7b9c021439f0 by thread T35:
    #0 convoi_t::threaded_step() /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:1740:6 (simutrans-extended+0xac75d9)
    #1 step_individual_convoy_threaded(void*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:1946:10 (simutrans-extended+0xc3576f)

  Location is heap block of size 13904 at 0x7b9c02143800 allocated by main thread:
    #0 operator new(unsigned long) <null> (simutrans-extended+0x4caeeb)
    #1 karte_t::load(loadsave_t*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:9468:19 (simutrans-extended+0xc60d25)
    #2 karte_t::load(char const*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:8823:3 (simutrans-extended+0xc5c2a7)
    #3 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1262:32 (simutrans-extended+0xb8c5e6)
    #4 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:830:9 (simutrans-extended+0xc80fca)
    #5 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:191:10 (simutrans-extended+0xd5afbf)

  Thread T35 (tid=2917, running) created by main thread at:
    #0 pthread_create <null> (simutrans-extended+0x43dfdb)
    #1 karte_t::init_threads() /home/ceeac/Projects/code/simutrans-ex/simworld.cc:2226:8 (simutrans-extended+0xc2d9df)
    #2 karte_t::load(loadsave_t*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:9815:2 (simutrans-extended+0xc639bb)
    #3 karte_t::load(char const*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:8823:3 (simutrans-extended+0xc5c2a7)
    #4 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1262:32 (simutrans-extended+0xb8c5e6)
    #5 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:830:9 (simutrans-extended+0xc80fca)
    #6 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:191:10 (simutrans-extended+0xd5afbf)

SUMMARY: ThreadSanitizer: data race /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:2826:11 in convoi_t::warten_bis_weg_frei(int)
==================

The problem is that the state of a convoy is modified non-atomically by the main thead in sync_step(), while a different thread reads the state in step().
To reproduce the issue, compile the non-graphical server with clang and -fsanitize=thread on linux, then load demo.sve from pak128.britain-ex.

Offline freddyhayward

  • Devotee
  • *
  • Posts: 451
  • Languages: EN
Re: Losses of synchronisation with mismatching convoy debug sums
« Reply #5 on: October 09, 2020, 07:13:52 AM »
Unfortunately, my previous patch caused immediate desyncs on large servers, because convoy states aren't synchronised within steps (for better or worse). Sorry to everyone affected by this. I have made a new patch that uses sums [2] and [3] for convoy speeds and target speeds. I don't expect these to detect desyncs on their own (and some testing shows they don't), but they should hopefully give more information about affected convoys. https://github.com/jamespetts/simutrans-extended/pull/289