News:

Use the "Forum Search"
It may help you to find anything in the forum ;).

Losses of synchronisation with mismatching convoy debug sums

Started by freddyhayward, September 16, 2020, 10:23:56 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

freddyhayward

In my record of desyncs currently from 12/9/20-17/9/20, nearly ~60% (currently 17) have had the convoy debug sums mismatching (and all other checklist records matching perfectly).
The first (debug_sum[0]) is updated by adding the mantissa of some floating point value when each convoy is processed.
The second (debug_sum[1]) is the same but multiplied with the internal ID of the convoy. In all cases, the first and the second were mismatched together, but I wasn't able to discern the internal ID of the offending convoy(s) in any case.

https://docs.google.com/spreadsheets/d/1Jwa-5G6aXfkrCrPYLQvDbYt7LqYMgYapW8hyOF9IB3s/edit#gid=0

freddyhayward

As of now, there have been 35 recorded desyncs (of the total 70) with debug_sum[0] mismatching. What I have found is that these sums overflow so frequently between each check that it is almost always impossible to determine the unsynchronised convoy, assuming there is only one. I've made a change that significantly reduces the size of these sums so that they should be easier to investigate, here: https://github.com/jamespetts/simutrans-extended/pull/285

jamespetts

Thank you for your work on this: I have now incorporated your debugging change.

There are such a wide range of things that might cause a code divergence first to manifest itself in a difference in convoy movement that any additional information or precision is likely to be useful.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

freddyhayward

I have pushed another change (https://github.com/jamespetts/simutrans-extended/pull/287) that adds two new convoy debug sums. The first is the sum of state indexes (e.g. reversing, loading, waiting for clearance) while the second is the sum of these indexes multiplied by the convoy ID. The purpose of this is to discern whether the mismatch results from differing states or small variations of movement within the same state.

ceeac

I think I found the issue (or at least a closely related one):

==================
WARNING: ThreadSanitizer: data race (pid=2827)
  Write of size 4 at 0x7b9c021439f0 by main thread:
    #0 convoi_t::warten_bis_weg_frei(int) /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:2826:11 (simutrans-extended+0xad829b)
    #1 vehicle_t::hop_check() /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:1618:9 (simutrans-extended+0xcb4e10)
    #2 vehicle_base_t::do_drive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:373:45 (simutrans-extended+0xcaa271)
    #3 road_vehicle_t::do_drive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:4419:46 (simutrans-extended+0xccbffb)
    #4 convoi_t::sync_step(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:1333:30 (simutrans-extended+0xac37e4)
    #5 karte_t::sync_list_t::sync_step(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:4818:14 (simutrans-extended+0xc48156)
    #6 karte_t::sync_step(unsigned int, bool, bool) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:4890:8 (simutrans-extended+0xc485cc)
    #7 karte_t::interactive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:11001:6 (simutrans-extended+0xc6abef)
    #8 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1396:9 (simutrans-extended+0xb8d325)
    #9 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:830:9 (simutrans-extended+0xc80fca)
    #10 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:191:10 (simutrans-extended+0xd5afbf)

  Previous read of size 4 at 0x7b9c021439f0 by thread T35:
    #0 convoi_t::threaded_step() /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:1740:6 (simutrans-extended+0xac75d9)
    #1 step_individual_convoy_threaded(void*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:1946:10 (simutrans-extended+0xc3576f)

  Location is heap block of size 13904 at 0x7b9c02143800 allocated by main thread:
    #0 operator new(unsigned long) <null> (simutrans-extended+0x4caeeb)
    #1 karte_t::load(loadsave_t*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:9468:19 (simutrans-extended+0xc60d25)
    #2 karte_t::load(char const*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:8823:3 (simutrans-extended+0xc5c2a7)
    #3 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1262:32 (simutrans-extended+0xb8c5e6)
    #4 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:830:9 (simutrans-extended+0xc80fca)
    #5 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:191:10 (simutrans-extended+0xd5afbf)

  Thread T35 (tid=2917, running) created by main thread at:
    #0 pthread_create <null> (simutrans-extended+0x43dfdb)
    #1 karte_t::init_threads() /home/ceeac/Projects/code/simutrans-ex/simworld.cc:2226:8 (simutrans-extended+0xc2d9df)
    #2 karte_t::load(loadsave_t*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:9815:2 (simutrans-extended+0xc639bb)
    #3 karte_t::load(char const*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:8823:3 (simutrans-extended+0xc5c2a7)
    #4 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1262:32 (simutrans-extended+0xb8c5e6)
    #5 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:830:9 (simutrans-extended+0xc80fca)
    #6 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:191:10 (simutrans-extended+0xd5afbf)

SUMMARY: ThreadSanitizer: data race /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:2826:11 in convoi_t::warten_bis_weg_frei(int)
==================


The problem is that the state of a convoy is modified non-atomically by the main thead in sync_step(), while a different thread reads the state in step().
To reproduce the issue, compile the non-graphical server with clang and -fsanitize=thread on linux, then load demo.sve from pak128.britain-ex.

freddyhayward

Unfortunately, my previous patch caused immediate desyncs on large servers, because convoy states aren't synchronised within steps (for better or worse). Sorry to everyone affected by this. I have made a new patch that uses sums [2] and [3] for convoy speeds and target speeds. I don't expect these to detect desyncs on their own (and some testing shows they don't), but they should hopefully give more information about affected convoys. https://github.com/jamespetts/simutrans-extended/pull/289