News:

The Forum Rules and Guidelines
Our forum has Rules and Guidelines. Please, be kind and read them ;).

New loss of synchronisation error

Started by jamespetts, April 14, 2019, 03:23:46 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

jamespetts

It appears that a new loss of synchronisation error has appeared on the Bridgewater-Brunel server. All development efforts* will need to focus on fixing this until such time as it is resolved. This may take a considerable time, as it is extremely difficult to test properly for this sort of error.

The problem appears first to have become manifest on Friday with occasional losses of synchronisation, but became more severe during yesterday and, to-day, loss of synchronisation occurs within a minute or so of logging in. I have not confirmed whether this occurs on all builds or whether this occurs only on Windows builds connecting to the Linux server.

I am unable to test this using a local copy at home any longer because I cannot fit two instances of the current Bridgewater-Brunel game onto my home computer's memory. I can therefore only test on the live server. If some basic tests do not reveal the source of the problem such that I am able to fix it, I will need to take a backup of the current running game and use the server for testing only as occurred on the last occasion. The backup will be restored when the problem has been confirmed to be fixed.

I have attempted two basic fixes earlier to-day, but neither was successful. The next step will be to disable the path explorer data saving to determine whether the issue is there, as this is somewhat complex code. However, I have found that the simuconf.tab setting does not override the saved setting even on the server, so I will have to modify the code to change the priorities for this before I carry out this test.

Any reports from anyone as to for how long that he/she can stay connected and what platform that he/she is using in the meantime would be helpful. Also, any information as to any new state of affairs in the server game that has arisen since Friday and increased in quantity in some way from then until now that might correlate with the reported symptoms might be useful.

* I should note that I will be away from home from this afternoon until the 22nd/23rd of April, and will not be able to do any significant work on the code in that time; I had planned to, and will, be doing some pakset work during this period, however.

Any tests/investigations that anyone can perform to try to determine the source of this issue (especially controlled tests that isolate the issue to some specific subsystem or particular part of the player transport infrastructure on the server game) would be very helpful. Please note that the server is still running a live game, so destructive testing will not be appropriate at present: I will let people know if and when I do suspend and backup the live game and switch to a testing game.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Rollmaterial

Lately I have been connecting a live fish line to a new fishery far away from the fishing port. In that process I would desync when changing the schedule of a boat right after the route too complex message would pop up. I managed to make the route work just before the desync started to happen immediately after login.

jamespetts

Quote from: Rollmaterial on April 14, 2019, 03:41:13 PM
Lately I have been connecting a live fish line to a new fishery far away from the fishing port. In that process I would desync when changing the schedule of a boat right after the route too complex message would pop up. I managed to make the route work just before the desync started to happen immediately after login.

That is very useful - thank you. I had noticed that your company had some fishing boats that were showing no route messages when I logged in earlier. May I ask whether you use Linux or Windows?
This suggests that the issue relates to convoy routing.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Rollmaterial


jamespetts

Quote from: Rollmaterial on April 14, 2019, 07:12:44 PM
I use Windows.

Thank you for confirming.

Can I ask anyone who uses Linux whether you also experience loss of synchronisation errors within a few minutes of connecting to the server?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

ACarlotti

I've examined two desyncs; they occurred due to a mismatch in the speeds of 'Seine netter' ships in the 'F Live fish 3' line, owned by the companu 'Crandon and Lakes ...'. Both of them (convoys 6814 and 186) are (on my client) reporting 'No route' to a destination of 'Waypoint'. They are both located at (701,3183), which is one of the waypoints; there are also ships at the next waypoint reporting 'No route (too long/complex). There doesn't appear to be a sensible route between these two waypoints, although the addition or a short canal connecting two nearby bodies of water would change this.

jamespetts

Quote from: ACarlotti on April 16, 2019, 07:42:03 PM
I've examined two desyncs; they occurred due to a mismatch in the speeds of 'Seine netter' ships in the 'F Live fish 3' line, owned by the companu 'Crandon and Lakes ...'. Both of them (convoys 6814 and 186) are (on my client) reporting 'No route' to a destination of 'Waypoint'. They are both located at (701,3183), which is one of the waypoints; there are also ships at the next waypoint reporting 'No route (too long/complex). There doesn't appear to be a sensible route between these two waypoints, although the addition or a short canal connecting two nearby bodies of water would change this.

That is interesting: thank you for testing this. I must say, I am confused as to why the speed is other than zero for either of these vessels if it cannot find a route - is it zero on either client or server?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

ACarlotti

#7
So what seems to be happening is that the state of a ship is changing at different times. On my most recent test, when wait_lock reached 0, both the client and the server spent 8 sync_steps in the state NO_ROUTE, and then some either 3 (on the client) or 4 (on the server) sync_steps in the state ROUTING_2, before wait_lock became positive again. Since my desync detection code considers a convoys speed if (and only if) wait_lock reached (or was alread) zero, then this triggers the desync detection.

EDIT: wait_lock is set in threaded step when in state ROUTING_2. This means that while the convoy motion isn't out of sync (their state makes them stationary even though their speed hasn't been reset to 0), they could potentially make their next route search in different steps on the client and on the server.

So now I'm trying to investigate the convoy threading model. As far as I can tell, the comments seem to suggest that the threaded step shouldn't be running at the same time as the sync_steps. This seems inconsistent with what is happening.

jamespetts

Quote from: ACarlotti on April 17, 2019, 10:23:31 PM
So what seems to be happening is that the state of a ship is changing at different times. On my most recent test, when wait_lock reached 0, both the client and the server spent 8 sync_steps in the state NO_ROUTE, and then some either 3 (on the client) or 4 (on the server) sync_steps in the state ROUTING_2, before wait_lock became positive again. Since my desync detection code considers a convoys speed if (and only if) wait_lock reached (or was alread) zero, then this triggers the desync detection.

That is very interesting, thank you. The task now will be to track down how this divergence occurs, which I will not be able to do in any practical sense until the middle of next week at the earliest.

However, if it is of any help, some background to the phases of routing and some theoretical analysis as to what might be occurring: the reason that we have a ROUTING_2 state is because of the multi-threading algorithm: the idea is that the single threaded part of the routing algorithm transitions between ROUTING and ROUTING_2 (to do the preparatory work, if I recall correctly), the multi-threaded algorithm does the actual routefinding, transitioning between ROUTING_2 and ROUTE_JUST_FOUND (the single threaded algorithms knowing not to do anything to any convoy in the ROUTING_2 status), and the single threaded algorithms then dealing with the follow-up work on detecting the ROUTE_JUST_FOUND state. These three states replace the single ROUTING state from Standard to break apart the actual routefinding (which does not change any memory state other than the memory relating to the routing of that specific convoy) with the ancillary pre- and post-routefinding work, which does affect other memory states, and which must be single threaded.

In principle, there should be no distinction so far as the sync_steps are concerned between ROUTING_2 and ROUTE_JUST_FOUND, and the transition between ROUTE_JUST_FOUND and any other state should be handled in step, not sync_step. This is important because the multi-threaded routing algorithm runs in parallel with sync_step, but not step.

However, when the convoys cannot find a route, they will transition, not from ROUTING_2 to ROUTE_JUST_FOUND, but from ROUTING_2 to NO_ROUTE. The question, then, appears to be this: does sync_step do anything different if the status is NO_ROUTE than if it is ROUTING_2, or, at least, anything different that is network sync relevant? If so, then we perhaps need a new state of NO_ROUTE_2 or similar that is transitioned to NO_ROUTE in a step to ensure that it is deterministic.

I am not in a position here to check easily whether anything network sync relevant happens in sync_step if NO_ROUTE is the current status (there are UI things, but those should not alone be network sync relevant), but if anyone else could look into this, that might be very helpful.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

Could this be related to all the ships on the server currently showing "no route"? When I last logged in a lot of mine and other player ships were showing no route for some reason, possibly a low bridge over a public river.

jamespetts

Quote from: DrSuperGood on April 17, 2019, 11:25:06 PM
Could this be related to all the ships on the server currently showing "no route"? When I last logged in a lot of mine and other player ships were showing no route for some reason, possibly a low bridge over a public river.

It is probably related in the sense that, given A. Carlotti's investigations, it seems that the loss of synchronisation occurs when vehicles are in the "NO_ROUTE" state (i.e., the NO_ROUTE state probably causes the loss of synchronisation, rather than the other way around or them both being the effect of a third cause).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

ACarlotti

#11
It is certainly related - the ships on the left are spending three or four sync_steps searching for a route, which they cannot find. It is this complicated route search which is causing threaded_step to last long enough that the wait_lock is updated after an indeterminate number of sync_steps. (And it is the desync detection I added in January which is causing this to trigger an immediate desync; otherwise, it wouldn't usually lead to a detectable desync).

One fix for this (which I am just testing) is for threaded_step to write to wait_lock_next_step, which is (effectively) added to wait_lock in the next step.

I don't know for sure why those ships are currently showing "no route", but my guess is that someone has either deleted a canal or river they wanted to use, or has blocked off a the channel they wanted to use by land reclaimation.

EDIT: Ran for over 5 minutes on a save that was previously giving desyncs in well under a minute - it seems to have fixed the problem. The fix is now on my Github.

jamespetts

Excellent - thank you very much for that: now incorporated.

I should be grateful if people could test with to-morrow's nightly build to see whether this solves the loss of synchronisation issue.

One thing that I do wonder is whether the new variable wait_lock_next_step should be saved, as, if it happens that a client joins when this value is non-zero for some convoy somewhere, a loss of synchronisation might result (unless I have misunderstood how this works and this can never be a non-zero value at the point when a game is saved/loaded?).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

ACarlotti

Quote from: jamespetts on April 18, 2019, 12:53:55 AMOne thing that I do wonder is whether the new variable wait_lock_next_step should be saved, as, if it happens that a client joins when this value is non-zero for some convoy somewhere, a loss of synchronisation might result (unless I have misunderstood how this works and this can never be a non-zero value at the point when a game is saved/loaded?).

This shouldn't introduce any new synchronisation bugs. The only concern specific to this patch is a saveload cycle after threaded_step runs but before the next convoy step runs will mean that convoys are sent to the depot (emergency_go_to_depot) 25s sooner (or 2 hours sooner in the case of NO_ROUTE_TOO_COMPLEX). The simplest solution to this (if it is a problem) is probably to add wait_lock_next_step to wait_lock when saving the game. I've pushed this change to Github.

However, there seems to be an existing synchronisation issue here, whereby threaded_step isn't stopped before the game is saved. I could fix this by adding code to karte_t::save (to the same place that the path_explorer is stopped). I think that is unlikely to cause synchronisation errors in practice, because I expect threaded_step will almost always be finished before the corresponding data is written (it's taken us three months to begin triggering desyncs for an equivalent condition checked every sync_step; saves are much less frequent and rather more computationally expensive).

jamespetts

Thank you: this is helpful. If you could write code to force the main thread to pause until threaded_step has finished before the save routine runs, that would be extremely helpful. Thank you.
Your contribution to solving this is very much appreciated.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

This seems to have been reported as fixed - thank you very much again to A. Carlotti for working on this: it is very helpful.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.