Losses of synchronisation in online play

jamespetts · August 27, 2020, 04:48:51 PM

I have now had both clients lose synchronisation to the local server. We can therefore potentially reproduce the loss of synchronisation locally and without interaction, but we can only confirm that any possible change has the effect of avoiding a loss of synchronisation with at last 2 hours' testing.

freddyhayward · August 27, 2020, 11:21:14 PM

I have been able to to produce losses of synchronisation for both clients without interaction in under half an hour (possibly shorter, since I didn't time it) by using a save with the lowest possible bits-per-month which therefore calls new_month and associated code more frequently. I noticed that in these cases the server had advanced its random seed while the clients had not.

jamespetts · August 27, 2020, 11:33:45 PM

Quote from: freddyhayward on August 27, 2020, 11:21:14 PM
I have been able to to produce losses of synchronisation for both clients without interaction in under half an hour (possibly shorter, since I didn't time it) by using a save with the lowest possible bits-per-month which therefore calls new_month and associated code more frequently. I noticed that in these cases the server had advanced its random seed while the clients had not.

That is interesting. However, there are reports of losses of synchronisation (and I have experienced this myself) before a month end change (i.e. players logging on during a month lose synchronisation in that same month), which suggests that there is a problem that does not depend on month end issues. It is possible for there to be two separate issues, of course, one based on month ends and one not.

jamespetts · August 28, 2020, 01:05:20 AM

I conducted an additional long-term test this evening, leaving the game running in the background for many hours whilst I did other things. In this case, I had set it up to remove all industry at the beginning of a new month; but checking now, this appears to have failed for reasons which are currently unclear, thus invalidating the test. The clients had lost synchronisation by the time that I checked them again.

freddyhayward · August 28, 2020, 01:26:49 AM

I have been repeatedly desyncing seconds after connecting to bridgewater-brunel and have 'narrowed' down the divergence in this case to between lines 5868 and 5936 of simworld.cc. however, this contains quite a bit of logic including convoy, path explorer, and time interval signalling.

jamespetts · August 28, 2020, 10:32:15 AM

Thank you for this: that is most helpful. This is the main step() routine, and a complete list of things processed in this block of this routine is:

(1) the path explorer refreshing all categories;
(2) check transferring cargoes;
(3) starting the multi-threaded path explorer;
(4) starting the multi-threaded convoy threads;
(5) checking the MIDI;
(6) recalculating the snowline;
(7) stepping time interval signals;
( 8) checking the number of playing clients in a server game and recording whether a player has disconnected; and
(9) checking the functions to be executed by scenario scripts if this is a scripted scenario.

We can realistically rule out 3, 4, 5, 6, 8 and 9 as proximate causes: any losses of synchronisation in the multi-threaded code would only show in in the main code after those multi-threaded parts have finished, and the completion of those multi-threaded algorithms has not occurred by the end of this block. That leaves:

(1) the path explorer refreshing all categories;
(2) check transferring cargoes; and
(3) stepping time interval signals;

Unfortunately for narrowing down purposes, there is a high chance that, if (2) is the proximate cause, the actual cause of the divergence is nothing to do with the check transferring cargoes code, since the inconsistency may have arisen much earlier, when the passengers/mail/goods were put into the transferring cargoes list in the first place, and only registers when, for example, passengers leave that list, are registered as arriving somewhere, and pedestrians are generated, giving rise to random number generator calls to determine which pedestrian images to use and in what direction that they should travel.

This would suggest some inconsistency in code relating to one or more of the following:
(1) passenger generation;
(2) transfer time calculation;
(3) vehicle movement (including physics, signalling, road vehicle conflict resolution);
(4) passenger routing;
(5) passenger loading (including overcrowding, class and comfort);
(6) mail routing;
(7) mail loading (including class);
( 8) schedules/waiting times; or
(9) industry production logic.

Incidentally, I have had another look at the DEBUG_SIMRAND_CALLS code. This appears to have been improved since I first wrote it and it now appears to be suitable for use in a running game. What it will do is give information as to the caller and the range of each random call. However, because of this, it will generate truly gargantuan log files that means that it is impractical to use on the server. If you are able to reproduce a loss of synchronisation locally within a short period of time - something that I have not been able to do - it may well be worth enabling this on server and client and analysing the resulting log files.

jamespetts · August 29, 2020, 05:03:24 PM

I continue to have difficulty in reproducing this locally: using the saved game from the server from just before I reset the pakset override setting, which started at circa 3:00 in May 1863, I have progressed with two locally connected clients to circa 4:40 in June 1863 without loss of synchronisation.

Can I check how often that people are losing synchronisation to-day in the Bridgewater-Brunel server?

Huitsi · August 29, 2020, 07:47:00 PM

I just desynced three times within less than two hours.

Matthew · August 29, 2020, 10:19:53 PM

This morning (European time) I had immediate desynchs every time I joined.

Early this evening I noticed that Freahk and Huitsi twice desynched as soon as I joined. At that time, I did not desynch, but I did have steadily increasing lag (1 minute plus), even though I had the game windows as small as possible, which I remedied by quitting and restarting the client.

Later this evening, I was the only person on the server for about two hours. I had no or trivial (<10s) lag, even with the game playing full-screen and zoomed out, and desynched only once. §

So there is a possible pattern that I only get lag when the savegame is saved by the Windows client to my hard drive (which I understand to be the case when others join after I do) and not when it's transferred from the server's Linux build (when I join and play alone). Perhaps someone could share a recent Bridgewater-Brunel save made by a Linux client so I can see whether that has different behaviour off-line.

§ It was a weird case like nothing I've seen before: I suddenly jumped twenty (in-game) minutes forward at the same time as I desynched. By the time I resynched (by which I mean, re-joined the online game), I went back fifteen minutes again. My best guess is that's off-topic to this thread and that I accidentally pressed the "fast-forward" shortcut at the same time as I desynched or something.

jamespetts · August 29, 2020, 10:45:34 PM

Thank you both for your reports. I have been carrying out some testing over many hours this evening, leaving the computer unattended whilst undertaking other tasks. I connected two clients to the server after I had manually removed every industry from the May 1863 saved game. Having left it for circa 4-5 hours, it had reached October 1863; one client had lost synchronisaiton but one remained connected.

This would tend to suggest that industry is not involved in the loss of synchronisation, but there is a possibility of the results being contaminated by new industries spawning which use player transport connexions that already existed.

I will need to re-run the test, probably overnight, using the industry density proportion override to prevent any industry spawning.

jamespetts · August 30, 2020, 10:25:10 AM

I have re-run the test overnight. One client lost synchronisation, the other crashed: when attempting to use the debugger on a crashed version, I find that it crashed because the pointer to a way object in an ordinary crossroads in a town apparently pointed to a deleted object, the ultimate reason for which could not be discerned.

Checking the local server, I see that a number of industry chains had been generated despite setting the industry density proportion to 1, its lowest setting without disabling the system entirely. However, checking these, none of these seemed to be actually interacting with player networks.

This would suggest that industry production and interaction with player networks is not the cause of the loss of synchronisation issue. If Freddy is correct about the part of the code in which this problem arises, this would, in turn, suggest that the problem does not involve industry at all, since the only means of industry involvement in those parts of the code is by means of transport of cargoes.

jamespetts · August 31, 2020, 12:42:37 AM

I have done some further testing: again, I left the simulation running for ~5 hours with one local server and two clients, but I modified the code so as to disable check_transferring_cargoes(). The same result as above obtained: one client lost synchronisation, the other crashed.

The crash is odd, and difficult to track down, partly, I suspect, because I am using an optimised debug build (the optimisations being necessary for the Bridgewater-Brunel game to run at a reasonable speed).

The error that I get is in line 1791 of grund.cc when attempting to check tile 433,570 as to whether it has a depot (called from line 4783 of simvehicle.cc, part of the bool rail_vehicle_t::check_next_tile() method. This is called by convoy 8608 and is recorded as occurring at 4:35:11 of September 1863.

The crash itself is indecipherable. The only details given by the debugger are:

Code Select


Simutrans-Extended-debug-optimised.exe has triggered a breakpoint. occurred

The code then enters some external code and ultimately ends in issue_debug_notification(), for which no sources are available, having passed through terminate() and abort(); none of these methods are part of Simutrans code, so they must be part of some exception handling library; but what exception is being handled is not clear.

The code being executed is:

Code Select


depot_t* grund_t::get_depot() const
{
	return dynamic_cast<depot_t *>(first_obj());
}

I cannot inspect the value of first_obj() because of compiler optimisations.

Given that a similar crash occurred on the last occasion, I am wondering whether this is relevant to the cause of the loss of synchronisation. This looks as though it may be some form of memory corruption, but how it is arising is extremely unclear. However, it is notable that check_transferring_cargoes() being disabled appears to have made no difference; that may well rule out a large number of possible causes.

jamespetts · September 02, 2020, 09:18:52 PM

Further testing shows that a loss of synchronisation occurs after circa 2 hours on a local client/server with two clients connected with line 5903 of simworld.cc commented out, suggesting that the problem is unlikely to be specific to time interval signals.

I should note that it is likely to be very difficult for me to undertake any development work until this issue has been identified and fixed, and this could take an unlimited and unpredictable amount of time.

Any assistance in resolving this would be very much appreciated.

jamespetts · September 02, 2020, 11:10:58 PM

I have attempted to conduct further tests to see whether a shorter time period for invoking a loss of synchronisation could be prompted, but without success. There was a point when clients lost synchronisation seconds after connecting to the server, but I have not been able to reproduce this reliably with a saved game from the time when this occurred.

I have tried connecting up to four clients simultaneously, but this does not reliably prompt a loss of synchronisation within a reasonable time for testing.

A failure to reproduce the problem for check_transferring_cargoes() and stepping the time interval signals, the inability to reproduce loss of synchronisation within the same time of loading the saved game and reports from users that losses of synchronisation tend more often to happen after players connect tends to suggest that the problem might be related in some way (whether directly or indirectly - and indirectly might include very indirectly indeed, i.e., by means of an arbitrary number of intermediate steps each of an arbitrary level of complexity) to loading and saving, but this is highly uncertain. There is currently no clue as to where in the loading/saving code that the problem might occur.

I note that Freddy had reported being able to reproduce the loss of synchronisation within 30 minutes, but I have never received information on how this was possible, so I have not been able to do anything with this. If it is possible to reproduce it within that time, that would be extremely helpful and could quadruple the speed with which this problem can be found.

freddyhayward · September 02, 2020, 11:50:59 PM

Quote from: jamespetts on September 02, 2020, 11:10:58 PMI note that Freddy had reported being able to reproduce the loss of synchronisation within 30 minutes, but I have never received information on how this was possible, so I have not been able to do anything with this. If it is possible to reproduce it within that time, that would be extremely helpful and could quadruple the speed with which this problem can be found.

Unfortunately, I can't precisely remember either. I do remember altering certain settings to make industries grow at the maximum possible rate (I can't remember which settings achieved this), and lowering the bits_per_month to 16. I have no idea whether this would relate to the same error that you had reproduced within 2 hours, or whether this method would even succeed in the current version. I think I used a flat world with many interconnected cities for this as well, but I can't remember its size.

jamespetts · September 02, 2020, 11:56:10 PM

Thank you for your reply. Unfortunately, this information is not sufficiently precise for me to be able to do anything with it.

jamespetts · September 03, 2020, 12:08:02 AM

I have attempted to use Dr. Memory to determine whether there are any items of memory corruption that might be causing the losses of synchronisation. Unfortunately, I have been unable to get Dr. Memory to work at all: with the 64-bit executable, it will freeze after the pakset selector. With the 32-bit executable, it will simply crash shortly after opening. These errors do not occur when not running in Dr. Memory.

If anyone is able to get this to run successfully with either Dr. Memory or Valgrind, especially with the current Bridgewater-Brunel saved game, it would be extremely helpful to see the output of this to determine whether memory corruption is possibly relevant.

jamespetts · September 03, 2020, 12:09:23 AM

Incidentally, Freddy - can I check precisely how you determined that the loss of synchronisation occurred in certain parts of the code? I have not found that disabling any parts of that code makes any difference to whether the losses of synchronisation occur.

freddyhayward · September 03, 2020, 01:13:04 AM

Quote from: jamespetts on September 03, 2020, 12:09:23 AM
Incidentally, Freddy - can I check precisely how you determined that the loss of synchronisation occurred in certain parts of the code? I have not found that disabling any parts of that code makes any difference to whether the losses of synchronisation occur.

I used the checklist mismatch messages - each record of `str=` and `ssr=` is updated at a different point in the code, so it is a matter of finding which entries are mismatched, and where they are updated in the code. We could make this more precise by adding more entries in suspect areas, and removing them from non-suspect areas.

ceeac · September 03, 2020, 06:46:21 AM

I noticed that running the server with MULTI_THREAD=0 and connecting with a client that has MULTI_THREAD=1 (everything else being equal, including settings) results in an immediate loss of synchronization every time.

Phystam · September 03, 2020, 07:17:54 AM

Additionally, if the thread number is different between the server and a client, connection will be lost immediately.
I think it's better to implement the comparison of thread number when connecting.

Matthew · September 03, 2020, 07:24:12 AM

Quote from: ceeac on September 03, 2020, 06:46:21 AM
I noticed that running the server with MULTI_THREAD=0 and connecting with a client that has MULTI_THREAD=1 (everything else being equal, including settings) results in an immediate loss of synchronization every time.

Would Simutrans warn of incompatible paksets if a player tried to do that?

ceeac · September 03, 2020, 08:14:24 AM

Quote from: Phystam on September 03, 2020, 07:17:54 AMI think it's better to implement the comparison of thread number when connecting.

No. I believe the results should be the same no matter the number of threads or whether multi-threading is enabled. It also helps when debugging multi-threaded code if you are able to compare the results of the multi-threaded code with the results of the single-threaded code.

Quote from: Matthew on September 03, 2020, 07:24:12 AMWould Simutrans warn of incompatible paksets if a player tried to do that?

I queried my local server by entering the IP address in the "play online" window and there were no warnings.

RESTRICTED ACCOUNT · September 03, 2020, 11:15:29 AM

Code Select


	if(  get_scenario()->is_scripted() ) {
		get_scenario()->step();
	} // Loss of synchronisation suspected to be in a block of code ending here.

Does the scenario work correctly with extended?

Mariculous · September 03, 2020, 01:01:51 PM

It does not work at all with extended.

TurfIt · September 03, 2020, 01:08:44 PM

Quote from: ceeac on September 03, 2020, 06:46:21 AM
I noticed that running the server with MULTI_THREAD=0 and connecting with a client that has MULTI_THREAD=1 (everything else being equal, including settings) results in an immediate loss of synchronization every time.

That is a known issue with the architecture of the threading that was added. It should really be done as a pool of worker threads grabbing tasks. i.e. Break the work into a hundred tasks and let the threads chew. Then it doesn't matter 1 thread or 30. The problem now is the work is broken into the number of threads, and work unit size affects the simulation (maintaining sync). For the threaded display rendering, it is quite important for performance the number of threads match the available CPU cores. Hence the number of threads must be different for each persons computer.

Quote from: freddyhayward on September 03, 2020, 01:13:04 AM
I used the checklist mismatch messages - each record of `str=` and `ssr=` is updated at a different point in the code, so it is a matter of finding which entries are mismatched, and where they are updated in the code. We could make this more precise by adding more entries in suspect areas, and removing them from non-suspect areas.

This is exactly how the rands added to the checklist were intended to be used. In practice, they're less effective at finding the desync than desirable, but better than nothing. It does mean for every desync, you must check the checklists and see where the mismatch is. Otherwise just saying a desync occurred is a waste of testing.

A note on the intended design:
   ssr= sync_step_rands. 8 of these to be used throughout sync_step().
   str= step_rands. 32 of these to used throughout step().
   exr= extra_rands. 8 of these to be stuck in whereever an extra checkpoint is needed.
   debug_sums, don't know, not added by me.

With the current distribution in step(), rands[21] and [22] should really be in the exr range. [23] should be [16], and 16->17, etc. i.e. Keep them in order, much easier when finding the mismatch. And the setting =0 at the end of step(), not in the middle, then you can see what's not used (and see [23] is used even though set =0)

--
As for the current desync in block 19 to 20, the only thing there calling simrand() is pedestrian generation within check_transferring_cargoes(). As previously mentioned, things that affect cargo are many. i.e. the root source will not likely be in check_transferring_cargoes(), but those things that affect cargo in general. (any why commenting out check_transferring_cargoes() still results in a desync - it's simply detected later/elsewhere - why it's important to check the rands every desync)

Matthew · September 04, 2020, 06:29:03 AM

Quote from: TurfIt on September 03, 2020, 01:08:44 PMThis is exactly how the rands added to the checklist were intended to be used. In practice, they're less effective at finding the desync than desirable, but better than nothing. It does mean for every desync, you must check the checklists and see where the mismatch is. Otherwise just saying a desync occurred is a waste of testing.

A note on the intended design:
   ssr= sync_step_rands. 8 of these to be used throughout sync_step().
   str= step_rands. 32 of these to used throughout step().
   exr= extra_rands. 8 of these to be stuck in whereever an extra checkpoint is needed.
   debug_sums, don't know, not added by me.

With the current distribution in step(), rands[21] and [22] should really be in the exr range. [23] should be [16], and 16->17, etc. i.e. Keep them in order, much easier when finding the mismatch. And the setting =0 at the end of step(), not in the middle, then you can see what's not used (and see [23] is used even though set =0)

Turfit, thank you for adding this feature to Simutrans and drawing James' attention to it. However, I don't have anywhere near enough knowledge of C++ or Simutrans to place your comments in context and make use of them.

I will try to describe the feature in full to get it clear in my head; could you (or anyone else who understands it) please check that I have understood correctly how it works and how to use it?

The main Simutrans random number generator uses a seed generated by the previous call to it, so if all is well there should be an unbroken chain of seeds.

If the Simutrans client has debug warnings enabled, then it periodically runs the do_network_world function, which publishes in the log file the seed available for random number generation at certain points in the previous period. It reports both the seed it used itself and the seed available to the server, which should be the same, otherwise the chain is broken and the gamestates will diverge over time. Such a report is also published to the log once such such a divergence is detected and the client desynchs.

The cause of the divergence must lie before the point at which the logged seeds diverged. It must lie after the last point at which the seeds were identical, though since step() and sync_step() are loops, so a later point in the code may refer to an earlier point in time.

There are three different sets of log points, for use at different places in the code:

Quotessr= sync_step_rands. 8 of these to be used throughout sync_step().
str= step_rands. 32 of these to used throughout step().
exr= extra_rands. 8 of these to be stuck in whereever an extra checkpoint is needed.

In order to activate the log points, change (for example)

Code Select

rands[7] = 0; to

Code Select

rands[7] = get_random_seed(); And presumably both client & server must be recompiled with this change, otherwise you are guaranteed to have mismatches because the Simutrans instance with the extra activated log point will be making an additional call to the random number generator.

Once it is noticed that divergences are often occurring at the same place(s) in the code, it may be possible to move the log points in order to bisect the code and find the cause of the random seed divergences. Though I doubt it as always as simple as that!

jamespetts · September 04, 2020, 09:32:53 AM

Thank you all for your responses. I have been very busy in the last few days, but I will look into updating the logging code to assist in tracking this problem down when I have some time. Thank you again.

Phystam · September 04, 2020, 05:09:16 PM

In my server, I often observe desync when nettool sends a force-sync call. (His client also often shows "data is not enough" error while connecting.)

Matthew · September 05, 2020, 05:50:12 AM

I madea tiny Bash script to search Simutrans log files for desync random-number checklists and save them:

Code Select

#!/bin/bash
# Log version hash (equals last commit on main branch)
grep "Simutrans version" simu.log >> checklist.log
# For clients, log disconnects and make the random-number-seed checklists line up nicely (if word wrap is turned off) for easy comparison
grep -B2 "network_disconnect()" simu.log | sed -e 's/server=/\nserver=/; s/client=/\nclient=/'  >> checklist.log
# For the server, log and make the random-number-seed checklists line up nicely (if word wrap is turned off) for easy comparison by changing "initiator" to "client"
grep -B2 "kicking" simu.log | sed -e 's/server=/\nserver=/; s/initiator=/\nclient=/'  >> checklist.log

It seems to work on the server log that James posted a few weeks ago and on the Windows client logs.

If James edited something like this into the scripts that he uses to restart the server, then he would have a summary log of RNG mismatches causing desyncs, without having to keep gigabytes of log files.

If James were to edit the path for checklist.log to somewhere that is downloadable (like his /misc/ directory) then others could analyse the data too. For easiest view, turn word wrapping off.

It's only my third Bash script (with lots of help from Stack Exchange) so use at your own risk.

TurfIt · September 05, 2020, 05:51:46 AM

Quote from: Matthew on September 04, 2020, 06:29:03 AM
I will try to describe the feature in full to get it clear in my head; could you (or anyone else who understands it) please check that I have understood correctly how it works and how to use it?

You've got the general gist, not quite the details, but not important unless you're combing the code for the cause. As an end user, running simutrans with "-debug 2 -log" args, and checking the logs for messages:

Code Select


Warning: karte_t::interactive:	sync_step=413158 server=[ss=413158 st=25822 nfc=6 rand=1232380772 halt=5216 line=1025 cnvy=8193 ssr=2417508643,1232380772,0,0,0,0,0,0 str=775003360,775003360,775003360,775003360,775003360,3751946317,3751946317,3751946317,3751946317,3751946317,3751946317,2444611105,2792299441,343840496,40657385,3751946317 exr=0,0,0,0,0,0,0,0 sums=1744600149,2506396596,0,0,0,0,0,0] client=[ss=413158 st=25822 nfc=6 rand=2159140440 halt=5216 line=1025 cnvy=8193 ssr=1442281764,2159140440,0,0,0,0,0,0 str=775003360,775003360,775003360,775003360,775003360,3751946317,3751946317,3751946317,3751946317,3751946317,3751946317,2444611105,2792299441,343840496,40657385,3751946317 exr=0,0,0,0,0,0,0,0 sums=1744600149,2506396596,0,0,0,0,0,0]
Warning: karte_t::interactive:	disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect():	Lost synchronisation with server. Random flags: 0

Confirmation that you're experiencing a "checklist mismatch" is vital. It represents the existence of a bug in Extended, other 'desync' reasons are more of a disconnection, usually due to the clients flaky internet.
Posting the first line shown above on every checklist mismatch could help track down the bug, atleast posting it until too many showing the same are up...

Quote from: Matthew on September 04, 2020, 06:29:03 AM
Though I doubt it as always as simple as that!

If only! Way too many red herrings.

Quote from: Phystam on September 04, 2020, 05:09:16 PM
In my server, I often observe desync when nettool sends a force-sync call. (His client also often shows "data is not enough" error while connecting.)

force-sync causes server and all clients through a save/reload cycle. If a client is too slow doing so, could result in socket closing - disconnection, not "desync". If confirmed to be checklist mismatches, something very wrong in the save/load routines then...
"data not enough" == bad connection.

jamespetts · September 05, 2020, 12:03:22 PM

Thank you all for your contributions to this. I am currently working on improving the logging code, and have pushed the first round of improvements. I have added some additional points where rands[] is set, and moved the two places where rands[] had been set with something other than random numbers to some of the unused debug checksums.

In the meantime, I have created a symbolic link to the live log file genreated by the server to allow active debugging by the community, both for this loss of synchronisation and any other error. It can be found here. Beware that this is a very, very large file.

I will look into improving the logging further shortly. Note that the improved logging will take effect on the live log from to-morrow's nightly build onwards. Note also that these live logs are reset whenever the server restarts and archives are not currently kept.

prissi · September 05, 2020, 01:39:06 PM

In the other thread, if a different units of path explore chunks are processed on a client, would that not cause desyncs quickly?

jamespetts · September 05, 2020, 02:26:51 PM

Quote from: prissi on September 05, 2020, 01:39:06 PM
In the other thread, if a different units of path explore chunks are processed on a client, would that not cause desyncs quickly?

Not if the paths were not updated until the path explorer had finished a whole round of calculation.

jamespetts · September 05, 2020, 03:13:39 PM

A little earlier, I pushed a fix to the transferring cargoes algorithm in respect of an error that I found when improving the logging. The transferring cargoes had not been properly indexed in a number of places, such that the highest two threads' worth of these may well have been ignored.

I cannot immediately see how this could have caused a loss of synchronisation, but it is not impossible that this was to blame. I should be grateful for feedback on losses of synchronisation from to-morrow's nightly build onwards.

News:

Losses of synchronisation in online play