News:

SimuTranslator
Make Simutrans speak your language.

Losses of synchronisation in online play

Started by freddyhayward, August 19, 2020, 09:59:26 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

jamespetts

Checking the logs this morning, I can see one loss of synchronisation recorded since the latest server update. The log entry is as follows:


Warning: karte_t::process_network_commands: kicking client due to checklist mismatch : sync_step=12734

server=[ss=0 st=0 nfc=0 rand=0 halt=0 line=0 cnvy=0 ssr=0,0,0,0,0,0,0,0 str=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 exr=0,0,0,0,0,0,0,0 sums=0,0,0,0,0,0,0,0

initiator=[ss=12734 st=795 nfc=14 rand=943453193 halt=5647 line=1025 cnvy=8193 ssr=3923501859,0,3923501859,3923501859,943453193,943453193,943453193,0 str=1601740177,1601740177,1601740177,1601740177,1601740177,1601740177,1311543576,1311543576,1311543576,1311543576,1311543576,1311543576,1311543576,1311543576,1311543576,1157293220 exr=1157293220,4176945506,4176945506,0,0,0,0,0 sums=3623084883,1660530599,0,0,6,0,0,0


For some reason, the server's figures have not been logged. I do not understand how this could have occurred; I am struggling to see how the logging changes that I made yesterday could possibly have affected this. Any assistance in this regard would be much appreciated.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

freddyhayward

#71
Quote from: jamespetts on September 05, 2020, 03:13:39 PM
A little earlier, I pushed a fix to the transferring cargoes algorithm in respect of an error that I found when improving the logging. The transferring cargoes had not been properly indexed in a number of places, such that the highest two threads' worth of these may well have been ignored.

I cannot immediately see how this could have caused a loss of synchronisation, but it is not impossible that this was to blame. I should be grateful for feedback on losses of synchronisation from to-morrow's nightly build onwards.
After you joined today, followed by Supertimo and Huitsi, I was desynced and found that the new debug sums [5] (passengers/mail) and [7] (transferring cargo) were off by small amounts, with all other entries equal.
server ... sums=2467091683,1342947152,1222935,141356,6,1062,35088,35213
client ... sums=2467091683,1342947152,1222935,141356,6,1061,35088,35211

edit: this happened again with the following discrepancy (note that this time, the client's record was higher than the server's):
server ... sums=346604347,12621192,1223059,141372,6,1068,35016,35152
client ... sums=346604347,12621192,1223059,141372,6,1069,35016,35157

edit: and another:
server ... sums=809983894,1692448293,1223268,141382,6,1050,34990,35107
client ... sums=809983894,1692448293,1223268,141382,6,1050,34990,35116

Phystam

We got this log.

Warning: karte_t:::do_network_world_command: sync_step=339968  server=[ss=339968 st=42496 nfc=0 rand=1371915518 halt=1931 line=1 cnvy=2049 ssr=3914378526,0,3914378526,3914378526,398461557,398461557,398461557,0 str=398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,2383265876 exr=2383265876,1371915518,1371915518,0,0,0,0,0 sums=2645975564,566767530,1079890,131294,6,255,43104,43136client=[ss=339968 st=42496 nfc=0 rand=1371915518 halt=1931 line=1 cnvy=2049 ssr=3914378526,0,3914378526,3914378526,398461557,398461557,398461557,0 str=398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,2383265876 exr=2383265876,1371915518,1371915518,0,0,0,0,0 sums=2645975564,566767530,1079890,131294,6,267,43093,43122
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0

jamespetts

Thank you both for that.

It seems that almost all of the losses of synchronisation are now being triggered from the client rather than the server, leaving no logs on the server to show the checklists. However, connecting with a client with logging enabled shows this:
Quote
Warning: karte_t:::do_network_world_command:   sync_step=162480 

server=[ss=162480 st=10155 nfc=0 rand=206871005 halt=5647 line=1025 cnvy=8193 ssr=2333617215,0,2333617215,2333617215,1007176688,1007176688,1007176688,0 str=1007176688,1007176688,1007176688,1007176688,1007176688,1007176688,114449090,114449090,114449090,114449090,114449090,114449090,114449090,114449090,114449090,379896060 exr=379896060,206871005,206871005,0,0,0,0,0 sums=3960158586,773655635,1223249,141380,6,1051,35002,35124

client=[ss=162480 st=10155 nfc=0 rand=206871005 halt=5647 line=1025 cnvy=8193 ssr=2333617215,0,2333617215,2333617215,1007176688,1007176688,1007176688,0 str=1007176688,1007176688,1007176688,1007176688,1007176688,1007176688,114449090,114449090,114449090,114449090,114449090,114449090,114449090,114449090,114449090,379896060 exr=379896060,206871005,206871005,0,0,0,0,0 sums=3960158586,773655635,1223249,141380,6,1051,35002,35123
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

Analysing the above, it appears as though the loss of synchronisation occurs somewhere in the passenger generation code. However, this does not mean that the problem ultimately originates in the passenger generation code: it might be a problem connected to finding passenger routes by public transport, which might be a problem with the path explorer, with data structures for classes, halts, lines (and what lines serve what halts and carry which classes), buildings that generate or receive passengers or mail (including city buildings, industries, attractions, town halls and all player buildings), or a large range of settings that affect passenger/mail generation.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

ceeac

I fixed some occurrences of Undefined Behaviour related to factory in-transit calculation, overtaking, and a few additional ones most probably not affecting the desyncs. Check pull request #275.

jamespetts

Ceeac - thank you very much for this.

I have found a possible issue in the passenger generation code, which was that the dynamic tolerance calculation feature introduced better to model Marchetti's Constant was based both on the current year's and the past year's data, whereas the current year's data might have been modified by another thread, and thus it is indeterministic whether this modification would have occurred before or after any given access. Because this modification would only very rarely affect the percentage, the actual loss of synchronisation would occur rarely, but would be increasingly common proportionate to the number of city buildings in the world.

I have fixed this by using only the previous year's data, and have restarted the server with this updated version. I should be grateful if people could re-test and reconnect with the new client to check for losses of synchronisation. I have not yet incorporated Ceeac's changes in order to have a properly controlled test.

Ceeac - I notice that a number of your changes add a dynamic cast. Dynamic casts can have a significant performance impact when used in performance critical sections of code; can I check whether you have profiled the performance of these changes on a large map to see whether there is any noticeable difference?

Thank you again for your work on this, and to everyone who has given feedback and assistance with this issue so far.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

freddyhayward

After you updated the server today, I have not been able to locate checklist messages in either the client or server logs. I wonder if this was caused by my change to the message format?

TurfIt

Standard r8395 commit is relevant here - missing, and r8348 not merged correctly:

// time for the next step?
uint32 time = dr_time();
if(  (sint32)next_step_time - (sint32)time <= 0  ) {    <<<<******
if(  step_mode&PAUSE_FLAG  ) {
// only update display
sync_step( 0, false, true );
idle_time = 100;
}
****
else if(  env_t::networkmode  &&  !env_t::server  &&  sync_steps >= sync_steps_barrier  ) {
sync_step( 0, false, true );
next_step_time = time + fix_ratio_frame_time;
}
*****
else {
if(  step_mode==FAST_FORWARD  ) {

Setting next_step_time is important, and note the correct way to take a time diff on values that overflow... (not part of 8348, but missed from some other one...)

Matthew

There's clearly been a lot of work put in to this issue in the last few days. Thank you to everyone. Particular credit to James for this monumental effort:

Quote from: jamespetts on August 29, 2020, 10:45:34 PMafter I had manually removed every industry from the May 1863 saved game

:o

Since the server was reset to build #c2173fa, I have played for two sessions of about a real-life hour each. I did not experience any desynchs during that time, though that is not really any different from the behaviour that I experienced beforehand. However, Freddy Hayward observed that desynchs often occurred about 20-40 in-game minutes after the start of the month, and the game passed through that period without difficulty.

Obviously these are only two data points, but at least we have not gone backward! Nobody else has reported desynching in the chats either, which is a hopeful sign.

Quote from: freddyhayward on September 06, 2020, 03:38:04 PM
After you updated the server today, I have not been able to locate checklist messages in either the client or server logs. I wonder if this was caused by my change to the message format?

I observe the same phenomenon in the Windows client logs.

Staying on the topic of logging changes, the logs for both sessions contained far more divide by zero warnings than I have seen before, usually in clusters and often (always?) involving a limited set of numerators. I have not created a bug report because I don't yet have a reproducible case beyond "play on Bridgewater-Brunel". I guess that a debug build is necessary to create such a case, but at the present time I can't use one due to (to cut a long story short) lack of memory.
(Signature being tested) If you enjoy playing Simutrans, then you might also enjoy watching Japan Railway Journal
Available in English and simplified Chinese
如果您喜欢玩Simutrans的话,那么说不定就想看《日本铁路之旅》(英语也有简体中文字幕)。

Huitsi

I've had hardly any desyncs today, and the few that I have had seemed to be related to players joining.

jamespetts

Thank you all for your feedback and my apologies for not having had more time to look into this to-day. I have just reverted the change that Freddy made to the logging code that is suspected of having caused problems for the logging. If anyone has any losses of synchronisation after to-morrow's nightly build, please let me know so that I can analyse the logs.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

freddyhayward

#82
Quote from: jamespetts on September 07, 2020, 05:13:42 PM
Thank you all for your feedback and my apologies for not having had more time to look into this to-day. I have just reverted the change that Freddy made to the logging code that is suspected of having caused problems for the logging. If anyone has any losses of synchronisation after to-morrow's nightly build, please let me know so that I can analyse the logs.
I tested the latest build with my patch reverted, but I still cannot see checklist messages.
EDIT: I also tried the earliest build with my change, and logging worked as expected. I am now in the process of identifying exactly which change has caused the issue.
EDIT: I have identified the issue and submitted a pull request to fix it: https://github.com/jamespetts/simutrans-extended/pull/276

TurfIt

James demoted it from a warning to a message. Will need debug 3 now instead of debug 2 to see it.  (Of course debug 3 spits out so many other issues [that really should be fixed....] that you can't effectively use it.)

freddyhayward

Quote from: TurfIt on September 07, 2020, 11:14:27 PM
James demoted it from a warning to a message. Will need debug 3 now instead of debug 2 to see it.  (Of course debug 3 spits out so many other issues [that really should be fixed....] that you can't effectively use it.)
See the above post: I have kept it as a message unless a mismatch actually occurs, where it is added to the existing warning.

Phystam

#85
What happened? I cannot understand why it was disconnected.
It was an auto-update of my server. please ignore this.

Message: karte_t:::do_network_world_command: sync_step=1265440  server=[ss=1265440 st=158180 nfc=0 rand=2836023522 halt=2028 line=1 cnvy=2049
ssr=1926564120,0,1926564120,1926564120,3558286601,3558286601,3558286601,0
str=3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3439267139
exr=3439267139,2836023522,2836023522,0,0,0,0,0
sums=2251449498,2512648722,1115024,135464,6,396,43956,44015]
client=[ss=1265440 st=158180 nfc=0 rand=2836023522 halt=2028 line=1 cnvy=2049
ssr=1926564120,0,1926564120,1926564120,3558286601,3558286601,3558286601,0
str=3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3439267139
exr=3439267139,2836023522,2836023522,0,0,0,0,0
sums=2251449498,2512648722,1115024,135464,6,396,43956,44015]

Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[2752]
Warning: NWC_CHECK: time difference to server 50
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[2752]
Warning: NWC_CHECK: time difference to server 50
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[2752]
Warning: NWC_CHECK: time difference to server 0
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[2752]
Warning: NWC_CHECK: time difference to server 50
Warning: network_receive_data: connection [2752] already closed
Message: socket_list_t::remove_client: remove client socket[2752]
Warning: karte_t::process_network_commands: lost connection to server
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0

Phystam

Double post!
I have connected to my server 3 hours ago, and I have not had any desync issues.
Another player did not report the desync issue, too.

Phystam

Triple post ;)
I got a desync:

server=[ss=577568 st=72196 nfc=0 rand=3624669075 halt=2049 line=1 cnvy=2049
ssr=2442508065,0,2442508065,2442508065,4026605948,4026605948,4026605948,0
str=4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,197235402
exr=197235402,3624669075,3624669075,0,0,0,0,0
sums=3912519612,3810850575,1120566,136003,6,406,44647,44703]
client=[ss=577568 st=72196 nfc=0 rand=3624669075 halt=2049 line=1 cnvy=2049
ssr=2442508065,0,2442508065,2442508065,4026605948,4026605948,4026605948,0
str=4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,197235402
exr=197235402,3624669075,3624669075,0,0,0,0,0
sums=4083084865,2854028676,1120566,136003,6,406,44647,44703]

Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset

jamespetts

Thank you for these reports. That is interesting, as the mismatch occurred on the first two debug sums, which indicates an issue related in some way to vehicle movement. This is different to the previously reported problem. This, together with the reduction in reported losses of synchronisation and, suggests that the original problem may well have been solved and that this is a separate problem.

It would be helpful to have any more log reports of losses of synchronisation here to allow this possibly separate issue to be investigated further.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

Quote from: TurfIt on September 06, 2020, 07:04:57 PM
Standard r8395 commit is relevant here - missing, and r8348 not merged correctly:

// time for the next step?
uint32 time = dr_time();
if(  (sint32)next_step_time - (sint32)time <= 0  ) {    <<<<******
if(  step_mode&PAUSE_FLAG  ) {
// only update display
sync_step( 0, false, true );
idle_time = 100;
}
****
else if(  env_t::networkmode  &&  !env_t::server  &&  sync_steps >= sync_steps_barrier  ) {
sync_step( 0, false, true );
next_step_time = time + fix_ratio_frame_time;
}
*****
else {
if(  step_mode==FAST_FORWARD  ) {

Setting next_step_time is important, and note the correct way to take a time diff on values that overflow... (not part of 8348, but missed from some other one...)

My apologies: I had not had time to look into this until now.

Thank you very much for this: I have now incorporated these changes.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TurfIt


jamespetts

Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TurfIt

Almost there, still r8395 to go...

void socket_list_t::send_all(network_command_t* nwc, bool only_playing_clients, uint8 player_nr)

...
if (list[i]->is_active()  &&  list[i]->socket!=INVALID_SOCKET
&&  (!only_playing_clients  ||  list[i]->state == socket_info_t::playing  ||  list[i]->state == socket_info_t::connected)      <<<<<<<<<<<
&&  (player_nr >= PLAYER_UNOWNED  ||  list[i]->is_player_unlocked(player_nr))
...


jamespetts

Thank you for that. I am having trouble working out where to get the player_nr from to pass to this method: how is this done in Standard, do you know?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TurfIt

I would skip that part... it's from a much larger commit and is for scripting support which I believe you've decided to remove from Extended.
Just the   state == :: connected that was added in 8395 is needed.

jamespetts

Quote from: TurfIt on September 08, 2020, 07:33:40 PM
I would skip that part... it's from a much larger commit and is for scripting support which I believe you've decided to remove from Extended.
Just the   state == :: connected that was added in 8395 is needed.

Excellent, thank you: that is very helpful.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

Right now I have immediate desync after connecting to stephenson-siemens and bridgewater-brunel. But I was able to stay connected to Karafuto for quite a while.

freddyhayward

Desyncs have decreased overall, but I recently had one similar to Phystam's:
server=[ss=245976 st=15373 nfc=8 rand=1948000736 halt=5737 line=1647 cnvy=10269
ssr=1268985866,0,1268985866,1268985866,1948000736,1948000736,1948000736,0
str=2556624162,2556624162,2556624162,2556624162,2556624162,2556624162,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,3525597508
exr=3525597508,4058585351,4058585351,0,0,0,0,0
sums=2869568541,3060730727,0,0,6,0,0,0]
client=[ss=245976 st=15373 nfc=8 rand=1948000736 halt=5737 line=1647 cnvy=10269
ssr=1268985866,0,1268985866,1268985866,1948000736,1948000736,1948000736,0
str=2556624162,2556624162,2556624162,2556624162,2556624162,2556624162,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,3525597508
exr=3525597508,4058585351,4058585351,0,0,0,0,0
sums=3599377438,2028940010,0,0,6,0,0,0]

Phystam

#98
Currently, it is very stable -- the reported desyncs seem relatively rare cases, so I do not think that this issue must be fixed.

Edit:
Finally I did not get any desync today!

jamespetts

Excellent - thank you all for confirming. It seems that the main issue has now been resolved. If there are any other losses of synchronisation, they are likely to be caused by something different, so a new thread will be required.

Thank you very much to all those who assisted in resolving this problem.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

freddyhayward

Quote from: jamespetts on September 09, 2020, 11:06:23 PM
Excellent - thank you all for confirming. It seems that the main issue has now been resolved. If there are any other losses of synchronisation, they are likely to be caused by something different, so a new thread will be required.

Thank you very much to all those who assisted in resolving this problem.
Thanks for your work on this - it's hard to overstate how much the online experience has improved. Could you please split my and Phystam's reports into a new thread - we seem to have reached the same error, fortunately much less frequently than before.

freddyhayward

#101
Feel free to split this into a new thread if you think it necessary. I have been recording reports of losses of synchronisation and the first mismatched sum or random seed. There were three recorded places where mismatches first occurred, and I suggest that we make the checklist logging more granular for these if possible.

edit: link to spreadsheet is here: https://docs.google.com/spreadsheets/d/1Jwa-5G6aXfkrCrPYLQvDbYt7LqYMgYapW8hyOF9IB3s/edit?usp=sharing

Mariculous

There seems to be yet another desync issue, which appears more-or-less immediately after connectngto the server.
It can be reproduced on stephenson-siemens server. Here is the latest save of that map, which allowed me to reliably reproduce the desync locally by simply hosting server and client on the same machine.

Stephenson-siemens save: https://dome.xileks.de/simutrans/save/stephenson-siemens.sve


Pak192 server:
Simutrans extended Version: https://citarofahrer.ch/ST/exe/Simutrans-Extended-Complete.zip  (not the latest because OVH does not support latest nightlies)
Pakset: https://www.simutrans.ovh/server/41/pakset
Save: https://cdn.discordapp.com/attachments/744334274357887016/754782292781826189/savegame.sve

freddyhayward

Quote from: Freahk on September 14, 2020, 10:54:42 AMThere seems to be yet another desync issue, which appears more-or-less immediately after connectngto the server.
It can be reproduced on stephenson-siemens server. Here is the latest save of that map, which allowed me to reliably reproduce the desync locally by simply hosting server and client on the same machine.

Stephenson-siemens save: https://dome.xileks.de/simutrans/save/stephenson-siemens.sve
I tried this and couldn't reproduce the desync.

Mariculous

#104
Wait, I got an idea!
Might be related to *_overrides_savegame_settings

Edit: It seems like something is wrong when the server makes use of *_overrides_savegame_settings=1

Yet tested with the provided pak192 save.
This seems to be a relatively new issue, as stephenson-siemens made use of this for a long time. Searching for the commit that introduced this isue now.

Edit2: Confirmed the same with the stephenson-siemens save. It's running locally without any desyncs now either.