The International Simutrans Forum

 

Author Topic: Instability on the Bridgewater-Brunel server  (Read 12897 times)

0 Members and 1 Guest are viewing this topic.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #210 on: December 09, 2018, 06:38:03 PM »
Thank you for checking.

This is, I have to say, a fantastically and inexplicably complex problem to resolve - no test so far devised seems to have provided any meaningful insight into the nature of the problem. This really is bizarre beyond comprehension. I fear that this may take many, many more months to resolve without some serious progress.

If anyone can devise and execute some tests to narrow this issue down further, that would be most helpful.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2709
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #211 on: December 10, 2018, 12:26:41 AM »
So is it related to the type of locomotive used or was that a premature conclusion?

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #212 on: December 10, 2018, 12:37:38 AM »
So is it related to the type of locomotive used or was that a premature conclusion?

It may be - so far, it appears not to lose synchronisation with a GWR Castle class, but it does lose synchronisation with the LNER A4 class (with or without a custom air resistance value). An earlier test showed loss of synchronisation with a Bulleid pacific (I think a Merchant Navy class, but it might have been the similar looking but lighter weight West Country class) and mail carriages. Nothing else has been tested so far - testing other locomotives with everything else the same would be a helpful start to try to narrow this down.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #213 on: December 10, 2018, 11:42:18 PM »
One possible method of testing is to make a copy of the GWR castle class for debugging purposes, and, one by one, change its characteristics to those of the LNER A4 class and re-test each time to see whether there is a loss of synchronisation. Because a full round of testing will involve the trains making the full circuit of the Northern Frontier Express for each modification, this will be an extremely time consuming method of testing.

However, I cannot at present devise any alternative. I am not likely to have the time to do this for a few days, but if anyone else can devise any more efficient test in the meantime, that would be very helpful.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #214 on: December 12, 2018, 10:34:37 PM »
I have just commenced a test of seeing whether the LMS Princess Coronation Royal class when paired with the debugging carriages will provoke a loss of synchronisation or not. Testing several other classes of locomotives would be helpful to have good data as to whether the locomotive is relevant to the loss of synchronisation and, if so, in what way and to what extent.

Edit: A loss of synchronisation can be reproduced with this class of locomotive on departing St. Mary Beddington towards Elmley.
« Last Edit: December 13, 2018, 12:17:27 AM by jamespetts »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #215 on: December 22, 2018, 11:35:10 PM »
I have been very busy making various Christmas preparations, so have not been able much to test this recently. However, I am now staying with my parents for the festive season with my Linux NUC and have tried logging into the server with that. I find that that also loses synchronisation.

This is significant, as I had previously thought from earlier testing that this problem cannot be reproduced when a Linux client connects to a Linux server: it seems that that is not the case. This significantly broadens the type of error that might cause this behaviour and makes it in some respects less confusing than it appeared to be hitherto.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #216 on: December 27, 2018, 01:39:54 PM »
Further testing shows some even more perplexing results.

Running on the Linux NUC with a server and a client connected through the loopback interface, I cannot reproduce the loss of synchronisation at all, even after several hours of running.

I am wondering whether this is a checklist mismatch at all. It may well be necessary to add a feature to the UI to display, when a client loses synchronisation, the type of cause (e.g. checklist mismatch, command executed in the past, etc.) to assist with diagnostics.

Edit: Further testing shows that this is indeed a checklist mismatch: using a debug output and connecting to the server in that way produces the following debugging output (edited to show the relevant part):
Code: [Select]
Warning: network_check_activity():    received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK:    time difference to server 66
Warning: network_check_activity():    received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK:    time difference to server 66
Warning: network_check_activity():    received cmd id=9 nwc_check_t from socket[19]
Warning: NWC_CHECK:    time difference to server 0
Warning: network_check_activity():    received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK:    time difference to server 0
Warning: karte_t:::do_network_world_command:    sync_step=5128  server=[ss=5128 st=1282 nfc=0 rand=2876650628 halt=4097 line=1025 cnvy=4097 ssr=1905360171,3649131816,0,0,0,0,0,0 str=3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,2876650628,2876650628,2556660269,644001429,3649131816 exr=0,0,0,0,0,0,0,0  client=[ss=5128 st=1282 nfc=0 rand=2876650628 halt=4097 line=1025 cnvy=4097 ssr=1905360171,3649131816,0,0,0,0,0,0 str=3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,2876650628,2876650628,2556660269,644001429,3649131816 exr=0,0,0,0,0,0,0,0 
Warning: network_check_activity():    received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK:    time difference to server 0
Warning: network_check_activity():    received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK:    time difference to server 0
Warning: network_check_activity():    received cmd id=9 nwc_check_t from socket[19]
Warning: NWC_CHECK:    time difference to server 0
Warning: karte_t:::do_network_world_command:    sync_step=5132  server=[ss=5132 st=1283 nfc=0 rand=2026024821 halt=4097 line=1025 cnvy=4097 ssr=3465126746,205948590,0,0,0,0,0,0 str=205948590,205948590,205948590,205948590,205948590,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2563437453,644638429,2026024821 exr=0,0,0,0,0,0,0,0  client=[ss=5132 st=1283 nfc=0 rand=205948590 halt=4097 line=1025 cnvy=4097 ssr=3465126746,205948590,0,0,0,0,0,0 str=205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,2563437453,644638429,205948590 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command:    disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect():    Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset:    all static variables are reset
World finished ...
Show banner ...
Edit 2: Extracting from that, we seem to get the following pattern under "str=" for the failed step:
Client:
Code: [Select]
205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,2563437453,644638429,205948590Server
Code: [Select]
205948590,205948590,205948590,205948590,205948590,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2563437453,644638429,2026024821This is in circumstances where the rand= are as follows on the failed step:
Client
Code: [Select]
rand=205948590
Server
Code: [Select]
rand=2026024821
« Last Edit: December 27, 2018, 02:22:28 PM by jamespetts »

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1324
Re: Instability on the Bridgewater-Brunel server
« Reply #217 on: December 27, 2018, 07:56:22 PM »
That means something in either convoi_t::threaded_step() or convoi_t::step() as called from karte_t::step() is grabbing a random on the server but not client...
There's still 'spare' checklist rands in this debug version (that you were supposed to remove after the last big desync hunt - 4? years ago), you can insert in between to disambiguate threaded_step from _step.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #218 on: December 27, 2018, 08:06:56 PM »
That is a very interesting and useful analysis: thank you. I earlier experimented with disabling the convoy multi-threading, which did not affect the desync, so whatever is causing the problem is either in convoi_t::step(), or in that part of convoi_t::threaded_step() that is replicated in the single thread only version. 

Of course, it is entirely possible that the ultimate problem is not in either of those methods, but that those methods are where the divergence first manifests itself as a call to simrand() on the server but not on the client.

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1324
Re: Instability on the Bridgewater-Brunel server
« Reply #219 on: December 27, 2018, 08:25:37 PM »
Finding the actual source of the problem is a problem itself with this style of debugging... end up down a lot of false paths.

I'd first start by ensuring the desync is consistently triggering from rands[13] (and hope you don't have more than one intermittent desync bug at play!), and then determine where all the simrand calls in convoi stepping are, and then determine what all conditions affect whether those calls are made.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #220 on: December 30, 2018, 12:25:16 AM »
Thank you for your thoughts on this. As to a problem with this style of debugging, so far as I know, there is no better way of debugging an extremely difficult problem of this nature; unless you are aware of one?

In any event, I am now having trouble reproducing the loss of synchronisation. I have just been connected for ~2 hours with logging enabled and the trains on the Northern Frontier Express have made more than one complete circuit and there was no loss of synchronisation at all. I very much doubt that the problem has fixed itself (or that Ranran's UI improvements have had any effect), so this seems to be more intermittency. It is all very odd, but makes finding the problem much harder. I had been hoping to obtain some more logs of loss of synchronisation to test whether or not the loss of synchronisation always happens in the same place as revealed by the logs. I will try again to-morrow.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2709
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #221 on: December 30, 2018, 02:18:33 PM »
Any sort of code change can move data around in memory.

Might be worth testing a reverted build to see if it comes back. This could explain some of the consistency issues.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #222 on: January 01, 2019, 10:50:11 PM »
It is hard to see how having different memory locations would affect the loss of synchronisaiton bug - can you clarify how this would actually work?

I have re-tested and am still unable to reproduce this. I should be very grateful if anyone else could try connecting to the server to see whether the loss of synchronisation can still be reproduced. If not, I will try reverting the server saved game version to an earlier version in which time had not advanced so much and the issue could reliably be reproduced.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2709
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #223 on: January 02, 2019, 12:44:15 AM »
Quote
It is hard to see how having different memory locations would affect the loss of synchronisaiton bug - can you clarify how this would actually work?
If the problem is the result of memory corruption or reading memory garbage then moving the code around or changing the size of a struct might change how things get packed in memory. This is likely different between Linux and Windows and possibly even other Linux builds which would explain how random the out of sync is.

The fact this bug is so obscure and hard to reliably recreate means that the cause must be something borderline insane.

The custom memory allocation used by Simutrans for performance might not be helping detecting such issues. Specifically standard instrumentation such as is often used to check for correct memory usage might not be compatible with the custom allocation model. Have we tried debugging a build where it is bypassed for normal malloc/free (or new/delete) calls?

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #224 on: January 02, 2019, 01:08:17 AM »
This bug is very obscure, but not so random: there was a time when it would very reliably be triggered whenever a train arrived at St. Mary Beddington heading to Elmley. Removing the St. Mary Beddington stop then produced the exact same symptoms (reliably) two stops down the line. This suggests that the cause is not random, but chaotic: deterministic between different machines and sessions, but ultra-complex in cause (at least in the specific instance). Corrupt memory would be far more likely to cause desyncs at random whenever a particular part of the code were used. The symptoms suggest a logic bug, albeit a highly complex and obscure one, rather than something lower level such as memory corruption.

As to disabling the custom memory allocation, the trouble is that any debug build with this map is almost unusably slow, and one with memory optimisations disabled would be worse still (there is a preprocessor directive that will do this, I think).

I am planning to build a new high performance computer this year, which might well make it easier to debug large maps, but how much easier remains to be seen.

Offline ACarlotti

  • *
  • Posts: 474
Re: Instability on the Bridgewater-Brunel server
« Reply #225 on: January 02, 2019, 04:01:56 AM »
I have now got myself a new (powerful) laptop, so, if time permits, I might now be able to investigate the desyncs. I'd like to start by taking a known bad save and seeing if I can reproduce the desync locally (probably running one of the client or server on a Windows virtual machine). It sounds like the current save might not be causing desyncs, so perhaps it would be helpful if you posted a suitable version of the savegame somewhere. This would also make it easier to access the savegame without having to build a matching pakset and client to connect to the server.

Incidentally, one of my issues seems to have been due to an unexpected version mismatch arising from my client having a nine-character revision while the server had a seven-character revision. I'm not sure why the lengths are different by two characters (one character would be more understandable). There also appears to be an assumption in revision.jse that the revision length is nine characters, but the server breaks this assumption.
A fix for this would be simple - replace the two instances of "git rev-parse --short HEAD" with "git rev-parse --short=9 HEAD"
« Last Edit: January 02, 2019, 04:15:02 AM by ACarlotti »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #226 on: January 02, 2019, 12:13:21 PM »
Thank you for looking into this: that is most kind. I am currently away from home, where I have a few more versions on my computer there, so the only version that I have available is the network game as it was actually running a short while (~ 2 game years) after the network issue was first detected. It can be downloaded here. You can increase the performance of this game by liquidating all players except Bay Transport from the command line by running the game as a server and using nettool.

Note, however, that I have never been able to reproduce the loss of synchronisation without actually connecting to the server (be it with a Linux or Windows client): the usual testing method of running a local server and connecting to it with a local client on the loopback interface has never been able to reproduce this error. That seems to be a clue itself as to the cause, but to what it is pointing I have no idea.

As to the server and client giving different numbers of characters for the revision, that is rather odd, especially as they are both compiled on the same machine (albeit the Windows ones are cross-compiled with different libraries). I will, however, implement your suggested fix: thank you for that.

Edit: Being away from my normal development environment, I am having trouble finding where git rev-parse --short HEAD is located: can you remind me? Thank you.

Offline ACarlotti

  • *
  • Posts: 474
Re: Instability on the Bridgewater-Brunel server
« Reply #227 on: January 02, 2019, 01:22:41 PM »
especially as they are both compiled on the same machine

Actually they weren't, because I was using a self-compiled version. Perhaps that explains why the issue hasn't previously been encountered.

I've pushed the revision name fix to Github.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #228 on: January 02, 2019, 01:52:00 PM »
Splendid, thank you: now incorporated.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #229 on: January 02, 2019, 11:12:23 PM »
I am now again able to reproduce the loss of synchronisation on the game as it is currently on the server when a train arrives at St. Mary Beddington heading for Elmley. This was reproduced consistently twice in a row, and the second time, I logged it. Here is the relevant extract from the log:

Code: [Select]
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Warning: karte_t:::do_network_world_command: sync_step=73568  server=[ss=73568 st=18392 nfc=0 rand=1576148348 halt=4097 line=1025 cnvy=4097 ssr=2991505728,2840015283,0,0,0,0,0,0 str=2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2501980820,1576148348,2536695661,641877393,2840015283 exr=0,0,0,0,0,0,0,0  client=[ss=73568 st=18392 nfc=0 rand=1576148348 halt=4097 line=1025 cnvy=4097 ssr=2991505728,2840015283,0,0,0,0,0,0 str=2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2501980820,1576148348,2536695661,641877393,2840015283 exr=0,0,0,0,0,0,0,0 
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Warning: karte_t:::do_network_world_command: sync_step=73572  server=[ss=73572 st=18393 nfc=0 rand=4030280172 halt=4097 line=1025 cnvy=4097 ssr=390225914,3158243328,0,0,0,0,0,0 str=3158243328,3158243328,3158243328,3158243328,3158243328,2091714030,2091714030,2091714030,2091714030,2091714030,2091714030,4030280172,4030280172,2543500209,642516720,2091714030 exr=0,0,0,0,0,0,0,0  client=[ss=73572 st=18393 nfc=0 rand=2998080384 halt=4097 line=1025 cnvy=4097 ssr=390225914,3158243328,0,0,0,0,0,0 str=3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,2998080384,2998080384,2543500209,642516720,3158243328 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
World finished ...
Show banner ...

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1324
Re: Instability on the Bridgewater-Brunel server
« Reply #230 on: January 03, 2019, 01:19:13 AM »
Same spot, but you should be running checks every frame for debugging.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #231 on: January 03, 2019, 01:53:57 AM »
Interesting - thank you. By checks every frame, you mean the server setting? I can change this fairly easily if so.
Edit: Done - this should take effect when the server restarts to-morrow morning.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #232 on: January 03, 2019, 03:59:49 PM »
Another round of testing with the check set to every frame rather then every 32 frames: the first batch of trains to pass St. Mary Beddington did not trigger the loss of synchronisation. However, waiting for the trains to go all the way around the line and back again did trigger it. Here is the relevant part of the debug output:

Penultimate message
Code: [Select]
Message: network_command_t::rdwr:   read packet_id=9, client_id=0
Warning: network_check_activity():   received cmd id=9 nwc_check_t from socket[19]
Warning: NWC_CHECK:   time difference to server -66
Warning: karte_t:::do_network_world_command:   sync_step=129523  server=[ss=129523 st=32380 nfc=3 rand=1421316284 halt=4097 line=1025 cnvy=4097 ssr=4080282562,1421316284,0,0,0,0,0,0 str=3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3257590584,1337000643,1883002831,3548628848 exr=0,0,0,0,0,0,0,0  client=[ss=129523 st=32380 nfc=3 rand=1421316284 halt=4097 line=1025 cnvy=4097 ssr=4080282562,1421316284,0,0,0,0,0,0 str=3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3257590584,1337000643,1883002831,3548628848 exr=0,0,0,0,0,0,0,0 

Final message
Code: [Select]
Warning: karte_t:::do_network_world_command:   sync_step=129524  server=[ss=129524 st=32381 nfc=0 rand=2141574950 halt=4097 line=1025 cnvy=4097 ssr=1421316284,3615887852,0,0,0,0,0,0 str=3615887852,3615887852,3615887852,3615887852,3615887852,1588385680,1588385680,1588385680,1588385680,1588385680,1588385680,46953431,2141574950,1343811081,1883642638,1588385680 exr=0,0,0,0,0,0,0,0  client=[ss=129524 st=32381 nfc=0 rand=1588385680 halt=4097 line=1025 cnvy=4097 ssr=1421316284,3615887852,0,0,0,0,0,0 str=3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,2997474060,1588385680,1343811081,1883642638,3615887852 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command:   disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect():   Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset:   all static variables are reset
World finished ...
Show banner ...

Offline ACarlotti

  • *
  • Posts: 474
Re: Instability on the Bridgewater-Brunel server
« Reply #233 on: January 06, 2019, 05:17:40 AM »
I have been attempting to reproduce a deync 'locally', using the save from the 5th of October that was linked a few days ago. I have tested running a server in Linux on my new laptop and connecting to it from a Windows VM on the same laptop, but have so far been unable to reproduce a desync by this method. I have also tried running a server on my old laptop (over seven years old and with only 2GB of RAM) which is also running Linux. It seems to be able to run the server, but even with a heavily reduced save it runs very slowly. I do, however, eventually get a desync, which I think happens almost immediately after the server starts running.

Code: [Select]
Warning: karte_t:::do_network_world_command:    sync_step=8  server=[ss=8 st=1 nfc=0 rand=1302837895 halt=2049 line=1025 cnvy=4097 ssr=552184448,3172636355,0,0,0,0,0,0 str=3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,2818197206,2818197206,2818197206,2818197206,1302837895,1302837895,138807,13216,2818197206 exr=0,0,0,0,0,0,0,0  client=[ss=8 st=1 nfc=0 rand=221087431 halt=2049 line=1025 cnvy=4097 ssr=552184448,3172636355,0,0,0,0,0,0 str=3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,221087431,221087431,138807,13216,3172636355 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command:    disconnecting due to checklist mismatch

My most recent testing has been using the save here. It is produced by modifying the code to remove almost all of the vehicles, stations and objects outside a bounding box for the Northern Frontier Express, along with a few bits that were manually removed. This seems to have led to some slight glitches - in particular, attempting to delete the partially removed bridge in the northern chain crashed the game. I have, however, achieved a significant reduction in memory usage (3GB, down from about 3.7GB after removing other companies, or 6GB with other companies included), and running speed, which may help with debugging efforts.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #234 on: January 06, 2019, 12:26:51 PM »
Interesting - thank you for your work on this. It is also interesting that you, too, cannot reproduce a loss of synchronisation on the loopback interface. I suspect that there must be some significance in this, but it is very hard to imagine what it is at this stage.

It would be very interesting to know whether the loss of synchronisation that you are able to reproduce occurs when trains on the Northern Frontier Express arrive at St. Mary Beddington station heading towards Elmley.

Offline ACarlotti

  • *
  • Posts: 474
Re: Instability on the Bridgewater-Brunel server
« Reply #235 on: January 07, 2019, 01:01:02 AM »
I've had a closer look at the checklists, and it seems that we are looking at different causes of desyncs. My desync happens on the first step and arises during passenger generation, where the server seems to request one or more random numbers while the client does not.

Your desync however occurs (as might be expected) in the part of the code handling convoy stepping. This is between lines 5455 and 5490 in simworld.cc - random seeds written in line 5455 match, but the random seeds written in line 5490.

I don't know whether my desync is triggered by the extreme conditions I'm running under (i.e. a very underresourced laptop), or something inconsistent in the save that I am using - perhaps you could see what desync you get if you run that save on your server.

Offline ACarlotti

  • *
  • Posts: 474
Re: Instability on the Bridgewater-Brunel server
« Reply #236 on: January 07, 2019, 07:26:54 AM »
I've pushed some debugging changes to Github:
1) Change goods->cargo so that we can compile with DEBUG_SIMRAND_CALLS
2) Add the thread-local seed to the debugging output, to allow calls from different threads to be distinguished (using the seed seemed to be the easiest way to produce an identifier that always exists and ought to be unique*).
This latter change should help me to debug the desync I'm getting in the passenger generation, since it seems to be caused by an unwanted call to the main thread simrand.

*I've also changed the seeds used for the passenger generation thread to make them actually unique - previously the first passenger generation thread used the same seed as the main thread.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #237 on: January 07, 2019, 08:59:45 PM »
Thank you for this work and my apologies for the delay in replying: I have been returning from my holiday to-day.

I have now incorporated and pushed your patch, so this should be running on the server from to-morrow morning onwards. Do you still need me to run your saved game on the server?

Thank you again for all your work on this.

Offline ACarlotti

  • *
  • Posts: 474
Re: Instability on the Bridgewater-Brunel server
« Reply #238 on: January 07, 2019, 09:47:01 PM »
Do you still need me to run your saved game on the server?
I suspect that it wouldn't be of much use - I'm haven't managed to reproduce that desync myself. On my latest run I managed to reach sync_step=46 before the client was somehow aborted (my mistake probably). On the other hand, my old laptop reached sync_step=8000(ish) before something happened to it, so it might just be feasible to run it as far as a desync.

I would suggest that the next step should be to run both a client and a server with DEBUG_SIMRAND_CALLS defined, so that we can hopefully get a better idea of where the deviation is arising. However, that might not be practical without further simplifying the game, due to the volume of debugging output that would give. In particular, for about 8000 sync steps I produced a log of over 1GB - a similar log for the duration that your test ran for would write over 16GB.

I'm going to try commenting out chunks of code and see what happens.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #239 on: January 07, 2019, 10:57:56 PM »
Thank you - that is very helpful.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #240 on: January 09, 2019, 12:11:12 AM »
I have been trying to deduce what calls to simrand() that there might be during convoi_t::step(). There are actually very few places where simrand() is called within the convoy stepping, but one place is in void haltestelle_t::liefere_an(ware_t ware, uint8 walked_between_stations) at line 3269, where pedestrian_t::generate_pedestrians_at() is called.

This has in the past been a place where a loss of synchronisation first manifested itself as a result of convoys stopping at stops at a different time. It is consistent with observed behaviour that this is what is happening: the loss of synchronisation occurs fractionally before the train arrives at St. Mary Beddington (on the client) and the server is recorded as having a call to the random number generator that the client does not have: what I infer is probably happening is that, for some reason, the train is arriving at the station on the server a few steps ahead of the step at which it arrives on the client.

There are two possible causes for this type of loss of synchronisation:
(1) the train leaves earlier on the server than on the client; or
(2) the train travels faster on the server than it travels on the client.

The latter seems unlikely, as this would be an issue with the physics engine, and it is very unlikely that this would consistently occur in exactly one place and one place only (and then migrate further down the line when the stop at that station is removed).

That leaves the possibility that the train leaves the previous station slightly earlier on the server than it leaves on the client. I did wonder whether this might have been the trouble some time ago, but investigations into the code for deciding when the convoy would leave the previous station did not reveal anything that looked as if they might possibly cause any loss of synchronisation, and I note that that part of the code is run in the main thread.

However, I do note that there has been a bug report relating to overloading of trains in some cases beyond their overcrowded capacity. It is just possible that this might be related, as there is code that is intended to increase loading time as the number of passengers or amount of goods the board a vehicle increases. I will have to look into this bug and see whether solving it assists this issue.

Offline ACarlotti

  • *
  • Posts: 474
Re: Instability on the Bridgewater-Brunel server
« Reply #241 on: January 09, 2019, 05:26:06 AM »
It sounds like a useful test would be to simultaneously run a client that typically desyncs, and a client that typically does not desync. When a desync happens, have a look at the passengers on the train that just arrived at St. Mary Beddington, as well as those listed as 'transferring' to St. Mary Beddington in the station window, and see if there are any differences.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #242 on: January 09, 2019, 09:09:37 AM »
Thank you for that suggestion; however, I am not sure that there is a specific client that does and a specific client that does not reliably lose synchronsation at the same time; I am not aware of any test that has identified any such client.

Probably easier at this stage would simply be to fix the known issue with passenger loading and see whether that assists - even if it does not, it is not work wasted. It appears that the  link for the file containing the reproduction case is broken, so I am waiting on Spenk009 to re-upload that before proceeding.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2709
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #243 on: January 09, 2019, 09:19:08 PM »
One could rule out the first case being a problem by printing the step that all convoys depart and arrive at stations. In worst case this could go to a separate log file. Unlike random calls themselves, this log should be a lot more manageable and could even be filtered for the specific convoys that are causing problems.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18683
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #244 on: January 09, 2019, 09:28:23 PM »
Thank you for that suggestion. I am not sure whether it will work, but I have added a line of code that will show a debug output when a convoy is set to depart, but only with the debug level set to 4, the highest setting. Hopefully, this will assist.