News:

Simutrans.com Portal
Our Simutrans site. You can find everything about Simutrans from here.

Desync issue (devel-new-2) with Linux Server/Windows client

Started by Ves, October 22, 2016, 09:03:44 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Felix

The issue is also definitely savegame dependent. With a copy of the bridgewater-brunel savegame that I used in a local game for some time, I do not get the immediate disconnect with a local host.

Felix

As said earlier, I am not sure if the extra package with the experimental-specific simconf.tab etc. is still needed. I am currently using the configuration files form the devel-new-2 branch.

jamespetts

Quote from: Felix on January 14, 2017, 08:31:14 PM
The issue is also definitely savegame dependent. With a copy of the bridgewater-brunel savegame that I used in a local game for some time, I do not get the immediate disconnect with a local host.

Interesting - how long does it take before you desync?

There may be multiple, separate desync issues, of course.

What do you mean about the package with the experimental-specific simuconf.tab? Do you mean the .zip file distributed with the old release binaries from long ago? This ought not in principle to be an issue, since all of the configuration settings are saved with the saved game and transferred to the client when it first connects to the server, overriding any configuration settings in the client's simuconf.tab. In any event, the simuconf.tab from Github should be the most up to date version.

The Bridgewater-Brunel server has its own modified simuconf.tab to allow for settings specific to that server, such as the administrator's (i.e. my) e-mail address, a description, etc..
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

With my local savegame the client desyncs only after like 10 min, but also with a mismatch of the random numbers.


ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1289,1610,9 to (1306,1717,9) at 1289,1615, best = 1590, cost = 50, heur = 1620, dist = 109, turns = 1461

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: karte_t:::do_network_world_command: sync_step=11776  server=[ss=11776 st=1472 nfc=0 rand=3328960089 halt=1 line=1 cnvy=1025 ssr=3461460419,3328960089,0,0,0,0,0,0 str=3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,1688219608,143636616,3328960089 exr=0,0,0,0,0,0,0,0  client=[ss=11776 st=1472 nfc=0 rand=3461460419 halt=1 line=1 cnvy=1025 ssr=3461460419,3461460419,0,0,0,0,0,0 str=3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,1688219608,143636616,3461460419 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
Message: karte_t::reset_timer(): called, mode=$0
World finished ...
Show banner ...
Message: karte_t::reset_timer(): called, mode=$0
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1336,2179,5 to (1337,2182,5) at 1337,2182, best = 70, cost = 70, heur = 700, dist = 0, turns = 630

Felix

And yes, I was talking about that zip file. But if it is not needed, I should have a correct configuration.

jamespetts

It is a mismatch of the random numbers that is the more usual type of desync that is hard to track down, especially if it only happens every 10 minutes (meaning that each small change needs 10 minutes to be tested to see whether it makes a difference).

Either the cause of the desyncs on the Bridgerwater-Brunel server are different to those on a local server (which are still seriously problematic), or they are both related to the same thing, but for some reason causing a desync more quickly on the Bridgewater-Brunel server.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

Where are the numbers in the rand[] ("ssr" in the message) calculated? Somehow it is interesting, that on the server the value for rand[1] is identical to the seed ("rand" in the message), while on the client rand[0] and rand[1] are identical to the seed. The value for rand[0] matches the seed value from the client.

TurfIt

ssr = sync_step randoms  karte_t::sync_step()
str = step randoms  karte_t::step()
These were just extra check points added to help track down desyncs 2 (3?4?) years ago, I'd rather have expected them to be have been removed once troubleshooting was over...

To make use of them, you'll want "server_frames_between_checks = 1" on the server. And then shuffle around the where the current state of the randoms are captured into the checklist. IIRC the previous desyncs were all in the step - str numbers, so ssr is just showing the state of the random numbers at the beginning and end of the sync_step. For the log posted, it would indicate the server is using a random number somewhere in a sync_stepped object that the client is not. You'd need to break up the sync step to be by object and add more capturing to try and use these to find the possible issue.

jamespetts

The "rand=..." is the random seed on the client and server respectively. I am not entirely sure what the st and ssr are (I did not write this code), nor quite where the long list of numbers come from. (Thank you TurfIt for answering whilst I was typing this reply - that is most helpful).

Normally, a desync of this sort is caused by divergence between server and client somewhere (it is usually extremely hard to find where), normally caused by some sort of indeterminism (which could be caused by undefined behaviour, incorrect implementation of multi-threading, a reference to an indeterminate variable or a failure to transmit all of the necessary information from the server to the client in the first place).

I usually find that the best way to fix this sort of problem is to try to narrow down the part of the code in which it occurs either by testing to see into which part of the code that it was introduced, or by selectively disabling parts of the code using preprocessor directives and seeing which parts need to be disabled in order for client and server to stay in sync.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

I tried something slightly different, by logging all calls of simrand. Sadly, the information is quite difficult to interpret. On a first look, it seems like karte_t::generate_passengers_and_mail gets called on the client at some point while it does not get called on the server at the same time. Form that point on, the random number seem to be out of sync.

jamespetts

Interesting. Was this a single-threaded or multi-threaded build?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

This was in a multithreaded build. I was not really able to replicate this in a singlethreaded build. The interpretation might be also plainly wrong.

jamespetts

This could be a problem with the multi-threaded passenger generation, in that case. A single threaded build on the loopback interface has been connected with me for some time.

How long did it take to desync in the multi-threaded build?

Edit: Could you try to see whether it desyncs with a multi-thread build with the preprocessor directive FORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE defined?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

With the bridgewater-brunel savegame I also had desyncs with singlethreaded builds, usually leading to an immediate crash of either the server or client.

Still trying with the flag, now.

jamespetts

Thank you - that is helpful. Did you get desyncs with the britain-3.sve file with single threaded mode? I could not get desyncs with that despite running it for about four hours this afternoon/evening in a single threaded build.

(The trick to increasing the efficiency of fixing this bug is to find a saved game that will reliably cause a desync quickly).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

A build with the flag set (-DFORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE) still gives me a immediate desync in connect.

Multithreading was accidentally disabled. So, also without I get a immediate disconnect with the britain-3 savegame.

jamespetts

With britain-3.sve or with the very similar but perhaps subtly different saved game saved from the Bridgewater-Brunel server?

Edit: I should note that I have been connected with that flag enabled since before I wrote the last message and and still connected now - on the loopback server.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.


jamespetts

Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

Same result :-(

If the britain-3 or the server's savegame is involved, I seem to get an immediate disconnect no matter what. With a copy of the server's savegame that I used locally for some hours, I only get a delayed disconnect after like 10 min.

jamespetts

I have to say, I am finding it exceedingly difficult to understand why you are getting a different result to that which I am getting. The only thing that I can think to suggest now is for you to try older versions to see where this problem first arose. If this involves going any further back than late December, this will get very complicated indeed because from about October to December, I was adding multi-threading features, which involved lots of commits adding, disabling, then re-enabling (often many times over) a set of about four or five independent sets of multi-threading code, so it will not be a simple matter of going backwards and finding a version in which a desync does not occur.

I do find it very perplexing that you are getting rapid desyncs in a single threaded build, however, which I have not had (with a multi-threaded build) when testing connecting a Linux machine to the Bridgewater-Brunel server. I really cannot understand this at all.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

I already noticed that it gets complicated before December. One interessting aspect is also, that it seems to be savegame related. With other savegames, I do not get the immediate disconnect.

Maybe, me or us are also overlooking something. Could my setup differ in any significant aspect?

jamespetts

I cannot think of anything configuration specific that could make a difference; but could you post your config.default file just in case?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

sure (I added the .txt extension to be able to attach it)

jamespetts

I cannot see anything in there that seems to be problematic.

I should say that I was just about to test whether I could reproduce your results under Linux using my NUC which runs Ubuntu, but that device failed (to the extent that I am now organising a warranty return) as I was doing that, so I am afraid that I will not be able to do any Linux testing myself for a few weeks until the replacement item is sent to me and I am able to set it up.

Edit It is rather a long shot, but do you think that you could try with SDL rather than SDL2 to see whether this makes any difference?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

Might be worth a try. SDL2 also has another issue ;-) When I resize the Window, the game crashes.

jamespetts

Thank you - do let me know how you get on.

I should say that the debug build Windows versions with the multi-threading of passenger generation disabled are still connected. I shall set up a release build to try overnight to see whether that makes any difference.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

The result with SDL 1 is pretty similar:


ERROR: route_t::intern_calc_route():    Problem with heuristic:  from 1021,1369,5 to (1036,1459,8) at 1022,1369, best = 1554, cost = 10, heur = 3340, dist = 96, turns = 3234

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: karte_t:::do_network_world_command:    skipping command due to checklist mismatch : sync_step=280 server=[ss=280 st=35 nfc=0 rand=786208633 halt=1 line=1 cnvy=1025 ssr=1005465115,1005465115,0,0,0,0,0,0 str=1005465115,1005465115,1005465115,1005465115,1005465115,1005465115,1005465115,786208633,786208633,786208633,786208633,786208633,786208633,1235700,105473,786208633 exr=0,0,0,0,0,0,0,0  executor=[ss=280 st=35 nfc=0 rand=4269245549 halt=1 line=1 cnvy=1025 ssr=2623138960,2623138960,0,0,0,0,0,0 str=2623138960,2623138960,2623138960,2623138960,2623138960,2623138960,2623138960,3881930485,3881930485,3881930485,3881930485,4269245549,4269245549,1235700,105473,3881930485 exr=0,0,0,0,0,0,0,0 
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset:      all static variables are reset
Message: karte_t::reset_timer():        called, mode=$0
Segmentation fault

jamespetts

That is exceedingly odd. Are you able to check older versions to see when this fault was first introduced? A good start might be the 1st of January: after the implementation of all the multi-threading, but before some of the work that I have done this year.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

It looks like it is a configuration issue. The name the savegame uses for the pakset actually matches a different one (the one from the nightly builds page) in my setup while the client uses the custom build one. So it is likely caused by the pakset mismatch. The crashes are still valid issues. Sorry, for the wasted time :-(


This really solves the immediate desync :-/

Felix

Sadly, connecting to bridgewater-brunel.me.uk still results in a desync, but the server also claims to be a different version (the commit id seems to not exist, though).

jamespetts

Thank you for testing this. Can you clarify the circumstances, if any, in which you now get (1) a crash; and (2) a desync using the loopback interface?

Also, I have not encountered this issue before with the name used by the saved game for the pakset causing desyncs, and I am not really sure why it would do this. Can you let me know more about how you traced the problem to this issue? Are you sure that it is the name itself causing it? It is hard to see any means by which this could happen.

As to the Bridgewater-Brunel server, I am having problems with getting the correct version to work on that: see here for an explanation, including a description of a very bizarre problem that I am currently unable to resolve, preventing me from having usable version numbers on this server.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

On the loopback interface I did not observe anymore desyncs with the latest version.

In my test setup the server was accidentally loading a different version of the pakset, which had the name expected by the savegame. The client was running with a newer version build from the sources. This caused an immediate desync right after the connect. I was setting everything on the command line to simplify testing. In game both versions of the paksets are unfortunately referenced by the same name, which made me to miss the error.

jamespetts

You had two different versions of the same pakset with the same name installed?

In nay event, running overnight with release builds on Windows with FORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE defined in the britain-3.sve, I get no desyncs either.

When you say that you get no desyncs with the latest version, is that with or without FORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE defined, and after how long a time of running is that?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

I had to versions of the same paksets within different folders.

The game was running with and without the flag and neither raised an immediate desync. I only ran the game for 15 min in both tests, though.