The International Simutrans Forum

Simutrans Extended => Simutrans Extended Development => Simutrans Extended Solved Bug Reports => Topic started by: Ves on October 22, 2016, 09:03:44 PM

Title: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 22, 2016, 09:03:44 PM
Moderator note: This has been split from the server topic to focus the discussion on the desync issue. For an introduction to the issue, see the post here (http://forum.simutrans.com/index.php?topic=15841.msg155448#msg155448).

I have tried to connect now, and I get desyncs quite often. Have not yet tracked down what is causing it, but the game appears to lag a bit in general (the messages on the bottom is quite jagging).
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: jamespetts on October 22, 2016, 09:17:03 PM
I have noticed the desyncs, too, and am trying to track them down, which is never easy, as this is the hardest type of bug to solve. That was one reason that I was keen for the other server to be running, to see whether you get desyncs on that. Do you think that you could try the saved game from the Bridgewater-Brunel server on your server to see whether you also get desyncs?
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: Ves on October 22, 2016, 09:20:46 PM
I dont have the server, that is Vladki. I only upload "devel-new"s to it :)
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: jamespetts on October 22, 2016, 09:31:47 PM
Ahh - my apologies. I am afraid that I confuse the two of you sometimes.

Edit: I am unable to reproduce the desyncs on a local server (i.e. client and server running on my computer at home), so it would be extremely helpful if Vladki could run a controlled test by uploading the same map to his server and seeing whether we get desyncs there.
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: Ves on October 22, 2016, 11:03:10 PM
I dont know anything special about servers, but made a quick comparison between your server and the other one:

(http://server.exp.simutrans.com/screenshots/bridgewater-brunel.me.uk.jpg)

(http://server.exp.simutrans.com/screenshots/server.exp.simutrans.com.jpg)

I dont know if anybody can make something out of this?
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: jamespetts on October 22, 2016, 11:16:34 PM
Thank you for testing that. I do not think that it is a performance issue: I think that it is an issue with the code of some sort.
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: DrSuperGood on October 22, 2016, 11:37:56 PM
I would recommend giving the full path to the server (port included) as well as where to download the pak and the executable to run the server. It has to be very user friendly for people to be able to join.
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: jamespetts on October 22, 2016, 11:41:05 PM
The port is the default port, so it should not need to be specified manually. The port will become relevant if I decide to run a second game on this server.

As to the download link, I do not want to encourage players other than testers at present, as there will at present be frequent changes to the server and erasing of saved games without notice.

Edit: Having rebuilt the server and cleared out all the old files that had accumulated over time, taking a fresh commit from Github (and upgrading the version of Linux that it runs in the process), I still get desyncs, so this is not a simple issue of the server having a slightly incorrect version of one of the source code files. This may take considerable work to fix.
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: DrSuperGood on October 23, 2016, 03:54:16 AM
Quote
I do not want to encourage players other than testers at present
Which is difficult if no one knows where to download...

Quote
Edit: Having rebuilt the server and cleared out all the old files that had accumulated over time, taking a fresh commit from Github (and upgrading the version of Linux that it runs in the process), I still get desyncs, so this is not a simple issue of the server having a slightly incorrect version of one of the source code files. This may take considerable work to fix.
I am guessing a lot of object types were added to Simutrans Experimental to support the new signalling. Make sure such objects are loaded deterministically between clients since Simutrans loads in parallel as far as I can tell.

Make sure everything which alters game state (commands) are synchronized and not run locally.

Also maybe it is a false positive. It might be worse disabling the forced disconnect and seeing what or where stuff starts to go out of sync.
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: jamespetts on October 23, 2016, 12:38:22 PM
The only new object types are signalboxes; but I do not think that this is an object types issue, as client and server stay in sync even on a dense map (with lots of signalboxes) for many hours (overnight) when both client and server are on the same computer connecting via the loopback interface.

I do not think that multi-threading is an issue, as the desync occurs even when the server is single threaded, but does not occur when both client and server (on the loopback interface locally) are multi-threaded, nor when both local client and remote server are multi-threaded.

This is not the old electrification issue, since this occurs even when there are no power lines or substations on the map, and does not occur when there are power lines and substations on the map when both client and server are connected by the loopback interface locally.

It is also not an interaction issue (i.e. one related to not properly sending/receiving commands), as the desync occurs when idle without any commands being sent by any player.

The Bridgewater-Brunel server runs Linux whereas my computer at home runs Windows - I wonder whether this might be relevant. This was an issue once many years ago, when the problem transpired to be that Windows and Linux builds dealt with imprecision in floating point arithmetic differently, but all floating point in running code was abandoned after that, so it is hard to see what the problem might be.

Can anyone try connecting with a Linux machine to see whether that remains stable? I might try to connect using my Linux NUC that I use in work, but I think that the only cable that I can use to connect it to my monitors at home may have failed, so this may not be possible.

Edit: I have been able to get my NUC working (the HDMI cable issue seems to be intermittent), and confirm that I can connect to both the Bridgewater-Brunel and the server.exp.simutrans.com servers without desyncs from a Linux build of the latest devel-new-2 branch, whereas I cannot connect stably to either from a Windows build. This is very odd: I cannot think of anything other than floating point arithmetic, which has been eliminated, which might cause this. Has anyone any ideas of what might run differently in Linux and Windows?

Edit 2: Running it through Dr. Memory, I get the following suspicious entry (but it is odd, as no actual crash is encountered in the game):

Code: [Select]
Error #1: UNADDRESSABLE ACCESS: reading 0x4d6f8d40-0x4d6f8d44 4 byte(s)
# 0 _longest_match                         [F:\Develop\vs140\build\zlib-1.2.8\contrib\masmx86\match686.asm:375]
# 1 inflateUndermine               
# 2 deflate                         
# 3 gzungetc                       
# 4 gzwrite                         
# 5 loadsave_t::flush_buffer               [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\dataobj\loadsave.cc:605]
# 6 loadsave_thread                        [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\dataobj\loadsave.cc:63]
# 7 pthreadVCE2.dll!pthread_setcanceltype +0x4bce   (0x71445eef <pthreadVCE2.dll+0x5eef>)
# 8 MSVCR100.dll!endthreadex              +0x39     (0x77f4c6de <MSVCR100.dll+0x5c6de>)
# 9 MSVCR100.dll!endthreadex              +0xe3     (0x77f4c788 <MSVCR100.dll+0x5c788>)
#10 KERNEL32.dll!BaseThreadInitThunk      +0x11     (0x7517336a <KERNEL32.dll+0x1336a>)
Note: @0:10:00.187 in thread 4556
Note: refers to 0 bytes(s) beyond last valid byte in prior malloc
Note: prev lower malloc: 0x4d6e8d40-0x4d6f8d40 here:
Note: # 0 replace_malloc                     [d:\drmemory_package\common\alloc_replace.c:2292]
Note: # 1 zcalloc                         
Note: # 2 deflateInit2_                   
Note: # 3 gzungetc                       
Note: # 4 gzwrite                         
Note: # 5 loadsave_t::write                  [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\dataobj\loadsave.cc:587]
Note: # 6 loadsave_t::wr_open                [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\dataobj\loadsave.cc:413]
Note: # 7 karte_t::save                      [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\simworld.cc:6319]
Note: # 8 karte_t::interactive               [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\simworld.cc:8765]
Note: # 9 simu_main                          [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\simmain.cc:1363]
Note: #10 sysmain                            [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\simsys.cc:805]
Note: #11 WinMain                            [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\simsys_w.cc:968]
Note: instruction: xor    0x04(%edx,%edi,1) %eax -> %eax
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: DrSuperGood on October 23, 2016, 09:54:58 PM
Quote
Has anyone any ideas of what might run differently in Linux and Windows?
I assume you are getting a hash based OOS and not a command in past related one?

It can be anything from API calls acting slightly differently to differences in type sizes which are assumed the same. For example size_t and other API related structs can be different sizes on Windows and on Linux.

Rule out a 32 to 64 issue by making sure both Linux and Windows run the same.
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: jamespetts on October 23, 2016, 11:39:29 PM
I have checked, and this is the usual checklist mismatch desync, not a command being executed in the past (the desyncs occur even with no interaction by any player).

As to APIs, what APIs are there apart from pthreads (which it seems reasonable to infer is not relevant here because multi-threading was ruled out as a cause as set out above), libpng (which is not relevant as the server does not use graphics and graphics could not cause a desync in any event) and the various compression libraries? The Simutrans-Experimental unique code does not tend to reference these APIs directly in any event, being focussed on altering gameplay rather than the underlying, lower level simulation.

In relation to size_t, I have carefully looked over the instances of this in the code: all but one such instance was either unchanged from Standard or only in the GUI (which would not cause a desync). The one instance that was neither of these I have just changed and testing shows that the desync still occurs.

As to 32/64 bit issues, this is harder, as I suspect that I shall have considerable difficulty now compiling a 64-bit Windows version, as the processes used to do this (especially library references) have been deprecated, particularly when I upgraded to MSVS 2015.

I do note, however, that, now that Vladki has confirmed that his server runs Linux, it was quite recently that Windows clients were connecting to that server with either no or only very occasional desyncs related to interaction, whereas these desyncs occur almost inevitably after a few seconds after being connected with or without interaction. The issue therefore is likely to have arisen in a recent change to the code (https://github.com/jamespetts/simutrans-experimental/tree/devel-new-2), but I am struggling to find anything that might be relevant.

It does not help that I do not have an exact date for when I was last able to connect to server.exp.simutrans.com with a Windows client and not desync in less than a minute. If Ves or Vladki are able to be more specific about when they or either of them were last able to do so, I should be very grateful.
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: DrSuperGood on October 24, 2016, 01:07:19 AM
Check for uninitialized variables. Windows builds generally give them different values from Linux builds.
Title: Re: Re: bridgewater-brunel.me.uk - Simutrans-Experimental (devel-new-2) - testing
Post by: jamespetts on October 24, 2016, 01:18:48 AM
I did run Dr. Memory for this purpose, but it yielded no results other than memory leaks and potential memory leaks in code common to Experimental and Standard, except for the somewhat odd error reproduced in post no. 10 above.
Title: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 24, 2016, 07:26:11 PM
Unfortunately, it seems as though there is a new desync bug. As some may know, these are extremely hard to fix, so any help from any source would be greatly appreciated, either from experienced coders who are able to offer advice, or regular players who can run some basic testing to save me a lot of time (in particular, it would be helpful to know from Vladki and/or Ves when they were last able to connect to the server at server.exp.simutrans.com from a Windows client without a quick desync, as this should help me very much to pin down when the problem arose).

This one is especially hard to fix, as it can only be reproduced (so far) when the server is running Linux and the client is running Windows. (I have no access to a Windows server, so it is hard to test the other way around, and I cannot seem to get it working on my home network to test between my Windows and Linux PCs).

I am about to join (or have joined, depending on when you are reading this message) the previous discussions about this from the thread relating to the Bridgewater-Brunel server being refreshed with the devel-new-2 build for testing, which is where this discussion started.

Anyone who is able to compile and run the code on both Windows and Linux can help me greatly by going back through the Github history and compiling the client and server from the same historical commits for the last few months (I suspect that this has arisen recently) and, on the same map, seeing whether the problem can be reproduced with any given commit. Any suggestions or ideas for helping to track this down would be much appreciated.

I am afraid that this might interfere with other development priorities until it is fixed, although I am minded to undertake some performance tuning work whilst trying to think of what to do in relation to this.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 24, 2016, 07:36:47 PM
I'm running purely on linux. And my previous desync problems were solved by reconfiguring my home wi-fi from 2.4 GHz to 5 GHz (to avoid interference from neighbours).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 24, 2016, 08:04:53 PM
Thank you for letting me know: that is helpful. I know that I connected to your server (which I presume was running Linux at the time) a month or two ago with my Windows client and had few if any desyncs, but I cannot now recall when it was updated.

Can you perhaps assist by letting me know a list of dates since August on which you updated the server with the latest code from devel-new-2?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 24, 2016, 08:51:03 PM
Quote
(in particular, it would be helpful to know from Vladki and/or Ves when they were last able to connect to the server at server.exp.simutrans.com from a Windows client without a quick desync, as this should help me very much to pin down when the problem arose)

I remember specifically that it worked flawless for me until the introduction of msvs2015. After that, I had troubbles compile the game and connected only occasionally to test. I think to recall that it worked a number of commits later, as I bugreported from the swedish server for a time. The date of my latest bugreport comment (and therefore latest known time I could connect without any rememberable desyncs) is: 14 october, in this thread: http://forum.simutrans.com/index.php?topic=15766.msg155207#msg155207 (http://forum.simutrans.com/index.php?topic=15766.msg155207#msg155207)

That would suggest that the problem is from commit e8d0c80db89a5e1cd373b87f6aa232ec3c93db2b and onwards, which is still a considerably amount of commits..

edit:
If you want, I can compile a version of each commit from that commit to the present one and upload them to the server.exp.simutrans.com server for bugtesting? However, I cannot provide any linux server builds so someone else would have to provide that...
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 24, 2016, 08:54:36 PM
Thank you - that is very helpful. Just to be sure, however, can you check whether you can connect to the Swedish server now (using the same map) without desyncing? This problem may occur only in certain situations and therefore on certain maps. I suggest that you wait for 2 minutes without doing anything to see whether it desyncs.

Thank you very much for your help.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 24, 2016, 09:18:19 PM
I connect to the swedish server but get a desync without doing anything after about 5:30 minutes. This is using 7f94e9bd831807c52d807f66ab5979189c7ac84e, which the server also is running.

Now Im compiling the newest executable version from github (15851bfba5e7977f67f4ea4edd9590b3f0ace236). Testing the server now...

edit1: it runs on the server at least, passing 1:30 without crashing at the moment...

edit2: Nope, at 5:20 minutes it creates a desync. This is now the third desync I get at around 5:30 minutes (the one beforehand also was on the short side of 5:30)
Incidentally, did you see my edit in my previous post?

Checking again to see if it happens the same time again...
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 24, 2016, 09:24:01 PM
Thank you very much for testing. Do you think that you might be able to test with 1fb72c008c82e5c75bff1430531c1d6cdef3c038 on both client and server, which is the version immediately before the 14th of October?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 24, 2016, 09:28:40 PM
Im editing my posts too much :P

I cannot compile servers, unfortunately.

Read my edits from one and two post up for some details.

Im currently running the latest test again to check the time. Can try connect with the 14 october version after.

edit: Now I got a desync after 3:38 minutes with 15851bfba5e7977f67f4ea4edd9590b3f0ace236

edit2: I cannot connect with 1fb72c008c82e5c75bff1430531c1d6cdef3c038 currently and I cant compile the server. That is Vladki who does that..
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 25, 2016, 01:06:23 AM
Please note that the test is only valid if client and server are using exactly the same versions; Vladki, are you able to assist with this?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 25, 2016, 05:46:37 AM
I was suspissing that!
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 25, 2016, 06:44:17 AM
I'm reading forum from my phone, so I cannot recompile now. I usually recompile whenever I have a free evening. Unfortunately I do not have a log of server updates. Btw do you know the command line arguments for git to pull a specific commit?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 25, 2016, 10:59:53 AM
You need git reset --hard [commit ID] to do that.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 25, 2016, 12:03:12 PM
I usually use git checkout commit

I will start compile excecutables later today!
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 25, 2016, 03:43:45 PM
I have tried:
Code: [Select]
git checkout 15851bfba5e7977f67f4ea4edd9590b3f0ace236
make clean
make
And then restarted the server with new binary, but got the following error:
Code: [Select]
Reading menu configuration ...
Warning: tool_t::read_menu():   toolbar[11][5]: replaced way-builder(id=14) with default param=cityroad by cityroad builder(id=36)
Midi disabled ...
Calculating textures ...done
Message: karte_t::load():       Prepare for loading
World destroyed.
Warning: karte_t::load: Fileversion: 120008
Message: nwc_auth_player_t::init_player_lock_server:    new = 32767
*** stack smashing detected ***: /home/vladki/simutrans/simutrans-experimental terminated
Aborted (core dumped)
Same error for british or swedish pakset, so I restarted them with previous server binary (7f94e9bd831807c52d807f66ab5979189c7ac84e)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 25, 2016, 04:00:49 PM
15851bfba5e7977f67f4ea4edd9590b3f0ace236 is the latest commit - you get "stack smashing detected" with that? That is very odd, since I cannot reproduce this. Do you get this both on the client and server? Would you be able to try the immediately previous commit (6256ca284e2f6752676d500404dbbef8fa68d2cd) and see whether this also causes the "stack smashing detected" error (which I understand to be caused by a stack overflow, which is treated in this way because stack overflows can be a security vulnerability)?

Incidentally, are you also able to connect with a Linux client to the bridgewater-brunel.me.uk server and see whether you are able to run the game without desyncing? I have just tried it with my Linux client, and was able to stay connected for a considerable period of time with both client and server running the latest commit (6256ca284e2f6752676d500404dbbef8fa68d2cd).

Finally, are you able to set up your server to run a rather earlier commit, 1fb72c008c82e5c75bff1430531c1d6cdef3c038, so that I and/or Ves can then connect to it with a Windows client and see whether this will stay in sync in circumstances (i.e. with a saved game) where it would not on a later commit?

Thank you very much for your help; these are fantastically challenging problems to track down, alas, and are also critical and game-breaking in severity, so all possible help is very much appreciated indeed.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 25, 2016, 04:03:26 PM
Sorry copy paste from bad place. Stack smashing happens with:
$ git status
HEAD detached at 1fb72c0
nothing to commit, working directory clean

I tried only on server, I'm not at home at the moment so cannot try client.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 25, 2016, 04:07:29 PM
Ahh, that makes more sense. Can you try instead the immediately previous commit in that case, SHA 1cb53908a88570042717df64be86828fe917c8d6 ? Thank you very much.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 25, 2016, 04:59:48 PM
The first batch of excecutables are now on server.exp.simutrans.com - devel-new section.

They are named by date and first 7 commit numbers.
The covered time is:
14-20 october

Also, there is the currently newest build I have compiled (161024_15851bf)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 25, 2016, 08:10:58 PM
Splendid, that is very helpful. Are these in both graphical and command line versions, and are they in 64- and/or 32-bit?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 25, 2016, 08:54:47 PM
They are ONLY win 32 executables.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: DrSuperGood on October 25, 2016, 10:48:58 PM
Might be worth pointing out that building via GCC targeting windows is different than building via MSVC. It is completely possible than a GCC Windows client might sync with a GCC Linux server where as a MSVC client will not.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 26, 2016, 01:50:49 AM
Dr. Supergood makes an interesting point. Ves - are these GCC builds? If so, I will have a go with the latest one connecting to the servers and see whether there are any desyncs.

Edit: Unfortunately, testing http://server.exp.simutrans.com/Devel-new-builds/Simutrans-Experimental_161024_15851bf.exe by connecting to the Bridgewater-Brunel server results in a desync within a few seconds of connecting.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 26, 2016, 01:11:24 PM
They are compiled on vs2015 on windows 10. I maybe might be able to produce some more builds later today.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 26, 2016, 01:49:57 PM
Ahh, yes, I see. Thank you: that is most helpful.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 26, 2016, 06:36:05 PM
Ahh, that makes more sense. Can you try instead the immediately previous commit in that case, SHA 1cb53908a88570042717df64be86828fe917c8d6 ? Thank you very much.

So the stack smashing was probably by trying to load new savegame by older version. Now the british test server is running commit 1fb72c0.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 26, 2016, 08:57:20 PM
Thank you very much for testing this: this is most helpful. However, the issue with the saved game is preventing a proper comparison: this (http://bridgewater-brunel.me.uk/saves/server-test-control.sve) is the saved game from your server as running now, which produces no desync even on the latest version. This (http://bridgewater-brunel.me.uk/saves/server-test-control-3.sve) saved game, however, which is a more developed and later version than what is on the server, but still, I think, old enough to be of the old saved game version that should work with the old version that you are running does produce desyncs with the latest version.

Can you try this version on your server so that I can connect with the binary of the old version and see whether it makes any difference? Thank you.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 26, 2016, 11:13:24 PM
Server restarted with the requested savegame
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 27, 2016, 12:21:23 AM
Thank you very much for uploading that. Connecting to that with the 1fb70... Windows executable compiled by Ves produces a near instant desync, so the problem does not appear to be of recent origin at all.

The next step is painstakingly to work out which features of the saved game cause the desync by testing either with certain features disabled in the code, or with saved games with those features not used. The latter may be a good place to start, as we have two very similar saved games, one which does and one which does not cause a desync.

Vladki - I wonder whether you could start by sending all the trains to the depot (you will be able to connect with Linux), and let me know when you have done that so that I can attempt to connect again.

Can anyone remember whether the game in the state in which it now exists has ever been playable on the server without desyncs when connected from a Windows client?

For reference, the date of this saved game was originally the 18th of September, and is the saved game that was uploaded to demonstrate the problems with trains apparently teleporting, which was since fixed.

Thank you both very much for your help so far.

Edit: One thought occurs to me - does anyone have a Windows executable from the last month or two built with Visual Studio 2012? It would be worthwhile trying that with a matching server version to see whether that makes any difference.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 27, 2016, 06:47:34 AM
I won't be at my home pc until sunday evening. So if someone else with linux would be able to help, please do.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 27, 2016, 08:07:08 AM
James, I will have quite limited time in the forthcoming days, do you have any specific commits you want me to
compile, or a range of them?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 27, 2016, 11:09:43 AM
Can you try the very last commit of the original devel-2 (using Visual Studio 2012), which is ff5373424e0fee6b5163ea59ae464fe943c2ed60, and also 197f14910b8395cc72a6c76595d1205b02b30120 from devel-new-2 in Visual Studio 2015?

Thank you very much.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 27, 2016, 11:01:09 PM
They are both online now (they where in fact already compiled, just removed from server!).

Compiled using msvs2015, win10.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 28, 2016, 01:31:00 AM
Thank you very much. Vladki, are you able to set your server to run ff5373424e0fee6b5163ea59ae464fe943c2ed60?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 28, 2016, 09:45:18 AM
Server commit ff5373... cannot load the savegame server-test-control-3:

FATAL ERROR: loadsave_t::read - savegame corrupt, not enough data. Restarted again with 1fb72c...
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 28, 2016, 10:59:01 AM
That is unfortunate. Does anyone have any saved games created on or before the 1st of September based on the map from this same server game? Edit: I have found one myself here (http://bridgewater-brunel.me.uk/saves/server-test-3.sve). Can we try with ff5373424e0fee6b5163ea59ae464fe943c2ed60 and that saved game?

Thank you very much for trying.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 28, 2016, 11:36:21 AM
That is unfortunate. Does anyone have any saved games created on or before the 1st of September based on the map from this same server game? Edit: I have found one myself here (http://bridgewater-brunel.me.uk/saves/server-test-3.sve). Can we try with ff5373424e0fee6b5163ea59ae464fe943c2ed60 and that saved game?

Thank you very much for trying.

Running.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 28, 2016, 12:11:40 PM
Splendid, thank you very much.

Edit: This has been extremely helpful: to Vladki's server with ff5373424e0fee6b5163ea59ae464fe943c2ed6, I am able to stay in sync. However, to the Bridgewater-Brunel server running the 1585... build on server and client, I am not able to stay in sync with the same saved game (server-test-3.sve).

This has narrowed down the difficulties considerably; thank you.
 
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 28, 2016, 01:33:28 PM
Glad that there is progress on the issue! :)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 28, 2016, 08:19:01 PM
Vladki - to narrow this down further, can you build commit 197f14910b8395cc72a6c76595d1205b02b30120  on the server and run it with the same saved game as is on there now?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 30, 2016, 10:23:10 AM
server restarted whit commit 197f149
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 30, 2016, 11:01:12 AM
Thank you very much - this is most helpful. This might well be a slow process of trying various builds on client/server to find the point at which it stops working. The usual technique is to find the point half-way between the known working and not working point and see whether that works, giving a new known working or not working point, and then repeating the process until the exact commit causing the failure is identified.

Edit: Ves - do you have an archive Windows build for this commit?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 30, 2016, 11:46:22 AM
If you have a list of commits that are interesting for you, I may compile them at once, and run on different ports.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 30, 2016, 11:48:35 AM
That is helpful; but the trouble is that which commits will need to be compiled next will depend on the result of testing each previous one on the basis of the system that I describe above (which I believe is a well-known fault finding methodology).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 30, 2016, 12:12:56 PM
Yes, but if you have now at least two commits - one working and one not, I could compile several commits in between to more precisely find the midpoint. This will narrow the range more quickly than waiting for you to test and then waiting for me to compile one more chosen commit.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 30, 2016, 01:34:36 PM
Ahh, I see. That is helpful. ff5373424e0fee6b5163ea59ae464fe943c2ed6 is the last known working build at present. 1fb72c0... is, I think, the first known non-working build; so, if you can extrapolate from those and produce builds at half, quarter, and eighth points between them (it need not be exact), that would be helpful.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 30, 2016, 02:39:49 PM
Uff that's quite a lot of commits. I notecied that there are some commits from BerndGabriel...
And I think he made a typo in 30574d19e928f908425b0f584f5e30987f0acfe4

windows build for the currently runnig commit is here: http://server.exp.simutrans.com/Devel-new-builds/
Server commit dd8d3b2 running on port 13354, windows binary is available for download on the above link as well.
And commit 9b7bfe9 running on port 13355 (swedish pak server stopped now.)

Server commit 197f149 crashed with:
Code: [Select]
FATAL ERROR: vector_tpl<T>::[] - 7koord3d: index out of bounds: 72 not in 0..70
Aborting program execution ...

For help with this error or to file a bug report please see the Simutrans forum at
http://forum.simutrans.com
FATAL ERROR: vector_tpl<T>::[] - 7koord3d: index out of bounds: 72 not in 0..70
Aborting program execution ...

For help with this error or to file a bug report please see the Simutrans forum at
http://forum.simutrans.com
Aborted (core dumped)

I think I have seen this error before...
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 30, 2016, 03:09:04 PM
Yes, there are a lot of commits; those were mainly things from Standard in that period. That is why I suggested doing them one by one rather than all at once.

However, this appears to work without disconnecting, although I cannot test it for very long, as I get a crash after a few minutes (a bug since fixed, I think).

We now need to go to the (approximate) half-way point between this working build (17 September) and the earliest known non-working build (13 October). May I suggest that we try 69ff5d7d2d1bface984c5c0546bc4004e90b63c4, which is the first build using Visual Studio 2015 from the 25th of September?

Thank you very much for your help with this.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 31, 2016, 06:52:42 PM
comit 69ff5d7 running on the standard port.

How about the previously posted commits?
./simutrans-experimental-dd8d3b2 -server 13354
./simutrans-experimental-9b7bfe9 -server 13355
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 31, 2016, 07:00:48 PM
Thank you very much for your help: this is much appreciated. There does not seem to be a Windows build for 69ff5d7 available, and I have had trouble compiling older vesrions myself for some reason; would anyone be able to produce a copy?

As for the other two builds:

dd8d3b2 does desync; and
9b7bfe9 also does desync.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 31, 2016, 07:01:02 PM
./simutrans-experimental-dd8d3b2
./simutrans-experimental-9b7bfe9

Is in the devel-new folder.

comit 69ff5d7 I cannot compile, due to (now solved) timespec-issues..
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 31, 2016, 07:06:23 PM
dd8d3b2 is from the 8th of October and 9b7bfe9 is from the 3rd of October, so the problem occurred somewhere between the 17th of September and the 3rd of October. May I suggest trying 41d2457cc0763c8d6449fd0a5e398b6ddd1a119d, which is one commit before the change to Visual Studio 2015, from the 19th of September?

Thank you very much.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 31, 2016, 07:12:28 PM
The 41d2457cc0763c8d6449fd0a5e398b6ddd1a119d is now on devel-new!

I realize, the patch that solved the timespec issue for me was 9b7bfe9 (3 oct) and you changed two files. Do you think it is valid if I fetch one of those commits in the middle (eg 69ff5d7) and change those two files? Will it disturb anything you think?

edit:
Anyway, I took the liberty to do it and have added 69ff5d7 (the msvs2015 upgrade) to the server!

edit2:
Get immediate desync with my version of 69ff5d7!
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 31, 2016, 07:43:38 PM
Excellent, thank you very much for testing: that is extremely helpful. The next thing to check is the immediately previous version to see whether the problem is the switch to Visual Studio 2015 (i.e. 41d2457cc0763c8d6449fd0a5e398b6ddd1a119d).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on October 31, 2016, 08:36:05 PM
./simutrans-experimental-41d2457 is now running on standard port
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 31, 2016, 08:48:01 PM
./simutrans-experimental-41d2457 is now running on standard port

7 mins, still running.....
12 mins, still running....

edit
accidentally closed the game window and then the game had been running for 34 minutes without a desync!
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 31, 2016, 09:18:06 PM
Thank you - this is very interesting. Either the problem is something that occurs as a result of the one substantive change in the code in commit 69ff5d7d2d1bface984c5c0546bc4004e90b63c4 (https://github.com/jamespetts/simutrans-experimental/commit/69ff5d7d2d1bface984c5c0546bc4004e90b63c4), or the problem is with Visual Studio 2015 itself. Let me check reverting the one substantive change, which is:

Code: [Select]
-                        schiene_t* const sch = obj_cast<schiene_t>(way);
+         //schiene_t* const sch = obj_cast<schiene_t>(way);
+         schiene_t* const sch = way->is_rail_type() ? (schiene_t*)way : NULL;

and see whether that helps.

Edit: I have just pushed the change to Github - let us see whether this helps.

Edit 2: Latest version now running on the Bridgewater-Brunel server.

Edit 3: It does stay in sync between instances on a Windows machine, so if it does not work between Windows and Linux, this is the original problem continuing, not some new problem subsequently introduced.

Edit 4: A desync still occurs between the Windows client and Linux server (Bridgewater-Brunel) with the latest version. I wonder whether the problem could be caused by Visual Studio 2015 itself?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on October 31, 2016, 09:29:15 PM
8a59a0d52e501ab26c4badbba2f45f196584856a executable is now on devel-new, but created a desync on server before I got to finnish this post
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on October 31, 2016, 09:46:43 PM
I am having trouble reverting to Visual Studio 2012, as it does not support the thread_local keyword that I have used in implementing multi-threading of the private car route finder.

Does anyone knowledgeable about these matters have any idea what might be the cause of a desync specific to Visual Studio 2015? This seems to be a very difficult sort of problem. Has anyone tried compiling the latest Standard build in Visual Studio 2015 and seeing whether that stays in sync with the stame Standard build in Linux?

Edit: Also, is anyone able to compile in MinGW to see whether this will stay in sync with a Linux client?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on November 01, 2016, 07:56:06 PM
So, the server is restarted with last version. Linux-Linux connection is fine.

I tried to connect to bridgewater-brunel.me.uk, and got almost instant desync or even crash:
*** Error in `/home/vladki/simutrans/simutrans-experimental': double free or corruption (!prev): 0x0000000018f40b10 ***

Perhaps just because they run different version?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on November 01, 2016, 08:15:57 PM
Perhaps - I have pushed a further (minor) change to devel-new-2 and am currently rebuilding the Bridgewater-Brunel server with this newer version now. The newer version will almost certainly not sync with the immediately previous version.

If anyone is able to test a Windows build made with MinGW to see whether that stays in sync, that would be very helpful, as it would help to narrow down why Visual Studio 2015 builds are not staying in sync.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on November 02, 2016, 08:16:16 PM
As you probably remember, I only have msvs 2015 on windows 10, so I cant assist with that. However, the newest builds (using mvsv2015) is on devel-new for testing.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on November 02, 2016, 08:37:54 PM
I am slightly confused by what you mean by the last part of the post; when you refer to the newest builds using Visual Studio 2015 being on devel-new, do you mean that you have builds from the (old) devel-new branch made with Visual Studio 2015, as opposed to the Visual Studio 2012 with which they would have been built when the commits of which they are builds were current?

Edit: Also, Vladki, would you be able to revert the server to 69ff5d7d2d1bface984c5c0546bc4004e90b63c4 (https://github.com/jamespetts/simutrans-experimental/commit/69ff5d7d2d1bface984c5c0546bc4004e90b63c4)? Ters has built a Windows binary from that version using MinGW, and it would be instructive to test whether this desyncs with a Linux build or not. Thank you.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on November 02, 2016, 09:07:37 PM
Im sorry if I caused any confusion! I just stated that I had compiled the commit 9a7696b.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on November 04, 2016, 12:07:40 PM
simutrans-experimental-69ff5d7 running on port 13354
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on December 20, 2016, 10:55:24 PM
I have not had time to look at this recently, and am now visiting my parents for Christmas, so will not have the opportunity to look properly into this for a few weeks. However, what I do notice while I am here (with a Linux desktop computer and no Windows machine) is that I cannot stay in sync between my own Linux machine and either the Bridgewater-Brunel server or server.exp.simutrans.com, which also both run on Linux with the current latest commit of devel-new-2.

Does anyone else notice this, or can others connect properly?

The last testing for desyncs that I did was a few weeks ago when fixing the multi-threaded passenger generation, testing with a Windows client connecting to a Windows server over the loopback interface, which appeared at the time to work correctly. I do not think that any of the changes made since then will affect network synchronisation without any interaction, and I disconnect almost instantly from the Bridgewater-Brunel server and not only disconnect but sometimes crash when trying to connect to server.exp.simutrans.com.

This issue is likely to require lengthy investigation in the new year. However, it would greatly reduce the amount of time that I spend on this (and therefore increase the amount of time that I am able to spend on other things for Simutrans) if anyone could run tests to see which is the last Github commit in which a Linux client can connect to a Linux server without desyncing.

I should be most grateful if anyone could have a go at this test to assist me greatly in advance of the possibly gargantuan task of trying to fix this problem in the new year.

Edit: Very oddly, I cannot reproduce this issue when I am testing on my own Linux desktop over the loopback interface, for reasons that I cannot at present fathom. Either there is something different between the client and the server (I cannot see what as I have downloaded and built the latest pakset and code sources on both), or the desync arises from the act of actually connecting over the network (which seems unlikely as I have been able to get a stable connexion in the fairly recent past from my Linux desktop to the Bridgewater-Brunel server, over wifi, no less.

It would still be very helpful if anyone running a Linux client could let me know whether they can connect and stay in sync with the Bridgewater-Brunel server, however.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on December 21, 2016, 07:43:36 AM
Server.exp.... was not updated for several days. Can you try it with the linux client and pakset provided there? Also check for interference on wifi. My problems witg desync disappeared when I switched from 2.4 GHz to 5 GHz
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on December 21, 2016, 12:40:50 PM
Server.exp.... was not updated for several days. Can you try it with the linux client and pakset provided there? Also check for interference on wifi. My problems witg desync disappeared when I switched from 2.4 GHz to 5 GHz

With the executable from server.exp.simutrans.com (but without changing the pakset; I am not aware of having made any changes that will affect sync since the 13th of this month), the behaviour is the same as with the latest executable, viz. it will crash within a second of connecting.

As to wifi, I am on a wired connexion, so that will not be an issue.

Can I ask whether others are able to connect to either server without desyncing or crashing?

Edit: I have now tried connecting to the Bridgewater-Brunel server from a newly installed Debian package, and it still desyncs. It is unclear why.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on December 21, 2016, 05:12:54 PM
Can you try changing the pakset as well? It reminds me of the problem with not completely disabled rescaled bus, which created broken pak file ".routemaster.dat"
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on December 21, 2016, 05:48:41 PM
I skip the pakset checks by loading using net:server.exp.simutrans.com in the load dialogue, so the change would have to produce an actual desync. Can I check - are you able to connect to bridgewater-brunel.me.uk with the latest binary and pakset?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on December 21, 2016, 07:42:29 PM
I have tried first my old binary (9ea227ca62aecc2ea326744d9467324cc91e4c58). I can connect to server.exp.simutrans.com just fine. I get immediate desync or crash with bridgewater...
I'll now compile fresh version and see EDIT: it is the same with latest build: 617dd75fc13f62c6cc715ae873ecc68467b2ccfe
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on December 21, 2016, 09:51:02 PM
I have tried first my old binary (9ea227ca62aecc2ea326744d9467324cc91e4c58). I can connect to server.exp.simutrans.com just fine. I get immediate desync or crash with bridgewater...
I'll now compile fresh version and see EDIT: it is the same with latest build: 617dd75fc13f62c6cc715ae873ecc68467b2ccfe

Thank you very much for testing: that is most helpful.

When you say that "it is the same", can you clarify what you mean by that? You described two different behaviours on connecting to two different servers; are both the same, or just one?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on December 21, 2016, 10:55:31 PM
With both versions (9ea... and 617...) I could connect and play without problems to server.exp.simutrans.com (both british and swedish pakset).
If I connect to bridgewater-brunel, I get immediate desync and crash upon second try - again with both versions of client (Linux 64-bit).

I have stil the feeling that the problem may be in pakset. I have this fix to remove the funny ".routemaster-rescaled.pak" file:
Code: [Select]
diff --git a/bus/routemaster.dat b/bus/routemaster.dat
index 1535554..b20d268 100644
--- a/bus/routemaster.dat
+++ b/bus/routemaster.dat
@@ -39,7 +39,7 @@ EmptyImage[N]=./images/routemaster.0.6
 EmptyImage[NE]=./images/routemaster.0.7
 ---
 # For TESTing of rescaled vehicles only - delete when testing complete.
-#obj=vehicle
+obj=vehicle
 name=Routemaster-rescaled
 copyright=JamesPetts&JamesHood
 waytype=road
With the patch applied I have a proper file: "vehicle.routemaster-rescaled.pak"
Before this patch I had problems even with server.exp.simutrans.com.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on December 21, 2016, 11:02:36 PM
Thank you - that is a useful clarification. There may be an issue with the Bridgewater-Brunel server. I am currently working on improving the road vehicles, so am on the road-vehicle-rescaling branch, whereas the server is running the half-heights branch. However, installing afresh on a completely different computer (my father's, whom I am encouraging to take up Simutrans; he has built a few stage coach lines in 1750 already) also produced a desync, and that install was from the .deb package on my nightly server, which is from the half-heights branch.

It is hard to see how the problem can be the Routemaster 'bus, which is introduced in 1956, when the game on the Bridgewater-Brunel server is currently in 1909, however.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on December 22, 2016, 07:44:35 AM
I have to say that my client pakset matches with server.exp.... but does not match bridgewater... (not only rescaled bus, also pedestrians have changed)
Can you try connecting with my pakset? I have the gut feeling that pakset mismatch may be the cause.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on December 22, 2016, 11:44:21 AM
It could be. The Bridgewater-Brunel server should have the latest pakset on the half-heights branch, but it may be that the updating is not working: if it has different pedestrians, that would suggest that the pakset is a few weeks old. I will have a go at this when I have a moment.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on December 22, 2016, 01:52:52 PM
Do not forget to clean up old .pak files. I remember that some objects were removed/renamed.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on December 22, 2016, 02:00:24 PM
Do not forget to clean up old .pak files. I remember that some objects were removed/renamed.

Yes, this is a particular issue with this system of having one .pak file per object. I will have to look into automating this by deleting the pakset folder entirely before rebuilding it.

Edit: I have found that using "make clean" works for this purpose. I will have to test it on the server.

Edit 2: The nightly pakset build on the server is now set to "make clean" before it "make"s, so, as of to-morrow, there should be a proper clean pakset in place. I should be interested to see whether this makes any difference.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on December 31, 2016, 12:26:49 AM
I have tried to connect to bridgewater-brunel server, but the pakset used by server does not match the nightly build
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on December 31, 2016, 12:33:00 AM
I have tried to connect to bridgewater-brunel server, but the pakset used by server does not match the nightly build

I will have to look into this when I get a chance.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on December 31, 2016, 08:30:05 AM
Just one more note, as you work on transparent vehicles, you could completely remove the partial definition of rescaled routemaster bus to avoid the pak file starting with dot, which is in the nightlies. Sync (delete obsolete paks) the pakset for bridgewater server and try if it helped. Or try connecting to the swedish pak server.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on December 31, 2016, 11:50:38 AM
I have removed the "routemaster-rescaled" from the road-vehicles-rescaling branch. When work on this branch is complete, I will be able to merge it back into the half-heights branch and hopefully we will then be able to test whether this helps.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on December 31, 2016, 02:40:56 PM
I made some tests today to see wether I could log on to the bridgewater-brunel.me.uk servergame. It desynced and crashed quite heavily, but in a curios pattern:

1st attempt: lost syncronization after 26 seconds
2nd attempt: Crash to desktop after 12 seconds

after restarting simutrans:
3rd attempt: lost syncronization after 23 seconds
4th attempt: Crash to desktop after 13 seconds

after restarting simutrans again:
5th attempt: lost syncronization after 21 seconds
6th attempt: crash to desktop after 13 seconds

The pattern seems to be that when the servergame is accessed first time in a game session, it last for around 20-25 seconds before it will desync. When connecting again (without restarting Simutrans) you can only be there for 12-13 seconds before the entire game crashes. As if the first attempt to log onto the servergame influences the second attempt. Also note that the initial attempts to start the servergame after a crash appears to trigger the desync earlier and earlier.

Using:
Windows 10
Executable compiled with msvs 2015: 2d60c8ffe5ecbc6192e5a846fc59907e7a6442d7
Pakset (half height branch) compiled with corresponding makeobj: 65b85f3f8c231057f35bdfb7deb8b7dd1b8f02e3

I dont know if you can use this in any way, but thought I should report it to you anyway!

Happy new year!  ;D
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on December 31, 2016, 02:44:59 PM
Thank you for letting me know, and happy new year to you, too!
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on December 31, 2016, 09:03:36 PM
I have seen the same pattern as Ves few weeks ago (desync, crash, desync, crash, ....

Linux 64-bit
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 09, 2017, 03:19:07 PM
I have just pushed a fix which involves removing an instance of casting away const, which is undefined behaviour. I have not had a chance to test the effect on networking yet, but I wonder whether anyone could test whether this helps when a Linux server has a Windows client connected to it?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on January 09, 2017, 10:56:33 PM
Can you check windows client against server.exp.simutrans.com?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on January 09, 2017, 11:26:09 PM
May I ask, where do I see which version the server is? is it always the latest nightly? Can I rely on the date that is shown as the "last modified"?

My results:

Connecting to server.exp.... swedish server with 99d6634f2c44dacc986c84715fb91f942e7a79f3 yields no problems whatsoever.
Connecting to server.exp.... british server with the same comit lets me stay synced. However, it feels a bit unstable, as if I tamper with some of the deadlocks for instance, it might desync me.
Connecting to bridgewater server with a8ab51179693fa413f17c9d5040724e895c34ae8 yields the same results as described previously in this thread (desync after 20 sec on first attempt, crash after 12 sec on second)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 09, 2017, 11:38:32 PM
Hmm - interesting. I thought that server.exp.simutrans.com had previously had desyncs without interaction within seconds of connecting?

Perhaps you could try using the same map from the Bridgewater-Brunel server on server.exp.simutrans.com to see whether you can reproduce the desyncs there?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on January 09, 2017, 11:50:29 PM
Hmm - interesting. I thought that server.exp.simutrans.com had previously had desyncs without interaction within seconds of connecting?

Perhaps you could try using the same map from the Bridgewater-Brunel server on server.exp.simutrans.com to see whether you can reproduce the desyncs there?

Here you are - port 13354 - connected just fine

May I ask, where do I see which version the server is? is it always the latest nightly? Can I rely on the date that is shown as the "last modified"?
You can rely on "last modified" or the info in README.txt. However If the file is very fresh, it may be that the server is runnign the previos version and will be restarted soon. All is done manually, and the order of operations (upload/restart) may be random.

Quote
Connecting to server.exp.... swedish server with 99d6634f2c44dacc986c84715fb91f942e7a79f3 yields no problems whatsoever.
Connecting to server.exp.... british server with the same comit lets me stay synced. However, it feels a bit unstable, as if I tamper with some of the deadlocks for instance, it might desync me.
Connecting to bridgewater server with a8ab51179693fa413f17c9d5040724e895c34ae8 yields the same results as described previously in this thread (desync after 20 sec on first attempt, crash after 12 sec on second)
Yeah, I have the same with linux client.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on January 10, 2017, 12:02:37 AM
Quote
You can rely on "last modified" or the info in README.txt. However If the file is very fresh, it may be that the server is runnign the previos version and will be restarted soon. All is done manually, and the order of operations (upload/restart) may be random.
Im sorry, I did not mean the server.exp... I meant the bridgewater-brunel server.

Quote
Hmm - interesting. I thought that server.exp.simutrans.com had previously had desyncs without interaction within seconds of connecting?
Perhaps you could try using the same map from the Bridgewater-Brunel server on server.exp.simutrans.com to see whether you can reproduce the desyncs there?
Connecting to port 13354 with 99d6634f2c44dacc986c84715fb91f942e7a79f3 only caused "normal" desyncs after around 30 seconds for me. No crashes though!
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 10, 2017, 12:28:52 AM
Do I recall correctly that Vladki, you are running Linux, and Ves, you are running Windows? And do I correctly understand that Ves is referring to a "normal" desync as one that occurs without any user interaction?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on January 10, 2017, 12:37:25 AM
Yes I'm running Linux. I think the "normal desync" happens when you tamper with something - typically schedule, or vehicles in depot. But if you just watch the game it should be ok. And most importantly you can reconnect again.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on January 10, 2017, 01:28:32 AM
Normal desync was in contrast to the crashes mentioned earlier. They occurred without doing anything. Yes I run windows.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 10, 2017, 10:20:32 PM
On server.exp.simutrans.com:13354 the game runs without the initial disconnect. I also tried to build some stuff, which also works.

I am running the game in Linux (64-bit). The pakset is actually newer than the one on the server (9a8d1a8e61c296a1d65303643f1f268f0ad24e40). I forced the connection via "load" dialog.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 10, 2017, 11:31:04 PM
Testing again, I do desync within <10 seconds of connecting both on the Bridgewater-Brunel server and on server.exp.simutrans.com:13354. I notice that server.exp.simutrans.com:13354 reports having the routemaster-rescaled object whereas the Bridgewater-Brunel server does not (and nor does my client).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 11, 2017, 06:17:22 PM
Testing with my Linux client and compiling from the latest commit on devel-new-2 for the code and half-heights for the pakset, I desync shortly after connecting both to the Bridgewater-Brunel server and to server.exp.simutrans.com:13354 (although not as quickly as with my Windows machine). I have connected to the default port of server.exp.simutrans.com and have not desynced yet (after circa 2-3 minutes). Edit: Still connected about 20 minutes later.

However, I am not sure that the Bridgewater-Brunel server or server.exp.simutrans.com are running the latest versions.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 12, 2017, 10:16:16 PM
On the Bridgewater-Brunel server and with a build of the latest sources, my client crashes right after the connect with a segmentation fault.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 12, 2017, 10:59:31 PM
Thank you for testing. Are you able to run GDB to find where the fault occurs? I should note that I have been integrating a lot of updates from Standard this evening, so the server will not be up to date with the latest code on Github; desyncs are known to cause crashes in some cases for reasons that remain elusive (the crash occurs in code that is unmodified from Standard).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 08:43:10 PM
I hope the following is already helpful for you. If you need something else, please, let me know.

Code: [Select]
Program received signal SIGSEGV, Segmentation fault.
0x000000000040e57e in player_t::get_player_nr (this=0x0) at bauer/../player/simplay.h:302
302             sint8 get_player_nr() const {return player_nr; }
(gdb) bt
#0  0x000000000040e57e in player_t::get_player_nr (this=0x0) at bauer/../player/simplay.h:302
#1  0x000000000076a7ad in tool_add_message_t::init (this=0x14fb3cd0, player=0x0) at simtool.cc:9102
#2  0x00000000005ea5af in nwc_tool_t::do_command (this=0x153dc310, welt=0x35bd830)
    at network/network_cmd_ingame.cc:1339
#3  0x00000000007975c5 in karte_t::do_network_world_command (this=0x35bd830, nwc=0x153dc310) at simworld.cc:9714
#4  0x0000000000796fe3 in karte_t::process_network_commands (this=0x35bd830, ms_difference=0x7fffffffb158)
    at simworld.cc:9659
#5  0x0000000000797c3a in karte_t::interactive (this=0x35bd830, quit_month=2147483647) at simworld.cc:9820
#6  0x000000000072ffaa in simu_main (argc=2, argv=0x7fffffffda68) at simmain.cc:1370
#7  0x0000000000742cd0 in sysmain (argc=2, argv=0x7fffffffda68) at simsys.cc:805
#8  0x00000000007ffb6f in main (argc=2, argv=0x7fffffffda68) at simsys_s2.cc:800
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 09:00:16 PM
Thank you for that. It is difficult to tell exactly what is going on there and therefore the ultimate cause, but I have just pushed a fix that might help to deal with the immediate cause. Are you able to re-test? I should be grateful.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 09:14:17 PM
Will do so immediately.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 09:18:35 PM
Thank you - that is very kind.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 09:26:26 PM
Now, the game is running very slow, when I try to connect and I get "no response from server" when trying to force the connect via load dialog.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 09:29:23 PM
That is very odd indeed, and extremely difficult to understand. Is the game running slowly for you even in single player mode? What map are you running?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 09:32:02 PM
Sorry, should have been more clear. The game starts to respond slowly once I click on the "play online" button or try to connect to a server via load dialog. I am currently on the devel-new-2 branch. I got quite a number of patched files, when I pulled. So it might be related to something else that was pushed. Of course, my build configuration might be also less then optimal.

I also cannot see any server in the server browser, even if I select to also display mismatched. I can ping the server.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 09:35:19 PM
Taking a long time to respond when actually selecting an individual server in the list is usually caused by that server being slow to respond. Is this what you are experiencing?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 09:37:38 PM
No, the game gets slow and consumes an unusual amount of CPU time once I open the server dialog. Before selecting anything within the dialog window.


But I think it is unlikely to be caused by the change you commited. Trying to bisect the other commits, now.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 09:42:27 PM
That is very odd indeed. It performs normally before attempting to connect to the server?

Edit: Even more oddly, I cannot reproduce this in Windows.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 09:45:09 PM
Yes, it performs normal, if not trying to connect to a server.

By the way, are you using IRC or something?



ecb9712b19b66298062464702410c69702119a26 does not have the issue. Still no response from the server.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 09:50:39 PM
I am currently in the process of doing some work on the Bridgewater-Brunel server to try to get it up to the latest version, enable automatic version numbering, and make sure that the server is always running the latest nightly build. Because of my poor multi-tasking skills, IRC may not work well at present in any event. However, the as a result of that, the Bridgewater-Brunel server is not currently running.

Edit: Incidentally, are you able to identify the latest version that does not have this problem?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 09:52:00 PM
Ok, I will continue to try to investigate the issue.


Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 09:53:21 PM
That is very kind - thank you.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 09:54:08 PM
The issue with the slow speed after opening the network dialog was introduced with commit 959405a1da3cc2b569a508e371ed10b4f9f55720.


Might be a setup issue on my side, but reintroducing "local_hints.ai_socktype = SOCK_STREAM;" in line 237 of network.cc fixes the slowness problem for me.


I can connect to the bridgewater-brunel copy on server.exp.simutrans.com:13354 without crashing, now, but I still get the desync after some seconds.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 10:50:14 PM
I have just been looking into this. That was the commit in which I merged in some of Dr. Supergood's network changes; however, they seem to have been merged into an already outdated network code-base. I have now managed to bring it fully up-to-date with the latest code in Standard. Do you think that you could re-test with this new code?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 10:57:31 PM
Sure
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 10:59:00 PM
Thank you.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 11:05:55 PM
This change also fixes the slowness issue. The crash issue seems also to be resolved. Still get the desync, though.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 11:07:06 PM
Excellent - thank you for re-testing. That is some progress at least.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 11:12:25 PM
Definitely! Let me know, when you need something tested.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 11:36:14 PM
Thank you again.

I have now restarted the Bridgewater-Brunel server with the latest version. Do you think that you could re-test for the desync? Thank you again.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 11:39:28 PM
I seem to have a different revision than the server. If I force the connection via load, I still get the desync :-(

(just trying to get some more debug information)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 11:46:48 PM
I think that the different revision is caused by an issue discussed here (http://forum.simutrans.com/index.php?topic=15523.msg158095#msg158095).

Are you able to run Simutrans-Experimental as a server on your own computer (use the "-server" command line flag), load the same saved game as is on the Bridgewater-Brunel server, and then connect to it with another instance of Simutrans-Experimental running on your own computer using the "net:127.0.0.1" command in the load dialogue to see whether that stays in sync?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 11:51:30 PM
With a local server I could also run the map before.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 13, 2017, 11:53:02 PM
Can you test whether this is still the case with the latest build?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 13, 2017, 11:59:45 PM
Do you happen to have a fresh dump of the savegame? Mine might be outdated.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 12:01:15 AM
The best thing to get the exact same thing as is running on the server is to connect to the server, wait for it to desync, then save the game.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 12:01:46 AM
ok, will do.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 12:03:10 AM
Splendid, thank you.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 12:11:43 AM
No desync so far after a couple of minutes.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 12:12:59 AM
Thank you - that is very helpful. It is very odd that you get a desync with the Bridgewater-Brunel server but not locally - there may be some anomaly on the server.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 12:15:14 AM
Of course, latency is almost zero on a local server. Maybe it is a timing issue?


By the way, I also have a log-file with full debug from a session where I got the desync, but it is to large to upload.

The following looks slightly suspicious
Code: [Select]
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: route_t::intern_calc_route():    Problem with heuristic:  from 1009,1999,6 to (1007,1999,6) at 1008,1999, best = 20, cost = 10, heur = 110, dist = 1, turns = 99

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal():        could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Similar errors as the first one are repeated numerous times. The second one is actually the last error before the world was destroyed.

I just checked. I get similar errors as the first one also on the local server (with -debug 5).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 12:25:53 AM
After several minutes an connecting a second client, I also got a desync locally. Sadly on the client running without "-debug 5 -log" :-/
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 12:27:27 AM
Desync errors are rarely caused by latency unless it is extreme. They are normally caused by the game state diverging on the client and the server. What happens is that the client and server compare their random number seed every so often. If these are different, the client will disconnect. They will become different if there is almost any deviation in the execution of game code between the client and server, but the difference between them will not be detected until an indeterminate time later. This means that errors of this sort are extremely difficult to resolve. They become orders of magnitude harder again to resolve when they occur only in a setting (a server running one operating system and a client running another) where it is very difficult to reproduce the problem reliably.

Edit: Thank you for checking that. Did you also get a local desync after several minutes when you last tested this?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 12:43:40 AM
For the desnc, I might not have tested long enough, last time.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 12:48:05 AM
Can you check with this (http://bridgewater-brunel.me.uk/saves/britain-3.sve) saved game? It is the same as should be on the server, but the one on the server may not be identical. This is the one that I use for testing including very long-term (i.e. hours long, sometimes overnight) network synchronisation testing.

If this still fails, would you be able to re-test with this saved game and the older version that you used before to see if we can narrow down what is causing the desync that you are getting with a Linux server and a Linux client? The latter issue seems new.

In the meantime, I am re-testing with this saved game locally on Windows over a longer-term run to see whether it remains as stable as it was when I last tested it.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 12:55:41 AM
Ok, will test with the local server again. Might take a while to get the desync, if I get it at all.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 01:09:58 AM
Interesting, now I can reproduce the behavior locally. I immediately get a desync with that game. Will try with logging, now.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 01:20:41 AM
Maybe the following is helpful:
Code: [Select]
Message: wayobj_t::fill_menu(): try to add TramElectrification(0x7d7ab10)
Message: wayobj_t::fill_menu(): try to add DCThirdRail(0x9431a60)
Message: wayobj_t::fill_menu(): try to add DCRailCatenary(0xa8ff0a0)
Message: wayobj_t::fill_menu(): try to add DCFourthRail(0x623eb90)
Message: hausbauer_t::fill_menu(): maximum 139
Message: hausbauer_t::fill_menu(): maximum 139
Message: hausbauer_t::fill_menu(): maximum 139
Message: hausbauer_t::fill_menu(): maximum 139
Message: toolbar_t::update(): update toolbar SPECIALTOOLS
Message: toolbar_t::update(): update toolbar EDITTOOLS
Message: toolbar_t::update(): update toolbar LISTTOOLS
Message: network_command_t::rdwr: read packet_id=7, client_id=0
Warning: network_check_activity(): received cmd id=7 nwc_ready_t from socket[10]
Warning: nwc_ready_t::execute: set sync_step=952 where map_counter=96770
Warning: karte_t::network_game_set_pause: steps=119 sync_steps=952 pause=0
Message: karte_t::reset_timer(): called, mode=$4
Message: network_command_t::rdwr: read packet_id=12, client_id=0
Warning: network_check_activity(): received cmd id=12 nwc_auth_player_t from socket[10]
Message: nwc_auth_player_t::execute: plnr = 255  unlock = 32767  our_client_id = 0
Message: network_command_t::rdwr: read packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr: rdwr id=8 client=0 plnr=255 pos=koord3d invalid tool_id=8224 defpar=32768,Welcome, Client#2! init=1 flags=0
Warning: network_check_activity(): received cmd id=8 nwc_tool_t from socket[10]
Message: network_world_command_t::execute: do_command 8 at sync_step 953 world now at 952
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 132
Message: network_world_command_t::execute: do_command 9 at sync_step 961 world now at 952
Message: network_command_t::rdwr: read packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr: rdwr id=8 client=0 plnr=255 pos=koord3d invalid tool_id=8224 defpar=32768,Now 1 clients connected. init=1 flags=0
Warning: network_check_activity(): received cmd id=8 nwc_tool_t from socket[10]
Message: network_world_command_t::execute: do_command 8 at sync_step 961 world now at 953
Message: nwc_tool_t::do_command: steps 953 tool 8224 init
Message: nwc_tool_t::do_command: id=32 init=1 defpar=32768,Welcome, Client#2! flag=0
Message: message_t::add_msg():                       Welcome, Client#2! (at -1,-1)
Message: message_t::add_msg(): New world record for
railways:
 13.0 km/h
by (1376) UERL 1906 Tube Stock. (at 1269,2189)
Message: haltestelle_t::reserve_position(): failed for gr=1534,1672, cnv=132
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1227,2181,5 to (1235,2170,5) at 1235,2170, best = 229, cost = 229, heur = 2290, dist = 0, turns = 2061

and slightly later

Code: [Select]
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1262,2211,7 to (1263,2208,7) at 1262,2210, best = 71, cost = 10, heur = 150, dist = 2, turns = 138

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: karte_t:::do_network_world_command: sync_step=960  server=[ss=960 st=120 nfc=0 rand=4206555837 halt=1 line=1 cnvy=1025 ssr=4206555837,4206555837,0,0,0,0,0,0 str=4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,1235697,105474,4206555837 exr=0,0,0,0,0,0,0,0  client=[ss=960 st=120 nfc=0 rand=2648001267 halt=1 line=1 cnvy=1025 ssr=2193165484,2193165484,0,0,0,0,0,0 str=2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2648001267,2648001267,1235697,105474,2193165484 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
Message: karte_t::reset_timer(): called, mode=$0
World finished ...
Show banner ...
Message: karte_t::reset_timer(): called, mode=$0
Message: message_t::add_msg(): New world record for
road vehicles:
 7.0 km/h
by (558) Pair of horses. (at 1277,2197)
Message: haltestelle_t::reserve_position(): failed for gr=741,2278, cnv=170
Message: haltestelle_t::reserve_position(): failed for gr=1534,1672, cnv=132
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1230,2217,5 to (1239,2217,5) at 1239,2217, best = 90, cost = 90, heur = 900, dist = 0, turns = 810

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1230,2217,5 to (1239,2217,5) at 1238,2217, best = 90, cost = 80, heur = 810, dist = 1, turns = 729
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 01:21:25 AM
That is very helpful. Do you think that you could try with older versions to see whether the problem exists with older versions also or whether it was introduced recently, and, if so, when?

Unfortunately, it seems that there is now a desync even between Windows clients, at least with release builds, after a few minutes. I will test again with debug builds to see whether the problem is confined to release builds (which will give a clue, if a vague one, about the nature of the problem) or whether it is also present with debug builds.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 01:24:17 AM
Were there some more significant changes at some point? Might be a good idea to try something older than that.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 01:48:20 AM
The trouble is that there have been a great many changes, and it is hard to tell what is significant and what is not. I should note that I have, since I posted the last message, been running locally with a debug build without desync. I will continue to run it overnight: if this is stable, the issue is something to do with the release build, possibly some undefined behaviour that happens to be deterministic in a non-optimised build.

Unfortunately, this sort of error is almost impossibly hard to find, especially since, for reasons that I cannot fathom, running Dr. Memory causes the game to crash in a way that is very hard in itself to diagnose and does not occur when the game is not run with Dr. Memory (which in any event only works properly in debug mode without optimisations).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 02:02:41 AM
I did only build with debugging enabled.

It seems that I get the desync even with pretty old versions (October/November). With even older versions, I have trouble to even load the savegame. Perhaps, it would also make sense to double check my configuration to rule out possible errors in that area.

Anyway, I will go to bed, now, and will continue tomorrow. We are one hour ahead ;-)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 10:56:34 AM
Having run an overnight test on the Windows debug build (having also gone to bed after my last post), it seems that even this now will not stay in sync, although it is not clear how long that it took to fail beyond that it did so sometime overnight.

In around October to December, I had spend an enormous amount of time implementing multi-threading, and, in particular, making it work when connected over a network (testing with the loopback interface). After a huge amount of testing, I thought that I had resolved such problems as there were in around December.

It might help, therefore, if you could run further tests with MULTI_THREAD=1 disabled. If multi-threading turns out to be the problem, I shall have to add preprocessor directives disabling different parts of the multi-threaded code so that those different parts can be disabled one by one to test each in isolation.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 02:16:08 PM
With the latest version compiled without the MULTI_THREAD flag, the server crashes as soon as I try to connect with a client. Again with the britain-3.sve savegame.

Code: [Select]
*** Error in `/home/felix/simutrans/simutrans/simutrans-experimental': free(): invalid pointer: 0x0000000012d72690 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x6f263)[0x7ffff6b21263]
/lib64/libc.so.6(+0x748d6)[0x7ffff6b268d6]
/lib64/libc.so.6(+0x750de)[0x7ffff6b270de]
/home/felix/simutrans/simutrans/simutrans-experimental[0x78ea7c]
/home/felix/simutrans/simutrans/simutrans-experimental[0x78a4a5]
/home/felix/simutrans/simutrans/simutrans-experimental[0x5e52c8]
/home/felix/simutrans/simutrans/simutrans-experimental[0x79283d]
/home/felix/simutrans/simutrans/simutrans-experimental[0x79225b]
/home/felix/simutrans/simutrans/simutrans-experimental[0x792eb2]
/home/felix/simutrans/simutrans/simutrans-experimental[0x72d087]
/home/felix/simutrans/simutrans/simutrans-experimental[0x73fee8]
/home/felix/simutrans/simutrans/simutrans-experimental[0x7fa6e7]
/lib64/libc.so.6(__libc_start_main+0xf0)[0x7ffff6ad2790]
/home/felix/simutrans/simutrans/simutrans-experimental[0x405a69]

Code: [Select]
Program received signal SIGABRT, Aborted.
0x00007ffff6ae5107 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6ae5107 in raise () from /lib64/libc.so.6
#1  0x00007ffff6ae655a in abort () from /lib64/libc.so.6
#2  0x00007ffff6b21268 in ?? () from /lib64/libc.so.6
#3  0x00007ffff6b268d6 in ?? () from /lib64/libc.so.6
#4  0x00007ffff6b270de in ?? () from /lib64/libc.so.6
#5  0x000000000078ea7c in karte_t::load (this=0xf767cf0, file=0x7fffffff97a0) at simworld.cc:8795
#6  0x000000000078a4a5 in karte_t::load (this=0xf767cf0, filename=0x7fffffff9e20 "server13353-network.sve")
    at simworld.cc:7865
#7  0x00000000005e52c8 in nwc_sync_t::do_command (this=0x25a8a4e0, welt=0xf767cf0) at network/network_cmd_ingame.cc:754
#8  0x000000000079283d in karte_t::do_network_world_command (this=0xf767cf0, nwc=0x25a8a4e0) at simworld.cc:9704
#9  0x000000000079225b in karte_t::process_network_commands (this=0xf767cf0, ms_difference=0x7fffffffb068)
    at simworld.cc:9649
#10 0x0000000000792eb2 in karte_t::interactive (this=0xf767cf0, quit_month=2147483647) at simworld.cc:9810
#11 0x000000000072d087 in simu_main (argc=4, argv=0x7fffffffd988) at simmain.cc:1369
#12 0x000000000073fee8 in sysmain (argc=4, argv=0x7fffffffd988) at simsys.cc:805
#13 0x00000000007fa6e7 in main (argc=4, argv=0x7fffffffd988) at simsys_s2.cc:798
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 02:17:53 PM
Is this a local server or the Bridgewater-Brunel server? Connecting a multi-thread build to a single thread build will not work properly given the way in which multi-threading has been implemented.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 02:21:20 PM
It is a local server and actually the server is crashing, not the client.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 02:34:16 PM
That is very odd. This backtrace suggests that the crash actually occurs in some pre-compiled library for which it cannot load symbols. It is extremely odd that it is crashing in this way. Are you able to test to see the version in which this crash was first introduced? The Bridgewater-Brunel server ran single threadedly in the past, so this is not a long-term bug.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 02:41:09 PM
The crash seems to be due to trying to free some invalid memory. Likely a double free or a similar issue. It is more obvious in the game's error message.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 03:10:03 PM
The trouble is that the attempt to free invalid memory appears to be made by code that is not actually part of Simutrans-Experimental at all, but in an external library, which makes debugging extraordinarily difficult, as it is not even possible to see what is happening that might have caused this.

Are you able to test to see the version in which this error was first introduced?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 03:26:08 PM
I am working on it.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 03:30:23 PM
In 7cb45bd without multithreading, I don't get the server crash, but still get the desync. Still need to pinpoint the exact commit that introduced the problematic change. The crash seems to be related to the delete[] in the following code in simworld.cc (line numbers change with revisions, its somewhere between line 8000 and 9000)
Code: [Select]
#ifdef MULTI_THREAD
destroy_threads();
init_threads();
#else
delete[] transferring_cargoes;
transferring_cargoes = new vector_tpl<transferring_cargo_t>[1];
#endif

I will need to first do some groceries shopping, now.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 03:42:02 PM
Thank you very much for testing - that is most helpful. I will have a look at that transferring cargoes code when I have finished trying to make cross-compiling work.

Edit: I have pushed some fixes for possible double free errors associated with this code. Are you able to test whether this assists?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 06:26:48 PM
I will test in a minute.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 06:40:23 PM
I am sorry, the issue seems to remain.  :-(   ... Wait, this time it is the client that crashes. Something new.

Code: [Select]
Message: haltestelle_t::reserve_position():     failed for gr=24,1784, cnv=659
Message: haltestelle_t::reserve_position():     failed for gr=1526,2901, cnv=123
Message: network_command_t::rdwr:       read packet_id=9, client_id=0
Warning: network_check_activity():      received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK:     time difference to server 1452
Message: network_world_command_t::execute:      do_command 9 at sync_step 6153 world now at 6104
Warning: karte_t:::do_network_world_command:    skipping command due to checklist mismatch : sync_step=6104 server=[ss=6104 st=763 nfc=0 rand=1333123480 halt=1 line=1 cnvy=1025 ssr=4101835982,4101835982,0,0,0,0,0,0 str=4101835982,4101835982,4101835982,4101835982,4101835982,4101835982,4101835982,1333123480,1333123480,1333123480,1333123480,1333123480,1333123480,1235659,105478,1333123480 exr=0,0,0,0,0,0,0,0  executor=[ss=6104 st=763 nfc=0 rand=2142232825 halt=1 line=1 cnvy=1025 ssr=4101835982,4101835982,0,0,0,0,0,0 str=4101835982,4101835982,4101835982,4101835982,4101835982,4101835982,4101835982,1333123480,1333123480,1333123480,1333123480,2142232825,2142232825,1235659,105478,1333123480 exr=0,0,0,0,0,0,0,0 
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset:      all static variables are reset
Message: karte_t::reset_timer():        called, mode=$0
*** Error in `/home/felix/simutrans/simutrans/simutrans-experimental': double free or corruption (!prev): 0x0000000013efd040 ***

Trying again in gdb, I get the following:
Code: [Select]
Message: haltestelle_t::reserve_position():     failed for gr=1526,2901, cnv=123
Warning: karte_t:::do_network_world_command:    skipping command due to checklist mismatch : sync_step=13272 server=[ss=13272 st=1659 nfc=0 rand=1494109945 halt=1 line=1 cnvy=1025 ssr=1526326882,1526326882,0,0,0,0,0,0 str=1526326882,1526326882,1526326882,1526326882,1526326882,1526326882,1526326882,1494109945,1494109945,1494109945,1494109945,1494109945,1494109945,1235681,105479,1494109945 exr=0,0,0,0,0,0,0,0  executor=[ss=13272 st=1659 nfc=0 rand=1188062987 halt=1 line=1 cnvy=1025 ssr=1526326882,1526326882,0,0,0,0,0,0 str=1526326882,1526326882,1526326882,1526326882,1526326882,1526326882,1526326882,1494109945,1494109945,1494109945,1494109945,1188062987,1188062987,1235681,105479,1494109945 exr=0,0,0,0,0,0,0,0 
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset:      all static variables are reset
Message: karte_t::reset_timer():        called, mode=$0

Program received signal SIGSEGV, Segmentation fault.
0x0000000000004061 in ?? ()
(gdb) bt
#0  0x0000000000004061 in ?? ()
#1  0x00000000007922ea in karte_t::process_network_commands (this=0x13616bb0, ms_difference=0x7fffffffb038)
    at simworld.cc:9654
#2  0x0000000000792f1a in karte_t::interactive (this=0x13616bb0, quit_month=2147483647) at simworld.cc:9814
#3  0x000000000072d0dd in simu_main (argc=8, argv=0x7fffffffd958) at simmain.cc:1369
#4  0x000000000073ff3e in sysmain (argc=8, argv=0x7fffffffd958) at simsys.cc:805
#5  0x00000000007fa74f in main (argc=8, argv=0x7fffffffd958) at simsys_s2.cc:798
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 06:47:28 PM
By the way, I still would like to ensure that my configuration (simuconf.tab) is not causing the problems. Are the separately downloaded experimental config files still needed?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on January 14, 2017, 07:09:26 PM
I have updated all servers runnig on server.exp.simutrans.com to f5e72b5c6a41f6ca064bb1d46de8e1cc29d37d16

I have no problems with those. I still get desync with bridgewater-brunel.me.uk.
Just to make it clear. British pakset on my servers differs a little bit from git:

diff --git a/bus/routemaster.dat b/bus/routemaster.dat
index 1535554..b20d268 100644
--- a/bus/routemaster.dat
+++ b/bus/routemaster.dat
@@ -39,7 +39,7 @@ EmptyImage[N]=./images/routemaster.0.6
 EmptyImage[NE]=./images/routemaster.0.7
 ---
 # For TESTing of rescaled vehicles only - delete when testing complete.
-#obj=vehicle
+obj=vehicle
 name=Routemaster-rescaled
 copyright=JamesPetts&JamesHood
 waytype=road
---

I'm quite convinced that this is the root cause of desync problems. If you comment only the obj=vehicle line, then makeobj creates an invalid pak file ".Routemaster-rescaled.pak" instead of nothing as you might expect. So the trick is either to uncomment the obj= line, or comment all lines that belong to the definition of the object
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 07:32:17 PM
Thank you all for your testing: that is most helpful.

I have just tried unsuccessfully to reproduce this on Windows: a single threaded client can connect to a single threaded server, both compiled by Visual Studio (as debug builds), without either crashing (or indeed desynchronising, in the short-term at least: I have not yet run a long-term test).

This is all rather odd: it does potentially suggest a memory management issue somewhere, but Dr. Memory (when I was able to get it working without crashing Simutrans-Experimental by loading a map with no factories) found nothing but minor memory leaks in code shared with Standard and that have been there for a long time.

The backtrace on this occasion suggests an error arising ultimately from process_network_commands, code that is common between Experimental and Standard, which is a crash that is known to occur simultaneously to a desync in some cases for reasons that have never yet been ascertained.

It is still not clear why a single threaded build in Linux should desync/crash in this way, however, when it does not do so in Windows. Are you able to run Valgrind to see whether there are any detectable memory management issues?

Vladki - I am afraid that the problem cannot be the routemaster-rescaled object, as my own local tests, where I have now been able to reproduce a desync between client and server running the same build and pakset (two copies of the identical executable and pakset folder on the same computer) were performed using the road-vehicle-rescaling branch, which does not have the routemaster-rescaled object at all (the default Routemaster being rescaled instead).

If you really think that the routemaster-rescaled object is responsible for at least some of the desyncs, however, you might want to try tests with both client and server using the road-vehicle-rescaling branch instead of the half-heights branch to see whether that makes any difference.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 08:18:08 PM
I can confirm, that building the pakset without the offending routemaster still does not solve the desync issue for me.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 08:23:57 PM
Thank you for testing that: that is helpful. Are you able to test single threaded builds of older commits to see whether the desync and/or crash issues occur there?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 08:26:55 PM
I still would like to ensure, that this is not a configuration issue.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 08:30:27 PM
I still would like to ensure, that this is not a configuration issue.

What sort of configuration issue had you in mind?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 08:31:14 PM
The issue is also definitely savegame dependent. With a copy of the bridgewater-brunel savegame that I used in a local game for some time, I do not get the immediate disconnect with a local host.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 08:32:37 PM
As said earlier, I am not sure if the extra package with the experimental-specific simconf.tab etc. is still needed. I am currently using the configuration files form the devel-new-2 branch.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 08:36:57 PM
The issue is also definitely savegame dependent. With a copy of the bridgewater-brunel savegame that I used in a local game for some time, I do not get the immediate disconnect with a local host.

Interesting - how long does it take before you desync?

There may be multiple, separate desync issues, of course.

What do you mean about the package with the experimental-specific simuconf.tab? Do you mean the .zip file distributed with the old release binaries from long ago? This ought not in principle to be an issue, since all of the configuration settings are saved with the saved game and transferred to the client when it first connects to the server, overriding any configuration settings in the client's simuconf.tab. In any event, the simuconf.tab from Github should be the most up to date version.

The Bridgewater-Brunel server has its own modified simuconf.tab to allow for settings specific to that server, such as the administrator's (i.e. my) e-mail address, a description, etc..
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 08:38:10 PM
With my local savegame the client desyncs only after like 10 min, but also with a mismatch of the random numbers.

Code: [Select]
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1289,1610,9 to (1306,1717,9) at 1289,1615, best = 1590, cost = 50, heur = 1620, dist = 109, turns = 1461

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: karte_t:::do_network_world_command: sync_step=11776  server=[ss=11776 st=1472 nfc=0 rand=3328960089 halt=1 line=1 cnvy=1025 ssr=3461460419,3328960089,0,0,0,0,0,0 str=3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,3328960089,1688219608,143636616,3328960089 exr=0,0,0,0,0,0,0,0  client=[ss=11776 st=1472 nfc=0 rand=3461460419 halt=1 line=1 cnvy=1025 ssr=3461460419,3461460419,0,0,0,0,0,0 str=3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,3461460419,1688219608,143636616,3461460419 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
Message: karte_t::reset_timer(): called, mode=$0
World finished ...
Show banner ...
Message: karte_t::reset_timer(): called, mode=$0
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1336,2179,5 to (1337,2182,5) at 1337,2182, best = 70, cost = 70, heur = 700, dist = 0, turns = 630
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 08:39:19 PM
And yes, I was talking about that zip file. But if it is not needed, I should have a correct configuration.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 08:44:09 PM
It is a mismatch of the random numbers that is the more usual type of desync that is hard to track down, especially if it only happens every 10 minutes (meaning that each small change needs 10 minutes to be tested to see whether it makes a difference).

Either the cause of the desyncs on the Bridgerwater-Brunel server are different to those on a local server (which are still seriously problematic), or they are both related to the same thing, but for some reason causing a desync more quickly on the Bridgewater-Brunel server.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 08:46:58 PM
Where are the numbers in the rand[] ("ssr" in the message) calculated? Somehow it is interesting, that on the server the value for rand[1] is identical to the seed ("rand" in the message), while on the client rand[0] and rand[1] are identical to the seed. The value for rand[0] matches the seed value from the client.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: TurfIt on January 14, 2017, 09:07:01 PM
ssr = sync_step randoms  karte_t::sync_step()
str = step randoms  karte_t::step()
These were just extra check points added to help track down desyncs 2 (3?4?) years ago, I'd rather have expected them to be have been removed once troubleshooting was over...

To make use of them, you'll want "server_frames_between_checks = 1" on the server. And then shuffle around the where the current state of the randoms are captured into the checklist. IIRC the previous desyncs were all in the step - str numbers, so ssr is just showing the state of the random numbers at the beginning and end of the sync_step. For the log posted, it would indicate the server is using a random number somewhere in a sync_stepped object that the client is not. You'd need to break up the sync step to be by object and add more capturing to try and use these to find the possible issue.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 09:09:00 PM
The "rand=..." is the random seed on the client and server respectively. I am not entirely sure what the st and ssr are (I did not write this code), nor quite where the long list of numbers come from. (Thank you TurfIt for answering whilst I was typing this reply - that is most helpful).

Normally, a desync of this sort is caused by divergence between server and client somewhere (it is usually extremely hard to find where), normally caused by some sort of indeterminism (which could be caused by undefined behaviour, incorrect implementation of multi-threading, a reference to an indeterminate variable or a failure to transmit all of the necessary information from the server to the client in the first place).

I usually find that the best way to fix this sort of problem is to try to narrow down the part of the code in which it occurs either by testing to see into which part of the code that it was introduced, or by selectively disabling parts of the code using preprocessor directives and seeing which parts need to be disabled in order for client and server to stay in sync.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 09:32:37 PM
I tried something slightly different, by logging all calls of simrand. Sadly, the information is quite difficult to interpret. On a first look, it seems like karte_t::generate_passengers_and_mail gets called on the client at some point while it does not get called on the server at the same time. Form that point on, the random number seem to be out of sync.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 10:30:06 PM
Interesting. Was this a single-threaded or multi-threaded build?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 10:34:45 PM
This was in a multithreaded build. I was not really able to replicate this in a singlethreaded build. The interpretation might be also plainly wrong.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 10:41:53 PM
This could be a problem with the multi-threaded passenger generation, in that case. A single threaded build on the loopback interface has been connected with me for some time.

How long did it take to desync in the multi-threaded build?

Edit: Could you try to see whether it desyncs with a multi-thread build with the preprocessor directive FORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE defined?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 11:20:19 PM
With the bridgewater-brunel savegame I also had desyncs with singlethreaded builds, usually leading to an immediate crash of either the server or client.

Still trying with the flag, now.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 11:23:45 PM
Thank you - that is helpful. Did you get desyncs with the britain-3.sve file with single threaded mode? I could not get desyncs with that despite running it for about four hours this afternoon/evening in a single threaded build.

(The trick to increasing the efficiency of fixing this bug is to find a saved game that will reliably cause a desync quickly).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 11:26:55 PM
A build with the flag set (-DFORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE) still gives me a immediate desync in connect.

Multithreading was accidentally disabled. So, also without I get a immediate disconnect with the britain-3 savegame.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 11:28:27 PM
With britain-3.sve or with the very similar but perhaps subtly different saved game saved from the Bridgewater-Brunel server?

Edit: I should note that I have been connected with that flag enabled since before I wrote the last message and and still connected now - on the loopback server.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 11:30:08 PM
No trying with only the flag ...
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 11:31:01 PM
Thank you.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 11:37:47 PM
Same result :-(

If the britain-3 or the server's savegame is involved, I seem to get an immediate disconnect no matter what. With a copy of the server's savegame that I used locally for some hours, I only get a delayed disconnect after like 10 min.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 14, 2017, 11:43:10 PM
I have to say, I am finding it exceedingly difficult to understand why you are getting a different result to that which I am getting. The only thing that I can think to suggest now is for you to try older versions to see where this problem first arose. If this involves going any further back than late December, this will get very complicated indeed because from about October to December, I was adding multi-threading features, which involved lots of commits adding, disabling, then re-enabling (often many times over) a set of about four or five independent sets of multi-threading code, so it will not be a simple matter of going backwards and finding a version in which a desync does not occur.

I do find it very perplexing that you are getting rapid desyncs in a single threaded build, however, which I have not had (with a multi-threaded build) when testing connecting a Linux machine to the Bridgewater-Brunel server. I really cannot understand this at all.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 14, 2017, 11:52:40 PM
I already noticed that it gets complicated before December. One interessting aspect is also, that it seems to be savegame related. With other savegames, I do not get the immediate disconnect.

Maybe, me or us are also overlooking something. Could my setup differ in any significant aspect?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 12:03:46 AM
I cannot think of anything configuration specific that could make a difference; but could you post your config.default file just in case?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 01:13:53 AM
sure (I added the .txt extension to be able to attach it)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 01:19:13 AM
I cannot see anything in there that seems to be problematic.

I should say that I was just about to test whether I could reproduce your results under Linux using my NUC which runs Ubuntu, but that device failed (to the extent that I am now organising a warranty return) as I was doing that, so I am afraid that I will not be able to do any Linux testing myself for a few weeks until the replacement item is sent to me and I am able to set it up.

Edit It is rather a long shot, but do you think that you could try with SDL rather than SDL2 to see whether this makes any difference?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 01:31:01 AM
Might be worth a try. SDL2 also has another issue ;-) When I resize the Window, the game crashes.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 01:40:12 AM
Thank you - do let me know how you get on.

I should say that the debug build Windows versions with the multi-threading of passenger generation disabled are still connected. I shall set up a release build to try overnight to see whether that makes any difference.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 01:54:47 AM
The result with SDL 1 is pretty similar:

Code: [Select]
ERROR: route_t::intern_calc_route():    Problem with heuristic:  from 1021,1369,5 to (1036,1459,8) at 1022,1369, best = 1554, cost = 10, heur = 3340, dist = 96, turns = 3234

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: karte_t:::do_network_world_command:    skipping command due to checklist mismatch : sync_step=280 server=[ss=280 st=35 nfc=0 rand=786208633 halt=1 line=1 cnvy=1025 ssr=1005465115,1005465115,0,0,0,0,0,0 str=1005465115,1005465115,1005465115,1005465115,1005465115,1005465115,1005465115,786208633,786208633,786208633,786208633,786208633,786208633,1235700,105473,786208633 exr=0,0,0,0,0,0,0,0  executor=[ss=280 st=35 nfc=0 rand=4269245549 halt=1 line=1 cnvy=1025 ssr=2623138960,2623138960,0,0,0,0,0,0 str=2623138960,2623138960,2623138960,2623138960,2623138960,2623138960,2623138960,3881930485,3881930485,3881930485,3881930485,4269245549,4269245549,1235700,105473,3881930485 exr=0,0,0,0,0,0,0,0 
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset:      all static variables are reset
Message: karte_t::reset_timer():        called, mode=$0
Segmentation fault
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 01:58:58 AM
That is exceedingly odd. Are you able to check older versions to see when this fault was first introduced? A good start might be the 1st of January: after the implementation of all the multi-threading, but before some of the work that I have done this year.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 02:01:38 AM
It looks like it is a configuration issue. The name the savegame uses for the pakset actually matches a different one (the one from the nightly builds page) in my setup while the client uses the custom build one. So it is likely caused by the pakset mismatch. The crashes are still valid issues. Sorry, for the wasted time :-(


This really solves the immediate desync :-/
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 02:44:48 AM
Sadly, connecting to bridgewater-brunel.me.uk still results in a desync, but the server also claims to be a different version (the commit id seems to not exist, though).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 02:47:43 AM
Thank you for testing this. Can you clarify the circumstances, if any, in which you now get (1) a crash; and (2) a desync using the loopback interface?

Also, I have not encountered this issue before with the name used by the saved game for the pakset causing desyncs, and I am not really sure why it would do this. Can you let me know more about how you traced the problem to this issue? Are you sure that it is the name itself causing it? It is hard to see any means by which this could happen.

As to the Bridgewater-Brunel server, I am having problems with getting the correct version to work on that: see here (http://forum.simutrans.com/index.php?topic=15523.msg158095#msg158095) for an explanation, including a description of a very bizarre problem that I am currently unable to resolve, preventing me from having usable version numbers on this server.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 02:52:50 AM
On the loopback interface I did not observe anymore desyncs with the latest version.

In my test setup the server was accidentally loading a different version of the pakset, which had the name expected by the savegame. The client was running with a newer version build from the sources. This caused an immediate desync right after the connect. I was setting everything on the command line to simplify testing. In game both versions of the paksets are unfortunately referenced by the same name, which made me to miss the error.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 12:35:25 PM
You had two different versions of the same pakset with the same name installed?

In nay event, running overnight with release builds on Windows with FORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE defined in the britain-3.sve, I get no desyncs either.

When you say that you get no desyncs with the latest version, is that with or without FORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE defined, and after how long a time of running is that?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 01:03:42 PM
I had to versions of the same paksets within different folders.

The game was running with and without the flag and neither raised an immediate desync. I only ran the game for 15 min in both tests, though.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 01:07:14 PM
Hmm - I think that we need to find a map that triggers the desync with the multi-theraded passenger generation more quickly, as this must have been why I missed it in December when I was trying to make it work.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 01:11:23 PM
I still need to ensure, that this is not also related to the configuration issue. Will let the game run for a longer time, now, without the flag and on loopback.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 01:13:11 PM
Thank you - that is helpful.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 01:25:37 PM
I got some other stuff to do anyway. I will give you an intermediate result in an hour.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 01:27:49 PM
Splendid, thank you.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 02:31:44 PM
Takes much longer, now:

Code: [Select]
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1323,2161,2 to (1313,2153,2) at 1323,2160, best = 201, cost = 10, heur = 260, dist = 13, turns = 237

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: karte_t:::do_network_world_command: sync_step=98240  server=[ss=98240 st=12280 nfc=0 rand=218288632 halt=1 line=1 cnvy=1025 ssr=2422921232,2422921232,0,0,0,0,0,0 str=2422921232,2422921232,2422921232,2422921232,2422921232,218288632,218288632,218288632,218288632,218288632,218288632,218288632,218288632,3688248822,681432194,218288632 exr=0,0,0,0,0,0,0,0  client=[ss=98240 st=12280 nfc=0 rand=2852418295 halt=1 line=1 cnvy=1025 ssr=2422921232,2422921232,0,0,0,0,0,0 str=2422921232,2422921232,2422921232,2422921232,2422921232,2852418295,2852418295,2852418295,2852418295,2852418295,2852418295,2852418295,2852418295,3688248822,681432194,2852418295 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
Message: karte_t::reset_timer(): called, mode=$0
World finished ...
Show banner ...
Message: karte_t::reset_timer(): called, mode=$0
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1336,2179,5 to (1337,2182,5) at 1337,2182, best = 70, cost = 70, heur = 700, dist = 0, turns = 630

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com

This was the latest version, without the flag. Please note, that I modified the build of the pakset to get rid of the offending routemaster (dot-file).

I will now retry with a build with the flag.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 02:37:29 PM
Thank you - that is helpful. I wonder whether testing with an optimised build might reproduce the problem more quickly?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 03:02:20 PM
With the flag, it seems to desync faster, but that might be a coincidence.

Code: [Select]
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123
Warning: karte_t:::do_network_world_command: sync_step=8544  server=[ss=8544 st=1068 nfc=0 rand=2793569096 halt=1 line=1 cnvy=1025 ssr=2130149267,345191757,0,0,0,0,0,0 str=345191757,345191757,345191757,345191757,345191757,345191757,345191757,2793569096,2793569096,2793569096,2793569096,2793569096,2793569096,1224606908,104526904,2793569096 exr=0,0,0,0,0,0,0,0  client=[ss=8544 st=1068 nfc=0 rand=2793569096 halt=1 line=1 cnvy=1025 ssr=2130149267,345191757,0,0,0,0,0,0 str=345191757,345191757,345191757,345191757,345191757,345191757,345191757,2793569096,2793569096,2793569096,2793569096,2793569096,2793569096,1224606908,104526904,2793569096 exr=0,0,0,0,0,0,0,0 
Message: haltestelle_t::reserve_position(): failed for gr=24,1784, cnv=339
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123
Message: haltestelle_t::reserve_position(): failed for gr=24,1784, cnv=339
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123
Message: haltestelle_t::reserve_position(): failed for gr=24,1784, cnv=339
Message: haltestelle_t::reserve_position(): failed for gr=741,2278, cnv=170
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 0
Message: network_world_command_t::execute: do_command 9 at sync_step 8577 world now at 8572
Message: haltestelle_t::reserve_position(): failed for gr=24,1784, cnv=339
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123
Warning: karte_t:::do_network_world_command: sync_step=8576  server=[ss=8576 st=1072 nfc=0 rand=2202801700 halt=1 line=1 cnvy=1025 ssr=4174812530,4174812530,0,0,0,0,0,0 str=4174812530,4174812530,4174812530,4174812530,4174812530,4174812530,4174812530,2202801700,2202801700,2202801700,2202801700,2202801700,2202801700,1229549888,104948820,2202801700 exr=0,0,0,0,0,0,0,0  client=[ss=8576 st=1072 nfc=0 rand=3894976443 halt=1 line=1 cnvy=1025 ssr=3344621506,3344621506,0,0,0,0,0,0 str=3344621506,3344621506,3344621506,3344621506,3344621506,3344621506,3344621506,3894976443,3894976443,3894976443,3894976443,3894976443,3894976443,1229549888,104948820,3894976443 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
Message: karte_t::reset_timer(): called, mode=$0
World finished ...
Show banner ...
Message: karte_t::reset_timer(): called, mode=$0
Message: haltestelle_t::reserve_position(): failed for gr=24,1784, cnv=339
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123

Please note, that I removed the log messages regarding heuristic errors to make the log more readable.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 03:20:21 PM
Thank you for testing that. I am in the process of trying to find a map with which it will desync more quickly, but have encountered thread deadlocks when using that map that I am spending a long time now trying to resolve.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 03:38:16 PM
Currently running with "-debug 5" instead of "-debug 3". Already got to 10000 steps without desync.

Good luck with the deadlocks!
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 03:39:49 PM
Thank you!
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 04:43:29 PM
Running for almost an hour, now, and at 130000 steps without desync (with flag, and "-debug 5").
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 04:46:36 PM
Excellent, thank you for that. The problem does appear to be in the multi-threaded passenger generation. The trouble is finding out where. I have just managed to fix the multi-threading deadlock, however, which might help in being able to use saved games that might test this more thoroughly.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 05:21:23 PM
Ok, I will switch to a build without the flag, but with the setting "-debug 5", now. Lets see what happens.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 05:50:42 PM
The desync seems more related to not using the command line argument "-debug 5" than the compile time flag. The version without the flag is already running for 20 min (36000 steps) without issues, too.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 06:31:33 PM
With the above settings it takes really long to desync (~1hour):

client (simu.log):
Code: [Select]
Warning: NWC_CHECK: time difference to server 0
Message: network_world_command_t::execute: do_command 9 at sync_step 104001 world now at 103996
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=194
Warning: karte_t:::do_network_world_command: sync_step=104000  server=[ss=104000 st=13000 nfc=0 rand=2001339770 halt=1 line=1 cnvy=1025 ssr=2001339770,2001339770,0,0,0,0,0,0 str=2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2307280173,1296641621,2001339770 exr=0,0,0,0,0,0,0,0  client=[ss=104000 st=13000 nfc=0 rand=2001339770 halt=1 line=1 cnvy=1025 ssr=2001339770,2001339770,0,0,0,0,0,0 str=2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2001339770,2307280173,1296641621,2001339770 exr=0,0,0,0,0,0,0,0 
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=194
Message: haltestelle_t::reserve_position(): failed for gr=741,2278, cnv=170
Message: vehicle_t::remove_stale_cargo(): called
Message: vehicle_t::remove_stale_cargo(): called
Message: vehicle_t::remove_stale_cargo(): called
Message: vehicle_t::remove_stale_cargo(): called
Message: vehicle_t::remove_stale_cargo(): called
Message: vehicle_t::remove_stale_cargo(): called
Message: route_t::calc_route(): No route from 1213,2305 to 1249,2324 found
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=194
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=194
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 0
Message: network_world_command_t::execute: do_command 9 at sync_step 104033 world now at 104028
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=194
Warning: karte_t:::do_network_world_command: sync_step=104032  server=[ss=104032 st=13004 nfc=0 rand=1664088824 halt=1 line=1 cnvy=1025 ssr=1664088824,1664088824,0,0,0,0,0,0 str=1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,2312225405,1297063697,1664088824 exr=0,0,0,0,0,0,0,0  client=[ss=104032 st=13004 nfc=0 rand=1664088824 halt=1 line=1 cnvy=1025 ssr=1664088824,1664088824,0,0,0,0,0,0 str=1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,1664088824,2312225405,1297063697,1664088824 exr=0,0,0,0,0,0,0,0 
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=194
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=194
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=194
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 33
Message: network_world_command_t::execute: do_command 9 at sync_step 104065 world now at 104059
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=194
Warning: karte_t:::do_network_world_command: sync_step=104064  server=[ss=104064 st=13008 nfc=0 rand=3200587526 halt=1 line=1 cnvy=1025 ssr=76729760,3560747292,0,0,0,0,0,0 str=3560747292,3560747292,3560747292,3560747292,3560747292,3560747292,3560747292,3560747292,3560747292,3560747292,3560747292,3200587526,3200587526,2317170637,1297485773,3560747292 exr=0,0,0,0,0,0,0,0  client=[ss=104064 st=13008 nfc=0 rand=1355265147 halt=1 line=1 cnvy=1025 ssr=600745941,862477084,0,0,0,0,0,0 str=862477084,862477084,862477084,862477084,862477084,862477084,862477084,862477084,862477084,862477084,862477084,1355265147,1355265147,2317170637,1297485773,862477084 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
Message: karte_t::reset_timer(): called, mode=$0
World finished ...

Server (console):
Code: [Select]
or help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal():        could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal():        could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: network_command_t::rdwr:       write packet_id=9, client_id=0
Message: packet_t::send:        sent 169 bytes to socket[5]; id=9, size=169
ERROR: rail_vehicle_t::activate_choose_signal():        could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal():        could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: network_command_t::rdwr:       write packet_id=9, client_id=0
Message: packet_t::send:        sent 169 bytes to socket[5]; id=9, size=169
ERROR: rail_vehicle_t::activate_choose_signal():        could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: network_command_t::rdwr:       write packet_id=9, client_id=0
Message: packet_t::send:        sent 169 bytes to socket[5]; id=9, size=169
Message: network_command_t::rdwr:       write packet_id=9, client_id=0
Message: packet_t::send:        sent 169 bytes to socket[5]; id=9, size=169
Warning: network_receive_data:  connection [5] already closed
Message: socket_list_t::remove_client:  remove client socket[5]
Warning: __ChatLog__:   disconnect,2,0.0.0.0,Client#2
Message: network_command_t::rdwr:       write packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr:      rdwr id=8 client=0 plnr=255 pos=koord3d invalid tool_id=8224 defpar=32768,Client#2 has left. init=1 flags=0
Message: network_command_t::rdwr:       write packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr:      rdwr id=8 client=0 plnr=255 pos=koord3d invalid tool_id=8224 defpar=32768,Now 0 clients connected. init=1 flags=0
Warning: nwc_tool_t::clone:     send sync_steps=0  tool=8224 init
Message: network_command_t::rdwr:       write packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr:      rdwr id=8 client=0 plnr=255 pos=koord3d invalid tool_id=8224 defpar=32768,Client#2 has left. init=1 flags=0
Warning: nwc_tool_t::clone:     send sync_steps=13009  tool=8224 init
Message: network_command_t::rdwr:       write packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr:      rdwr id=8 client=0 plnr=255 pos=koord3d invalid tool_id=8224 defpar=32768,Now 0 clients connected. init=1 flags=0
ERROR: rail_vehicle_t::activate_choose_signal():        could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: network_command_t::rdwr:       write packet_id=9, client_id=0
Message: network_command_t::rdwr:       write packet_id=9, client_id=0
Message: network_command_t::rdwr:       write packet_id=9, client_id=0
ERROR: rail_vehicle_t::activate_choose_signal():        could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: network_command_t::rdwr:       write packet_id=9, client_id=0
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 07:29:25 PM
It looks as though we will need to find a map that reproduces this more quickly. Unfortunately, it seems that I have not fixed the deadlock after all, and I am still working on that.

Edit: I think that I have now fixed the deadlock, and I am having trouble reproducing the desync on even a very dense and well developed map. Can you try again with the latest commit to see whether this makes any difference?

If there is a desync associated with the passenger generation, it must be a very rare one given the difficulties that we are having reproducing it.

Edit 2: The desync seems to occur a little faster with this (http://bridgewater-brunel.me.uk/saves/britain-1971.sve) saved game, but it still takes a long time.

Edit 3: Going back to the original topic, recompiling a fresh game on the Bridgewater-Brunel server and downloading the pakset from that server to make sure that the paksets were identical, connecting to that server results in a desync within <10 seconds of unpausing after connecting. It seems rather unlikely that this could be the same issue as produces the desync on even very complex maps only after about 20 minutes. There seems to be some additional problem, but it is not clear what it is. Felix - are you able to connect to the Bridgewater-Brunel server without desyncing more or less immediately?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 15, 2017, 11:53:37 PM
Yes, I also get the immediate desync. Nevertheless, this might be related to a version mismatch. I used the latest revision from git. I will to more tests in the evening.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 15, 2017, 11:57:43 PM
Can you check whether the server listing registers your pakset as matching or non-matching using the in-game network browser? For reasons already explained, I have not been able to get consistent version numbers for the executable on the server yet, but it should be the latest version from Github currently.

Edit: I have pushed a small fix for the multi-threaded passenger generation. I do not know whether this will affect synchronisation, but would you be able to re-test?

Edit 2: Having tested again, it still desyncs after a while with multi-threaded passenger generation enabled when using the loopback interface. Are you able to log the random calls again and see whether the problem is consistently that there is an extra call to generate_passengers_and_mail() on the client compared with the server?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 16, 2017, 07:41:08 PM
I would not trust the information from that analysis too much. As said, the data was very difficult to interpret.

With multithreaded passenger generation enabled, the client still desync immediately after the connect. I will now try without.

Compiled without multithreaded passenger generation enabled, I get the same desync.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 16, 2017, 09:05:42 PM
So it seems that there are two different causes of desyncs here. How did you analyse the data such as to give rise to the analysis relating to passenger generation? Since tests demonstrate that disabling multi-threaded passenger generation does in fact result in no desyncs when using the loopback interface, it does suggest that this analysis has some merit.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 16, 2017, 09:14:01 PM
I just printed a log message for each call to the simrand function including the caller and later compared the log from the server and the client. Sadly, multithreading makes it difficult to actually map calls. To be useful, one would probably need to improve the approach based on a better understanding of how randomness is handled by the game. I think, the rand array used by the network check is doing something like this, but I do not yet fully understand it.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 16, 2017, 10:03:10 PM
Yes - and that array was written by the Standard developers; Standard does not use any multi-threading in the actual simulation code (only in the loading/saving and in the graphics), so it will not have been designed to be used for multi-threading purposes in any event.

As to the random number generator, of all of the parts of the simulation that are multi-threaded, the only part where threads other than the main thread access the random number generator is the passenger generation section. In order to do this in a way that is deterministic between client and server, each thread generates its random number seed from its thread number (each thread being assigned a thread number on creation).  The random number generator seed is then stored as thread local variable. The main thread's random number generator seed is saved with the saved game and transmitted over the network so as to remain in sync between the server and client. Every call to simrand() (but not to sim_async_rand(), which is used for things such as graphical effects that need not be in sync) both produces a random number and changes the seed. I am not sure exactly what the logging system logs - I suspect that it may log the state of the seed over time.

Every few steps, the server and client compare their random number seeds. If they do not match, the client is disconnected from the game: it is this that we usually term a "desync" (i.e., the client and server have got out of synchronisation with one another resulting in a forced disconnexion). This can happen in a number of ways, all hard to trace because we can only detect the remote consequence of the problem rather than the problem itself. It might be caused by failing to transmit all the necessary data; it might be caused by non-deterministic multi-threading code; it might be caused by undefined behaviour somewhere in the code (e.g. taking the value of an uninitialised variable); it might be caused by something indeterministic between platforms (the entire physics engine needed re-writing once to use a special float class made using only int data after it was found that native float types are not deterministic accross platforms in order to mitigate this problem); or it might be caused by reference to some variable that will differ between server and client, such as using pointers as random number seeds or as keys in a hashtable (as in the built-in ptrhashtable class, which cannot be used where the order of iteration is critical to be deterministic between client and server) or using a machine's local time (e.g. dr_get_time()) rather than the game's internal time (get_zeit_ms()).

I suspect that the passenger generation desync is caused by some non-deterministic artefact in the multi-threading code, although quite what it is I have yet no idea. If it consistently manifests itself by causing the generate_passengers_and_mail() to be called a different number of times on the server and the client, then I should strongly suspect that the problem is somewhere in step_passengers_and_mail_threaded (which is the code that determines the number of times that each thread calls generate_passengers_and_mail(), each instance being called in a separate thread), but, having spent some time looking at this yesterday, I cannot immediately see what the problem is. I did fix an incidental bug that probably had only a very small effect (the mechanism being self-correcting as to the number of passengers generated from step to step), which would cause too many passengers to be generated on an individual step, but this has not affected the desync issue.

The desync caused when connecting a Windows client to a Linux server I initially thought was caused by a problem of the platform type, but if you on a Linux client are getting the same problem connecting to the Bridgewater-Brunel server, then I am very confused as to what the problem might be. Client and server do not need to have identical simuconf.tab files (etc.) as these configuration settings are saved with the saved game and transmitted from server to client when the client first joins (precisely in order to avoid that sort of problem), and although a substantive difference in pakset (e.g. a vehicle in one pakset being faster or more powerful than the vehicle with the same name in the other) could easily cause a divergence between server and client, a mere difference in version number in and of itself (caused by the issues described on the other thread to which I linked earlier) or difference in pakset name cannot, so far as I can tell, do any such thing.

Edit: There is a method (enabled by defining DEBUG_SIMRAND_CALLS) to log calls to simrand() and print the most recent of them them in the in-game message windows of both server and client as soon as a desync happens, but this appears not to be working properly at present. I will see about trying to restore it to working order to help the process.

Edit 2: I have now fixed DEBUG_SIMRAND_CALLS - it turns out that an important line had been commented out.

Edit 3: Unfortunately, this does not seem to be reliable enough to be used in a complex game as it causes crashes.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 17, 2017, 08:59:16 PM
Thank you very much for the extensive explanation. I will definitely continue to investigate the issue. Sadly, this will take quite a while, as my spare time is currently rather limited.

On the weekend, I already tried to restore the debug functionality for the random generation, but that code seems to be affected by significant bit rot. Next weekend, I will probably try again.

The problem is likely not platform related. But it is definitely affected by the specific save game. An interesting experiment might be to run bridgewater-brunel with a fresh save game. Might at least give some new insight.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 17, 2017, 09:08:29 PM
Thank you for that suggestion, and for all your work so far: it is most appreciated. I am currently in the process of trying to restart the Bridgewater-Brunel server with the 1971 version of Rollermaterial's map to see whether this makes any difference.

Edit: Having done this, I still get a desync - I should be interested in whether you still get a desync with this server, and also with server.exp.simutrans.com.

Incidentally, would it interfere with the investigations that you plan to carry out in this regard if I were to start integrating some of the translation patches from Standard?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 17, 2017, 09:48:19 PM
Don't worry to integrate anything! First, I will need to understand the code, anyway, and merging commits should not be to much of a problem.

I am currently building the latest version and will then try to connect.

Edit 1: Ok, sadly, with the same result. But, at least, now, we know that it is not the map itself.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 17, 2017, 09:52:55 PM
Thank you - that is helpful. The latest changes from Standard change the names of quite a few files/variables to make the code more accessible to those who do not speak German, although I notice that you speak German in any case.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 17, 2017, 09:54:36 PM
Yes, the German in the code will not negatively affect me ;-) Still, it feels strange to have variable and function names in German.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 17, 2017, 10:06:30 PM
Having undertaken some testing in step_passengers_and_mail_threaded using the debugger to try to find the cause of the infrequent desync reproducible on the loopback interface, I cannot find anything resembling an anomaly so far.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 17, 2017, 10:14:27 PM
The infrequent desyncs will be significantly more difficult to track down.

The immediate desync seems to happen at the first check. Something is already wrong right at the start. Another minor observation is, that the reported time difference between server and client continuously increases, while on my local setup it usually also starts with a significant difference, but quickly decreases. Not sure, if this has any significance at all.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 17, 2017, 10:19:20 PM
The infrequent desyncs will be significantly more difficult to track down.

The immediate desync seems to happen at the first check. Something is already wrong right at the start. Another minor observation is, that the reported time difference between server and client continuously increases, while on my local setup it usually also starts with a significant difference, but quickly decreases. Not sure, if this has any significance at all.

The first check? That is interesting. That suggests some problem with data transmission (i.e. the load/save routines) rather than with the running simulation code. Is the instant desync reporting divergence in the random numbers?

As to the time difference, the code for this is unchanged from Standard; so far as I am aware, the idea is that the client and server try to run in time with one another. The client getting behind the server will cause input lag but will not cause desyncs, while the client getting ahead of the server will cause desyncs (which is why the system will always try to keep the client slightly behind the server according to the simuconf.tab setting server_frames_behind). The commonest cause of the client getting behind the server by a significant amount is the client running on a slower machine than the server. May I ask what your system specifications are?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 17, 2017, 10:25:46 PM
The disconnect happens at the first check appearing in the log. It is the usual difference in random number seeds.

I am running the game on an Intel Core i7-4800MQ at 2.70GHz with 32 GB RAM. The operating system is a pretty up to date 64-Bit Gentoo Linux. The machine should not be part of the problem. This is not anymore the one from years ago, when I hat issues to keep up with the multiplayer game ;-)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 17, 2017, 10:28:28 PM
Hmm - that looks like a more powerful machine than mine, an Intel i7 950 (overclocked to 4.2Ghz) with 12Gb of RAM, so that is probably not the issue. If you are falling behind the server, it is not clear to me why in those circumstances - unless perhaps it is because you are running a non-optimised build and the server is (as it indeed is) running an optimised build? The difference between the two in performance is very large.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 17, 2017, 10:30:34 PM
That is something I can quickly test out. But, I would assume, the initial lag is just an artifact cause by some underlying cause.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 17, 2017, 10:32:03 PM
That is something I can quickly test out. But, I would assume, the initial lag is just an artifact cause by some underlying cause.

That is also a possibility - I should be interested in the results of your tests. In theory, Simutrans is supposed to be able to be compiled with -O3.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 17, 2017, 10:44:40 PM
I only activate the optimization in the config file and removed the debug flag. I did not manually change the CFLAGs. The result is still the same. The log clearly shows that the first check already fails:

Code: [Select]
Message: karte_t::load(): Prepare for loading
World destroyed.
Warning: karte_t::load: Fileversion: 120003
Warning: convoi_t::finish_rd(): convoi ((681) BR Class 50) is broken => realign
Warning: convoi_t::finish_rd(): convoi ((889) BR-307(front)-AC) is broken => realign
Warning: rail_vehicle_t::set_convoi(): convoi 996 had a too high route index! (25 of max 22)
Warning: convoi_t::finish_rd(): convoi ((1032) BR-501(DMB)-3rdRail) is broken => realign
Warning: rail_vehicle_t::set_convoi(): convoi 1434 had a too high route index! (15 of max 10)
Warning: rail_vehicle_t::set_convoi(): convoi 1412 had a too high route index! (17 of max 13)
Warning: rail_vehicle_t::set_convoi(): convoi 1128 had a too high route index! (40 of max 21)
Warning: rail_vehicle_t::set_convoi(): convoi 1654 had a too high route index! (20 of max 9)
Warning: convoi_t::finish_rd(): convoi ((1724) LUL C Stock) is broken => realign
Warning: rail_vehicle_t::set_convoi(): convoi 1751 had a too high route index! (18 of max 17)
Warning: rail_vehicle_t::set_convoi(): convoi 1502 had a too high route index! (16 of max 12)
Warning: convoi_t::finish_rd(): convoi ((2017) BR Class 86) is broken => realign
Warning: convoi_t::finish_rd(): convoi ((1190) BR Class 33) is broken => realign
Warning: karte_t::load(): loaded savegame from 9/1971, next month=-1719664640, ticks=-1721195101 (per month=1<<22)
Message: network_command_t::rdwr: read packet_id=7, client_id=0
Warning: network_check_activity(): received cmd id=7 nwc_ready_t from socket[10]
Warning: nwc_ready_t::execute: set sync_step=1951 where map_counter=5031530
Warning: karte_t::network_game_set_pause: steps=243 sync_steps=1951 pause=0
Message: network_command_t::rdwr: read packet_id=12, client_id=0
Warning: network_check_activity(): received cmd id=12 nwc_auth_player_t from socket[10]
Message: nwc_auth_player_t::execute: plnr = 255  unlock = 32767  our_client_id = 0
Message: network_command_t::rdwr: read packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr: rdwr id=8 client=0 plnr=255 pos=koord3d invalid tool_id=8224 defpar=32768,Welcome, Felix! init=1 flags=0
Warning: network_check_activity(): received cmd id=8 nwc_tool_t from socket[10]
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server -198
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: karte_t:::do_network_world_command: sync_step=1952  server=[ss=1952 st=244 nfc=0 rand=217478360 halt=1025 line=1 cnvy=1025 ssr=2939411524,64761495,0,0,0,0,0,0 str=64761495,64761495,64761495,64761495,64761495,4049239263,4049239263,4049239263,4049239263,4049239263,4049239263,217478360,217478360,3415153,149758,4049239263 exr=0,0,0,0,0,0,0,0  client=[ss=1952 st=244 nfc=0 rand=217478360 halt=1025 line=1 cnvy=1025 ssr=2939411524,64761495,0,0,0,0,0,0 str=64761495,64761495,64761495,64761495,64761495,4049239263,4049239263,4049239263,4049239263,4049239263,4049239263,217478360,217478360,3415153,149758,4049239263 exr=0,0,0,0,0,0,0,0 
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: sint64 convoi_t::calc_revenue: Average speed (164) for (1902) BR-307(front)-AC exceeded maximum speed (110); falling back to overall average

Edit 1 Interestingly, we actually start out with a negative time difference, now! Which seems to mean the client is ahead.

Edit 2 Please note, I remove all warnings regarding the heuristic error.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 17, 2017, 10:47:40 PM
I was not expecting the optimisation to affect the desyncs, but rather the timing issue - but of course, I was slightly confused, because the timing issue you were having when you were running on the loopback interface, not the server, so they would both have been debug builds, so optimisation should not make a difference there, either.

Let me see whether I can increase the frequency of checks on the server so that we have a better idea of what is happening with these early desyncs.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 17, 2017, 10:48:37 PM
I will try with the debug build again. Maybe the different behavior is caused by something else.


Edit 1 This is definitely strange. A debug build survives a little longer, more than one check, and I see the increasing time difference, again.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 17, 2017, 10:55:31 PM
I am just restarting the server now with server_frames_between_checks set to 1 rather than 32 (and with the latest commit from devel-new-2) to see whether this gives more precise diagnostics.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 17, 2017, 11:25:03 PM
I get a lot more checks before the server desyncs ;-) I am currently trying to find the description of the data displayed by the check. It was somewhere within this thread. But for tonight, I will also need to quit.

Just in case it helps, the last couple lines from the latest log:
Code: [Select]
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 66
Message: network_world_command_t::execute: do_command 9 at sync_step 100 world now at 94
Warning: karte_t:::do_network_world_command: sync_step=93  server=[ss=93 st=11 nfc=5 rand=187636870 halt=1025 line=1 cnvy=1025 ssr=2187775812,187636870,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0  client=[ss=93 st=11 nfc=5 rand=187636870 halt=1025 line=1 cnvy=1025 ssr=2187775812,187636870,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0 
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 66
Message: network_world_command_t::execute: do_command 9 at sync_step 101 world now at 95
Warning: karte_t:::do_network_world_command: sync_step=94  server=[ss=94 st=11 nfc=6 rand=890309775 halt=1025 line=1 cnvy=1025 ssr=187636870,890309775,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0  client=[ss=94 st=11 nfc=6 rand=890309775 halt=1025 line=1 cnvy=1025 ssr=187636870,890309775,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0 
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: route_t::calc_route(): No route from 1413,2338 to 1414,2342 found
Message: air_vehicle_t::find_route_to_stop_position(): no free position found!
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: haltestelle_t::reserve_position(): failed for gr=1528,1702, cnv=612
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=212
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=9
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 66
Message: network_world_command_t::execute: do_command 9 at sync_step 102 world now at 96
Warning: karte_t:::do_network_world_command: sync_step=95  server=[ss=95 st=11 nfc=7 rand=710456930 halt=1025 line=1 cnvy=1025 ssr=890309775,710456930,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0  client=[ss=95 st=11 nfc=7 rand=710456930 halt=1025 line=1 cnvy=1025 ssr=890309775,710456930,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0 
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 66
Message: network_world_command_t::execute: do_command 9 at sync_step 103 world now at 97
Warning: karte_t:::do_network_world_command: sync_step=96  server=[ss=96 st=12 nfc=0 rand=3405237637 halt=1025 line=1 cnvy=1025 ssr=710456930,1023255373,0,0,0,0,0,0 str=1023255373,1023255373,1023255373,1023255373,1023255373,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,40981836,1797096,3405237637 exr=0,0,0,0,0,0,0,0  client=[ss=96 st=12 nfc=0 rand=3405237637 halt=1025 line=1 cnvy=1025 ssr=710456930,1023255373,0,0,0,0,0,0 str=1023255373,1023255373,1023255373,1023255373,1023255373,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,40981836,1797096,3405237637 exr=0,0,0,0,0,0,0,0 
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 66
Message: network_world_command_t::execute: do_command 9 at sync_step 104 world now at 98
Warning: karte_t:::do_network_world_command: sync_step=97  server=[ss=97 st=12 nfc=1 rand=3274686897 halt=1025 line=1 cnvy=1025 ssr=3405237637,3274686897,0,0,0,0,0,0 str=1023255373,1023255373,1023255373,1023255373,1023255373,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,40981836,1797096,3405237637 exr=0,0,0,0,0,0,0,0  client=[ss=97 st=12 nfc=1 rand=2249708390 halt=1025 line=1 cnvy=1025 ssr=3405237637,2249708390,0,0,0,0,0,0 str=1023255373,1023255373,1023255373,1023255373,1023255373,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,40981836,1797096,3405237637 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
Message: karte_t::reset_timer(): called, mode=$0
World finished ...
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 18, 2017, 11:18:41 AM
Hmm - by itself, this does not seem very revealing.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 19, 2017, 07:22:36 PM
I honestly did not expect it to be to helpful, either.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 19, 2017, 11:47:59 PM
There is no harm in trying, I suppose.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on January 26, 2017, 11:05:22 PM
I have lost track of what you have tried or not. But If you want to test I have three servers running on server.exp.simutrans.com, all running commit f49e8a11d3694003ca199e439e920ac7bebdc694.

port:13353 - small map, pak128.Britain - half-heights branch http://server.exp.simutrans.com/pak128.Britain-Ex.zip (http://server.exp.simutrans.com/pak128.Britain-Ex.zip)
port: 13354 - copy of bridgewater brunel, pak128.Britain - rescaled-road-vehicles http://server.exp.simutrans.com/pak128.Brunel-Ex.zip (http://server.exp.simutrans.com/pak128.Brunel-Ex.zip)
port: 13355 - small map, pak128.Sweden http://server.exp.simutrans.com/pak128.Sweden-Ex.zip (http://server.exp.simutrans.com/pak128.Sweden-Ex.zip)

I have played with all of them for quite some time, and got only one desync (on each), which I think was related to some attempt to mess with vehicle schedules. Could you try with windows clients?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on January 26, 2017, 11:39:09 PM
Connecting with windows client, I get desyncs on both of the british paksets. The big map on port 13354 desyncs after only like 10-12 seconds, while the smaller one on 13353 desyncs after maybe half a minute. The swedish pakset, I can stay synced for ever it appears.

Could it be something of the following:

* Map size - The bigger a map is, the faster it desyncs. The swedish map being so small that it never reach whatever the treasholds are needed to create a desync?
Possible tests to perform:
- A big swedish map to see if that also desyncs
- A very small british map to see if it can stay in sync

* Pakset - Since, for me, it only is the brittish paksets that are desyncing, could it be that the brittish pakset either contains some objects, or settings in simuconf.tab or similar, that are obliged to desync?
Possible tests to perform:
- Restart a naked brittish map, populating it with a single (kind of) object at a time and see if it starts to desync

Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 27, 2017, 01:24:16 AM
Thank you both very much for your testing - this seems helpful. Small maps (especially maps that are not well developed) often fail to trigger desyncs as they often do not use the feature that causes it, or use it very rarely.

I should be most interested to see the result of these tests.

Edit: I have restarted the Bridgewater-Brunel server with the latest version, too.

Edit 2: I have been testing a little more with this to try to track down the rare desync associated with passenger generation. So far, I have tried fixing "next_step_passenger_this_thread" and "next_step_mail_this_thread" at 50 each to see whether desynchronisation between these values on client and server is the cause of the problem. Testing this on the britain-1971 map (one of Rollermaterial's uploads), the game desynchronised just before the end of October (the loading time being mid-October). Next, I will try disabling all mail generation to see whether the problem is specific to mail.

Edit 3: Trying again with the mail generation disabled resulted in a desync even sooner than before, although it is not clear whether this difference is other than random.

Edit 4: I have realised that I had not properly set up my preprocessor directives with the result that the first test was actually run with the code as normal, not with the fixed numbers for passenger generation per step. This first run, however, is still useful as a control to gauge the time within which a desync may be expected to occur with the ordinary code. I am about to try again with the fixed numbers enabled properly.

Edit 5: Testing again with this properly enabled, and the desync still occurs, which suggests that the problem is not anything to do with the step_passengers_and_mail_threaded method itself (which calls the generate_passengers_or_mail method and determines the number of times that it is called on each thread).

Edit 6: Testing with the following defined:

Code: [Select]
#define FORBID_SYNC_OBJECTS
#define FORBID_PRIVATE_CARS 
#define FORBID_PEDESTRIANS
#define FORBID_CONGESTION_EFFECTS
#define DISABLE_JOB_EFFECTS


still produces a desync.

Edit 7: Defining DISABLE_GLOBAL_WAITING_LIST did not prevent the desync.

Edit 8: Preventing step_passengers_and_mail_threaded from running multiple passenger generation threads in parallel (but allowing it to run concurrently with some small parts of karte_t::step) does prevent the desync from occurring - but this does not say much, as it only runs concurrently with some very minor pieces of code in karte_t::step updating statistics.

Edit 9: Defining FORBID_PUBLIC_TRANSPORT does appear to prevent the desyncs (on the britain-1971 map, it runs until well into November (and is still running now without a desync), whereas, with this not defined, it desyncs by the end of October. This suggests that the error is related specifically to an element of the code dealing in particular with public (i.e player) transport, and not pedestrians, private cars or passenger generation statistics.

Edit 10: Defining FORBID_RETURN_TRIPS does seem to prevent the desync, however (running now at nearly the end of November in the britain-1971 map with no desync). I have found a possible cause of this relating to the interaction between a mutex lock and a goto command, and will test to see whether this fixes the problem now.

Edit 11: The mutex error (a fix for which is now pushed) did not seem to prevent the desync. I will have to investigate further.

Edit 12: Although used only for returning passengers, disabling the city->set_generated_passengers() function does not affect the desync, which still occurs before the end of October.

Edit 13: Even disabling the code that starts the returning passengers on their way (without which they are simply discarded as unused local variables at the end of each passenger generation step) does not affect the desync.

Edit 14: Testing again with the returning passengers code disabled, this time for a number of hours (into December 1971 in game time) to make sure that the last test of this was a fluke, and still no desync in this condition. There is something odd about the situation with returning passengers that causes this desync that has yet eluded me.

Edit 15: Defining all of the following preprocessor directives (added to the code recently for the purposes of testing as to the cause of this issue) seems to prevent the desync:

FORBID_SWITCHING_TO_RETURN_ON_FOOT
FORBID_SET_GENERATED_PASSENGERS
FORBID_RECORDING_RETURN_FACTORY_PASSENGERS
FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS
FORBID_STARTE_MIT_ROUTE_FOR_RETURNING_PASSENGERS

even if FORBID_RETURN_TRIPS is not defined. This is the first time that it has been possible to run without a desync without FORBID_RETURN_TRIPS being defined; but this might just be because the combined effect of all of these preprocessor directives is that the returning passengers code has virtually no effect.

Edit 16: Undefining FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS and leaving all of the above defined allows it to run without a desync (well into November at least on the britain-1971 map used for testing).

Edit 17: Undefining FORBID_STARTE_MIT_ROUTE_FOR_RETURNING_PASSENGERS and FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS together allows the britain-1971 saved game to run well into November without a desync.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Felix on January 29, 2017, 06:52:55 PM
Good to see, that you seem to make some progress with analyzing the issue. Sadly, I do not currently find sufficient time to get more involved and start to dig into the code.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 29, 2017, 08:00:44 PM
Don't worry - such time as you are able to spare to do such things as you are able in that time to do is very much appreciated.

Edit 1: I had made an error with the FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS preprocessor directive, with the result that it did precisely the opposite of what was intended. I will re-test with this error corrected and see whether it works with only this defined (correctly). The latest test showed that there were no desyncs into December with FORBID_SET_GENERATED_PASSENGERS and FORBID_RECORDING_RETURN_FACTORY_PASSENGERS defined and none other (but the error being the equivalent of also having FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS defined).

Edit 2: With only FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS defined (after the fix is applied), the britain-1971 map reaches December without a desync.

Edit 3: As FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS prevented two separate parts of code from compiling, I broke that down into two segments, FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS_1 and FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS_2. Further testing shows that only FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS_1 is the one that makes the decisive difference. That refers to the following code:

Code: [Select]
// Now try to add them to the target halt
                ware_t test_passengers;
                test_passengers.set_ziel(start_halts[best_bad_start_halt].halt);
#ifndef FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS_1
                const bool overcrowded_route = ret_halt->find_route(test_passengers) < UINT32_MAX_VALUE;
#else
                const bool overcrowded_route = false;
#endif

Edit 4: Only defining FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS_1made a difference to whether the desync occurred: with it defined, it did not occur, and with it undefined, it did occur. Looking into the code more closely, this appeared to be in the code relating to overcrowding. It seems as though updating the stop's statistics can in some cases cause the stop to recalculate its status immediately, including whether it is overcrowded, which has the potential to cause a desync if it makes a difference as to whether a particular stop is recorded as being overcrowded in a different order on the client and server. I have now disabled the immediate updating of status (it is updated every step in any event), and this appears to have prevented the desync even with none of the testing preprocessor definitions defined.

However, in looking at this closely, I have found some logic bugs with the system for dealing with overcrowding of return journeys (which may also have made this part of the code slower than necessary). I have written what I hope will be a fix to this, but need now to run the desync test again to see whether it remains desync free with this hopefully fixed code.

Edit 5: I think that I have what is now a fix for this issue (just pushed to Github). After testing overnight with the corrected code as described above, the britain-1971 map gets to June 1972 without disconnecting.

That appears to deal with the rare desync that occurred even between client and server of the same build: this does not address the very different, nearly instant desync that occurs for mysterious reasons between client and server of different builds.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on January 30, 2017, 10:53:36 PM
James, you are telling us if we need to do something, right? :)

Should we connect with thit or that (which is basically the only thing at least I can do), just give a hint about it!

Otherwise, fun to read your comments!  ;D
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on January 31, 2017, 12:28:58 PM
Yes - nothing for you to do at this stage:all that I have done is fix a desync that would occur occasionally.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on February 01, 2017, 05:53:51 PM
I have managed to get my Linux computer working again, so I have been able to test connecting to the servers using the Linux machine.

Connecting to bridgewater-brunel.me.uk with the Linux machine still results in a desync after only a few seconds, even from the Linux machine. This is different to previous behaviour, and I do not understand what has caused this difference. I tried deleting the .Routemaster-rescaled.pak from both client and server and ran the test again, with the identical result.

Connecting to server.exp.simutrans.com with the Linux machine allows it to stay connected for some time (it is still connected now). On Windows, it desyncs after a few seconds, although it did crash when first I tried to connect.

It is a shame that the Bridgewater-Brunel mirror server has been taken down just when I am able to make use of testing that again.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on February 01, 2017, 10:38:22 PM
The copy of Bridgewater game must have crashed then...



Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on February 04, 2017, 10:57:44 AM
server.exp.simutrans.com is updated to commit (37569a4117e2580003e3450a11605369d5f77643) and all 3 instances running again
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on February 04, 2017, 12:59:07 PM
Splendid, thank you for that.

Testing using my Linux NUC, the britain-1971 saved game running on the loopback interface remained connected overnight, so there is nothing inherent about running that saved game on Linux that causes a desync. There must be something odd about the Bridgewater-Brunel server. I will try again connecting to Vladki's server with the Linux machine and see whether that makes any difference.

Edit: Attempting this causes a desync within about 10 seconds, albeit this is the britain-3 (circa 1911) rather than the britain-1971 map. I will re-test on the loopback with this map.

Edit 2: Testing over several hours, this saved game also runs stably without a desync on the loopback interface in Linux.

Edit 3: Running a server on Windows and a client on Linux over my local network, the britain-1971 game causes a desync within less than a minute of connecting (but not as fast as connecting to the Bridgewater-Brunel server).

Edit 4: Some very interesting results: running with a MinGW built Windows client (not the cross-compile at present, which is still not working, but a MinGW executable built on Windows), I am able to connect to a Linux server running the britain-1971 game running on my local network for hours at a time without desync (it has not desynced once).

I am still getting crashes sometimes with the MinGW build that I am not getting with the Visual Studio build that seem mainly to be related to loading games, and which I cannot debug because the debug symbols are not compatible with Visual Studio and, for some reason that I cannot fathom, MinGW's version of GDB also seems to crash when the game crashes making it impossible to perform a backtrace: all that I know so far is that there is a segmentation fault/access violation caused by attempting to read data from a null pointer somewhere.

However, trying to connect to server.exp.simutrans.com (the Bridgewater-Brunel copy) with this executable results in a desync, as does attempting to connect to the Bridgewater-Brunel server itself.

Edit 5: I have now managed to set up the Bridgewater-Brunel server to use the correct nightly build, which carries the same revision number as the MinGW build that I am using (except that, for some reason, the MinGW build uses two more characters in its shortened version, so the in-game server selecting interface thinks that they are not the same), which is a controlled test to make sure that the versions used on both are identical. Connecting to the Bridgewater-Brunel server with either MinGW Windows or Linux clients will result in a desync very quickly. Both MinGW Windows and Linux client can connect to one another without difficulty, as noted in edit 4 above.

Edit 6: Even making sure that the pakset is identical on both client and server, the desync still occurs. I am wondering whether there might be some issue with the simuconf.tab file - but it is hard to see how this could be when all of the relevant settings are saved.

Edit 7: I have updated the simuconf.tab file on the server to match the latest on Github (with the exception of some server specific settings, which were retained from the old simuconf.tab and which are necessary). I cannot get the MinGW client to load any saved game without crashing at present for reasons that are extremely unclear, however the Linux client is able to remain connected for about 5-10 minutes without desyncing, but has just desynced after about that time of remaining connected.

Edit 8: Trying again, and the Linux machine still desyncs from the Bridgewater-Brunel server after a few minutes. This is extremely perplexing, as this is the exact same version of both binary and pakset used on my own Linux computer on my local network, which remains connected stably to a Windows build.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on February 05, 2017, 10:58:52 AM
If I understand correctly, if you connect over loopback or LAN it is OK, but connection over Internet, to both servers desyncs within minutes?

Is it possible that different versions of shared libraries (LibC), could cause it? I mean e.g. inconsistent behavior of prng, in different versions?



Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on February 05, 2017, 01:33:05 PM
There are a number of different possibilities, and I doubt that I know enough to enumerate all of them at this stage. So far, I have only tried connecting from my local computers to the Bridgewater-Brunel server, so it is possible that some specific feature of that server causes problems. When I changed the simuconf.tab file, the desync took a lot longer to occur. One thing that I have not tried, which I should try, is using the binary on my Linux NUC compiled on the Bridgewater-Brunel server, rather than one locally compiled, although I doubt that I have different versions of libraries on each.

Edit: I have now tested this (i.e. using the server compiled binary on my own Linux machine at home to connect to the server): it still seems to desync, but only after a long time (I do not know exactly how long unfortunately because my Linux computer uses the same monitor as my main computer, on which I spent a long time adjusting the height of single decker 'buses, after finishing which I discovered that it had desynchronised; but it lasted tens of minutes before desyncing this time. It is not clear why it is stable overnight on a local network but not over the internet, however, and it is exceptionally difficult to find any satisfactory way of discovering what the problem is here.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on February 05, 2017, 08:43:32 PM
I was now playing with the copy of bridgewater brunel - resolved some blockages, upgraded a few track, and it was quite stable. One server crash, probably due to out-of memory, and one or two desyncs - when I fas fiddling with schedule.

In the debug output of server there's a lot of messages like this:

ERROR: void convoi_t::laden():  Trying to load at halt City of Westminster Doll Street Stop when not at a halt

I resolved some of these by selling road vehicles or sending them to depot.

Also occasionaly passengers are delivered to a factory which was deleted in the meantime...

Is it possible that these things may affect the stability or desyncs?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on February 05, 2017, 09:35:27 PM
These things ought not to affect desyncs, especially as the exact same map with the exact same executable does not desync over a local network. My Linux computer is currently connected by wifi - I can try again sometime next week connected with a wired connexion, but given that it works fine over my local network with wifi, this is unlikely to be the problem. One possibility is that these desyncs are no longer checklist mismatch desyncs, but are caused by the client running ahead of the server, but I would need to look at log outputs for that, which I have not had a chance to do yet.

One thing that I am trying is running the binary downloaded from the server on the loopback interface (previously when trying this, I used a locally compiled binary). However, leaving this to run for about one game month when filling in my VAT return it seems to be stable and has not disconnected.

Edit: I tried running a MinGW build connected to the server overnight: it ran for a while (at least half an hour, before I went to bed), but I found that it had crashed in the morning - presumably from the load/save bug affecting only MinGW builds that I discuss in the cross-compiling thread, so I do not know for how long that it ran in total, but it seems to have stayed connected for a decent period of time at least until the next load/save cycle on the server (which would be at least an hour, I think).

Edit 2: I have now managed to get cross-compiling (mostly: with the exception of makeobj) to work: see here (http://bridgewater-brunel.me.uk/downloads/nightly/windows/) for the binaries. This does stay in sync with the Bridgewater-Brunel server - until somebody else connects, when it will desync shortly after being unpaused. I cannot, for reasons that I do not understand, reproduce this using the Visual Studio build with the debug interface.

Would anyone running a Linux build (please make sure that you have the latest executable from the Bridgewater-Brunel server this evening, or built yourself from devel-new-2 after this message was posted) be able to test this to see whether this also applies to Linux builds?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on February 08, 2017, 10:30:26 PM
Wow, this appears promising! Am now connected to the bridgewater-brunel game for more than 20 seconds using your crossover windows build! :D
Connecting with my own compiles still disconnects me within those 20 seconds last I checked.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on February 08, 2017, 11:24:38 PM
Further testing reveals as follows: even with four clients connected to a server on the loopback interface, this latest desync issue does not occur with Visual Studio builds for both client and server. Windows MinGW builds seem to behave in exactly the same way as Linux cross-compiled MinGW builds: that is, any clients already connected to the server when the server re-saves the game for a joining client will, after a period of time, desync from the server, while the latest connected client will stay in sync.

What I do not know yet is whether this same behaviour can be replicated with a native Linux version of the clients. I did notice, however, that for the MinGW builds, it sometimes took longer to manifest, the older clients desyncing quite a few minutes after the new one connected.

It is still very unclear why this is happening.

Edit: Re-testing with 5 Visual Studio clients all with release (rather than, as before, debug) builds confirms that this problem cannot be reproduced with Visual Studio clients and a Visual Studio server.

Edit 2: In further testing, there has now been one instance in which a Visual Studio build has desynced after another Visual Studio client has joined, but this appears to be extremely rare on Visual Studio builds and almost universal on MinGW builds.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: prissi on February 09, 2017, 06:24:39 AM
I look for uninitialised variables or memory then. MSVC freshly allocated memory is non-debug builds is usually initialised with zero (and 0xEE for debug builds), while it tends to be more random with GCC.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on February 09, 2017, 12:09:23 PM
I think that Visual Studio spots uninitialised variables automatically when running in debug mode (it breaks the program and throws a "runtime check" error), and I think that Dr. Memory finds this sort of error, too.

Edit: When running Dr. Memory, I seem again to have the unaccountable problem in which it will crash immediately upon loading, which crash cannot be reproduced when Dr. Memory is not running. Here is the output from the cross-compiled version:

Code: [Select]
Dr. Memory version 1.11.0 build 2 built on Aug 29 2016 02:42:07
Dr. Memory results for pid 12140: "simutrans-experimental-cross-compile.exe"
Application cmdline: "C:\Users\James\Documents\Development\Simutrans\simutrans-experimental-sources\simutrans\simutrans-experimental-cross-compile.exe"
Recorded 115 suppression(s) from default C:\Program Files (x86)\Dr. Memory\bin\suppress-default.txt

Error #1: UNADDRESSABLE ACCESS beyond heap bounds: reading 0x0ec0be2c-0x0ec0be30 4 byte(s)
# 0 gebaeude_t::get_visitor_demand                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 1 fabrik_t::update_scaled_pax_demand               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 2 fabrik_t::set_base_production                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 karte_t::load                                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 simu_main                                        [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 5 sysmain                                          [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 __tmainCRTStartup                                [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]
# 7 KERNEL32.dll!BaseThreadInitThunk                +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)
Note: @0:00:31.053 in thread 9316
Note: refers to 4 byte(s) beyond last valid byte in prior malloc
Note: prev lower malloc:  0x0ec0be18-0x0ec0be28
Note: allocated here:
Note: # 0 replace_operator_new_array                 [d:\drmemory_package\common\alloc_replace.c:2928]
Note: # 1 fabrik_t::get_tile_list                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 2 _Unwind_SjLj_Unregister                    [/root/bzip2-1.0.6/blocksort.c:1086]
Note: # 3 _Unwind_SjLj_Unregister                    [/root/bzip2-1.0.6/blocksort.c:1086]
Note: # 4 haltestelle_t::add_grund                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 5 haltestelle_t::rdwr                        [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 6 haltestelle_t::haltestelle_t               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 7 haltestelle_t::create                      [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 8 karte_t::load                              [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 9 karte_t::load                              [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: #10 simu_main                                  [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: #11 sysmain                                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: instruction: mov    0x14(%ecx) -> %eax

Error #2: UNADDRESSABLE ACCESS: reading 0xf1fdf104-0xf1fdf108 4 byte(s)
# 0 gebaeude_t::get_visitor_demand                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 1 fabrik_t::update_scaled_pax_demand               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 2 fabrik_t::set_base_production                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 karte_t::load                                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 simu_main                                        [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 5 sysmain                                          [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 __tmainCRTStartup                                [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]
# 7 KERNEL32.dll!BaseThreadInitThunk                +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)
Note: @0:00:31.119 in thread 9316
Note: instruction: mov    0x04(%eax) -> %eax

Error #3: LEAK 0 direct bytes 0x0372a368-0x0372a368 + 0 indirect bytes
# 0 replace_operator_new_array                  [d:\drmemory_package\common\alloc_replace.c:2928]
# 1 pthreadGC2-w32.dll!?                       +0x0      (0x65103b48 <pthreadGC2-w32.dll+0x3b48>)
# 2 freelist_t::gimme_node                      [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 imagelist_reader_t::read_node               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 5 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 7 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 8 obj_reader_t::read_file                     [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 9 loadingscreen_t::set_progress               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#10 simu_main                                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#11 sysmain                                     [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]

Error #4: LEAK 67 direct bytes 0x0372b978-0x0372b9bb + 0 indirect bytes
# 0 replace_operator_new                                  [d:\drmemory_package\common\alloc_replace.c:2899]
# 1 std::__cxx11::basic_string<>::_M_create               [/root/bzip2-1.0.6/blocksort.c:1086]
# 2 std::__cxx11::basic_string<>::_M_mutate               [/root/bzip2-1.0.6/blocksort.c:1086]
# 3 std::__cxx11::basic_string<>::_M_append               [/root/bzip2-1.0.6/blocksort.c:1086]
# 4 sysmain                                               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 5 __tmainCRTStartup                                     [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]
# 6 KERNEL32.dll!BaseThreadInitThunk                     +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)

Error #5: LEAK 3 direct bytes 0x03738258-0x0373825b + 0 indirect bytes
# 0 replace_malloc                     [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 gzread                             [/root/bzip2-1.0.6/blocksort.c:1086]
# 2 loadsave_t::read                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 loadsave_t::rdwr_str               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 KERNELBASE.dll!ReadFile           +0x117    (0x764dde95 <KERNELBASE.dll+0xde95>)
# 5 KERNELBASE.dll!ReadFile           +0x169    (0x764ddee7 <KERNELBASE.dll+0xdee7>)
# 6 KERNELBASE.dll!CreateSemaphoreExW +0x77     (0x764e0fe9 <KERNELBASE.dll+0x10fe9>)
# 7 KERNELBASE.dll!ReadFile           +0x169    (0x764ddee7 <KERNELBASE.dll+0xdee7>)
# 8 KERNEL32.dll!ReadFile             +0x53     (0x75c73ec7 <KERNEL32.dll+0x13ec7>)
# 9 KERNEL32.dll!ReadFile             +0x58     (0x75c73ecc <KERNEL32.dll+0x13ecc>)
#10 KERNEL32.dll!ReadFile             +0x58     (0x75c73ecc <KERNEL32.dll+0x13ecc>)
#11 msvcrt.dll!_read_nolock

Error #6: POSSIBLE LEAK 8 direct bytes 0x03738500-0x03738508 + 0 indirect bytes
# 0 replace_malloc                      [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 loadsave_t::start_tag               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 2 simu_main                           [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 sysmain                             [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 __tmainCRTStartup                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]
# 5 KERNEL32.dll!BaseThreadInitThunk   +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)

Error #7: LEAK 0 direct bytes 0x0376c4a0-0x0376c4a0 + 0 indirect bytes
# 0 replace_operator_new_array                  [d:\drmemory_package\common\alloc_replace.c:2928]
# 1 image_reader_t::read_node                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 2 msvcrt.dll!_unlock   
# 3 msvcrt.dll!_unlock_file
# 4 msvcrt.dll!fread_s   
# 5 msvcrt.dll!fread_s   
# 6 msvcrt.dll!_unlock_file
# 7 msvcrt.dll!fread     
# 8 read_node_info                              [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 9 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#10 imagelist_reader_t::read_node               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#11 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]

Error #8: LEAK 71 direct bytes 0x037954e0-0x03795527 + 0 indirect bytes
# 0 replace_operator_new_array                [d:\drmemory_package\common\alloc_replace.c:2928]
# 1 pthreadGC2-w32.dll!?                     +0x0      (0x6510332c <pthreadGC2-w32.dll+0x332c>)
# 2 freelist_t::gimme_node                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 _Unwind_SjLj_Register                     [/root/bzip2-1.0.6/blocksort.c:1086]
# 4 _Unwind_SjLj_Unregister                   [/root/bzip2-1.0.6/blocksort.c:1086]
# 5 savegame_frame_t::fill_list               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 ask_objfilename                           [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 7 sysmain                                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 8 __tmainCRTStartup                         [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]
# 9 KERNEL32.dll!BaseThreadInitThunk         +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)

Error #9: LEAK 28 direct bytes 0x037cc230-0x037cc24c + 0 indirect bytes
# 0 replace_operator_new                              [d:\drmemory_package\common\alloc_replace.c:2899]
# 1 _Unwind_SjLj_Register                             [/root/bzip2-1.0.6/blocksort.c:1086]
# 2 _Unwind_SjLj_Unregister                           [/root/bzip2-1.0.6/blocksort.c:1086]
# 3 image_t::copy_image                               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 _Unwind_SjLj_Unregister                           [/root/bzip2-1.0.6/blocksort.c:1086]
# 5 image_t::copy_rotate                              [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 create_alpha_tile                                 [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 7 ground_desc_t::init_ground_textures               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 8 WINMM.dll!midiOutGetNumDevs                      +0x7c5    (0x74926144 <WINMM.dll+0x6144>)
# 9 WINMM.dll!midiOutGetNumDevs                      +0xa54    (0x749263d3 <WINMM.dll+0x63d3>)
#10 KERNELBASE.dll!FindCloseChangeNotification       +0x68     (0x764ea353 <KERNELBASE.dll+0x1a353>)
#11 KERNELBASE.dll!FindFirstFileExW                  +0x531    (0x764eab95 <KERNELBASE.dll+0x1ab95>)

Error #10: LEAK 28 direct bytes 0x037cc738-0x037cc754 + 0 indirect bytes
# 0 replace_operator_new                              [d:\drmemory_package\common\alloc_replace.c:2899]
# 1 _Unwind_SjLj_Register                             [/root/bzip2-1.0.6/blocksort.c:1086]
# 2 _Unwind_SjLj_Unregister                           [/root/bzip2-1.0.6/blocksort.c:1086]
# 3 image_t::copy_image                               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 _Unwind_SjLj_Unregister                           [/root/bzip2-1.0.6/blocksort.c:1086]
# 5 image_t::copy_rotate                              [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 create_alpha_tile                                 [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 7 ground_desc_t::init_ground_textures               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 8 WINMM.dll!midiOutGetNumDevs                      +0x7c5    (0x74926144 <WINMM.dll+0x6144>)
# 9 WINMM.dll!midiOutGetNumDevs                      +0xa54    (0x749263d3 <WINMM.dll+0x63d3>)
#10 KERNELBASE.dll!FindCloseChangeNotification       +0x68     (0x764ea353 <KERNELBASE.dll+0x1a353>)
#11 KERNELBASE.dll!FindFirstFileExW                  +0x531    (0x764eab95 <KERNELBASE.dll+0x1ab95>)

Error #11: LEAK 0 direct bytes 0x037e32d8-0x037e32d8 + 0 indirect bytes
# 0 replace_operator_new_array                  [d:\drmemory_package\common\alloc_replace.c:2928]
# 1 image_reader_t::read_node                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 2 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 imagelist_reader_t::read_node               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 5 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 7 obj_reader_t::read_file                     [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 8 loadingscreen_t::set_progress               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 9 simu_main                                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#10 sysmain                                     [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#11 __tmainCRTStartup                           [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]

Reached maximum leak report limit (-report_leak_max). No further leaks will be reported.

===========================================================================
FINAL SUMMARY:

DUPLICATE ERROR COUNTS:
    Error #   3:      2
    Error #   6:      2
    Error #   9:      9
    Error #  10:      9

SUPPRESSIONS USED:

ERRORS FOUND:
      2 unique,     2 total unaddressable access(es)
      0 unique,     0 total invalid heap argument(s)
      0 unique,     0 total GDI usage error(s)
      0 unique,     0 total warning(s)
      8 unique,    25 total,    645 byte(s) of leak(s)
      1 unique,     2 total,   2508 byte(s) of possible leak(s)
ERRORS IGNORED:
     24 potential leak(s) (suspected false positives)
         (details: C:\Users\James\AppData\Roaming\Dr. Memory\DrMemory-simutrans-experimental-cross-compile.exe.12140.000\potential_errors.txt)
    924 unique,  9709 total, 2307668 byte(s) of still-reachable allocation(s)
         (re-run with "-show_reachable" for details)
  425493 leak(s) beyond -report_leak_max
Details: C:\Users\James\AppData\Roaming\Dr. Memory\DrMemory-simutrans-experimental-cross-compile.exe.12140.000\results.txt

Edit: Apparently, the output from the Visual Studio debug version (which is very similar) is too long to fit into one message.

However, I have managed to get it to run by deleting demo.sve.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on February 09, 2017, 12:34:20 PM
Apologies for double-posting, but I cannot fit two Dr. Memory outputs into one post, it seems.

Here is the output from running without having loaded demo.sve on the same saved game as the server uses, using the cross-compiled build:

Code: [Select]
Dr. Memory version 1.11.0 build 2 built on Aug 29 2016 02:42:07
Dr. Memory results for pid 9372: "simutrans-experimental.exe"
Application cmdline: "C:\msys32\home\James\simutrans\simutrans-experimental\build\default\simutrans-experimental.exe"
Recorded 115 suppression(s) from default C:\Program Files (x86)\Dr. Memory\bin\suppress-default.txt

Error #1: LEAK 72 direct bytes 0x03973de8-0x03973e30 + 0 indirect bytes
# 0 replace_RtlAllocateHeap                  [d:\drmemory_package\common\alloc_replace.c:3770]
# 1 ntdll.dll!RtlDosSearchPath_Ustr         +0x385    (0x77282bf3 <ntdll.dll+0x52bf3>)
# 2 replace_native_xfer_target               [d:\drmemory_package\dynamorio\ext\drwrap\drwrap.c:1560]
# 3 ntdll.dll!RtlUniform                    +0x27     (0x7727bec1 <ntdll.dll+0x4bec1>)
# 4 NULL.dll!?                              +0x0      (0x2ae593de <NULL.dll+0x93de>)
# 5 SHELL32.dll!SHGetDiskFreeSpaceExW       +0x181    (0x74e5a804 <SHELL32.dll+0xaa804>)
# 6 SHELL32.dll!SHRestricted                +0x5af9   (0x74e34abd <SHELL32.dll+0x84abd>)
# 7 SHELL32.dll!SHRestricted                +0x5c1a   (0x74e34bde <SHELL32.dll+0x84bde>)
# 8 SHELL32.dll!SHRestricted                +0x5b78   (0x74e34b3c <SHELL32.dll+0x84b3c>)
# 9 SHELL32.dll!SHRestricted                +0x5cfe   (0x74e34cc2 <SHELL32.dll+0x84cc2>)
#10 SHELL32.dll!SHGetDiskFreeSpaceExW       +0x23d1   (0x74e5ca54 <SHELL32.dll+0xaca54>)
#11 SHELL32.dll!SHGetFolderPathEx           +0x2c     (0x74e35555 <SHELL32.dll+0x85555>)

Error #2: LEAK 16 direct bytes 0x04982bc0-0x04982bd0 + 96 indirect bytes
# 0 replace_operator_new                   [d:\drmemory_package\common\alloc_replace.c:2899]
# 1 __pformat_int.isra.0                   [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/stdio/mingw_pformat.c:780]
# 2 msvcrt.dll!_getptd_noexit
# 3 msvcrt.dll!_getptd_noexit
# 4 tabfileobj_t::get_string               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 5 stadt_t::cityrules_init                [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 6 simu_main                              [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 7 sysmain                                [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 8 __tmainCRTStartup                      [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:334]
# 9 KERNEL32.dll!BaseThreadInitThunk      +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)

Error #3: LEAK 3 direct bytes 0x049986e0-0x049986e3 + 0 indirect bytes
# 0 replace_malloc                    [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 gz_read                           [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/libsrc/ws2tcpip/gai_strerrorW.c:17]
# 2 gzread                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/libsrc/ws2tcpip/gai_strerrorW.c:17]
# 3 loadsave_t::read                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 4 gzread                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/libsrc/ws2tcpip/gai_strerrorW.c:17]
# 5 simu_main                         [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 6 sysmain                           [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 7 __tmainCRTStartup                 [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:334]
# 8 KERNEL32.dll!BaseThreadInitThunk +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)

Error #4: POSSIBLE LEAK 8 direct bytes 0x04998c88-0x04998c90 + 0 indirect bytes
# 0 replace_malloc                       [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 pthread_mutex_lock                   [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 2 rwlock_static_init                   [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 3 rwl_unref                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 4 rwlock_gain_both_locks               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 5 rwlock_free_both_locks               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 6 rwl_unref                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 7 __pthread_self_lite                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 8 pthread_setspecific                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 9 pthread_getspecific                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
#10 __emutls_get_address                 [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/libsrc/ws2tcpip/gai_strerrorW.c:17]
#11 loadsave_t::rdwr_bool                [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]

Error #5: POSSIBLE LEAK 2500 direct bytes 0x04998cc8-0x0499968c + 0 indirect bytes
# 0 replace_malloc                       [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 pthread_mutex_lock                   [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 2 rwlock_static_init                   [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 3 rwl_unref                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 4 rwlock_gain_both_locks               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 5 rwlock_free_both_locks               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 6 rwl_unref                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 7 __pthread_self_lite                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 8 pthread_getspecific                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 9 __emutls_get_address                 [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/libsrc/ws2tcpip/gai_strerrorW.c:17]
#10 pthread_getspecific                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
#11 MTgenerate                           [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]

Error #6: LEAK 16 direct bytes 0x049bdbf8-0x049bdc08 + 6 indirect bytes
# 0 replace_operator_new                      [d:\drmemory_package\common\alloc_replace.c:2899]
# 1 __pformat_int.isra.0                      [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/stdio/mingw_pformat.c:780]
# 2 msvcrt.dll!_getptd_noexit
# 3 msvcrt.dll!_getptd_noexit
# 4 msvcrt.dll!_strtoui64_l
# 5 msvcrt.dll!_chdir   
# 6 NULL.dll!?                               +0x0      (0x2ae59480 <NULL.dll+0x9480>)
# 7 KERNELBASE.dll!FindCloseChangeNotification+0x68     (0x764ea353 <KERNELBASE.dll+0x1a353>)
# 8 KERNELBASE.dll!FindFirstFileExW          +0x531    (0x764eab95 <KERNELBASE.dll+0x1ab95>)
# 9 KERNELBASE.dll!SetErrorMode              +0x36     (0x764d7577 <KERNELBASE.dll+0x7577>)
#10 KERNEL32.dll!QueryActCtxW                +0x660    (0x75c7d163 <KERNEL32.dll+0x1d163>)
#11 KERNEL32.dll!GetDriveTypeW               +0x5a     (0x75c74186 <KERNEL32.dll+0x14186>)

Reached maximum leak report limit (-report_leak_max). No further leaks will be reported.

===========================================================================
FINAL SUMMARY:

DUPLICATE ERROR COUNTS:
   Error #   6:      3

SUPPRESSIONS USED:

ERRORS FOUND:
      0 unique,     0 total unaddressable access(es)
      0 unique,     0 total invalid heap argument(s)
      0 unique,     0 total GDI usage error(s)
      0 unique,     0 total warning(s)
      4 unique,     6 total,    343 byte(s) of leak(s)
      2 unique,     2 total,   2508 byte(s) of possible leak(s)
ERRORS IGNORED:
     89 potential leak(s) (suspected false positives)
         (details: C:\Users\James\AppData\Roaming\Dr. Memory\DrMemory-simutrans-experimental.exe.9372.000\potential_errors.txt)
   1146 unique,  9775 total, 2717758 byte(s) of still-reachable allocation(s)
         (re-run with "-show_reachable" for details)
  434940 leak(s) beyond -report_leak_max
Details: C:\Users\James\AppData\Roaming\Dr. Memory\DrMemory-simutrans-experimental.exe.9372.000\results.txt

All of these issues appear to be memory leaks, all of them apparently originating in code unchanged from Standard.

Edit: In case the problems loading demo.sve were silently causing unseen problems, I tried connecting to the Bridgewater-Brunel server with two clients which had not started by loading demo.sve, but the same failure state (the original client lost synchronisation shortly after the new one connected) was observed.

Edit 2: The desync on joining issue cannot be reproduced on three MinGW builds on the loopback interface built without multi-threading, so this must be an issue relating to multi-threading somewhere.

Edit 3: Re-enabling multi-threading, but disabling multi-threading of the convoys (using the -DFORBID_MULTI_THREAD_CONVOYS preorpcessor directive) produces an executable with the same desync on another player connecting as the fully multi-threaded build.

Edit 4: Additionally defining -DFORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE has no effect on this behaviour.

Edit 5: Additionally defining -DFORBID_MULTI_THREAD_PATH_EXPLORER makes no difference: the desync on another player joining still occurs (and quickly).

Edit 6: Enabling assume_everywhere_connected_by_road (which has the effect of disabling the (multi-threaded) private car route finder, the only other multi-threaded part of the program to use the route finder (route_t) has no effect: the desync on connexion still occurs.

Edit 7: Additionally defining -DFORBID_MULTI_THREAD_ROUTE_UNRESERVER (which disables the multi-threaded route reservation) makes no difference to the player joining desync. This is significant, as this means - bizarrely - that this error is caused by some part of the multi-threading code that is not specific to Experimental (which has additional multi-threading to Standard for convoy routing, passenger generation, the path explorer, private car route finding and the block unreserver). Only the loading/saving and the graphics are multi-threaded in Standard. The graphics are very unlikely to make a difference here, so there must be some issue with the multi-threaded loading/saving routine (which is also present in Standard; but Standard does not seem to have this problem, which I do not really understand).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: prissi on February 10, 2017, 02:59:28 AM
How often do you check for sync in your debug builds? But if the desync happens after saving and reloading, than that means your saving routines do either not save the entire state. It would be very easy to compare the just transferred game (maybe after setting the format to xml_zipped) with the locally cached savegame of the first client. If these are not identical, find the differences (just switch off pedestrains first, because they are allowed to be different within reasons, same for smoke).

The crash you are showing is likely in the multithreaded loadsave routines. If these are identical to standard, then the error is like occuring during obj_t::obj_t(file) contructors.

If this errors is indeed in the multithreaded loading, then either it might be clash of stdlib single threaded for bzlib versus a multithreaded stdlib for the rest. Or your library has some issues with semaphores (i.e. simtrhead does not work as intended, for whateve reasons.)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on February 10, 2017, 11:35:25 AM
Thank you for that: that is helpful. I will have to look into each of these issues carefully and will report progress on this thread.

Edit: Updating the code in loadsave_t to the latest code from Standard does not seem to fix this issue.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: TurfIt on February 10, 2017, 09:34:59 PM
so there must be some issue with the multi-threaded loading/saving routine
That seems improbable,  at least for the actual loadsave routine. Now missing mutexes to allow the multithreaded laden_abschliessen() on the other hand...
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on February 10, 2017, 11:52:19 PM
Testing on my Linux computer, I can confirm that the issue can be reproduced on a pure Linux 64-bit build, so the problem is not specific to MinGW.

There is something very odd happening that this problem happens with GCC builds and GCC builds will not stay in sync with Visual Studio builds and this problem does not occur with Visual Studio builds.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on March 25, 2017, 12:14:51 AM
For reasons entirely beyond explanation at this juncture, this problem appears to have become worse: now, the MinGW build will not stay in sync with the Linux build even if no other clients are connected, whereas a Linux build will stay in sync with a Linux build and a MinGW build will stay in sync with a MinGW build. This is a new manifestation of this issue since I last tested (apparently in early February), but it is entirely unclear what sort of thing could cause this behaviour and how this change from the previous less faulty behaviour could have occurred.

Fixing this is likely to take many weeks if not months, during which time I will not be able to work on other issues. If anyone has any thoughts as to how best to track this down, that would be much appreciated, as it could potentially save a gargantuan amount of time.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: A.Badger on March 27, 2017, 05:08:49 AM
If you now have a setup where you can test reliably and it seems like you can tell the previous behaviour from the current behaviour I'd use git bisect to find out when things got much worse.  The last commit in January was 444aea695046e591bc37285fb037521a9ec948e0. git bisect tells me there's 81 commits between there and it will take roughly 6 steps to narrow down to a single commit.

I don't know anything about simutrans's wire protocol but if Linux <=> Linux and MingW <=> MingW works fine but Linux <=> MingW does not I'd think that LP64 vs LLP64 might be the issue.  grep shows that there are some uses of signed/unsigned long in the code but I don't know if any of those variables end up being passed over the wire so I don't know if my guess is correct or not.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on March 27, 2017, 11:24:43 AM
Thank you for that. Almost all of the instances of unsigned long, and all of the instances of signed long in the code are identical in Extended to Standard, and Standard works over the network without desynchronising in this way. There was one place, in path_explorer.cc, where I modified unsigned_long to uint32 just now, but this part of the code has been unchanged for many years, during which time it has managed to stay in sync, so I doubt that that is the problem.

Running the bifurcation test that you suggest may be the only way to proceed, but it is likely to be very time consuming having to do it on two separate computers, so I may have to set aside a week-end just for that. If anyone else can assist by running this test in the meantime, I should be most grateful.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on March 27, 2017, 07:26:05 PM
I'm afraid that this change (commit 7d09ea4d9d1f18a9f9b237f8a7167259eb8f4c6f) has broken compilation on linux:

Code: [Select]
===> CXX obj/leitung2.cc
g++ -std=gnu++11 -O -DNDEBUG -DMULTI_THREAD -DREVISION="7d09ea4" -Wall -W -Wcast-qual -Wpointer-arith -Wcast-align -DUSE_C -fno-delete-null-pointer-checks -fno-strict-aliasing  -I/usr/include/SDL -D_GNU_SOURCE=1 -D_REENTRANT -DCOLOUR_DEPTH=16 -c -MMD -o build/default/obj/leitung2.o obj/leitung2.cc
In file included from obj/../vehicle/../boden/grund.h:19:0,
                 from obj/../vehicle/../simplan.h:12,
                 from obj/../vehicle/../simworld.h:34,
                 from obj/../vehicle/simvehicle.h:18,
                 from obj/../vehicle/simroadtraffic.h:15,
                 from obj/../simcity.h:22,
                 from obj/leitung2.h:16,
                 from obj/leitung2.cc:17:
obj/../vehicle/../boden/wege/weg.h:60:41: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
  static const uint32 get_all_ways_count();
                                         ^
In file included from obj/leitung2.cc:23:0:
obj/../simfab.h:131:32: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
  const sint32 get_in_transit() const { return statistics[0][FAB_GOODS_TRANSIT]; }
                                ^
obj/leitung2.cc:395:44: warning: unused parameter ‘dummy’ [-Wunused-parameter]
 void leitung_t::info(cbuffer_t & buf, bool dummy) const
                                            ^
obj/leitung2.cc: In member function ‘virtual void leitung_t::rdwr(loadsave_t*)’:
obj/leitung2.cc:457:27: error: cast from ‘powernet_t*’ to ‘uint32 {aka unsigned int}’ loses precision [-fpermissive]
   value = (uint32)get_net(); //  This seems to be functionless, but should be preserved for compatibility. It likewise appears functionless in Standard.
                           ^
obj/leitung2.cc: At global scope:
obj/leitung2.cc:687:42: warning: unused parameter ‘dummy’ [-Wunused-parameter]
 void pumpe_t::info(cbuffer_t & buf, bool dummy) const
                                          ^
obj/leitung2.cc:1130:42: warning: unused parameter ‘dummy’ [-Wunused-parameter]
 void senke_t::info(cbuffer_t & buf, bool dummy) const
                                          ^
common.mk:50: návod pro cíl „build/default/obj/leitung2.o“ selhal
make: *** [build/default/obj/leitung2.o] Chyba 1


Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on March 27, 2017, 08:05:44 PM
Ah - I think that I have pushed what amounts to a fix for this. Would you be able to re-test?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on March 27, 2017, 10:23:13 PM
fixed
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on April 12, 2017, 12:30:00 AM
Testing again with the latest nightly build from the server, this still stays in sync with identical builds. I have not fully re-tested for synchronisation between different builds, but it is useful to test this periodically just to make sure that nothing has broken network synchronisation in new ways. The problem still seems to be confined to mixing Linux/Windows builds. (I have briefly re-tested with the Bridgewater-Brunel server, and the current nightly MinGW build still desynchronises with that virtually instantly).
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on April 23, 2017, 10:21:00 PM
I am developing a new method to try to test the cause of this and other desyncs: I have just pushed a change to the code in which, if the preprocessor directive DISABLE_RANDOMNESS is defined, the random number generator will always simply return half of the maximum value, and the function for getting the random number generator seed will always return 1. I am in the process of rebuilding the Bridgewater-Brunel server with this option enabled. This means that it will not stay in sync with any normal client, but should be able to stay in sync with a client compiled with DISABLE_RANDOMNESS.

When a client with DISABLE_RANDOMNESS connects with the server, it should be possible to see in the game itself where the desyncs arise without this affecting lots of other unconnected areas of the game at random by changing the random number generator seed. Also, because the random number generator seed is fixed, the client should not be kicked from the game unless and until a more major difference (such as in the number of convoys) emerges. This system is intended just for testing (it would be no good in a real game, as there would be no randomness), but it should help to highlight where the problems are.

It would be very helpful if anyone with the ability to compile the game were to connect to the Bridgewater-Brunel server with a build with DISABLE_RANDOMNESS compiled with two separate clients (both built with DISABLE_RANDOMNESS) to try to spot how, if at all, they diverge from one another. It might be hard to spot, as the game currently running on the server is a big map; if anyone can find a smaller, simpler map which will reliably desync with a normal build between Windows and Linux, it would be helpful to use that for testing, too.

Thank you all in advance for any help with this: it would be much appreciated.

Edit: Having now tested this briefly, the special DISABLE_RANDOMNESS build does appear to stay in sync, as expected. Any help in tracking down actual divergence would be very much appreciated.
 
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: prissi on April 23, 2017, 11:36:32 PM
The random number seed is not used in simutrans at all (unless experiemtnal added some code using it). The internal random generator uses the Mersenne Twister alogrithm and is seeded with a complex seed depending on the time in ms since 1970 and some mathematical operations.

If the differences are really compiler/machine dependent, then it may be quite well the rounding of a float/double to an integer, which can give different results on different compilers and machines.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on April 23, 2017, 11:41:42 PM
Is it not the random seed that is checked between server and client to test for a mismatch?

As to the floating point thing, this was a problem a long time ago, but all the floating point arithmetic other than in the GUI was removed circa 2011/2012 to solve this problem, and no further floating point arithmetic has been added since, so I do not think that it is this. Bernd Gabriel even wrote an elaborate class to simulate floating point arithmetic using integers to get around this.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on April 23, 2017, 11:56:21 PM
I have restarted all instances at server.exp.simutrans com to commit 58181b37d85561c.... and DISABLE_RANDOMNESS.
However I had desync with both British pak games after 5-10 minutes. Pak sweden seems to be stable.
You are welcome to test them.

Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on April 24, 2017, 12:51:34 AM
I have restarted all instances at server.exp.simutrans com to commit 58181b37d85561c.... and DISABLE_RANDOMNESS.
However I had desync with both British pak games after 5-10 minutes. Pak sweden seems to be stable.
You are welcome to test them.

Thank you - can you see if you can capture and post the debug output from the desync that you get? That would be most helpful.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on April 24, 2017, 10:35:21 PM
Playing bridgewater (copy) for maybe an hour - without desync (linux/linux)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on April 24, 2017, 11:38:01 PM
Thank you for checking that, although the linux/linux connexion was working before in any event. Can anyone running Windows connect to it, run it for a while, then after perhaps 5-10 minutes, connect a second client, and search for any differences in the map? That would be extremely helpful.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on April 25, 2017, 06:21:03 AM
Just a quick note, if testing games on server.exp.simutrans.com, use the Pakset provided there. It has a few modifications.



Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on April 25, 2017, 12:03:01 PM
Interesting - may I ask what the modifications are?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on April 25, 2017, 08:47:15 PM
Removed sound from crossings (reported in another thread).



Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on April 25, 2017, 09:51:05 PM
Interesting - what effect have you found that that has had?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on April 26, 2017, 06:13:44 AM
It causes pakset mismatch when connecting to network game. http://forum.simutrans.com/index.php?topic=16996.0



Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on April 29, 2017, 10:51:38 PM
I think that I have fixed a bug that might have caused this (albeit this requires recompiling makeobj). Would you be able to test? I should be most grateful.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on April 30, 2017, 10:14:49 AM
Pakset mismatch fixed
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on April 30, 2017, 11:47:11 AM
Splendid, thank you for confirming.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on May 22, 2017, 12:40:32 AM
I have been doing some further testing on this issue to-day.

As readers of this thread may remember, there are essentially two separate problems: (1) the original problem of a GCC build immediately desynchronising with a Visual Studio build; and (2) a desync occurring, irrespective of the build, when more than one client connects to a server (the earlier connected clients desyncing after a delay).

A week or two ago (I cannot remember exactly when), I fixed a bug relating to the post-loading code for vehicles that had the potential to cause the second issue. I had not had time to test whether this did fix this issue at the time, however.

Recently, I have been working on the code for passenger and mail classes. In testing some of that code, I found a thread deadlock in the path explorer code. This turns out to have been caused, not by a bug in the new passenger and mail classes code, but by a pre-existing bug in the multi-threading path explorer code. Looking very carefully into the documentation for pthreads, it transpired that I had misunderstood the relationship between the pthread_cond_wait command and the mutex that it requires as a parameter. The existing path explorer multi-threading code is classed as having undefined behaviour by the pthreads standard as it calls a mutex lock multiple times in succession.

I have coded an initial attempt at a fix to the multi-threaded path explorer on this (https://github.com/jamespetts/simutrans-extended/tree/path-explorer-multi-threading-fix) dedicated branch.

However, testing shows that this code gives rise to a network desync when instances of the same build are connected on the loopback interface for testing, albeit only after some considerable time has lapsed. This does not occur when the path explorer multi-threading is disabled or on the master branch.

Meanwhile, testing on the master branch seems to show that the other problem (a desync occurring a short while after a second or subsequent client connects) seems to have been fixed, which I suspect is to do with the post-loading code fix to which I refer above.

I have now run out of time for further testing (as each individual test cycle requires running the whole thing for over an hour, fast forwarding not being possible in network mode) this week-end, but will look into refining the new code further. Because the current code has undefined behaviour, this is a prime suspect for inter-platform desyncs, so I am keen to fix this as soon as possible.

If anyone can spot any immediate problems in my new multi-threading code, I should be grateful for any feedback.

Edit: Some further testing seems to show that the multi-threaded passenger generation code seems to be responsible for a desync between a Visual Studio client and a GCC (Msys) client (both Windows): when this is disabled, the two will stay in sync for far longer than when this is enabled. I have not yet tested long enough to see whether it will stay in sync permanently, however.

Edit 2: With the (modified) multi-threaded path explorer multi-threading enabled but the passenger generation multi-threading disabled, it still desyncs between an Msys GCC client and the Visual Studio client, but only after a very long time; the same sort of time as it takes to desync between two Visual Studio clients with the new path explorer multi-threading algorithm. This suggests that the passenger generation multi-threading may well be responsible somehow for the desync between differently compiled versions of Extended.

Edit 3: I have now run a long-term test connecting a single Msys/GCC compiled client to a Visual Studio compiled server all day to-day with the passenger generation multi-threading disabled and the (existing) path explorer multi-threading enabled, and the two are still in sync even now. I have also had an answer on Stack Exchange that might help to explain the problem that I have been having with the new path explorer multi-threading code.

Edit 4: Using the 2010 edition of Rollmaterial's map, in order to use a more challenging test, with the latest code on the passenger-generation-multi-threading-fix (currently, just minor updates to the path explorer multi-threading from the master branch, and disabling the passenger generation multi-threading entirely), a Visual Studio client will stay in sync with another Visual Studio client for longer than I have so far measured, but an Msys/GCC client, connected second, will desync after approximately one game hour (with no interaction).

Edit 5: Using the mutex error checking, no mutex errors can be found running the britain-2010 map in the passenger generation multi-threading.

Edit 6: Repeating the test from edit 3 using the saved game from edit 4 produces a desync, but only after running for about one game month.

Edit 7: Repeating the test from edit 6 with the path explorer multi-threading disabled produces a desync after a long time, but slightly short of a game month (i.e. before the crossing of a month boundary since loading the game).

Edit 8: Very oddly indeed, when testing with FORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE defined, I get a near-instant desync between a GCC/Msys and Visual Studio client, although two Visual Studio clients will happily stay in sync. The difference between FORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE and FORBID_PARALLELL_PASSENGER_GENERATION is that the former uses the old code for single-threaded passenger generation in the main thread, whereas the latter (i.e., the one that works) uses a separate thread for the passenger generation, but only actually runs the passenger generation on a single thread rather than all of the passenger generation threads. This is very bizarre, as it suggests that there is an error with the single threaded passenger generation code that is not present in the multi-threaded passenger generation code when it is restricted to running with only one thread.

Edit 9: Even running entirely single-threadedly, the GCC/Msys build will desync from the Visual Studio build in seconds with the britain-2010 saved game. It seems that only when the passenger generation multi-threading is operational but it is set to use only one of the threads does this work (for a while) without desyncing. This is extremely odd.

Edit 10: Defining FIXED_PASSENGER_NUMBERS_PER_STEP_FOR_TESTING with the passenger generation multi-threading fully enabled does not prevent the Visual Studio/GCC desync.

Edit 11: Defining DISABLE_JOB_EFFECTS prevents the very quick desync between the Visual Studio and GCC/Msys builds with the britain-2010 saved game.

Edit 12: I have reverted the DISABLE_RANDOMNESS setting on the server's version and recompiled it so that people can again try to connect with an unmodified client for testing purposes.

Edit 13: Attempting to connect to the Bridgewater-Brunel server from a the cross-compiled client (both from the master branch) still results in a near instant desync.

Edit 14: Further testing shows that the cause of the short desync between a Visual Studio client and a GCC/Msys (Windows) client appears to have been using the min() and max() methods with 64-bit integers when in fact they are defined as using signed 32-bit integers.  I have added a special 64-bit version of min() and max() (called min_64() and max_64()) to handle these where they appeared in the code relating to job effects, and I can now connect, with job effects and passenger generation multi-threading both enabled, an Msys/GCC client to a Visual Studio server for a considerable time (crossing a month boundary in the britain-2010 game) before a desync occurs. The long desync, however, after the month boundary is crossed, is still present.

This opens up a new line of enquiry into all cross platform desyncs, however, as there may be other sync critical places in the code with 64-bit integers using these methods. Also, I wonder whether it is safe for unsigned 32-bit integers to use these methods.

Edit 15: I cannot find any more instances of the min()/max() methods being passed 64-bit integers, although I have slightly improved some code. This has not prevented the long desync between Visual Studio and Msys/GCC clients, but normally connexions will be made between GCC/Msys and GCC/Linux clients in any event, so it is possible that this desync is not important.

I have now integrated the above fix into the master branch as this is clearly an improvement on the previous code and fixes a specific issue. There will be further testing on the Bridgewater-Brunel server.

Edit 16: There is still an immediate desync when connecting to the Bridgewater-Brunel server with the client cross-compiled on the Bridgewater-Brunel server It is not clear at present what the cause of this is.

Edit 17: The same result obtains with both the Msys/GCC and Visual Studio builds connecting to the Bridgewater-Brunel server: an instant desync.

Edit 18: Testing with my Linux computer, this seems to desync from the Bridgewater-Brunel server instantly, too, but be able to stay in sync with a Windows server. However, the problem of one client joining causing all other clients to desync shortly after connecting appears to have returned, and it is not clear why.

Edit 19: The client kick desync can be reproduced with the Visual Studio and the GCC/Msys builds connecting to a build of the same type.

Edit 20: Testing again with all multi-threading disabled, the client kick desync cannot be reproduced. This appears to be the same issue as was investigated some months ago relating apparently to multi-threading of the load/save routines. The long desync appears also not to be reproducible in this contingent, but this needs a longer period of testing to confirm.

Edit 21: With multi-threading disabled entirely, three clients can remain connected to a local server for many hours and many in-game months without desyncing from the server.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on June 11, 2017, 07:54:58 PM
Re-testing this again now, this can still be reproduced on the current master branch.

Analysing this carefully, this must be a problem with saving rather than loading. This is because the online gaming works in the following way: the server starts out with no clients connected. When the first client connects, it saves the game, sends the saved game as a file to the client requesting connexion, then loads that saved game, along with the client. When another client connects, the same procedure is applied for the newly joining client, but, in order to save bandwidth, the existingly connected clients save their games locally and re-load their own saved games without needing this to be transmitted from the server.

Thus, for all of the existingly connected clients to desync (at exactly the same time, as I have just confirmed) and for the most recently connected client not to desync, the problem must be that the files that the already connected clients are loading are not identical to the one being loaded by the newly connected client and the server. Because the newly connected client loads but does not save, any non-determinism in the saving mechanism is not relevant, as both server and the most recently connected client will in any event be loading exactly the same file. Only the existingly connected clients have to load a file that has been saved other than by the server, and thus potentially get something non-identical when they save.

The problem cannot be (or, at least, is very unlikely to be) in the code for loading (including the post-processing after loading), nor an inadequacy in what data are saved, since a problem with either of things would equally affect the newly connected client, whereas tests show that the most recently connected client always reliably stays in sync.

Thus, I need to look for problems specifically in the code for saving that is not shared with the code for loading. Much of the save/load code is shared (the computer being instruct to save or load a particular datum in one line of code by the "rdwr" method, and whehter it saves or loads is determined by whether the game is currently in the process of loading or saving generally), but there are some places where the code differs between saving and loading, being in simworld.cc and at various places that use branching logic depending on whether the file is loading or saving.

Edit 1: Forcing the use of bzip 2 rather than zip has no effect on the desync issue.

Edit 2: The problem does not appear to be any error with loading/saving the "parallell_operations" datum.

Edit 3: It appears that this cannot be reproduced on a map where the only transport infrastructure consists of airports and roads usable by private cars (I have generally used one of Rollmaterial's large, complex maps to test this, which has air, water, rail and road transport).

Edit 4: If I add a fairly substantial 'bus network to the airport only map, I can reproduce the desync problem relatively quickly. This eliminates the possibility that the problem is caused by industries or electricity supplies, or that it is specific to rail transport.

Edit 5: Running overnight with the aircraft only, three instances of the client remain connected to the server (over the loopback interface). However, none of the airports were in range of any town buildings, so no passengers were transported. From this, I infer that the problem is likely to be connected to the actual transporting of passengers by player provided transport (as opposed to walking or the use of private cars).

Edit 6: Removing the aircraft but retaining the 'buses, the client kick desync can still be reproduced. I have noticed, however, that the desync only occurs if the later clients join a little while after the first client joined.

Edit 7: Further testing seems to have ruled out the loading/saving of transferring passengers at stops. Also, I have found that what counts as far as timing is concerned appears to be the time between the last load/save cycle and the new client joining, as a series of rapid load/save cycles one after another where a number of clients join in quick succession allow a large number of clients to connect simultaneously, but if, after a period of time after the last load/save cycle, another client tries to connect, all of the previously connected clients will desync very quickly.

Edit 8: The problem appears to be in the world (rather than the stop) transferring cargo/passengers list: when I comment out line no. 9087 in simworld.cc, being

Code: [Select]
transferring_cargoes[0].append(tc);

I cannot reproduce the desync, and was able to connect 8 client instances to a server over the loopback interface without any of them desyncing. Obviously, this is not a solution, as it breaks the transferring cargo functionality, but it means that I have at least narrowed down the problem to this area of the code.

Edit 9: Editing the following part of the code so as not to compile the parts dependant on "MULTI_THREAD" being defined (by modifying that string in both cases to something else) prevents the desync from occurring:

Code: [Select]
if (file->get_extended_version() >= 13 || file->get_extended_revision() >= 15)
    {
        uint32 count;
        sint64 ready;
        ware_t ware;
#ifdef MULTI_THREAD
        count = 0;
        for (sint32 i = 0; i < parallel_operations; i++)
        {
            count += transferring_cargoes[i].get_count();
        }
#else
        count = transferring_cargoes[0].get_count();
#endif

        file->rdwr_long(count);

        sint32 po;
#ifdef MULTI_THREAD
        po = parallel_operations;
#else
        po = 1;
#endif

        for (sint32 i = 0; i < po; i++)
        {
            for (uint32 j = 0; j < transferring_cargoes[i].get_count(); j++)
            {
                ready = transferring_cargoes[i][j].ready_time;
                ware = transferring_cargoes[i][j].ware;

                file->rdwr_longlong(ready);
                ware.rdwr(file);
            }
        }
    }

This modification has the effect of saving only one of the total of 4 sets of transferring cargoes/passengers generated by the multi-threaded passenger generation system (an arbitrary cross-section of all transferring cargoes which cross-section should be the same between server and clients).

Given that I have verified that the number of parallel operations is consistent between server and client (by default 5), it is not clear to me what is occurring here or why this should make a difference.

Edit 10: I think that I have - eventually - managed to fix this (which fix has just been pushed). The problem was that the parallel_operations variable was not always of the same value as was returned by the get_parallel_operations() method: in the case of the server, the parallel_operations value was 0, whereas get_parallel_operations() would return 5, but on the client, both would be 5. The consequence of this was that the server would in effect discard transferring cargo when saving whereas the clients would save correctly, thus causing a desync when there was any transferring cargo and a subsequent client joined.

I have fixed this by replacing parallel_operations in the above code with get_parallel_operations(), and I was able to connect three clients on the loopback interface to a local server running Rollmaterial's game in 2010 and I was able to confirm that they ran overnight without desyncing.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on June 16, 2017, 06:35:11 PM
I have updated my servers, and was connected to the big map for quite a while. But just a minute ago I got a desync. So I connected to the smaller sandbox map and it desynced almost immediately - perhaps due to someone else connecting to the game
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on June 16, 2017, 08:27:55 PM
It would be helpful if you could be more specific about the circumstances in which this occurs.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on June 17, 2017, 09:20:28 PM
I tried to connect to both the bridgewater-brunel server and the two british games on server.exp.simutrans.com, but got immediate desyncs within a second on all accounts. The swedish servergame, however, stays online without desyncing for eternity it seems.

Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on June 17, 2017, 09:23:51 PM
Can I check which version that you are using (in terms of the abbreviated Github hash) and whether you have downloaded the latest pakset from the server?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on June 17, 2017, 09:27:32 PM
I used this executable:
e18d813edb17e52282d6c94863b57ea81aa0576b (16.06.2017)

and updated to this pakset:
122fa645ae91ecbc5eb9a55b5c939064bed21cab (16.06.2017)

I compiled the pakset using the latest makeobj with the same commit as the executable.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on June 17, 2017, 09:31:52 PM
Did you download the executable and pakset from the server or compile them yourself? If the latter, I should be grateful if you could re-test with downloaded executable and pakset.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on June 17, 2017, 09:35:25 PM
I compiled everything myself.
I will try downloading everything and see if that changes.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on June 17, 2017, 09:37:48 PM
Splendid, thank you.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Ves on June 17, 2017, 10:04:28 PM
aarghh, my computer wont let me open the downloaded nightly build because it needs to be sent to the AVG-center first and be confirmated or something similar which would take around 2-3 hours!  :o
When pressing the "I trust this program" button, I just get a small window telling me that I dont have permission to the location where that file is located. I hate it when I dont have controll over my own computer...

Anyway, I could test with the downloaded pakset and that stayed sync with the small british map on server.exp.simutrans.com for around 30 seconds (didnt try the other) and is connected with bridgewater-brunel while Im writing without desyncing at all, now for at least 5 minutes.
Connecting with my own compiled pakset to bridgewater-brunel server at the same time (so I had two instances of the game connected to the same map) generated a desync within 30 seconds for that pakset, leaving the nightly downloaded pakset online!

So, something has to be said for the different paksets. I believe they are the same paksets, it should only be the makeobj's that are different for them. I have uploaded the makeobj I use to http://server.exp.simutrans.com/Devel-new-builds/ (http://server.exp.simutrans.com/Devel-new-builds/)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on June 17, 2017, 11:26:31 PM
I am not going to be able to work out the differences in the pakset either from a compiled makeobj or a compiled pakset - this problem looks as though it is more easily dealt with simply by using the pakset downloaded from the server rather than trying to work out where the divergence lies, I think.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on June 18, 2017, 10:12:44 AM
The only difference between my and nightly pakset is the rolling resistance of hackney carriage...

I had an immediate desync after connecting to "british sandbox" game, but on the second try it was OK.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on June 18, 2017, 11:03:12 AM
I should note that the British sandbox game is not updated to the latest nightly version. I have a shell script on the Bridgewater-Brunel server that updates it to the latest nightly version every night (which had not been working properly, but which I have just now fixed), which should ensure that the same version is running on the server as is available to download.

It is always better to test with the latest version to make sure that no errors that have since been fixed are causing the trouble.
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on June 18, 2017, 11:15:29 AM
Could you share that script? I update the sandbox game manually, and that takes quite a lot of time...
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on June 18, 2017, 11:24:32 AM
The scripts come in a number of parts, all in ~/. The first is nightly.sh, which builds the executables and paksets from source, copies them to the download directories and then copies them also (or links them in the case of the paksets) to the directory from where the game is actually run on the server:

Code: [Select]
echo "***"
echo "Nightly build for Linux"
echo "***"
cd /usr/share/games/nightly/simutrans-experimental
echo "Fetching new version of the code"
echo "***"
git pull origin master --no-edit
echo "***"
echo "Building the main executable"
echo "***"
# Linux
env CFG=default make clean
env CFG=default make -j3
strip build/default/simutrans-extended
chmod +x build/default/simutrans-extended
env CFG=server make clean
env CFG=server make -j3
strip build/server/simutrans-extended
chmod +x build/server/simutrans-extended
# Windows
env CFG=mingw make clean
env CFG=mingw make -j3
echo "***"
echo "Building makeobj"
echo "***"
cd makeobj
# Linux
env CFG=default make clean
env CFG=default make -j3
strip /usr/share/games/nightly/simutrans-experimental/build/default/makeobj-extended/makeobj-exntended
chmod +x /usr/share/games/nightly/simutrans-experimental/build/default/makeobj-extended/makeobj-extended
# Windows
env CFG=mingw make clean
env CFG=mingw make -j3
echo "***"
echo "Linking Makeobj to the pakset directories"
echo "***"
ln -s /usr/share/games/nightly/simutrans-experimental/build/default/makeobj-extended/makeobj-extended /usr/share/games/nightly/simutrans-pak128.britain/makeobj-extended
ln -s /usr/share/games/nightly/simutrans-experimental/build/default/makeobj-extended/makeobj-extended /usr/share/games/nightly/Pak128.Sweden-Ex/makeobj
echo "***"
echo "Building nettool"
echo "***"
cd /usr/share/games/nightly/simutrans-experimental/nettools
# Linux
make clean
env CFG=default make -j3
strip /usr/share/games/nightly/simutrans-experimental/build/default/nettool/nettool
chmod +x /usr/share/games/nightly/simutrans-experimental/build/default/nettool/nettool
# Windows
env CFG=mingw make clean
env CFG=mingw make -j3
# Paksets
echo "***"
echo "Fetching the new version of the paksets"
echo "***"
cd /usr/share/games/nightly/simutrans-pak128.britain
git pull origin master --no-edit
cd /usr/share/games/nightly/Pak128.Sweden-Ex
git pull origin half-height --no-edit
echo "***"
echo "Building the paksets"
echo "***"
cd /usr/share/games/nightly/simutrans-pak128.britain
make clean; make -j3
cd /usr/share/games/nightly/Pak128.Sweden-Ex
make clean; make -j3
echo "***"
echo "Copying the files for download and the game server"
echo "***"
cp /usr/share/games/nightly/simutrans-experimental/build/default/simutrans-extended /var/www/downloads/nightly/linux-x64
cp /usr/share/games/nightly/simutrans-experimental/build/mingw/simutrans-extended /var/www/downloads/nightly/windows/Simutrans-Extended.exe

cp /usr/share/games/nightly/simutrans-experimental/build/default/makeobj-extended/makeobj-extended /var/www/downloads/nightly/linux-x64
cp /usr/share/games/nightly/simutrans-experimental/build/mingw/makeobj-extended/makeobj-extended /var/www/downloads/nightly/windows/Makeobj-Extended.exe

cp /usr/share/games/nightly/simutrans-experimental/build/default/nettool/nettool /var/www/downloads/nightly/linux-x64
cp /usr/share/games/nightly/simutrans-experimental/build/mingw/nettool/nettool /var/www/downloads/nightly/windows/Nettool-Extended.exe

rm /usr/share/games/simutrans-extended/nettool
cp /usr/share/games/nightly/simutrans-experimental/build/default/nettool/nettool /usr/share/games/simutrans-extended/nettool

rm /usr/share/games/simutrans-extended/simutrans-extended
cp /usr/share/games/nightly/simutrans-experimental/build/server/simutrans-extended /usr/share/games/simutrans-extended/simutrans-extended
cp /usr/share/games/nightly/simutrans-experimental/build/server/simutrans-extended /var/www/downloads/nightly/linux-x64/command-line-server-build

tar -zcvf /var/www/downloads/nightly/pakset/pak128.britain-ex-nightly.tar.gz --directory "/usr/share/games/nightly/simutrans-pak128.britain/pak128.Britain-Ex" .
tar -zcvf /var/www/downloads/nightly/pakset/pak128.sweden-ex-nightly.tar.gz --directory "/usr/share/games/nightly/Pak128.Sweden-Ex/pak128.Sweden-Ex" .
echo "***"
echo "Cleaning up the pakset folders"
rm /usr/share/games/nightly/Pak128.Sweden-Ex/makeobj
rm /usr/share/games/nightly/simutrans-pak128.britain/makeobj-extended
echo "***"
echo "Copying files to the /simutrans folder to make the Windows complete .zip file"
echo "***"
rm /usr/share/games/nightly/simutrans-experimental/simutrans/Simutrans-Extended.exe
cp /usr/share/games/nightly/simutrans-experimental/build/mingw/simutrans-extended /usr/share/games/nightly/simutrans-experimental/simutrans/Simutrans-Extended.exe
rm /usr/share/games/nightly/simutrans-experimental/simutrans/Makeobj-Extended.exe
cp /usr/share/games/nightly/simutrans-experimental/build/mingw/makeobj-extended/makeobj-extended /usr/share/games/nightly/simutrans-experimental/simutrans/Makeobj-Extended.exe
rm -rf /usr/share/games/nightly/simutrans-experimental/simutrans/Pak128.Britain-Ex
cp -R /usr/share/games/nightly/simutrans-pak128.britain/pak128.Britain-Ex /usr/share/games/nightly/simutrans-experimental/simutrans/Pak128.Britain-Ex
rm -rf /usr/share/games/nightly/simutrans-experimental/simutrans/Pak128.Sweden-Ex
cp -R /usr/share/games/nightly/Pak128.Sweden-Ex/pak128.Sweden-Ex /usr/share/games/nightly/simutrans-experimental/simutrans/Pak128.Sweden-Ex
echo "***"
echo "Zipping the Windows Simutrans-Extended-Complete file"
echo "***"
rm /var/www/downloads/nightly/packages/Simutrans-Extended-Complete.zip
cd /usr/share/games/nightly/simutrans-experimental/
zip -r /var/www/downloads/nightly/packages/Simutrans-Extended-Complete.zip ./simutrans
echo "***"
echo "Completed"

Next is warn-save.sh, a script which terminates the Simutrans-Extended process on the server after making sure that the game is saved and that players are warned of the impending reset:

Code: [Select]

# Shell script to run a force-sync on the running Simutrans-Experimental server
# but only after warning players that it is about to do this and waiting 2 minutes.
# Written by James E. Petts, February 2017

echo "Saving/restarting the server"
echo
date
echo
echo "Warning players of impending save/restart"

/usr/share/games/simutrans-extended/nettool -s bridgewater-brunel.me.uk -p Billinton1890 say "WARNING: Server about to be reset and updated to the latest version. All changes after the next save will be lost. Will save in 1 minute from now. This is an automated message."
sleep 1m

echo "Running a force-sync..."

/usr/share/games/simutrans-extended/nettool -s bridgewater-brunel.me.uk -p Billinton1890 clients

echo

/usr/share/games/simutrans-extended/nettool -s bridgewater-brunel.me.uk -p Billinton1890 -q force-sync

sleep 1m
echo "Stopping the server"
echo
/usr/share/games/simutrans-extended/nettool -s bridgewater-brunel.me.uk -p Billinton1890 say "WARNING: Server will shut down for restart and update to the latest version in 1 minute. No further progress will be saved. The server will restart within 2-5 minutes. You may need to download a new version to continue to connect. Download from http://bridgewater-brunel.me.uk/download/nightly. This is an automated message."
sleep 1m
/root/simctrl brit stop
# No need to restart manually here, as there is a cron job running simctrl brit restart every minute.

That relies on force-sync.sh, which does the actual saving:

Code: [Select]
# Shell script to run a force-sync on the running Simutrans-Experimental server
# Written by James E. Petts, December 2012

echo "Running a force-sync..."

/usr/share/games/simutrans-extended/nettool -s bridgewater-brunel.me.uk -p Billinton1890 clients

echo

/usr/share/games/simutrans-extended/nettool -s bridgewater-brunel.me.uk -p Billinton1890 -q force-sync

Then, I run a number of cron jobs to make sure that they run every night. Here is the output of crontab -e on my server:

Code: [Select]
# Edit this file to introduce tasks to be run by cron.
#
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
#
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').#
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
#
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
#
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h  dom mon dow   command
*/1 * * * * /root/simctrl brit check >> /var/log/simutrans/check.log
00 */1 * * * /root/rotate-backup.sh >> /var/log/simutrans/rotate-backup.log
00 03 * * * /usr/sbin/logrotate /etc/logrotate.conf > /dev/null 2>&1
00 05 * * * bash -x /root/nightly.sh >> /var/log/simutrans/nightly-linux.log 2>&1
00 06 * * * /root/warn-save.sh >> /var/log/simutrans/warn-save.log
30 05 * * * /root/package.sh >> /var/log/simutrans/nightly-package.log

This checks every minute whether the game is running and restarts it if it is not. Note that this uses the simctrl script, written by Timothy a long time ago - do you have that set up?
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: Vladki on July 02, 2017, 06:45:09 PM
Thank for the scripts. I have modified them a liitle to suit my needs, and moved the sandbox servers to new place. Now they update automatically, using the nightlies from bridgewater brunel server. Restart is scheduled 10 minutes after the nightly build. (5:20)
Title: Re: Desync issue (devel-new-2) with Linux Server/Windows client
Post by: jamespetts on July 02, 2017, 07:56:52 PM
Excellent - it is good to have a number of different servers running!