News:

Simutrans.com Portal
Our Simutrans site. You can find everything about Simutrans from here.

Desync issue (devel-new-2) with Linux Server/Windows client

Started by Ves, October 22, 2016, 09:03:44 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Ves

Moderator note: This has been split from the server topic to focus the discussion on the desync issue. For an introduction to the issue, see the post here.

I have tried to connect now, and I get desyncs quite often. Have not yet tracked down what is causing it, but the game appears to lag a bit in general (the messages on the bottom is quite jagging).

jamespetts

I have noticed the desyncs, too, and am trying to track them down, which is never easy, as this is the hardest type of bug to solve. That was one reason that I was keen for the other server to be running, to see whether you get desyncs on that. Do you think that you could try the saved game from the Bridgewater-Brunel server on your server to see whether you also get desyncs?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

I dont have the server, that is Vladki. I only upload "devel-new"s to it :)

jamespetts

Ahh - my apologies. I am afraid that I confuse the two of you sometimes.

Edit: I am unable to reproduce the desyncs on a local server (i.e. client and server running on my computer at home), so it would be extremely helpful if Vladki could run a controlled test by uploading the same map to his server and seeing whether we get desyncs there.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

I dont know anything special about servers, but made a quick comparison between your server and the other one:





I dont know if anybody can make something out of this?

jamespetts

Thank you for testing that. I do not think that it is a performance issue: I think that it is an issue with the code of some sort.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

I would recommend giving the full path to the server (port included) as well as where to download the pak and the executable to run the server. It has to be very user friendly for people to be able to join.

jamespetts

The port is the default port, so it should not need to be specified manually. The port will become relevant if I decide to run a second game on this server.

As to the download link, I do not want to encourage players other than testers at present, as there will at present be frequent changes to the server and erasing of saved games without notice.

Edit: Having rebuilt the server and cleared out all the old files that had accumulated over time, taking a fresh commit from Github (and upgrading the version of Linux that it runs in the process), I still get desyncs, so this is not a simple issue of the server having a slightly incorrect version of one of the source code files. This may take considerable work to fix.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

Quote
I do not want to encourage players other than testers at present
Which is difficult if no one knows where to download...

Quote
Edit: Having rebuilt the server and cleared out all the old files that had accumulated over time, taking a fresh commit from Github (and upgrading the version of Linux that it runs in the process), I still get desyncs, so this is not a simple issue of the server having a slightly incorrect version of one of the source code files. This may take considerable work to fix.
I am guessing a lot of object types were added to Simutrans Experimental to support the new signalling. Make sure such objects are loaded deterministically between clients since Simutrans loads in parallel as far as I can tell.

Make sure everything which alters game state (commands) are synchronized and not run locally.

Also maybe it is a false positive. It might be worse disabling the forced disconnect and seeing what or where stuff starts to go out of sync.

jamespetts

The only new object types are signalboxes; but I do not think that this is an object types issue, as client and server stay in sync even on a dense map (with lots of signalboxes) for many hours (overnight) when both client and server are on the same computer connecting via the loopback interface.

I do not think that multi-threading is an issue, as the desync occurs even when the server is single threaded, but does not occur when both client and server (on the loopback interface locally) are multi-threaded, nor when both local client and remote server are multi-threaded.

This is not the old electrification issue, since this occurs even when there are no power lines or substations on the map, and does not occur when there are power lines and substations on the map when both client and server are connected by the loopback interface locally.

It is also not an interaction issue (i.e. one related to not properly sending/receiving commands), as the desync occurs when idle without any commands being sent by any player.

The Bridgewater-Brunel server runs Linux whereas my computer at home runs Windows - I wonder whether this might be relevant. This was an issue once many years ago, when the problem transpired to be that Windows and Linux builds dealt with imprecision in floating point arithmetic differently, but all floating point in running code was abandoned after that, so it is hard to see what the problem might be.

Can anyone try connecting with a Linux machine to see whether that remains stable? I might try to connect using my Linux NUC that I use in work, but I think that the only cable that I can use to connect it to my monitors at home may have failed, so this may not be possible.

Edit: I have been able to get my NUC working (the HDMI cable issue seems to be intermittent), and confirm that I can connect to both the Bridgewater-Brunel and the server.exp.simutrans.com servers without desyncs from a Linux build of the latest devel-new-2 branch, whereas I cannot connect stably to either from a Windows build. This is very odd: I cannot think of anything other than floating point arithmetic, which has been eliminated, which might cause this. Has anyone any ideas of what might run differently in Linux and Windows?

Edit 2: Running it through Dr. Memory, I get the following suspicious entry (but it is odd, as no actual crash is encountered in the game):

Error #1: UNADDRESSABLE ACCESS: reading 0x4d6f8d40-0x4d6f8d44 4 byte(s)
# 0 _longest_match                         [F:\Develop\vs140\build\zlib-1.2.8\contrib\masmx86\match686.asm:375]
# 1 inflateUndermine               
# 2 deflate                         
# 3 gzungetc                       
# 4 gzwrite                         
# 5 loadsave_t::flush_buffer               [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\dataobj\loadsave.cc:605]
# 6 loadsave_thread                        [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\dataobj\loadsave.cc:63]
# 7 pthreadVCE2.dll!pthread_setcanceltype +0x4bce   (0x71445eef <pthreadVCE2.dll+0x5eef>)
# 8 MSVCR100.dll!endthreadex              +0x39     (0x77f4c6de <MSVCR100.dll+0x5c6de>)
# 9 MSVCR100.dll!endthreadex              +0xe3     (0x77f4c788 <MSVCR100.dll+0x5c788>)
#10 KERNEL32.dll!BaseThreadInitThunk      +0x11     (0x7517336a <KERNEL32.dll+0x1336a>)
Note: @0:10:00.187 in thread 4556
Note: refers to 0 bytes(s) beyond last valid byte in prior malloc
Note: prev lower malloc: 0x4d6e8d40-0x4d6f8d40 here:
Note: # 0 replace_malloc                     [d:\drmemory_package\common\alloc_replace.c:2292]
Note: # 1 zcalloc                         
Note: # 2 deflateInit2_                   
Note: # 3 gzungetc                       
Note: # 4 gzwrite                         
Note: # 5 loadsave_t::write                  [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\dataobj\loadsave.cc:587]
Note: # 6 loadsave_t::wr_open                [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\dataobj\loadsave.cc:413]
Note: # 7 karte_t::save                      [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\simworld.cc:6319]
Note: # 8 karte_t::interactive               [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\simworld.cc:8765]
Note: # 9 simu_main                          [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\simmain.cc:1363]
Note: #10 sysmain                            [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\simsys.cc:805]
Note: #11 WinMain                            [c:\users\james\documents\development\simutrans\simutrans-experimental-sources\simsys_w.cc:968]
Note: instruction: xor    0x04(%edx,%edi,1) %eax -> %eax
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

QuoteHas anyone any ideas of what might run differently in Linux and Windows?
I assume you are getting a hash based OOS and not a command in past related one?

It can be anything from API calls acting slightly differently to differences in type sizes which are assumed the same. For example size_t and other API related structs can be different sizes on Windows and on Linux.

Rule out a 32 to 64 issue by making sure both Linux and Windows run the same.

jamespetts

I have checked, and this is the usual checklist mismatch desync, not a command being executed in the past (the desyncs occur even with no interaction by any player).

As to APIs, what APIs are there apart from pthreads (which it seems reasonable to infer is not relevant here because multi-threading was ruled out as a cause as set out above), libpng (which is not relevant as the server does not use graphics and graphics could not cause a desync in any event) and the various compression libraries? The Simutrans-Experimental unique code does not tend to reference these APIs directly in any event, being focussed on altering gameplay rather than the underlying, lower level simulation.

In relation to size_t, I have carefully looked over the instances of this in the code: all but one such instance was either unchanged from Standard or only in the GUI (which would not cause a desync). The one instance that was neither of these I have just changed and testing shows that the desync still occurs.

As to 32/64 bit issues, this is harder, as I suspect that I shall have considerable difficulty now compiling a 64-bit Windows version, as the processes used to do this (especially library references) have been deprecated, particularly when I upgraded to MSVS 2015.

I do note, however, that, now that Vladki has confirmed that his server runs Linux, it was quite recently that Windows clients were connecting to that server with either no or only very occasional desyncs related to interaction, whereas these desyncs occur almost inevitably after a few seconds after being connected with or without interaction. The issue therefore is likely to have arisen in a recent change to the code, but I am struggling to find anything that might be relevant.

It does not help that I do not have an exact date for when I was last able to connect to server.exp.simutrans.com with a Windows client and not desync in less than a minute. If Ves or Vladki are able to be more specific about when they or either of them were last able to do so, I should be very grateful.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

Check for uninitialized variables. Windows builds generally give them different values from Linux builds.

jamespetts

I did run Dr. Memory for this purpose, but it yielded no results other than memory leaks and potential memory leaks in code common to Experimental and Standard, except for the somewhat odd error reproduced in post no. 10 above.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

Unfortunately, it seems as though there is a new desync bug. As some may know, these are extremely hard to fix, so any help from any source would be greatly appreciated, either from experienced coders who are able to offer advice, or regular players who can run some basic testing to save me a lot of time (in particular, it would be helpful to know from Vladki and/or Ves when they were last able to connect to the server at server.exp.simutrans.com from a Windows client without a quick desync, as this should help me very much to pin down when the problem arose).

This one is especially hard to fix, as it can only be reproduced (so far) when the server is running Linux and the client is running Windows. (I have no access to a Windows server, so it is hard to test the other way around, and I cannot seem to get it working on my home network to test between my Windows and Linux PCs).

I am about to join (or have joined, depending on when you are reading this message) the previous discussions about this from the thread relating to the Bridgewater-Brunel server being refreshed with the devel-new-2 build for testing, which is where this discussion started.

Anyone who is able to compile and run the code on both Windows and Linux can help me greatly by going back through the Github history and compiling the client and server from the same historical commits for the last few months (I suspect that this has arisen recently) and, on the same map, seeing whether the problem can be reproduced with any given commit. Any suggestions or ideas for helping to track this down would be much appreciated.

I am afraid that this might interfere with other development priorities until it is fixed, although I am minded to undertake some performance tuning work whilst trying to think of what to do in relation to this.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

I'm running purely on linux. And my previous desync problems were solved by reconfiguring my home wi-fi from 2.4 GHz to 5 GHz (to avoid interference from neighbours).

jamespetts

Thank you for letting me know: that is helpful. I know that I connected to your server (which I presume was running Linux at the time) a month or two ago with my Windows client and had few if any desyncs, but I cannot now recall when it was updated.

Can you perhaps assist by letting me know a list of dates since August on which you updated the server with the latest code from devel-new-2?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

Quote(in particular, it would be helpful to know from Vladki and/or Ves when they were last able to connect to the server at server.exp.simutrans.com from a Windows client without a quick desync, as this should help me very much to pin down when the problem arose)

I remember specifically that it worked flawless for me until the introduction of msvs2015. After that, I had troubbles compile the game and connected only occasionally to test. I think to recall that it worked a number of commits later, as I bugreported from the swedish server for a time. The date of my latest bugreport comment (and therefore latest known time I could connect without any rememberable desyncs) is: 14 october, in this thread: http://forum.simutrans.com/index.php?topic=15766.msg155207#msg155207

That would suggest that the problem is from commit e8d0c80db89a5e1cd373b87f6aa232ec3c93db2b and onwards, which is still a considerably amount of commits..

edit:
If you want, I can compile a version of each commit from that commit to the present one and upload them to the server.exp.simutrans.com server for bugtesting? However, I cannot provide any linux server builds so someone else would have to provide that...

jamespetts

Thank you - that is very helpful. Just to be sure, however, can you check whether you can connect to the Swedish server now (using the same map) without desyncing? This problem may occur only in certain situations and therefore on certain maps. I suggest that you wait for 2 minutes without doing anything to see whether it desyncs.

Thank you very much for your help.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

I connect to the swedish server but get a desync without doing anything after about 5:30 minutes. This is using 7f94e9bd831807c52d807f66ab5979189c7ac84e, which the server also is running.

Now Im compiling the newest executable version from github (15851bfba5e7977f67f4ea4edd9590b3f0ace236). Testing the server now...

edit1: it runs on the server at least, passing 1:30 without crashing at the moment...

edit2: Nope, at 5:20 minutes it creates a desync. This is now the third desync I get at around 5:30 minutes (the one beforehand also was on the short side of 5:30)
Incidentally, did you see my edit in my previous post?

Checking again to see if it happens the same time again...

jamespetts

Thank you very much for testing. Do you think that you might be able to test with 1fb72c008c82e5c75bff1430531c1d6cdef3c038 on both client and server, which is the version immediately before the 14th of October?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

Im editing my posts too much :P

I cannot compile servers, unfortunately.

Read my edits from one and two post up for some details.

Im currently running the latest test again to check the time. Can try connect with the 14 october version after.

edit: Now I got a desync after 3:38 minutes with 15851bfba5e7977f67f4ea4edd9590b3f0ace236

edit2: I cannot connect with 1fb72c008c82e5c75bff1430531c1d6cdef3c038 currently and I cant compile the server. That is Vladki who does that..

jamespetts

Please note that the test is only valid if client and server are using exactly the same versions; Vladki, are you able to assist with this?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.


Vladki

I'm reading forum from my phone, so I cannot recompile now. I usually recompile whenever I have a free evening. Unfortunately I do not have a log of server updates. Btw do you know the command line arguments for git to pull a specific commit?

jamespetts

Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

I usually use git checkout commit

I will start compile excecutables later today!

Vladki

I have tried:

git checkout 15851bfba5e7977f67f4ea4edd9590b3f0ace236
make clean
make

And then restarted the server with new binary, but got the following error:

Reading menu configuration ...
Warning: tool_t::read_menu():   toolbar[11][5]: replaced way-builder(id=14) with default param=cityroad by cityroad builder(id=36)
Midi disabled ...
Calculating textures ...done
Message: karte_t::load():       Prepare for loading
World destroyed.
Warning: karte_t::load: Fileversion: 120008
Message: nwc_auth_player_t::init_player_lock_server:    new = 32767
*** stack smashing detected ***: /home/vladki/simutrans/simutrans-experimental terminated
Aborted (core dumped)

Same error for british or swedish pakset, so I restarted them with previous server binary (7f94e9bd831807c52d807f66ab5979189c7ac84e)

jamespetts

15851bfba5e7977f67f4ea4edd9590b3f0ace236 is the latest commit - you get "stack smashing detected" with that? That is very odd, since I cannot reproduce this. Do you get this both on the client and server? Would you be able to try the immediately previous commit (6256ca284e2f6752676d500404dbbef8fa68d2cd) and see whether this also causes the "stack smashing detected" error (which I understand to be caused by a stack overflow, which is treated in this way because stack overflows can be a security vulnerability)?

Incidentally, are you also able to connect with a Linux client to the bridgewater-brunel.me.uk server and see whether you are able to run the game without desyncing? I have just tried it with my Linux client, and was able to stay connected for a considerable period of time with both client and server running the latest commit (6256ca284e2f6752676d500404dbbef8fa68d2cd).

Finally, are you able to set up your server to run a rather earlier commit, 1fb72c008c82e5c75bff1430531c1d6cdef3c038, so that I and/or Ves can then connect to it with a Windows client and see whether this will stay in sync in circumstances (i.e. with a saved game) where it would not on a later commit?

Thank you very much for your help; these are fantastically challenging problems to track down, alas, and are also critical and game-breaking in severity, so all possible help is very much appreciated indeed.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

Sorry copy paste from bad place. Stack smashing happens with:
$ git status
HEAD detached at 1fb72c0
nothing to commit, working directory clean

I tried only on server, I'm not at home at the moment so cannot try client.

jamespetts

Ahh, that makes more sense. Can you try instead the immediately previous commit in that case, SHA 1cb53908a88570042717df64be86828fe917c8d6 ? Thank you very much.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

The first batch of excecutables are now on server.exp.simutrans.com - devel-new section.

They are named by date and first 7 commit numbers.
The covered time is:
14-20 october

Also, there is the currently newest build I have compiled (161024_15851bf)

jamespetts

Splendid, that is very helpful. Are these in both graphical and command line versions, and are they in 64- and/or 32-bit?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves


DrSuperGood

Might be worth pointing out that building via GCC targeting windows is different than building via MSVC. It is completely possible than a GCC Windows client might sync with a GCC Linux server where as a MSVC client will not.