News:

Want to praise Simutrans?
Your feedback is important for us ;D.

Desync issue (devel-new-2) with Linux Server/Windows client

Started by Ves, October 22, 2016, 09:03:44 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.


jamespetts

Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix


jamespetts

Thank you - that is very helpful. It is very odd that you get a desync with the Bridgewater-Brunel server but not locally - there may be some anomaly on the server.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

Of course, latency is almost zero on a local server. Maybe it is a timing issue?


By the way, I also have a log-file with full debug from a session where I got the desync, but it is to large to upload.

The following looks slightly suspicious

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: route_t::intern_calc_route():    Problem with heuristic:  from 1009,1999,6 to (1007,1999,6) at 1008,1999, best = 20, cost = 10, heur = 110, dist = 1, turns = 99

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal():        could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com

Similar errors as the first one are repeated numerous times. The second one is actually the last error before the world was destroyed.

I just checked. I get similar errors as the first one also on the local server (with -debug 5).

Felix

After several minutes an connecting a second client, I also got a desync locally. Sadly on the client running without "-debug 5 -log" :-/

jamespetts

Desync errors are rarely caused by latency unless it is extreme. They are normally caused by the game state diverging on the client and the server. What happens is that the client and server compare their random number seed every so often. If these are different, the client will disconnect. They will become different if there is almost any deviation in the execution of game code between the client and server, but the difference between them will not be detected until an indeterminate time later. This means that errors of this sort are extremely difficult to resolve. They become orders of magnitude harder again to resolve when they occur only in a setting (a server running one operating system and a client running another) where it is very difficult to reproduce the problem reliably.

Edit: Thank you for checking that. Did you also get a local desync after several minutes when you last tested this?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

For the desnc, I might not have tested long enough, last time.

jamespetts

Can you check with this saved game? It is the same as should be on the server, but the one on the server may not be identical. This is the one that I use for testing including very long-term (i.e. hours long, sometimes overnight) network synchronisation testing.

If this still fails, would you be able to re-test with this saved game and the older version that you used before to see if we can narrow down what is causing the desync that you are getting with a Linux server and a Linux client? The latter issue seems new.

In the meantime, I am re-testing with this saved game locally on Windows over a longer-term run to see whether it remains as stable as it was when I last tested it.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

Ok, will test with the local server again. Might take a while to get the desync, if I get it at all.

Felix

Interesting, now I can reproduce the behavior locally. I immediately get a desync with that game. Will try with logging, now.

Felix

Maybe the following is helpful:

Message: wayobj_t::fill_menu(): try to add TramElectrification(0x7d7ab10)
Message: wayobj_t::fill_menu(): try to add DCThirdRail(0x9431a60)
Message: wayobj_t::fill_menu(): try to add DCRailCatenary(0xa8ff0a0)
Message: wayobj_t::fill_menu(): try to add DCFourthRail(0x623eb90)
Message: hausbauer_t::fill_menu(): maximum 139
Message: hausbauer_t::fill_menu(): maximum 139
Message: hausbauer_t::fill_menu(): maximum 139
Message: hausbauer_t::fill_menu(): maximum 139
Message: toolbar_t::update(): update toolbar SPECIALTOOLS
Message: toolbar_t::update(): update toolbar EDITTOOLS
Message: toolbar_t::update(): update toolbar LISTTOOLS
Message: network_command_t::rdwr: read packet_id=7, client_id=0
Warning: network_check_activity(): received cmd id=7 nwc_ready_t from socket[10]
Warning: nwc_ready_t::execute: set sync_step=952 where map_counter=96770
Warning: karte_t::network_game_set_pause: steps=119 sync_steps=952 pause=0
Message: karte_t::reset_timer(): called, mode=$4
Message: network_command_t::rdwr: read packet_id=12, client_id=0
Warning: network_check_activity(): received cmd id=12 nwc_auth_player_t from socket[10]
Message: nwc_auth_player_t::execute: plnr = 255  unlock = 32767  our_client_id = 0
Message: network_command_t::rdwr: read packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr: rdwr id=8 client=0 plnr=255 pos=koord3d invalid tool_id=8224 defpar=32768,Welcome, Client#2! init=1 flags=0
Warning: network_check_activity(): received cmd id=8 nwc_tool_t from socket[10]
Message: network_world_command_t::execute: do_command 8 at sync_step 953 world now at 952
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 132
Message: network_world_command_t::execute: do_command 9 at sync_step 961 world now at 952
Message: network_command_t::rdwr: read packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr: rdwr id=8 client=0 plnr=255 pos=koord3d invalid tool_id=8224 defpar=32768,Now 1 clients connected. init=1 flags=0
Warning: network_check_activity(): received cmd id=8 nwc_tool_t from socket[10]
Message: network_world_command_t::execute: do_command 8 at sync_step 961 world now at 953
Message: nwc_tool_t::do_command: steps 953 tool 8224 init
Message: nwc_tool_t::do_command: id=32 init=1 defpar=32768,Welcome, Client#2! flag=0
Message: message_t::add_msg():                       Welcome, Client#2! (at -1,-1)
Message: message_t::add_msg(): New world record for
railways:
13.0 km/h
by (1376) UERL 1906 Tube Stock. (at 1269,2189)
Message: haltestelle_t::reserve_position(): failed for gr=1534,1672, cnv=132
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1227,2181,5 to (1235,2170,5) at 1235,2170, best = 229, cost = 229, heur = 2290, dist = 0, turns = 2061


and slightly later


For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1262,2211,7 to (1263,2208,7) at 1262,2210, best = 71, cost = 10, heur = 150, dist = 2, turns = 138

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: karte_t:::do_network_world_command: sync_step=960  server=[ss=960 st=120 nfc=0 rand=4206555837 halt=1 line=1 cnvy=1025 ssr=4206555837,4206555837,0,0,0,0,0,0 str=4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,4206555837,1235697,105474,4206555837 exr=0,0,0,0,0,0,0,0  client=[ss=960 st=120 nfc=0 rand=2648001267 halt=1 line=1 cnvy=1025 ssr=2193165484,2193165484,0,0,0,0,0,0 str=2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2193165484,2648001267,2648001267,1235697,105474,2193165484 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
Message: karte_t::reset_timer(): called, mode=$0
World finished ...
Show banner ...
Message: karte_t::reset_timer(): called, mode=$0
Message: message_t::add_msg(): New world record for
road vehicles:
7.0 km/h
by (558) Pair of horses. (at 1277,2197)
Message: haltestelle_t::reserve_position(): failed for gr=741,2278, cnv=170
Message: haltestelle_t::reserve_position(): failed for gr=1534,1672, cnv=132
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=123
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1230,2217,5 to (1239,2217,5) at 1239,2217, best = 90, cost = 90, heur = 900, dist = 0, turns = 810

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: route_t::intern_calc_route(): Problem with heuristic:  from 1230,2217,5 to (1239,2217,5) at 1238,2217, best = 90, cost = 80, heur = 810, dist = 1, turns = 729

jamespetts

That is very helpful. Do you think that you could try with older versions to see whether the problem exists with older versions also or whether it was introduced recently, and, if so, when?

Unfortunately, it seems that there is now a desync even between Windows clients, at least with release builds, after a few minutes. I will test again with debug builds to see whether the problem is confined to release builds (which will give a clue, if a vague one, about the nature of the problem) or whether it is also present with debug builds.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

Were there some more significant changes at some point? Might be a good idea to try something older than that.

jamespetts

The trouble is that there have been a great many changes, and it is hard to tell what is significant and what is not. I should note that I have, since I posted the last message, been running locally with a debug build without desync. I will continue to run it overnight: if this is stable, the issue is something to do with the release build, possibly some undefined behaviour that happens to be deterministic in a non-optimised build.

Unfortunately, this sort of error is almost impossibly hard to find, especially since, for reasons that I cannot fathom, running Dr. Memory causes the game to crash in a way that is very hard in itself to diagnose and does not occur when the game is not run with Dr. Memory (which in any event only works properly in debug mode without optimisations).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

I did only build with debugging enabled.

It seems that I get the desync even with pretty old versions (October/November). With even older versions, I have trouble to even load the savegame. Perhaps, it would also make sense to double check my configuration to rule out possible errors in that area.

Anyway, I will go to bed, now, and will continue tomorrow. We are one hour ahead ;-)

jamespetts

Having run an overnight test on the Windows debug build (having also gone to bed after my last post), it seems that even this now will not stay in sync, although it is not clear how long that it took to fail beyond that it did so sometime overnight.

In around October to December, I had spend an enormous amount of time implementing multi-threading, and, in particular, making it work when connected over a network (testing with the loopback interface). After a huge amount of testing, I thought that I had resolved such problems as there were in around December.

It might help, therefore, if you could run further tests with MULTI_THREAD=1 disabled. If multi-threading turns out to be the problem, I shall have to add preprocessor directives disabling different parts of the multi-threaded code so that those different parts can be disabled one by one to test each in isolation.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

With the latest version compiled without the MULTI_THREAD flag, the server crashes as soon as I try to connect with a client. Again with the britain-3.sve savegame.


*** Error in `/home/felix/simutrans/simutrans/simutrans-experimental': free(): invalid pointer: 0x0000000012d72690 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x6f263)[0x7ffff6b21263]
/lib64/libc.so.6(+0x748d6)[0x7ffff6b268d6]
/lib64/libc.so.6(+0x750de)[0x7ffff6b270de]
/home/felix/simutrans/simutrans/simutrans-experimental[0x78ea7c]
/home/felix/simutrans/simutrans/simutrans-experimental[0x78a4a5]
/home/felix/simutrans/simutrans/simutrans-experimental[0x5e52c8]
/home/felix/simutrans/simutrans/simutrans-experimental[0x79283d]
/home/felix/simutrans/simutrans/simutrans-experimental[0x79225b]
/home/felix/simutrans/simutrans/simutrans-experimental[0x792eb2]
/home/felix/simutrans/simutrans/simutrans-experimental[0x72d087]
/home/felix/simutrans/simutrans/simutrans-experimental[0x73fee8]
/home/felix/simutrans/simutrans/simutrans-experimental[0x7fa6e7]
/lib64/libc.so.6(__libc_start_main+0xf0)[0x7ffff6ad2790]
/home/felix/simutrans/simutrans/simutrans-experimental[0x405a69]



Program received signal SIGABRT, Aborted.
0x00007ffff6ae5107 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6ae5107 in raise () from /lib64/libc.so.6
#1  0x00007ffff6ae655a in abort () from /lib64/libc.so.6
#2  0x00007ffff6b21268 in ?? () from /lib64/libc.so.6
#3  0x00007ffff6b268d6 in ?? () from /lib64/libc.so.6
#4  0x00007ffff6b270de in ?? () from /lib64/libc.so.6
#5  0x000000000078ea7c in karte_t::load (this=0xf767cf0, file=0x7fffffff97a0) at simworld.cc:8795
#6  0x000000000078a4a5 in karte_t::load (this=0xf767cf0, filename=0x7fffffff9e20 "server13353-network.sve")
    at simworld.cc:7865
#7  0x00000000005e52c8 in nwc_sync_t::do_command (this=0x25a8a4e0, welt=0xf767cf0) at network/network_cmd_ingame.cc:754
#8  0x000000000079283d in karte_t::do_network_world_command (this=0xf767cf0, nwc=0x25a8a4e0) at simworld.cc:9704
#9  0x000000000079225b in karte_t::process_network_commands (this=0xf767cf0, ms_difference=0x7fffffffb068)
    at simworld.cc:9649
#10 0x0000000000792eb2 in karte_t::interactive (this=0xf767cf0, quit_month=2147483647) at simworld.cc:9810
#11 0x000000000072d087 in simu_main (argc=4, argv=0x7fffffffd988) at simmain.cc:1369
#12 0x000000000073fee8 in sysmain (argc=4, argv=0x7fffffffd988) at simsys.cc:805
#13 0x00000000007fa6e7 in main (argc=4, argv=0x7fffffffd988) at simsys_s2.cc:798

jamespetts

Is this a local server or the Bridgewater-Brunel server? Connecting a multi-thread build to a single thread build will not work properly given the way in which multi-threading has been implemented.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

It is a local server and actually the server is crashing, not the client.

jamespetts

That is very odd. This backtrace suggests that the crash actually occurs in some pre-compiled library for which it cannot load symbols. It is extremely odd that it is crashing in this way. Are you able to test to see the version in which this crash was first introduced? The Bridgewater-Brunel server ran single threadedly in the past, so this is not a long-term bug.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

The crash seems to be due to trying to free some invalid memory. Likely a double free or a similar issue. It is more obvious in the game's error message.

jamespetts

The trouble is that the attempt to free invalid memory appears to be made by code that is not actually part of Simutrans-Experimental at all, but in an external library, which makes debugging extraordinarily difficult, as it is not even possible to see what is happening that might have caused this.

Are you able to test to see the version in which this error was first introduced?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.


Felix

In 7cb45bd without multithreading, I don't get the server crash, but still get the desync. Still need to pinpoint the exact commit that introduced the problematic change. The crash seems to be related to the delete[] in the following code in simworld.cc (line numbers change with revisions, its somewhere between line 8000 and 9000)

#ifdef MULTI_THREAD
destroy_threads();
init_threads();
#else
delete[] transferring_cargoes;
transferring_cargoes = new vector_tpl<transferring_cargo_t>[1];
#endif


I will need to first do some groceries shopping, now.

jamespetts

#165
Thank you very much for testing - that is most helpful. I will have a look at that transferring cargoes code when I have finished trying to make cross-compiling work.

Edit: I have pushed some fixes for possible double free errors associated with this code. Are you able to test whether this assists?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.


Felix

I am sorry, the issue seems to remain.  :-(   ... Wait, this time it is the client that crashes. Something new.


Message: haltestelle_t::reserve_position():     failed for gr=24,1784, cnv=659
Message: haltestelle_t::reserve_position():     failed for gr=1526,2901, cnv=123
Message: network_command_t::rdwr:       read packet_id=9, client_id=0
Warning: network_check_activity():      received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK:     time difference to server 1452
Message: network_world_command_t::execute:      do_command 9 at sync_step 6153 world now at 6104
Warning: karte_t:::do_network_world_command:    skipping command due to checklist mismatch : sync_step=6104 server=[ss=6104 st=763 nfc=0 rand=1333123480 halt=1 line=1 cnvy=1025 ssr=4101835982,4101835982,0,0,0,0,0,0 str=4101835982,4101835982,4101835982,4101835982,4101835982,4101835982,4101835982,1333123480,1333123480,1333123480,1333123480,1333123480,1333123480,1235659,105478,1333123480 exr=0,0,0,0,0,0,0,0  executor=[ss=6104 st=763 nfc=0 rand=2142232825 halt=1 line=1 cnvy=1025 ssr=4101835982,4101835982,0,0,0,0,0,0 str=4101835982,4101835982,4101835982,4101835982,4101835982,4101835982,4101835982,1333123480,1333123480,1333123480,1333123480,2142232825,2142232825,1235659,105478,1333123480 exr=0,0,0,0,0,0,0,0 
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset:      all static variables are reset
Message: karte_t::reset_timer():        called, mode=$0
*** Error in `/home/felix/simutrans/simutrans/simutrans-experimental': double free or corruption (!prev): 0x0000000013efd040 ***


Trying again in gdb, I get the following:

Message: haltestelle_t::reserve_position():     failed for gr=1526,2901, cnv=123
Warning: karte_t:::do_network_world_command:    skipping command due to checklist mismatch : sync_step=13272 server=[ss=13272 st=1659 nfc=0 rand=1494109945 halt=1 line=1 cnvy=1025 ssr=1526326882,1526326882,0,0,0,0,0,0 str=1526326882,1526326882,1526326882,1526326882,1526326882,1526326882,1526326882,1494109945,1494109945,1494109945,1494109945,1494109945,1494109945,1235681,105479,1494109945 exr=0,0,0,0,0,0,0,0  executor=[ss=13272 st=1659 nfc=0 rand=1188062987 halt=1 line=1 cnvy=1025 ssr=1526326882,1526326882,0,0,0,0,0,0 str=1526326882,1526326882,1526326882,1526326882,1526326882,1526326882,1526326882,1494109945,1494109945,1494109945,1494109945,1188062987,1188062987,1235681,105479,1494109945 exr=0,0,0,0,0,0,0,0 
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset:      all static variables are reset
Message: karte_t::reset_timer():        called, mode=$0

Program received signal SIGSEGV, Segmentation fault.
0x0000000000004061 in ?? ()
(gdb) bt
#0  0x0000000000004061 in ?? ()
#1  0x00000000007922ea in karte_t::process_network_commands (this=0x13616bb0, ms_difference=0x7fffffffb038)
    at simworld.cc:9654
#2  0x0000000000792f1a in karte_t::interactive (this=0x13616bb0, quit_month=2147483647) at simworld.cc:9814
#3  0x000000000072d0dd in simu_main (argc=8, argv=0x7fffffffd958) at simmain.cc:1369
#4  0x000000000073ff3e in sysmain (argc=8, argv=0x7fffffffd958) at simsys.cc:805
#5  0x00000000007fa74f in main (argc=8, argv=0x7fffffffd958) at simsys_s2.cc:798

Felix

By the way, I still would like to ensure that my configuration (simuconf.tab) is not causing the problems. Are the separately downloaded experimental config files still needed?

Vladki

I have updated all servers runnig on server.exp.simutrans.com to f5e72b5c6a41f6ca064bb1d46de8e1cc29d37d16

I have no problems with those. I still get desync with bridgewater-brunel.me.uk.
Just to make it clear. British pakset on my servers differs a little bit from git:

diff --git a/bus/routemaster.dat b/bus/routemaster.dat
index 1535554..b20d268 100644
--- a/bus/routemaster.dat
+++ b/bus/routemaster.dat
@@ -39,7 +39,7 @@ EmptyImage[N]=./images/routemaster.0.6
EmptyImage[NE]=./images/routemaster.0.7
---
# For TESTing of rescaled vehicles only - delete when testing complete.
-#obj=vehicle
+obj=vehicle
name=Routemaster-rescaled
copyright=JamesPetts&JamesHood
waytype=road
---

I'm quite convinced that this is the root cause of desync problems. If you comment only the obj=vehicle line, then makeobj creates an invalid pak file ".Routemaster-rescaled.pak" instead of nothing as you might expect. So the trick is either to uncomment the obj= line, or comment all lines that belong to the definition of the object

jamespetts

Thank you all for your testing: that is most helpful.

I have just tried unsuccessfully to reproduce this on Windows: a single threaded client can connect to a single threaded server, both compiled by Visual Studio (as debug builds), without either crashing (or indeed desynchronising, in the short-term at least: I have not yet run a long-term test).

This is all rather odd: it does potentially suggest a memory management issue somewhere, but Dr. Memory (when I was able to get it working without crashing Simutrans-Experimental by loading a map with no factories) found nothing but minor memory leaks in code shared with Standard and that have been there for a long time.

The backtrace on this occasion suggests an error arising ultimately from process_network_commands, code that is common between Experimental and Standard, which is a crash that is known to occur simultaneously to a desync in some cases for reasons that have never yet been ascertained.

It is still not clear why a single threaded build in Linux should desync/crash in this way, however, when it does not do so in Windows. Are you able to run Valgrind to see whether there are any detectable memory management issues?

Vladki - I am afraid that the problem cannot be the routemaster-rescaled object, as my own local tests, where I have now been able to reproduce a desync between client and server running the same build and pakset (two copies of the identical executable and pakset folder on the same computer) were performed using the road-vehicle-rescaling branch, which does not have the routemaster-rescaled object at all (the default Routemaster being rescaled instead).

If you really think that the routemaster-rescaled object is responsible for at least some of the desyncs, however, you might want to try tests with both client and server using the road-vehicle-rescaling branch instead of the half-heights branch to see whether that makes any difference.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

I can confirm, that building the pakset without the offending routemaster still does not solve the desync issue for me.

jamespetts

Thank you for testing that: that is helpful. Are you able to test single threaded builds of older commits to see whether the desync and/or crash issues occur there?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

I still would like to ensure, that this is not a configuration issue.

jamespetts

Quote from: Felix on January 14, 2017, 08:26:55 PM
I still would like to ensure, that this is not a configuration issue.

What sort of configuration issue had you in mind?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.