News:

Simutrans.com Portal
Our Simutrans site. You can find everything about Simutrans from here.

Desync issue (devel-new-2) with Linux Server/Windows client

Started by Ves, October 22, 2016, 09:03:44 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Felix

I only activate the optimization in the config file and removed the debug flag. I did not manually change the CFLAGs. The result is still the same. The log clearly shows that the first check already fails:


Message: karte_t::load(): Prepare for loading
World destroyed.
Warning: karte_t::load: Fileversion: 120003
Warning: convoi_t::finish_rd(): convoi ((681) BR Class 50) is broken => realign
Warning: convoi_t::finish_rd(): convoi ((889) BR-307(front)-AC) is broken => realign
Warning: rail_vehicle_t::set_convoi(): convoi 996 had a too high route index! (25 of max 22)
Warning: convoi_t::finish_rd(): convoi ((1032) BR-501(DMB)-3rdRail) is broken => realign
Warning: rail_vehicle_t::set_convoi(): convoi 1434 had a too high route index! (15 of max 10)
Warning: rail_vehicle_t::set_convoi(): convoi 1412 had a too high route index! (17 of max 13)
Warning: rail_vehicle_t::set_convoi(): convoi 1128 had a too high route index! (40 of max 21)
Warning: rail_vehicle_t::set_convoi(): convoi 1654 had a too high route index! (20 of max 9)
Warning: convoi_t::finish_rd(): convoi ((1724) LUL C Stock) is broken => realign
Warning: rail_vehicle_t::set_convoi(): convoi 1751 had a too high route index! (18 of max 17)
Warning: rail_vehicle_t::set_convoi(): convoi 1502 had a too high route index! (16 of max 12)
Warning: convoi_t::finish_rd(): convoi ((2017) BR Class 86) is broken => realign
Warning: convoi_t::finish_rd(): convoi ((1190) BR Class 33) is broken => realign
Warning: karte_t::load(): loaded savegame from 9/1971, next month=-1719664640, ticks=-1721195101 (per month=1<<22)
Message: network_command_t::rdwr: read packet_id=7, client_id=0
Warning: network_check_activity(): received cmd id=7 nwc_ready_t from socket[10]
Warning: nwc_ready_t::execute: set sync_step=1951 where map_counter=5031530
Warning: karte_t::network_game_set_pause: steps=243 sync_steps=1951 pause=0
Message: network_command_t::rdwr: read packet_id=12, client_id=0
Warning: network_check_activity(): received cmd id=12 nwc_auth_player_t from socket[10]
Message: nwc_auth_player_t::execute: plnr = 255  unlock = 32767  our_client_id = 0
Message: network_command_t::rdwr: read packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr: rdwr id=8 client=0 plnr=255 pos=koord3d invalid tool_id=8224 defpar=32768,Welcome, Felix! init=1 flags=0
Warning: network_check_activity(): received cmd id=8 nwc_tool_t from socket[10]
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server -198
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: karte_t:::do_network_world_command: sync_step=1952  server=[ss=1952 st=244 nfc=0 rand=217478360 halt=1025 line=1 cnvy=1025 ssr=2939411524,64761495,0,0,0,0,0,0 str=64761495,64761495,64761495,64761495,64761495,4049239263,4049239263,4049239263,4049239263,4049239263,4049239263,217478360,217478360,3415153,149758,4049239263 exr=0,0,0,0,0,0,0,0  client=[ss=1952 st=244 nfc=0 rand=217478360 halt=1025 line=1 cnvy=1025 ssr=2939411524,64761495,0,0,0,0,0,0 str=64761495,64761495,64761495,64761495,64761495,4049239263,4049239263,4049239263,4049239263,4049239263,4049239263,217478360,217478360,3415153,149758,4049239263 exr=0,0,0,0,0,0,0,0 
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Warning: sint64 convoi_t::calc_revenue: Average speed (164) for (1902) BR-307(front)-AC exceeded maximum speed (110); falling back to overall average


Edit 1 Interestingly, we actually start out with a negative time difference, now! Which seems to mean the client is ahead.

Edit 2 Please note, I remove all warnings regarding the heuristic error.

jamespetts

I was not expecting the optimisation to affect the desyncs, but rather the timing issue - but of course, I was slightly confused, because the timing issue you were having when you were running on the loopback interface, not the server, so they would both have been debug builds, so optimisation should not make a difference there, either.

Let me see whether I can increase the frequency of checks on the server so that we have a better idea of what is happening with these early desyncs.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

I will try with the debug build again. Maybe the different behavior is caused by something else.


Edit 1 This is definitely strange. A debug build survives a little longer, more than one check, and I see the increasing time difference, again.

jamespetts

I am just restarting the server now with server_frames_between_checks set to 1 rather than 32 (and with the latest commit from devel-new-2) to see whether this gives more precise diagnostics.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

I get a lot more checks before the server desyncs ;-) I am currently trying to find the description of the data displayed by the check. It was somewhere within this thread. But for tonight, I will also need to quit.

Just in case it helps, the last couple lines from the latest log:

Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 66
Message: network_world_command_t::execute: do_command 9 at sync_step 100 world now at 94
Warning: karte_t:::do_network_world_command: sync_step=93  server=[ss=93 st=11 nfc=5 rand=187636870 halt=1025 line=1 cnvy=1025 ssr=2187775812,187636870,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0  client=[ss=93 st=11 nfc=5 rand=187636870 halt=1025 line=1 cnvy=1025 ssr=2187775812,187636870,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0 
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 66
Message: network_world_command_t::execute: do_command 9 at sync_step 101 world now at 95
Warning: karte_t:::do_network_world_command: sync_step=94  server=[ss=94 st=11 nfc=6 rand=890309775 halt=1025 line=1 cnvy=1025 ssr=187636870,890309775,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0  client=[ss=94 st=11 nfc=6 rand=890309775 halt=1025 line=1 cnvy=1025 ssr=187636870,890309775,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0 
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: route_t::calc_route(): No route from 1413,2338 to 1414,2342 found
Message: air_vehicle_t::find_route_to_stop_position(): no free position found!
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
ERROR: rail_vehicle_t::activate_choose_signal(): could not reserved route after find_route!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: haltestelle_t::reserve_position(): failed for gr=1528,1702, cnv=612
Message: haltestelle_t::reserve_position(): failed for gr=3,1913, cnv=212
Message: haltestelle_t::reserve_position(): failed for gr=1526,2901, cnv=9
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 66
Message: network_world_command_t::execute: do_command 9 at sync_step 102 world now at 96
Warning: karte_t:::do_network_world_command: sync_step=95  server=[ss=95 st=11 nfc=7 rand=710456930 halt=1025 line=1 cnvy=1025 ssr=890309775,710456930,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0  client=[ss=95 st=11 nfc=7 rand=710456930 halt=1025 line=1 cnvy=1025 ssr=890309775,710456930,0,0,0,0,0,0 str=1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,1260555990,37566683,1647338,1260555990 exr=0,0,0,0,0,0,0,0 
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 66
Message: network_world_command_t::execute: do_command 9 at sync_step 103 world now at 97
Warning: karte_t:::do_network_world_command: sync_step=96  server=[ss=96 st=12 nfc=0 rand=3405237637 halt=1025 line=1 cnvy=1025 ssr=710456930,1023255373,0,0,0,0,0,0 str=1023255373,1023255373,1023255373,1023255373,1023255373,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,40981836,1797096,3405237637 exr=0,0,0,0,0,0,0,0  client=[ss=96 st=12 nfc=0 rand=3405237637 halt=1025 line=1 cnvy=1025 ssr=710456930,1023255373,0,0,0,0,0,0 str=1023255373,1023255373,1023255373,1023255373,1023255373,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,40981836,1797096,3405237637 exr=0,0,0,0,0,0,0,0 
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[10]
Warning: NWC_CHECK: time difference to server 66
Message: network_world_command_t::execute: do_command 9 at sync_step 104 world now at 98
Warning: karte_t:::do_network_world_command: sync_step=97  server=[ss=97 st=12 nfc=1 rand=3274686897 halt=1025 line=1 cnvy=1025 ssr=3405237637,3274686897,0,0,0,0,0,0 str=1023255373,1023255373,1023255373,1023255373,1023255373,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,40981836,1797096,3405237637 exr=0,0,0,0,0,0,0,0  client=[ss=97 st=12 nfc=1 rand=2249708390 halt=1025 line=1 cnvy=1025 ssr=3405237637,2249708390,0,0,0,0,0,0 str=1023255373,1023255373,1023255373,1023255373,1023255373,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,3405237637,40981836,1797096,3405237637 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
Message: karte_t::reset_timer(): called, mode=$0
World finished ...

jamespetts

Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

I honestly did not expect it to be to helpful, either.

jamespetts

Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

I have lost track of what you have tried or not. But If you want to test I have three servers running on server.exp.simutrans.com, all running commit f49e8a11d3694003ca199e439e920ac7bebdc694.

port:13353 - small map, pak128.Britain - half-heights branch http://server.exp.simutrans.com/pak128.Britain-Ex.zip
port: 13354 - copy of bridgewater brunel, pak128.Britain - rescaled-road-vehicles http://server.exp.simutrans.com/pak128.Brunel-Ex.zip
port: 13355 - small map, pak128.Sweden http://server.exp.simutrans.com/pak128.Sweden-Ex.zip

I have played with all of them for quite some time, and got only one desync (on each), which I think was related to some attempt to mess with vehicle schedules. Could you try with windows clients?

Ves

Connecting with windows client, I get desyncs on both of the british paksets. The big map on port 13354 desyncs after only like 10-12 seconds, while the smaller one on 13353 desyncs after maybe half a minute. The swedish pakset, I can stay synced for ever it appears.

Could it be something of the following:

* Map size - The bigger a map is, the faster it desyncs. The swedish map being so small that it never reach whatever the treasholds are needed to create a desync?
Possible tests to perform:
- A big swedish map to see if that also desyncs
- A very small british map to see if it can stay in sync

* Pakset - Since, for me, it only is the brittish paksets that are desyncing, could it be that the brittish pakset either contains some objects, or settings in simuconf.tab or similar, that are obliged to desync?
Possible tests to perform:
- Restart a naked brittish map, populating it with a single (kind of) object at a time and see if it starts to desync


jamespetts

#255
Thank you both very much for your testing - this seems helpful. Small maps (especially maps that are not well developed) often fail to trigger desyncs as they often do not use the feature that causes it, or use it very rarely.

I should be most interested to see the result of these tests.

Edit: I have restarted the Bridgewater-Brunel server with the latest version, too.

Edit 2: I have been testing a little more with this to try to track down the rare desync associated with passenger generation. So far, I have tried fixing "next_step_passenger_this_thread" and "next_step_mail_this_thread" at 50 each to see whether desynchronisation between these values on client and server is the cause of the problem. Testing this on the britain-1971 map (one of Rollermaterial's uploads), the game desynchronised just before the end of October (the loading time being mid-October). Next, I will try disabling all mail generation to see whether the problem is specific to mail.

Edit 3: Trying again with the mail generation disabled resulted in a desync even sooner than before, although it is not clear whether this difference is other than random.

Edit 4: I have realised that I had not properly set up my preprocessor directives with the result that the first test was actually run with the code as normal, not with the fixed numbers for passenger generation per step. This first run, however, is still useful as a control to gauge the time within which a desync may be expected to occur with the ordinary code. I am about to try again with the fixed numbers enabled properly.

Edit 5: Testing again with this properly enabled, and the desync still occurs, which suggests that the problem is not anything to do with the step_passengers_and_mail_threaded method itself (which calls the generate_passengers_or_mail method and determines the number of times that it is called on each thread).

Edit 6: Testing with the following defined:


#define FORBID_SYNC_OBJECTS
#define FORBID_PRIVATE_CARS 
#define FORBID_PEDESTRIANS
#define FORBID_CONGESTION_EFFECTS
#define DISABLE_JOB_EFFECTS


still produces a desync.

Edit 7: Defining DISABLE_GLOBAL_WAITING_LIST did not prevent the desync.

Edit 8: Preventing step_passengers_and_mail_threaded from running multiple passenger generation threads in parallel (but allowing it to run concurrently with some small parts of karte_t::step) does prevent the desync from occurring - but this does not say much, as it only runs concurrently with some very minor pieces of code in karte_t::step updating statistics.

Edit 9: Defining FORBID_PUBLIC_TRANSPORT does appear to prevent the desyncs (on the britain-1971 map, it runs until well into November (and is still running now without a desync), whereas, with this not defined, it desyncs by the end of October. This suggests that the error is related specifically to an element of the code dealing in particular with public (i.e player) transport, and not pedestrians, private cars or passenger generation statistics.

Edit 10: Defining FORBID_RETURN_TRIPS does seem to prevent the desync, however (running now at nearly the end of November in the britain-1971 map with no desync). I have found a possible cause of this relating to the interaction between a mutex lock and a goto command, and will test to see whether this fixes the problem now.

Edit 11: The mutex error (a fix for which is now pushed) did not seem to prevent the desync. I will have to investigate further.

Edit 12: Although used only for returning passengers, disabling the city->set_generated_passengers() function does not affect the desync, which still occurs before the end of October.

Edit 13: Even disabling the code that starts the returning passengers on their way (without which they are simply discarded as unused local variables at the end of each passenger generation step) does not affect the desync.

Edit 14: Testing again with the returning passengers code disabled, this time for a number of hours (into December 1971 in game time) to make sure that the last test of this was a fluke, and still no desync in this condition. There is something odd about the situation with returning passengers that causes this desync that has yet eluded me.

Edit 15: Defining all of the following preprocessor directives (added to the code recently for the purposes of testing as to the cause of this issue) seems to prevent the desync:

FORBID_SWITCHING_TO_RETURN_ON_FOOT
FORBID_SET_GENERATED_PASSENGERS
FORBID_RECORDING_RETURN_FACTORY_PASSENGERS
FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS
FORBID_STARTE_MIT_ROUTE_FOR_RETURNING_PASSENGERS

even if FORBID_RETURN_TRIPS is not defined. This is the first time that it has been possible to run without a desync without FORBID_RETURN_TRIPS being defined; but this might just be because the combined effect of all of these preprocessor directives is that the returning passengers code has virtually no effect.

Edit 16: Undefining FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS and leaving all of the above defined allows it to run without a desync (well into November at least on the britain-1971 map used for testing).

Edit 17: Undefining FORBID_STARTE_MIT_ROUTE_FOR_RETURNING_PASSENGERS and FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS together allows the britain-1971 saved game to run well into November without a desync.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Felix

Good to see, that you seem to make some progress with analyzing the issue. Sadly, I do not currently find sufficient time to get more involved and start to dig into the code.

jamespetts

#257
Don't worry - such time as you are able to spare to do such things as you are able in that time to do is very much appreciated.

Edit 1: I had made an error with the FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS preprocessor directive, with the result that it did precisely the opposite of what was intended. I will re-test with this error corrected and see whether it works with only this defined (correctly). The latest test showed that there were no desyncs into December with FORBID_SET_GENERATED_PASSENGERS and FORBID_RECORDING_RETURN_FACTORY_PASSENGERS defined and none other (but the error being the equivalent of also having FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS defined).

Edit 2: With only FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS defined (after the fix is applied), the britain-1971 map reaches December without a desync.

Edit 3: As FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS prevented two separate parts of code from compiling, I broke that down into two segments, FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS_1 and FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS_2. Further testing shows that only FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS_1 is the one that makes the decisive difference. That refers to the following code:


// Now try to add them to the target halt
                ware_t test_passengers;
                test_passengers.set_ziel(start_halts[best_bad_start_halt].halt);
#ifndef FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS_1
                const bool overcrowded_route = ret_halt->find_route(test_passengers) < UINT32_MAX_VALUE;
#else
                const bool overcrowded_route = false;
#endif


Edit 4: Only defining FORBID_FIND_ROUTE_FOR_RETURNING_PASSENGERS_1made a difference to whether the desync occurred: with it defined, it did not occur, and with it undefined, it did occur. Looking into the code more closely, this appeared to be in the code relating to overcrowding. It seems as though updating the stop's statistics can in some cases cause the stop to recalculate its status immediately, including whether it is overcrowded, which has the potential to cause a desync if it makes a difference as to whether a particular stop is recorded as being overcrowded in a different order on the client and server. I have now disabled the immediate updating of status (it is updated every step in any event), and this appears to have prevented the desync even with none of the testing preprocessor definitions defined.

However, in looking at this closely, I have found some logic bugs with the system for dealing with overcrowding of return journeys (which may also have made this part of the code slower than necessary). I have written what I hope will be a fix to this, but need now to run the desync test again to see whether it remains desync free with this hopefully fixed code.

Edit 5: I think that I have what is now a fix for this issue (just pushed to Github). After testing overnight with the corrected code as described above, the britain-1971 map gets to June 1972 without disconnecting.

That appears to deal with the rare desync that occurred even between client and server of the same build: this does not address the very different, nearly instant desync that occurs for mysterious reasons between client and server of different builds.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

James, you are telling us if we need to do something, right? :)

Should we connect with thit or that (which is basically the only thing at least I can do), just give a hint about it!

Otherwise, fun to read your comments!  ;D

jamespetts

Yes - nothing for you to do at this stage:all that I have done is fix a desync that would occur occasionally.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have managed to get my Linux computer working again, so I have been able to test connecting to the servers using the Linux machine.

Connecting to bridgewater-brunel.me.uk with the Linux machine still results in a desync after only a few seconds, even from the Linux machine. This is different to previous behaviour, and I do not understand what has caused this difference. I tried deleting the .Routemaster-rescaled.pak from both client and server and ran the test again, with the identical result.

Connecting to server.exp.simutrans.com with the Linux machine allows it to stay connected for some time (it is still connected now). On Windows, it desyncs after a few seconds, although it did crash when first I tried to connect.

It is a shame that the Bridgewater-Brunel mirror server has been taken down just when I am able to make use of testing that again.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

The copy of Bridgewater game must have crashed then...

Sent from my ONEPLUS A3003 using Tapatalk


Vladki

server.exp.simutrans.com is updated to commit (37569a4117e2580003e3450a11605369d5f77643) and all 3 instances running again

jamespetts

#263
Splendid, thank you for that.

Testing using my Linux NUC, the britain-1971 saved game running on the loopback interface remained connected overnight, so there is nothing inherent about running that saved game on Linux that causes a desync. There must be something odd about the Bridgewater-Brunel server. I will try again connecting to Vladki's server with the Linux machine and see whether that makes any difference.

Edit: Attempting this causes a desync within about 10 seconds, albeit this is the britain-3 (circa 1911) rather than the britain-1971 map. I will re-test on the loopback with this map.

Edit 2: Testing over several hours, this saved game also runs stably without a desync on the loopback interface in Linux.

Edit 3: Running a server on Windows and a client on Linux over my local network, the britain-1971 game causes a desync within less than a minute of connecting (but not as fast as connecting to the Bridgewater-Brunel server).

Edit 4: Some very interesting results: running with a MinGW built Windows client (not the cross-compile at present, which is still not working, but a MinGW executable built on Windows), I am able to connect to a Linux server running the britain-1971 game running on my local network for hours at a time without desync (it has not desynced once).

I am still getting crashes sometimes with the MinGW build that I am not getting with the Visual Studio build that seem mainly to be related to loading games, and which I cannot debug because the debug symbols are not compatible with Visual Studio and, for some reason that I cannot fathom, MinGW's version of GDB also seems to crash when the game crashes making it impossible to perform a backtrace: all that I know so far is that there is a segmentation fault/access violation caused by attempting to read data from a null pointer somewhere.

However, trying to connect to server.exp.simutrans.com (the Bridgewater-Brunel copy) with this executable results in a desync, as does attempting to connect to the Bridgewater-Brunel server itself.

Edit 5: I have now managed to set up the Bridgewater-Brunel server to use the correct nightly build, which carries the same revision number as the MinGW build that I am using (except that, for some reason, the MinGW build uses two more characters in its shortened version, so the in-game server selecting interface thinks that they are not the same), which is a controlled test to make sure that the versions used on both are identical. Connecting to the Bridgewater-Brunel server with either MinGW Windows or Linux clients will result in a desync very quickly. Both MinGW Windows and Linux client can connect to one another without difficulty, as noted in edit 4 above.

Edit 6: Even making sure that the pakset is identical on both client and server, the desync still occurs. I am wondering whether there might be some issue with the simuconf.tab file - but it is hard to see how this could be when all of the relevant settings are saved.

Edit 7: I have updated the simuconf.tab file on the server to match the latest on Github (with the exception of some server specific settings, which were retained from the old simuconf.tab and which are necessary). I cannot get the MinGW client to load any saved game without crashing at present for reasons that are extremely unclear, however the Linux client is able to remain connected for about 5-10 minutes without desyncing, but has just desynced after about that time of remaining connected.

Edit 8: Trying again, and the Linux machine still desyncs from the Bridgewater-Brunel server after a few minutes. This is extremely perplexing, as this is the exact same version of both binary and pakset used on my own Linux computer on my local network, which remains connected stably to a Windows build.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

If I understand correctly, if you connect over loopback or LAN it is OK, but connection over Internet, to both servers desyncs within minutes?

Is it possible that different versions of shared libraries (LibC), could cause it? I mean e.g. inconsistent behavior of prng, in different versions?

Sent from my ONEPLUS A3003 using Tapatalk


jamespetts

#265
There are a number of different possibilities, and I doubt that I know enough to enumerate all of them at this stage. So far, I have only tried connecting from my local computers to the Bridgewater-Brunel server, so it is possible that some specific feature of that server causes problems. When I changed the simuconf.tab file, the desync took a lot longer to occur. One thing that I have not tried, which I should try, is using the binary on my Linux NUC compiled on the Bridgewater-Brunel server, rather than one locally compiled, although I doubt that I have different versions of libraries on each.

Edit: I have now tested this (i.e. using the server compiled binary on my own Linux machine at home to connect to the server): it still seems to desync, but only after a long time (I do not know exactly how long unfortunately because my Linux computer uses the same monitor as my main computer, on which I spent a long time adjusting the height of single decker 'buses, after finishing which I discovered that it had desynchronised; but it lasted tens of minutes before desyncing this time. It is not clear why it is stable overnight on a local network but not over the internet, however, and it is exceptionally difficult to find any satisfactory way of discovering what the problem is here.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

I was now playing with the copy of bridgewater brunel - resolved some blockages, upgraded a few track, and it was quite stable. One server crash, probably due to out-of memory, and one or two desyncs - when I fas fiddling with schedule.

In the debug output of server there's a lot of messages like this:

ERROR: void convoi_t::laden():  Trying to load at halt City of Westminster Doll Street Stop when not at a halt

I resolved some of these by selling road vehicles or sending them to depot.

Also occasionaly passengers are delivered to a factory which was deleted in the meantime...

Is it possible that these things may affect the stability or desyncs?

jamespetts

#267
These things ought not to affect desyncs, especially as the exact same map with the exact same executable does not desync over a local network. My Linux computer is currently connected by wifi - I can try again sometime next week connected with a wired connexion, but given that it works fine over my local network with wifi, this is unlikely to be the problem. One possibility is that these desyncs are no longer checklist mismatch desyncs, but are caused by the client running ahead of the server, but I would need to look at log outputs for that, which I have not had a chance to do yet.

One thing that I am trying is running the binary downloaded from the server on the loopback interface (previously when trying this, I used a locally compiled binary). However, leaving this to run for about one game month when filling in my VAT return it seems to be stable and has not disconnected.

Edit: I tried running a MinGW build connected to the server overnight: it ran for a while (at least half an hour, before I went to bed), but I found that it had crashed in the morning - presumably from the load/save bug affecting only MinGW builds that I discuss in the cross-compiling thread, so I do not know for how long that it ran in total, but it seems to have stayed connected for a decent period of time at least until the next load/save cycle on the server (which would be at least an hour, I think).

Edit 2: I have now managed to get cross-compiling (mostly: with the exception of makeobj) to work: see here for the binaries. This does stay in sync with the Bridgewater-Brunel server - until somebody else connects, when it will desync shortly after being unpaused. I cannot, for reasons that I do not understand, reproduce this using the Visual Studio build with the debug interface.

Would anyone running a Linux build (please make sure that you have the latest executable from the Bridgewater-Brunel server this evening, or built yourself from devel-new-2 after this message was posted) be able to test this to see whether this also applies to Linux builds?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

Wow, this appears promising! Am now connected to the bridgewater-brunel game for more than 20 seconds using your crossover windows build! :D
Connecting with my own compiles still disconnects me within those 20 seconds last I checked.

jamespetts

#269
Further testing reveals as follows: even with four clients connected to a server on the loopback interface, this latest desync issue does not occur with Visual Studio builds for both client and server. Windows MinGW builds seem to behave in exactly the same way as Linux cross-compiled MinGW builds: that is, any clients already connected to the server when the server re-saves the game for a joining client will, after a period of time, desync from the server, while the latest connected client will stay in sync.

What I do not know yet is whether this same behaviour can be replicated with a native Linux version of the clients. I did notice, however, that for the MinGW builds, it sometimes took longer to manifest, the older clients desyncing quite a few minutes after the new one connected.

It is still very unclear why this is happening.

Edit: Re-testing with 5 Visual Studio clients all with release (rather than, as before, debug) builds confirms that this problem cannot be reproduced with Visual Studio clients and a Visual Studio server.

Edit 2: In further testing, there has now been one instance in which a Visual Studio build has desynced after another Visual Studio client has joined, but this appears to be extremely rare on Visual Studio builds and almost universal on MinGW builds.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

I look for uninitialised variables or memory then. MSVC freshly allocated memory is non-debug builds is usually initialised with zero (and 0xEE for debug builds), while it tends to be more random with GCC.

jamespetts

#271
I think that Visual Studio spots uninitialised variables automatically when running in debug mode (it breaks the program and throws a "runtime check" error), and I think that Dr. Memory finds this sort of error, too.

Edit: When running Dr. Memory, I seem again to have the unaccountable problem in which it will crash immediately upon loading, which crash cannot be reproduced when Dr. Memory is not running. Here is the output from the cross-compiled version:


Dr. Memory version 1.11.0 build 2 built on Aug 29 2016 02:42:07
Dr. Memory results for pid 12140: "simutrans-experimental-cross-compile.exe"
Application cmdline: "C:\Users\James\Documents\Development\Simutrans\simutrans-experimental-sources\simutrans\simutrans-experimental-cross-compile.exe"
Recorded 115 suppression(s) from default C:\Program Files (x86)\Dr. Memory\bin\suppress-default.txt

Error #1: UNADDRESSABLE ACCESS beyond heap bounds: reading 0x0ec0be2c-0x0ec0be30 4 byte(s)
# 0 gebaeude_t::get_visitor_demand                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 1 fabrik_t::update_scaled_pax_demand               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 2 fabrik_t::set_base_production                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 karte_t::load                                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 simu_main                                        [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 5 sysmain                                          [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 __tmainCRTStartup                                [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]
# 7 KERNEL32.dll!BaseThreadInitThunk                +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)
Note: @0:00:31.053 in thread 9316
Note: refers to 4 byte(s) beyond last valid byte in prior malloc
Note: prev lower malloc:  0x0ec0be18-0x0ec0be28
Note: allocated here:
Note: # 0 replace_operator_new_array                 [d:\drmemory_package\common\alloc_replace.c:2928]
Note: # 1 fabrik_t::get_tile_list                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 2 _Unwind_SjLj_Unregister                    [/root/bzip2-1.0.6/blocksort.c:1086]
Note: # 3 _Unwind_SjLj_Unregister                    [/root/bzip2-1.0.6/blocksort.c:1086]
Note: # 4 haltestelle_t::add_grund                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 5 haltestelle_t::rdwr                        [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 6 haltestelle_t::haltestelle_t               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 7 haltestelle_t::create                      [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 8 karte_t::load                              [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: # 9 karte_t::load                              [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: #10 simu_main                                  [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: #11 sysmain                                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
Note: instruction: mov    0x14(%ecx) -> %eax

Error #2: UNADDRESSABLE ACCESS: reading 0xf1fdf104-0xf1fdf108 4 byte(s)
# 0 gebaeude_t::get_visitor_demand                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 1 fabrik_t::update_scaled_pax_demand               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 2 fabrik_t::set_base_production                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 karte_t::load                                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 simu_main                                        [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 5 sysmain                                          [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 __tmainCRTStartup                                [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]
# 7 KERNEL32.dll!BaseThreadInitThunk                +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)
Note: @0:00:31.119 in thread 9316
Note: instruction: mov    0x04(%eax) -> %eax

Error #3: LEAK 0 direct bytes 0x0372a368-0x0372a368 + 0 indirect bytes
# 0 replace_operator_new_array                  [d:\drmemory_package\common\alloc_replace.c:2928]
# 1 pthreadGC2-w32.dll!?                       +0x0      (0x65103b48 <pthreadGC2-w32.dll+0x3b48>)
# 2 freelist_t::gimme_node                      [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 imagelist_reader_t::read_node               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 5 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 7 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 8 obj_reader_t::read_file                     [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 9 loadingscreen_t::set_progress               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#10 simu_main                                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#11 sysmain                                     [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]

Error #4: LEAK 67 direct bytes 0x0372b978-0x0372b9bb + 0 indirect bytes
# 0 replace_operator_new                                  [d:\drmemory_package\common\alloc_replace.c:2899]
# 1 std::__cxx11::basic_string<>::_M_create               [/root/bzip2-1.0.6/blocksort.c:1086]
# 2 std::__cxx11::basic_string<>::_M_mutate               [/root/bzip2-1.0.6/blocksort.c:1086]
# 3 std::__cxx11::basic_string<>::_M_append               [/root/bzip2-1.0.6/blocksort.c:1086]
# 4 sysmain                                               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 5 __tmainCRTStartup                                     [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]
# 6 KERNEL32.dll!BaseThreadInitThunk                     +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)

Error #5: LEAK 3 direct bytes 0x03738258-0x0373825b + 0 indirect bytes
# 0 replace_malloc                     [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 gzread                             [/root/bzip2-1.0.6/blocksort.c:1086]
# 2 loadsave_t::read                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 loadsave_t::rdwr_str               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 KERNELBASE.dll!ReadFile           +0x117    (0x764dde95 <KERNELBASE.dll+0xde95>)
# 5 KERNELBASE.dll!ReadFile           +0x169    (0x764ddee7 <KERNELBASE.dll+0xdee7>)
# 6 KERNELBASE.dll!CreateSemaphoreExW +0x77     (0x764e0fe9 <KERNELBASE.dll+0x10fe9>)
# 7 KERNELBASE.dll!ReadFile           +0x169    (0x764ddee7 <KERNELBASE.dll+0xdee7>)
# 8 KERNEL32.dll!ReadFile             +0x53     (0x75c73ec7 <KERNEL32.dll+0x13ec7>)
# 9 KERNEL32.dll!ReadFile             +0x58     (0x75c73ecc <KERNEL32.dll+0x13ecc>)
#10 KERNEL32.dll!ReadFile             +0x58     (0x75c73ecc <KERNEL32.dll+0x13ecc>)
#11 msvcrt.dll!_read_nolock

Error #6: POSSIBLE LEAK 8 direct bytes 0x03738500-0x03738508 + 0 indirect bytes
# 0 replace_malloc                      [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 loadsave_t::start_tag               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 2 simu_main                           [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 sysmain                             [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 __tmainCRTStartup                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]
# 5 KERNEL32.dll!BaseThreadInitThunk   +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)

Error #7: LEAK 0 direct bytes 0x0376c4a0-0x0376c4a0 + 0 indirect bytes
# 0 replace_operator_new_array                  [d:\drmemory_package\common\alloc_replace.c:2928]
# 1 image_reader_t::read_node                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 2 msvcrt.dll!_unlock   
# 3 msvcrt.dll!_unlock_file
# 4 msvcrt.dll!fread_s   
# 5 msvcrt.dll!fread_s   
# 6 msvcrt.dll!_unlock_file
# 7 msvcrt.dll!fread     
# 8 read_node_info                              [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 9 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#10 imagelist_reader_t::read_node               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#11 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]

Error #8: LEAK 71 direct bytes 0x037954e0-0x03795527 + 0 indirect bytes
# 0 replace_operator_new_array                [d:\drmemory_package\common\alloc_replace.c:2928]
# 1 pthreadGC2-w32.dll!?                     +0x0      (0x6510332c <pthreadGC2-w32.dll+0x332c>)
# 2 freelist_t::gimme_node                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 _Unwind_SjLj_Register                     [/root/bzip2-1.0.6/blocksort.c:1086]
# 4 _Unwind_SjLj_Unregister                   [/root/bzip2-1.0.6/blocksort.c:1086]
# 5 savegame_frame_t::fill_list               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 ask_objfilename                           [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 7 sysmain                                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 8 __tmainCRTStartup                         [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]
# 9 KERNEL32.dll!BaseThreadInitThunk         +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)

Error #9: LEAK 28 direct bytes 0x037cc230-0x037cc24c + 0 indirect bytes
# 0 replace_operator_new                              [d:\drmemory_package\common\alloc_replace.c:2899]
# 1 _Unwind_SjLj_Register                             [/root/bzip2-1.0.6/blocksort.c:1086]
# 2 _Unwind_SjLj_Unregister                           [/root/bzip2-1.0.6/blocksort.c:1086]
# 3 image_t::copy_image                               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 _Unwind_SjLj_Unregister                           [/root/bzip2-1.0.6/blocksort.c:1086]
# 5 image_t::copy_rotate                              [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 create_alpha_tile                                 [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 7 ground_desc_t::init_ground_textures               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 8 WINMM.dll!midiOutGetNumDevs                      +0x7c5    (0x74926144 <WINMM.dll+0x6144>)
# 9 WINMM.dll!midiOutGetNumDevs                      +0xa54    (0x749263d3 <WINMM.dll+0x63d3>)
#10 KERNELBASE.dll!FindCloseChangeNotification       +0x68     (0x764ea353 <KERNELBASE.dll+0x1a353>)
#11 KERNELBASE.dll!FindFirstFileExW                  +0x531    (0x764eab95 <KERNELBASE.dll+0x1ab95>)

Error #10: LEAK 28 direct bytes 0x037cc738-0x037cc754 + 0 indirect bytes
# 0 replace_operator_new                              [d:\drmemory_package\common\alloc_replace.c:2899]
# 1 _Unwind_SjLj_Register                             [/root/bzip2-1.0.6/blocksort.c:1086]
# 2 _Unwind_SjLj_Unregister                           [/root/bzip2-1.0.6/blocksort.c:1086]
# 3 image_t::copy_image                               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 _Unwind_SjLj_Unregister                           [/root/bzip2-1.0.6/blocksort.c:1086]
# 5 image_t::copy_rotate                              [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 create_alpha_tile                                 [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 7 ground_desc_t::init_ground_textures               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 8 WINMM.dll!midiOutGetNumDevs                      +0x7c5    (0x74926144 <WINMM.dll+0x6144>)
# 9 WINMM.dll!midiOutGetNumDevs                      +0xa54    (0x749263d3 <WINMM.dll+0x63d3>)
#10 KERNELBASE.dll!FindCloseChangeNotification       +0x68     (0x764ea353 <KERNELBASE.dll+0x1a353>)
#11 KERNELBASE.dll!FindFirstFileExW                  +0x531    (0x764eab95 <KERNELBASE.dll+0x1ab95>)

Error #11: LEAK 0 direct bytes 0x037e32d8-0x037e32d8 + 0 indirect bytes
# 0 replace_operator_new_array                  [d:\drmemory_package\common\alloc_replace.c:2928]
# 1 image_reader_t::read_node                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 2 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 3 imagelist_reader_t::read_node               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 4 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 5 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 6 obj_reader_t::read_nodes                    [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 7 obj_reader_t::read_file                     [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 8 loadingscreen_t::set_progress               [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
# 9 simu_main                                   [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#10 sysmain                                     [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:212]
#11 __tmainCRTStartup                           [/build/mingw-w64-_1w3Xm/mingw-w64-4.0.4/mingw-w64-crt/crt/crtexe.c:332]

Reached maximum leak report limit (-report_leak_max). No further leaks will be reported.

===========================================================================
FINAL SUMMARY:

DUPLICATE ERROR COUNTS:
    Error #   3:      2
    Error #   6:      2
    Error #   9:      9
    Error #  10:      9

SUPPRESSIONS USED:

ERRORS FOUND:
      2 unique,     2 total unaddressable access(es)
      0 unique,     0 total invalid heap argument(s)
      0 unique,     0 total GDI usage error(s)
      0 unique,     0 total warning(s)
      8 unique,    25 total,    645 byte(s) of leak(s)
      1 unique,     2 total,   2508 byte(s) of possible leak(s)
ERRORS IGNORED:
     24 potential leak(s) (suspected false positives)
         (details: C:\Users\James\AppData\Roaming\Dr. Memory\DrMemory-simutrans-experimental-cross-compile.exe.12140.000\potential_errors.txt)
    924 unique,  9709 total, 2307668 byte(s) of still-reachable allocation(s)
         (re-run with "-show_reachable" for details)
  425493 leak(s) beyond -report_leak_max
Details: C:\Users\James\AppData\Roaming\Dr. Memory\DrMemory-simutrans-experimental-cross-compile.exe.12140.000\results.txt


Edit: Apparently, the output from the Visual Studio debug version (which is very similar) is too long to fit into one message.

However, I have managed to get it to run by deleting demo.sve.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

#272
Apologies for double-posting, but I cannot fit two Dr. Memory outputs into one post, it seems.

Here is the output from running without having loaded demo.sve on the same saved game as the server uses, using the cross-compiled build:


Dr. Memory version 1.11.0 build 2 built on Aug 29 2016 02:42:07
Dr. Memory results for pid 9372: "simutrans-experimental.exe"
Application cmdline: "C:\msys32\home\James\simutrans\simutrans-experimental\build\default\simutrans-experimental.exe"
Recorded 115 suppression(s) from default C:\Program Files (x86)\Dr. Memory\bin\suppress-default.txt

Error #1: LEAK 72 direct bytes 0x03973de8-0x03973e30 + 0 indirect bytes
# 0 replace_RtlAllocateHeap                  [d:\drmemory_package\common\alloc_replace.c:3770]
# 1 ntdll.dll!RtlDosSearchPath_Ustr         +0x385    (0x77282bf3 <ntdll.dll+0x52bf3>)
# 2 replace_native_xfer_target               [d:\drmemory_package\dynamorio\ext\drwrap\drwrap.c:1560]
# 3 ntdll.dll!RtlUniform                    +0x27     (0x7727bec1 <ntdll.dll+0x4bec1>)
# 4 NULL.dll!?                              +0x0      (0x2ae593de <NULL.dll+0x93de>)
# 5 SHELL32.dll!SHGetDiskFreeSpaceExW       +0x181    (0x74e5a804 <SHELL32.dll+0xaa804>)
# 6 SHELL32.dll!SHRestricted                +0x5af9   (0x74e34abd <SHELL32.dll+0x84abd>)
# 7 SHELL32.dll!SHRestricted                +0x5c1a   (0x74e34bde <SHELL32.dll+0x84bde>)
# 8 SHELL32.dll!SHRestricted                +0x5b78   (0x74e34b3c <SHELL32.dll+0x84b3c>)
# 9 SHELL32.dll!SHRestricted                +0x5cfe   (0x74e34cc2 <SHELL32.dll+0x84cc2>)
#10 SHELL32.dll!SHGetDiskFreeSpaceExW       +0x23d1   (0x74e5ca54 <SHELL32.dll+0xaca54>)
#11 SHELL32.dll!SHGetFolderPathEx           +0x2c     (0x74e35555 <SHELL32.dll+0x85555>)

Error #2: LEAK 16 direct bytes 0x04982bc0-0x04982bd0 + 96 indirect bytes
# 0 replace_operator_new                   [d:\drmemory_package\common\alloc_replace.c:2899]
# 1 __pformat_int.isra.0                   [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/stdio/mingw_pformat.c:780]
# 2 msvcrt.dll!_getptd_noexit
# 3 msvcrt.dll!_getptd_noexit
# 4 tabfileobj_t::get_string               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 5 stadt_t::cityrules_init                [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 6 simu_main                              [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 7 sysmain                                [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 8 __tmainCRTStartup                      [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:334]
# 9 KERNEL32.dll!BaseThreadInitThunk      +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)

Error #3: LEAK 3 direct bytes 0x049986e0-0x049986e3 + 0 indirect bytes
# 0 replace_malloc                    [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 gz_read                           [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/libsrc/ws2tcpip/gai_strerrorW.c:17]
# 2 gzread                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/libsrc/ws2tcpip/gai_strerrorW.c:17]
# 3 loadsave_t::read                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 4 gzread                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/libsrc/ws2tcpip/gai_strerrorW.c:17]
# 5 simu_main                         [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 6 sysmain                           [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 7 __tmainCRTStartup                 [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:334]
# 8 KERNEL32.dll!BaseThreadInitThunk +0x11     (0x75c7336a <KERNEL32.dll+0x1336a>)

Error #4: POSSIBLE LEAK 8 direct bytes 0x04998c88-0x04998c90 + 0 indirect bytes
# 0 replace_malloc                       [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 pthread_mutex_lock                   [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 2 rwlock_static_init                   [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 3 rwl_unref                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 4 rwlock_gain_both_locks               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 5 rwlock_free_both_locks               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 6 rwl_unref                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 7 __pthread_self_lite                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 8 pthread_setspecific                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 9 pthread_getspecific                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
#10 __emutls_get_address                 [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/libsrc/ws2tcpip/gai_strerrorW.c:17]
#11 loadsave_t::rdwr_bool                [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]

Error #5: POSSIBLE LEAK 2500 direct bytes 0x04998cc8-0x0499968c + 0 indirect bytes
# 0 replace_malloc                       [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 pthread_mutex_lock                   [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 2 rwlock_static_init                   [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 3 rwl_unref                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 4 rwlock_gain_both_locks               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 5 rwlock_free_both_locks               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 6 rwl_unref                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 7 __pthread_self_lite                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 8 pthread_getspecific                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
# 9 __emutls_get_address                 [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/libsrc/ws2tcpip/gai_strerrorW.c:17]
#10 pthread_getspecific                  [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]
#11 MTgenerate                           [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:212]

Error #6: LEAK 16 direct bytes 0x049bdbf8-0x049bdc08 + 6 indirect bytes
# 0 replace_operator_new                      [d:\drmemory_package\common\alloc_replace.c:2899]
# 1 __pformat_int.isra.0                      [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/stdio/mingw_pformat.c:780]
# 2 msvcrt.dll!_getptd_noexit
# 3 msvcrt.dll!_getptd_noexit
# 4 msvcrt.dll!_strtoui64_l
# 5 msvcrt.dll!_chdir   
# 6 NULL.dll!?                               +0x0      (0x2ae59480 <NULL.dll+0x9480>)
# 7 KERNELBASE.dll!FindCloseChangeNotification+0x68     (0x764ea353 <KERNELBASE.dll+0x1a353>)
# 8 KERNELBASE.dll!FindFirstFileExW          +0x531    (0x764eab95 <KERNELBASE.dll+0x1ab95>)
# 9 KERNELBASE.dll!SetErrorMode              +0x36     (0x764d7577 <KERNELBASE.dll+0x7577>)
#10 KERNEL32.dll!QueryActCtxW                +0x660    (0x75c7d163 <KERNEL32.dll+0x1d163>)
#11 KERNEL32.dll!GetDriveTypeW               +0x5a     (0x75c74186 <KERNEL32.dll+0x14186>)

Reached maximum leak report limit (-report_leak_max). No further leaks will be reported.

===========================================================================
FINAL SUMMARY:

DUPLICATE ERROR COUNTS:
   Error #   6:      3

SUPPRESSIONS USED:

ERRORS FOUND:
      0 unique,     0 total unaddressable access(es)
      0 unique,     0 total invalid heap argument(s)
      0 unique,     0 total GDI usage error(s)
      0 unique,     0 total warning(s)
      4 unique,     6 total,    343 byte(s) of leak(s)
      2 unique,     2 total,   2508 byte(s) of possible leak(s)
ERRORS IGNORED:
     89 potential leak(s) (suspected false positives)
         (details: C:\Users\James\AppData\Roaming\Dr. Memory\DrMemory-simutrans-experimental.exe.9372.000\potential_errors.txt)
   1146 unique,  9775 total, 2717758 byte(s) of still-reachable allocation(s)
         (re-run with "-show_reachable" for details)
  434940 leak(s) beyond -report_leak_max
Details: C:\Users\James\AppData\Roaming\Dr. Memory\DrMemory-simutrans-experimental.exe.9372.000\results.txt


All of these issues appear to be memory leaks, all of them apparently originating in code unchanged from Standard.

Edit: In case the problems loading demo.sve were silently causing unseen problems, I tried connecting to the Bridgewater-Brunel server with two clients which had not started by loading demo.sve, but the same failure state (the original client lost synchronisation shortly after the new one connected) was observed.

Edit 2: The desync on joining issue cannot be reproduced on three MinGW builds on the loopback interface built without multi-threading, so this must be an issue relating to multi-threading somewhere.

Edit 3: Re-enabling multi-threading, but disabling multi-threading of the convoys (using the -DFORBID_MULTI_THREAD_CONVOYS preorpcessor directive) produces an executable with the same desync on another player connecting as the fully multi-threaded build.

Edit 4: Additionally defining -DFORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE has no effect on this behaviour.

Edit 5: Additionally defining -DFORBID_MULTI_THREAD_PATH_EXPLORER makes no difference: the desync on another player joining still occurs (and quickly).

Edit 6: Enabling assume_everywhere_connected_by_road (which has the effect of disabling the (multi-threaded) private car route finder, the only other multi-threaded part of the program to use the route finder (route_t) has no effect: the desync on connexion still occurs.

Edit 7: Additionally defining -DFORBID_MULTI_THREAD_ROUTE_UNRESERVER (which disables the multi-threaded route reservation) makes no difference to the player joining desync. This is significant, as this means - bizarrely - that this error is caused by some part of the multi-threading code that is not specific to Experimental (which has additional multi-threading to Standard for convoy routing, passenger generation, the path explorer, private car route finding and the block unreserver). Only the loading/saving and the graphics are multi-threaded in Standard. The graphics are very unlikely to make a difference here, so there must be some issue with the multi-threaded loading/saving routine (which is also present in Standard; but Standard does not seem to have this problem, which I do not really understand).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

How often do you check for sync in your debug builds? But if the desync happens after saving and reloading, than that means your saving routines do either not save the entire state. It would be very easy to compare the just transferred game (maybe after setting the format to xml_zipped) with the locally cached savegame of the first client. If these are not identical, find the differences (just switch off pedestrains first, because they are allowed to be different within reasons, same for smoke).

The crash you are showing is likely in the multithreaded loadsave routines. If these are identical to standard, then the error is like occuring during obj_t::obj_t(file) contructors.

If this errors is indeed in the multithreaded loading, then either it might be clash of stdlib single threaded for bzlib versus a multithreaded stdlib for the rest. Or your library has some issues with semaphores (i.e. simtrhead does not work as intended, for whateve reasons.)

jamespetts

#274
Thank you for that: that is helpful. I will have to look into each of these issues carefully and will report progress on this thread.

Edit: Updating the code in loadsave_t to the latest code from Standard does not seem to fix this issue.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TurfIt

Quote from: jamespetts on February 09, 2017, 12:34:20 PM
so there must be some issue with the multi-threaded loading/saving routine
That seems improbable,  at least for the actual loadsave routine. Now missing mutexes to allow the multithreaded laden_abschliessen() on the other hand...

jamespetts

Testing on my Linux computer, I can confirm that the issue can be reproduced on a pure Linux 64-bit build, so the problem is not specific to MinGW.

There is something very odd happening that this problem happens with GCC builds and GCC builds will not stay in sync with Visual Studio builds and this problem does not occur with Visual Studio builds.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

For reasons entirely beyond explanation at this juncture, this problem appears to have become worse: now, the MinGW build will not stay in sync with the Linux build even if no other clients are connected, whereas a Linux build will stay in sync with a Linux build and a MinGW build will stay in sync with a MinGW build. This is a new manifestation of this issue since I last tested (apparently in early February), but it is entirely unclear what sort of thing could cause this behaviour and how this change from the previous less faulty behaviour could have occurred.

Fixing this is likely to take many weeks if not months, during which time I will not be able to work on other issues. If anyone has any thoughts as to how best to track this down, that would be much appreciated, as it could potentially save a gargantuan amount of time.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

A.Badger

If you now have a setup where you can test reliably and it seems like you can tell the previous behaviour from the current behaviour I'd use git bisect to find out when things got much worse.  The last commit in January was 444aea695046e591bc37285fb037521a9ec948e0. git bisect tells me there's 81 commits between there and it will take roughly 6 steps to narrow down to a single commit.

I don't know anything about simutrans's wire protocol but if Linux <=> Linux and MingW <=> MingW works fine but Linux <=> MingW does not I'd think that LP64 vs LLP64 might be the issue.  grep shows that there are some uses of signed/unsigned long in the code but I don't know if any of those variables end up being passed over the wire so I don't know if my guess is correct or not.

jamespetts

Thank you for that. Almost all of the instances of unsigned long, and all of the instances of signed long in the code are identical in Extended to Standard, and Standard works over the network without desynchronising in this way. There was one place, in path_explorer.cc, where I modified unsigned_long to uint32 just now, but this part of the code has been unchanged for many years, during which time it has managed to stay in sync, so I doubt that that is the problem.

Running the bifurcation test that you suggest may be the only way to proceed, but it is likely to be very time consuming having to do it on two separate computers, so I may have to set aside a week-end just for that. If anyone else can assist by running this test in the meantime, I should be most grateful.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.