I have been doing some further testing on this issue to-day.
As readers of this thread may remember, there are essentially two separate problems: (1) the original problem of a GCC build immediately desynchronising with a Visual Studio build; and (2) a desync occurring, irrespective of the build, when more than one client connects to a server (the earlier connected clients desyncing after a delay).
A week or two ago (I cannot remember exactly when), I fixed a bug relating to the post-loading code for vehicles that had the potential to cause the second issue. I had not had time to test whether this did fix this issue at the time, however.
Recently, I have been working on the code for passenger and mail classes. In testing some of that code, I found a thread deadlock in the path explorer code. This turns out to have been caused, not by a bug in the new passenger and mail classes code, but by a pre-existing bug in the multi-threading path explorer code. Looking very carefully into the documentation for pthreads, it transpired that I had misunderstood the relationship between the pthread_cond_wait command and the mutex that it requires as a parameter. The existing path explorer multi-threading code is classed as having undefined behaviour by the pthreads standard as it calls a mutex lock multiple times in succession.
I have coded an initial attempt at a fix to the multi-threaded path explorer on
this dedicated branch.
However, testing shows that this code gives rise to a network desync when instances of the same build are connected on the loopback interface for testing, albeit only after some considerable time has lapsed. This does not occur when the path explorer multi-threading is disabled or on the master branch.
Meanwhile, testing on the master branch seems to show that the other problem (a desync occurring a short while after a second or subsequent client connects) seems to have been fixed, which I suspect is to do with the post-loading code fix to which I refer above.
I have now run out of time for further testing (as each individual test cycle requires running the whole thing for over an hour, fast forwarding not being possible in network mode) this week-end, but will look into refining the new code further. Because the current code has undefined behaviour, this is a prime suspect for inter-platform desyncs, so I am keen to fix this as soon as possible.
If anyone can spot any immediate problems in my new multi-threading code, I should be grateful for any feedback.
Edit: Some further testing seems to show that the multi-threaded passenger generation code seems to be responsible for a desync between a Visual Studio client and a GCC (Msys) client (both Windows): when this is disabled, the two will stay in sync for far longer than when this is enabled. I have not yet tested long enough to see whether it will stay in sync permanently, however.
Edit 2: With the (modified) multi-threaded path explorer multi-threading enabled but the passenger generation multi-threading disabled, it still desyncs between an Msys GCC client and the Visual Studio client, but only after a very long time; the same sort of time as it takes to desync between two Visual Studio clients with the new path explorer multi-threading algorithm. This suggests that the passenger generation multi-threading may well be responsible somehow for the desync between differently compiled versions of Extended.
Edit 3: I have now run a long-term test connecting a single Msys/GCC compiled client to a Visual Studio compiled server all day to-day with the passenger generation multi-threading disabled and the (existing) path explorer multi-threading enabled, and the two are still in sync even now. I have also had an answer on Stack Exchange that might help to explain the problem that I have been having with the new path explorer multi-threading code
.
Edit 4: Using the 2010 edition of Rollmaterial's map, in order to use a more challenging test, with the latest code on the passenger-generation-multi-threading-fix (currently, just minor updates to the path explorer multi-threading from the master branch, and disabling the passenger generation multi-threading entirely), a Visual Studio client will stay in sync with another Visual Studio client for longer than I have so far measured, but an Msys/GCC client, connected second, will desync after approximately one game hour (with no interaction).
Edit 5: Using the mutex error checking, no mutex errors can be found running the britain-2010 map in the passenger generation multi-threading.
Edit 6: Repeating the test from edit 3 using the saved game from edit 4 produces a desync, but only after running for about one game month.
Edit 7: Repeating the test from edit 6 with the path explorer multi-threading disabled produces a desync after a long time, but slightly short of a game month (i.e. before the crossing of a month boundary since loading the game).
Edit 8: Very oddly indeed, when testing with FORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE defined, I get a near-instant desync between a GCC/Msys and Visual Studio client, although two Visual Studio clients will happily stay in sync. The difference between FORBID_MULTI_THREAD_PASSENGER_GENERATION_IN_NETWORK_MODE and FORBID_PARALLELL_PASSENGER_GENERATION is that the former uses the old code for single-threaded passenger generation in the main thread, whereas the latter (i.e., the one that works) uses a separate thread for the passenger generation, but only actually runs the passenger generation on a single thread rather than all of the passenger generation threads. This is very bizarre, as it suggests that there is an error with the single threaded passenger generation code that is not present in the multi-threaded passenger generation code when it is restricted to running with only one thread.
Edit 9: Even running entirely single-threadedly, the GCC/Msys build will desync from the Visual Studio build in seconds with the britain-2010 saved game. It seems that
only when the passenger generation multi-threading is operational
but it is set to use only one of the threads does this work (for a while) without desyncing. This is extremely odd.
Edit 10: Defining FIXED_PASSENGER_NUMBERS_PER_STEP_FOR_TESTING with the passenger generation multi-threading fully enabled does not prevent the Visual Studio/GCC desync.
Edit 11: Defining DISABLE_JOB_EFFECTS prevents the very quick desync between the Visual Studio and GCC/Msys builds with the britain-2010 saved game.
Edit 12: I have reverted the DISABLE_RANDOMNESS setting on the server's version and recompiled it so that people can again try to connect with an unmodified client for testing purposes.
Edit 13: Attempting to connect to the Bridgewater-Brunel server from a the cross-compiled client (both from the master branch) still results in a near instant desync.
Edit 14: Further testing shows that the cause of the short desync between a Visual Studio client and a GCC/Msys (Windows) client appears to have been using the min() and max() methods with 64-bit integers when in fact they are defined as using signed 32-bit integers. I have added a special 64-bit version of min() and max() (called min_64() and max_64()) to handle these where they appeared in the code relating to job effects, and I can now connect, with job effects and passenger generation multi-threading both enabled, an Msys/GCC client to a Visual Studio server for a considerable time (crossing a month boundary in the britain-2010 game) before a desync occurs. The long desync, however, after the month boundary is crossed, is still present.
This opens up a new line of enquiry into all cross platform desyncs, however, as there may be other sync critical places in the code with 64-bit integers using these methods. Also, I wonder whether it is safe for unsigned 32-bit integers to use these methods.
Edit 15: I cannot find any more instances of the min()/max() methods being passed 64-bit integers, although I have slightly improved some code. This has not prevented the long desync between Visual Studio and Msys/GCC clients, but normally connexions will be made between GCC/Msys and GCC/Linux clients in any event, so it is possible that this desync is not important.
I have now integrated the above fix into the master branch as this is clearly an improvement on the previous code and fixes a specific issue. There will be further testing on the Bridgewater-Brunel server.
Edit 16: There is still an immediate desync when connecting to the Bridgewater-Brunel server with the client cross-compiled on the Bridgewater-Brunel server It is not clear at present what the cause of this is.
Edit 17: The same result obtains with both the Msys/GCC and Visual Studio builds connecting to the Bridgewater-Brunel server: an instant desync.
Edit 18: Testing with my Linux computer, this seems to desync from the Bridgewater-Brunel server instantly, too, but be able to stay in sync with a Windows server. However, the problem of one client joining causing all other clients to desync shortly after connecting appears to have returned, and it is not clear why.
Edit 19: The client kick desync can be reproduced with the Visual Studio and the GCC/Msys builds connecting to a build of the same type.
Edit 20: Testing again with all multi-threading disabled, the client kick desync cannot be reproduced. This appears to be the same issue as was investigated some months ago relating apparently to multi-threading of the load/save routines. The long desync appears also not to be reproducible in this contingent, but this needs a longer period of testing to confirm.
Edit 21: With multi-threading disabled entirely, three clients can remain connected to a local server for many hours and many in-game months without desyncing from the server.