News:

SimuTranslator
Make Simutrans speak your language.

Multiplayer oddities

Started by Junna, April 08, 2015, 07:00:56 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Junna

Trying to connect to myself running a server I get this... (compiled dev ver)



Obviously this is not true, because it is the exact same.

jamespetts

May I ask what pakset version that you are using - your own or the version compiled from half-heights? If the former, can you try it with the latter to see whether that makes any difference? Otherwise, it may be a bug in the pakset comparison code.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Junna

I was using my own, but it looks the same with a new compile of the original pakset without any changes (though what appears as mismatched seems to vary almost randomly).

jamespetts

I see - this does look like a bug in the system for checking for mismatches. Does it work without desyncing when connected manually?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Junna

How would I connect manually? In OTTD one does this via console, but Simutrans has no such, so I'm not sure how one would connect manually?

prissi

as filename type "net:ipnumber:port" for instance net:192.168.2.224 for a local game in standard port 13353

Junna

Ah, right.

Yeah, I de-synch immediately.

jamespetts

Sorry for missing this earlier - for some reason, the notification goes into my bulk folder. Do you desync only with your own pakset or do you also desync with the half-heights compiled pakset?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Junna

Hmm, it does not de-synch with the clean pak-set. It does however still display discrepancies.

Edit: Spoke too soon. De-synched as I wrote that.

jamespetts

Desyncs are very difficult to track down, as there is no easy way of telling at what exact point the code path taken by the two machines varies. One thing that can help greatly, however, is working out what aspect(s) of the game cause desyncs. We currently know that electricity supply to cities has this effect (this bug has not been fixed yet). If your game had electricity supply to cities, this might be a cause. If it did not, it might be helpful to use the clean pakset (which seems not to have desynchronised immediately, which suggests that the cause might be different) and experiment with lots of different games, each using different aspects (passenger transport by road, passenger transport by rail, supplying electricity to factories, goods transport, airports, etc.) to see whether the desyncs happen with but not without any given feature, which should help to narrow down where the problem lies.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Junna

The desynch (new game, clean pak) occured after a small test bus route had been set up in a city, so that might have been the case (with the other pak, that desynched immediately, a bus line already existed, I guess this might be relevant.)

jamespetts

Did it desync without the 'bus test route? Did it desync as 'buses arrived at the stops, or at other times?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

A while ago I tracked down a few bugs with the checksum process for various data elements in Simutrans standards. Specifically the checksum of un-initialized members are used. The result is that default "garbage" values are checksumed which are compiler specific. As such MSVC client would show a pakset mismatch with a GCC server and a GCC client would show a pakset mismatch with a MSVC server even though they were running from the very same pakset data.

It is possible that this fix has not been ported to Experimental and would certainly explain the pakset mismatch. The synchronization problem is likely a separate issue however.

Junna

^But, it was my compile and my client, so shouldn't that be fine?

DrSuperGood

Quote^But, it was my compile and my client, so shouldn't that be fine?
If the compiler has consistent behaviour for uninitialized members then yes.

jamespetts

#15
Dr. Supergood - when was this fixed (can you remember the approximate date)? I will need to check whether this was properly backported.

Edit: I have spent some considerable time this evening trying to discover what aspects of a game cause the desynchronisation. I am able to reproduce the problem with some maps, but not others. I have yet to find any general characteristic that is different about maps where I can reproduce the desynchronisaiton problem than where I cannot. I have ruled out the problem being with industry (at least, only with industry), as maps without any industry exhibit the problem. Maps without any transporting happening (e.g., with no towns or industries) do not exhibit this problem, but some maps with towns and where passengers are transported do not exhibit this problem.

The last time that there was a significant desynchronisation bug, it took about four months of intensive work to find and fix it, and that was with somebody helping: this type of bug is, sadly, fantastically difficult to find.

One thing that would help greatly would be for people to try lots of different maps of increasing complexity (starting with extremely simple arrangements with only one or two lines) and find whether there are any particular things that reliably cause desyncs, which, when removed, solve the problem. Please post the results of any such findings on this thread.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

O01eg

If the desync bugs caused by uninitialized values it can be searched by valgrind or memory sanitizers. I found two places https://gist.github.com/o01eg/09bb728831830365b837 but not sure about default values especially in haltestelle_t::haltestelle_t(loadsave_t* file)

jamespetts

Thank you for that. I have applied your fixes, but this does not help the desync problem, sadly.

Incidentally, for testing purposes, this is the saved game that does not cause desyncs, and this is the not much more complicated one that does.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.


jamespetts

Fixes applied, but there are still desyncs, alas.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

O01eg

Does desync happen when MULTI_THREAD=0? I don't get any valgrind errors about uninitialized values while leave Tamworth_1484205993_1806-11 map running for several hours. But it was debug build. In release build I found and fixed some alignment issues when optimized string function read 4 bytes at once but allocation return unaligned buffer.

Is some actions required? Or maybe it requires only network game?

jamespetts

#21
I have yet to test the desyncing with multi-threading disabled or with both client and server in a debug build (I normally test with the client as a release build and the server as a debug build running in the MSVC 2012 debugger). I will have a go at testing both of those things and revert to you on this thread.

As to whether actions are required, do you mean whether the player has to take any action in the game before the desync happens? If you mean to ask that, the answer is no: it will desync after a few seconds in the Tamworth map even if no actions are taken.

In the meantime, can you let me have the .diff files for your fixes?

Edit: Tested with single-threaded release build: still desyncs.

Edit 2: Tested with single-threded debug build: does not desync.

Edit 3: Tested with multi-threaded debug build: still desyncs.

Edit 4: Incidentally, the single-threaded builds were built without MULTI_THREAD defined at all.

Edit 5: Actually, that is not quite correct. The debug single threaded build had not had MULTI_THREAD defined at all: the release build had had MULTI_THREAD=0 defined. Recompiling the release build without MULTI_THREAD defined at all, it also does not desync. This is very interesting.

Edit 6: Multi-threading is used for two things: graphics and loading/saving. Graphics will not cause desyncs, but loading/saving may very well if, for example, things are loaded in a different order between client and server.

Edit 7: This is interesting: the desync caused by supplying electricity to towns (which is generally reproducible even with an otherwise empty map with one power station and one town) also cannot be reproduced with the single threaded version.

Edit 8: Trying to run Dr. Memory on the multi-thread debug version with the Tamworth save, I get nothing other than leaks in areas of code shared in common with Standard. Valgrind may be more sensitive, but it is Linux only and I run Windows.

Edit 9: The Tamworth map desyncs in a multi-threaded debug build even when all industries and player vehicles are removed: saved game here for testing.

Edit 10: However, further removing attractions prevents the desyncs in a multi-threaded debug build.

Edit 11: Removing all attractions whilst keeping industry and player vehicles also results in no desyncs on a multi-threaded debug build. I also had a crash when deleting an attraction at one point to prepare this test, but could not track it, as I was not running it in a debugger. The tram-test map, however, does have some attractions, so attractions do not inevitably cause desyncs.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

The threading during load does only the decompression in a seperate therad. However, after world generation all tiles need to be passed once which is done multithreaded only on non-multiplayer build. Maybe here is the error, i.e. world_xy_loop (in simworld) in threaded even for multiplayer games. (Since standard use these too, it could mean that something is added to the sync loops, which would lead to a different order.) Also I think experimental uses prthashtable for something. While debug builds always have the same based address, client build start at random addresses.

jamespetts

That's interesting. I cleared out the ptrhashtables from Experimental a long time ago to make it multi-player compatible, so it's not that, but the single-threaded post-processing after loading for multiplayer games is something that I will need to check. Where in the current Standard code does it check for networkmode for this purpose?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TurfIt

Standard uses, and very much works fine with the multithreaded post-load processing single or multi player. i.e. There is no such check to run single threaded in multiplayer.

Given that Experimentals last two desync issues were due to lack of merging a fix from Standard, I again say someone really needs to go through each and every commit in Standard to make sure it's properly merged into Experimental. A huge task, but it's immensely frustrating to spend an inordinate number of hours hunting a bug that's already fixed...

Also, I suspect a multithreading safety assessment was never done on the routines in Experimental that have been modified from standard, but are being called multithreaded. Or even having variables correctly initialized in any modified routines as shown by O01eg's recent fixes.

Junna

Would these loading sequence issues also relate to the strange errors about how certain paks were missing when they clearly were not?

O01eg

Quote from: jamespetts on July 25, 2015, 09:14:51 AM
As to whether actions are required, do you mean whether the player has to take any action in the game before the desync happens? If you mean to ask that, the answer is no: it will desync after a few seconds in the Tamworth map even if no actions are taken.

I'll try to build it with the gcc or clang sanitizers due under valgrind running is too slow for network interaction.

Quote from: jamespetts on July 25, 2015, 09:14:51 AM
In the meantime, can you let me have the .diff files for your fixes?

Yes, but it mostly to keep silent valgrind it should not fix any desync issues. It is better to refactor this changes or to don't use them: https://gist.github.com/o01eg/2f04966a42f451d61895#file-unaligned-allocation-diff

Quote from: jamespetts on July 25, 2015, 09:14:51 AM
Edit: Tested with single-threaded release build: still desyncs.

Edit 2: Tested with single-threded debug build: does not desync.

Edit 3: Tested with multi-threaded debug build: still desyncs.

Edit 4: Incidentally, the single-threaded builds were built without MULTI_THREAD defined at all.

Edit 5: Actually, that is not quite correct. The debug single threaded build had not had MULTI_THREAD defined at all: the release build had had MULTI_THREAD=0 defined. Recompiling the release build without MULTI_THREAD defined at all, it also does not desync. This is very interesting.

Edit 6: Multi-threading is used for two things: graphics and loading/saving. Graphics will not cause desyncs, but loading/saving may very well if, for example, things are loaded in a different order between client and server.

Edit 7: This is interesting: the desync caused by supplying electricity to towns (which is generally reproducible even with an otherwise empty map with one power station and one town) also cannot be reproduced with the single threaded version.

Edit 8: Trying to run Dr. Memory on the multi-thread debug version with the Tamworth save, I get nothing other than leaks in areas of code shared in common with Standard. Valgrind may be more sensitive, but it is Linux only and I run Windows.

Edit 9: The Tamworth map desyncs in a multi-threaded debug build even when all industries and player vehicles are removed: saved game here for testing.

Edit 10: However, further removing attractions prevents the desyncs in a multi-threaded debug build.

Edit 11: Removing all attractions whilst keeping industry and player vehicles also results in no desyncs on a multi-threaded debug build. I also had a crash when deleting an attraction at one point to prepare this test, but could not track it, as I was not running it in a debugger. The tram-test map, however, does have some attractions, so attractions do not inevitably cause desyncs.

Looks like it better to run with -fsanitizer=thread and valgrind --tool=helgrind. Maybe it really better to merge with the upstream.

jamespetts

#27
TurfIt: I have already spent quite a lot of time going through Standard changes since the last automatic merge (and some before then), incorporating most of them (including any relevant bug fixes, and excluding only things that either do not apply to Experimental, or some UI features that will need a huge amount of work to merge). Indeed, because of the degree of divergence, automatic merges are no longer practicable, so all future merging will be done manually: solving merge conflicts in automatic merging became harder than manual merging.

Junna: I doubt that the two are connected, since the consistency checking is a specific piece of code and is not something that affects the synchronisation. Incidentally, I cannot reproduce this with the current development version of Pak128.Britain-Ex (half-heights branch).

O01eg: All of the changes in your .diff are to code that is shared with Standard, so this should not affect the desyncs. I did incorporate your changes of "return false" to "return NULL" however.

What do you mean here by "merge with the upstream;" are you referring to Standard?

Edit 1: Disabling private car routing in the Tamworth map (enabling "assume_everywhere_connected_by_road") does not affect the desync issue: desyncs still occur.

Edit 2: Deleting only country attractions from the Tamworth map, leaving town attractions, does not cure the desyncs.

Edit 3: I think that I have fixed the attraction related bug: the Tamworth map will now run in multi-threaded mode without desyncing. What was happening was that non-town attractions were being added to the world list for passenger generation/destinations (which is Experimental specific, for the new passenger generation system) in the threaded part of the code. I have altered this so that, when it is in network mode, it will instead add them in the non-threaded part of the code: this appears to avoid the desyncs.

Edit 4: I am now having trouble reproducing the electricity related desync - can anyone confirm whether it is still present in the latest code from devel-new?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TurfIt

multithreaded calling of add_building_to_world_list() in gebaeude_t::finish_rd() is a perfect example of unsafe multithreading in Experimental. Multiplayer or single. Any changes to such global lists need to done under mutex protection. See a couple lines higher where add_gebaeude_to_stadt() is wrapped in the add_to_city_mutex. Incidentally, city building lists are order sensitive, hence the addition of the 'ordered' parameter in add_gebaeude_to_stadt() to allow multithreaded insertion, yet maintain multiplayer sync. Similar may be possible for lists in Experimental rather than moving the insertion outside the multithreading scope.

jamespetts

#29
Splendid, thank you, that is helpful. I have now fixed that particular thread safety issue, and will look into ordered adding to the world list.

Edit: Ordered adding works! Thank you.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Junna

I can not replicate the stuff with the list of erroneously mismatched pak with a current version, so it appears fixed.

jamespetts

Interesting - thank you for checking.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.