News:

Want to praise Simutrans?
Your feedback is important for us ;D.

Online checklist mismatches (was: Clients keep disconnecting)

Started by DrSuperGood, January 08, 2016, 03:50:48 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

DrSuperGood

Quote
It can bounce around a bit, but should mostly track 0. When -'ve, if a command comes in from a player, you're at risk of a desync.
A desync because of this will show up as :
Surely your client should wait if it is running ahead of the server? I think that is how RTS games do it. They must use some sort of checkpoint system where the client refuses to execute passed a certain time frame when the client is no longer sure the server is still ahead.

To better describe it the server receives a user command, then assigns it a certain time slot. It then sends out a packet to all clients saying that between a start and end time these commands were issued at ceretain times and in a certain order. The client receives the packet and knows it can safely advance until the time specified by the server, after which it has to pause until another such command covering more time is issued. If there is a bad connection a client will get periodic stutter as the game pauses often to wait for packets and then has to advance faster than normal to catch up. Games like Warcraft III and StarCraft II go a step further by potentially pausing the game state to "wait" for a player if that player falls too far behind, that functionality is not appropriate for a game like Simutrans and instead the client running as fast as possible to catch up is probably more friendly.

You trade some input latency (a result of the checkpoints) for an absolutely stable server client system.


Edit: (TurfIt) split topic from http://forum.simutrans.com/index.php?topic=15109.0

Dwachs

I am not aware of any bugs you mention in these posts:
Quote from: DrSuperGood on January 08, 2016, 01:39:12 AM
There are issues with the netcode. Specifically "send to depot" is notorious for causing out of sync problems. That said many of the issues should have been fixed.
Is the 'send to depot' bug fixed with the latest commits addressing uninitialized variables in vehicle routefinding?
Quote from: TurfIt on January 08, 2016, 01:53:48 AM
Note: there are some bugs in 120.1.1 that will cause this. Some are fixed with 120.1.2, ....
Are there more known bugs that are not fixed in 120.1.2? I am more than happy to hunt them down.
Parsley, sage, rosemary, and maggikraut.

DrSuperGood

Quote
Is the 'send to depot' bug fixed with the latest commits addressing uninitialized variables in vehicle routefinding?
I have no idea, I have not been following that bug. I only really knew about it from personal experience and some chat messages.

Dwachs

There is indeed a desyncing bug in the go-to-depot code.
Parsley, sage, rosemary, and maggikraut.

TurfIt

My observations with the go-to-depot are if a train is waiting at a signal to enter a tile, and the train occupying that tile is then told to goto depot, and it 'jumps' when doing so, then frequently on the client, the waiting train proceeds through the signal, the two trains overlap, and sometime later a checklist mismatch desync is triggered.  If I manually reconnect to the server immediately to see its state, I see that waiting train is actually still waiting for the depot bound one to clear. I've not been able to duplicate under controlled circumstances with a graphical server and client both running...

In general, it appears this, and several other issues all have block reservations in common as the desync culprit.

Phantom signal - using the track way removal tool to disconnect a piece of track from an adjacent one containing a signal sometimes results in the signal vanishing from view. It also does not show up as an object when clicking on the tile. Trains proceeding over the spot where the signal was enter a state where they reserve tile by tile rather than to the next signal. Often a save/reload clears this phantom signal. Bulldozing everything on the tile, and rebuilding the track results in the same behaviour returning. Building a signal on the tile is allowed, but it never clears (goes permissive). Deleting the signal does not fix things.

Trains running red signals - once a train is in a tile by tile reservation mode, it seems to ignore all further signals server side - sometimes client side.  You can watch it follow a train - it's always one reserved block back in the same signal section. Once it reaches the next station on its schedule, it returns to normal behaviour obeying signals. If the platform at the station is occupied, and it's a terminal station (same way in/out), then it will block the exit and deadlock the system.
Usually you don't see this deadlock client side, instead the train is correctly waiting at the signal. Reconnect to the server, and suddenly there's a train sitting where it's impossible to be except by running a red.

If a train has a path reserved from a signal to the next, and you delete the next signal (bulldoze) before it goes permissive, the train will enter the tile-by-tile mode. Client side after a couple blocks of this tile-by-tile, it reverts to signal-signal reservations. Server side it seems to frequently stay stuck in tile-by-tile, ultimately leading to the deadlock above.

Anything that results in a train jumping and redoing it's reservations is also likely to trigger a desync. Editing schedules, especially editing the line schedule is quite dangerous. If you add a new waypoint, quite often trains change their destination to the waypoint even when they're already beyond it in the schedule. The resultant jump when they turnaround results in behavior much like when they jump due to the goto depot command. Related, it would be nice if vehicles were smarter about line schedule changes so they didn't turn back... it's a pain to have to visit every single convoi in a line, and open its individual schedule to correct the current destination after you add/remove a waypoint... That would minimize jumping too...



Dwachs

The bug with the go-to-depot button was: immediately upon pressing the button, a route-search was triggered. This route-search immediately cleared reserved tiles and reserved stop positions. On all other clients the effects were only visible after the convoi in question got its new schedule. I tried to fix this with r7724-7726. The attempted fix was to trigger the route search on all clients synchronously. Changing the route-search to not change state of the convoi seems to be impossible as the route search for airplanes changes state of the convoi (flying vs on ground).


Will have to check the strange behavior you mention in connection with signals.
Parsley, sage, rosemary, and maggikraut.

Dwachs

I could not reproduce phantom signals: If the signal object is gone, it is gone. The way is clean and free. I cannot imagine that this produces strange behavior in network mode.

I could reproduce the step-by-step reserving of trains if signals are removed. However, this behavior reverts back to signal-to-signal reservation at the tile that follows the tile, where the signal was removed.
Parsley, sage, rosemary, and maggikraut.

Ters

I have observed trains doing tile-by-tile reservations, although I don't remember the circumstances, since it has been a while now. It lasted quite a while, I think, probably until the next waypoint, or perhaps rather the next time route is calculated. Changing the next destination back-and-forth in the schedule would cure it.

Dwachs

I saw the step-by-step reservation but only triggered by removing signals. When the train arrives at the tile after the removed signal, it proceeds with signal-signal reservations. Then it never travels through red signals.

Maybe I miss something, but I would love to have a procedure to get step-by-step reservation over more than two tiles.
Parsley, sage, rosemary, and maggikraut.

Ters

I notice that when removing the signal the train is waiting at, the train starts moving, reserving tile-by-tile, but the train reserves its entire length. In the cases I remember, the train only reserved a single tile at the front. It may be a bug that has been fixed by now.

Dwachs

Removing the signal the train is waiting at leeds to this behavior if the train cannot reserve the next block. Confirmed.
Parsley, sage, rosemary, and maggikraut.

TurfIt

Removing the signal the train has reserved a path upto, but not yet reached results in the tile-by-tile behaviour if there's no more signals between it and the destination.

Connecting to a mulitplayer server and observing what's in the screenshot is far too common.

I did catch a train moving down the tracks ignoring all signals once. It ended up exactly as shown (this was a recreation). I did not however catch it originally enter the tile-by-tile mode. I do know track construction was performed along its route, hundreds of tiles away from where it finally ended up. The construction changed signals, but not the final track layout. i.e. The train did not have to re-search the route to continue.

I wish there was a nice procedure to induce this. I do note server and client quite often differ with one of them having the trains tile-by-tile, the other not.

prissi

Hmm, there had been once code to unreserve for the train and trigger a new search.

Ters

Quote from: TurfIt on January 10, 2016, 08:13:29 PM
I did catch a train moving down the tracks ignoring all signals once. It ended up exactly as shown (this was a recreation). I did not however catch it originally enter the tile-by-tile mode.

I think my trains have gone completely "car-ish" like that on several occasions. The sad part is that I believe I saw it starting, but I apparently didn't investigate or report anything back then, and now I've forgotten the details.

TurfIt

I just induced many trains to all enter the tile-by-tile / ignore all signals mode by using the signal dragging tool to change from 12 spacing to 8.

DrSuperGood

Quote
12 spacing to 8.
Could be the removal or addition of a signal at a certain stage in the journey. For example if the train is already interacting with a signal and it gets removed or if a train is passing through a block which gets a signal suddenly.

Dwachs

Quote from: TurfIt on January 10, 2016, 08:13:29 PM
I did catch a train moving down the tracks ignoring all signals once. It ended up exactly as shown (this was a recreation). I did not however catch it originally enter the tile-by-tile mode. I do know track construction was performed along its route, hundreds of tiles away from where it finally ended up. The construction changed signals, but not the final track layout. i.e. The train did not have to re-search the route to continue.
This happens if the signals were constructed after the train entered the block up to the station. Combined with the step-by-step reservation if a signal at the end of the reserved block is removed, this can lead to the situation in the screenshot.

I think I have a patch for this, will post it tonight.

However, I do not understand how this would lead to desyncs. While this behavior is weird and annoying, it should occur synchronously on all clients/server.
Parsley, sage, rosemary, and maggikraut.

TurfIt

Putting the trains into tile-by-tile ignore reds mode with the signal dragging tool results in the pictured deadlock with trains that are many 10's of tiles away from the signal block at the station. This one does appears to occur synchronously, I presume there's more ways to induce that result in the desyncs. My suspicions still point towards modifying signals with the track removal tool.



Forgot the other desync that's repeatable: Using the raise/lower land tools is blocked on the server when you're bankrupt. The tool executes on the client however.

Dwachs

Tried to fix the desync with terraforming in r7730.

Comitted a fix to the train signalling bug in r7731. If a signal vanishes, the train will try to reserve the next block. If this fails the train will stop without signal. Maybe we can allow it to drive on up to the next crossing. Everything else would result in the deadlocks as mentioned above.

Please test and report back.
Parsley, sage, rosemary, and maggikraut.

TurfIt

Once again one of the game instances on my server has started throwing checklist mismatches immediately upon joining. And also again resaving the server.sve on a Windows client solves the problem.

The client was connected and idling for ~30 mins before the desync. No one joined, no commands issued. Rejoining gave the checklist mismatches immediately until the server save was replace with the resaved one. Restarting the server with the original save results in the mismatches continuing.

The troublesome .sve works with no issue with a Windows server (both graphics and posix) and client, all 32bit on 64bit OS. But the Windows client connecting to the BSD64 server disconnects immediately upon the first checklist check - 100% repeatable. Saves FWIW: desyn128_2.7z

Issues with BSD? 32 vs 64? Something else? Very frustrating...

EDIT:  BSD server was built DEBUG=3, OPTIMIZE=1.  Windows server and client are #DEBUG, OPTIMIZE=1. Changing the Windows server to DEBUG=3 results in the same checklist desync now...

DrSuperGood

Quote
EDIT:  BSD server was built DEBUG=3, OPTIMIZE=1.  Windows server and client are #DEBUG, OPTIMIZE=1. Changing the Windows server to DEBUG=3 results in the same checklist desync now...
As far as I could tell the DEBUG and OPTIMIZE compiler flags were mutually exclusive, especially as far as version logic goes.

It is possible this is a GCC bug as if you are building "optimized" then it is using very different logic from "un-optimized" to build with.

It would be very useful if servers/clients were able to perform an "autopsy" when they out of sync from a checklist mismatch. This would flag up which objects the client has out of synced with, possibly even what fields. Might be completely impractical to implement though and is not useful outside of debugging.

Ters

Quote from: DrSuperGood on January 20, 2016, 12:32:09 AM
As far as I could tell the DEBUG and OPTIMIZE compiler flags were mutually exclusive, especially as far as version logic goes.

It is possible this is a GCC bug as if you are building "optimized" then it is using very different logic from "un-optimized" to build with.

DEBUG=3 is mutually exclusive with OPTIMIZE, since it adds -O0 after OPTIMIZE's -O3. Lower DEBUG levels can coexist with OPTIMIZE, although DEBUG=2 does turn off inlining.

In addition to possible GCC bugs, we must also consider the possibility that some Simutrans code has crossed into undefined behavior territory.

prissi

The only thing causing different logic would be a calculation only done for a debug message (or an assert). I though we ironed those out years ago :(

Ters

Quote from: prissi on January 20, 2016, 10:12:04 PM
The only thing causing different logic would be a calculation only done for a debug message (or an assert). I though we ironed those out years ago :(

Other possibilities are something like this (not exactly different logic):


#include <cstdio>

int main() {
int value;
printf("%i\n", value);
return 0;
}


For me, this prints 0 when compiled with -O3 and some other "random" but constant value when compiled with -g. GCC generates a warning about this, and the only such warning I get when building Simutrans now is from squirrel code, but there might be more stealthy ways of doing something similar.

Dwachs

@Ters: such errors can be found with valgrind. I am pretty sure that this kind of uninitialized variables does not happen anymore.

The cause could also be static variables: They are initialized, their value is not random, but their value depends on the savegame that was loaded before the current game.

Parsley, sage, rosemary, and maggikraut.

Ters

Quote from: Dwachs on January 21, 2016, 07:37:51 AM
@Ters: such errors can be found with valgrind. I am pretty sure that this kind of uninitialized variables does not happen anymore.

The cause could also be static variables: They are initialized, their value is not random, but their value depends on the savegame that was loaded before the current game.

How often are networked games run in valgrind? And since there is a difference between optimized and debug builds, some value lingering in static fields seems less likely to be the reason for this.

prissi

MSVC warns at compile time about all unititialised variable. Furthermore debug builds of simutrans with MSVC would report accessing unitialise variables. So I think that is not an issue, or at least it is one variable which is allocated.

Ters

Quote from: prissi on January 21, 2016, 10:53:34 PM
MSVC warns at compile time about all unititialised variable.
As I wrote: GCC also warns, and it does issue a warning when building Simutrans (although not in a part that seems relevant here). There used to be more, I think, so someone has been doing a good job weeding them out.

Quote from: prissi on January 21, 2016, 10:53:34 PM
Furthermore debug builds of simutrans with MSVC would report accessing unitialise variables.
Exactly the kind of difference between release builds and debug builds I was talking about, although the "reporting" might just be that uninitialized variables are filled with an easily recognizable value when debugging.

Quote from: prissi on January 21, 2016, 10:53:34 PM
So I think that is not an issue, or at least it is one variable which is allocated.
I wrote that if it was an uninitialized variable thing, it would be more hidden than in my simple example showing how release builds and debug builds can differ in behavior when invoking undefined behavior.

Not that this is to only possible reason, but I generally consider most things more likely than a compiler bug, which is the only other hypothesis on the table. Especially given how little we have to go on so far.

prissi

Another thing is that allocated memory is zeroed in debug builds (or with a defined values) while it is random on release builds.

jamespetts

These are painful to resolve, I know from experience. I took four months to find one once, with much help from Bernd Gabriel. Release/debug differences I have found are often down to undefined behaviour somewhere (undefined variables are usually easy to track down; these issues tend to be more subtle).

One (rather blunt) technique that I have found that does work to narrow the problem down a little is to disable parts of the code one by one while testing to see which area of the code contains the problem. (This is less effective, of course, when there are multiple and diverse causes). Logging all calls to simrand() can help, too: there is some code in Experimental to do this (enabled by a preprocessor definition) which you might want to look at to help with this. This can be useful in tracking down exactly where the random numbers diverge, which can help to narrow the problem down a little.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TurfIt

Is there anything in r7731 that could've fixed this?

Current symptoms - using the save previously linked, and a posix debug server (both Windows and BSD), graphical debug and non-debug clients (Windows) desync immediately with 7730 and older, remain connected with 7731. Graphical servers (Windows), and non-debug posix servers (Windows and BSD) work with any version.

Resaving the game with any of the working configurations results in it working also with the posix debug servers (past experience is the problem will reoccur in several days if the game is left to run and with clients joining/leaving). Resaving the game with the posix debug server results in the problem persisting.

I somewhat expect this is just a fluke with the newer versions rearranging things enough to mask the problem... If 7731 hasn't actually fixed it, then I really don't know how to even start tracking this down; Attempting to add some debugging code will also likely change the .exes enough to mask it. But first, can anyone replicate the behaviour?


DrSuperGood

It does have to do with block reservations, so it is possible it was trying to touch invalid values before I guess (since the logic was not working as intended?). One also cannot rule out that that small change in logic has just spontaneously "fixed" the problem. Without knowing the exact cause there is no guarantee that it will not re-appear later with another change that spontaneously breaks it.

Dwachs

r7731 was targetted to cure the step-by-step reserving of track. It is just a change in reserving behavior.

However this change broke choose signals, i.e., trains waiting at choose signals treated the signal as normal signal and thus were not choosing free platforms. You might want to try out if the desync come back in r7742, which fixed this.

I could not reproduce the desyncs with graphic/no-graphic, optimized/non-optimized, gcc 4.6/clang 3.0 builds (with recent r7744).
Parsley, sage, rosemary, and maggikraut.

TurfIt

I'm aware 7731 wasn't trying to necessarily fix desyncs, but it's possible the game has a convoi doing a step-by-step somewhere (unlikely given no players had done actions for a quite a while {but then one player has his entire network jammed for a decade...}).

Using a posix debug server, game desyncs with a non-debug graphical client for 7720, 7723, 7727, 7729, 7730. Does not desync for 7731, 7732, 7736, 7740, 7746. 100% repeatable.
A debug client will sometimes desync, sometimes not. Graphic servers never have clients desyncing. Posix non-debug, also works.

A posix debug server, graphic non-debug client, built in mingw w/ gcc4.6.2 also works.  With 5.3.0 desyncs. The BSD servers were compiled with 4.9.4, and exhibit the desyncs.
Presumably the stone age 4.6 optimizer doesn't trip over whatever bad code we have... Or a gcc bug persisting through many very different versions... but bad Simutrans is more likely.

So, can anyone replicate the desyncs not using an ancient compiler?
(and note that after a client successfully stays joined, you have to revert to the original server13356-network.sve file since resaving it with a working server during the join process results in it continuing to work with the non-working setup. i.e. use the original save everytime starting the server, and restart the server after every join.)

Dwachs

Could replicate the immediate desync locally with r7730

-server: posix, non-debug, optimized, gcc-4.6
-client: graphical, non-debug, optimized, gcc-4.9

Edit: r7744 does not give the immediate desync.

Not sure what this story tells us.

Edit2: Using the new gcc compiler option -fsanitize=undefined to enable UndefinedBehaviorSanitizer does not give any clue. The reported undefined behavior is in simgraph16.....

Edit3: The good news are that this desync is completely reproducible, and even adding my desync-debugging code does not alter the outcome. It seems to be indeed related to trains and reserving: there is a train that goes from loading to driving state on the server, but to waiting on the client.
Parsley, sage, rosemary, and maggikraut.