The International Simutrans Forum

Simutrans Extended => Simutrans-Extended bug reports => Simutrans-Extended development => Simutrans-Extended closed bug reports => Topic started by: freddyhayward on August 19, 2020, 09:59:26 PM

Title: Losses of synchronisation in online play
Post by: freddyhayward on August 19, 2020, 09:59:26 PM
James, can you update as to whether you are still attempting to set up logging for the server including checklist mismatches at desyncs? This will be very useful in my opinion and I'm sure Freahk will help with any difficulties with the script.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 19, 2020, 10:39:56 PM
Quote from: freddyhayward on August 19, 2020, 09:59:26 PM
James, can you update as to whether you are still attempting to set up logging for the server including checklist mismatches at desyncs? This will be very useful in my opinion and I'm sure Freahk will help with any difficulties with the script.

I had not looked into this for a while - I will have to spend a considerable time getting this working; but I am grateful for your offer of assistance.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 19, 2020, 10:52:02 PM
Quote from: jamespetts on August 19, 2020, 10:39:56 PM
I had not looked into this for a while - I will have to spend a considerable time getting this working; but I am grateful for your offer of assistance.
To be clear, I have not offered assistance myself! I meant that Freahk, whose restart script you have seen, will probably be willing to help.
Title: Re: Losses of synchronisation in online play
Post by: Mariculous on August 20, 2020, 12:38:17 AM
Sure I will help.
If it takes too much of your time, just send your script to me and I might add the logging stuff.
Title: Re: Losses of synchronisation in online play
Post by: Matthew on August 20, 2020, 07:47:28 AM
I have the same unhappy symptoms reported by others last night. Frequent desyncs and I just had a non-pause freeze (time does not tick, animations do not move and the fps appears to go through the roof) for the first time.

(http://i.imgur.com/SkSLZ5F.png)

In the latter case, I was only reading chat and the Path Explorer was not running. The desyncs happened both in that situation and when I was doing some signalling.

This is on Windows, updated using the Nightly Updater to today.

(I have not posted this as a bug report because I have no reproducible case.)
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 20, 2020, 12:04:45 PM
I have split the discussion relating to losses of synchronisation from the general server topic, as this is a discussion about a problem with the game rather than something specific to the Bridgewater-Brunel server.

I am looking into logging at the present. However, this is extremely complex: the restart script that I currently use is one that I adapted slightly from one written many years ago. That script is quite sophisticated, and I have made only minimal alterations to it. I have very limited familiarity with shell script syntax, so I do not fully understand how this script works.

In previous versions of the game, the logging worked correctly. However, in more recent versions, and without, to my recollection, any changes to this shell script, it stopped working.

The current restart script is as follows:

#!/bin/sh

VERBOSE=1
# Number of seconds to allow for server to start/stop
startstopgracetime=5

pathtosimutrans=/usr/share/games/simutrans-extended

log() {
    if [ "$1" != "ERROR" ]; then
        if [ "$VERBOSE" -eq "1" ]; then
            echo "$1: $2" >&2
        fi
    else
        echo "$1: $2" >&2
    fi
}

show_usage() {
    echo "Usage: simctrl <tag> {status|start|restart|check|stop}" >&2
}

# Check we are running as root
if [ `id -u` != "0" ]; then
        log "ERROR" "You need to be superuser"
        exit 1
fi

# Check command line parameters, there must be exactly two of them
if [ "$#" -ne "2" ]; then
    log "ERROR" "Wrong number of arguments"
    show_usage
    exit 1
fi

tag=$1
action=$2

pidfile="$pathtosimutrans/$tag.pid"
configfile="$pathtosimutrans/etc/$tag.config"

if [ -f $configfile ]; then
    . $configfile
else
    log "ERROR" "No config file available for tag: $tag at location: $configfile"
    exit 1
fi

# Need to check that the following variables are actually present (e.g. that config file is valid)
if [ -z "${port:+x}" ]; then
    log "ERROR" "Config, parameter: 'port' not set for tag: $tag"
    exit 1
fi
if [ -z "${revision:+x}" ]; then
    log "ERROR" "Config, parameter: 'revision' not set for tag: $tag"
    exit 1
fi
if [ -z "${objects:+x}" ]; then
    log "ERROR" "Config, parameter: 'objects' not set for tag: $tag"
    exit 1
fi
if [ -z "${save:+x}" ]; then
    log "ERROR" "Config, parameter: 'save' not set for tag: $tag"
    exit 1
fi
if [ -z "${server_id:+x}" ]; then
    log "ERROR" "Config, parameter: 'server_id' not set for tag: $tag"
    exit 1
fi
if [ -z "${server_name:+x}" ]; then
    log "ERROR" "Config, parameter: 'server_name' not set for tag: $tag"
    exit 1
fi
if [ -z "${server_comment:+x}" ]; then
    log "ERROR" "Config, parameter: 'server_comment' not set for tag: $tag"
    exit 1
fi
if [ -z "${debug:+x}" ]; then
    # Debug defaults to 2
    debug=2
fi


# We do not use the $revision system at present for the Extended server.

simpath="$pathtosimutrans/simutrans-extended"
#simpath="$pathtosimutrans/$revision/simutrans-extended"

simutrans_stop() {
    # Check if process is running
    # $pid will be either the pid, or true value
    PID=$(pidof_simutrans)

    if [ "$PID" -gt "0" ]; then
        # PID found and it is running, take steps to stop the server

        logger -t $tag Attempting to stop server...
        echo -n "Attempting to stop simutrans instance: $tag ($PID)."

        kill $PID

        dead=0
        count=0

        while [ $dead -eq "0" ] && [ $count -le $startstopgracetime ]
        do
            sleep 1

            # Update status
            echo -n "."

            # Increment count by 1
            count=$(($count+1))

            # Check if process is still running, pidof_simutrans returns -2 if it cannot find the pid
            # it will also remove the pidfile in this instance
            if [ $(pidof_simutrans) -ne "-2" ]; then dead=1; fi
        done

        # Finish up status with newline
        echo ""

        if [ $dead -eq "0" ]; then
            log "ERROR" "Kill didn't work, server is still running"
            logger -t $tag Server failed to stop!
            exit 1
        fi

        logger -t $tag Server successfully stopped
    else
        log "INFO" "Server not running, not performing stop action."
        echo "Server is not running"
    fi
}

simutrans_start() {
    # Check if this tagged instance is already running by checking PID file
    PID=$(pidof_simutrans)
    if [ "$PID" -eq "-2" ]; then
        log "INFO" "Pidfile existed at: $pidfile, but no process with that ID was running. Continuing with startup of instance: $tag"
    elif [ "$PID" -gt "0" ]; then
        log "ERROR" "Pidfile exists at: $pidfile and a process is running with that ID, cannot start duplicate instance of server: $tag"
        exit 1
    fi

    echo -n "Attempting to start Simutrans-Extended instance: $tag."
    logger -t $tag Attempting to start server...

    # Start up the server; running as non-root user disabled for the time being, as, oddly, when run as non-root, nobody can connect to the server.
    if [ $1 = "start" ]; then
        /bin/sh -c "( $simpath -server $port -server_id $server_id -server_name $server_name -server_comment $server_comment -debug $debug -log 2 -lang en -nosound -nomidi -objects $objects -load $save 2>&1 & echo \$! >&3 ) 3>$pidfile | logger -i -t $tag &"
    elif [ $1 = "restart" ]; then
        /bin/sh -c "( $simpath -server $port -server_id $server_id -server_name $server_name -server_comment $server_comment -debug $debug -log 2 -lang en -nosound -nomidi -objects $objects 2>&1 & echo \$! >&3 ) 3>$pidfile | logger -i -t $tag &"
    fi

    alive=0
    count=0

    while [ $alive -eq "0" ] && [ $count -le $startstopgracetime ]
    do
        sleep 1

        # Update status
        echo -n "."

        # Increment count by 1
        count=$(($count+1))

        # Check if process is running yet, pidof_simutrans returns > 0 (the PID) if it is running
        if [ $(pidof_simutrans) -gt "0" ]; then alive=1; fi
    done

    # Finish up status with newline
    echo ""

    # Check if simutrans has actually started
    if [ "$alive" -eq "0" ]; then
        log "ERROR" "Startup appears to have failed, no process is running with the pid specified in the pidfile for this instance ($pidfile)"
        logger -t $tag Server startup failed
        exit 1
    else
        log "INFO" "Server startup appears to have worked, instance: $tag is now running with pid: $(pidof_simutrans)"
        logger -t $tag Server startup script completed
    fi

    # Return the pidfile
    cat $pidfile
}

pidof_simutrans() {
    # If pidfile exists and there is a simutrans process with the pid specified
    # print the pid and return 0
    if [ -e "$pidfile" ]; then
        pid=`cat $pidfile`
        kill -0 $pid 2>/dev/null
        if [ "$?" -eq "1" ]; then
            log "INFO" "Pidfile specifies a process that doesn't exist (pid: $pid), removing pidfile"
            rm $pidfile
            echo -2
        else
            log "INFO" "Pidfile found, pid is a running process (pid: $pid)"
            echo $pid
        fi
    else
        log "INFO" "Pidfile not found, this process either doesn't exist or wasn't started with this script"
        echo -1
    fi
}


case $action in
    status)
        # Check status of tagged instance
        PID=$(pidof_simutrans)

        if [ "$PID" -gt "0" ]; then
            echo "Simutrans instance: $tag is currently running with pid: $PID"
            echo "Output of \"ps uww -p $PID\":"
            ps uww -p $PID
        else
            echo "Simutrans instance: $tag is not currently running."
        fi
    ;;
    start)
        simutrans_start "start"
    ;;
    stop)
        simutrans_stop
    ;;
    restart)
        # If server running, stop it first
        simutrans_stop

        simutrans_start "restart"
   ;;
    check)
     # Check status of tagged instance
        PID=$(pidof_simutrans)

        # Check whether the server is running, and restart it if not.
   date
        if [ "$PID" -gt "0" ]; then
            echo "Simutrans instance: $tag is currently running. No action necessary."
        else
            echo "Simutrans instance: $tag is not currently running. Restarting (after archiving the log)..."
      now=$(date +"%m_%d_%Y")
      filename_log="/var/log/simutrans/simu-server13353-$now.log"
      logpath="$simpath/simu-server13353.log"
      cp "$logpath" "$filename_log"
      simutrans_start "restart"
        fi
    ;;
    *)
        echo "ERROR" "Invalid command"
        show_usage
    ;;
esac

The critical line appears to be:
/bin/sh -c "( $simpath -server $port -server_id $server_id -server_name $server_name -server_comment $server_comment -debug $debug -log 2 -lang en -nosound -nomidi -objects $objects 2>&1 & echo \$! >&3 ) 3>$pidfile | logger -i -t $tag &"
I am not sure why this does not work.
I am grateful to Freahk for sending his logging code, but I am having trouble working out any way to integrate this into the existing script, as the latter is very complex and it is difficult to for me to understand that complexity with a limited understanding of shell script syntax, and thus difficult for me to understand what details are significant and what are not.

I should therefore be grateful for Freahk's assistance as offered. Thank you in advance.

Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 20, 2020, 04:30:52 PM
I notice that Ceeac has just pushed some fixes for possible undefined behaviour and uninitialised variables, which I have incorporated. The code in question, I believe, originates in Standard. I am not sure whether this will assist, but I should be interested in any reports on to-morrow's nightly build as to whether the losses of synchronisation are notably less frequent or not.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 22, 2020, 02:33:37 PM
I have been testing and have been unable to reproduce this loss of synchronisation locally with a server running an optimised debug Visual Studio build and one identical client connecting and remaining connected without any interaction.

However, when I log into the Bridgewater-Brunel server, I find that I do lose synchronisation quite frequently. It would very much help me in narrowing down the issue to know the following:

(1) has anyone suffered losses of synchronisation when alone on the server and not changing anything;
(2) has anyone suffered losses of synchronisation when alone on the server and changing things; and
(3) what things were being changed when people lost synchronisation?

Unfortunately, I cannot test whether there are any losses of synchronisaiton on the Stephenson-Seimens server, as this is offline as the list server is currently down. Being able to test this would help to narrow down whether the losses in synchronisation on the Bridgewater-Brunel server arise from some state of affairs that is present on that server and not on the Stephenson-Seimens server, or whether instead it arises from some change in the code recently that affects all servers.
Title: Re: Losses of synchronisation in online play
Post by: Huitsi on August 22, 2020, 03:00:30 PM
I have certainly had desyncs while alone. As for whether I was doing anything those times, I'll have to pay attention next time. In general I have had desyncs both while doing nothing (often when I'm AFK or just observing things like right now), when I'm just about to do something (picked a tool but not used it yet) as well as after doing something. The non-server-crash desyncs have never, AFAIK, caused any loss of my actions, so if I have just build a signal, for instance, it's there when I rejoin.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 22, 2020, 03:14:12 PM
Quote from: jamespetts on August 22, 2020, 02:33:37 PM(1) has anyone suffered losses of synchronisation when alone on the server and not changing anything;
Yes, very frequently.
Quote from: jamespetts on August 22, 2020, 02:33:37 PM(2) has anyone suffered losses of synchronisation when alone on the server and changing things; and
Yes, very frequently.
Quote from: jamespetts on August 22, 2020, 02:33:37 PM(3) what things were being changed when people lost synchronisation?
Apart from doing nothing, almost anything. Building/removing ways, terraforming, editing lines and schedules, buying vehicles, selling vehicles, sending vehicles to the depot, placing/deleting signs or signals.
Note that these desyncs are not new, so there is no possibility of recent code changes being the cause. Again, I think that checklist mismatch messages will be useful. How exactly has the difficulty in logging presented? Is there an error, or does it simply not create a log file at all? Is there any output at all (e.g. echo statements from the script) that you are able to view?
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 22, 2020, 03:56:46 PM
Thank you for the feedback: that is helpful. I have been undertaking some of my own testing on the server, including building a small railway to test losses of synchronisation on interaction.

As to the logs, I have realised that the problem was that the log file was being created in the wrong place. I have now found it and have uploaded the latest logs to here (http://bridgewater-brunel.me.uk/misc/simu-server13353-22-aug-2020.log). Please note that this is a ~200mb text file.

In relation to the statement that the losses of synchronisation are not new, may I ask when people first started to notice them more often than about once a day? Do they occur on the Stephenson-Seimens server (when it is running), or only on the Bridgewater-Brunel server?
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on August 22, 2020, 04:21:45 PM
Quote from: jamespetts on August 20, 2020, 12:04:45 PM
The current restart script is as follows:
Why so complicated? .... supervisord/crond and done. Supervisor handles starting, stopping, log rotating, and monitoring process running. A simple cron task to force a sync and have a daily save archive, or update a nightly program version, etc.


Quote from: jamespetts on August 20, 2020, 12:04:45 PM

# Check we are running as root

Right there, I'm done with this script. Running Simutrans as root? That server needs cleansing with fire, stat.
In no way shape form can Simutrans be deemed safe to run as root.

Quote from: jamespetts on August 20, 2020, 12:04:45 PM

        /bin/sh -c "( $simpath -server $port -server_id $server_id -server_name $server_name -server_comment $server_comment -debug $debug -log 2 -lang en -nosound -nomidi -objects $objects -load $save 2>&1 & echo \$! >&3 ) 3>$pidfile |

There are no levels to -log. What is the intent of '-log 2'?
'| logger'  Simutrans has built in syslog support (compile time enable required) if you want the logs there, but it looks like the script trying to combine Simutrans normal logging (simu-serverXXX.log) with syslog?
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 22, 2020, 05:41:23 PM
I did not write the script - I adapted it slightly from one written by somebody else who no longer visits these forums many years ago. As to running in root, the reason for that, I believe, is that when it is set to run as a non-root user, nobody was able to join the server.
Title: Re: Losses of synchronisation in online play
Post by: Matthew on August 22, 2020, 08:15:13 PM
Quote from: jamespetts on August 22, 2020, 02:33:37 PM
(1) has anyone suffered losses of synchronisation when alone on the server and not changing anything;
(2) has anyone suffered losses of synchronisation when alone on the server and changing things; and

I think yes to both, but I wasn't specifically looking out for it.

And I think I've also had two reasonably long periods alone without desynchs. Just now I was on the server, changing things, and didn't desynch for at least an hour. The server said that 2 clients were connected, but no one replied to my chat message, so I think I was alone. When someone else did definitely connect, I lost synch within 10 minutes. I think that (~an hour without desyncs) has happened once before in the last ten days. I'm not sure at one point in the various changes it was, but I remember being surprised that I lasted so long without a desynch.

Quote(3) what things were being changed when people lost synchronisation?

As others have said, everything. I just can't notice any pattern here. But I will continue trying to spot one, as I am sure everyone else is.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 22, 2020, 10:49:20 PM
Quote from: jamespetts on August 22, 2020, 03:56:46 PMAs to the logs, I have realised that the problem was that the log file was being created in the wrong place. I have now found it and have uploaded the latest logs to here. Please note that this is a ~200mb text file.
Thank you for uploading this! I have located two mismatch warnings in the file:
Warning: karte_t::process_network_commands: kicking client due to checklist mismatch : sync_step=413158 server=[ss=413158 st=25822 nfc=6 rand=1232380772 halt=5216 line=1025 cnvy=8193 ssr=2417508643,1232380772,0,0,0,0,0,0 str=775003360,775003360,775003360,775003360,775003360,3751946317,3751946317,3751946317,3751946317,3751946317,3751946317,2444611105,2792299441,343840496,40657385,3751946317 exr=0,0,0,0,0,0,0,0 sums=1744600149,2506396596,0,0,0,0,0,0
initiator=[ss=413158 st=25822 nfc=6 rand=2159140440 halt=5216 line=1025 cnvy=8193 ssr=1442281764,2159140440,0,0,0,0,0,0 str=775003360,775003360,775003360,775003360,775003360,3751946317,3751946317,3751946317,3751946317,3751946317,3751946317,2444611105,2792299441,343840496,40657385,3751946317 exr=0,0,0,0,0,0,0,0 sums=1744600149,2506396596,0,0,0,0,0,0

Warning: karte_t::process_network_commands: kicking client due to checklist mismatch : sync_step=415405 server=[ss=0 st=0 nfc=0 rand=0 halt=0 line=0 cnvy=0 ssr=0,0,0,0,0,0,0,0 str=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 exr=0,0,0,0,0,0,0,0 sums=0,0,0,0,0,0,0,0
initiator=[ss=415405 st=25962 nfc=13 rand=3700814870 halt=5216 line=1025 cnvy=8193 ssr=3021926069,3700814870,0,0,0,0,0,0 str=4158350141,4158350141,4158350141,4158350141,4158350141,3425583775,3425583775,3425583775,3425583775,3425583775,3425583775,1565161018,3974084165,146709321,17347525,3425583775 exr=0,0,0,0,0,0,0,0 sums=767240836,1331425206,0,0,0,0,0,0

The causes of both of these (though, not their location in the code) are immediately clear:
1. in the first case, the problem seemed to be inconsistent random number generator behaviour. Either the generator itself has a problem, or it was not called at the same time for the server and client.
2. In the second, the server's checklist is zeroed while the client's is valid, indicating that there is somewhere in the code where the server clears its checklist but the client does not - this is a similar case to the one I had previously fixed, and should be slightly easier than the above.
I will look into this further and see if I can identify any fixes.
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on August 22, 2020, 11:00:16 PM
Quote from: freddyhayward on August 22, 2020, 10:49:20 PM
1. in the first case, the problem seemed to be inconsistent random number generator behaviour. Either the generator itself has a problem, or it was not called at the same time for the server and client.
No, it shows the client and server executing diverging code paths resulting in the random number generator being called a different number of times.
Since the str= lists match server and client, the calls from anything in ::step() are ok. The first ssr= number shows the mismatch - something in the ::sync_step() routines is going wrong. This would typically be vehicle movement - citycars movement. The problem is trying to figure out what has put the vehicles in different spots client vs server. It could be a player convoi that's left sooner and got in the way of a citycar on one, or many many other possibilities. Hence very hard to track these problems down.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 22, 2020, 11:14:07 PM
Quote from: TurfIt on August 22, 2020, 11:00:16 PM
No, it shows the client and server executing diverging code paths resulting in the random number generator being called a different number of times.
Since the str= lists match server and client, the calls from anything in ::step() are ok. The first ssr= number shows the mismatch - something in the ::sync_step() routines is going wrong. This would typically be vehicle movement - citycars movement. The problem is trying to figure out what has put the vehicles in different spots client vs server. It could be a player convoi that's left sooner and got in the way of a citycar on one, or many many other possibilities. Hence very hard to track these problems down.
I don't think this particular case is directly related to convoy movement, since the convoy debug sums are matching. However your first statement is probably correct.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 23, 2020, 12:00:30 AM
Matthew - thank you for your feedback/data: this is very helpful. I have started a small railway company on the server to test this, and have found no particular relationship between such interactions as I have been making (albeit relatively modest of late as I have relatively modest profits so far) and losses of synchronisation. There might be a higher rate of losses of synchronisation when other players are online, but the pattern is not strong enough for me to be sure of this. There can be long periods without any loss of synchronisation, and losses of synchronisation within minutes of logging in.

Thank you to Freddy for analysing the logs: this is most helpful. TurfIt is correct about at least the first of the two types of loss of synchronisation: this type originates from diverging code paths and is fantastically difficult to track down, as this can arise virtually anywhere in the simulation code. The second type is more curious: I do not know what might cause a server to reset its checklists; I have not modified code relating to this, so I do not really know how this code works. Does anyone who knows anything about the checklist code have any idea whether there are any circumstances in which it ought to be resetting like this?

Would it be helpful if I were to symlink the live log file to a place that you can read it so that you can analyse this dynamically?

Edit: The convoy check sums matching does not tell us much: those numbers only tell us how many convoys that there are and what handle numbers that they have been allocated, not what they are actually doing. An almost unimaginably vast number of things relating to convoy movement might cause this sort of loss of synchronisation, including (amongst many, many, many other things) private cars, railway signalling, loading behaviour (including loading times, divergence in which might be caused in turn by divergence in passenger numbers, which might, in turn, be caused by something else, and so on for an arbitrary number of steps), physics, industry production (which would affect loading), road junction logic, pathfinding, and a myriad other things.

The only way that I have found to track down these problems is to reproduce the problem locally then disable game features one by one until the problem can be narrowed down to one specific part of the code; or alternatively, remove player networks one by one until the problem can be narrowed down to a specific piece of in-game infrastructure. Unfortunately, this method is only effective when the losses of synchronisation occur with a high frequency. I know of no workable method of finding losses of synchronisation that occur at such a low frequency as these have been reported, especially since I have not been able to reproduce it at all locally, since there is no reliable way of telling whether disabling the feature or removing the infrastructure has suppressed the problem, or whether there just happen not to be any losses of synchronisation in any event for the last hour.

It might be that the only thing to do is wait for the losses of synchronisation to become so frequent that they are more readily reproducible.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 23, 2020, 12:14:31 AM
Quote from: jamespetts on August 23, 2020, 12:00:30 AM
Edit: The convoy check sums matching does not tell us much: those numbers only tell us how many convoys that there are and what handle numbers that they have been allocated, not what they are actually doing. An almost unimaginably vast number of things relating to convoy movement might cause this sort of loss of synchronisation, including (amongst many, many, many other things) private cars, railway signalling, loading behaviour (including loading times, divergence in which might be caused in turn by divergence in passenger numbers, which might, in turn, be caused by something else, and so on for an arbitrary number of steps), physics, industry production (which would affect loading), road junction logic, pathfinding, and a myriad other things.
As far as I am aware, the convoy checksums (that is the first two elements of sums=1744600149,2506396596,0,0,0,0,0,0) do count precisely the movement of the convoys. I should have made this clearer to distinguish them form the convoy handle sum.
Edit: it's also worth noting that the random seed is out of sync, not simply the result of generated random numbers. I couldn't find anything in the code that actually modifies this apart from the seed generation algorithm itself. It might also be possible that integer overflow of some kind is happening.
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on August 23, 2020, 04:25:00 AM
Quote from: freddyhayward on August 23, 2020, 12:14:31 AM
Edit: it's also worth noting that the random seed is out of sync, not simply the result of generated random numbers. I couldn't find anything in the code that actually modifies this apart from the seed generation algorithm itself. It might also be possible that integer overflow of some kind is happening.
The seed changes after each call to get a random number... how a mersenne twister works.

Quote from: freddyhayward on August 23, 2020, 12:14:31 AM
As far as I am aware, the convoy checksums (that is the first two elements of sums=1744600149,2506396596,0,0,0,0,0,0) do count precisely the movement of the convoys.
If that's what those sums= mean, then unless some new code has been added in sync_step() requiring randoms, it's the citycars as the only random users in sync_step() that are exhibiting the fault.

I'd added the str= and ssr= lists to the checklist some time ago in Extended to try helping debug a desync issue. Rather not as helpful as hoped, but the idea was to see where they diverged, and further subdivide the problem adding/ changing the lists to narrow in on the bad code. Of course they were temporary debugging code that seems permanent now!  i.e. Their positions in the code were for helping find a specific desync years ago, they need to be redone each time a new desync is hunted, and likely redone many times as one narrows in (hopefully!). The sums= lists must be someone else's attempt at tracking desyncs - I don't remember them.

EDIT: forget to mention - for these to have any meaning, frames between checks must be set to 1 - i.e. send checklists every frame.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 23, 2020, 09:13:41 AM
Here are three (edit: now eight) desync messages logged on my end:
Context: placing signs, not really doing anything.
Warning: karte_t:::do_network_world_command: sync_step=30216 
server=[ss=30216 st=1888 nfc=8 rand=1557250333 halt=5226 line=1025 cnvy=8193 ssr=1241857636,1557250333,0,0,0,0,0,0 str=4103595329,4103595329,4103595329,4103595329,4103595329,4121655257,4121655257,4121655257,4121655257,4121655257,4121655257,2645973647,1196416888,1871716010,220948049,4121655257 exr=0,0,0,0,0,0,0,0 sums=1726200471,949113653,0,0,0,0,0,0
client=[ss=30216 st=1888 nfc=8 rand=3538964740 halt=5226 line=1025 cnvy=8193 ssr=1911461731,3538964740,0,0,0,0,0,0 str=4103595329,4103595329,4103595329,4103595329,4103595329,4121655257,4121655257,4121655257,4121655257,4121655257,4121655257,2645973647,1196416888,1871716010,220948049,4121655257 exr=0,0,0,0,0,0,0,0 sums=1726200471,949113653,0,0,0,0,0,0

Context: immediately upon connection.
Warning: karte_t:::do_network_world_command: sync_step=30408
server=[ss=30408 st=1900 nfc=8 rand=1828447704 halt=5226 line=1025 cnvy=8193 ssr=3587605122,1828447704,0,0,0,0,0,0 str=743529120,743529120,743529120,743529120,743529120,3243039856,3243039856,3243039856,3243039856,3243039856,3243039856,1673307103,582538048,12663145,1494801,3243039856 exr=0,0,0,0,0,0,0,0 sums=685466241,4130127585,0,0,0,0,0,0
client=[ss=30408 st=1900 nfc=8 rand=2437304339 halt=5226 line=1025 cnvy=8193 ssr=1693602772,2437304339,0,0,0,0,0,0 str=743529120,743529120,743529120,743529120,743529120,3243039856,3243039856,3243039856,3243039856,3243039856,3243039856,1673307103,582538048,12663145,1494801,3243039856 exr=0,0,0,0,0,0,0,0 sums=685466241,4130127585,0,0,0,0,0,0

Context: terraforming.
Warning: karte_t:::do_network_world_command: skipping command due to checklist mismatch : sync_step=39093
server=[ss=39093 st=2443 nfc=5 rand=3369424330 halt=5226 line=1025 cnvy=8193 ssr=3486250843,3369424330,0,0,0,0,0,0 str=491313374,491313374,491313374,491313374,491313374,1561926749,1561926749,1561926749,1561926749,1561926749,1561926749,4228572398,3577287383,623968325,73653577,1561926749 exr=0,0,0,0,0,0,0,0 sums=3005605747,3294018549,0,0,0,0,0,0
executor=[ss=39093 st=2443 nfc=5 rand=2985375289 halt=5226 line=1025 cnvy=8193 ssr=1285812580,2985375289,0,0,0,0,0,0 str=491313374,491313374,491313374,491313374,491313374,1561926749,1561926749,1561926749,1561926749,1561926749,1561926749,4228572398,3577287383,623968325,73653577,1561926749 exr=0,0,0,0,0,0,0,0 sums=3005605747,3294018549,0,0,0,0,0,0

Context: placing signal/signalbox, Huitsi had just joined.
Warning: karte_t:::do_network_world_command: sync_step=57992 
server=[ss=57992 st=3624 nfc=8 rand=4241491095 halt=5226 line=1025 cnvy=8193 ssr=3703758365,4241491095,0,0,0,0,0,0 str=3086837121,3086837121,3086837121,3086837121,3086837121,352968718,352968718,352968718,352968718,352968718,352968718,358863249,2315238162,267133988,31528861,352968718 exr=0,0,0,0,0,0,0,0 sums=892798643,1570647053,0,0,0,0,0,0
client=[ss=57992 st=3624 nfc=8 rand=2837245729 halt=5226 line=1025 cnvy=8193 ssr=3085442846,2837245729,0,0,0,0,0,0 str=3086837121,3086837121,3086837121,3086837121,3086837121,352968718,352968718,352968718,352968718,352968718,352968718,358863249,2315238162,267133988,31528861,352968718 exr=0,0,0,0,0,0,0,0 sums=892798643,1570647053,0,0,0,0,0,0

Context: selecting the signalbox and about to place a signal. There were two other players online.
Warning: karte_t:::do_network_world_command: sync_step=30408 
server=[ss=30408 st=1900 nfc=8 rand=1828447704 halt=5226 line=1025 cnvy=8193 ssr=3587605122,1828447704,0,0,0,0,0,0 str=743529120,743529120,743529120,743529120,743529120,3243039856,3243039856,3243039856,3243039856,3243039856,3243039856,1673307103,582538048,12663145,1494801,3243039856 exr=0,0,0,0,0,0,0,0 sums=685466241,4130127585,0,0,0,0,0,0
client=[ss=30408 st=1900 nfc=8 rand=2437304339 halt=5226 line=1025 cnvy=8193 ssr=1693602772,2437304339,0,0,0,0,0,0 str=743529120,743529120,743529120,743529120,743529120,3243039856,3243039856,3243039856,3243039856,3243039856,3243039856,1673307103,582538048,12663145,1494801,3243039856 exr=0,0,0,0,0,0,0,0 sums=685466241,4130127585,0,0,0,0,0,0

Context: was AFK, there were previously 4 player connected including jamespetts and VOLVO who had disconnected (presumably desynced). I believe at least one of them had attempted to reconnect before I desynced, but cannot tell for sure.
Warning: karte_t:::do_network_world_command: sync_step=74288 
server=[ss=74288 st=4643 nfc=0 rand=2147361007 halt=5226 line=1025 cnvy=8193 ssr=1314354187,4204329809,0,0,0,0,0,0 str=4204329809,4204329809,4204329809,4204329809,4204329809,2370101764,2370101764,2370101764,2370101764,2370101764,2370101764,170758243,2147361007,36851628,4349120,2370101764 exr=0,0,0,0,0,0,0,0 sums=683012918,1291649801,0,0,0,0,0,0
client=[ss=74288 st=4643 nfc=0 rand=1389728691 halt=5226 line=1025 cnvy=8193 ssr=1314354187,4204329809,0,0,0,0,0,0 str=4204329809,4204329809,4204329809,4204329809,4204329809,2370101764,2370101764,2370101764,2370101764,2370101764,2370101764,170758243,1389728691,36851628,4349120,2370101764 exr=0,0,0,0,0,0,0,0 sums=683012918,1291649801,0,0,0,0,0,0

Context: editing a schedule. jamepetts had recently disconnected (presumably desynced) and may have attempted to reconnect, but I cannot tell.
Warning: karte_t:::do_network_world_command: sync_step=74288 
server=[ss=74288 st=4643 nfc=0 rand=2147361007 halt=5226 line=1025 cnvy=8193 ssr=1314354187,4204329809,0,0,0,0,0,0 str=4204329809,4204329809,4204329809,4204329809,4204329809,2370101764,2370101764,2370101764,2370101764,2370101764,2370101764,170758243,2147361007,36851628,4349120,2370101764 exr=0,0,0,0,0,0,0,0 sums=683012918,1291649801,0,0,0,0,0,0
client=[ss=74288 st=4643 nfc=0 rand=1389728691 halt=5226 line=1025 cnvy=8193 ssr=1314354187,4204329809,0,0,0,0,0,0 str=4204329809,4204329809,4204329809,4204329809,4204329809,2370101764,2370101764,2370101764,2370101764,2370101764,2370101764,170758243,1389728691,36851628,4349120,2370101764 exr=0,0,0,0,0,0,0,0 sums=683012918,1291649801,0,0,0,0,0,0

Context: inspecting vehicles in the depot (probably unrelated)
Warning: karte_t:::do_network_world_command: sync_step=93496 
server=[ss=93496 st=5843 nfc=8 rand=1332702392 halt=5226 line=1667 cnvy=8193 ssr=78849053,1332702392,0,0,0,0,0,0 str=162498849,162498849,162498849,162498849,162498849,1572107109,1572107109,1572107109,1572107109,1572107109,1572107109,1056427539,1584585818,468776815,55319441,1572107109 exr=0,0,0,0,0,0,0,0 sums=2075641198,334357099,0,0,0,0,0,0
client=[ss=93496 st=5843 nfc=8 rand=1846799280 halt=5226 line=1667 cnvy=8193 ssr=1120227439,1846799280,0,0,0,0,0,0 str=162498849,162498849,162498849,162498849,162498849,1572107109,1572107109,1572107109,1572107109,1572107109,1572107109,1056427539,1584585818,468776815,55319441,1572107109 exr=0,0,0,0,0,0,0,0 sums=2075641198,334357099,0,0,0,0,0,0

Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 23, 2020, 09:52:28 AM
Thank you both for your work on this. The server frames between checks had been set to 8; I have re-set it to 1 to aid with debugging this if anyone really can track down the problem using these numbers rather than the elimination test process that I described above.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 24, 2020, 01:36:07 PM
Now that the Stephenson-Seimens server is up and running again, I should be very grateful if people could confirm whether or not they get regular losses of synchronisation when playing that online saved game.

Thank you in advance.
Title: Re: Losses of synchronisation in online play
Post by: Vladki on August 25, 2020, 07:20:19 PM
Regarding running as root or not. Stephenson-Siemens is NOT running as root. (I would not dare doing that...)
My scripts are attached below - simupdate run daily, simrestore run hourly (rename from *.txt to *.sh or whatewer)
Feel free to modify

Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 25, 2020, 08:43:44 PM
Thank you for that - that is helpful. I do not think that the problem was the script, but rather the inability to connect to the server howsoever started if it was stated by a non-root user. However, this was such a long time ago that I do not recall much about the details, save that I spent a large amount of time attempting to deal with this and found the problem entirely intractable.




Regarding the losses of synchronisation: I tested this afternoon with the Stephenson-Seimens server: with nobody else online, and without interaction (while doing other things in the background), I was able to stay connected for > 2 hours.

This suggests that some feature or aspect of the saved game on the Bridgewater-Brunel server has the effect of invoking the troubled part of the code whereas that is not true of the Stephenson-Seimens server.

The most obvious things that are present on the Bridgewater-Brunel server but not the Stephenson-Seimens server are time interval signalling and industry connected by players. The former was present on the previous Bridgewater-Brunel server, and did not appear to cause losses of synchronisation then, although the code may have changed since then.

I have just pushed a change to the Github repository that should allow the industry density proportion override to work as intended when modified only using simuconf.tab. This should have the effect that the amount of industry on the map will increase significantly its quantity. This should result, hopefully, in more players using freight routes. If, over the next 30 or so game years, the frequency of losses of synchronisation increase significantly, then we can infer that the cause is likely to be industry and conduct further testing on that basis. If, conversely, the losses of synchronisation cease within that period, we can infer that time interval signalling is a likely culprit. In any other case, further and more difficult testing work will be required.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 25, 2020, 09:56:11 PM
Might it be worth temporarily re-enabling DEBUG_SIMRAND_CALLS? This could help to identify the source of inconsistent simrand calls between the server and client.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 25, 2020, 11:31:15 PM
Quote from: freddyhayward on August 25, 2020, 09:56:11 PM
Might it be worth temporarily re-enabling DEBUG_SIMRAND_CALLS? This could help to identify the source of inconsistent simrand calls between the server and client.

I have not looked at this code for a while, but, the last time that I looked at it, it was not suitable for use on a running server and was very crude. It might have been enhanced since then, but it would need some looking into to check this.
Title: Re: Losses of synchronisation in online play
Post by: ceeac on August 26, 2020, 10:08:39 AM
I found several potential sources of desyncs when using Windows builds (compiled by MSVC/MinGW). Not sure if this is the same problem, but pull request #266 (https://github.com/jamespetts/simutrans-extended/pull/266) fixes these.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 26, 2020, 10:40:34 AM
Quote from: ceeac on August 26, 2020, 10:08:39 AM
I found several potential sources of desyncs when using Windows builds (compiled by MSVC/MinGW). Not sure if this is the same problem, but pull request #266 (https://github.com/jamespetts/simutrans-extended/pull/266) fixes these.

Thank you for that - that is helpful. This is now incorporated.

I suspect that this is not the cause of at least many of the losses of synchronisation reported, since this sort of loss of synchronisation would only occur shortly after players used these specific tools and players have reported losses of synchronisation in circumstances where nobody has used any tools since the player losing synchronisation connected, but it is very helpful to have this fix all the same, as this may well have caused at least some losses of synchronisation, which it is worth fixing.

Thank you again.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 26, 2020, 10:45:12 AM
I have noticed that quite a few of these desyncs were caused by local settings and environment variables not being properly overriden by the server. I wonder if there could be more hiding in this category of errors?
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 26, 2020, 10:48:26 AM
Quote from: freddyhayward on August 26, 2020, 10:45:12 AM
I have noticed that quite a few of these desyncs were caused by local settings and environment variables not being properly overriden by the server. I wonder if there could be more hiding in this category of errors?

It is possible in theory - along with myriad other possibilities.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 26, 2020, 04:08:35 PM
User experiences reported this afternoon suggest that losses of synchronisation occur more often shortly after new players join. This suggests that the issue may be related to saved data not being consistent with data held in memory by the server.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 27, 2020, 02:42:58 PM
I have undertaken some further testing on this to-day. Using the Bridgewater-Brunel saved game from January 1858, I used an optimised debug build and connected 2 clients to the server locally for ~1 hour. No loss of synchronisation was observed.

I then repeated the experiment with a build downloaded from the server, and one of the two clients lost synchronisation after somewhere between 1 and 2 hours, approximately. The other client has remained synchronised.

Note that it has long been the case that the Visual Studio builds cannot remain synchronised with the Linux or cross-compiled builds.

This may enable some narrowing down of the problem to occur, but, at this rate, it will take a very, very large amount of time to make progress given that each change requires up to 2 hours of testing to see whether it has had any effect.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 27, 2020, 03:05:52 PM
Since the surface-level causes of losses of synchronisation is usually mismatching random seeds (presumably, simrand being called an inconsistent number of times), we could try recording in each checklist the number of calls to simrand since the previous checklist. The size of the inconsistency might offer useful clues.
I should also like to repeat my suggestion that disconnections due to checklist mismatches be treated as errors rather than warnings. I recall that you were concerned that these were not always errors, but those would be a tiny minority of cases. In practice, it is inconvenient to monitor checklist mismatches on debug level 2 because of the less-severe warning and message spam.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 27, 2020, 04:29:08 PM
I have been re-running the test with the optimised debug builds, connecting two local clients, this time with a version of the Bridgewater-Brunel saved game from yesterday evening in 1862. I started this shortly after my post above, and both clients are still connected to the (local) server without loss of synchronisation. I will keep this running for longer to see whether this persists for an extended period.

As to errors versus warnings of checklist mismatches, may I ask why it is difficult simply to use a search tool to find the text for these in log files?
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 27, 2020, 04:48:51 PM
I have now had both clients lose synchronisation to the local server. We can therefore potentially reproduce the loss of synchronisation locally and without interaction, but we can only confirm that any possible change has the effect of avoiding a loss of synchronisation with at last 2 hours' testing.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 27, 2020, 11:21:14 PM
I have been able to to produce losses of synchronisation for both clients without interaction in under half an hour (possibly shorter, since I didn't time it) by using a save with the lowest possible bits-per-month which therefore calls new_month and associated code more frequently. I noticed that in these cases the server had advanced its random seed while the clients had not.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 27, 2020, 11:33:45 PM
Quote from: freddyhayward on August 27, 2020, 11:21:14 PM
I have been able to to produce losses of synchronisation for both clients without interaction in under half an hour (possibly shorter, since I didn't time it) by using a save with the lowest possible bits-per-month which therefore calls new_month and associated code more frequently. I noticed that in these cases the server had advanced its random seed while the clients had not.

That is interesting. However, there are reports of losses of synchronisation (and I have experienced this myself) before a month end change (i.e. players logging on during a month lose synchronisation in that same month), which suggests that there is a problem that does not depend on month end issues. It is possible for there to be two separate issues, of course, one based on month ends and one not.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 28, 2020, 01:05:20 AM
I conducted an additional long-term test this evening, leaving the game running in the background for many hours whilst I did other things. In this case, I had set it up to remove all industry at the beginning of a new month; but checking now, this appears to have failed for reasons which are currently unclear, thus invalidating the test. The clients had lost synchronisation by the time that I checked them again.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on August 28, 2020, 01:26:49 AM
I have been repeatedly desyncing seconds after connecting to bridgewater-brunel and have 'narrowed' down the divergence in this case to between lines 5868 and 5936 of simworld.cc. however, this contains quite a bit of logic including convoy, path explorer, and time interval signalling.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 28, 2020, 10:32:15 AM
Thank you for this: that is most helpful. This is the main step() routine, and a complete list of things processed in this block of this routine is:

(1) the path explorer refreshing all categories;
(2) check transferring cargoes;
(3) starting the multi-threaded path explorer;
(4) starting the multi-threaded convoy threads;
(5) checking the MIDI;
(6) recalculating the snowline;
(7) stepping time interval signals;
( 8) checking the number of playing clients in a server game and recording whether a player has disconnected; and
(9) checking the functions to be executed by scenario scripts if this is a scripted scenario.

We can realistically rule out 3, 4, 5, 6, 8 and 9 as proximate causes: any losses of synchronisation in the multi-threaded code would only show in in the main code after those multi-threaded parts have finished, and the completion of those multi-threaded algorithms has not occurred by the end of this block. That leaves:

(1) the path explorer refreshing all categories;
(2) check transferring cargoes; and
(3) stepping time interval signals;

Unfortunately for narrowing down purposes, there is a high chance that, if (2) is the proximate cause, the actual cause of the divergence is nothing to do with the check transferring cargoes code, since the inconsistency may have arisen much earlier, when the passengers/mail/goods were put into the transferring cargoes list in the first place, and only registers when, for example, passengers leave that list, are registered as arriving somewhere, and pedestrians are generated, giving rise to random number generator calls to determine which pedestrian images to use and in what direction that they should travel.

This would suggest some inconsistency in code relating to one or more of the following:
(1) passenger generation;
(2) transfer time calculation;
(3) vehicle movement (including physics, signalling, road vehicle conflict resolution);
(4) passenger routing;
(5) passenger loading (including overcrowding, class and comfort);
(6) mail routing;
(7) mail loading (including class);
( 8) schedules/waiting times; or
(9) industry production logic.





Incidentally, I have had another look at the DEBUG_SIMRAND_CALLS code. This appears to have been improved since I first wrote it and it now appears to be suitable for use in a running game. What it will do is give information as to the caller and the range of each random call. However, because of this, it will generate truly gargantuan log files that means that it is impractical to use on the server. If you are able to reproduce a loss of synchronisation locally within a short period of time - something that I have not been able to do - it may well be worth enabling this on server and client and analysing the resulting log files.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 29, 2020, 05:03:24 PM
I continue to have difficulty in reproducing this locally: using the saved game from the server from just before I reset the pakset override setting, which started at circa 3:00 in May 1863, I have progressed with two locally connected clients to circa 4:40 in June 1863 without loss of synchronisation.

Can I check how often that people are losing synchronisation to-day in the Bridgewater-Brunel server?
Title: Re: Losses of synchronisation in online play
Post by: Huitsi on August 29, 2020, 07:47:00 PM
I just desynced three times within less than two hours.
Title: Re: Losses of synchronisation in online play
Post by: Matthew on August 29, 2020, 10:19:53 PM
This morning (European time) I had immediate desynchs every time I joined.

Early this evening I noticed that Freahk and Huitsi twice desynched as soon as I joined. At that time, I did not desynch, but I did have steadily increasing lag (1 minute plus), even though I had the game windows as small as possible, which I remedied by quitting and restarting the client.

Later this evening, I was the only person on the server for about two hours. I had no or trivial (<10s) lag, even with the game playing full-screen and zoomed out, and desynched only once. §

So there is a possible pattern that I only get lag when the savegame is saved by the Windows client to my hard drive (which I understand to be the case when others join after I do) and not when it's transferred from the server's Linux build (when I join and play alone). Perhaps someone could share a recent Bridgewater-Brunel save made by a Linux client so I can see whether that has different behaviour off-line.

§ It was a weird case like nothing I've seen before: I suddenly jumped twenty (in-game) minutes forward at the same time as I desynched. By the time I resynched (by which I mean, re-joined the online game), I went back fifteen minutes again. My best guess is that's off-topic to this thread and  that I accidentally pressed the "fast-forward" shortcut at the same time as I desynched or something.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 29, 2020, 10:45:34 PM
Thank you both for your  reports. I have been carrying out some testing over many hours this evening, leaving the computer unattended whilst undertaking other tasks. I connected two clients to the server after I had manually removed every industry from the May 1863 saved game. Having left it for circa 4-5 hours, it had reached October 1863; one client had lost synchronisaiton but one remained connected.

This would tend to suggest that industry is not involved in the loss of synchronisation, but there is a possibility of the results being contaminated by new industries spawning which use player transport connexions that already existed.

I will need to re-run the test, probably overnight, using the industry density proportion override to prevent any industry spawning.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 30, 2020, 10:25:10 AM
I have re-run the test overnight. One client lost synchronisation, the other crashed: when attempting to use the debugger on a crashed version, I find that it crashed because the pointer to a way object in an ordinary crossroads in a town apparently pointed to a deleted object, the ultimate reason for which could not be discerned.

Checking the local server, I see that a number of industry chains had been generated despite setting the industry density proportion to 1, its lowest setting without disabling the system entirely. However, checking these, none of these seemed to be actually interacting with player networks.

This would suggest that industry production and interaction with player networks is not the cause of the loss of synchronisation issue. If Freddy is correct about the part of the code in which this problem arises, this would, in turn, suggest that the problem does not involve industry at all, since the only means of industry involvement in those parts of the code is by means of transport of cargoes.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on August 31, 2020, 12:42:37 AM
I have done some further testing: again, I left the simulation running for ~5 hours with one local server and two clients, but I modified the code so as to disable check_transferring_cargoes(). The same result as above obtained: one client lost synchronisation, the other crashed.

The crash is odd, and difficult to track down, partly, I suspect, because I am using an optimised debug build (the optimisations being necessary for the Bridgewater-Brunel game to run at a reasonable speed).

The error that I get is in line 1791 of grund.cc when attempting to check tile 433,570 as to whether it has a depot (called from line 4783 of simvehicle.cc, part of the bool rail_vehicle_t::check_next_tile() method. This is called by convoy 8608 and is recorded as occurring at 4:35:11 of September 1863.

The crash itself is indecipherable. The only details given by the debugger are:


Simutrans-Extended-debug-optimised.exe has triggered a breakpoint. occurred


The code then enters some external code and ultimately ends in issue_debug_notification(), for which no sources are available, having passed through terminate() and abort(); none of these methods are part of Simutrans code, so they must be part of some exception handling library; but what exception is being handled is not clear.

The code being executed is:


depot_t* grund_t::get_depot() const
{
return dynamic_cast<depot_t *>(first_obj());
}


I cannot inspect the value of first_obj() because of compiler optimisations.

Given that a similar crash occurred on the last occasion, I am wondering whether this is relevant to the cause of the loss of synchronisation. This looks as though it may be some form of memory corruption, but how it is arising is extremely unclear. However, it is notable that check_transferring_cargoes() being disabled appears to have made no difference; that may well rule out a large number of possible causes.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 02, 2020, 09:18:52 PM
Further testing shows that a loss of synchronisation occurs after circa 2 hours on a local client/server with two clients connected with line 5903 of simworld.cc commented out, suggesting that the problem is unlikely to be specific to time interval signals.

I should note that it is likely to be very difficult for me to undertake any development work until this issue has been identified and fixed, and this could take an unlimited and unpredictable amount of time.

Any assistance in resolving this would be very much appreciated.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 02, 2020, 11:10:58 PM
I have attempted to conduct further tests to see whether a shorter time period for invoking a loss of synchronisation could be prompted, but without success. There was a point when clients lost synchronisation seconds after connecting to the server, but I have not been able to reproduce this reliably with a saved game from the time when this occurred.

I have tried connecting up to four clients simultaneously, but this does not reliably prompt a loss of synchronisation within a reasonable time for testing.

A failure to reproduce the problem for check_transferring_cargoes() and stepping the time interval signals, the inability to reproduce loss of synchronisation within the same time of loading the saved game and reports from users that losses of synchronisation tend more often to happen after players connect tends to suggest that the problem might be related in some way (whether directly or indirectly - and indirectly might include very indirectly indeed, i.e., by means of an arbitrary number of intermediate steps each of an arbitrary level of complexity) to loading and saving, but this is highly uncertain. There is currently no clue as to where in the loading/saving code that the problem might occur.

I note that Freddy had reported being able to reproduce the loss of synchronisation within 30 minutes, but I have never received information on how this was possible, so I have not been able to do anything with this. If it is possible to reproduce it within that time, that would be extremely helpful and could quadruple the speed with which this problem can be found.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on September 02, 2020, 11:50:59 PM
Quote from: jamespetts on September 02, 2020, 11:10:58 PMI note that Freddy had reported being able to reproduce the loss of synchronisation within 30 minutes, but I have never received information on how this was possible, so I have not been able to do anything with this. If it is possible to reproduce it within that time, that would be extremely helpful and could quadruple the speed with which this problem can be found.
Unfortunately, I can't precisely remember either. I do remember altering certain settings to make industries grow at the maximum possible rate (I can't remember which settings achieved this), and lowering the bits_per_month to 16. I have no idea whether this would relate to the same error that you had reproduced within 2 hours, or whether this method would even succeed in the current version. I think I used a flat world with many interconnected cities for this as well, but I can't remember its size.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 02, 2020, 11:56:10 PM
Thank you for your reply. Unfortunately, this information is not sufficiently precise for me to be able to do anything with it.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 03, 2020, 12:08:02 AM
I have attempted to use Dr. Memory to determine whether there are any items of memory corruption that might be causing the losses of synchronisation. Unfortunately, I have been unable to get Dr. Memory to work at all: with the 64-bit executable, it will freeze after the pakset selector. With the 32-bit executable, it will simply crash shortly after opening. These errors do not occur when not running in Dr. Memory.

If anyone is able to get this to run successfully with either Dr. Memory or Valgrind, especially with the current Bridgewater-Brunel saved game, it would be extremely helpful to see the output of this to determine whether memory corruption is possibly relevant.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 03, 2020, 12:09:23 AM
Incidentally, Freddy - can I check precisely how you determined that the loss of synchronisation occurred in certain parts of the code? I have not found that disabling any parts of that code makes any difference to whether the losses of synchronisation occur.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on September 03, 2020, 01:13:04 AM
Quote from: jamespetts on September 03, 2020, 12:09:23 AM
Incidentally, Freddy - can I check precisely how you determined that the loss of synchronisation occurred in certain parts of the code? I have not found that disabling any parts of that code makes any difference to whether the losses of synchronisation occur.
I used the checklist mismatch messages - each record of `str=` and `ssr=` is updated at a different point in the code, so it is a matter of finding which entries are mismatched, and where they are updated in the code. We could make this more precise by adding more entries in suspect areas, and removing them from non-suspect areas.
Title: Re: Losses of synchronisation in online play
Post by: ceeac on September 03, 2020, 06:46:21 AM
I noticed that running the server with MULTI_THREAD=0 and connecting with a client that has MULTI_THREAD=1 (everything else being equal, including settings) results in an immediate loss of synchronization every time.
Title: Re: Losses of synchronisation in online play
Post by: Phystam on September 03, 2020, 07:17:54 AM
Additionally, if the thread number is different between the server and a client, connection will be lost immediately.
I think it's better to implement the comparison of thread number when connecting.
Title: Re: Losses of synchronisation in online play
Post by: Matthew on September 03, 2020, 07:24:12 AM
Quote from: ceeac on September 03, 2020, 06:46:21 AM
I noticed that running the server with MULTI_THREAD=0 and connecting with a client that has MULTI_THREAD=1 (everything else being equal, including settings) results in an immediate loss of synchronization every time.

Would Simutrans warn of incompatible paksets if a player tried to do that?
Title: Re: Losses of synchronisation in online play
Post by: ceeac on September 03, 2020, 08:14:24 AM
Quote from: Phystam on September 03, 2020, 07:17:54 AMI think it's better to implement the comparison of thread number when connecting.
No. I believe the results should be the same no matter the number of threads or whether multi-threading is enabled. It also helps when debugging multi-threaded code if you are able to compare the results of the multi-threaded code with the results of the single-threaded code.

Quote from: Matthew on September 03, 2020, 07:24:12 AMWould Simutrans warn of incompatible paksets if a player tried to do that?
I queried my local server by entering the IP address in the "play online" window and there were no warnings.
Title: Re: Losses of synchronisation in online play
Post by: RESTRICTED ACCOUNT on September 03, 2020, 11:15:29 AM

if(  get_scenario()->is_scripted() ) {
get_scenario()->step();
} // Loss of synchronisation suspected to be in a block of code ending here.

Does the scenario work correctly with extended?
Title: Re: Losses of synchronisation in online play
Post by: Mariculous on September 03, 2020, 01:01:51 PM
It does not work at all with extended.
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on September 03, 2020, 01:08:44 PM
Quote from: ceeac on September 03, 2020, 06:46:21 AM
I noticed that running the server with MULTI_THREAD=0 and connecting with a client that has MULTI_THREAD=1 (everything else being equal, including settings) results in an immediate loss of synchronization every time.
That is a known issue with the architecture of the threading that was added. It should really be done as a pool of worker threads grabbing tasks. i.e. Break the work into a hundred tasks and let the threads chew. Then it doesn't matter 1 thread or 30. The problem now is the work is broken into the number of threads, and work unit size affects the simulation (maintaining sync).  For the threaded display rendering, it is quite important for performance the number of threads match the available CPU cores. Hence the number of threads must be different for each persons computer.


Quote from: freddyhayward on September 03, 2020, 01:13:04 AM
I used the checklist mismatch messages - each record of `str=` and `ssr=` is updated at a different point in the code, so it is a matter of finding which entries are mismatched, and where they are updated in the code. We could make this more precise by adding more entries in suspect areas, and removing them from non-suspect areas.
This is exactly how the rands added to the checklist were intended to be used. In practice, they're less effective at finding the desync than desirable, but better than nothing. It does mean for every desync, you must check the checklists and see where the mismatch is. Otherwise just saying a desync occurred is a waste of testing.

A note on the intended design:
   ssr=  sync_step_rands.  8 of these to be used throughout sync_step().
   str= step_rands. 32 of these to used throughout step().
   exr=  extra_rands. 8 of these to be stuck in whereever an extra checkpoint is needed.
   debug_sums, don't know, not added by me.

With the current distribution in step(),  rands[21] and [22] should really be in the exr range. [23] should be [16], and 16->17, etc. i.e. Keep them in order, much easier when finding the mismatch. And the setting =0 at the end of step(), not in the middle, then you can see what's not used (and see [23] is used even though set =0)


--
As for the current desync in block 19 to 20, the only thing there calling simrand() is pedestrian generation within check_transferring_cargoes(). As previously mentioned, things that affect cargo are many. i.e. the root source will not likely be in check_transferring_cargoes(), but those things that affect cargo in general. (any why commenting out check_transferring_cargoes() still results in a desync - it's simply detected later/elsewhere - why it's important to check the rands every desync)

Title: Re: Losses of synchronisation in online play
Post by: Matthew on September 04, 2020, 06:29:03 AM
Quote from: TurfIt on September 03, 2020, 01:08:44 PMThis is exactly how the rands added to the checklist were intended to be used. In practice, they're less effective at finding the desync than desirable, but better than nothing. It does mean for every desync, you must check the checklists and see where the mismatch is. Otherwise just saying a desync occurred is a waste of testing.

A note on the intended design:
   ssr=  sync_step_rands.  8 of these to be used throughout sync_step().
   str= step_rands. 32 of these to used throughout step().
   exr=  extra_rands. 8 of these to be stuck in whereever an extra checkpoint is needed.
   debug_sums, don't know, not added by me.

With the current distribution in step(),  rands[21] and [22] should really be in the exr range. [23] should be [16], and 16->17, etc. i.e. Keep them in order, much easier when finding the mismatch. And the setting =0 at the end of step(), not in the middle, then you can see what's not used (and see [23] is used even though set =0)

Turfit, thank you for adding this feature to Simutrans and drawing James' attention to it. However, I don't have anywhere near enough knowledge of C++ or Simutrans to place your comments in context and make use of them.  ??? I will try to describe the feature in full to get it clear in my head; could you (or anyone else who understands it) please check that I have understood correctly how it works and how to use it?

The main Simutrans random number generator uses a seed generated by the previous call to it, so if all is well there should be an unbroken chain of seeds.

If the Simutrans client has debug warnings enabled, then it periodically runs the do_network_world function, which publishes in the log file the seed available for random number generation at certain points in the previous period. It reports both the seed it used itself and the seed available to the server, which should be the same, otherwise the chain is broken and the gamestates will diverge over time. Such a report is also published to the log once such such a divergence is detected and the client desynchs.

The cause of the divergence must lie before the point at which the logged seeds diverged. It must lie after the last point at which the seeds were identical, though since step() and sync_step() are loops, so a later point in the code may refer to an earlier point in time.

There are three different sets of log points, for use at different places in the code:
Quotessr=  sync_step_rands.  8 of these to be used throughout sync_step().
   str= step_rands. 32 of these to used throughout step().
   exr=  extra_rands. 8 of these to be stuck in whereever an extra checkpoint is needed.

In order to activate the log points, change (for example) rands[7] = 0; to rands[7] = get_random_seed(); And presumably both client & server must be recompiled with this change, otherwise you are guaranteed to have mismatches because the Simutrans instance with the extra activated log point will be making an additional call to the random number generator.

Once it is noticed that divergences are often occurring at the same place(s) in the code, it may be possible to move the log points in order to bisect the code and find the cause of the random seed divergences. Though I doubt it as always as simple as that!  ;)
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 04, 2020, 09:32:53 AM
Thank you all for your responses. I have been very busy in the last few days, but I will look into updating the logging code to assist in tracking this problem down when I have some time. Thank you again.
Title: Re: Losses of synchronisation in online play
Post by: Phystam on September 04, 2020, 05:09:16 PM
In my server, I often observe desync when nettool sends a force-sync call. (His client also often shows "data is not enough" error while connecting.)
Title: Re: Losses of synchronisation in online play
Post by: Matthew on September 05, 2020, 05:50:12 AM
I madea tiny Bash script to search Simutrans log files for desync random-number checklists and save them:

#!/bin/bash
# Log version hash (equals last commit on main branch)
grep "Simutrans version" simu.log >> checklist.log
# For clients, log disconnects and make the random-number-seed checklists line up nicely (if word wrap is turned off) for easy comparison
grep -B2 "network_disconnect()" simu.log | sed -e 's/server=/\nserver=/; s/client=/\nclient=/'  >> checklist.log
# For the server, log and make the random-number-seed checklists line up nicely (if word wrap is turned off) for easy comparison by changing "initiator" to "client"
grep -B2 "kicking" simu.log | sed -e 's/server=/\nserver=/; s/initiator=/\nclient=/'  >> checklist.log


It seems to work on the server log (http://bridgewater-brunel.me.uk/misc/simu-server13353-22-aug-2020.log) that James posted a few weeks ago and on the Windows client logs.

If James edited something like this into the scripts that he uses to restart the server, then he would have a summary log of RNG mismatches causing desyncs, without having to keep gigabytes of log files.

If James were to edit the path for checklist.log to somewhere that is downloadable (like his /misc/ directory) then others could analyse the data too. For easiest view, turn word wrapping off.

It's only my third Bash script (with lots of help from Stack Exchange) so use at your own risk.
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on September 05, 2020, 05:51:46 AM
Quote from: Matthew on September 04, 2020, 06:29:03 AM
??? I will try to describe the feature in full to get it clear in my head; could you (or anyone else who understands it) please check that I have understood correctly how it works and how to use it?
You've got the general gist, not quite the details, but not important unless you're combing the code for the cause. As an end user, running simutrans with "-debug 2 -log" args, and checking the logs for messages:

Warning: karte_t::interactive: sync_step=413158 server=[ss=413158 st=25822 nfc=6 rand=1232380772 halt=5216 line=1025 cnvy=8193 ssr=2417508643,1232380772,0,0,0,0,0,0 str=775003360,775003360,775003360,775003360,775003360,3751946317,3751946317,3751946317,3751946317,3751946317,3751946317,2444611105,2792299441,343840496,40657385,3751946317 exr=0,0,0,0,0,0,0,0 sums=1744600149,2506396596,0,0,0,0,0,0] client=[ss=413158 st=25822 nfc=6 rand=2159140440 halt=5216 line=1025 cnvy=8193 ssr=1442281764,2159140440,0,0,0,0,0,0 str=775003360,775003360,775003360,775003360,775003360,3751946317,3751946317,3751946317,3751946317,3751946317,3751946317,2444611105,2792299441,343840496,40657385,3751946317 exr=0,0,0,0,0,0,0,0 sums=1744600149,2506396596,0,0,0,0,0,0]
Warning: karte_t::interactive: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0

Confirmation that you're experiencing a "checklist mismatch" is vital. It represents the existence of a bug in Extended, other 'desync' reasons are more of a disconnection, usually due to the clients flaky internet.
Posting the first line shown above on every checklist mismatch could help track down the bug, atleast posting it until too many showing the same are up...


Quote from: Matthew on September 04, 2020, 06:29:03 AM
Though I doubt it as always as simple as that!  ;)
If only! Way too many red herrings.


Quote from: Phystam on September 04, 2020, 05:09:16 PM
In my server, I often observe desync when nettool sends a force-sync call. (His client also often shows "data is not enough" error while connecting.)
force-sync causes server and all clients through a save/reload cycle. If a client is too slow doing so, could result in socket closing - disconnection, not "desync". If confirmed to be checklist mismatches, something very wrong in the save/load routines then...
"data not enough" == bad connection.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 05, 2020, 12:03:22 PM
Thank you all for your contributions to this. I am currently working on improving the logging code, and have pushed the first round of improvements. I have added some additional points where rands[] is set, and moved the two places where rands[] had been set with something other than random numbers to some of the unused debug checksums.

In the meantime, I have created a symbolic link to the live log file genreated by the server to allow active debugging by the community, both for this loss of synchronisation and any other error. It can be found here (http://bridgewater-brunel.me.uk/misc/simu-server13353.log). Beware that this is a very, very large file.

I will look into improving the logging further shortly. Note that the improved logging will take effect on the live log from to-morrow's nightly build onwards. Note also that these live logs are reset whenever the server restarts and archives are not currently kept.
Title: Re: Losses of synchronisation in online play
Post by: prissi on September 05, 2020, 01:39:06 PM
In the other thread, if a different units of path explore chunks are processed on a client, would that not cause desyncs quickly?
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 05, 2020, 02:26:51 PM
Quote from: prissi on September 05, 2020, 01:39:06 PM
In the other thread, if a different units of path explore chunks are processed on a client, would that not cause desyncs quickly?
Not if the paths were not updated until the path explorer had finished a whole round of calculation.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 05, 2020, 03:13:39 PM
A little earlier, I pushed a fix to the transferring cargoes algorithm in respect of an error that I found when improving the logging. The transferring cargoes had not been properly indexed in a number of places, such that the highest two threads' worth of these may well have been ignored.

I cannot immediately see how this could have caused a loss of synchronisation, but it is not impossible that this was to blame. I should be grateful for feedback on losses of synchronisation from to-morrow's nightly build onwards.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 06, 2020, 10:43:09 AM
Checking the logs this morning, I can see one loss of synchronisation recorded since the latest server update. The log entry is as follows:


Warning: karte_t::process_network_commands: kicking client due to checklist mismatch : sync_step=12734

server=[ss=0 st=0 nfc=0 rand=0 halt=0 line=0 cnvy=0 ssr=0,0,0,0,0,0,0,0 str=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 exr=0,0,0,0,0,0,0,0 sums=0,0,0,0,0,0,0,0

initiator=[ss=12734 st=795 nfc=14 rand=943453193 halt=5647 line=1025 cnvy=8193 ssr=3923501859,0,3923501859,3923501859,943453193,943453193,943453193,0 str=1601740177,1601740177,1601740177,1601740177,1601740177,1601740177,1311543576,1311543576,1311543576,1311543576,1311543576,1311543576,1311543576,1311543576,1311543576,1157293220 exr=1157293220,4176945506,4176945506,0,0,0,0,0 sums=3623084883,1660530599,0,0,6,0,0,0


For some reason, the server's figures have not been logged. I do not understand how this could have occurred; I am struggling to see how the logging changes that I made yesterday could possibly have affected this. Any assistance in this regard would be much appreciated.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on September 06, 2020, 10:52:08 AM
Quote from: jamespetts on September 05, 2020, 03:13:39 PM
A little earlier, I pushed a fix to the transferring cargoes algorithm in respect of an error that I found when improving the logging. The transferring cargoes had not been properly indexed in a number of places, such that the highest two threads' worth of these may well have been ignored.

I cannot immediately see how this could have caused a loss of synchronisation, but it is not impossible that this was to blame. I should be grateful for feedback on losses of synchronisation from to-morrow's nightly build onwards.
After you joined today, followed by Supertimo and Huitsi, I was desynced and found that the new debug sums [5] (passengers/mail) and [7] (transferring cargo) were off by small amounts, with all other entries equal.
server ... sums=2467091683,1342947152,1222935,141356,6,1062,35088,35213
client ... sums=2467091683,1342947152,1222935,141356,6,1061,35088,35211

edit: this happened again with the following discrepancy (note that this time, the client's record was higher than the server's):
server ... sums=346604347,12621192,1223059,141372,6,1068,35016,35152
client ... sums=346604347,12621192,1223059,141372,6,1069,35016,35157

edit: and another:
server ... sums=809983894,1692448293,1223268,141382,6,1050,34990,35107
client ... sums=809983894,1692448293,1223268,141382,6,1050,34990,35116
Title: Re: Losses of synchronisation in online play
Post by: Phystam on September 06, 2020, 12:05:58 PM
We got this log.

Warning: karte_t:::do_network_world_command: sync_step=339968  server=[ss=339968 st=42496 nfc=0 rand=1371915518 halt=1931 line=1 cnvy=2049 ssr=3914378526,0,3914378526,3914378526,398461557,398461557,398461557,0 str=398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,2383265876 exr=2383265876,1371915518,1371915518,0,0,0,0,0 sums=2645975564,566767530,1079890,131294,6,255,43104,43136client=[ss=339968 st=42496 nfc=0 rand=1371915518 halt=1931 line=1 cnvy=2049 ssr=3914378526,0,3914378526,3914378526,398461557,398461557,398461557,0 str=398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,398461557,2383265876 exr=2383265876,1371915518,1371915518,0,0,0,0,0 sums=2645975564,566767530,1079890,131294,6,267,43093,43122
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 06, 2020, 12:09:07 PM
Thank you both for that.

It seems that almost all of the losses of synchronisation are now being triggered from the client rather than the server, leaving no logs on the server to show the checklists. However, connecting with a client with logging enabled shows this:
Quote
Warning: karte_t:::do_network_world_command:   sync_step=162480 

server=[ss=162480 st=10155 nfc=0 rand=206871005 halt=5647 line=1025 cnvy=8193 ssr=2333617215,0,2333617215,2333617215,1007176688,1007176688,1007176688,0 str=1007176688,1007176688,1007176688,1007176688,1007176688,1007176688,114449090,114449090,114449090,114449090,114449090,114449090,114449090,114449090,114449090,379896060 exr=379896060,206871005,206871005,0,0,0,0,0 sums=3960158586,773655635,1223249,141380,6,1051,35002,35124

client=[ss=162480 st=10155 nfc=0 rand=206871005 halt=5647 line=1025 cnvy=8193 ssr=2333617215,0,2333617215,2333617215,1007176688,1007176688,1007176688,0 str=1007176688,1007176688,1007176688,1007176688,1007176688,1007176688,114449090,114449090,114449090,114449090,114449090,114449090,114449090,114449090,114449090,379896060 exr=379896060,206871005,206871005,0,0,0,0,0 sums=3960158586,773655635,1223249,141380,6,1051,35002,35123
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 06, 2020, 12:24:26 PM
Analysing the above, it appears as though the loss of synchronisation occurs somewhere in the passenger generation code. However, this does not mean that the problem ultimately originates in the passenger generation code: it might be a problem connected to finding passenger routes by public transport, which might be a problem with the path explorer, with data structures for classes, halts, lines (and what lines serve what halts and carry which classes), buildings that generate or receive passengers or mail (including city buildings, industries, attractions, town halls and all player buildings), or a large range of settings that affect passenger/mail generation.
Title: Re: Losses of synchronisation in online play
Post by: ceeac on September 06, 2020, 01:13:06 PM
I fixed some occurrences of Undefined Behaviour related to factory in-transit calculation, overtaking, and a few additional ones most probably not affecting the desyncs. Check pull request #275 (https://github.com/jamespetts/simutrans-extended/pull/275).
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 06, 2020, 01:20:18 PM
Ceeac - thank you very much for this.

I have found a possible issue in the passenger generation code, which was that the dynamic tolerance calculation feature introduced better to model Marchetti's Constant was based both on the current year's and the past year's data, whereas the current year's data might have been modified by another thread, and thus it is indeterministic whether this modification would have occurred before or after any given access. Because this modification would only very rarely affect the percentage, the actual loss of synchronisation would occur rarely, but would be increasingly common proportionate to the number of city buildings in the world.

I have fixed this by using only the previous year's data, and have restarted the server with this updated version. I should be grateful if people could re-test and reconnect with the new client to check for losses of synchronisation. I have not yet incorporated Ceeac's changes in order to have a properly controlled test.

Ceeac - I notice that a number of your changes add a dynamic cast. Dynamic casts can have a significant performance impact when used in performance critical sections of code; can I check whether you have profiled the performance of these changes on a large map to see whether there is any noticeable difference?

Thank you again for your work on this, and to everyone who has given feedback and assistance with this issue so far.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on September 06, 2020, 03:38:04 PM
After you updated the server today, I have not been able to locate checklist messages in either the client or server logs. I wonder if this was caused by my change to the message format?
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on September 06, 2020, 07:04:57 PM
Standard r8395 commit is relevant here - missing, and r8348 not merged correctly:

// time for the next step?
uint32 time = dr_time();
if(  (sint32)next_step_time - (sint32)time <= 0  ) {    <<<<******
if(  step_mode&PAUSE_FLAG  ) {
// only update display
sync_step( 0, false, true );
idle_time = 100;
}
****
else if(  env_t::networkmode  &&  !env_t::server  &&  sync_steps >= sync_steps_barrier  ) {
sync_step( 0, false, true );
next_step_time = time + fix_ratio_frame_time;
}
*****
else {
if(  step_mode==FAST_FORWARD  ) {

Setting next_step_time is important, and note the correct way to take a time diff on values that overflow... (not part of 8348, but missed from some other one...)
Title: Re: Losses of synchronisation in online play
Post by: Matthew on September 07, 2020, 12:25:10 PM
There's clearly been a lot of work put in to this issue in the last few days. Thank you to everyone. Particular credit to James for this monumental effort:

Quote from: jamespetts on August 29, 2020, 10:45:34 PMafter I had manually removed every industry from the May 1863 saved game

:o

Since the server was reset to build #c2173fa, I have played for two sessions of about a real-life hour each. I did not experience any desynchs during that time, though that is not really any different from the behaviour that I experienced beforehand. However, Freddy Hayward observed that desynchs often occurred about 20-40 in-game minutes after the start of the month, and the game passed through that period without difficulty.

Obviously these are only two data points, but at least we have not gone backward! Nobody else has reported desynching in the chats either, which is a hopeful sign.

Quote from: freddyhayward on September 06, 2020, 03:38:04 PM
After you updated the server today, I have not been able to locate checklist messages in either the client or server logs. I wonder if this was caused by my change to the message format?

I observe the same phenomenon in the Windows client logs.

Staying on the topic of logging changes, the logs for both sessions contained far more divide by zero warnings than I have seen before, usually in clusters and often (always?) involving a limited set of numerators. I have not created a bug report because I don't yet have a reproducible case beyond "play on Bridgewater-Brunel". I guess that a debug build is necessary to create such a case, but at the present time I can't use one due to (to cut a long story short) lack of memory.
Title: Re: Losses of synchronisation in online play
Post by: Huitsi on September 07, 2020, 03:46:26 PM
I've had hardly any desyncs today, and the few that I have had seemed to be related to players joining.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 07, 2020, 05:13:42 PM
Thank you all for your feedback and my apologies for not having had more time to look into this to-day. I have just reverted the change that Freddy made to the logging code that is suspected of having caused problems for the logging. If anyone has any losses of synchronisation after to-morrow's nightly build, please let me know so that I can analyse the logs.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on September 07, 2020, 10:55:48 PM
Quote from: jamespetts on September 07, 2020, 05:13:42 PM
Thank you all for your feedback and my apologies for not having had more time to look into this to-day. I have just reverted the change that Freddy made to the logging code that is suspected of having caused problems for the logging. If anyone has any losses of synchronisation after to-morrow's nightly build, please let me know so that I can analyse the logs.
I tested the latest build with my patch reverted, but I still cannot see checklist messages.
EDIT: I also tried the earliest build with my change, and logging worked as expected. I am now in the process of identifying exactly which change has caused the issue.
EDIT: I have identified the issue and submitted a pull request to fix it: https://github.com/jamespetts/simutrans-extended/pull/276
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on September 07, 2020, 11:14:27 PM
James demoted it from a warning to a message. Will need debug 3 now instead of debug 2 to see it.  (Of course debug 3 spits out so many other issues [that really should be fixed....] that you can't effectively use it.)
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on September 07, 2020, 11:22:23 PM
Quote from: TurfIt on September 07, 2020, 11:14:27 PM
James demoted it from a warning to a message. Will need debug 3 now instead of debug 2 to see it.  (Of course debug 3 spits out so many other issues [that really should be fixed....] that you can't effectively use it.)
See the above post: I have kept it as a message unless a mismatch actually occurs, where it is added to the existing warning.
Title: Re: Losses of synchronisation in online play
Post by: Phystam on September 08, 2020, 05:28:52 AM
What happened? I cannot understand why it was disconnected.
It was an auto-update of my server. please ignore this.

Message: karte_t:::do_network_world_command: sync_step=1265440  server=[ss=1265440 st=158180 nfc=0 rand=2836023522 halt=2028 line=1 cnvy=2049
ssr=1926564120,0,1926564120,1926564120,3558286601,3558286601,3558286601,0
str=3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3439267139
exr=3439267139,2836023522,2836023522,0,0,0,0,0
sums=2251449498,2512648722,1115024,135464,6,396,43956,44015]
client=[ss=1265440 st=158180 nfc=0 rand=2836023522 halt=2028 line=1 cnvy=2049
ssr=1926564120,0,1926564120,1926564120,3558286601,3558286601,3558286601,0
str=3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3558286601,3439267139
exr=3439267139,2836023522,2836023522,0,0,0,0,0
sums=2251449498,2512648722,1115024,135464,6,396,43956,44015]

Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[2752]
Warning: NWC_CHECK: time difference to server 50
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[2752]
Warning: NWC_CHECK: time difference to server 50
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[2752]
Warning: NWC_CHECK: time difference to server 0
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[2752]
Warning: NWC_CHECK: time difference to server 50
Warning: network_receive_data: connection [2752] already closed
Message: socket_list_t::remove_client: remove client socket[2752]
Warning: karte_t::process_network_commands: lost connection to server
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Title: Re: Losses of synchronisation in online play
Post by: Phystam on September 08, 2020, 10:52:16 AM
Double post!
I have connected to my server 3 hours ago, and I have not had any desync issues.
Another player did not report the desync issue, too.
Title: Re: Losses of synchronisation in online play
Post by: Phystam on September 08, 2020, 01:27:08 PM
Triple post ;)
I got a desync:

server=[ss=577568 st=72196 nfc=0 rand=3624669075 halt=2049 line=1 cnvy=2049
ssr=2442508065,0,2442508065,2442508065,4026605948,4026605948,4026605948,0
str=4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,197235402
exr=197235402,3624669075,3624669075,0,0,0,0,0
sums=3912519612,3810850575,1120566,136003,6,406,44647,44703]
client=[ss=577568 st=72196 nfc=0 rand=3624669075 halt=2049 line=1 cnvy=2049
ssr=2442508065,0,2442508065,2442508065,4026605948,4026605948,4026605948,0
str=4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,4026605948,197235402
exr=197235402,3624669075,3624669075,0,0,0,0,0
sums=4083084865,2854028676,1120566,136003,6,406,44647,44703]

Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 08, 2020, 04:21:07 PM
Thank you for these reports. That is interesting, as the mismatch occurred on the first two debug sums, which indicates an issue related in some way to vehicle movement. This is different to the previously reported problem. This, together with the reduction in reported losses of synchronisation and, suggests that the original problem may well have been solved and that this is a separate problem.

It would be helpful to have any more log reports of losses of synchronisation here to allow this possibly separate issue to be investigated further.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 08, 2020, 04:47:36 PM
Quote from: TurfIt on September 06, 2020, 07:04:57 PM
Standard r8395 commit is relevant here - missing, and r8348 not merged correctly:

// time for the next step?
uint32 time = dr_time();
if(  (sint32)next_step_time - (sint32)time <= 0  ) {    <<<<******
if(  step_mode&PAUSE_FLAG  ) {
// only update display
sync_step( 0, false, true );
idle_time = 100;
}
****
else if(  env_t::networkmode  &&  !env_t::server  &&  sync_steps >= sync_steps_barrier  ) {
sync_step( 0, false, true );
next_step_time = time + fix_ratio_frame_time;
}
*****
else {
if(  step_mode==FAST_FORWARD  ) {

Setting next_step_time is important, and note the correct way to take a time diff on values that overflow... (not part of 8348, but missed from some other one...)

My apologies: I had not had time to look into this until now.

Thank you very much for this: I have now incorporated these changes.
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on September 08, 2020, 04:56:05 PM
Quote from: jamespetts on September 08, 2020, 04:47:36 PM
if(  step_mode&PAUSE_FLAG  ) {
Don't forget ^^^^
Or r8395...
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 08, 2020, 05:00:35 PM
Thank you for that.
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on September 08, 2020, 05:29:25 PM
Almost there, still r8395 to go...

void socket_list_t::send_all(network_command_t* nwc, bool only_playing_clients, uint8 player_nr)

...
if (list[i]->is_active()  &&  list[i]->socket!=INVALID_SOCKET
&&  (!only_playing_clients  ||  list[i]->state == socket_info_t::playing  ||  list[i]->state == socket_info_t::connected)      <<<<<<<<<<<
&&  (player_nr >= PLAYER_UNOWNED  ||  list[i]->is_player_unlocked(player_nr))
...

Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 08, 2020, 07:24:47 PM
Thank you for that. I am having trouble working out where to get the player_nr from to pass to this method: how is this done in Standard, do you know?
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on September 08, 2020, 07:33:40 PM
I would skip that part... it's from a much larger commit and is for scripting support which I believe you've decided to remove from Extended.
Just the   state == :: connected that was added in 8395 is needed.
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 08, 2020, 07:40:26 PM
Quote from: TurfIt on September 08, 2020, 07:33:40 PM
I would skip that part... it's from a much larger commit and is for scripting support which I believe you've decided to remove from Extended.
Just the   state == :: connected that was added in 8395 is needed.

Excellent, thank you: that is very helpful.
Title: Re: Losses of synchronisation in online play
Post by: Vladki on September 08, 2020, 08:03:20 PM
Right now I have immediate desync after connecting to stephenson-siemens and bridgewater-brunel. But I was able to stay connected to Karafuto for quite a while.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on September 09, 2020, 03:05:53 AM
Desyncs have decreased overall, but I recently had one similar to Phystam's:
server=[ss=245976 st=15373 nfc=8 rand=1948000736 halt=5737 line=1647 cnvy=10269
ssr=1268985866,0,1268985866,1268985866,1948000736,1948000736,1948000736,0
str=2556624162,2556624162,2556624162,2556624162,2556624162,2556624162,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,3525597508
exr=3525597508,4058585351,4058585351,0,0,0,0,0
sums=2869568541,3060730727,0,0,6,0,0,0]
client=[ss=245976 st=15373 nfc=8 rand=1948000736 halt=5737 line=1647 cnvy=10269
ssr=1268985866,0,1268985866,1268985866,1948000736,1948000736,1948000736,0
str=2556624162,2556624162,2556624162,2556624162,2556624162,2556624162,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,2447664526,3525597508
exr=3525597508,4058585351,4058585351,0,0,0,0,0
sums=3599377438,2028940010,0,0,6,0,0,0]
Title: Re: Losses of synchronisation in online play
Post by: Phystam on September 09, 2020, 07:52:17 AM
Currently, it is very stable -- the reported desyncs seem relatively rare cases, so I do not think that this issue must be fixed.

Edit:
Finally I did not get any desync today!
Title: Re: Losses of synchronisation in online play
Post by: jamespetts on September 09, 2020, 11:06:23 PM
Excellent - thank you all for confirming. It seems that the main issue has now been resolved. If there are any other losses of synchronisation, they are likely to be caused by something different, so a new thread will be required.

Thank you very much to all those who assisted in resolving this problem.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on September 09, 2020, 11:24:12 PM
Quote from: jamespetts on September 09, 2020, 11:06:23 PM
Excellent - thank you all for confirming. It seems that the main issue has now been resolved. If there are any other losses of synchronisation, they are likely to be caused by something different, so a new thread will be required.

Thank you very much to all those who assisted in resolving this problem.
Thanks for your work on this - it's hard to overstate how much the online experience has improved. Could you please split my and Phystam's reports into a new thread - we seem to have reached the same error, fortunately much less frequently than before.
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on September 13, 2020, 11:56:24 PM
Feel free to split this into a new thread if you think it necessary. I have been recording reports of losses of synchronisation and the first mismatched sum or random seed. There were three recorded places where mismatches first occurred, and I suggest that we make the checklist logging more granular for these if possible.
(https://i.imgur.com/ljdisnV.png)
edit: link to spreadsheet is here: https://docs.google.com/spreadsheets/d/1Jwa-5G6aXfkrCrPYLQvDbYt7LqYMgYapW8hyOF9IB3s/edit?usp=sharing
Title: Re: Losses of synchronisation in online play
Post by: Mariculous on September 14, 2020, 10:54:42 AM
There seems to be yet another desync issue, which appears more-or-less immediately after connectngto the server.
It can be reproduced on stephenson-siemens server. Here is the latest save of that map, which allowed me to reliably reproduce the desync locally by simply hosting server and client on the same machine.

Stephenson-siemens save: https://dome.xileks.de/simutrans/save/stephenson-siemens.sve


Pak192 server:
Simutrans extended Version: https://citarofahrer.ch/ST/exe/Simutrans-Extended-Complete.zip  (not the latest because OVH does not support latest nightlies)
Pakset: https://www.simutrans.ovh/server/41/pakset
Save: https://cdn.discordapp.com/attachments/744334274357887016/754782292781826189/savegame.sve
Title: Re: Losses of synchronisation in online play
Post by: freddyhayward on September 14, 2020, 11:11:58 AM
Quote from: Freahk on September 14, 2020, 10:54:42 AMThere seems to be yet another desync issue, which appears more-or-less immediately after connectngto the server.
It can be reproduced on stephenson-siemens server. Here is the latest save of that map, which allowed me to reliably reproduce the desync locally by simply hosting server and client on the same machine.

Stephenson-siemens save: https://dome.xileks.de/simutrans/save/stephenson-siemens.sve
I tried this and couldn't reproduce the desync.
Title: Re: Losses of synchronisation in online play
Post by: Mariculous on September 14, 2020, 12:04:17 PM
Wait, I got an idea!
Might be related to *_overrides_savegame_settings

Edit: It seems like something is wrong when the server makes use of *_overrides_savegame_settings=1

Yet tested with the provided pak192 save.
This seems to be a relatively new issue, as stephenson-siemens made use of this for a long time. Searching for the commit that introduced this isue now.

Edit2: Confirmed the same with the stephenson-siemens save. It's running locally without any desyncs now either.
Title: Re: Losses of synchronisation in online play
Post by: Phystam on September 14, 2020, 01:08:38 PM
Ah yes, I have observed the same issue related to the override settings...
I thought that the desync is specification then...
Title: Re: Losses of synchronisation in online play
Post by: Vladki on September 14, 2020, 07:25:10 PM
Stephenson Siemens is running with override = 0
It disconnects immediately, debug says:

Warning: karte_t:::do_network_world_command:    disconnecting due to checklist mismatch:
server=[ss=88 st=11 nfc=0 rand=391693251 halt=4135 line=1 cnvy=2049
        ssr=1212207547,0,1212207547,1212207547,4173020357,4173020357,4173020357,0
        str=4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,618531044
        exr=618531044,391693251,391693251,0,0,0,0,0
        sums=1593254915,2629383893,284093,25440,4,49,14971,14982]
client=[ss=88 st=11 nfc=0 rand=391693251 halt=4135 line=1 cnvy=2049
        ssr=1212207547,0,1212207547,1212207547,4173020357,4173020357,4173020357,0
        str=4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,4173020357,618531044
        exr=618531044,391693251,391693251,0,0,0,0,0
        sums=1593254915,2629383893,284093,25440,6,49,14971,14982]

Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0

Diffference is in 5th field of sums: 4 vs 6
Title: Re: Losses of synchronisation in online play
Post by: TurfIt on September 14, 2020, 08:03:59 PM
Server set for 4 threads, client 6...
Title: Re: Losses of synchronisation in online play
Post by: Vladki on September 14, 2020, 08:23:04 PM
Quote from: TurfIt on September 14, 2020, 08:03:59 PM
Server set for 4 threads, client 6...

Oh, thanks that helped... I'm just wondering if that was not checked before, as I have not touched threads config for long time.
However - would it be possible to send this value from server to clients and override clients setting with server config?