News:

Congratulations!
 You've won the News Item Lottery! Your prize? Reading this news item! :)

Losses of synchronisation in online play

Started by freddyhayward, August 19, 2020, 09:59:26 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

freddyhayward

James, can you update as to whether you are still attempting to set up logging for the server including checklist mismatches at desyncs? This will be very useful in my opinion and I'm sure Freahk will help with any difficulties with the script.

jamespetts

Quote from: freddyhayward on August 19, 2020, 09:59:26 PM
James, can you update as to whether you are still attempting to set up logging for the server including checklist mismatches at desyncs? This will be very useful in my opinion and I'm sure Freahk will help with any difficulties with the script.

I had not looked into this for a while - I will have to spend a considerable time getting this working; but I am grateful for your offer of assistance.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

freddyhayward

Quote from: jamespetts on August 19, 2020, 10:39:56 PM
I had not looked into this for a while - I will have to spend a considerable time getting this working; but I am grateful for your offer of assistance.
To be clear, I have not offered assistance myself! I meant that Freahk, whose restart script you have seen, will probably be willing to help.

Mariculous

Sure I will help.
If it takes too much of your time, just send your script to me and I might add the logging stuff.

Matthew

#4
I have the same unhappy symptoms reported by others last night. Frequent desyncs and I just had a non-pause freeze (time does not tick, animations do not move and the fps appears to go through the roof) for the first time.



In the latter case, I was only reading chat and the Path Explorer was not running. The desyncs happened both in that situation and when I was doing some signalling.

This is on Windows, updated using the Nightly Updater to today.

(I have not posted this as a bug report because I have no reproducible case.)
(Signature being tested) If you enjoy playing Simutrans, then you might also enjoy watching Japan Railway Journal
Available in English and simplified Chinese
如果您喜欢玩Simutrans的话,那么说不定就想看《日本铁路之旅》(英语也有简体中文字幕)。

jamespetts

I have split the discussion relating to losses of synchronisation from the general server topic, as this is a discussion about a problem with the game rather than something specific to the Bridgewater-Brunel server.

I am looking into logging at the present. However, this is extremely complex: the restart script that I currently use is one that I adapted slightly from one written many years ago. That script is quite sophisticated, and I have made only minimal alterations to it. I have very limited familiarity with shell script syntax, so I do not fully understand how this script works.

In previous versions of the game, the logging worked correctly. However, in more recent versions, and without, to my recollection, any changes to this shell script, it stopped working.

The current restart script is as follows:

#!/bin/sh

VERBOSE=1
# Number of seconds to allow for server to start/stop
startstopgracetime=5

pathtosimutrans=/usr/share/games/simutrans-extended

log() {
    if [ "$1" != "ERROR" ]; then
        if [ "$VERBOSE" -eq "1" ]; then
            echo "$1: $2" >&2
        fi
    else
        echo "$1: $2" >&2
    fi
}

show_usage() {
    echo "Usage: simctrl <tag> {status|start|restart|check|stop}" >&2
}

# Check we are running as root
if [ `id -u` != "0" ]; then
        log "ERROR" "You need to be superuser"
        exit 1
fi

# Check command line parameters, there must be exactly two of them
if [ "$#" -ne "2" ]; then
    log "ERROR" "Wrong number of arguments"
    show_usage
    exit 1
fi

tag=$1
action=$2

pidfile="$pathtosimutrans/$tag.pid"
configfile="$pathtosimutrans/etc/$tag.config"

if [ -f $configfile ]; then
    . $configfile
else
    log "ERROR" "No config file available for tag: $tag at location: $configfile"
    exit 1
fi

# Need to check that the following variables are actually present (e.g. that config file is valid)
if [ -z "${port:+x}" ]; then
    log "ERROR" "Config, parameter: 'port' not set for tag: $tag"
    exit 1
fi
if [ -z "${revision:+x}" ]; then
    log "ERROR" "Config, parameter: 'revision' not set for tag: $tag"
    exit 1
fi
if [ -z "${objects:+x}" ]; then
    log "ERROR" "Config, parameter: 'objects' not set for tag: $tag"
    exit 1
fi
if [ -z "${save:+x}" ]; then
    log "ERROR" "Config, parameter: 'save' not set for tag: $tag"
    exit 1
fi
if [ -z "${server_id:+x}" ]; then
    log "ERROR" "Config, parameter: 'server_id' not set for tag: $tag"
    exit 1
fi
if [ -z "${server_name:+x}" ]; then
    log "ERROR" "Config, parameter: 'server_name' not set for tag: $tag"
    exit 1
fi
if [ -z "${server_comment:+x}" ]; then
    log "ERROR" "Config, parameter: 'server_comment' not set for tag: $tag"
    exit 1
fi
if [ -z "${debug:+x}" ]; then
    # Debug defaults to 2
    debug=2
fi


# We do not use the $revision system at present for the Extended server.

simpath="$pathtosimutrans/simutrans-extended"
#simpath="$pathtosimutrans/$revision/simutrans-extended"

simutrans_stop() {
    # Check if process is running
    # $pid will be either the pid, or true value
    PID=$(pidof_simutrans)

    if [ "$PID" -gt "0" ]; then
        # PID found and it is running, take steps to stop the server

        logger -t $tag Attempting to stop server...
        echo -n "Attempting to stop simutrans instance: $tag ($PID)."

        kill $PID

        dead=0
        count=0

        while [ $dead -eq "0" ] && [ $count -le $startstopgracetime ]
        do
            sleep 1

            # Update status
            echo -n "."

            # Increment count by 1
            count=$(($count+1))

            # Check if process is still running, pidof_simutrans returns -2 if it cannot find the pid
            # it will also remove the pidfile in this instance
            if [ $(pidof_simutrans) -ne "-2" ]; then dead=1; fi
        done

        # Finish up status with newline
        echo ""

        if [ $dead -eq "0" ]; then
            log "ERROR" "Kill didn't work, server is still running"
            logger -t $tag Server failed to stop!
            exit 1
        fi

        logger -t $tag Server successfully stopped
    else
        log "INFO" "Server not running, not performing stop action."
        echo "Server is not running"
    fi
}

simutrans_start() {
    # Check if this tagged instance is already running by checking PID file
    PID=$(pidof_simutrans)
    if [ "$PID" -eq "-2" ]; then
        log "INFO" "Pidfile existed at: $pidfile, but no process with that ID was running. Continuing with startup of instance: $tag"
    elif [ "$PID" -gt "0" ]; then
        log "ERROR" "Pidfile exists at: $pidfile and a process is running with that ID, cannot start duplicate instance of server: $tag"
        exit 1
    fi

    echo -n "Attempting to start Simutrans-Extended instance: $tag."
    logger -t $tag Attempting to start server...

    # Start up the server; running as non-root user disabled for the time being, as, oddly, when run as non-root, nobody can connect to the server.
    if [ $1 = "start" ]; then
        /bin/sh -c "( $simpath -server $port -server_id $server_id -server_name $server_name -server_comment $server_comment -debug $debug -log 2 -lang en -nosound -nomidi -objects $objects -load $save 2>&1 & echo \$! >&3 ) 3>$pidfile | logger -i -t $tag &"
    elif [ $1 = "restart" ]; then
        /bin/sh -c "( $simpath -server $port -server_id $server_id -server_name $server_name -server_comment $server_comment -debug $debug -log 2 -lang en -nosound -nomidi -objects $objects 2>&1 & echo \$! >&3 ) 3>$pidfile | logger -i -t $tag &"
    fi

    alive=0
    count=0

    while [ $alive -eq "0" ] && [ $count -le $startstopgracetime ]
    do
        sleep 1

        # Update status
        echo -n "."

        # Increment count by 1
        count=$(($count+1))

        # Check if process is running yet, pidof_simutrans returns > 0 (the PID) if it is running
        if [ $(pidof_simutrans) -gt "0" ]; then alive=1; fi
    done

    # Finish up status with newline
    echo ""

    # Check if simutrans has actually started
    if [ "$alive" -eq "0" ]; then
        log "ERROR" "Startup appears to have failed, no process is running with the pid specified in the pidfile for this instance ($pidfile)"
        logger -t $tag Server startup failed
        exit 1
    else
        log "INFO" "Server startup appears to have worked, instance: $tag is now running with pid: $(pidof_simutrans)"
        logger -t $tag Server startup script completed
    fi

    # Return the pidfile
    cat $pidfile
}

pidof_simutrans() {
    # If pidfile exists and there is a simutrans process with the pid specified
    # print the pid and return 0
    if [ -e "$pidfile" ]; then
        pid=`cat $pidfile`
        kill -0 $pid 2>/dev/null
        if [ "$?" -eq "1" ]; then
            log "INFO" "Pidfile specifies a process that doesn't exist (pid: $pid), removing pidfile"
            rm $pidfile
            echo -2
        else
            log "INFO" "Pidfile found, pid is a running process (pid: $pid)"
            echo $pid
        fi
    else
        log "INFO" "Pidfile not found, this process either doesn't exist or wasn't started with this script"
        echo -1
    fi
}


case $action in
    status)
        # Check status of tagged instance
        PID=$(pidof_simutrans)

        if [ "$PID" -gt "0" ]; then
            echo "Simutrans instance: $tag is currently running with pid: $PID"
            echo "Output of \"ps uww -p $PID\":"
            ps uww -p $PID
        else
            echo "Simutrans instance: $tag is not currently running."
        fi
    ;;
    start)
        simutrans_start "start"
    ;;
    stop)
        simutrans_stop
    ;;
    restart)
        # If server running, stop it first
        simutrans_stop

        simutrans_start "restart"
   ;;
    check)
     # Check status of tagged instance
        PID=$(pidof_simutrans)

        # Check whether the server is running, and restart it if not.
   date
        if [ "$PID" -gt "0" ]; then
            echo "Simutrans instance: $tag is currently running. No action necessary."
        else
            echo "Simutrans instance: $tag is not currently running. Restarting (after archiving the log)..."
      now=$(date +"%m_%d_%Y")
      filename_log="/var/log/simutrans/simu-server13353-$now.log"
      logpath="$simpath/simu-server13353.log"
      cp "$logpath" "$filename_log"
      simutrans_start "restart"
        fi
    ;;
    *)
        echo "ERROR" "Invalid command"
        show_usage
    ;;
esac

The critical line appears to be:
/bin/sh -c "( $simpath -server $port -server_id $server_id -server_name $server_name -server_comment $server_comment -debug $debug -log 2 -lang en -nosound -nomidi -objects $objects 2>&1 & echo \$! >&3 ) 3>$pidfile | logger -i -t $tag &"
I am not sure why this does not work.
I am grateful to Freahk for sending his logging code, but I am having trouble working out any way to integrate this into the existing script, as the latter is very complex and it is difficult to for me to understand that complexity with a limited understanding of shell script syntax, and thus difficult for me to understand what details are significant and what are not.

I should therefore be grateful for Freahk's assistance as offered. Thank you in advance.

Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I notice that Ceeac has just pushed some fixes for possible undefined behaviour and uninitialised variables, which I have incorporated. The code in question, I believe, originates in Standard. I am not sure whether this will assist, but I should be interested in any reports on to-morrow's nightly build as to whether the losses of synchronisation are notably less frequent or not.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have been testing and have been unable to reproduce this loss of synchronisation locally with a server running an optimised debug Visual Studio build and one identical client connecting and remaining connected without any interaction.

However, when I log into the Bridgewater-Brunel server, I find that I do lose synchronisation quite frequently. It would very much help me in narrowing down the issue to know the following:

(1) has anyone suffered losses of synchronisation when alone on the server and not changing anything;
(2) has anyone suffered losses of synchronisation when alone on the server and changing things; and
(3) what things were being changed when people lost synchronisation?

Unfortunately, I cannot test whether there are any losses of synchronisaiton on the Stephenson-Seimens server, as this is offline as the list server is currently down. Being able to test this would help to narrow down whether the losses in synchronisation on the Bridgewater-Brunel server arise from some state of affairs that is present on that server and not on the Stephenson-Seimens server, or whether instead it arises from some change in the code recently that affects all servers.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Huitsi

I have certainly had desyncs while alone. As for whether I was doing anything those times, I'll have to pay attention next time. In general I have had desyncs both while doing nothing (often when I'm AFK or just observing things like right now), when I'm just about to do something (picked a tool but not used it yet) as well as after doing something. The non-server-crash desyncs have never, AFAIK, caused any loss of my actions, so if I have just build a signal, for instance, it's there when I rejoin.

freddyhayward

Quote from: jamespetts on August 22, 2020, 02:33:37 PM(1) has anyone suffered losses of synchronisation when alone on the server and not changing anything;
Yes, very frequently.
Quote from: jamespetts on August 22, 2020, 02:33:37 PM(2) has anyone suffered losses of synchronisation when alone on the server and changing things; and
Yes, very frequently.
Quote from: jamespetts on August 22, 2020, 02:33:37 PM(3) what things were being changed when people lost synchronisation?
Apart from doing nothing, almost anything. Building/removing ways, terraforming, editing lines and schedules, buying vehicles, selling vehicles, sending vehicles to the depot, placing/deleting signs or signals.
Note that these desyncs are not new, so there is no possibility of recent code changes being the cause. Again, I think that checklist mismatch messages will be useful. How exactly has the difficulty in logging presented? Is there an error, or does it simply not create a log file at all? Is there any output at all (e.g. echo statements from the script) that you are able to view?

jamespetts

Thank you for the feedback: that is helpful. I have been undertaking some of my own testing on the server, including building a small railway to test losses of synchronisation on interaction.

As to the logs, I have realised that the problem was that the log file was being created in the wrong place. I have now found it and have uploaded the latest logs to here. Please note that this is a ~200mb text file.

In relation to the statement that the losses of synchronisation are not new, may I ask when people first started to notice them more often than about once a day? Do they occur on the Stephenson-Seimens server (when it is running), or only on the Bridgewater-Brunel server?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TurfIt

Quote from: jamespetts on August 20, 2020, 12:04:45 PM
The current restart script is as follows:
Why so complicated? .... supervisord/crond and done. Supervisor handles starting, stopping, log rotating, and monitoring process running. A simple cron task to force a sync and have a daily save archive, or update a nightly program version, etc.


Quote from: jamespetts on August 20, 2020, 12:04:45 PM

# Check we are running as root

Right there, I'm done with this script. Running Simutrans as root? That server needs cleansing with fire, stat.
In no way shape form can Simutrans be deemed safe to run as root.

Quote from: jamespetts on August 20, 2020, 12:04:45 PM

        /bin/sh -c "( $simpath -server $port -server_id $server_id -server_name $server_name -server_comment $server_comment -debug $debug -log 2 -lang en -nosound -nomidi -objects $objects -load $save 2>&1 & echo \$! >&3 ) 3>$pidfile |

There are no levels to -log. What is the intent of '-log 2'?
'| logger'  Simutrans has built in syslog support (compile time enable required) if you want the logs there, but it looks like the script trying to combine Simutrans normal logging (simu-serverXXX.log) with syslog?

jamespetts

I did not write the script - I adapted it slightly from one written by somebody else who no longer visits these forums many years ago. As to running in root, the reason for that, I believe, is that when it is set to run as a non-root user, nobody was able to join the server.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Matthew

Quote from: jamespetts on August 22, 2020, 02:33:37 PM
(1) has anyone suffered losses of synchronisation when alone on the server and not changing anything;
(2) has anyone suffered losses of synchronisation when alone on the server and changing things; and

I think yes to both, but I wasn't specifically looking out for it.

And I think I've also had two reasonably long periods alone without desynchs. Just now I was on the server, changing things, and didn't desynch for at least an hour. The server said that 2 clients were connected, but no one replied to my chat message, so I think I was alone. When someone else did definitely connect, I lost synch within 10 minutes. I think that (~an hour without desyncs) has happened once before in the last ten days. I'm not sure at one point in the various changes it was, but I remember being surprised that I lasted so long without a desynch.

Quote(3) what things were being changed when people lost synchronisation?

As others have said, everything. I just can't notice any pattern here. But I will continue trying to spot one, as I am sure everyone else is.
(Signature being tested) If you enjoy playing Simutrans, then you might also enjoy watching Japan Railway Journal
Available in English and simplified Chinese
如果您喜欢玩Simutrans的话,那么说不定就想看《日本铁路之旅》(英语也有简体中文字幕)。

freddyhayward

Quote from: jamespetts on August 22, 2020, 03:56:46 PMAs to the logs, I have realised that the problem was that the log file was being created in the wrong place. I have now found it and have uploaded the latest logs to here. Please note that this is a ~200mb text file.
Thank you for uploading this! I have located two mismatch warnings in the file:
Warning: karte_t::process_network_commands: kicking client due to checklist mismatch : sync_step=413158 server=[ss=413158 st=25822 nfc=6 rand=1232380772 halt=5216 line=1025 cnvy=8193 ssr=2417508643,1232380772,0,0,0,0,0,0 str=775003360,775003360,775003360,775003360,775003360,3751946317,3751946317,3751946317,3751946317,3751946317,3751946317,2444611105,2792299441,343840496,40657385,3751946317 exr=0,0,0,0,0,0,0,0 sums=1744600149,2506396596,0,0,0,0,0,0
initiator=[ss=413158 st=25822 nfc=6 rand=2159140440 halt=5216 line=1025 cnvy=8193 ssr=1442281764,2159140440,0,0,0,0,0,0 str=775003360,775003360,775003360,775003360,775003360,3751946317,3751946317,3751946317,3751946317,3751946317,3751946317,2444611105,2792299441,343840496,40657385,3751946317 exr=0,0,0,0,0,0,0,0 sums=1744600149,2506396596,0,0,0,0,0,0

Warning: karte_t::process_network_commands: kicking client due to checklist mismatch : sync_step=415405 server=[ss=0 st=0 nfc=0 rand=0 halt=0 line=0 cnvy=0 ssr=0,0,0,0,0,0,0,0 str=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 exr=0,0,0,0,0,0,0,0 sums=0,0,0,0,0,0,0,0
initiator=[ss=415405 st=25962 nfc=13 rand=3700814870 halt=5216 line=1025 cnvy=8193 ssr=3021926069,3700814870,0,0,0,0,0,0 str=4158350141,4158350141,4158350141,4158350141,4158350141,3425583775,3425583775,3425583775,3425583775,3425583775,3425583775,1565161018,3974084165,146709321,17347525,3425583775 exr=0,0,0,0,0,0,0,0 sums=767240836,1331425206,0,0,0,0,0,0

The causes of both of these (though, not their location in the code) are immediately clear:
1. in the first case, the problem seemed to be inconsistent random number generator behaviour. Either the generator itself has a problem, or it was not called at the same time for the server and client.
2. In the second, the server's checklist is zeroed while the client's is valid, indicating that there is somewhere in the code where the server clears its checklist but the client does not - this is a similar case to the one I had previously fixed, and should be slightly easier than the above.
I will look into this further and see if I can identify any fixes.

TurfIt

Quote from: freddyhayward on August 22, 2020, 10:49:20 PM
1. in the first case, the problem seemed to be inconsistent random number generator behaviour. Either the generator itself has a problem, or it was not called at the same time for the server and client.
No, it shows the client and server executing diverging code paths resulting in the random number generator being called a different number of times.
Since the str= lists match server and client, the calls from anything in ::step() are ok. The first ssr= number shows the mismatch - something in the ::sync_step() routines is going wrong. This would typically be vehicle movement - citycars movement. The problem is trying to figure out what has put the vehicles in different spots client vs server. It could be a player convoi that's left sooner and got in the way of a citycar on one, or many many other possibilities. Hence very hard to track these problems down.

freddyhayward

Quote from: TurfIt on August 22, 2020, 11:00:16 PM
No, it shows the client and server executing diverging code paths resulting in the random number generator being called a different number of times.
Since the str= lists match server and client, the calls from anything in ::step() are ok. The first ssr= number shows the mismatch - something in the ::sync_step() routines is going wrong. This would typically be vehicle movement - citycars movement. The problem is trying to figure out what has put the vehicles in different spots client vs server. It could be a player convoi that's left sooner and got in the way of a citycar on one, or many many other possibilities. Hence very hard to track these problems down.
I don't think this particular case is directly related to convoy movement, since the convoy debug sums are matching. However your first statement is probably correct.

jamespetts

Matthew - thank you for your feedback/data: this is very helpful. I have started a small railway company on the server to test this, and have found no particular relationship between such interactions as I have been making (albeit relatively modest of late as I have relatively modest profits so far) and losses of synchronisation. There might be a higher rate of losses of synchronisation when other players are online, but the pattern is not strong enough for me to be sure of this. There can be long periods without any loss of synchronisation, and losses of synchronisation within minutes of logging in.

Thank you to Freddy for analysing the logs: this is most helpful. TurfIt is correct about at least the first of the two types of loss of synchronisation: this type originates from diverging code paths and is fantastically difficult to track down, as this can arise virtually anywhere in the simulation code. The second type is more curious: I do not know what might cause a server to reset its checklists; I have not modified code relating to this, so I do not really know how this code works. Does anyone who knows anything about the checklist code have any idea whether there are any circumstances in which it ought to be resetting like this?

Would it be helpful if I were to symlink the live log file to a place that you can read it so that you can analyse this dynamically?

Edit: The convoy check sums matching does not tell us much: those numbers only tell us how many convoys that there are and what handle numbers that they have been allocated, not what they are actually doing. An almost unimaginably vast number of things relating to convoy movement might cause this sort of loss of synchronisation, including (amongst many, many, many other things) private cars, railway signalling, loading behaviour (including loading times, divergence in which might be caused in turn by divergence in passenger numbers, which might, in turn, be caused by something else, and so on for an arbitrary number of steps), physics, industry production (which would affect loading), road junction logic, pathfinding, and a myriad other things.

The only way that I have found to track down these problems is to reproduce the problem locally then disable game features one by one until the problem can be narrowed down to one specific part of the code; or alternatively, remove player networks one by one until the problem can be narrowed down to a specific piece of in-game infrastructure. Unfortunately, this method is only effective when the losses of synchronisation occur with a high frequency. I know of no workable method of finding losses of synchronisation that occur at such a low frequency as these have been reported, especially since I have not been able to reproduce it at all locally, since there is no reliable way of telling whether disabling the feature or removing the infrastructure has suppressed the problem, or whether there just happen not to be any losses of synchronisation in any event for the last hour.

It might be that the only thing to do is wait for the losses of synchronisation to become so frequent that they are more readily reproducible.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

freddyhayward

#18
Quote from: jamespetts on August 23, 2020, 12:00:30 AM
Edit: The convoy check sums matching does not tell us much: those numbers only tell us how many convoys that there are and what handle numbers that they have been allocated, not what they are actually doing. An almost unimaginably vast number of things relating to convoy movement might cause this sort of loss of synchronisation, including (amongst many, many, many other things) private cars, railway signalling, loading behaviour (including loading times, divergence in which might be caused in turn by divergence in passenger numbers, which might, in turn, be caused by something else, and so on for an arbitrary number of steps), physics, industry production (which would affect loading), road junction logic, pathfinding, and a myriad other things.
As far as I am aware, the convoy checksums (that is the first two elements of sums=1744600149,2506396596,0,0,0,0,0,0) do count precisely the movement of the convoys. I should have made this clearer to distinguish them form the convoy handle sum.
Edit: it's also worth noting that the random seed is out of sync, not simply the result of generated random numbers. I couldn't find anything in the code that actually modifies this apart from the seed generation algorithm itself. It might also be possible that integer overflow of some kind is happening.

TurfIt

#19
Quote from: freddyhayward on August 23, 2020, 12:14:31 AM
Edit: it's also worth noting that the random seed is out of sync, not simply the result of generated random numbers. I couldn't find anything in the code that actually modifies this apart from the seed generation algorithm itself. It might also be possible that integer overflow of some kind is happening.
The seed changes after each call to get a random number... how a mersenne twister works.

Quote from: freddyhayward on August 23, 2020, 12:14:31 AM
As far as I am aware, the convoy checksums (that is the first two elements of sums=1744600149,2506396596,0,0,0,0,0,0) do count precisely the movement of the convoys.
If that's what those sums= mean, then unless some new code has been added in sync_step() requiring randoms, it's the citycars as the only random users in sync_step() that are exhibiting the fault.

I'd added the str= and ssr= lists to the checklist some time ago in Extended to try helping debug a desync issue. Rather not as helpful as hoped, but the idea was to see where they diverged, and further subdivide the problem adding/ changing the lists to narrow in on the bad code. Of course they were temporary debugging code that seems permanent now!  i.e. Their positions in the code were for helping find a specific desync years ago, they need to be redone each time a new desync is hunted, and likely redone many times as one narrows in (hopefully!). The sums= lists must be someone else's attempt at tracking desyncs - I don't remember them.

EDIT: forget to mention - for these to have any meaning, frames between checks must be set to 1 - i.e. send checklists every frame.

freddyhayward

#20
Here are three (edit: now eight) desync messages logged on my end:
Context: placing signs, not really doing anything.
Warning: karte_t:::do_network_world_command: sync_step=30216 
server=[ss=30216 st=1888 nfc=8 rand=1557250333 halt=5226 line=1025 cnvy=8193 ssr=1241857636,1557250333,0,0,0,0,0,0 str=4103595329,4103595329,4103595329,4103595329,4103595329,4121655257,4121655257,4121655257,4121655257,4121655257,4121655257,2645973647,1196416888,1871716010,220948049,4121655257 exr=0,0,0,0,0,0,0,0 sums=1726200471,949113653,0,0,0,0,0,0
client=[ss=30216 st=1888 nfc=8 rand=3538964740 halt=5226 line=1025 cnvy=8193 ssr=1911461731,3538964740,0,0,0,0,0,0 str=4103595329,4103595329,4103595329,4103595329,4103595329,4121655257,4121655257,4121655257,4121655257,4121655257,4121655257,2645973647,1196416888,1871716010,220948049,4121655257 exr=0,0,0,0,0,0,0,0 sums=1726200471,949113653,0,0,0,0,0,0

Context: immediately upon connection.
Warning: karte_t:::do_network_world_command: sync_step=30408
server=[ss=30408 st=1900 nfc=8 rand=1828447704 halt=5226 line=1025 cnvy=8193 ssr=3587605122,1828447704,0,0,0,0,0,0 str=743529120,743529120,743529120,743529120,743529120,3243039856,3243039856,3243039856,3243039856,3243039856,3243039856,1673307103,582538048,12663145,1494801,3243039856 exr=0,0,0,0,0,0,0,0 sums=685466241,4130127585,0,0,0,0,0,0
client=[ss=30408 st=1900 nfc=8 rand=2437304339 halt=5226 line=1025 cnvy=8193 ssr=1693602772,2437304339,0,0,0,0,0,0 str=743529120,743529120,743529120,743529120,743529120,3243039856,3243039856,3243039856,3243039856,3243039856,3243039856,1673307103,582538048,12663145,1494801,3243039856 exr=0,0,0,0,0,0,0,0 sums=685466241,4130127585,0,0,0,0,0,0

Context: terraforming.
Warning: karte_t:::do_network_world_command: skipping command due to checklist mismatch : sync_step=39093
server=[ss=39093 st=2443 nfc=5 rand=3369424330 halt=5226 line=1025 cnvy=8193 ssr=3486250843,3369424330,0,0,0,0,0,0 str=491313374,491313374,491313374,491313374,491313374,1561926749,1561926749,1561926749,1561926749,1561926749,1561926749,4228572398,3577287383,623968325,73653577,1561926749 exr=0,0,0,0,0,0,0,0 sums=3005605747,3294018549,0,0,0,0,0,0
executor=[ss=39093 st=2443 nfc=5 rand=2985375289 halt=5226 line=1025 cnvy=8193 ssr=1285812580,2985375289,0,0,0,0,0,0 str=491313374,491313374,491313374,491313374,491313374,1561926749,1561926749,1561926749,1561926749,1561926749,1561926749,4228572398,3577287383,623968325,73653577,1561926749 exr=0,0,0,0,0,0,0,0 sums=3005605747,3294018549,0,0,0,0,0,0

Context: placing signal/signalbox, Huitsi had just joined.
Warning: karte_t:::do_network_world_command: sync_step=57992 
server=[ss=57992 st=3624 nfc=8 rand=4241491095 halt=5226 line=1025 cnvy=8193 ssr=3703758365,4241491095,0,0,0,0,0,0 str=3086837121,3086837121,3086837121,3086837121,3086837121,352968718,352968718,352968718,352968718,352968718,352968718,358863249,2315238162,267133988,31528861,352968718 exr=0,0,0,0,0,0,0,0 sums=892798643,1570647053,0,0,0,0,0,0
client=[ss=57992 st=3624 nfc=8 rand=2837245729 halt=5226 line=1025 cnvy=8193 ssr=3085442846,2837245729,0,0,0,0,0,0 str=3086837121,3086837121,3086837121,3086837121,3086837121,352968718,352968718,352968718,352968718,352968718,352968718,358863249,2315238162,267133988,31528861,352968718 exr=0,0,0,0,0,0,0,0 sums=892798643,1570647053,0,0,0,0,0,0

Context: selecting the signalbox and about to place a signal. There were two other players online.
Warning: karte_t:::do_network_world_command: sync_step=30408 
server=[ss=30408 st=1900 nfc=8 rand=1828447704 halt=5226 line=1025 cnvy=8193 ssr=3587605122,1828447704,0,0,0,0,0,0 str=743529120,743529120,743529120,743529120,743529120,3243039856,3243039856,3243039856,3243039856,3243039856,3243039856,1673307103,582538048,12663145,1494801,3243039856 exr=0,0,0,0,0,0,0,0 sums=685466241,4130127585,0,0,0,0,0,0
client=[ss=30408 st=1900 nfc=8 rand=2437304339 halt=5226 line=1025 cnvy=8193 ssr=1693602772,2437304339,0,0,0,0,0,0 str=743529120,743529120,743529120,743529120,743529120,3243039856,3243039856,3243039856,3243039856,3243039856,3243039856,1673307103,582538048,12663145,1494801,3243039856 exr=0,0,0,0,0,0,0,0 sums=685466241,4130127585,0,0,0,0,0,0

Context: was AFK, there were previously 4 player connected including jamespetts and VOLVO who had disconnected (presumably desynced). I believe at least one of them had attempted to reconnect before I desynced, but cannot tell for sure.
Warning: karte_t:::do_network_world_command: sync_step=74288 
server=[ss=74288 st=4643 nfc=0 rand=2147361007 halt=5226 line=1025 cnvy=8193 ssr=1314354187,4204329809,0,0,0,0,0,0 str=4204329809,4204329809,4204329809,4204329809,4204329809,2370101764,2370101764,2370101764,2370101764,2370101764,2370101764,170758243,2147361007,36851628,4349120,2370101764 exr=0,0,0,0,0,0,0,0 sums=683012918,1291649801,0,0,0,0,0,0
client=[ss=74288 st=4643 nfc=0 rand=1389728691 halt=5226 line=1025 cnvy=8193 ssr=1314354187,4204329809,0,0,0,0,0,0 str=4204329809,4204329809,4204329809,4204329809,4204329809,2370101764,2370101764,2370101764,2370101764,2370101764,2370101764,170758243,1389728691,36851628,4349120,2370101764 exr=0,0,0,0,0,0,0,0 sums=683012918,1291649801,0,0,0,0,0,0

Context: editing a schedule. jamepetts had recently disconnected (presumably desynced) and may have attempted to reconnect, but I cannot tell.
Warning: karte_t:::do_network_world_command: sync_step=74288 
server=[ss=74288 st=4643 nfc=0 rand=2147361007 halt=5226 line=1025 cnvy=8193 ssr=1314354187,4204329809,0,0,0,0,0,0 str=4204329809,4204329809,4204329809,4204329809,4204329809,2370101764,2370101764,2370101764,2370101764,2370101764,2370101764,170758243,2147361007,36851628,4349120,2370101764 exr=0,0,0,0,0,0,0,0 sums=683012918,1291649801,0,0,0,0,0,0
client=[ss=74288 st=4643 nfc=0 rand=1389728691 halt=5226 line=1025 cnvy=8193 ssr=1314354187,4204329809,0,0,0,0,0,0 str=4204329809,4204329809,4204329809,4204329809,4204329809,2370101764,2370101764,2370101764,2370101764,2370101764,2370101764,170758243,1389728691,36851628,4349120,2370101764 exr=0,0,0,0,0,0,0,0 sums=683012918,1291649801,0,0,0,0,0,0

Context: inspecting vehicles in the depot (probably unrelated)
Warning: karte_t:::do_network_world_command: sync_step=93496 
server=[ss=93496 st=5843 nfc=8 rand=1332702392 halt=5226 line=1667 cnvy=8193 ssr=78849053,1332702392,0,0,0,0,0,0 str=162498849,162498849,162498849,162498849,162498849,1572107109,1572107109,1572107109,1572107109,1572107109,1572107109,1056427539,1584585818,468776815,55319441,1572107109 exr=0,0,0,0,0,0,0,0 sums=2075641198,334357099,0,0,0,0,0,0
client=[ss=93496 st=5843 nfc=8 rand=1846799280 halt=5226 line=1667 cnvy=8193 ssr=1120227439,1846799280,0,0,0,0,0,0 str=162498849,162498849,162498849,162498849,162498849,1572107109,1572107109,1572107109,1572107109,1572107109,1572107109,1056427539,1584585818,468776815,55319441,1572107109 exr=0,0,0,0,0,0,0,0 sums=2075641198,334357099,0,0,0,0,0,0


jamespetts

Thank you both for your work on this. The server frames between checks had been set to 8; I have re-set it to 1 to aid with debugging this if anyone really can track down the problem using these numbers rather than the elimination test process that I described above.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

Now that the Stephenson-Seimens server is up and running again, I should be very grateful if people could confirm whether or not they get regular losses of synchronisation when playing that online saved game.

Thank you in advance.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

Regarding running as root or not. Stephenson-Siemens is NOT running as root. (I would not dare doing that...)
My scripts are attached below - simupdate run daily, simrestore run hourly (rename from *.txt to *.sh or whatewer)
Feel free to modify


jamespetts

Thank you for that - that is helpful. I do not think that the problem was the script, but rather the inability to connect to the server howsoever started if it was stated by a non-root user. However, this was such a long time ago that I do not recall much about the details, save that I spent a large amount of time attempting to deal with this and found the problem entirely intractable.




Regarding the losses of synchronisation: I tested this afternoon with the Stephenson-Seimens server: with nobody else online, and without interaction (while doing other things in the background), I was able to stay connected for > 2 hours.

This suggests that some feature or aspect of the saved game on the Bridgewater-Brunel server has the effect of invoking the troubled part of the code whereas that is not true of the Stephenson-Seimens server.

The most obvious things that are present on the Bridgewater-Brunel server but not the Stephenson-Seimens server are time interval signalling and industry connected by players. The former was present on the previous Bridgewater-Brunel server, and did not appear to cause losses of synchronisation then, although the code may have changed since then.

I have just pushed a change to the Github repository that should allow the industry density proportion override to work as intended when modified only using simuconf.tab. This should have the effect that the amount of industry on the map will increase significantly its quantity. This should result, hopefully, in more players using freight routes. If, over the next 30 or so game years, the frequency of losses of synchronisation increase significantly, then we can infer that the cause is likely to be industry and conduct further testing on that basis. If, conversely, the losses of synchronisation cease within that period, we can infer that time interval signalling is a likely culprit. In any other case, further and more difficult testing work will be required.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

freddyhayward

Might it be worth temporarily re-enabling DEBUG_SIMRAND_CALLS? This could help to identify the source of inconsistent simrand calls between the server and client.

jamespetts

Quote from: freddyhayward on August 25, 2020, 09:56:11 PM
Might it be worth temporarily re-enabling DEBUG_SIMRAND_CALLS? This could help to identify the source of inconsistent simrand calls between the server and client.

I have not looked at this code for a while, but, the last time that I looked at it, it was not suitable for use on a running server and was very crude. It might have been enhanced since then, but it would need some looking into to check this.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

ceeac

I found several potential sources of desyncs when using Windows builds (compiled by MSVC/MinGW). Not sure if this is the same problem, but pull request #266 fixes these.

jamespetts

Quote from: ceeac on August 26, 2020, 10:08:39 AM
I found several potential sources of desyncs when using Windows builds (compiled by MSVC/MinGW). Not sure if this is the same problem, but pull request #266 fixes these.

Thank you for that - that is helpful. This is now incorporated.

I suspect that this is not the cause of at least many of the losses of synchronisation reported, since this sort of loss of synchronisation would only occur shortly after players used these specific tools and players have reported losses of synchronisation in circumstances where nobody has used any tools since the player losing synchronisation connected, but it is very helpful to have this fix all the same, as this may well have caused at least some losses of synchronisation, which it is worth fixing.

Thank you again.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

freddyhayward

I have noticed that quite a few of these desyncs were caused by local settings and environment variables not being properly overriden by the server. I wonder if there could be more hiding in this category of errors?

jamespetts

Quote from: freddyhayward on August 26, 2020, 10:45:12 AM
I have noticed that quite a few of these desyncs were caused by local settings and environment variables not being properly overriden by the server. I wonder if there could be more hiding in this category of errors?

It is possible in theory - along with myriad other possibilities.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

User experiences reported this afternoon suggest that losses of synchronisation occur more often shortly after new players join. This suggests that the issue may be related to saved data not being consistent with data held in memory by the server.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have undertaken some further testing on this to-day. Using the Bridgewater-Brunel saved game from January 1858, I used an optimised debug build and connected 2 clients to the server locally for ~1 hour. No loss of synchronisation was observed.

I then repeated the experiment with a build downloaded from the server, and one of the two clients lost synchronisation after somewhere between 1 and 2 hours, approximately. The other client has remained synchronised.

Note that it has long been the case that the Visual Studio builds cannot remain synchronised with the Linux or cross-compiled builds.

This may enable some narrowing down of the problem to occur, but, at this rate, it will take a very, very large amount of time to make progress given that each change requires up to 2 hours of testing to see whether it has had any effect.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

freddyhayward

Since the surface-level causes of losses of synchronisation is usually mismatching random seeds (presumably, simrand being called an inconsistent number of times), we could try recording in each checklist the number of calls to simrand since the previous checklist. The size of the inconsistency might offer useful clues.
I should also like to repeat my suggestion that disconnections due to checklist mismatches be treated as errors rather than warnings. I recall that you were concerned that these were not always errors, but those would be a tiny minority of cases. In practice, it is inconvenient to monitor checklist mismatches on debug level 2 because of the less-severe warning and message spam.

jamespetts

I have been re-running the test with the optimised debug builds, connecting two local clients, this time with a version of the Bridgewater-Brunel saved game from yesterday evening in 1862. I started this shortly after my post above, and both clients are still connected to the (local) server without loss of synchronisation. I will keep this running for longer to see whether this persists for an extended period.

As to errors versus warnings of checklist mismatches, may I ask why it is difficult simply to use a search tool to find the text for these in log files?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.