News:

Simutrans Tools
Know our tools that can help you to create add-ons, install and customize Simutrans.

Networking oddity: diagnosing desyncs

Started by jamespetts, January 30, 2011, 12:35:44 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

jamespetts

I'd appreciate the input of other developers on this one. I noticed when I tested my 9.x branch recently in network mode many more desyncs than I had had with 9.2. The log files indicated that the problems arose, as usual, with the random seed being unsynchronised. To try to diagnose the problem, I implemented a debugging mechanism that would log every call to simrand with the name of the caller and the current state of the random seed so that I could see exactly where they diverge.

What I found was very odd indeed: my client log files would inevitably look something like this around the crucial point:


Warning: simrand: stadt_t::destination stadt_t::finde_passagier_ziel (init) (825978845)
Warning: simrand: stadt_t::destination stadt_t::finde_passagier_ziel (town) (2015962805)
Warning: simrand: koord stadt_t::get_zufallspunkt (349613883)
Warning: simrand: void verkehrsteilnehmer_t::hop() (1107522216)
Warning: network_check_activity(): received cmd id=7 nwc_check_t from socket[468]
Warning: simrand: void verkehrsteilnehmer_t::hop() (982162114)
Warning: simrand: void verkehrsteilnehmer_t::hop() (50510621)
Warning: karte_t::interactive: sync_step=4976  server=[rand=3765579299 halt=7 line=2 cnvy=4] client=[rand=50510621 halt=7 line=2 cnvy=4]
Warning: karte_t::interactive: disconnecting due to checklist mismatch


When comparing this to the server's log, I found that the divergence inevitably occurs immediately or almost immediately after the Warning: network_check_activity():   received cmd id=7 nwc_check_t from socket[468] part. In this instance, for example, the server logged identical callers and random seeds up to that line, but the second call after the network_check_activity() call was ware_t haltestelle_t::hole_ab with random seed 50510621 instead of void verkehrsteilnehmer_t::hop(), and the divergence continued from there until the desync very shortly afterwards. In most other tests, the divergence appeared immediately afterwards.

Is the network check causing a desync somehow? I don't understand what that method does that might cause a divergence in the code paths followed by the server and client thereafter. Any thoughts on this issue would be much appreciated.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs

The network_check_Activity should not interfer with random numbers at all. The check-command is precisely the data sent by server that contains the random number information.

Could you post the part of the server log file that corresponds to this clinet log file? Any chance that you missed to log a call to the simrand method?

Edit: reread your post: the desync comes from the different calls to hole_ab(). The recevied command does not change anything in the world. Only client receives that command, so it does not appear in server log.
Parsley, sage, rosemary, and maggikraut.

jamespetts

Thank you for your reply: for some reason, the server logging isn't working properly (the logfiles appear completely blank even when "-log" is specified at the command line), so I generated psudo server logs by outputting the last 64 calls to random to the message window, which I can't copy and paste.

The odd thing is, though, that over the several times that I have run this test, the divergent call on the server has always been almost or actually immediately after a network_activity_check and was different each time. I'll run the test again, though, and see what other calls that I manage to find.

Incidentally, am I to take it from that that, where there is a differential call on the client and the server, one should look to whatever called the method on the server, rather than the client, to find the fault?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs

Quote from: jamespetts on January 30, 2011, 11:42:04 AM
Incidentally, am I to take it from that that, where there is a differential call on the client and the server, one should look to whatever called the method on the server, rather than the client, to find the fault?
yes, but most of the time there are too many possibilities that could have triggered the wrong call: convois being one step too fast etc.
Parsley, sage, rosemary, and maggikraut.

jamespetts

I have done some more testing, with the following results (all on the same map, with city cars disabled and one very simple horse and cart line):

1. Diverged before the improvised server log before started, but a further oddity is that there were no network checks for a long, long time before the checklist fail, then a number came all at once, as follows on the client log:

Quote
Warning: simrand:   koord stadt_t::get_zufallspunkt (487005750)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (init) (-300666770)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (factory 1) (1670274006)
Warning: network_check_activity():   received cmd id=7 nwc_check_t from socket[468]
Warning: network_check_activity():   received cmd id=7 nwc_check_t from socket[468]
Warning: network_check_activity():   received cmd id=7 nwc_check_t from socket[468]
Warning: network_check_activity():   received cmd id=7 nwc_check_t from socket[468]
Warning: network_check_activity():   received cmd id=7 nwc_check_t from socket[468]
Warning: karte_t::interactive:   sync_step=5688  server=[rand=3112167803 halt=7 line=2 cnvy=4] client=[rand=1593448315 halt=7 line=2 cnvy=4]
Warning: karte_t::interactive:   disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect():   Lost synchronisation with server. Random flags: 0

2. Could not find any random seeds on the last 64 entries on the server that matched any of the entries in the client log.

3. Could not find any random seeds on the last 64 entries on the server that matched any of the entries in the client log.

4. Divergence similar to the original test:

Quote
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (town) (951560665)
Warning: simrand:   koord stadt_t::get_zufallspunkt (1425376236)
Warning: karte_t::interactive:   sync_step=4960  server=[rand=1835219286 halt=7 line=2 cnvy=4] client=[rand=1835219286 halt=7 line=2 cnvy=4]
Warning: network_check_activity():   received cmd id=7 nwc_check_t from socket[468]
Warning: simrand:   void stadt_t::step_passagiere() (mail or passengers?) (1835219286)
Warning: simrand:   void stadt_t::step_passagiere() (has private car?) (536130351)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (init) (380845718)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (town) (-1234976902)
Warning: simrand:   koord stadt_t::get_zufallspunkt (799599297)
Warning: karte_t::interactive:   sync_step=4968  server=[rand=1883251353 halt=7 line=2 cnvy=4] client=[rand=2468749903 halt=7 line=2 cnvy=4]
Warning: karte_t::interactive:   disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect():   Lost synchronisation with server. Random flags: 0

On the server, the random call with seed no. 1835219286 was a call to ware_t haltestelle::hole_ab.

5. The divergence appeared here:

Quote
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (init) (567923432)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (factory 1) (-954931593)
Warning: simrand:   void verkehrsteilnehmer_t::hop() (-2096304932)
Warning: simrand:   void verkehrsteilnehmer_t::hop() (1076746324)
Warning: simrand:   void stadt_t::step_passagiere() (mail or passengers?) (1092295953)
Warning: simrand:   verkehrsteilnehmer_t::verkehrsteilnehmer_t (weg_next) (-1512961783)
Warning: simrand:   verkehrsteilnehmer_t::verkehrsteilnehmer_t (fahrtrichtung 1) (-107701950)
Warning: simrand:   fussgaenger_t::fussgaenger_t (927021654)
Warning: simrand:   fussgaenger_t::fussgaenger_t (-1885603865)

On the server, the random call with seed no. 1092295953 was a call to ware_t haltestelle::hole_ab.

6. The divergence here appeared in a different method:

Quote
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (init) (491522433)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (town) (187348537)
Warning: simrand:   koord stadt_t::get_zufallspunkt (847918714)
Warning: karte_t::interactive:   sync_step=17728  server=[rand=1056166756 halt=9 line=2 cnvy=4] client=[rand=1056166756 halt=9 line=2 cnvy=4]
Warning: simrand:   void verkehrsteilnehmer_t::hop() (1056166756)
Warning: simrand:   void karte_t::neuer_monat() (1289307762)
Warning: simrand:   const fabrik_besch_t *fabrikbauer_t::get_random_consumer (-776604678)
Warning: simrand:   fabrikbauer_t::increase_industry_density() (1059616191)
Warning: simrand:   fabrikbauer_t::increase_industry_density() (353425880)
Warning: simrand:   fabrik_t::fabrik_t (-1304597293)

On the server, the random call with seed no. 1289307762 was a call to koord stadt_t::get_zufallspunkt.

7. On this occasion, the divergence occurred immediately after the network_check_activity:

Quote
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (init) (-1801099213)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (town) (874618070)
Warning: simrand:   koord stadt_t::get_zufallspunkt (1562158299)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (init) (1443248572)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (town) (132057872)
Warning: simrand:   koord stadt_t::get_zufallspunkt (-1799658836)
Warning: network_check_activity():   received cmd id=7 nwc_check_t from socket[468]
Warning: simrand:   void karte_t::neuer_monat() (1079200276)
Warning: simrand:   const fabrik_besch_t *fabrikbauer_t::get_random_consumer (-1384646780)
Warning: simrand:   fabrikbauer_t::increase_industry_density() (-594738221)
Warning: simrand:   fabrikbauer_t::increase_industry_density() (-1078294607)
Warning: simrand:   fabrik_t::fabrik_t (-112295256)

On the server, the random call with seed no. 1079200276 was a call to koord stadt_t::get_zufallspunkt.

8. On this occasion, the divergence occurred here:

Quote
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (init) (-846909330)
Warning: simrand:   const gebaeude_t *karte_t::get_random_ausflugsziel() const (-1614753566)
Warning: simrand:   const gebaeude_t *karte_t::get_random_ausflugsziel() const (1880193729)
Warning: simrand:   const gebaeude_t *karte_t::get_random_ausflugsziel() const (2103517493)
Warning: simrand:   void verkehrsteilnehmer_t::hop() (669199564)
Warning: simrand:   void verkehrsteilnehmer_t::hop() (-2077928220)
Warning: simrand:   void karte_t::neuer_monat() (107680356)
Warning: simrand:   const fabrik_besch_t *fabrikbauer_t::get_random_consumer (2107780404)
Warning: simrand:   fabrikbauer_t::increase_industry_density() (1265088580)
Warning: simrand:   fabrikbauer_t::increase_industry_density() (909777696)

On the server, the random call with seed no. 107680356 was a call to koord stadt_t::get_zufallspunkt.

9. Could not find any random seeds on the last 64 entries on the server that matched any of the entries in the client log.

10. The divergence occurred in what is becoming a familiar pattern:

Quote
fussgaenger_t::fussgaenger_t (-1429537522)
Warning: simrand:   fussgaenger_t::fussgaenger_t (1824130102)
Warning: simrand:   void stadt_t::step_passagiere() (has private car?) (251936868)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (init) (-851005284)
Warning: simrand:   stadt_t::destination stadt_t::finde_passagier_ziel (town) (-2059737591)
Warning: simrand:   koord stadt_t::get_zufallspunkt (-2089517545)
Warning: karte_t::interactive:   sync_step=59672  server=[rand=2643417478 halt=10 line=2 cnvy=4] client=[rand=2643417478 halt=10 line=2 cnvy=4]
Warning: simrand:   void karte_t::neuer_monat() (-1651549818)
Warning: simrand:   const fabrik_besch_t *fabrikbauer_t::get_random_consumer (-64801254)
Warning: simrand:   fabrikbauer_t::increase_industry_density() (-1355772246)

On the server, the random call with seed no. -1651549818 was a call to koord stadt_t::get_zufallspunkt.


***

There are some patterns here, but it is not entirely clear what they reveal - any thoughts would be appreciated!
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have conducted some furhter research on some simpler maps, and made the following findings. Firstly, I found that the convoy odometer was not synchronised for some reason: the client would gradually drift ahead of the server. The problem appeared to be a rounding error. I have now fixed that issue, but it does not alter the frequency of desyncs.

The problem appears to be that the server runs ahead of the client such that the server executes more steps than the client, with the result that, when the synchronisation comes to be checked, the server has called the random number generator a greater number of times, and hence the seeds are out of synchronisation. This appears to occur only when convoys arrive at halts.

The typical pattern is that, on the client, the logs will show a large number of simrand calls from simcity.cc, which will be synchronised with the server. The server will then show several additional calls to hole_ab(), or to verkehrsteilnehmer_t::verkehrsteilnehmer_t(). Turning pedestrians off results in the additional calls always being to hole_ab().

I find this a little difficult to understand, since, from what I understand of the code, the sequence is something like this:

Check sync
Step
- Convoys
- Towns
- Check clients


In that sequence, "check sync" is the method that the client and server both use to tell whether the client or server is out of synchronisation at that point, and "check clients" is the method where the server checks to see whether the number of clients have reduced.

In each case, it is the client that detects the desync and disconnects. The logs are generated at the "check sync" stage for the client, therefore, and the "check clients" stage for the server. In each case, the difference is the random number seed, e.g.:


server=[rand=2127004698 halt=11 line=4 cnvy=4]
client=[rand=2295517672 halt=11 line=4 cnvy=4]


The odd thing is, however, that, on some occasions, the client's random number seed is not the next number to appear on the server's list of random seeds at each simrand call.

It should be noted that I have set server_frames_between_checks = 1 and server_frames_ahead = 1 in simuconf.tab. I should be grateful for any thoughts that anyone might have as to how to track down the problem from here.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs

Quote from: jamespetts on February 19, 2011, 02:48:29 PM
The problem appears to be that the server runs ahead of the client such that the server executes more steps than the client, with the result that, when the synchronisation comes to be checked, the server has called the random number generator a greater number of times, and hence the seeds are out of synchronisation. This appears to occur only when convoys arrive at halts.
In network mode the server and client should perform the same number of sync_steps() before a step() is done. That should be easy to spot and fix.

QuoteThe typical pattern is that, on the client, the logs will show a large number of simrand calls from simcity.cc, which will be synchronised with the server. The server will then show several additional calls to hole_ab(), or to verkehrsteilnehmer_t::verkehrsteilnehmer_t(). Turning pedestrians off results in the additional calls always being to hole_ab().
I guess there is still something wrong with either the convoi moving code or the goods routing (ie not synchronized). You have to analyze at which point the server and clients diverges. That means, if the first diverging simrand-call is in hole_ab, then log all calls to hole_ab, all calls to methods that call hole_ab etc. Log also values of variables that are used in the decisions.

I did this procedure once, to find out that some gui interaction triggered rerouting of goods on the client independently...
Parsley, sage, rosemary, and maggikraut.

jamespetts

Quote from: Dwachs on February 19, 2011, 04:34:01 PM
In network mode the server and client should perform the same number of sync_steps() before a step() is done. That should be easy to spot and fix.

Hmm, I'm not sure of how sync_steps interact here. The tests show that, on this particular map, the desyncs always happen at exactly the same time (when the one and only convoy calls at one of two halts for the second time, after having called at that halt once already, called at the second halt, then returned to the first), and that both client and server call the verkehrsteilnehmer_t::verkehrsteilnehmer_t() method when the convoy arrives at that halt, but it seems to happen that the server calls the method first, the client then checks for synchronisation, finds that the server's random seed is different, disconnects, then calls the method itself. I am not sure what would cause this.

QuoteI guess there is still something wrong with either the convoi moving code or the goods routing (ie not synchronized). You have to analyze at which point the server and clients diverges. That means, if the first diverging simrand-call is in hole_ab, then log all calls to hole_ab, all calls to methods that call hole_ab etc. Log also values of variables that are used in the decisions.

I did this procedure once, to find out that some gui interaction triggered rerouting of goods on the client independently...

Hmm... the convoy moving code has not changed in Experimental (the reversing affects the vehicle graphics orientation, not the moving code), and Knightly has checked the routing code carefully. I suspect that the hole_ab() calls will show the same pattern as the verkehrsteilnehmer_t::verkehrsteilnehmer_t() calls, however.

Edit: Further testing has revealed that something similar does indeed happen with the hole_ab calls: hole_ab gets called on the server, the server detects a disconnexion from the client, then the client calls hole_ab. That test showed something else interesting: the client had, while the server was calling hole_ab, executed another step(), not calling hole_ab (or presumably any of the convoy loading code that ultimately calls hole_ab()), but stepping over the town's passenger generation code. So, it seems that the convoy loading is called at least one step later on clients than the server for reasons that I cannot quite fathom. This doesn't happen every time that a convoy unloads, but is deterministic in that, for any given saved game, the desync will always happen when the convoy arrives at a particular halt.

This is a somewhat confusing problem, since I am not sure what would cause the convoy unloading code to be called in a different step on the server and client, especially since the code for stepping objects has not been changed in Experimental.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs

Quote from: jamespetts on February 20, 2011, 01:54:34 PM
Edit: Further testing has revealed that something similar does indeed happen with the hole_ab calls: hole_ab gets called on the server, the server detects a disconnexion from the client, then the client calls hole_ab.
If the server detects a desync of random number calls, then the client did something wrong earlier. The server checks, whether at syncstep X the client has the same state of the random number generator as the server. This syncstep X is already processed both for server and client. If that checks fails, it tells you that server and client did diverge earlier. You have to identify the first instruction that happens to be different.

QuoteThat test showed something else interesting: the client had, while the server was calling hole_ab, executed another step(), not calling hole_ab (or presumably any of the convoy loading code that ultimately calls hole_ab()), but stepping over the town's passenger generation code. So, it seems that the convoy loading is called at least one step later on clients than the server for reasons that I cannot quite fathom. This doesn't happen every time that a convoy unloads, but is deterministic in that, for any given saved game, the desync will always happen when the convoy arrives at a particular halt.

This is a somewhat confusing problem, since I am not sure what would cause the convoy unloading code to be called in a different step on the server and client, especially since the code for stepping objects has not been changed in Experimental.
The point of divergence may be completely unrelated to stepping of convois. It could be some desynchronisation in goods routing, making the convoi waiting longer for full load etc.

I have no other advice that the following:
(1) Log everything related to random numbers
(2) Run until disconnect
(3) Find diverging call to simrand()
(4) Log all calls / variables / decisions that call simrand or it callers. Each iteration of this procedure you have to go one level deeper in the code (log the calls to more functions, log more variables)
(5) Goto (1)
Parsley, sage, rosemary, and maggikraut.

jamespetts

Thank you very much for your advice! I have revealed something interesting with further testing. The immediate point of divergence appears to be in karte_t::interactive(). When convoi_t::ziel_erreicht() is called after the convoy reaches the halt, tracing back up the stack of callers in the MSVC++ debugger, I find the following code in karte_t::interactive which triggers it:

sync_step( fix_ratio_frame_time, true, true );
if(  ++network_frame_count==einstellungen->get_frames_per_step()  ) {
// ever fourth frame
set_random_mode( STEP_RANDOM );
step();
clear_random_mode( STEP_RANDOM );
network_frame_count = 0;
}


convoi_t::ziel_erreicht(), which sets the convoy's state to loading, is called from a sync_step. However, the actual unloading (and the calls which ultimately reach hole_ab()) are done in a step(). When  convoi_t::ziel_erreicht() is called on the karte_t::client, network_frame_count is 0. However, when it is called on the server, network_frame_count is 1. This divergence only happens with the particular stop that causes the problems: I tested for another stop, where no desync occurs, and both client and server reported network_frame_count as 3.

What might cause the divergence in network_frame_count, however, I have yet to fathom. Edit: Is it possible that differing numbers of calls on the server and client to INT_CHECK might cause this?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs

The network_frame_count has no deeper meaning as to ensure that server and client perform the same number of sync_step()s between steps().

INT_CHECK should not do any harm, as it should immediately return since 'enabled' is false, ie network games are run with intr_disabled() called before the game starts. This ensures that the sync_steps are done with constant time steps.

I doubt that these to things trigger any desync. Of course you should check whether simutrans-exp calls intr_enable() somewhere during a network game.

In order to help further, can you please show the part of your log for server and client that shows the point of divergence? Please include some lines before and after the divergence.
Parsley, sage, rosemary, and maggikraut.

jamespetts

Thank you very much for your reply. I'm having some trouble with logs, though, as Simutrans-Experimental seems to be generating blank logs at present, which is making it harder to diagnose these things. Is there anywhere in particular in the code that I should look to try to fix that problem? I've not really look into the logfile code before.

Some updated findings from my research. Firstly, the stop based desync that I describe above happens only when the vehicle has had to go around a corner to get to the stop. It does not occur when a vehicle travels entirely in a straight line (including reversing). It does not make any difference whether the vehicle was bought and placed after the network game started or whether it was already there when the client joined, which seems to rule out any problems in the loading/saving code. Cornering speeds are affected by Experimental, so that might well be somewhere to look.

I have identified another (probably separate) desync issue, which occurs at the beginnings of new months. When the game is running in network mode, at the beginning of a month, I notice a long pause on both client and server, but the pause seems to be much longer on the client. When the client unpauses, it desyncs. I notice that, often, an industry chain has been built, sometimes not the same industry chain: most recently, an industry chain was built on the client but not the server. This is not consistent, however, I have noticed sometimes when the same industry chain is built on both and yet there is a desync; sometimes when the same industry chain is built on both, but there is no desync (although the client runs a long way behind the server for a while, but slowly catches up), and one occasion when both client and server built different industry chains. The random caller logs, when they have provided useful information (i.e., where the point of divergence is not too far in the past) seem to show that nuer_monat() is called at an earlier point on the client than the server, although, unlike with the previous issue, network_frame_count is the same, as is sync_steps on both the client and the server.

Edit: I think that I have finally solved the first issue: some of the code in the Simutrans-Experimental section on corner handling defined a number of variables read from the configuration file as static rather than const, meaning that their values when that part of code was executed were not updated to match the configuration data held by the server. Changing the variables from static to const has resulted in the desyncs in question no longer occurring. I have not yet traced the source of the new month desyncs beyond narrowing it down to industry creation: new months rolled past without incident when no new industry was created.

Edit 2: I think that I may have found the problem with the new month/industry issues: the actual_industry_density was not saved, so the value may have been different on the client and on the server. After re-testing it, even large and complicated games now seem to work reliably over several months over the network (at least, over the loopback interface used for testing).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs

Congratulations to find these errors :)

You should ensure that all data / settings / etc that is relevant for the simulation is saved to the savegame (maybe as part of einstellungen_t). Then the client will receive the right data from server.

In case you did not do that: all experimental properties in pak-files / object descriptors (*_besch_t)  that are relevant for simulation needs to be taken into account when calculating pak-checksums.
Parsley, sage, rosemary, and maggikraut.