The International Simutrans Forum

 

Author Topic: Instability on the Bridgewater-Brunel server  (Read 13653 times)

0 Members and 1 Guest are viewing this topic.

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #245 on: January 11, 2019, 02:12:24 AM »
I've added some basic checksums to the checklist - the first one sums the mantissae of convoy speeds, and the second sums the convoy weights. (Actually, it isn't quite over all convoys - I think convoys might be excluded for one sync step after exiting a depot, and maybe some other occasional sync steps). I've pushed this to Github.

The upshot is that this should detect the desync much earlier (hopefully identifying the sync step in which it happens), and perhaps even give us more indication of what is causing it.

EDIT: I don't think it's working properly, so probably best to leave this until I've got it working.
« Last Edit: January 11, 2019, 03:07:24 AM by ACarlotti »

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1326
Re: Instability on the Bridgewater-Brunel server
« Reply #246 on: January 11, 2019, 03:29:20 AM »
Might want to rdwr debug_sum instead of rand   ;)
Also, while the extra checklist items are called rand, you can stick anything in them (well anything uint32). Many are not currently used... and IIRC I intended the extra ones (exr=)  24-31 to be more general purpose to narrow in on a location like you're attempting with convoi speeds and weights. i.e. Didn't really need to add debug_sum.

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #247 on: January 11, 2019, 03:34:57 AM »
You spotted the issue* about the time that I did.

I noticed there were some unused rands, but I thought I'd go with something different to avoid having a misleading variable name - at the moment I think everything in rands is a random number seed recorded at some point, and I think it would be less intuitive to change that.

EDIT:
*Actually, I didn't read what you wrote properly - we actually found similar problems in two different places.

I think I've now pushed a correct version (along with separate desync bug fix that was very helpful in testing).

EDIT 2:
It turns out convoy weight isn't suitable for inclusion in a checksum, because it is updated only when it is needed, which includes displaying it in gui windows. So I've pushed another commit to remove this. (There's also another commit on my master which is just refactoring some loop logic for simplicity.)
« Last Edit: January 11, 2019, 11:53:30 AM by ACarlotti »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #248 on: January 11, 2019, 04:56:23 PM »
Splendid, thank you: now incorporated.

Offline Vladki cz

  • Devotee
  • *
  • Posts: 2715
    • My addons, mostly roadsigns
  • Languages: EN, CS
Re: Instability on the Bridgewater-Brunel server
« Reply #249 on: January 11, 2019, 10:37:06 PM »
Hello. Out of curiosity I have connected to Bridgewater Brunel server. Player "Bay transport Group" was probably using track network of some other player who went bankrupt, and all his stations and bridges were removed, while tracks were left mothballed. So many of his trains are stuck now. However  I have found one funny place: (1861, 3011) - it is a diagonal tunnel entrance. I thought that such thing is impossible. So I just thought if that could not be the cause of the problems?

Offline Rollmaterial fi

  • Devotee
  • *
  • Posts: 572
  • Languages: EN, FR, DE, FI, SE
Re: Instability on the Bridgewater-Brunel server
« Reply #250 on: January 11, 2019, 11:33:49 PM »
I've been using diagonal tunnel entrances quite extensively for quite some time so I doubt they have anything to do with the desync.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #251 on: January 11, 2019, 11:40:41 PM »
Hello. Out of curiosity I have connected to Bridgewater Brunel server. Player "Bay transport Group" was probably using track network of some other player who went bankrupt, and all his stations and bridges were removed, while tracks were left mothballed. So many of his trains are stuck now. However  I have found one funny place: (1861, 3011) - it is a diagonal tunnel entrance. I thought that such thing is impossible. So I just thought if that could not be the cause of the problems?

That seems most unlikely as it does not appear to be connected in any way with the circumstances in which the loss of synchronisation occurs (i.e. a train arriving in a specific direction at a specific location on a line that does not involve passing through such a tunnel).

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #252 on: January 11, 2019, 11:49:23 PM »
So many of his trains are stuck now.
You could try using this save on the server - it has most of the objects in the worlds removed (most of what's left is the cities served by the Northern Frontier Express and the surrounding infrastructure). If it's possible to reproduce the desync on that then it will probably be much easier to look through the logs to debug it, given that there's very little left going on. I've genrally begun by reducing the frequency of the line and sending half the trains to the depot (newly built at Bealdon Rye), to increase passenger loading per train, in case that's relevant, but I don't have a good save for that and haven't managed to trigger this desync.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #253 on: January 11, 2019, 11:52:04 PM »
Interesting. Would it be worth using that save on the server itself, do you think?

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #254 on: January 12, 2019, 12:04:42 AM »
If you can trigger a desync using that save, then it will probably easier to debug because of how much less is going on. The reduced network will reduce passenger loading, which could be relevant; hence my suggestion to reduce the service frequency. And if we can't get a desync after a while, then we can switch back to using a bigger save.

Incidentally, has anyone made any observations on how full the trains are when they desync?

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #255 on: January 12, 2019, 07:25:35 AM »
I have not observed that, but I did use special debugging carriages with overcrowding disabled in the current server save.

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #256 on: January 13, 2019, 03:31:02 AM »
With the extra checksum, I am now getting relatively quick desyncs when conencting to the server (within a couple of minutes). These are not obviously linked to activity around St. Mary Beddington or nearby stations. Unfortunately, I have no way of working out the which convoy is out of sync.

I've pushed another checksum to Github which should solve this issue - with both the sum of speeds and the sum (speeds times id), it should usually be possible to derive the convoy's internal id (assuming that only one desyncs at a time).

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #257 on: January 13, 2019, 05:04:06 PM »
Splendid, thank you: now incorporated.

Thank you very much for your work on this.

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #258 on: January 14, 2019, 07:56:02 AM »
RIght, I now have some diagnostics from a desync.

Desync occurred at sync_step 27245, step 6811, and was due to a speed mismatch with convoy 4233:
Code: [Select]
Warning: karte_t:::do_network_world_command:    sync_step=27245  server=[ss=27245 st=6811 nfc=1 rand=2594737631 halt=4097 line=1025 cnvy=4097 ssr=1910350214,2594737631,0,0,0,0,0,0 str=1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1910350214,3692126888,82257853,1420675956 exr=0,0,0,0,0,0,0,0 sums=1871189474,1645687888,0,0,0,0,0,0client=[ss=27245 st=6811 nfc=1 rand=2594737631 halt=4097 line=1025 cnvy=4097 ssr=1910350214,2594737631,0,0,0,0,0,0 str=1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1910350214,3692126888,82257853,1420675956 exr=0,0,0,0,0,0,0,0 sums=1849259230,3304245548,0,0,0,0,0,0
Looking later in the log I find:
Code: [Select]
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4233) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 6818So it looks like the convoy departed on the server approximately 7 steps earlier than on the client.

Since the convoy was probably stationary on the client when the desync occurred, I can conclude that the speed on the server must have had a mantissa of 4273037052, which makes the speed about 0.5% less than a power of 2 (which may just be a coincidence).

So the question is: why are trains departing from Bickstable Fields at such different times on the client and the server?


Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #259 on: January 14, 2019, 09:45:38 AM »
Thank you for that: that is very helpful and interestingly confirms the hypothesis set out earlier in this thread regarding departure times.

The reason for the differing departure times is not straightforward to deduce. I had hoped that the issue would have related to the overcrowding bug, but that has now been fixed and turned out to be deterministic in nature (an integer underflow), so I suspect that this is not now likely to be the cause.

The moment of departure is calculated in convoi_t::hat_gelhalten(). The relevant part of the code is here:

Code: [Select]

const sint64 now = welt->get_ticks();
   if(arrival_time > now || arrival_time == WAIT_INFINITE)
   {
      // This is a workaround for an odd bug the origin of which is as yet unclear.
      go_on_ticks = WAIT_INFINITE;
      arrival_time = now;
      if (arrival_time < WAIT_INFINITE)
      {
         dbg->error("void convoi_t::hat_gehalten(halthandle_t halt)", "Arrival time is in the future for convoy %u at stop %u", self.get_id(), halt.get_id());
      }
   }
   const sint64 reversing_time = schedule->get_current_entry().reverse > 0 ? (sint64)calc_reverse_delay() : 0ll;
   bool running_late = false;
   sint64 go_on_ticks_waiting = WAIT_INFINITE;
   const sint64 earliest_departure_time = arrival_time + ((sint64)current_loading_time - reversing_time);
   if(go_on_ticks == WAIT_INFINITE)
   {
      if(haltestelle_t::get_halt(get_pos(), get_owner()) != haltestelle_t::get_halt(schedule->get_current_entry().pos, get_owner()))
      {
         // Sometimes, for some reason, the loading method is entered with the wrong schedule entry. Make sure that this does not cause
         // convoys to become stuck trying to get a full load at stops where this is not possible (freight consumers, etc.).
         loading_limit = 0;
      }
      if((!loading_limit || loading_level >= loading_limit) && !wait_for_time)
      {
         // Simple case: do not wait for a full load or a particular time.
         go_on_ticks = std::max(earliest_departure_time, arrival_time);
      }
      else
      {
         // Wait for a % load or a spacing slot.
         sint64 go_on_ticks_spacing = WAIT_INFINITE;

         if(line.is_bound() && schedule->get_spacing() && line->count_convoys())
         {
            // Departures/month
            const sint64 spacing = welt->ticks_per_world_month / (sint64)schedule->get_spacing();
            const sint64 spacing_shift = (sint64)schedule->get_current_entry().spacing_shift * welt->ticks_per_world_month / (sint64)welt->get_settings().get_spacing_shift_divisor();
            const sint64 wait_from_ticks = ((now + reversing_time - spacing_shift) / spacing) * spacing + spacing_shift; // remember, it is integer division
            sint64 queue_pos = halt.is_bound() ? halt->get_queue_pos(self) : 1ll;
            go_on_ticks_spacing = (wait_from_ticks + spacing * queue_pos) - reversing_time;
         }

         if(schedule->get_current_entry().waiting_time_shift > 0)
         {
            // Maximum wait time
            go_on_ticks_waiting = now + (welt->ticks_per_world_month >> (16ll - (sint64)schedule->get_current_entry().waiting_time_shift)) - (sint64)reversing_time;
         }

         if (schedule->get_spacing() && !line.is_bound())
         {
            // Spacing should not be possible without a line, but this can occasionally occur. Without this, the convoy will wait forever.
            go_on_ticks_spacing = earliest_departure_time;
         }

         go_on_ticks = std::min(go_on_ticks_spacing, go_on_ticks_waiting);
         go_on_ticks = std::max(earliest_departure_time, go_on_ticks);
         running_late = wait_for_time && (go_on_ticks_waiting < go_on_ticks_spacing);
         if(running_late)
         {
            go_on_ticks = earliest_departure_time;
         }
      }
   }

   // loading is finished => maybe drive on
   bool can_go = false;

   can_go = loading_level >= loading_limit && (now >= go_on_ticks || !wait_for_time);
   //can_go = can_go || (now >= go_on_ticks_waiting && !wait_for_time); // This is pre-14 August 2016 code
   can_go = can_go || (now >= go_on_ticks && !wait_for_time);
   can_go = can_go || running_late;
   can_go = can_go || no_load;
   can_go = can_go && state != WAITING_FOR_CLEARANCE && state != WAITING_FOR_CLEARANCE_ONE_MONTH && state != WAITING_FOR_CLEARANCE_TWO_MONTHS;
   can_go = can_go && now > earliest_departure_time;
   if(can_go) {

      if(withdraw  &&  (loading_level==0  ||  goods_catg_index.empty())) {
         // destroy when empty
         self_destruct();
         return;
      }

      // add available capacity after loading(!) to statistics
      for (unsigned i = 0; i<vehicle_count; i++) {
         book(get_vehicle(i)->get_cargo_max()-get_vehicle(i)->get_total_cargo(), CONVOI_CAPACITY);
      }

      // Advance schedule
      advance_schedule();
      state = ROUTING_1;
      dbg->message("void convoi_t::hat_gehalten(halthandle_t halt)", "Convoy %s departing from stop %s at step %i", get_name(), halt.is_bound() ? halt->get_name() : "unknown", welt->get_steps());
   }

   // reset the wait_lock
   if(state == ROUTING_1)
   {
      wait_lock = 0;
   }
   else
   {

      if (loading_limit > 0 && !wait_for_time)
      {
         wait_lock = (earliest_departure_time - now) / 2;
      }
      else
      {
         wait_lock = (go_on_ticks - now) / 2;
      }
      // The random extra wait here is designed to avoid processing every convoy at once
      wait_lock += (self.get_id()) % 1024;
      if (wait_lock < 0 )
      {
         wait_lock = 0;
      }
      else if(wait_lock > 8192 && go_on_ticks == WAIT_INFINITE)
      {
         // This is needed because the above calculation (from Standard) produces excessively
         // large numbers on occasions due to the conversion in Extended of certain values
         // (karte_t::ticks and go_on_ticks) to sint64. It would be better ultimately to fix that,
         // but this seems to work for now.
         wait_lock = 8192;
      }
   }


This code works by calculating the go_on_ticks number, comparing this with the current number of ticks and setting the wait_lock to 0 and state to ROUTING_1 if the go_on_ticks value is equal to or less than the current number of ticks. I have just pushed a change to the code which will give the go_on_ticks value in the convoy departure debug message so that this can be compared between client and server. We will then know whether the problem is that the go_on_ticks value is calculated differently, or whether the problem is that a divergence occurs after this number has been calculated.

Edit: One thing that might be worth checking: does the log contain any messages of the following type:
Quote
void convoi_t::hat_gehalten(halthandle_t halt): Arrival time is in the future for convoy 4233 at stop [whatever the ID of Brickstable Fields is]
?
« Last Edit: January 14, 2019, 09:58:43 AM by jamespetts »

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #260 on: January 14, 2019, 01:58:19 PM »
I am now getting desyncs *before* convoys reach Bickstablewood Fields Railway Station. The latest one happened when the train was still on the viaduct about 1km from the edge of the station, and about 99 steps before departure. I am ... surprised.

Code: [Select]
Warning: karte_t:::do_network_world_command:    sync_step=27245  server=[ss=27245 st=6811 nfc=1 rand=2594737631 halt=4097 line=1025 cnvy=4097 ssr=1910350214,2594737631,0,0,0,0,0,0 str=1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1910350214,3692126888,82257853,1420675956 exr=0,0,0,0,0,0,0,0 sums=1871189474,1645687888,0,0,0,0,0,0client=[ss=27245 st=6811 nfc=1 rand=2594737631 halt=4097 line=1025 cnvy=4097 ssr=1910350214,2594737631,0,0,0,0,0,0 str=1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1910350214,3692126888,82257853,1420675956 exr=0,0,0,0,0,0,0,0 sums=1849259230,3304245548,0,0,0,0,0,0
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4233) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 6818
Convoy 4233 out of sync
Client speed mantissa is 21930244 less than server speed mantissa

Convoys 4225,4223,4221,4220 passed (I believe) while offline.

Convoys 4233,4225,4223,4221,4220 all passed once without desyncs.

Warning: karte_t:::do_network_world_command:    sync_step=175613  server=[ss=175613 st=43903 nfc=1 rand=2729897307 halt=4097 line=1025 cnvy=4097 ssr=3466879092,2729897307,0,0,0,0,0,0 str=931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,2659737695,3466879092
,1858232909,2730814151,931275724 exr=0,0,0,0,0,0,0,0 sums=3985059028,4134635295,0,0,0,0,0,0client=[ss=175613 st=43903 nfc=1 rand=2729897307 halt=4097 line=1025 cnvy=4097 ssr=3466879092,2729897307,0,0,0,0,0,0 str=931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275
724,2659737695,3466879092,1858232909,2730814151,931275724 exr=0,0,0,0,0,0,0,0 sums=3963128784,1498225659,0,0,0,0,0,0
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4233) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 43912
Convoy 4233 out of sync
Client speed mantissa is 21930244 less than server speed mantissa

Convoys 4225,4223 passed while offline.

Warning: karte_t:::do_network_world_command:    sync_step=260259  server=[ss=260259 st=65064 nfc=3 rand=3777191495 halt=4097 line=1025 cnvy=4097 ssr=520123040,3777191495,0,0,0,0,0,0 str=3921714216,3921714216,3921714216,3921714216,3921714216,3227137456,3227137456,3227137456,3227137456,3227137456,3227137456,3227137456,
3993339209,3531713451,331365175,3227137456 exr=0,0,0,0,0,0,0,0 sums=3742272234,3644090924,0,0,0,0,0,0client=[ss=260259 st=65064 nfc=3 rand=3777191495 halt=4097 line=1025 cnvy=4097 ssr=520123040,3777191495,0,0,0,0,0,0 str=3921714216,3921714216,3921714216,3921714216,3921714216,3227137456,3227137456,3227137456,322713745
6,3227137456,3227137456,3227137456,3993339209,3531713451,331365175,3227137456 exr=0,0,0,0,0,0,0,0 sums=3715599885,2729255491,0,0,0,0,0,0
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4221) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 65084
Convoy 4221 out of sync
Client speed mantissa is 26672349 less than server speed mantissa

Warning: karte_t:::do_network_world_command:    sync_step=262259  server=[ss=262259 st=65564 nfc=3 rand=2224400415 halt=4097 line=1025 cnvy=4097 ssr=1010227916,2224400415,0,0,0,0,0,0 str=3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,1451021687
,1867168680,2503084853,234854611,3548066625 exr=0,0,0,0,0,0,0,0 sums=87831369,260559438,0,0,0,0,0,0client=[ss=262259 st=65564 nfc=3 rand=2224400415 halt=4097 line=1025 cnvy=4097 ssr=1010227916,2224400415,0,0,0,0,0,0 str=3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625
,3548066625,3548066625,1451021687,1867168680,2503084853,234854611,3548066625 exr=0,0,0,0,0,0,0,0 sums=63690176,1463940082,0,0,0,0,0,0
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4220) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 65653
Convoy 4220 out of sync
Client speed mantissa is 24141193 less than server speed mantissa
This was well before arrival at Bickstablewood Fields Railway Station!

Convoys 4233,4225,4223,4221,4220 all passed again without desync (last one passed at about 4:30pm).

Convoy 4233 passed again without desync. However, there was a desync shortly before convoy 4225 arrived. It does not seem to be consistent with the speed of only one convoy being out of sync.
Warning: karte_t:::do_network_world_command:    sync_step=399930  server=[ss=399930 st=99982 nfc=2 rand=2054363945 halt=4097 line=1025 cnvy=4097 ssr=585326164,2054363945,0,0,0,0,0,0 str=2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,797446143,9
85621664,199545851,1364821627,2846632917 exr=0,0,0,0,0,0,0,0 sums=1266174254,4190039820,0,0,0,0,0,0client=[ss=399930 st=99982 nfc=2 rand=693739107 halt=4097 line=1025 cnvy=4097 ssr=2343166075,693739107,0,0,0,0,0,0 str=617314065,617314065,617314065,617314065,617314065,617314065,617314065,617314065,617314065,617314065,
617314065,4010552739,2385058720,199545769,1364821586,617314065 exr=0,0,0,0,0,0,0,0 sums=3896737550,310673654,0,0,0,0,0,0
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4225) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 100039

Convoys 4223,4221,4220 approached and passed while I was disconnected.

This is a summary of the two key lines from the logs, and some numbers derived from them. I may edit this to add more later.

EDIT: There seems to be a pattern of desyncs occurring only on every second round trip. It could just be a coincidence.

EDIT: Hmm, last bit of data seems to break even more patterns. The debug sums do not seem to be consistent with a single convoy being out of sync, but it would be very strange for two convoys to go out of sync at the same time. This just gets even more confusing.
« Last Edit: January 14, 2019, 07:40:17 PM by ACarlotti »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #261 on: January 14, 2019, 10:31:47 PM »
I have also noticed the pattern with the loss of synchronisation occurring at every other round trip. Quite what should be causing this is, however, entirely obscure.

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #262 on: January 15, 2019, 01:39:59 AM »
I presume you have access to the server logs - perhaps you could look over them to see what was reported during the desync events above. The most useful thing is probably to find all (relevant) lines relating to convoy departure from Bickstablewood Fields Railway Station to compare the convoy departure times.

EDIT: Some more debugging info, quite useful this time.

I enabled the debug output within #ifdef DEBUG_ACCELERATION, and also added some for sp_soll and sp_hat in convoi_t::sync_step. The conclusion is that my most recent desync happened while the convoy was slowing for the semaphore signal before Bickstablewood Fields station, one tile before the signal cleared to allow the train into the station.

Debug output, filtered for relevant lines:
Code: [Select]
Warning: convoi_t::calc_acceleration 1: 153) at tile 163 next limit of   0 km/h, current speed 113 km/h,   187 steps til brake,  2304 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=97969, sp_hat=94208
Warning: karte_t::interactive: sync_step=624721  chklist=[ss=624721 st=156180 nfc=1 rand=2946203112 halt=4097 line=1025 cnvy=4097 ssr=2263788107,2946203112,0,0,0,0,0,0 str=4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,2263788107,3511651656,732188843,4164642788 exr=0,0,0,0,0,0,0,0 sums=1399152953,439014957,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -72 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=99317, sp_hat=98304
Warning: karte_t::interactive: sync_step=624722  chklist=[ss=624722 st=156180 nfc=2 rand=920875045 halt=4097 line=1025 cnvy=4097 ssr=2946203112,920875045,0,0,0,0,0,0 str=4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,2263788107,3511651656,732188843,4164642788 exr=0,0,0,0,0,0,0,0 sums=1310593905,793669416,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -72 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=96599, sp_hat=94208
Warning: karte_t::interactive: sync_step=624723  chklist=[ss=624723 st=156180 nfc=3 rand=3515893034 halt=4097 line=1025 cnvy=4097 ssr=920875045,3515893034,0,0,0,0,0,0 str=4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,2263788107,3511651656,732188843,4164642788 exr=0,0,0,0,0,0,0,0 sums=4031560277,3499533278,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -75 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=98008, sp_hat=94208
Warning: karte_t::interactive: sync_step=624724  chklist=[ss=624724 st=156181 nfc=0 rand=3574004632 halt=4097 line=1025 cnvy=4097 ssr=3515893034,3090030907,0,0,0,0,0,0 str=3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3574004632,3518523764,732833386,3090030907 exr=0,0,0,0,0,0,0,0 sums=355198068,323033399,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -75 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=99447, sp_hat=98304
Warning: karte_t::interactive: sync_step=624725  chklist=[ss=624725 st=156181 nfc=1 rand=1144307454 halt=4097 line=1025 cnvy=4097 ssr=3574004632,1144307454,0,0,0,0,0,0 str=3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3574004632,3518523764,732833386,3090030907 exr=0,0,0,0,0,0,0,0 sums=8624033,1041624339,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -75 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=96820, sp_hat=94208
Warning: karte_t::interactive: sync_step=624726  chklist=[ss=624726 st=156181 nfc=2 rand=3716921189 halt=4097 line=1025 cnvy=4097 ssr=1144307454,3716921189,0,0,0,0,0,0 str=3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3574004632,3518523764,732833386,3090030907 exr=0,0,0,0,0,0,0,0 sums=316715136,1043566946,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -78 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=98319, sp_hat=98304
Warning: karte_t::interactive: sync_step=624727  chklist=[ss=624727 st=156181 nfc=3 rand=3675663255 halt=4097 line=1025 cnvy=4097 ssr=3716921189,3675663255,0,0,0,0,0,0 str=3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3574004632,3518523764,732833386,3090030907 exr=0,0,0,0,0,0,0,0 sums=2301287915,37053678,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -78 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=95753, sp_hat=94208
Warning: karte_t::interactive: sync_step=624728  chklist=[ss=624728 st=156182 nfc=0 rand=1400799860 halt=4097 line=1025 cnvy=4097 ssr=3675663255,3931085262,0,0,0,0,0,0 str=3931085262,3931085262,3931085262,3931085262,3931085262,248477531,248477531,248477531,248477531,248477531,248477531,248477531,1400799860,3525395872,733477929,248477531 exr=0,0,0,0,0,0,0,0 sums=2896912166,1038970125,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -81 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=97149, sp_hat=94208
Warning: karte_t::interactive: sync_step=624729  chklist=[ss=624729 st=156182 nfc=1 rand=1556871974 halt=4097 line=1025 cnvy=4097 ssr=1400799860,1556871974,0,0,0,0,0,0 str=3931085262,3931085262,3931085262,3931085262,3931085262,248477531,248477531,248477531,248477531,248477531,248477531,248477531,1400799860,3525395872,733477929,248477531 exr=0,0,0,0,0,0,0,0 sums=1050489881,1571753344,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -60 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=98087, sp_hat=94208
Warning: karte_t::interactive: sync_step=624730  chklist=[ss=624730 st=156182 nfc=2 rand=3410090894 halt=4097 line=1025 cnvy=4097 ssr=1556871974,3410090894,0,0,0,0,0,0 str=3931085262,3931085262,3931085262,3931085262,3931085262,248477531,248477531,248477531,248477531,248477531,248477531,248477531,1400799860,3525395872,733477929,248477531 exr=0,0,0,0,0,0,0,0 sums=3675369905,3865676177,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 112 km/h,   -40 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=98458, sp_hat=98304
Warning: karte_t::interactive: sync_step=624731  chklist=[ss=624731 st=156182 nfc=3 rand=1696428903 halt=4097 line=1025 cnvy=4097 ssr=3410090894,1696428903,0,0,0,0,0,0 str=3931085262,3931085262,3931085262,3931085262,3931085262,248477531,248477531,248477531,248477531,248477531,248477531,248477531,1400799860,3525395872,733477929,248477531 exr=0,0,0,0,0,0,0,0 sums=453163847,2708091143,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 112 km/h,   -17 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=94161, sp_hat=90112
Warning: karte_t::interactive: sync_step=624732  chklist=[ss=624732 st=156183 nfc=0 rand=726217411 halt=4097 line=1025 cnvy=4097 ssr=1696428903,1249187829,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=3759666517,1546734706,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 111 km/h,  -247 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=97588, sp_hat=94208
Warning: karte_t::interactive: sync_step=624733  chklist=[ss=624733 st=156183 nfc=1 rand=332093103 halt=4097 line=1025 cnvy=4097 ssr=726217411,332093103,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=2735623852,2582478221,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 110 km/h,  -227 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=96351, sp_hat=94208
Warning: karte_t::interactive: sync_step=624734  chklist=[ss=624734 st=156183 nfc=2 rand=3125117933 halt=4097 line=1025 cnvy=4097 ssr=332093103,3125117933,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=1847295082,4139723,0,0,0,0,0,0

Warning: NWC_CHECK: time difference to server 7920
Message: network_world_command_t::execute: do_command 9 at sync_step 624859 world now at 624734
Warning: karte_t:::do_network_world_command: sync_step=624733  server=[ss=624733 st=156183 nfc=1 rand=332093103 halt=4097 line=1025 cnvy=4097 ssr=726217411,332093103,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=2757554096,748478609,0,0,0,0,0,0client=[ss=624733 st=156183 nfc=1 rand=332093103 halt=4097 line=1025 cnvy=4097 ssr=726217411,332093103,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=2735623852,2582478221,0,0,0,0,0,0
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 110 km/h,  -201 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=48456, sp_hat=45056

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 109 km/h,  -193 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=49574, sp_hat=49152

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 109 km/h,  -181 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=46453, sp_hat=45056

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 109 km/h,  -170 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=47285, sp_hat=45056

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 108 km/h,  -156 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=48017, sp_hat=45056

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 108 km/h,  -148 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=48610, sp_hat=45056

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 108 km/h,  -136 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=705049, sp_hat=704512

Warning: convoi_t::calc_acceleration 1: 156) at tile 177 next limit of   0 km/h, current speed 103 km/h,  3364 steps til brake,  5120 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=289299, sp_hat=286720
(Convoy 4225 was out of sync. The client speed mantissa was 21930244 less than server speed mantissa)

I enabled the debug output within #ifdef DEBUG_ACCELERATION (for specific convoys only), and also added some for sp_soll and sp_hat in convoi_t::sync_step. The conclusion is that this desync (at around 4am) happened while the convoy was slowing for the semaphore signal before Bickstablewood Fields station, one tile before the signal cleared to allow the train into the station. The nature of the desync is consistent with the signal clearing on the server one hop before it cleared on the client.

« Last Edit: January 15, 2019, 05:30:24 AM by ACarlotti »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #263 on: January 15, 2019, 10:07:14 AM »
Thank you very much for your work on this: this is most interesting. What I was planning to do was to check to see whether the convoy arriving in the future error message was triggered and try to work out what might cause that, but this suggests that this is not relevant to the cause of the issue.

I am currently away from home and do not have access to the server logs; those logs are not generasted at log level 4 in any event (log level 2 or 3 is used instead, I believe, which shows warnings but not messages), but I can check when I get home if this degree of logging would be helpful.

This suggests that the problem is either that there is something divergent between client and server in the signalling code or that the train is leaving the previous station earlier on the server than on the client.

One thing that is interesting to note is that the train slows for a signal ahead of Brickstable Fields station at all: if the signalling is well designed (i.e. with a distant signal far enough in front of the stop signal), the trians should not have to slow for this signal at all, which might suggest a bug with the signalling; but the distant signal may not be placed far enough away, so this slowing may result simply from sub-opimal signalling design.

Does your logging enable us to tell whether the trains departed the station before Brickstable Fields at the same time? If they did, we will need to look intensively at the signalling code itself, although it is not at all clear to me at present how this code might be divergent between client and server (but seemingly in a deterministic way).

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #264 on: January 15, 2019, 11:18:22 AM »
The debug_sums in the checklist would diverge at the previous station if the convoy was departing at different times, since the speeds wouldn't be the same. I am confident that the first mismatch of speeds was on the approach to that signal.

I will note, however, that the symptoms we have experienced suggest that that is not the only place that things can go wrong. Hopefully fixing what goes wrong at that point will fix the problem elsewhere too.

I had a quick look at what seems to be the relevant bit of signalling code, and my first thoughts were that it's sufficiently hard to read that I'd eventually want to refactor it to help me understand it. It seems I might have to dig into it soon anyway (at least on some superficial level).

I think detailed logs from the server would be less useful now that we have evidence of a problem with the signal clearing.

Offline Spenk009

  • *
  • Posts: 251
Re: Instability on the Bridgewater-Brunel server
« Reply #265 on: January 15, 2019, 12:07:00 PM »
Does the convoy resume acceleration immediately after the signal checking completes? If either machine finishes at a different time or the route polling occurs at different times (the polling rate seems random, when vehicles re-route or re-check) the convoy would find itself slightly shifted in time/space/speed.

Without knowing much about the code, do both make the same routing (full reroute vs reserving track ahead)?

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #266 on: January 15, 2019, 12:17:37 PM »
Thank you for that: I have enough information now that I can look into this part of the code in detail to try to detect any problems, although this may take some time.

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #267 on: January 15, 2019, 01:30:24 PM »
Does the convoy resume acceleration immediately after the signal checking completes?
I am fairly confident that there is no race condition involving the signal check (which is in hop_check) and the convoy movement, since they are both run as part of the convoy sync_step. If there is a race condition, then it would probably be between the signal check, and something that modifies data used by the signal check (but isn't part of the sync_step).

Quote
the polling rate seems random, when vehicles re-route or re-check
If you're looking at data from a single-player game, then you should note that the step and sync_step durations are handled completely differently in single player and network play.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #268 on: January 15, 2019, 09:26:39 PM »
Looking at the signalling at this location, I immediately notice two linked anomalies. First of all, the area is generally signalled by track circuit block signals, but the last signals before the station approaching from the Bealden Rye end are older absolute block mechanical signals. One of these, at 6331,1260 would presumably be the signal that is the issue in your testing.

Secondly, these signals purport to be connected to a mechanical signalbox at 6345,1259. However, there is no such signalbox: instead, part of Bickstable Fields railway station stands on that tile. One can imagine that a mechanical signalbox once stood there before the station was widened to four platforms and resignalled.

It should not be possible for signals to exist where their signalbox has been destroyed, as the destruction of the signalbox should automatically delete all connected signals. I notice that these signals are on an elevated way, so it is possible that this in some way caused the failure to delete these signals properly.

I have set a breakpoint for the point in the code when the block reserver first engages with the signal at 6331,1260, and I will see if I can work out how these anomalies affect the code.

Edit: Note for internal reference that, in a save running in single player mode, the signal first clears when the train is at 6273,1255.

Edit 2: The signalbox had not been deleted after all: it is still present, underneath the bridge.

Edit 3: The check at 6273,1255 is with route_index at 100 and next_block at 162.

Edit 4: This seems to be consistent in position over a round of trains passing Brickstable Fields.

Edit 5: When running as a server in debug mode, the signal first clears  6273,1255 is with route_index at 100 and next_block at 162 just as in single player mode.


Edit 6: The above is repeatable with multiple passes of trains.


I think that that is as far as I can go to-night before going to bed. I have yet to find any actual anomalies in the running code, but it is noteworthy that the problem occurs with (1) signals placed in a very odd way (it appears as though whoever was playing Bay Transport forgot to delete the old signals when resignalling the station); and (2) a signalbox placed under a single height elevated rail line, which should not be possible.

The signalling system has not been tested for the case of an absolute block combined signal sandwiched between two track circuit block signals each controlled by the same signalbox as each other. However, the signal appears to be working as intended (i.e., as a stand-alone stop signal, the distant aspect not working as the next signal is not of the absolute block type and in any event showing danger as the train is stopping at the station), so there is no immediate clue as to what the issue could be here.
« Last Edit: January 16, 2019, 12:37:26 AM by jamespetts »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #269 on: January 16, 2019, 10:43:16 PM »
Looking now at when the trains first attempt to reserve the wayward semaphore signal, my first result shows a route_index of 74, next_signal of 162, tiles_to_check of 7 and last_index of 176. The train had reserved up to this signal already, however, so I will need to check again with some other trains before being sure of the significance of these figures.

Edit: As is usual with the signalling system, it checks the route a number of times as it progresses. The next check seems to be at route_index 86. The next_signal is at 145 before the block reserver has finished running and 162 afterwards. modified_sighting_distance_tiles and tiles_to_check remain at 7.


Edit 2: The next check is at route_index 100 with the same figures otherwise save that now next_signal is 65530 (i.e., the semaphore has cleared).

Edit 3: The next check is at route_index 140 with tiles_to_check being 5, last_index 176 and the next_signal being still at 145 before running the block reserver and 162 afterwards. At this point, the route is shown reserved into the station.


Edit 4: It then continues to re-check every tile up to and including 144 and then again at 155, with variation only in modified_sighting_distance_tiles. At route_index 155 it is called again, this time with the next_signal having not been reset to 145 before it is called, but being at 65535 (meaning that the train is clear to run to the end of its calculated route without stopping first).


Edit 5: It then repeats this up to and including route_index 161 (one tile before the semaphore signal) with variations in the modified_sighting_distance_tiles.

Edit 6: I should note that this is running locally in server mode.



Edit 7: Re-loading an slightly older saved game where the trains had yet to approach this signal, the first hit appears now to be at route_index 19, with next_signal at 145 before and 162 after running the block_reserver, modified_sighting_distance_tiles at 7, last_index at 176 and tiles_to_check at 6. This is when the reservation up to (but not beyond) the semaphore signal is first made.

Edit 8: The next check seems to be at route_index 74 with the usual 145/162 pattern.

Edit 9: The next check is now at 86.

Edit 10: The next are at 100, 140, 141, 142, 143, 144, 155, 156, 157, 158, 159, 160 and 161: all the same as previously.


Edit 11: For the next train, the checks are at 19, 74, 86, 100, 140, 141, 142, 143, 144, 155, 156, 157, 158, 159, 160 and 161: in other words, the same check points as the previous train.
The next step is to attempt the same exercise with a debug client connected to a local server on the loopback interface, but that will have to await another evening.
« Last Edit: January 16, 2019, 11:43:43 PM by jamespetts »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #270 on: February 24, 2019, 01:21:20 PM »
Turning back to this and comparing my results above with the server logs: the server logs are somewhat misleading because the text for generating the messages is incorrect. The actual log outputs read as follows, e.g.,

Code: [Select]
Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 110 km/h,  -227 steps til brake,  1792 steps til stop

This reads as though the convoy is at "tile 163". This is slightly misleading in that 163 is not a tile in an absolute sense (this would have an x,y co-ordinate), but is rather the current route index (i.e. the 163rd tile of the convoy's current route to its next immediate destination), but is also in error, as the 163 is not, in fact, the current route index at all, but rather the index of the route where the convoy next needs to stop. The 155, which is unlabelled in the current output, is the current route index.

I have modified the debug output to correct this and also to add the actual absolute tile co-ordinates for each step.

In any event, putting that aside, we can see that the loss of synchronisation appears to have occurred at route index 156 (i.e. 155 in the debug output, which deducts 1 from the route index; it is not immediately clear why it does this: I did not write this code) at a time when the next_stop_index is 163.

156 is one of the many different checks and does not seem to be particularly noteworthy. The server logs suggest that route index 155 (recorded as 154 because of the subtraction of 1) is the first route tile at which braking occurs, and a number of furhter steps (each tile has 16 steps) are traversed before the loss of synchronisation recorded by A. Carlotti occurred.

However, it is interesting that the next stop index ("next_stop_index") is 163 in A. Carlotti's logs. The above tests show this to be 162. This is when the game was running in server mode. I cannot immediately find an explanation for this. The next step is to try to calculate whether this is one and the same thing as "The nature of the desync is consistent with the signal clearing on the server one hop before it cleared on the client" as reported by A. Carlotti above.

A hop is not the same as a step: a hop is the transition from one tile to another, whereas there are 16 steps in a tile. This inconsistency of the next_stop_index of 1 tile suggests that the convoys may act on the basis that the signal is in different places between client and server.

However, this may be a false signal: there may be another -1 or +1 somewhere that explains the difference in the figures without showing any actual difference between client and server.

The next step, I think, is to run the test again locally in stand alone mode and see whether I get a next_stop_index of 163 or 162.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #271 on: February 24, 2019, 02:58:32 PM »
Running this on the client, the next_stop_index is recorded as 163 consistently. I note also that the semaphore signal on the bridge clears long before the train enters the bridge on the client.

Edit: The semaphore signal on the bridge also clears long before the train enters the bridge on the server.
« Last Edit: February 24, 2019, 03:13:50 PM by jamespetts »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #272 on: February 24, 2019, 09:51:33 PM »
Running this test again with a debug build running as the server, I am unable to reproduce any next_stop_index in this vicinity other than 163.

Edit: Looking at the earlier posts carefully, it was next_signal that was 162. This is converted to next_stop_index in convoi_t::set_next_stop_index(), which, at line 6685, adds 1 to the number. So, unfortunately, this is not a significant result.
« Last Edit: February 24, 2019, 10:08:54 PM by jamespetts »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #273 on: February 25, 2019, 09:33:50 AM »
Further testing reveals some additional complexity in this issue which is very difficult to track down.

Recent signalling fixes appear to have altered the behaviour of the train approaching the signal at 6331,1260. As briefly noted above, the trains no longer slow down for this signal: instead, they always reserve through from this signal to the station some considerable distance in advance of this signal, when they reach the previous signal a long way before the bridge.

Running a test last night, I could not reproduce the loss of synchronisation with a train departing from St. Mary Beddington (it will be recalled that this is where the loss of synchronisation would always manifest itself, when trains arrived at this station). On the server at present, the trains on this route are bunched together, and it had been the case that, every other time that any of the bunch of trains would pass through St. Mary Beddington, the game would lose synchronisation. Last night, on testing, no loss of synchronisation occurred when any of the bunch of trains passed through St. Mary Beddington on the first occasion after connecting, nor did any loss of synchronisation occur on the first train of the second bunch passing through St. Mary Beddington.

However, leaving the game running overnight demonstrated that it is still not fully stable: a loss of synchronisation had occurred by this morning.

I am trying to analyse and understand the logs posted above by A. Carlotti to try to deduce how one can calculate from these that the issue was the deceleration of the convoy at the signal near Bickstable Fields.

The actual checklists where synchronisation seems to have been lost are as follows; on the server:

Code: [Select]
server= [ss=624733 st=156183 nfc=1 rand=332093103 halt=4097 line=1025 cnvy=4097 ssr=726217411,332093103,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=2757554096,748478609,0,0,0,0,0,0

and on the client:

Code: [Select]
client= [ss=624733 st=156183 nfc=1 rand=332093103 halt=4097 line=1025 cnvy=4097 ssr=726217411,332093103,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=2735623852,2582478221,0,0,0,0,0,0
The difference lies in the latter part, after "sums=":

Code: [Select]
sums=2757554096,748478609,0,0,0,0,0,0
on the server and

Code: [Select]
sums=2735623852,2582478221,0,0,0,0,0,0
on the client.

These two entries that are non-zero are defined in the code as "ss" and "st", and I think were added by A. Carlotti recently. I have not yet been able to calculate exactly how these variables are populated.

A. Carlotti - are you able to assist with exactly how these work and how you were able to deduce the immediate cause of the loss of synchronisation in your tests above?

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #274 on: February 28, 2019, 01:15:48 AM »
Further testing has narrowed this somewhat more, although I am getting some very odd results in some places. There is definitely an inconsistency in the way in which the game handles the signal at 6331,1260 clearing, however.

In particular, with this game, in a Visual Studio debug build, the next convoy to approach the signal at 6331,1260 (being convoy 4233) will, at route_index 100 (long before the train has reached the bridge), reserve the track up to the station and therefore not have to slow in advance of the signal, but with one of the automated builds from the server, this will not happen: the train will not reserve the track beyond that signal to the station until a few tiles before it reaches it, by which time it has slowed down.

I have also noticed anomalies between the release builds: there was one occasion when I was testing a release build when I had waited a long time with the game running for the train to get to the end of its line and return; I had then saved it in just the right place to reproduce this behaviour quickly in the future, and continued running; when I did so, the train reserved early as in the Visual Studio build. However, when I re-loaded the game that I had saved, this time, it reserved late. That saved game is the same as the one to which I link above.

This difference between builds makes this very difficult to debug using conventional methods, as I cannot follow the code path using a debugger to find out why there is ever a situation in which the signal is not cleared when the convoy reaches route_index 100; my test report above, which was based on debug builds, consistently showed the trains clearing the signal well in advance.

This sort of inconsistency of behaviour is consistent with precisely the sort of thing that would cause a loss of synchronisation, but what sort of error could lead to inconsistent behaviour between two different builds (and inconsistent behaviour within one build but not another, and not the only one that I can probe to see what is happening) running with an otherwise identical dataset is complex in the extreme.

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #275 on: February 28, 2019, 03:05:42 AM »
The difference lies in the latter part, after "sums=":

Code: [Select]
sums=2757554096,748478609,0,0,0,0,0,0

on the server and

Code: [Select]
sums=2735623852,2582478221,0,0,0,0,0,0

on the client.

These two entries that are non-zero are defined in the code as "ss" and "st", and I think were added by A. Carlotti recently. I have not yet been able to calculate exactly how these variables are populated.

A. Carlotti - are you able to assist with exactly how these work and how you were able to deduce the immediate cause of the loss of synchronisation in your tests above?

The first number is a sum of speed mantissas (the speed being a floating point number). The second is a sum of (speed mantissa * convoy id). The combination of these two values allows us to identify the particular convoy that desynced, and what the discrepancy is.

I wrote the following Python function to compute these details:
Code: [Select]
def debug(servera,serverb,clienta,clientb):
diffa=clienta-servera
diffb=clientb-serverb
for i in range(10000):
if diffa*i%2**32==diffb%2**32:
print("Convoy", i, "out of sync")
print("Client speed mantissa is", (-diffa)%2**32, "less than server speed mantissa")
print("If client mantissa is 0, then server mantissa is", diffa%2**32)
which gives the results:
Code: [Select]
>>> debug(2757554096,748478609,2735623852,2582478221)
Convoy 4225 out of sync
Client speed mantissa is 21930244 less than server speed mantissa
If client mantissa is 0, then server mantissa is 4273037052

The last line probably isn't particularly helpful as written, though it does effectively give the difference in the opposite direction. Note that if the number in the second line is close to 2**32, then it probably means that the client speed is a bit faster. And if the difference is close to 2**31, then that probably means that the speed is close to a power of two and the client and server mantissa are using different exponents. If testing in a game with convoy ids higher than 10000 then you'd need to increase the range to cover all possible values.

In my testing, it seemed that the discrepancies tended to take one of a small number of values (with one value being particularly common).

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #276 on: February 28, 2019, 10:46:37 AM »
Thank you for the information. This seems potentially consistent with the train slowing for the signal on the server and not on the client. I will have to carry out further extensive testing to narrow down which parts of the code are being not invoked on the optimised release build that are being invoked on the Visual Studio debug build, but that may take some time.
Edit: I have pushed a small change to a part of the code that may be implicated in this, replacing the built-in Simutrans min() with std::min(), as I am aware that the built-in min/max has given trouble in the past. I do not know whether this will help, but it may be worth re-testing with to-morrow's nightly build.
« Last Edit: February 28, 2019, 10:58:49 AM by jamespetts »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #277 on: March 02, 2019, 12:09:29 AM »
Further testing shows that the min/max changes do not affect the incorrect differentiation between debug and release builds regarding when the signal clears. My time is very limited over the next week, so I am not likely to be able to make progress until about this time next week.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #278 on: March 03, 2019, 02:36:59 PM »
I have added some temporary debugging output to aid testing. The divergence appears to occur in rail_vehicle_t::block_reserver() not in rail_vehicle_t::can_enter_tile().

It seems that block_reserver() is called at the same point (at route_index 100) on both debug and release builds; however, on the release build, it returns with a next_signal of 162 reserving only to the semaphore signal on the bridge, whereas on the debug build, it returns with a next_signal of 65530 (a special number denoting that the train need prepare to stop at no signal before the end of its route) and reserves all the way to the end of its route, the far end of the station platform.

I have added debugging output to places in the code in block_reserver() where the reservation might terminate, but the ones to which I have added the code are triggered on debug nor release builds.

Further exploration as to the origin of this divergence is required.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9558
  • Languages: De,EN,JP
Re: Instability on the Bridgewater-Brunel server
« Reply #279 on: March 08, 2019, 01:53:20 PM »
MSVC Debug builds initialise values to non-zero (often 0xCACA or so), while many variables allocated in releases are often either 0 (when freshly allocated) or random values (when taken from the stack).

The other thing is that debug builds often have some padding around arrays (to catch overrun), while this maz lack in release builds. So mazbe some structure overrun. Just some input for further thoughs.