The International Simutrans Forum

 

Author Topic: Instability on the Bridgewater-Brunel server  (Read 13704 times)

0 Members and 1 Guest are viewing this topic.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #280 on: March 08, 2019, 02:32:16 PM »
Thank you for your thoughts: that is helpful.

The current pattern of different behaviour seems to be deterministic (i.e., it always behaves in one way in the release build and always another way in the debug build), which tends to suggest that it is probably not an uninitialised variable or writing beyond the end of an array; I am continuing to narrow down in the code where the divergence occurs, but this is a very slow process since I cannot use a debugger to check both builds.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #281 on: March 08, 2019, 11:56:08 PM »
Further testing shows that the anomaly seems to be related to the treatment of combined signals and whether they are added to the list of pre-signals. More testing with updated testing code will be necessary before this can be narrowed down further.

Offline Vladki cz

  • Devotee
  • *
  • Posts: 2716
    • My addons, mostly roadsigns
  • Languages: EN, CS
Re: Instability on the Bridgewater-Brunel server
« Reply #282 on: March 09, 2019, 10:19:16 AM »
Further testing shows that the anomaly seems to be related to the treatment of combined signals and whether they are added to the list of pre-signals. More testing with updated testing code will be necessary before this can be narrowed down further.
Would it be helpful if I try using combined signals on the other server game?

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #283 on: March 09, 2019, 11:43:10 AM »
I think at this stage the best thing to do is for me to track down the divergence with the existing known setup, but thank you for offering.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #284 on: March 09, 2019, 12:08:02 PM »
I have now narrowed down the release/debug divergence as first occurring at this line in simvehicle.cc:

Code: [Select]
if(sb && sb->can_add_signal(signal) && !directional_only)

On the debug build, this line evaluates to true at the critical point, whereas on the release build, it evaluates to false.

This is very odd, as there is nothing that would appear to be the sort of thing there that is the sort of thing that one would expect to behave differently between builds.

There are three elements to this: whether the variable "sb" (a pointer to a signalbox) is NULL or not, whether sb->can_add_signal(signal) is true, and whether directional_only is false.

We can ignore the latter, as this will always be false in this case (which we can confirm by noting that the reservation colour is red and not blue on both builds).

The sb pointer comes from the following code a few lines up:

Code: [Select]
const signalbox_t* sb = NULL;
const grund_t* gr_signalbox = welt->lookup(signal->get_signalbox());
if(gr_signalbox)
{
const gebaeude_t* gb = gr_signalbox->get_building();
if(gb && gb->get_tile()->get_desc()->is_signalbox())
{
sb = (signalbox_t*)gb;
}
}

The can_add_signal function is as follows:

Code: [Select]
bool signalbox_t::can_add_signal(const signal_t* s) const
{
if(!s || (s->get_owner() != get_owner()))
{
return false;
}

return can_add_signal(s->get_desc());
}

The last call is not recursive, but calls an overloaded version, which is here:

Code: [Select]
bool signalbox_t::can_add_signal(const roadsign_desc_t* d) const
{
uint32 group = d->get_signal_group();

if(group) // A signal with a group of 0 needs no signalbox and does not work with signalboxes
{
uint32 my_groups = get_first_tile()->get_tile()->get_desc()->get_clusters();
if(my_groups & group)
{
// The signals form part of a matching group: allow addition
return true;
}
}
return false;
}

None of those functions look suspicious for a loss of synchronisaiton between different builds (or between different instances of the same build, as on the server game), so it is rather perplexing how this arises.

My next test is to add debugging output to the release build to show whether sb is true and, if it is, what its co-ordinates are on the release build, which should narrow down which one of these two pieces of code is the culprit.

Offline Vladki cz

  • Devotee
  • *
  • Posts: 2716
    • My addons, mostly roadsigns
  • Languages: EN, CS
Re: Instability on the Bridgewater-Brunel server
« Reply #285 on: March 09, 2019, 01:24:30 PM »
Just a logical question? Why does simvehicle, check if some signal can be added to some signalbox? That should be checked only when signal is built or reconnected to new signalbox.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #286 on: March 09, 2019, 01:31:33 PM »
The function of this test is to check whether a combined signal is close enough and compatible with the next signalbox: the idea is that a combined signal must be able to be controlled by both the signalbox to which it is attached and the next signalbox along the line.



In any event, I have narrowed this down further: in the release build, the "sb" variable is NULL, which is why the line

Code: [Select]
if(sb && sb->can_add_signal(signal) && !directional_only)

evaluates to false.

The divergence, therefore, must arise in the following code, although I must confess I am currently struggling to see how:

Code: [Select]
const signalbox_t* sb = NULL;
const grund_t* gr_signalbox = welt->lookup(signal->get_signalbox());
if(gr_signalbox)
{
  const gebaeude_t* gb = gr_signalbox->get_building();
  if(gb && gb->get_tile()->get_desc()->is_signalbox()) 
  {   
   sb = (signalbox_t*)gb; 
  }
}

Edit: This line refers to the signalbox at 6345,1259 which is the signalbox controlling the signal on the bridge. On the release build, clicking on this signal reveals that it has a signalbox of "none" (which is incorrect), whereas clicking on this signal on the debug version correctly shows it connected to this signalbox.

Edit 2: Further testing reveals some more details, some very odd. The problem appears to be in this line, which is replicated in the UI output described above:

Code: [Select]
const gebaeude_t* gb = gr_signalbox->get_building();

The coordinate for the signalbox is correctly saved and loaded, but, on some occasions, this succeeds and on some occasions this fails to return a building.
« Last Edit: March 09, 2019, 02:25:36 PM by jamespetts »

Offline Ranran jp

  • *
  • Posts: 483
  • Languages: ja
Re: Instability on the Bridgewater-Brunel server
« Reply #287 on: March 09, 2019, 02:31:37 PM »
Is it related whether signalbox is multiple tiles?

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #288 on: March 09, 2019, 02:35:59 PM »
The issue appears to be that the signalbox in question is actually beneath a viaduct and on that viaduct is a station platform. The code for finding the building is:

Code: [Select]
signalbox_t* grund_t::get_signalbox() const
{
   return dynamic_cast<signalbox_t *>(first_obj());
}

gebaeude_t *grund_t::get_building() const
{
   gebaeude_t *gb = find<gebaeude_t>();
   if (gb) {
      return gb;
   }

   gb = get_signalbox();
   if(gb)
   {
      return gb;
   }

   return get_depot();
}

The critical section appears to be:

Code: [Select]
return dynamic_cast<signalbox_t *>(first_obj());

This line was copied exactly from the get_depot() method from Standard:

Code: [Select]
depot_t* grund_t::get_depot() const
{
   return dynamic_cast<depot_t *>(first_obj());
}

In each case, what appears to happen is that the object that is returned is the first object in the object list. However, the sequence in which objects are stored in the object list appears to be indeterministic in release builds. So, if there is more than one object on the tile, it is indeterministic whether get_signalbox() will, in fact, return a signalbox or not, even if one of those objects is a signalbox. I wonder whether the same problem may apply to depots.

I will experiment with changing the code thus:

Code: [Select]
signalbox_t* grund_t::get_signalbox() const
{
   return dynamic_cast<signalbox_t *>(suche_obj(obj_t::signalbox));
}

although I worry that this might affect performance.
Edit: To answer Ranran's question, all of the signalboxes in question are on single tiles, so I do not believe that this is relevant.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #289 on: March 09, 2019, 03:40:37 PM »
Initial tests of this fix are promising: the modified/fixed vesrion is now running on the Bridgewater-Brunel server, and I am logged in running a stability test. Feel free to login to join in the testing.

Offline Phystam jp

  • *
  • Posts: 248
  • Pak256.Ex developer
  • Languages: JP, EN, EO
Re: Instability on the Bridgewater-Brunel server
« Reply #290 on: March 09, 2019, 05:02:03 PM »
I am now in the server. I will try to keep connecting the server during night.

edit:
The desync has happened during night.
« Last Edit: March 10, 2019, 05:05:36 AM by Phystam »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #291 on: March 09, 2019, 05:38:37 PM »
Splendid, thank you.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #292 on: March 09, 2019, 05:54:53 PM »
I have been connected to the server constantly since my  last post on this thread with no loss of synchronisation.

Offline Ves

  • Devotee
  • *
  • Posts: 1677
  • Languages: EN, SV, DK
Re: Instability on the Bridgewater-Brunel server
« Reply #293 on: March 09, 2019, 08:17:45 PM »
This is great news, James!
I will attempt to keep logged on for as long as possible too.

Offline Ves

  • Devotee
  • *
  • Posts: 1677
  • Languages: EN, SV, DK
Re: Instability on the Bridgewater-Brunel server
« Reply #294 on: March 10, 2019, 12:01:10 AM »
Great news James, had not a single dissync for 3,5 hours!

Skickat från min ONEPLUS A6003 via Tapatalk


Offline Phystam jp

  • *
  • Posts: 248
  • Pak256.Ex developer
  • Languages: JP, EN, EO
Re: Instability on the Bridgewater-Brunel server
« Reply #295 on: March 10, 2019, 02:20:58 AM »
Is the server revision #8076566?
I think the latest revision is #fff2765...

Anyway, I could connect at least for 1 hour, without desync.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #296 on: March 10, 2019, 02:49:36 AM »
I have left my computer running for many hours whilst I was in the shed doing something else, and it had lost synchronisation on my return. However, it had been connected successfully for a very long time.

I should note, incidentally, that you will all lose synchronisation at around 0600h GMT when the server resets for its nightly rebuild.

Offline Phystam jp

  • *
  • Posts: 248
  • Pak256.Ex developer
  • Languages: JP, EN, EO
Re: Instability on the Bridgewater-Brunel server
« Reply #297 on: March 10, 2019, 03:32:53 AM »
Thank you for clarifying this.
Now 2 hours, still connected.

edit:
3 hours. very stable.

edit2:
4 hours.
« Last Edit: March 10, 2019, 05:24:58 AM by Phystam »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18745
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Instability on the Bridgewater-Brunel server
« Reply #298 on: March 10, 2019, 10:57:07 AM »
Thank you all very much for your testing, that is most helpful. I think that it is now safe to conclude that this particular problem has been solved. I will now revert the server to its status before the problem occurred. I suggest that people re-set their passwords swiftly after I do this. Once this has been done, I will post in the server's main thread.

Offline Phystam jp

  • *
  • Posts: 248
  • Pak256.Ex developer
  • Languages: JP, EN, EO
Re: Instability on the Bridgewater-Brunel server
« Reply #299 on: March 10, 2019, 11:06:38 AM »
Thank you, now no desync is occured after resetting of the server!

Offline ACarlotti

  • *
  • Posts: 483
Re: Instability on the Bridgewater-Brunel server
« Reply #300 on: April 23, 2019, 01:11:49 AM »
I discovered the underlying bug, which was causing signalboxes to be sorted in objlist with an inconsistent priority (due to undefined behaviour). The fix is discussed here.