News:

Use the "Forum Search"
It may help you to find anything in the forum ;).

Instability on the Bridgewater-Brunel server

Started by DrSuperGood, September 06, 2018, 03:21:35 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

jamespetts

Thank you for your thoughts: that is helpful.

The current pattern of different behaviour seems to be deterministic (i.e., it always behaves in one way in the release build and always another way in the debug build), which tends to suggest that it is probably not an uninitialised variable or writing beyond the end of an array; I am continuing to narrow down in the code where the divergence occurs, but this is a very slow process since I cannot use a debugger to check both builds.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

Further testing shows that the anomaly seems to be related to the treatment of combined signals and whether they are added to the list of pre-signals. More testing with updated testing code will be necessary before this can be narrowed down further.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

Quote from: jamespetts on March 08, 2019, 11:56:08 PM
Further testing shows that the anomaly seems to be related to the treatment of combined signals and whether they are added to the list of pre-signals. More testing with updated testing code will be necessary before this can be narrowed down further.
Would it be helpful if I try using combined signals on the other server game?

jamespetts

I think at this stage the best thing to do is for me to track down the divergence with the existing known setup, but thank you for offering.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have now narrowed down the release/debug divergence as first occurring at this line in simvehicle.cc:


if(sb && sb->can_add_signal(signal) && !directional_only)


On the debug build, this line evaluates to true at the critical point, whereas on the release build, it evaluates to false.

This is very odd, as there is nothing that would appear to be the sort of thing there that is the sort of thing that one would expect to behave differently between builds.

There are three elements to this: whether the variable "sb" (a pointer to a signalbox) is NULL or not, whether sb->can_add_signal(signal) is true, and whether directional_only is false.

We can ignore the latter, as this will always be false in this case (which we can confirm by noting that the reservation colour is red and not blue on both builds).

The sb pointer comes from the following code a few lines up:


const signalbox_t* sb = NULL;
const grund_t* gr_signalbox = welt->lookup(signal->get_signalbox());
if(gr_signalbox)
{
const gebaeude_t* gb = gr_signalbox->get_building();
if(gb && gb->get_tile()->get_desc()->is_signalbox())
{
sb = (signalbox_t*)gb;
}
}


The can_add_signal function is as follows:


bool signalbox_t::can_add_signal(const signal_t* s) const
{
if(!s || (s->get_owner() != get_owner()))
{
return false;
}

return can_add_signal(s->get_desc());
}


The last call is not recursive, but calls an overloaded version, which is here:


bool signalbox_t::can_add_signal(const roadsign_desc_t* d) const
{
uint32 group = d->get_signal_group();

if(group) // A signal with a group of 0 needs no signalbox and does not work with signalboxes
{
uint32 my_groups = get_first_tile()->get_tile()->get_desc()->get_clusters();
if(my_groups & group)
{
// The signals form part of a matching group: allow addition
return true;
}
}
return false;
}


None of those functions look suspicious for a loss of synchronisaiton between different builds (or between different instances of the same build, as on the server game), so it is rather perplexing how this arises.

My next test is to add debugging output to the release build to show whether sb is true and, if it is, what its co-ordinates are on the release build, which should narrow down which one of these two pieces of code is the culprit.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

Just a logical question? Why does simvehicle, check if some signal can be added to some signalbox? That should be checked only when signal is built or reconnected to new signalbox.

jamespetts

#286
The function of this test is to check whether a combined signal is close enough and compatible with the next signalbox: the idea is that a combined signal must be able to be controlled by both the signalbox to which it is attached and the next signalbox along the line.




In any event, I have narrowed this down further: in the release build, the "sb" variable is NULL, which is why the line


if(sb && sb->can_add_signal(signal) && !directional_only)


evaluates to false.

The divergence, therefore, must arise in the following code, although I must confess I am currently struggling to see how:


const signalbox_t* sb = NULL;
const grund_t* gr_signalbox = welt->lookup(signal->get_signalbox());
if(gr_signalbox)
{
  const gebaeude_t* gb = gr_signalbox->get_building();
  if(gb && gb->get_tile()->get_desc()->is_signalbox()) 
  {   
   sb = (signalbox_t*)gb; 
  }
}


Edit: This line refers to the signalbox at 6345,1259 which is the signalbox controlling the signal on the bridge. On the release build, clicking on this signal reveals that it has a signalbox of "none" (which is incorrect), whereas clicking on this signal on the debug version correctly shows it connected to this signalbox.

Edit 2: Further testing reveals some more details, some very odd. The problem appears to be in this line, which is replicated in the UI output described above:


const gebaeude_t* gb = gr_signalbox->get_building();


The coordinate for the signalbox is correctly saved and loaded, but, on some occasions, this succeeds and on some occasions this fails to return a building.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ranran(retired)

Is it related whether signalbox is multiple tiles?
ひめしという日本人が開発者達の助言を無視して自分好みの機能をextendedに"強引に"実装し、
コードをぐちゃぐちゃにしてメンテナンスを困難にし(とりわけ道路と建物関連)、
挙句にバグを大量に埋め込み、それを知らんぷりして放置し(隠居するなどと言って)別のところに逃げ隠れて自分のフォーク(OTRP)は開発を続けている
その事実と彼の無責任さに日本人プレイヤーは目を向けるべき。らんらんはそれでやる気をなくした(´・ω・`)
他人の振り見て我が振り直せ。ひめしのようにならないために、らんらんが生み出したバグや問題は自分で修正しなくちゃね(´・ω・`)

jamespetts

The issue appears to be that the signalbox in question is actually beneath a viaduct and on that viaduct is a station platform. The code for finding the building is:


signalbox_t* grund_t::get_signalbox() const
{
   return dynamic_cast<signalbox_t *>(first_obj());
}

gebaeude_t *grund_t::get_building() const
{
   gebaeude_t *gb = find<gebaeude_t>();
   if (gb) {
      return gb;
   }

   gb = get_signalbox();
   if(gb)
   {
      return gb;
   }

   return get_depot();
}


The critical section appears to be:


return dynamic_cast<signalbox_t *>(first_obj());


This line was copied exactly from the get_depot() method from Standard:


depot_t* grund_t::get_depot() const
{
   return dynamic_cast<depot_t *>(first_obj());
}


In each case, what appears to happen is that the object that is returned is the first object in the object list. However, the sequence in which objects are stored in the object list appears to be indeterministic in release builds. So, if there is more than one object on the tile, it is indeterministic whether get_signalbox() will, in fact, return a signalbox or not, even if one of those objects is a signalbox. I wonder whether the same problem may apply to depots.

I will experiment with changing the code thus:


signalbox_t* grund_t::get_signalbox() const
{
   return dynamic_cast<signalbox_t *>(suche_obj(obj_t::signalbox));
}


although I worry that this might affect performance.
Edit: To answer Ranran's question, all of the signalboxes in question are on single tiles, so I do not believe that this is relevant.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

Initial tests of this fix are promising: the modified/fixed vesrion is now running on the Bridgewater-Brunel server, and I am logged in running a stability test. Feel free to login to join in the testing.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Phystam

#290
I am now in the server. I will try to keep connecting the server during night.

edit:
The desync has happened during night.

jamespetts

Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have been connected to the server constantly since my  last post on this thread with no loss of synchronisation.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

This is great news, James!
I will attempt to keep logged on for as long as possible too.

Ves

Great news James, had not a single dissync for 3,5 hours!

Skickat från min ONEPLUS A6003 via Tapatalk


Phystam

Is the server revision #8076566?
I think the latest revision is #fff2765...

Anyway, I could connect at least for 1 hour, without desync.

jamespetts

I have left my computer running for many hours whilst I was in the shed doing something else, and it had lost synchronisation on my return. However, it had been connected successfully for a very long time.

I should note, incidentally, that you will all lose synchronisation at around 0600h GMT when the server resets for its nightly rebuild.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Phystam

#297
Thank you for clarifying this.
Now 2 hours, still connected.


edit:
3 hours. very stable.


edit2:
4 hours.

jamespetts

Thank you all very much for your testing, that is most helpful. I think that it is now safe to conclude that this particular problem has been solved. I will now revert the server to its status before the problem occurred. I suggest that people re-set their passwords swiftly after I do this. Once this has been done, I will post in the server's main thread.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Phystam

Thank you, now no desync is occured after resetting of the server!

ACarlotti

I discovered the underlying bug, which was causing signalboxes to be sorted in objlist with an inconsistent priority (due to undefined behaviour). The fix is discussed here.