The International Simutrans Forum

 

Author Topic: Server in unstable state (Pure Virtual Function Call Fatal Error) [11.34]  (Read 4522 times)

0 Members and 1 Guest are viewing this topic.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2732
  • Languages: EN
Trying to play on the multiplayer server is throwing a fatal "Pure Virtual Function Call" C++ Runtime Library error after a minute odd. Obviously this crash affects both client and server as all progression made during this time is lost on re-join.

I cannot get anything sensible out of the log next to synchronization info.

I have no idea what your Linux server is doing but obviously the Windows client will be throwing R6025 Visual C++ error.

The cause of this is described by MS as the following...
Quote
This error occurs when your application indirectly calls a pure virtual member function in a context where a call to the function is not valid. In most cases, the compiler detects this and reports the error when building the application. But depending on how your code is written, sometimes the problem is detected only at run-time.

 The error occurs when you indirectly call a pure virtual function within the context of a call to the constructor or destructor of an abstract base class. This is illustrated in the sample code below, along with some more description.

I need a debugger in order to find out where the virtual function call is occurring. I will see if I can get experimental compiling locally for this.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18753
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Thank you for the report: that us helpful. I have been rather preoccupied of late with finding somewhere to live, but will look into this when I get a chance. If you can make any progress in the meantime, that would best helpful.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9585
  • Languages: De,EN,JP
This is pointing often to a wrong implementation, i.e. there is an object with a virtual function but with no type, i.e. new *obj_t() with somewhere in the definition virtual char *remove( spieler_t *) = NULL;

Usually you define then a a derived type class  house_t(): public obj_t(); Then all things are fine. However, if this class have a destructor, which is not virtual, and you delete a house_t, then there might be an object remain without a type. Similar during destruction you are not allowed to call virtual functions, as they may be want to access stuff already deleted. I would rather expect something aroud this.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2732
  • Languages: EN
Interesting that Simutrans experimental (not sure about standard) uses OpenTTD as an additional library directory...
Also it would be nice if each build could inherit from a single build version instead of having them separate. I do not have 110_xp so I have to manually change each build to 100.

Have builds working (I think) and they seem to reproduce the crash offline as well. Will see if I can pull any light on the cause.

Ok assuming the debug build is crashing at the exact same spot as the normal build (which below sounds like it is)...

The error seems to be coming from "karte_t::sync_way_eyecandy_step", the one loop is calling "if(!ss->sync_step(delta_t)) {" on ss where sync_step is a virtual method of sync_steppable (the type of ss). The virtual function pointer of ss is pointing at 0xfeeefeee (looks like garbage? Probably the object has been destroyed?) thus raising a virtual memory access violation exception which will eventually throw a 6025 virtual function call error.

The looks of what the loop covers (sync_way_eyecandy_list) means this is probably related to something visual with way types going wrong.

The list seems filled with such invalid objects.

The comment provided for why the function is called...
Quote
      /* pedestrians do not require exact sync and are added/removed frequently
       * => they are now in a hastable!
       */

EDIT

I can 100% confirm that this is the problem. Through some real-time hackery I managed to get the game back working past the point which it crashes. The hack performed was to discard the entire list contents during the swap procedure (since it is only graphic related, it should not be critical for progression). After discarding the list the game can progress although with visual bugs (smoke and things left standing still forever). All new smoke continues to flow and move like expected so this error has a very low chance of occurring.

Obviously this is not a solution, only the start of problem identification. I will now need to look why the list contains invalid objects. I am guessing these objects were destroyed at an inappropriate time maybe? Or could it be random memory corruption from something else?

Further investigation shows that only two types of object should ever be added to this list. "Abgas" (German for emissions or exhaust fumes) from vehicles such as steam engines and accelerating diesel engines and factory smoke from factories. Both of these use wolke (German for cloud).

Both of these seem to have correct destruction procedure with no ability to be in the list and destroyed at the same time. This makes me concerned about possible memory corruption.

Further tests show that programmatically removing all smoke from trains and factories (I commented them out...) stops this crash. However at the time when it crashes there is noticeably poor performance. It really looks like an obscure bug and I am more and more convinced it may not have anything to do with this system (something else is corrupting it).

Yet more investigation shows that it is caused by the vehicle smoke. Factory smoke can be enabled and the crash never occurs. However this may be due to disparity in numbers (I am pretty sure there are a lot more vehicle smoke particles on the map than factory smoke). Again it happens as the game enters a demanding state which is more and more convincing of some sort of memory corruption error. This does raise a optimization question why convoys not in sight are making smoke? But that is best for another day.

« Last Edit: June 23, 2014, 06:15:18 PM by DrSuperGood »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18753
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Dr. Supergood - thank you very much indeed for all your work on this: it is very much appreciated, especially as my time available for Simutrans is very limited at present. A pointer to 0xfeeefeee in MSVC almost certainly means that the object to which it is pointing has been deleted. The odd thing is that this system itself, relating to smoke and "eye candy" generally, has not changed at all in Experimental from Standard. Either, therefore, this is a bug that exists also in Standard, or there is a bug elsewhere in Experimental causing memory corruption, although a bug causing generalised memory corruption would not be so exactly repeatable.

As to the factory smoke, we cannot rule that out, as I believe that Pak128.Britain does not have any graphics for factory smoke, so the problem might be smoke more generally, although in this instance it must be triggered by vehicle smoke from a railway engine or steamship.

As to the optimisation question: I cannot answer that, I am afraid, as that is more an issue for the Standard developers.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2732
  • Languages: EN
A large number of bytes around the target address are set to 0xfeee (this exact same pattern repeating). The size of these blocks is variable and pretty randomly distributed. There are about 10-15 cases in the way eye candy list which are all spread apart over a variety of addresses. Further more there appears to be no remainder of "memory garbage" around those addresses which one would expect for freshly de-allocated structures (you should see old structure fields). Or are you telling me that computers actually waste time writing 16 bit patterns across freed memory? It cannot be that it only does it to virtual function calls as there would still be sensible data around the pointer. The size of these blocks are too small to be pages.

Unfortunately this is looking more and more like some random memory corruption. Especially if you tie in the fact that when it occurs there is a noticeable increase in processor load (frames are dropped) it is likely some complex periodic thing has a problem. However on the other hand one would imagine problems sprouting up all over the place if this was the case. Possibly it is an object allocated at the same time as clouds?

When the cloud is destroyed it should be removed from the way eye candy list. The only way a destroyed cloud can exist in the list is if it was in the list multiple times so is destroyed by the time the iterator passes it a second time. This seems pretty unlikely but may require further investigation.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9585
  • Languages: De,EN,JP
0xFeeefeee is used by MSVC to mark an invalid address. All non-initialized pointer will return these addresses in debug builds. Yes it writes that there in order for debugging.

The problem is a wrong deletion in in loop, i.e. an object deletes itself while the list is still iterating. If you manage to delete smoke outside the sync_step (i.e. by clearing a tile) then it will be already deleted (i.e. marked with 0xfeeefeee). The only case I see is if a too old cloud is deleted. (The if(zeit==2499) since this value will be only set in the sync_step() call.)

Also this stuff should be in a list, a hashtable is 20% slower (due to slow inserting/deletion). Also it works in standard, so I fail to see why it does not in experimental.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2732
  • Languages: EN
Quote
Also it works in standard, so I fail to see why it does not in experimental.
It has worked in experimental as well... Until now...

The reason a hashtable is being used is to allow more efficient list mergers. Removal of clouds is done twice per sync step by removing elements from a delete list. If a normal vector list was used it would require O(n) per element to remove to search it and find the elements. A hashtable solves this much faster as it will always have about O(1) complexity per element to remove. The reason this is being done is because immediate removal during list iteration would break the loop and in any case would require a O(n) search due to how it is related with the list.

That said there is probably a reasonable optimization to be had in this area simply by chaining clouds together as linked list nodes (inherit from a doubly linked list node). That way insertion to the list and removal from the list could be done both in O(1) time very simply and there would be no copies and merger operations. Let us not forget by using small dissociated nodes it would avoid the complexity of having to manage and move a largish block of memory (in this case it contains over 30,000 elements, several pages of memory). Would this make a difference? Probably not as there are bigger optimizations to be had. The best would be making smoke completely asynchronous and only created when you are viewing a convoy, that way instead of the game engine having to process 30,000 elements continuously, it would only have to deal with 100 at most and that would need a wide view of a big railway junction.

Quote
The problem is a wrong deletion in in loop, i.e. an object deletes itself while the list is still iterating. If you manage to delete smoke outside the sync_step (i.e. by clearing a tile) then it will be already deleted (i.e. marked with 0xfeeefeee). The only case I see is if a too old cloud is deleted. (The if(zeit==2499) since this value will be only set in the sync_step() call.)
This cannot happen as far as I can tell. Since Simutrans is single threaded one can assume that while it is stepping the way eye-candy other sources not related to the way eye-candy step cannot delete way eye-candy in parallel. When the destructor is called for way eye-candy it adds it to the remove list which is emptied twice per step, once before stepping and once after stepping. The step loop itself destroys the way eye-candy in response to it returning a certain value and excludes it from the list through the copy and swap operation it performs during iteration.

That said is the hashtable completely correct? How many people have run games with 30,000 clouds? Probably an area to look into.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18753
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Do we mean vehicle smoke by "clouds" or random moving objects? If the latter, this cannot be affecting these crashes, as the pakset does not have them.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2732
  • Languages: EN
The type of object is a wolke (which is German for Cloud). If the game was writing in English the object would be called cloud.

The objects in the way eye candy list are created from two sources (as far as I can tell).
1. Convoys. This is referred to internally as "Abgas" (German for emissions / exhaust fumes). Steam engines produce this mostly as the smoke rising from them. Diesels also produce it under high acceleration conditions.
2. Factory emissions referred to internally as smoke. The idea is that coal power stations and other infamous smoke producers can emit them.

Obviously we have a lot of case 1 going on in the game seeing how we are in the age of steam and people have massive passenger networks.

Also it appears that no hashtable is used. Instead it literally does perform a O(n) removal search for every element specified for removal. This is why it is doing copy then swap for removing way eye candy.

After further investigation I do not think it is because of a wolke being terminated pre-maturely. I added a debug break statement in the destructor for the case of it being terminated without fully decaying and the results were that the game still crashes with the same error.

It crashes exactly on the April -> May transition at around 20-25 seconds past the new month. This again points towards object corruption caused by something that runs monthly.
« Last Edit: June 24, 2014, 01:28:13 AM by DrSuperGood »

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9585
  • Languages: De,EN,JP
wolke are removed during the sync_step in the iterator. Hence it cost almost nothing to remove from euither list or hastable, since we can use the iterator directly. But it takes much more time to iterate a hashtable, compared to a list. And iteration is the most action.

But you are right, way_eyecandy is still a hastable in standard. Should also be a list like way_eyecandy.

In standard this part was debugged and worked on a lot, since it is the part that cost most performance after display and routing. I had a game with 8000 buses for profiling; believe me, they do a lot of clouds. Also the hashtable had handled all objects before we broke them into three classes to cater for frequent deletion (hashtables), many entry extits (lists) or fast iterating (vectors). But anyway, have a look. The hashtable implementation is short. It could be only the erase iterator in question, since the other parts cannot cause corruptions.

Even if there are too many objects on a tile, the clouds will not be added. gr->obj_add fails.

But since the lists are one after each other in memory, and out of bounds access to the list is possible.

I do not think it is needed but I think in principle ~wolke_t needs to be virtual, since it inherits for sync_steppable, which has a virtual destructor (but htis does nothing for now). And in the iterator, everything is fine. If however a wolke_t is deleted, then this would not be called.

Another way of optimisation might be using freelist for wolke_t, since the are added and removed freuqently. But for now, with the current memory management by the programm, you may want to try to run the game in valgrind or drmemory to find a corruption.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18753
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
I wonder whether this has to do with a problem that TurfIt spotted a while ago (which is not exclusive to Experimental, but perhaps exacerbated there due to increased maximum convoy lengths): the ship hull/hold system in Pak128.Britain (both Standard and Experimental versions) means that there are often lots of "vehicles" on one tile for a single ship (the hull and lots of holds). In a game with very many ships, all going to the same dock at the same time, this great number can approach the per tile limit of objects. Steamships emit smoke, which interacts with this mechanism. I should note, however, that I cannot immediately see how this would lead to the bug in question, and, on TurfIt's advice, I made some changes to the code lately to reduce the number of objects that may actually enter a tile to stop it from approaching its hard limit set by an 8-bit data structure.

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1327
IIRC Experimental has placed more objects into the freelist than Standard. Is wolke_t one of them? Maybe overflowing a freelist limit - I remember seeing a sint16 in there... Can't check the code myself right now.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18753
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
I have not altered the code relating to wolke_t, and I do not think that (although am not entirely sure whether) anyone else working on Experimental has done so. This conceivably could be freelist corruption, of course, which has the potential to be deterministic in a way in which general memory corruption is not.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2732
  • Languages: EN
Quote
The hashtable implementation is short. It could be only the erase iterator in question, since the other parts cannot cause corruptions.
I do not see it using a hashtable. The comments indicate it does but the underlying structure is not a hashtable. Linked lists are used for add and remove and a vector is used for the iterator.

Quote
I do not think it is needed but I think in principle ~wolke_t needs to be virtual, since it inherits for sync_steppable, which has a virtual destructor (but htis does nothing for now). And in the iterator, everything is fine. If however a wolke_t is deleted, then this would not be called.
I have no idea about how GCC/G++ treats virtual but MSDN (Visual C++) states the following.
Quote
The virtual keyword can be used when declaring overriding functions in a derived class, but it is unnecessary; overrides of virtual functions are always virtual.
So at least for Windows builds made with Visual C++ this should have no affect at all.

Quote
I wonder whether this has to do with a problem that TurfIt spotted a while ago (which is not exclusive to Experimental, but perhaps exacerbated there due to increased maximum convoy lengths): the ship hull/hold system in Pak128.Britain (both Standard and Experimental versions) means that there are often lots of "vehicles" on one tile for a single ship (the hull and lots of holds). In a game with very many ships, all going to the same dock at the same time, this great number can approach the per tile limit of objects. Steamships emit smoke, which interacts with this mechanism. I should note, however, that I cannot immediately see how this would lead to the bug in question, and, on TurfIt's advice, I made some changes to the code lately to reduce the number of objects that may actually enter a tile to stop it from approaching its hard limit set by an 8-bit data structure.
I do admit that I did order 40 odd steam ships into a single tile (fishing convoys are so inefficient...). However they should not all move at the same time.

I do not even see where wolke (cloud) is being bound to a tile.

EDIT (IMPORTANT)

Found the problem. It is caused by smoke removed on bridges (as a result of AK's old company being liquidated) never being removed from the list. The reason for this is because the wolke destructor only ever removed from the list if the underlying dinge is on the map. The bridge object removal hides the underlying dinge before deleting it meaning it is not on the map.

This is the problematic function...
Code: [Select]
wolke_t::~wolke_t()
{
mark_image_dirty( get_bild(), 0 );
if(  insta_zeit != 2499  ) {
if(  !get_flag(not_on_map)  &&  !welt->sync_way_eyecandy_remove( this )  ) {
dbg->error( "wolke_t::~wolke_t()", "wolke not in the correct sync list" );
}
}
}

The remove function will only be called if !get_flag(not_on_map) evaluates true due to the nature of &&. This never happens when smoke is removed from a bridge as it is specifically set to be not on map before removal.
« Last Edit: June 24, 2014, 03:34:33 PM by DrSuperGood »

Offline Sarlock

  • Devotee
  • *
  • Posts: 1340
  • Languages: EN
Wonderful... thanks for investing the time to locate where the bug resides!

I've been able to catch up on yard work while the server is down, so it has been a nice interlude.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2732
  • Languages: EN
I can confirm this was fixed in standard, just the change was never ported to experimental.

This is how Standard addressed the problem...
Code: [Select]
wolke_t::~wolke_t()
{
mark_image_dirty( get_bild(), 0 );
if(  insta_zeit != 2499  ) {
if(  !welt->sync_way_eyecandy_remove( this )  ) {
dbg->error( "wolke_t::~wolke_t()", "wolke not in the correct sync list" );
}
}
}
They removed the check if the object was on the map or not.

I am guessing this was related to the crash with removing tunnels with smoke inside. Seems logical and similar to this, which was removing bridges with smoke ontop.

I will test with this code substitution to see if it resolves the issue and respond back. James it would be nice if you could take 15-30 minutes in the near future to roll this out onto the server and build some clients so that we can continue playing.

EDIT

I can confirm the above code change will fix the problem. It is a shame that experimental and standard are so separate.

Offline Sarlock

  • Devotee
  • *
  • Posts: 1340
  • Languages: EN
Usually the patches between the two are aligned on a more timely basis but I believe that the half height implementation on standard has made the migration process more involved this time around.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18753
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Thank you very much for that: I will look into adding this fix when I have time. As Sarlock says, it is normal for Standard code to be merged into experimental, but for a variety of reasons (in summary, major feature changes in both), the current release version of Experimental is lagging. I will backport this specific fix, however, to get the server working again.


Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 18753
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
I have now incorporated this fix into the 11.x branch - thank you very much indeed for all your help.