News:

Do you need help?
Simutrans Wiki Manual can help you to play and extend Simutrans. In 9 languages.

Segmentation fault in depot search during convoy routing

Started by ACarlotti, April 21, 2019, 02:12:28 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

ACarlotti

Quote from: jamespetts on April 20, 2019, 08:41:35 PMRunning fast forwarded with this (somewhat old) saved game gives a segmentation fault after a few minutes sometimes (but not consistently - perhaps about 2/3rds of the time).

So I can reproduce these segfaults (or at least I can get my own segfaults). But it depends on the type of build. To test this I used this save, and sent the 3-car DLR train (that was triggering out-of-range errors) to the depot once it had reached Bank. The game starts in early August

With a debug build ("DEBUG=3" in the config), running in GDB, I can reach mid October with no segfault. Running outside of GDB I have reached mid September with no segfault.

With a non-debug build (as above, except with DEBUG commented out in the config), running outside of GDB, I consistently get a segfault approximately 1-2 hours into September.

With a mostly non-debug build (i.e. all the same optimisations), but with debugging symbols added (appending "-g" to "FLAGS"), I got a segfault almost straight away in GDB. The backtrace is:

0x00005555555f9123 in objlist_t::suche (this=this@entry=0x55557fef5340, typ=typ@entry=obj_t::strassendepot, start=start@entry=0 '\000')
    at dataobj/../boden/../dataobj/../simobj.h:247
247             inline typ get_typ() const { return type; }
(gdb) backtrace
#0  0x00005555555f9123 in objlist_t::suche (this=this@entry=0x55557fef5340, typ=typ@entry=obj_t::strassendepot, start=start@entry=0 '\000')
    at dataobj/../boden/../dataobj/../simobj.h:247
#1  0x00005555555e8f33 in grund_t::suche_obj (typ=obj_t::strassendepot, this=0x55557fef5338) at boden/../obj/../boden/grund.h:564
#2  grund_t::get_depot (this=this@entry=0x55557fef5338) at boden/grund.cc:1760
#3  0x00005555558ac119 in rail_vehicle_t::check_next_tile (this=0x55559fae0df0, bd=0x55557fef5338) at vehicle/simvehicle.cc:4551
#4  0x00005555556145b5 in route_t::intern_calc_route (this=this@entry=0x55559fadd118, welt=welt@entry=0x5555585735d0, start=..., ziel=...,
    tdriver=tdriver@entry=0x55559fae0e58, max_speed=max_speed@entry=160, max_cost=<optimized out>, axle_load=<optimized out>, convoy_weight=<optimized out>,
    is_tall=<optimized out>, tile_length=<optimized out>, avoid_tile=..., start_dir=<optimized out>) at dataobj/route.cc:704
#5  0x000055555561568b in route_t::calc_route (this=this@entry=0x55559fadd118, welt=0x5555585735d0, start=..., ziel=...,
    tdriver=tdriver@entry=0x55559fae0e58, max_khm=max_khm@entry=160, axle_load=11, is_tall=false, max_len=8, max_cost=9223372036854775807,
    convoy_weight=438, avoid_tile=..., direction=15 '\017') at dataobj/route.cc:1105
#6  0x00005555558c0507 in rail_vehicle_t::calc_route (this=this@entry=0x55559fae0df0, start=..., ziel=..., max_speed=max_speed@entry=160,
    is_tall=<optimized out>, route=route@entry=0x55559fadd118) at vehicle/../simworld.h:2743
#7  0x00005555557e0f4d in convoi_t::calc_route (this=this@entry=0x55559fadd080, start=..., ziel=..., max_speed=160) at tpl/array_tpl.h:95
#8  0x00005555557e955c in convoi_t::drive_to (this=0x55559fadd080) at simconvoi.h:1047
#9  0x00005555557e9fa7 in convoi_t::threaded_step (this=<optimized out>) at simconvoi.cc:1707
#10 0x000055555586aa10 in step_individual_convoy_threaded (args=<optimized out>) at simworld.cc:1856
#11 0x00007ffff7f93a9d in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff7820b23 in clone () from /usr/lib/libc.so.6


frame 2 involves the get_depot function that you have been modifying recently, but it is ultimately being called by convoi_t::threaded_step. So, at first glance, it could potentially have been caused by either of our changes.

Edit: I've got another fairly quick segfault, from the same bit of code. In both cases, get_depot was being called on a bit of plain track. In the first case, there were two objects at that point (after the segfault) - the track and the third rail. In the second case there were seven objects - the track, two rail vehicles and four smokes.

I think what is happening is that objects are being removed from the objlist while the get_depot is (indirectly) searching the objlist. This results in it trying to look up the type of a object referenced by a null pointer.

This problem was probably introduced a week ago when you changed get_depot to use suche_obj instead of first_obj. I think reverting that change (and the subsequent ones) will help.


jamespetts

Thank you for that detailed analysis - that is extremely helpful.

The reason that I changed this to suche_obj from first_obj is that the equivalent code for signalboxes caused loss of synchronisation errors (the one that took 6 months to find) because, where there was more than one object on a tile, whether the signalbox or the other object (in the case of the server game, a viaduct support) was the first object was indeterministic. The same could easily happen with depots. Thus, reverting to the original code is probably not sensible.

Some possible solutions to this include:

(1) using a mutex;
(2) moving depot searches back to a single threaded step (but I should rather not do this if it can be avoided, as this would adversely affect performance and also take a long time to code); or
(3) call await_convoy_threads before removing anything from objlist_t (but I suspect that this would impact on performance, too).

I should be grateful for your views on the above, as this is not an entirely simple problem to solve.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

ACarlotti

Quote from: jamespetts on April 21, 2019, 09:38:06 AMwhere there was more than one object on a tile, whether the signalbox or the other object (in the case of the server game, a viaduct support) was the first object was indeterministic.
I'm pretty sure this is a bug in itself - the 'fix' you made last month looks to me like a workaround to deal with one of the symptoms, rather than a fix to the underlying problem.

This underlying problem is something I intend to investigate further as soon as I have time (and will probably begin by investigating the objlists containing that signalbox that was giving us trouble before).

In the absence of confirmed bugs involving the use of first_obj for get_depot, and the presence of confirmed bugs when suche_obj is used instead, I think it is best that we switch back to the former for now, while remaining aware that there is still some unexplained indeterminacy here that I am going to investigate.

My investigations will also include trying to establish whether the use of first_obj could lead to segfaults due to race condition. I think that is much more likely to be safe (or able to be efficiently made safe) than suche_obj, but it definitely needs checking.

jamespetts

Interesting - I have not altered the basic objlist code from Standard so far as I can recall. May I ask why you think that the lack of determinacy here is a bug? Might is also be a bug in Standard? It seems very rarely to arise.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

Standard assumes that objlist is only accessed single threaded. Otherwise during a sync_step smoke and vehicle will vanish for the list. (For the same reason also route searches in thread should not working stable. The introduction of steps was exactly to avoid these kinds of issues. I fone does searches threaded, then the reason to have steps is almost gone.)

ACarlotti

I think the lack of determinacy is probably indicative of one or more underlying bugs, because it means that either there is the potential for a race condition leading to an inconsistent state, or there is some prior non-determinacy which could in itself be a problem. So there are multiple ways in which there could be underlying synchronisation issues.

There is also the issue that get_signalbox was returning an incorrect result only inconsistently, which suggests something strange was going on with the ordering in those objlists. At a first glance I would have expected the priority ordering to prevent that particular desync from happening, so either I've made an incorrect assumption somewhere, or the code is broken (independent of synchronisation considerations).

prissi: I think it will probably be possible to make access to objlist threadsafe in specific circumstances; however it will take a little work. I think this will primarily entail only accessing the first (or a fixed constant number) of objects, and ensuring that in those circumstances the compact representation for a single object is not being used.

jamespetts

Thank you for this. I have now reverted the depot search code as suggested, although it is sensible to monitor the situation in case any loss of synchronisation arises. I have also fixed the issue with the out of range crash relating to choose signals.

Thank you very much for your helpful analysis on this: the threading code is, by its nature, very complex (I had no previous experience of working in multi-threaded programming before starting the multi-threading project in 2016, so do forgive any errors), but it is important from a performance perspective and does make a real difference compared to single threaded code. Indeed, this difference is likely to increase in the future as increasing core counts rather than speed per core is the way in which computational power will grow.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

ACarlotti

Quote from: ACarlotti on April 22, 2019, 10:53:23 AMThere is also the issue that get_signalbox was returning an incorrect result only inconsistently, which suggests something strange was going on with the ordering in those objlists. At a first glance I would have expected the priority ordering to prevent that particular desync from happening, so either I've made an incorrect assumption somewhere, or the code is broken (independent of synchronisation considerations).
The code was broken - see here for details.