News:

Do you need help?
Simutrans Wiki Manual can help you to play and extend Simutrans. In 9 languages.

Data race in objlist_t::remove when running headless server

Started by ceeac, July 03, 2021, 02:46:02 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

ceeac

This bug is probably related to the desyncs related to convoys reported in the desync report thread.
The issue is that convoi_t::threaded_step and sync_step are not synchronized correctly which causes convoi_t::threaded_step to run during sync_step.


==================
WARNING: ThreadSanitizer: data race (pid=35805)
  Write of size 1 at 0x7bb0203a7659 by main thread:
    #0 objlist_t::remove(obj_t const*) /home/ceeac/Projects/code/simutrans-ex/dataobj/objlist.cc:506:7 (simutrans-extended+0x585e86)
    #1 grund_t::obj_remove(obj_t const*) /home/ceeac/Projects/code/simutrans-ex/vehicle/../boden/../boden/grund.h:597:54 (simutrans-extended+0xbac420)
    #2 vehicle_base_t::leave_tile() /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:280:52 (simutrans-extended+0xbac420)
    #3 vehicle_t::leave_tile() /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:1649:18 (simutrans-extended+0xbdf62a)
    #4 rail_vehicle_t::leave_tile() /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:7735:13 (simutrans-extended+0xbdf62a)
    #5 vehicle_t::hop(grund_t*) /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:1675:2 (simutrans-extended+0xbb4823)
    #6 vehicle_base_t::do_drive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:375:4 (simutrans-extended+0xbacfb5)
    #7 convoi_t::sync_step(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:1335:18 (simutrans-extended+0xa071c3)
    #8 karte_t::sync_list_t::sync_step(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:4879:14 (simutrans-extended+0xb5cc68)
    #9 karte_t::sync_step(unsigned int, bool, bool) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:4954:8 (simutrans-extended+0xb5cc68)
    #10 karte_t::interactive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:11395:6 (simutrans-extended+0xb7ece6)
    #11 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1606:9 (simutrans-extended+0xabfa91)
    #12 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:1097:9 (simutrans-extended+0xb89ff5)
    #13 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:195:9 (simutrans-extended+0xc41bc6)

  Previous read of size 1 at 0x7bb0203a7659 by thread T8:
    #0 objlist_t::bei(unsigned char) const /home/ceeac/Projects/code/simutrans-ex/boden/../dataobj/objlist.h:64:13 (simutrans-extended+0x55bd9c)
    #1 grund_t::obj_bei(unsigned char) const /home/ceeac/Projects/code/simutrans-ex/boden/grund.h:600:50 (simutrans-extended+0x55bd9c)
    #2 grund_t::get_weg_nr(int) const /home/ceeac/Projects/code/simutrans-ex/boden/grund.h:636:87 (simutrans-extended+0x55bd9c)
    #3 grund_t::get_weg(waytype_t) const /home/ceeac/Projects/code/simutrans-ex/boden/grund.h:643:24 (simutrans-extended+0x55bd9c)
    #4 grund_t::get_weg_ribi_unmasked(waytype_t) const /home/ceeac/Projects/code/simutrans-ex/boden/grund.cc:1820:15 (simutrans-extended+0x55bd9c)
    #5 grund_t::get_neighbour(grund_t*&, waytype_t, unsigned char) const /home/ceeac/Projects/code/simutrans-ex/boden/grund.cc:2772:39 (simutrans-extended+0x55b627)
    #6 route_t::intern_calc_route(karte_t*, koord3d, koord3d, test_driver_t*, int, long long, unsigned int, unsigned int, bool, int, koord3d, unsigned char, route_t::find_route_flags) /home/ceeac/Projects/code/simutrans-ex/dataobj/route.cc:902:19 (simutrans-extended+0x5b8589)
    #7 route_t::calc_route(karte_t*, koord3d, koord3d, test_driver_t*, int, unsigned int, bool, int, long long, unsigned int, koord3d, unsigned char, route_t::find_route_flags) /home/ceeac/Projects/code/simutrans-ex/dataobj/route.cc:1305:22 (simutrans-extended+0x5baa58)
    #8 rail_vehicle_t::calc_route(koord3d, koord3d, int, bool, route_t*) /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:4720:37 (simutrans-extended+0xbd42de)
    #9 convoi_t::calc_route(koord3d, koord3d, int) /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:940:45 (simutrans-extended+0xa04c98)
    #10 convoi_t::drive_to() /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:1481:36 (simutrans-extended+0xa08b26)
    #11 convoi_t::threaded_step() /home/ceeac/Projects/code/simutrans-ex/simconvoi.cc:1742:3 (simutrans-extended+0xa0abda)
    #12 step_individual_convoy_threaded(void*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:1947:10 (simutrans-extended+0xb494d8)

  Location is heap block of size 32744 at 0x7bb0203a0000 allocated by main thread:
    #0 malloc <null> (simutrans-extended+0x439874)
    #1 xmalloc(unsigned long) /home/ceeac/Projects/code/simutrans-ex/simmem.cc:15:18 (simutrans-extended+0xac0768)
    #2 freelist_t::gimme_node(unsigned long) /home/ceeac/Projects/code/simutrans-ex/dataobj/freelist.cc:114:20 (simutrans-extended+0x5a7f64)
    #3 grund_t::operator new(unsigned long) /home/ceeac/Projects/code/simutrans-ex/boden/grund.cc:161:9 (simutrans-extended+0x550645)
    #4 planquadrat_t::rdwr(loadsave_t*, koord) /home/ceeac/Projects/code/simutrans-ex/simplan.cc:275:39 (simutrans-extended+0xad7bb1)
    #5 karte_t::load(loadsave_t*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:9660:33 (simutrans-extended+0xb7378d)
    #6 karte_t::load(char const*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:9147:3 (simutrans-extended+0xb7100f)
    #7 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1462:32 (simutrans-extended+0xabe924)
    #8 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:1097:9 (simutrans-extended+0xb89ff5)
    #9 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:195:9 (simutrans-extended+0xc41bc6)

  Thread T8 (tid=35866, running) created by main thread at:
    #0 pthread_create <null> (simutrans-extended+0x43b13b)
    #1 karte_t::init_threads() /home/ceeac/Projects/code/simutrans-ex/simworld.cc:2224:8 (simutrans-extended+0xb419cd)
    #2 karte_t::load(loadsave_t*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:10138:2 (simutrans-extended+0xb778b8)
    #3 karte_t::load(char const*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:9147:3 (simutrans-extended+0xb7100f)
    #4 simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1462:32 (simutrans-extended+0xabe924)
    #5 sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/sys/simsys.cc:1097:9 (simutrans-extended+0xb89ff5)
    #6 main /home/ceeac/Projects/code/simutrans-ex/sys/simsys_posix.cc:195:9 (simutrans-extended+0xc41bc6)

SUMMARY: ThreadSanitizer: data race /home/ceeac/Projects/code/simutrans-ex/dataobj/objlist.cc:506:7 in objlist_t::remove(obj_t const*)
==================


To reproduce, compile with clang/gcc without graphics and with Thread Sanitizer enabled (-fsanitize=thread), then load a save file from the command line.

jamespetts

Interesting - thank you for this report. Do I understand that you mean that this data race can only be reproduced with the command line server build?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

It means it is a debug build which checks for this. Normal builds do not check it and just crash or desync. It cause quite slowdown and overhead:
https://clang.llvm.org/docs/ThreadSanitizer.html
So it may not work with the bridgewater save on most computers.

ceeac

Quote from: jamespetts on July 03, 2021, 03:08:50 PMDo I understand that you mean that this data race can only be reproduced with the command line server build?
The bug can be reproduced on any version, however with a non-graphical build no warnings originating from the graphics code are emitted which makes it much easier to look for issues potentially affecting client-server synchronization.

jamespetts

I have just been looking at this now. This is potentially complex. It is intended that the threaded step for convoys run at the same time as sync_step, as this is where the performance improvement comes from. The data race appears to be in objlist_t: on the one hand, a vehicle is being removed from objlist_t because it is moving to the next tile. On the other hand, the path finder is looking up whether a given way is in objlist_t to see whether it can find a route through it.

In principle, these two operations should not conflict. However, if they both use objlist_t, then I can see that it is possible for the path finding code to read an instance of objlist_t just as sync_step is writing it, and thus to read it in a corrupted state and produce an undefined result.

I wonder whether a mutex for objlist_t would suffice for this? The only alternative is making the vehicle path-finding not concurrent at all, which would have a significant adverse impact on performance.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

I did not look into the threading step code very deeply. But sync_step and route reservation can interfere a lot. How to avoid a citycar entering a crossing in sync_step when a train is just trying to reserve this route in a thread? Even a semaphore for the crossing will not help, because in some case it may be faster than the citycar and in other user PCs not. The outcome would be different depending on the CPU speed and load.

I think a semaphore before the call to the sync_step with the convois and the one for citycars is the only proper way of doing this. It is much more efficient since the call to enter a tile is most performance sensitve (show up as most called function in profiling) and a semaphore for protecting the objlist would be quite a slowdown on so many frequent calls.

jamespetts

Quote from: prissi on May 25, 2022, 08:31:07 AMI did not look into the threading step code very deeply. But sync_step and route reservation can interfere a lot. How to avoid a citycar entering a crossing in sync_step when a train is just trying to reserve this route in a thread? Even a semaphore for the crossing will not help, because in some case it may be faster than the citycar and in other user PCs not. The outcome would be different depending on the CPU speed and load.

I think a semaphore before the call to the sync_step with the convois and the one for citycars is the only proper way of doing this. It is much more efficient since the call to enter a tile is most performance sensitve (show up as most called function in profiling) and a semaphore for protecting the objlist would be quite a slowdown on so many frequent calls.
The convoi_t threaded step does not include route reservation, or any other operations involving affecting the game state directly. Route reservation is done in the non-threaded step. Only path-finding is done in the threaded step (calc_route(), etc.). Can I check whether your suggestion in respect of the semaphore still applies in light of this?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

Well, the case of a vehicle entering a tile and converting the objlist from one to minivect would be indeed a problem, because one then rather points to the vector and not the first obj. But that should be a very rare occurence.

I still suspect find route for choose signs, since a route would be different for road vehicles depending on which tiles are occupied (and at least in standard the cost function changes with vehicles passing as well). Also the routing is done threaded while find_route seems to be done from step.

Moreover, since only a few convois are handled each time by the thread routine, (and most are not routing) it takes a long time for the actual routing to be handled. I see very little speedup here. I would rather expect to add that convoi to a routing list (protected by semaphores) and then process each time n convois from the list before the next step. Also the chooseing seems to be done in the normal step.

It would be still important to protect a convoi from deletion while in ROUTING_2 state.

I see several challenges with this code, but I have not really time not motivation to dig into the deeply.