The International Simutrans Forum

Simutrans Extended => Simutrans-Extended bug reports => Simutrans-Extended development => Simutrans-Extended closed bug reports => Topic started by: Mariculous on September 13, 2019, 02:01:05 PM

Title: [Bug] Placement of signals sometimes crashes the server
Post by: Mariculous on September 13, 2019, 02:01:05 PM
This is a draft as I can not yet reproduce this at a minimal setup nor I can confirm that it is only related to track circuit block.

What happens:
When a train is waiting at a signal and you place a new signal behind that one, so the train can continue, the server sometime crashes, while the client does not.
Sometimes the server does not crash, but client gets out of sync and most times this simply works as expected.

As always, I will update this thread when I can reliably reproduce this, including a stacktrace.
Until then, this is only a placeholder, so anyone who observes the same can help locating the problem.
Title: Re: [Bug] Placement of Track circuit block signals sometimes crashes the server
Post by: jamespetts on September 13, 2019, 02:05:15 PM
Thank you for letting me know. Crash bugs are always a high priority.

At present, I am not able to run the server game at home; if you can reproduce this bug on a much smaller/simpler game, that would be very helpful. Thank you for your work on this.
Title: Re: [Bug] Placement of Track circuit block signals sometimes crashes the server
Post by: Mariculous on September 26, 2019, 01:23:06 PM
Small update, it frequently happened again, but I did not yet et it caught :( Maybe I should run the server with gdb.
Whenever it happened, there was a reservation at that tile, however, not whenever you place a track circuit block at an existing reservation, the server will crash.
In fact I tried to artificially create this situation but never got it crashing when I wanted to(running a debug build in gdb)

However, it seems to happen more frequently when there are very many (stuck) waiting vehicles at the time you build the signal.
Title: Re: [Bug] Placement of signals sometimes crashes the server
Post by: Mariculous on October 09, 2019, 02:51:21 PM
I updated the title as I encontered this not being related to track circuit block placement.

It just happened again, when I downgraded some of my randomly deadlocking, (directional) track circuit block signalled tracks to absolute block.
Still couldn't get directional reservations relyably working for track circuit block signalling. On some it works, on the others I simply replace it with token block now, but that's another issue.

As I still couldn't relyably reproduce nor catch this bug, I want to at least share what I tryed and why it is so hard to catch this one in gdb.

Base situation: many rail vehicles are waiting for clearance at a deadlocked system. Doesn't matter if they are actually "stuck" or yet just "waiting for clearance"

What did I do?
I removed old (track circuit block) signals and placed new (token block) signals on one of te reserved tiles => server crashed
So I reloaded the same savegame and dis just the same thing again => server crashed
So I thought I finally got a save to reproduce and catch the error.
  I compiled a (client) debug build, loaded the same save on the client and did the same again => worked
  I retryed it a few times but no matter what I tried, I couldn't get the client crashing.
So I thought it is maybe posix backend related.
  I compiled a headless server debug build and loaded the save with that build on my server again => server crashed... wait? Nah, it's stil up! It just disconnected the client!
So I thought maybe I can at least collect useful informations about the disconnect in the logs.
  I launched the server with -debug 5 and tryed to connect => can't connect at all, I guess it's due to server being too slow.
  Same for -debug 3

I really don't know how I could catch that one but it frequently crashes my server so I really need to catch it.


After yet another try I could connect with -debug 3 set, created some signal placements disconnects but there is nothing suspicious in the log, apart from spamming something like
"Warning: route_t::intern_calc_route():  Problem with heuristic:  from 1169,867,3 to (1152,871,4) at 1167,869, best = 15603, cost = 1867, heur = 18860, dist = 16, turns = 16977"
"Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (1783) London County Council double deck E/1 class tram departing from stop Bremerhaven Town Stop at step 259. Its departure time is calculated as"
and
"ERROR: void convoi_t::laden():  Trying to load at halt Rostock Hill Stop when not at a halt"
but I don't expect these to be related to this issue as they always get spammed to the log.

At the third try it crashed, instead of simply disconnecting, so I'll give it another try in gdb now!

Finally, I caught it in gdb!
However, the trace is not as useful (to me) as I was hoping for.
It seems to be related to posix backend.


Thread 1 "simutrans-exten" received signal SIGSEGV, Segmentation fault.
0x0000000000786c24 in checklist_t::print (this=0x727865203630335d, buffer=0x7fffffffc21d "", entity=0x87cdb6 "initiator") at simworld.cc:378
378     simworld.cc: No such file or directory.
(gdb) bt full
#0  0x0000000000786c24 in checklist_t::print (this=0x727865203630335d, buffer=0x7fffffffc21d "", entity=0x87cdb6 "initiator") at simworld.cc:378
No locals.
#1  0x00000000007aee81 in karte_t::process_network_commands (this=0x1347af30, ms_difference=0x7fffffffc27c) at simworld.cc:10237
        buf = "server=[ss=23466 st=2933 nfc=2 rand=2165059994 halt=1025 line=1 cnvy=1025 ssr=1458512605,2165059994,0,0,0,0,0,0 str=748925306,748925306,748925306,748925306,748925306,748925306,748925306,748925306,7489"...
        offset = 333
        nwt = 0x7278652036303335
        nwcid = 13109
        ms = 741354548
        time_to_next_step = 959920185
        nwc = 0x302c302c30
        next_command_step = 741354544
#2  0x00000000007afcb8 in karte_t::interactive (this=0x1347af30, quit_month=2147483647) at simworld.cc:10424
        time = 1632520
        hashes_ok = {data = 0x0, size = 0, count = 0}
        ms_difference = 0
#3  0x00000000007462db in simu_main (argc=4, argv=0x7fffffffeaf8) at simmain.cc:1382
        pause_after_load = false
        welt = 0x1347af30
        view = 0x8439540
        eventmanager = 0x84a20d0
        resolutions = {{640, 480}, {800, 600}, {1024, 768}, {1280, 1024}, {704, 560}}
        disp_width = 0
        disp_height = 0
        fullscreen = 0
        quit_month = 2147483647
        path_sep = 0x86d75e "/"
        pak_diagonal_multiplier = 724
        pak_tile_height = 8 '\b'
        pak_height_conversion_factor = 2 '\002'
        found_settings = true
        found_simuconf = true
        multiuser = true
        simuconf = {file = 0x0}
        path_to_simuconf = "config/simuconf.tab\000\000\000\000"
        version = 0x86d830 "Simutrans version 120.2.1 Extended Nightly development build 14.5 from Oct  9 2019 #59f9d9d\n"
        cli_syslog_enabled = false
        cli_syslog_tag = 0x0
        file = {mode = 6, saving = false, buffered = false, curr_buff = 4160741744, buf_pos = {32767, 0}, buf_len = {0, 0}, ls_buf = {0x0, 0x7ffff7ffe4c8 "(\344\377\367\377\177"}, version = 120004, extended_version = 14, extended_revision = 12, ident = 1,
          pak_extension = "settings only\000\000\000=4@\000\000\000\000\000\066\340\343\371\000\000\000\000\377\377\377\377", '\000' <repeats 12 times>, "\030\017\266\366\377\177\000\000\000\240\376\367\377\177", '\000' <repeats 90 times>, "\020\301\264\000\000\000\000\000P^@\000\000\000\000\000\360\352\377\377\377\177", '\000' <repeats 18 times>, "\360\351\377\377\377\177\000\000"..., filename = "settings-extended-debug.xml", fd = 0xcc37f0, static save_mode = loadsave_t::zipped, static autosave_mode = loadsave_t::zipped}
        xml_filename = "settings-extended-debug.xml\000\000\000\000"
        xml_settings_found = true
        obj_conf = "/path/with/the/same/number/of/charssssss/simuconf.tab"
        themes_ok = true
        parameter = {0, 0}
        new_world = false
        loadgame = ""
#4  0x000000000075939c in sysmain (argc=4, argv=0x7fffffffeaf8) at simsys.cc:825
        buffer2 = 0x0
        buffer = "/some/path/with/exactly/the/same/amount/of/charss/here/simutrans-extended-debug", '\000' <repeats 554 times>...
        length = 79
#5  0x0000000000823bdc in main (argc=4, argv=0x7fffffffeaf8) at simsys_posix.cc:197
No locals.
Title: Re: [Bug] Placement of signals sometimes crashes the server
Post by: jamespetts on December 19, 2019, 12:15:33 AM
Thank you for your analysis in relation to this. This is an odd issue - the line of code in question is the single line of a method used for producing the network checklists to check whether a client remains in synchronisation with the server:


int checklist_t::print(char *buffer, const char *entity) const
{
return sprintf(buffer, "%s=[ss=%u st=%u nfc=%u rand=%u halt=%u line=%u cnvy=%u ssr=%u,%u,%u,%u,%u,%u,%u,%u str=%u,%u,%u,%u,%u,%u,%u,%u,%u,%u,%u,%u,%u,%u,%u,%u exr=%u,%u,%u,%u,%u,%u,%u,%u sums=%u,%u,%u,%u,%u,%u,%u,%u",
entity, ss, st, nfc, random_seed, halt_entry, line_entry, convoy_entry,
rand[0], rand[1], rand[2], rand[3], rand[4], rand[5], rand[6], rand[7],
rand[8], rand[9], rand[10], rand[11], rand[12], rand[13], rand[14], rand[15], rand[16], rand[17], rand[18], rand[19], rand[20], rand[21], rand[22], rand[23],
rand[24], rand[25], rand[26], rand[27], rand[28], rand[29], rand[30], rand[31],
debug_sum[0], debug_sum[1], debug_sum[2], debug_sum[3], debug_sum[4], debug_sum[5], debug_sum[6], debug_sum[7]
);
}


I did not write this code nor do I know much of its workings. I would need a reliable reproduction case to be able to track this one down.
Title: Re: [Bug] Placement of signals sometimes crashes the server
Post by: Mariculous on January 13, 2020, 10:34:46 AM
I just got it reproduced, it only happens in network games, that's the reson why I could not reproduce it ever before.
Open the attached save and follow the instructions:
1. start a simutrans (localhost) server and load the map within.
2. start a simutrans client and connect to localhost.
3. There will be some stuck trams on the map.
4. On the server, place any kind of sign or signal e.g. oneway sign, within the trams reservations and the client will immediately desync.
5. Connect the client again.
6. On the client, again place any kind of signal or sign within the reservation and there is a rather small chanche the client will desync, otherwise, the server will crash (and client gets disconnected for sure)
Title: Re: [Bug] Placement of signals sometimes crashes the server
Post by: jamespetts on January 13, 2020, 11:15:38 AM
Excellent, that is very helpful indeed, thank you. I shall look into this when I get home. This is a high priority item.
Title: Re: [Bug] Placement of signals sometimes crashes the server
Post by: jamespetts on January 13, 2020, 10:08:39 PM
I am afraid that I am having trouble reproducing this. In particular, step 5 appears not to work as described: when I place a one way sign on the server instance, I notice no loss of synchronisation. Can I ask that you be more specific as to the tile on which you place the sign?
Title: Re: [Bug] Placement of signals sometimes crashes the server
Post by: Mariculous on January 13, 2020, 10:41:49 PM
You can skip stel 4 and 5 it is not required to reproduce the crash.
The exact location does not matter as long as you place it on top of any tram waiting for clearance, but there is indeed the description is wrong in one point, I was not aware of!
You don't need to actually place it, you need to drag it along. If you still can't reproduce it, try dragging from (31,29) towards north.
Same goes for step 6 (the server crash) simply click placing it will work finde but dragging is the isue.
Title: Re: [Bug] Placement of signals sometimes crashes the server
Post by: jamespetts on January 14, 2020, 12:15:47 AM
Splendid, the issue was indeed with the dragging: what would happen is that the dragging function created a dummy vehicle to test the route, and, when this vehicle was deleted, it would clear the reservation in the starting tile - but only on the client (assuming that the client was the one doing the dragging), as no command would be sent over the network as it was assumed that creating and deleting the dummy vehicle would have no side effects. Clearing the reservation then allowed one of the trams to move, causing a loss of synchronisation.

I have now fixed this by adding a check in the destructor to see whether the thing being deleted is a real train or a dummy object, and only clearing the reservation in the former case.

Thank you for the reproduction case for this: it is much appreciated.