News:

Simutrans Sites
Know our official sites. Find tools and resources for Simutrans.

Command line server issues

Started by jamespetts, May 26, 2013, 10:50:44 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

neroden

I was having some weird integer overflow crashes in neuer_monat a while back, which I never tracked down.

Does this work with a full graphics *server*, hosting clients?  Perhaps we can sort out whether this is really in the "no graphics" code, or actually in the server code.

Also, I just fixed your other savegame bug (very sorry about that one -- the finance code logic was really, really hairy).  So -- can you see if you can reproduce this bug on that branch (emergency-depot-bugfix)?  I did several rounds of cleanups there.  The code I was cleaning up could easily have had memory management errors in it, and I suspect some of it did.  If the bug goes away on the emergency-depot-bugfix branch, we should probably just merge that to the 11.x branch and make it the new release candidate.

prissi

Are you compiling a 64 bit command line option by accident, but still using ptrhashtables with int somewhere?

jamespetts

Nathaneal - thank you for those fixes: will test with your latest branch presently. Prissi - I don't think that this can be the issue, since my most recent tests used the -m64 switch, which compiles "int" as 64-bit.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have now tried Nathaneal's latest work (on my 112.x-private-car-merge branch) to see whether this alleviates the problem, especially in light of the work that has been done which Nathaneal tells me impacts on things in simgraph16.cc and simgraph0.cc. The result is that, after this patch, the behaviour in the non-graphics server version and the full graphics client version have indeed converged: now both crash immediately on a new month in this saved game.

The crash is in line 1779 of simconvoi.cc and is a segmentation fault/access violation: the problem appears to be that get_schedule() returns a NULL pointer. It is not clear why at the beginning of a month, convoys should suddenly have a NULL pointer for their schedules: although that particular line of code is not called except at the ends of months, get_schedule() is called many times elsewhere in the code for convoy objects, and this would surely have caused a crash elsewhere earlier (especially as the game in question is at the very beginning of a month) if the problem did not begin on a new month.

This suggests to me wider and quite significant memory management issues associated with some recent merging work, in particular whatever code was changed recently to align the behaviour of the full graphics and no graphics versions.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

neroden

Quote from: jamespetts on June 12, 2013, 01:04:51 PM
I have now tried Nathaneal's latest work (on my 112.x-private-car-merge branch) to see whether this alleviates the problem, especially in light of the work that has been done which Nathaneal tells me impacts on things in simgraph16.cc and simgraph0.cc. The result is that, after this patch, the behaviour in the non-graphics server version and the full graphics client version have indeed converged:
Hooray!  :-)

Quotenow both crash immediately on a new month in this saved game.
Believe it or not, this is progress.

QuoteThe crash is in line 1779 of simconvoi.cc and is a segmentation fault/access violation: the problem appears to be that get_schedule() returns a NULL pointer.
Oho, I found and fixed that bug today (now fixed on ncn-devel and on merge-from-standard).  It was giving me a beginning of month crash.

The important thing to realize is that this section of code is called on convois which are *sitting in the depot and have not yet launched*, and as such do not have schedules. 

The call to get_schedule is solely to get the waytype for accounting purposes -- this is another one of the problems arising out of the accounting merger.  I changed the code to get the waytype in a far more tedious manner which should be completely reliable, hopefully.

However, I believe that convoi_t objects should internally store their waytype.  This would allow for removal of great masses of gunk, not just here but in a bunch of other places.

Please try the merge-from-standard branch again and see if there's still *another* lurking crash hiding behind this one.  (Merge-from-standard has the loading screen code fixes which affect simgraph0.cc and simgraph16.cc, ncn-devel doesn't).

--Nathanael

jamespetts

Hmm, I suspect that the month end but that you have recently fixed might rather have got in the way of seeing this issue: compiling a Windows command line server in MSVC++ from the 112.x-private-car-merge produces an access violation at line 149 of image_reader.cc, the problem being that "dest" is a bad (not null) pointer. This is code unchanged from Standard, so something is still very wrong somewhere.

I have not had chance to test this again with a Linux server yet (have spent the evening baking a cake), but will revert when I get a chance to do so.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

neroden

#41
Quote from: jamespetts on June 12, 2013, 11:57:12 PM
Hmm, I suspect that the month end but that you have recently fixed might rather have got in the way of seeing this issue: compiling a Windows command line server in MSVC++ from the 112.x-private-car-merge produces an access violation at line 149 of image_reader.cc, the problem being that "dest" is a bad (not null) pointer. This is code unchanged from Standard, so something is still very wrong somewhere.
This code is changed from standard.  You changed the memory management on April 10th (commit 77fc2bdf691) after finding memory leaks...

(Edit: for reference, I used "git diff standard HEAD -- filename" to find out what differs from standard, "git blame filename" to figure out what commit made the change.)

Unfortunately, there's something worse than memory leaks.  If you call delete too many times, you get bad pointers.  GCC is remarkably good at protecting us from that in debug builds, but in non-debug builds all bets are off, and I'm sure MSVC is different.

Worse, this code compiles differently depending on whether COMMAND_LINE_SERVER is defined, so it may actually be at fault.

Quote
I have not had chance to test this again with a Linux server yet (have spent the evening baking a cake), but will revert when I get a chance to do so.

I'm going to remove the bad references to delete[] (two of them are clearly erroneous).

By the way, I'm going to merge up my ncn-devel branch to merge-from-standard now, since the stuff on merge-from-standard doesn't seem to be hurting anything.

----
To your credit, the patch to fix the memory leaks was *mostly* correct -- there was a memory leak which needed to be fixed, which you fixed correctly in the destructor for bild_besch.  Unfortunately I think you ended up with a "double delete".

jamespetts

Ahh - hadn't remembered that: thank you for working this out. I had used Dr. Memory to try to find errors in the code, and had fixed a number of memory leaks as a result. Had you implemented the fix for this yet? I ask because, having pulled from the latest merge-from-standard, I am still getting this error.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

neroden

#43
Quote from: jamespetts on June 13, 2013, 08:12:11 AM
Ahh - hadn't remembered that: thank you for working this out. I had used Dr. Memory to try to find errors in the code, and had fixed a number of memory leaks as a result. Had you implemented the fix for this yet? I ask because, having pulled from the latest merge-from-standard, I am still getting this error.

I have implemented an attempted fix for this, yes. (commit 49920ee52).

So, at the moment, can I confirm, the graphics version is working fine; the command-line version is crashing?

I noticed that the command-line version compiles with MIDI on by default (!!!), so I ran it with -nomidi -nosound.  That should probably be fixed regardless of whether it's relevant to this bug.

I'm going to try working on a fix-command-line branch since some of the things I try may break further stuff.
----
I'm having some problems tracking this down because the command-line server is now working fine for me.  I'll leave it running for a while in the debugger, see if it crashes..   Perhaps if I tried the Bridgewater-Brunel game I could track this down.

neroden

OK, so I managed to get some trouble running a command line server.

First, a desync:

Desync due to checklist mismatch
sync_step=56620  server=[rand=2339077297 halt=1 line=1 cnvy=1 ind_dns_prop=5585 act_ind_dens=1127 traffic=480473] client=[rand=3039275210 halt=1 line=1 cnvy=1 ind_dns_prop=5585 act_ind_dens=1127 traffic=480502] World finished ...

Then, when I attempted to reconnect, a crash with this backtrace:

#0  0x0000000000434d82 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
#1  0x0000000000466d21 in dingliste_t::rdwr (this=0x74da360, welt=0x16c1600, file=0x7fffffffb8c0,
    current_pos=...) at dataobj/dingliste.cc:1062
#2  0x000000000045363a in grund_t::rdwr (this=0x74da358, file=0x7fffffffb8c0) at boden/grund.cc:368
#3  0x0000000000451309 in boden_t::rdwr (this=0x74da358, file=0x7fffffffb8c0) at boden/boden.cc:55
#4  0x00000000006afe04 in planquadrat_t::rdwr (this=0x7fffe3dc5a98, welt=0x16c1600, file=0x7fffffffb8c0,
    pos=...) at simplan.cc:240
#5  0x00000000006f4746 in karte_t::save (this=0x16c1600, file=0x7fffffffb8c0, silent=false)
    at simworld.cc:4974
#6  0x00000000006f3ed4 in karte_t::save (this=0x16c1600, filename=0x7fffffffbba0 "server1234-network.sve",
    savemode=loadsave_t::zipped, version_str=0x75ffc5 "0.112.5", ex_version_str=0x75ffc1 ".11", silent=false)
    at simworld.cc:4850
#7  0x0000000000485251 in nwc_sync_t::do_command (this=0x7a9f930, welt=0x16c1600)
    at dataobj/network_cmd_ingame.cc:730
#8  0x00000000006fde04 in karte_t::interactive (this=0x16c1600, quit_month=2147483647) at simworld.cc:7103
#9  0x00000000006a341c in simu_main (argc=9, argv=0x7fffffffe108) at simmain.cc:1261
#10 0x00000000006b2c71 in sysmain (argc=9, argv=0x7fffffffe108) at simsys.cc:703
#11 0x00000000007537ab in main (argc=9, argv=0x7fffffffe108) at simsys_s.cc:720

I'm not sure where to start on debugging this; perhaps someone else has an idea.  I'm also not sure this is your crash!

jamespetts

Nathaneal,

ahh, you have hit upon the same crash as I did on the Linux server - see above. It is very different to the crash that I get on the Windows command line server, but the same as reported here. The desync is I think an old issue not yet tracked down (because tracking down desyncs is excruciatingly difficult and months are long enough for this to be a tolerable bug for the time being) that often occurs at the beginning/end of months on developed games; the crash happens on reconnexion because the game is set to pause when no clients are connected: I strongly suspect that the crash would occur immediately after the desync were the server not set to pause when no clients are connected.

Edit: Actually, looking at your backtrace, the crash seems to occur in the save/load routine used when a client connects, so the above might not necessarily be right: however, using nettool to bring about a force-sync before the month boundary works fine without any crashes, so this is still end of month related. Sadly, I am not quite sure where to start with this one, either, and it's rather a show-stopper, as they say.

Thank you for the tip re: -nosound and -nomidi: I have added these parameters to my startup script. It may not make much of a difference, but there is no harm in possibly saving a small amount of memory/CPU power.

To answer your question:

Quote from: neroden
So, at the moment, can I confirm, the graphics version is working fine; the command-line version is crashing?

Yes, subject to latent bugs on the graphics version.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have just attempted to run the command line server on Linux with Valgrind. It was, as I suppose is inevitable, excurciatingly slow (taking over an hour just to load the game), and, as soon as (or perhaps even before) it had loaded the saved game, it immediately quit without any explanation or error message. I cannot explain why this behaviour with Valgrind attached is different than the behaviour without Valgrind.

However, it did generate a log, which is available for inspection here. To those who know more about Valgrind than I do - are any of these errors significant, and might any of them be related to the troubles described in this thread?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs

Imho, you should try to hunt down the reason for this block:

==16951== Conditional jump or move depends on uninitialised value(s)
==16951==    at 0x43A4EE: haus_besch_t::calc_checksum(checksum_t*) const (haus_besch.cc:189)
==16951==    by 0x438079: fabrik_besch_t::calc_checksum(checksum_t*) const (fabrik_besch.cc:82)
==16951==    by 0x40BDBD: fabrikbauer_t::alles_geladen() (fabrikbauer.cc:229)
==16951==    by 0x43ED32: factory_reader_t::successfully_loaded() const (factory_reader.cc:446)
==16951==    by 0x441816: obj_reader_t::laden_abschliessen() (obj_reader.cc:73)
==16951==    by 0x6A0C99: simu_main(int, char**) (simmain.cc:964)
==16951==    by 0x6B17A0: sysmain(int, char**) (simsys.cc:703)
==16951==    by 0x7540E9: main (simsys_posix.cc:147)

... you have uninitialized data in your haus_besch_t ... which might even be the cause of a desync.

I cannot see anything related to the crash. The program dies due to a SIGPIPE, but I have no idea what could have triggered this.


Parsley, sage, rosemary, and maggikraut.

jamespetts

#48
Thank you for the pointer (no pun intended). Will investigate.

Edit: I think that I have found and fixed this uninitialised variable bug (on my 112.x-private-car-merge): testing now. It related to whether buildings in old Experimental versions of paksets were marked as control towers or allowed underground.

Edit 2: Updated to the latest 112.x-private-car-merge commit on the Linux server, and ran to the end of the month with the Bridgewater-Brunel saved game from July 1956. Same behaviour as before: desynched fractionally after the end of the month: on reloading, crashed with a segmentation fault. Reloading the same saved game afresh results in this loading error:


Warning: karte_t::laden:        Fileversion: 112005
Calculating textures ...done
ERROR: grund_t::rdwr(): invalid waytype 48!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
FATAL ERROR: dingliste_t::laden() - During loading: Unknown object type '48'
Aborting program execution ...

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
FATAL ERROR: dingliste_t::laden() - During loading: Unknown object type '48'
Aborting program execution ...

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Aborted


It looks as though something in the new month routines is damaging dingliste_t members - but only in the command line server version. This is all very odd.

Edit 3: Trying again with a graphics Windows server, the month transition in the July 1956 Bridgewater-Brunel saved game occurs without crashing, although it does desync: nevertheless, once the client logs in again, it continues to run without errors.

Edit 4: I tried running it with Valgrind again, after fixing the error with the uninitialised variables: the result was the same as before, with it quitting before actually getting to run properly. I wonder whether the SIGPIPE error might be caused by the fact that Valgrind makes the program so slow that it takes about two hours to load a map, by which time some socket or the like might have timed out or somehow become stale? In any event, here is the log.

Edit 5: I am now trying to run it with Valgrind without "-server" specified to see whether I can run it in the no graphics version without getting a SIGPIPE error. Because Valgrind makes it so slow, it might take days before I have any results, however.

Edit 6: The non-server version exited with a segmentation fault: Valgrind log here.

Edit 7: Running the no graphics version in Linux without the server setting eventually produces a segmentation fault with the following backtrace:


Program received signal SIGSEGV, Segmentation fault.
0x000000000046327e in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
188             inline bool is_moving() const { return flags&is_vehicle; }
(gdb) backtrace
#0  0x000000000046327e in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
#1  0x0000000000464c64 in dingliste_t::intern_add_moving (this=0x1727e760, ding=0x2b774680) at dataobj/dingliste.cc:289
#2  0x0000000000465330 in dingliste_t::add (this=0x1727e760, ding=0x2b774680) at dataobj/dingliste.cc:488
#3  0x0000000000408a93 in grund_t::obj_add (this=0x1727e758, obj=0x2b774680) at bauer/../boden/grund.h:544
#4  0x00000000007117ad in vehikel_basis_t::betrete_feld (this=0x2b774680) at vehicle/simvehikel.cc:303
#5  0x000000000071595b in vehikel_t::betrete_feld (this=0x2b774680) at vehicle/simvehikel.cc:1405
#6  0x00000000007220df in waggon_t::betrete_feld (this=0x2b774680) at vehicle/simvehikel.cc:4278
#7  0x0000000000715dbd in vehikel_t::hop (this=0x2b774680) at vehicle/simvehikel.cc:1450
#8  0x000000000071191f in vehikel_basis_t::fahre_basis (this=0x2b774680, distance=6688768) at vehicle/simvehikel.cc:348
#9  0x0000000000650a4b in convoi_t::sync_step (this=0x2b772560, delta_t=8164) at simconvoi.cc:1116
#10 0x00000000006ee019 in karte_t::sync_step (this=0x157e550, delta_t=8164, sync=true, display=true) at simworld.cc:3457
#11 0x0000000000698250 in interrupt_check (caller_info=0x795690 "simworld.cc:3805") at simintr.cc:102
#12 0x00000000006ef8f6 in karte_t::new_month (this=0x157e550) at simworld.cc:3805
#13 0x00000000006f0e93 in karte_t::step (this=0x157e550) at simworld.cc:4102
#14 0x00000000006ff103 in karte_t::interactive (this=0x157e550, quit_month=2147483647) at simworld.cc:7196
#15 0x00000000006a20e7 in simu_main (argc=13, argv=0x7fffffffe538) at simmain.cc:1261
#16 0x00000000006b1cc1 in sysmain (argc=13, argv=0x7fffffffe538) at simsys.cc:703
#17 0x000000000075432e in main (argc=13, argv=0x7fffffffe538) at simsys_posix.cc:147


Preliminary conclusions so far:


       
  • the problem is associated with COLOUR_DEPTH=0, not with running as a server;
  • the problem occurs after the new month routines have run;
  • the exact manifestation of the problem is different each time; but
  • the problems are all caused by memory corruption in dingliste_t.
The conclusion to draw from that is that, in one of the new month routines (or a routine called by it), some code causes memory corruption in dingliste_t, and that code is only active when COLOR_DEPTH=0.

This is rather perplexing to me, since I have never altered code in dingliste_t and it is not an area of the code with which I am familiar. Can anyone possibly enlighten me a little as to how, in overview, the dingliste_t class works?

Edit 8: I wonder whether the problem is with the merging of this Standard commit of the 5th of October 2012 with the comment, "CHG: reduced memory consumption when compiled with COLOUR_DEPTH=0, no copies of object images kept in memory"?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

neroden

I'm trying to debug one of the crashes in dingliste_t.  I managed to get one with an ordinary graphics game, by just turning it on and running it for a long time.


#0  0x0000000000463f9a in ding_t::is_moving (this=0x0)
    at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
#1  0x0000000000469401 in local_display_dinge_bg (ding=0x0, xpos=1044, ypos=700) at dataobj/dingliste.cc:1178
#2  0x0000000000468878 in dingliste_t::display_dinge_bg (this=0xfcdf670, xpos=1044, ypos=700,
    start_offset=0 '\000') at dataobj/dingliste.cc:1200
#3  0x0000000000459375 in grund_t::display_dinge_bg (this=0xfcdf668, xpos=1044, ypos=700, is_global=false,
    draw_ways=true, visible=true) at boden/grund.cc:1367
#4  0x0000000000458a4e in grund_t::display_dinge_all (this=0xfcdf688, xpos=980, ypos=668,
    raster_tile_width=128, is_global=false) at boden/grund.cc:1292
#5  0x00000000006c03f7 in planquadrat_t::display_dinge (this=0x7fffe12301f8, xpos=980, ypos=668,
    raster_tile_width=128, is_global=false, hmin=0 '\000', hmax=127 '\177') at simplan.cc:426
#6  0x00000000004e6bed in world_view_t::internal_draw (this=0x1448e680, offset=..., ding=0x0)
    at gui/components/gui_world_view_t.cc:170
#7  0x000000000051a6b3 in location_view_t::zeichnen (this=0x1448e680, offset=...)
    at gui/components/gui_location_view_t.h:25
#8  0x000000000051b65f in gui_container_t::zeichnen (this=0x1448e5d8, offset=...) at gui/gui_container.cc:226
#9  0x000000000051cff7 in gui_frame_t::zeichnen (this=0x1448e5d0, pos=..., gr=...) at gui/gui_frame.cc:209
#10 0x00000000006eaa9b in display_win (win=9) at simwin.cc:842
#11 0x00000000006eacb4 in display_all_win () at simwin.cc:875
#12 0x00000000006ecde0 in win_display_flush (konto=-101781.95) at simwin.cc:1431
#13 0#0  0x0000000000463f9a in ding_t::is_moving (this=0x0)
    at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
#1  0x0000000000469401 in local_display_dinge_bg (ding=0x0, xpos=1044, ypos=700) at dataobj/dingliste.cc:1178
#2  0x0000000000468878 in dingliste_t::display_dinge_bg (this=0xfcdf670, xpos=1044, ypos=700,
    start_offset=0 '\000') at dataobj/dingliste.cc:1200
#3  0x0000000000459375 in grund_t::display_dinge_bg (this=0xfcdf668, xpos=1044, ypos=700, is_global=false,
    draw_ways=true, visible=true) at boden/grund.cc:1367
#4  0x0000000000458a4e in grund_t::display_dinge_all (this=0xfcdf688, xpos=980, ypos=668,
    raster_tile_width=128, is_global=false) at boden/grund.cc:1292
#5  0x00000000006c03f7 in planquadrat_t::display_dinge (this=0x7fffe12301f8, xpos=980, ypos=668,
    raster_tile_width=128, is_global=false, hmin=0 '\000', hmax=127 '\177') at simplan.cc:426
#6  0x00000000004e6bed in world_view_t::internal_draw (this=0x1448e680, offset=..., ding=0x0)
    at gui/components/gui_world_view_t.cc:170
#7  0x000000000051a6b3 in location_view_t::zeichnen (this=0x1448e680, offset=...)
    at gui/components/gui_location_view_t.h:25
#8  0x000000000051b65f in gui_container_t::zeichnen (this=0x1448e5d8, offset=...) at gui/gui_container.cc:226
#9  0x000000000051cff7 in gui_frame_t::zeichnen (this=0x1448e5d0, pos=..., gr=...) at gui/gui_frame.cc:209
#10 0x00000000006eaa9b in display_win (win=9) at simwin.cc:842
#11 0x00000000006eacb4 in display_all_win () at simwin.cc:875
#12 0x00000000006ecde0 in win_display_flush (konto=-101781.95) at simwin.cc:1431
#13 0x00000000006a905b in intr_refresh_display (dirty=false) at simintr.cc:75
#14 0x00000000006fe628 in karte_t::sync_step (this=0xd05c4a0, delta_t=31, sync=true, display=true)
    at simworld.cc:3502
#15 0x00000000006a91ac in interrupt_check (caller_info=0x7a72e2 "simworld.cc:7045") at simintr.cc:102
#16 0x000000000070e270 in karte_t::interactive (this=0xd05c4a0, quit_month=2147483647) at simworld.cc:7045
#17 0x00000000006b2dc8 in simu_main (argc=1, argv=0x7fffffffe1d8) at simmain.cc:1261
#18 0x00000000006c261d in sysmain (argc=1, argv=0x7fffffffe1d8) at simsys.cc:703
#19 0x00000000007640ef in main (argc=1, argv=0x7fffffffe1d8) at simsys_s.cc:720x00000000006a905b in intr_refresh_display (dirty=false) at simintr.cc:75
#14 0x00000000006fe628 in karte_t::sync_step (this=0xd05c4a0, delta_t=31, sync=true, display=true)
    at simworld.cc:3502
#15 0x00000000006a91ac in interrupt_check (caller_info=0x7a72e2 "simworld.cc:7045") at simintr.cc:102
#16 0x000000000070e270 in karte_t::interactive (this=0xd05c4a0, quit_month=2147483647) at simworld.cc:7045
#17 0x00000000006b2dc8 in simu_main (argc=1, argv=0x7fffffffe1d8) at simmain.cc:1261
#18 0x00000000006c261d in sysmain (argc=1, argv=0x7fffffffe1d8) at simsys.cc:703
#19 0x00000000007640ef in main (argc=1, argv=0x7fffffffe1d8) at simsys_s.cc:720


Poking around, all the code is doing exactly what it's supposed to, except that there is a null pointer lurking in the list of dings.  This is also the cause of the previous crash in ::intern_add_moving -- it's running through objects and finding a crashin

I did three things on ncn-devel
(1) I turned on assertions in the debug build.  I realize this will make things slower, but it needs to be done, and frankly it needs to be done routinely.  If we have particular assertions which are too slow, we can remove them from the code.
(2) I added an assertion during dingliste_t:add, asserting that it is not being asked to add a null pointer. This is the only way to figure out how a null pointer is getting into the dingliste_t. 
(3) I added a check for what is apparently a missing case.  It doesn't seem to be the case causing the problem, though.

Run with assertions on for a while and get the crash again.  Most likely  you'll get a crash based on the assertion -- in which case we can figure out who is trying to add a null pointer.  If you get the same crash as before, then we have to dig further into the dingliste_t code to figure out how it's getting a null pointer when nobody is trying to add one.

jamespetts

#50
Thank you very much for this - this is very helpful. It is interesting that you have been able to reproduce this with a graphics build. I have pulled and compiled your code on the Linux server (albeit running it as a non-server build for testing purposes), and I am waiting to see whether it will still crash, and, if so, with what backtrace.

Turning assertions on in debug builds seems sensible enough; but when running the server with an actual saved game, we need all the speed that we can get, and should turn off debugging in this situation.

Edit 1: My new test with your new version also crashed with the following backtrace:


Program received signal SIGSEGV, Segmentation fault.
0x000000000046327e in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
188             inline bool is_moving() const { return flags&is_vehicle; }
(gdb) backtrace
#0  0x000000000046327e in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
#1  0x0000000000464c64 in dingliste_t::intern_add_moving (this=0x171d8460, ding=0x2b9db650) at dataobj/dingliste.cc:289
#2  0x0000000000465330 in dingliste_t::add (this=0x171d8460, ding=0x2b9db650) at dataobj/dingliste.cc:488
#3  0x0000000000408a93 in grund_t::obj_add (this=0x171d8458, obj=0x2b9db650) at bauer/../boden/grund.h:544
#4  0x00000000007117c1 in vehikel_basis_t::betrete_feld (this=0x2b9db650) at vehicle/simvehikel.cc:310
#5  0x000000000071596f in vehikel_t::betrete_feld (this=0x2b9db650) at vehicle/simvehikel.cc:1412
#6  0x00000000007220f3 in waggon_t::betrete_feld (this=0x2b9db650) at vehicle/simvehikel.cc:4285
#7  0x0000000000715dd1 in vehikel_t::hop (this=0x2b9db650) at vehicle/simvehikel.cc:1457
#8  0x0000000000711933 in vehikel_basis_t::fahre_basis (this=0x2b9db650, distance=3497984) at vehicle/simvehikel.cc:355
#9  0x0000000000650a4b in convoi_t::sync_step (this=0x2b9d9230, delta_t=4996) at simconvoi.cc:1116
#10 0x00000000006ee019 in karte_t::sync_step (this=0x157e550, delta_t=4996, sync=true, display=true) at simworld.cc:3457
#11 0x0000000000698250 in interrupt_check (caller_info=0x7956b0 "simworld.cc:3805") at simintr.cc:102
#12 0x00000000006ef8f6 in karte_t::new_month (this=0x157e550) at simworld.cc:3805
#13 0x00000000006f0e93 in karte_t::step (this=0x157e550) at simworld.cc:4102
#14 0x00000000006ff103 in karte_t::interactive (this=0x157e550, quit_month=2147483647) at simworld.cc:7196
#15 0x00000000006a20e7 in simu_main (argc=13, argv=0x7fffffffe538) at simmain.cc:1261
#16 0x00000000006b1cc1 in sysmain (argc=13, argv=0x7fffffffe538) at simsys.cc:703
#17 0x0000000000754342 in main (argc=13, argv=0x7fffffffe538) at simsys_posix.cc:147
(gdb)
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

neroden

#51
Excellent.  This is an extremely helpful backtrace.  The problem I've been having is that I've been catching the bug after the null pointer has been inserted into the dingliste_t.  You managed to catch it *during* the add process; I think I may be able to fix it now.
---

Spoke too soon.  Nope, the null pointer is already in there when your crash happens.  ****.

I can't figure out how the null pointers are getting in.  Nobody is calling add() with a null pointer, which means that there's a logic error in the data structure handling.

I tried completely rewriting it, but I haven't quite gotten that working yet.

neroden

#52
Wait wait wait.  Did someone try to make simutrans multithreaded?

If so, that's the problem.  Everything in dingliste.cc has to be protected by synchronization locks of some sort.

If simutrans is being compiled multithreaded by default, turn that off and see what the results are.  Just an idea...
---
Sigh.  Sadly, no.   I just checked and threading is off in my test executable.   

Having been staring at this file for a week, I think I see how to convert dingliste_t to handle multiple rotations simultaneously :-) but first I have to figure out how to fix the basic memory management code!

prissi

The backtrace almost rules out that someone called interrupt check during a sync_step. (In such a case one routine may have stale elements like the max still in its structures.) This leaves an error by wrong memory allocation using freelist as most likeliest candidate. Smaller dinglist (up to 16 items) are allocated from freelist. If an item is accidently freed (or gotten from the wrong list) it will be overwritten. Same obviously if somethign writes outidie its boundaries. You can remove the handling of dinglist be freelist by changing


static void dl_free(ding_t** p, uint8 size)
{
assert(size > 1);
size = 0;
guarded_free(p);
}


static ding_t** dl_alloc(uint8 size)
{
assert(size > 1);
ding_t** p;
p = MALLOCN(ding_t*, size);
return p;
}


Apart from slowing down the game, this will only fix symptoms; it rather points to an error in your dingliste handling (or another object with similar size). It could by also a problem with another object which memory was obtained by freelist and then wrote outside its borders.

To actually debug this by valgrind, you need to define USE_VALGRIND_MEMCHECK

jamespetts

Thank you both for your assistance with this - that is most helpful. Simutrans is indeed now multi-threaded, and has been for a while, but my server version is built without MULTI_THREAD defined (as my (virtual) server has only one allocated core), so the version that is crashing for me should be a single threaded compile.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

neroden

#55
I think it's got to be an error somewhere in the insertion/removal routines in dingliste_t.  I tried rewriting them entirely along somewhat different and cleaner (and hopefully slightly faster) lines, but I got a very weird crash, so I'm going to have to go back and figure out what was going on.

Quote from: prissi on June 18, 2013, 08:57:11 PM
it rather points to an error in your dingliste handling (or another object with similar size). It could by also a problem with another object which memory was obtained by freelist and then wrote outside its borders.
I've been debugging these crashes.  The lists appear to be *mostly* valid pointers to ding_t.  They just have null pointer holes in the middle.  The crashing instances have capacity 16, "top" somewhere around 14 or 11, and the "null pointer hole" a few entries back from that, around 11 or 8.  The early entries appear to be proper ding_t pointers, which makes it seem unlikely that the memory has been overwritten by something writing outside its borders. 

I suppose it's possible that they were overwritten by something which was writing lists of ding_t pointers.... but there aren't that many things using the freelist, and most of them aren't lists of ding_t pointers.

This feels like improper list management to me.  Specifically, it feels like an insertion error, where something is inserted in the wrong slot when the list is expanded, so that the new item overwrites an old item (which gets lost) rather than being appended. 

But I'm having trouble finding such an error in the code.  Even though the code is different between standard and experimental, I have not been able to spot any logic errors. 

The basic logic for the type is baroque and complicated, so there could be something I missed.  It's complicated enough that I tried rewriting it, to use "top" for the count at all times, and to use "capacity" strictly for the capacity, not for the count.  But as I said I've had some very mysterious trouble with that.  I'll have to try it again and see if I can spot what's going on....

----

Okay.  New day.  I still don't know what was going wrong with the original implementation, but I do know what went wrong with my revised implementation.

The revised implementation is on my "dinglist-cleanup" branch.  James, I would appreciate it if you would test it to see if I've finally eliminated the error.  (Don't merge it yet, though!)

If I *have* eliminated the error, I'll make an adaptation of the patch for standard.  I believe that it makes the code significantly more readable, and therefore less prone to future fencepost (off-by-one) errors.

If I haven't... well, at least I have more readable code to debug.  :-)

I believe that the revised implementation should be marginally faster as well as significantly cleaner.  Probably not measurably faster, but it does create what I think is a highly optimizable logic based entirely in comparisons against zero. I have some further ideas for improvement but let's see if I got the bugs out first.

prissi

Honestly, I would be very surprised if is an error in standard. dingliste is used a zillion times on the servers. If there would be buggy code there, any large game would trigger it quickly. However the standard servers usually run for weeks without crashing via dingliste.

But one never knows.

jamespetts

Thank you both for your comments on this and Nathaneal for all the work on this - it is most appreciated. I have not had chance to test this yet with my server, but hope to do so soon. If this fix does not solve the problem, the next place to look, I think, is that set of various new month (either new_month() or neuer_monat()) methods, as the crash always occurs almost immediately after the beginning of a new month; disabling them one by one and checking whether the crash recurs should help to narrow down the cause.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

Ok, I stand corrected: There was the one off bug fixed in bei(): Albeit this was never triggered, as this is called from for loop limited by top ...

But still I do not see how to get a NULL pointer into the list. It still look to me as there is someting writing out of its bound and hitting a dinglist. You can easily check this by only using malloc.

neroden

#59
Quote from: prissi on June 19, 2013, 12:57:47 PM
Ok, I stand corrected: There was the one off bug fixed in bei(): Albeit this was never triggered, as this is called from for loop limited by top ...
(Edit:) One of the reasons I was willing to suspect an error in dingliste.cc is that it had diverged somewhat between experimental and standard.  Not in big ways, but the code was so confusing that I figured it might have generated an off-by-one. 

There *was* a bug in the deletion of "dying trees", which do not get removed from the list and leave bad pointers in the list at the end of the month...

Anyway, my revision is still cleaner architecturally and quite likely to be faster, depending on how inlining works out.

QuoteBut still I do not see how to get a NULL pointer into the list. It still look to me as there is someting writing out of its bound and hitting a dinglist. You can easily check this by only using malloc.

I wonder if there's something writing out of bounds BACKWARDS?  That would explain the behavior I'm seeing. 

The 0 end of the array is fine, but the "top" end of the array is broken.  So I'm looking for a loop which writes into an array and counts DOWN using memory from the freelist.

---
(Edit:) Actually, come to think of it, I'm only seeing problems with lists of size 16 specifically (!!!), which is a memory unit of 128 bytes (since I'm running 64-bit).  Since the freelist uses pools of memory where all the allocations of size 16 come from the same pool, I'm looking specifically for something allocated out of the 128 byte pool.

::gimme_node is called in a very short list of places in experimental.

       
  • operator new in grund_t -- sizeof(grund_t) is 32 and the same for all subclasses
  • dl_alloc in dingliste_t (which is what we've been debugging)
  • dings/groundobj.cc -- called with sizeof(groundobj_t) which is 32
  • dings/baum.cc -- called with sizeof(baum_t) which is 32
  • player/simplay.cc -- called with sizeof(spieler_t::income_message_t) which is 56
  • simhalt.cc -- called with sizeof(haltestelle_t::connexion) -- this is a struct with six elements: four uint16s and two quickstones.  My debugger won't give me the size, but it should be 48 in the least-packed case, more likely 12
  • tpl/slist_tpl.h -- probably the worst use to track down; this is used for node_t, so it's one pointer (8) plus the data type.
  • To hit 128, given the automatic four-byte alignment, this would require a datatype of a very specific and unlikely size.  It's also quite unlikely that anyone would scribble backwards across a node and we'd probably see very bad results from the corrupted linked lists.
  • vehicle/movingobj.cc -- called with sizeof(movingobj_t) which is 72
  • vehicle/simverkehr.cc -- called with sizeof(stadtauto_t) which is 128
This says to me we have to look at backwards loops in the city car code, which certainly does differ between experimental and standard.  James?

neroden

Okay!

So I removed the custom allocators from stadtauto_t, so that it allocates the "normal" way.

This triggered an exciting new crash.  A double free.

*** Error in `/home/neroden/simutrans-ex-new-alternate/sim-carsout': free(): invalid pointer: 0x00000000149b13c0 ***


#0  0x00007ffff6933295 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff6936438 in __GI_abort () at abort.c:90
#2  0x00007ffff696e86b in __libc_message (do_abort=do_abort@entry=2,
    fmt=fmt@entry=0x7ffff6a6ad28 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:199
#3  0x00007ffff6978ac6 in malloc_printerr (action=3, str=0x7ffff6a66cca "free(): invalid pointer",
    ptr=<optimized out>) at malloc.c:4902
#4  0x00007ffff6979843 in _int_free (av=<optimized out>, p=0x149b13b0, have_lock=0) at malloc.c:3758
#5  0x000000000064f43c in sparse_tpl<unsigned char>::resize_data (this=0xde44df0, new_size=0)
    at tpl/sparse_tpl.h:199
#6  0x000000000064bd16 in sparse_tpl<unsigned char>::clear (this=0xde44df0) at tpl/sparse_tpl.h:60
#7  0x000000000063cdfd in stadt_t::neuer_monat (this=0xde44da0, check=false) at simcity.cc:2526
#8  0x0000000000703d77 in karte_t::new_month (this=0xde78f0) at simworld.cc:3755
#9  0x000000000070565a in karte_t::step (this=0xde78f0) at simworld.cc:4110
#10 0x000000000071341f in karte_t::interactive (this=0xde78f0, quit_month=2147483647) at simworld.cc:7203
#11 0x00000000006b6cb6 in simu_main (argc=1, argv=0x7fffffffe1a8) at simmain.cc:1261
#12 0x00000000006c6805 in sysmain (argc=1, argv=0x7fffffffe1a8) at simsys.cc:703
#13 0x000000000076905b in main (argc=1, argv=0x7fffffffe1a8) at simsys_s.cc:720


I'm not entirely sure how to interpret this, but I think it supports my theory that there's a failure of memory management for stadtauto_t objects.

jamespetts

Ahh, thank you! I had started to wonder whether the cars might be an issue, since that is the only area of the code that is exclusive to Experimental that causes objects to be removed from the world at the end of the month. Can you put your testing code for using normal allocation for city cars on a separate branch for me to test?

Incidentally, I cannot compile the Linux version of your dingliste-cleanup branch: I get:


build/default/gui/components/gui_world_view_t.o: In function `world_view_t::internal_draw(koord, ding_t const*)':
/usr/share/games/simutrans-experimental/gui/components/gui_world_view_t.cc:61: undefined reference to `display_base_img_alpha(unsigned int, unsigned int, unsigned int, short, short, signed char, unsigned short, int, int)'
build/default/simgraph0.o:(.data+0x28): undefined reference to `display_base_img_alpha(unsigned int, unsigned int, unsigned int, short, short, signed char, unsigned short, int, int)'
collect2: error: ld returned 1 exit status
make: *** [simutrans/sim] Error 1
root@438242:/usr/share/games/simutrans-experimental#


Also, I am still getting the graphical problems with the ticker.

Edit: Compiling with COLOUR_DEPTH=0 also fails in Windows/MSVC++:


1>simgraph0.obj : error LNK2001: unresolved external symbol "void __cdecl display_base_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_base_img_alpha@@YAXIIIFFCGHH@Z)
1>gui_world_view_t.obj : error LNK2001: unresolved external symbol "void __cdecl display_base_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_base_img_alpha@@YAXIIIFFCGHH@Z)
1>gui_world_view_t.obj : error LNK2019: unresolved external symbol "void __cdecl display_rezoomed_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_rezoomed_img_alpha@@YAXIIIFFCGHH@Z) referenced in function "protected: void __thiscall world_view_t::internal_draw(class koord,class ding_t const *)" (?internal_draw@world_view_t@@IAEXVkoord@@PBVding_t@@@Z)
1>.\simutrans\Simutrans-Experimental-debug.exe : fatal error LNK1120: 2 unresolved externals
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have had a go at working out what the error might be, and have pushed a possible (but as yet untested) fix to my dingliste-cleanup branch. I cannot test this yet because of the above compile errors - could you look into fixing those? I am not quite sure how they come about except that it seems that the unresolved symbol relates to something ultimately found in simgraph16.cc but not simgraph0.cc - it is not clear why your dingliste changes should have caused this, though.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

neroden

Quote from: jamespetts on June 19, 2013, 08:56:09 PM
Ahh, thank you! I had started to wonder whether the cars might be an issue, since that is the only area of the code that is exclusive to Experimental that causes objects to be removed from the world at the end of the month. Can you put your testing code for using normal allocation for city cars on a separate branch for me to test?
Oh, that was easy -- I just commented out the "operator new" and "operator delete" for stadtauto_t, in both the header file and the .cc file.  I didn't bother to make a branch for that.

Quote
Incidentally, I cannot compile the Linux version of your dingliste-cleanup branch: I get:


build/default/gui/components/gui_world_view_t.o: In function `world_view_t::internal_draw(koord, ding_t const*)':
/usr/share/games/simutrans-experimental/gui/components/gui_world_view_t.cc:61: undefined reference to `display_base_img_alpha(unsigned int, unsigned int, unsigned int, short, short, signed char, unsigned short, int, int)'
build/default/simgraph0.o:(.data+0x28): undefined reference to `display_base_img_alpha(unsigned int, unsigned int, unsigned int, short, short, signed char, unsigned short, int, int)'
collect2: error: ld returned 1 exit status
make: *** [simutrans/sim] Error 1
root@438242:/usr/share/games/simutrans-experimental#


Also, I am still getting the graphical problems with the ticker.

Edit: Compiling with COLOUR_DEPTH=0 also fails in Windows/MSVC++:


1>simgraph0.obj : error LNK2001: unresolved external symbol "void __cdecl display_base_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_base_img_alpha@@YAXIIIFFCGHH@Z)
1>gui_world_view_t.obj : error LNK2001: unresolved external symbol "void __cdecl display_base_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_base_img_alpha@@YAXIIIFFCGHH@Z)
1>gui_world_view_t.obj : error LNK2019: unresolved external symbol "void __cdecl display_rezoomed_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_rezoomed_img_alpha@@YAXIIIFFCGHH@Z) referenced in function "protected: void __thiscall world_view_t::internal_draw(class koord,class ding_t const *)" (?internal_draw@world_view_t@@IAEXVkoord@@PBVding_t@@@Z)
1>.\simutrans\Simutrans-Experimental-debug.exe : fatal error LNK1120: 2 unresolved externals

I'll go back and fix that.  Currently, however, I'm working on fixing the zero mail revenue, and uncovering all manner of lurking bugs.

jamespetts

#64
Thank you for fixing this error: it now compiles, but on Windows I still get an access violation in line 147 of image_reader.cc for some odd reason.

Edit: Having somewhat odd problems testing the latest dingliste-cleanup branch on the Linux server: running it in non-server mode, it appears to freeze without crashing: it stops generating debug messages shortly after loading. Because it is not running as a server, there is no definitive way to test whether it is working or not. I do not know how to debug an infinite loop with gdb, if that is what the problem is. I tried running it as a server with gdb, and that seemed to work for a while, but it desynced after about one game hour: trying to connect again ended up resulting in the thing terminating with a SIGPIPE error (perhaps due to gdb making the server extremely slow?). I have yet to try a server without gdb, which might be the next step.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs

Running in gdb should not slow-down much, as long as your are not using any conditional breakpoints. Pressing ctrl+c in the terminal will break the program and you can analyze the backtrace.
Parsley, sage, rosemary, and maggikraut.

jamespetts

#66
Hmm - if running in gdb does not slow things down much, I do wonder why I get a SIGPIPE. Hmm.

Edit: I can confirm that running this, even without GDB, I seem to get some sort of infinite loop after around 5 game hours on the upload-refresh.sve, which results in a desync for the client (which does not get the infinite loop) and an eventual SIGPIPE error for the server (which I am guessing is caused by a timeout in consequence of the infinite loop?).

Edit 2: Further testing seems to show that the original problem is not solved: I still get a crash at the month boundary. Have not had chance to run it with gdb yet to get a backtrace.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

neroden

A few things:
(1) I don't fully understand what's going on here, but I think there are some serious problems with dingliste_t management even in standard. 

In short, objects are deleted (in simwerkz.cc, in wkz_remover_t::wkz_remover_intern) but left in the list, so there are going to be bad references in the list. 

Tracing through the code, I don't see any call to loesche_alle to clean the list up.  Maybe I'm missing something.

This shouldn't be creating the problems which we're currently seeing, but I don't understand the memory management here, and I don't like that.

(2) The primary thing going wrong is probably in the citycar code, which appears to be writing to released memory (memory which has been "delete"d), and I haven't had a chance to even look into that code.  It seems to be quite complicated, and tracking down the memory management is going to be serious work.  Perhaps you can find someone else to do it...

(3) The imagereader.cc crash on line 147 is more understandable.  The command line server relies on a really awful hack when loading the image data, to avoid actually loading the image data, and I think the hack isn't fully implemented.  This is a bug present in standard.  It would be highly desirable to implement this more cleanly (in standard), but it will take quite a lot of work.

(4) I've been spending my time doing some more substantial cleanups of the revenue code in hopes of not only fixing the mail revenue bug, but of making it much more maintainable and comprehensible.  (This is over on my "mail-revenue" branch.  Don't merge it yet!)  This has led to some substantial work which I will mention elsewhere.

prissi

It could very weel be that some cleanup code is called during deletion, which passed the this pointer to other code (which is not valid at this point any more). Different compiler can be more forgiving on this. Any cleanup hence is in entferne().

All things are removed from the dinglist on deletion anyway, unless they have the "not_on_map" flag. If experimetnal did not touch this, then there are no remains. But should be very easy to add debug code for this; it will just slow down world deletion a little more.

neroden

Quote from: prissi on June 22, 2013, 08:53:11 PM
It could very weel be that some cleanup code is called during deletion, which passed the this pointer to other code (which is not valid at this point any more).
This does not seem to happen, at least not for fussgaenger_t (which is the first one I checked, out of several).
QuoteDifferent compiler can be more forgiving on this. Any cleanup hence is in entferne().
Which does not get called.

QuoteAll things are removed from the dinglist on deletion anyway,
On deletion of *what*?  Sure, on deletion of the grund_t object, but that's not happening.

"grund_t::obj_remove" needs to be called to delete something from the dinglist. 

Instead, raw "delete" is being called.  This happens several times in wkz_remover_t::wkz_remover_intern. 

In some other places ::obj_remove *is* called.

I am pretty close to being convinced that this is simply a programming error.  I will write a patch.