The International Simutrans Forum

 

Author Topic: Command line server issues  (Read 16269 times)

0 Members and 1 Guest are viewing this topic.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Command line server issues
« on: May 26, 2013, 10:50:44 PM »
In trying to test the server running the latest release candidate, I have found that the command line server (the version with no graphics) seems to have stopped working: although compiling it with the usual settings produces a valid executable, it fails on startup both on my Linux server and on Windows; the cause of the failure on Windows is not clear, but the Linux version complains at not being able to find ground.outside.pak and not being able to call a modal dialogue (which error is, from what I recall, the expected error if one starts a normal version with full graphics in a non-graphical environment). The Windows version just starts up and shuts down after a few seconds. When run as a server in the normal, full graphics mode (in Windows), it works fine and does not desync.

I am not entirely sure where to look to try to find this problem: I should be very grateful for any suggestions from anyone who has been working on the code recently, or might have an idea as to what could cause this sort of issue.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #1 on: May 27, 2013, 12:02:07 PM »
I have found that the crash occurs at line 149 of image_reader.cc. I wonder whether this could be anything to do with the integration of the world edge changes, as I have not touched the graphics code myself; does anyone know whether these changes affect image_reader.cc directly or indirectly?

Edit: Further tests show that it cannot be the map edge changes merged by Neroden, as the same error occurs in 10.9004. I wonder whether the issue might be related to this commit, which has been merged in since 10.27? Do any of the Standard developers have any idea what might cause an access violation in this code:

Quote
            // skip rest of the line
            do {
                dest++;
                dest += *dest + 1;
            } while (*dest);
            dest++;    // skip trailing zero

Edit 2: Further enquiry has shown that reverting this commit does not fix the problem, but rather moves the access violation to a different place:

Quote
#ifdef MULTI_THREAD
    // now the thread is finished ...
   EnterCriticalSection( &redraw_underway );
#endif

Does anyone have any clues about this one?

Edit 3: Disabling multi-threading, in Windows at least, by undefining MULTI_THREAD seems to fix the crash (setting MULTI_THREAD=0 has no effect), at least with the abovementioned commit reverted.

Edit 4: De-reverting the commit, whilst undefining MULTI_THREAD, however, does not work: the original crash still appears. There seem to be two separate, unrelated bugs here.

Edit 5: I notice in any event that my Linux build did not have MULTI_THREAD defined.

Edit 6: The debug output that I get from trying to run the command line server on Linux is:

Code: [Select]
...[lots of similar messages about vehicles' hashes]
Message: pakset_info_t::debug:  WagonLong -> sha1 = 964E308B68717BB634065A89659AF6548BB91889
Reading menu configuration ...
Warning: werkzeug_t::read_menu():       toolbar[11][5]: replaced way-builder(id=14) with default param=cityroad by cityroad builder(id=36)
Warning: ask_language:  No language selected, will use english!
loading savegame "upload-refresh.sve"
Midi disabled ...
Warning: karte_t::laden:        disconnecting all clients
Warning: nwc_routesearch_t::reset:      all static variables are reset
Warning: karte_t::laden():      Cannot load saved game!
Warning: nwc_routesearch_t::reset:      all static variables are reset
Message: grund_besch_t::calc_water_level():     Last image nr 1
Calculating textures ...done
Message: nwc_auth_player_t::init_player_lock_server:    new = 32767
Warning: karte_t::create_rivers():      Too many rivers requested! (only 0 rivers placed)
Creating cities ...
Creating cities: 1
Creating factories ...
Distributing 1 tourist attractions ...
Preparing startup ...
Warning: fabrikbauer_t::increase_industry_density():    No suitable city industry found => pak missing something?
Message: network_command_t::rdwr:       write packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr:      rdwr id=8 client=0 plnr=255 pos=koord3d invalid wkzid=8224 defpar=Now 0 clients connected. init=1 flags=0
ERROR: modal_dialogue():        called without a display driver => nothing will be shown!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
World destroyed.
Warning: nwc_routesearch_t::reset:      all static variables are reset

The actual error appears to be "ERROR: modal_dialogue():        called without a display driver => nothing will be shown!". It is not clear why this is being displayed, as COLOUR_DEPTH=0 is defined.
« Last Edit: May 28, 2013, 01:31:16 AM by jamespetts »

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #2 on: May 28, 2013, 08:51:11 PM »
About the linux server:

Warning: karte_t::laden:        disconnecting all clients
   -> tried starting loading a map
Warning: karte_t::laden():      Cannot load saved game!
  -> could not load (maybe the cityroad error?)

Not map loaded => create default world (next lines) and then show the banner window (which obviously could not be shown on the server). Hence the server works fine, but the savegame is broken.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #3 on: May 29, 2013, 12:03:25 AM »
Hmm - that's odd: the saved game works on my local copy. What is the highest debug level? Might a level higher than 3 give me more of a clue what is happening here?

Thank you very much for your reply.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #4 on: May 29, 2013, 02:13:18 AM »
Extremely high debug levels generate some very frequent, very early output which prevents the game from really running, unfortunately; I found this out a while back.  You'll have to debug the debug code in order to use the high debug levels.

I just tested the command-line version of the server (from my 11x-fixes branch) on my savegame -- it works.

Your savegame is broken in some interesting way.  Have you tried loading it and re-saving it?

(The 112.x branch is probably broken because I know I screwed something up related to savegames in the accounting merger.)

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #5 on: May 31, 2013, 03:59:50 PM »
Your savegame is broken in some interesting way.  Have you tried loading it and re-saving it?

It is possible that it is broken for the same reason mine was broken.  (The "partial implementation" of the savefile format change.)  Have you tried applying the same fix?

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #6 on: May 31, 2013, 05:05:58 PM »
Neroden,

thank you for testing this. Apologies for not having relied to this and other messages earlier: have been (and am still) away from home, so do not have full access to my development environment, although can log into my Linux server, as am staying with Parents, and have a Linux terminal set up here.

Starting with your most recent hypothesis, I tested this by running the original saved game from 10.27: if there was an issue with partial implementation of a new file format, this game would load without difficulty. This did not assist. The output that I got from that was:

Code: [Select]
[Lots of stuff about road rules, not reproduced here]
Reading speedbonus configuration ...
ReadiReading speedbonus configuration ...
Reading private car ownership configuration ...
Reading electricity consumption configuration ...
Reading menu configuration ...
Reading object data from Pak128.Britain-Ex-0.8.4/...
Reading menu configuration ...
loading savegame "upload-refresh.sve"
Midi disabled ...
Calculating textures ...done
Creating cities ...
Creating cities: 1
Creating factories ...
Distributing 1 tourist attractions ...
Preparing startup ...
World destroyed.
root@438242:/usr/share/games/simutrans-experimental/simutrans# ./simutrans-experimental -server -objects Pak128.Britain-Ex-0.8.4 -load pre-upgrade.sve -log 1 - debug 4 > test-problem.log
Simutrans version 112.2 Experimental Release Candidate 11.9006 from May 26 2013 r119006
root@438242:/usr/share/games/simutrans-experimental/simutrans# psng private car ownership configuration ...
Reading electricity consumption configuration ...
Reading menu configuration ...
Reading object data from Pak128.Britain-Ex-0.8.4/...
Reading menu configuration ...
loading savegame "upload-refresh.sve"
Midi disabled ...
Calculating textures ...done
Creating cities ...
Creating cities: 1
Creating factories ...
Distributing 1 tourist attractions ...
Preparing startup ...
World destroyed.

The same output was generated with the refreshed saved game saved in the new version. The intermediate save compatibility issue was not likely to be the issue in any event, as the game that I have uploaded to the server is a copy of a saved game that I can open without problems on my machine at home. Further, an archived 10.27 build can run the pre-upgrade game that I tested above without trouble.

I am wondering whether the problem might in part be some changes to the makefile that have occurred since 10.27: I notice that there must have been changes, as the default executable name has reverted to "sim" from the previous default of "simutrans-experimental".

I wonder whether you could upload your binary (if it is a 64-bit binary) to http://files.simutrans-germany.com so that I can test whether the problem is in my Linux build environment somewhere? That would at least go some way towards narrowing the problem.

Incidentally, what saved game do you use, and what command do you use to start the server?

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #7 on: May 31, 2013, 09:59:20 PM »
It is not loading your savegame, it is instead generating the default island at teh start. Is the savegame in the rigth folder  ~/simutrans/save/ ?

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #8 on: May 31, 2013, 11:10:26 PM »
Ahh, no, it's not in ~/simutrans/save - it's just in ~/simutrans. Is the addition of /save in Linux fairly recent (after about February 2012)? 10.27 (based on Standard from February 2012) works with the saved games in ~/simutrans.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #9 on: June 01, 2013, 07:02:55 AM »
I have copied my executable into a working directory under a name based on the build date (so that I can compare several).

I use the command
Code: [Select]
./sim-20130601 -objects ./pak-20130601 -server 1234 -load test.sve

Yes, I think the relocation of save files to the "save" subdirectory is new since February.  In fact, I found that test.sve had to be in the directory:
Code: [Select]
~/.simutrans-ex/save/
Even if I had a different working directory for simutrans.  Which I do (I keep the executable and pak in ~/simutrans-ex-new ).

This may be what is giving you problems.

I think the directory search code should probably be cleaned up, but I haven't looked at it.
« Last Edit: June 01, 2013, 05:27:12 PM by neroden »

Offline Vonjo

  • *
  • Posts: 273
Re: Command line server issues
« Reply #10 on: June 01, 2013, 10:58:56 AM »
It looks like you want to load "pre-upgrade.sve", but simutrans-experimental tried to load "upload-refresh.sve".

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #11 on: June 01, 2013, 01:35:44 PM »
Hmm - I do not think that the saved game directory issue is the ultimate problem - I get the same problem:

Quote
Reading speedbonus configuration ...
Reading private car ownership configuration ...
Reading electricity consumption configuration ...
Reading menu configuration ...
Reading object data from ./Pak128.Britain-Ex-0.8.4/...
Reading menu configuration ...
loading savegame "upload-refresh.sve"
Midi disabled ...
Calculating textures ...done
Creating cities ...
Creating cities: 1
Creating factories ...
Distributing 1 tourist attractions ...
Preparing startup ...
ERROR: modal_dialogue():   called without a display driver => nothing will be shown!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
World destroyed.

when I expressly try to call a specific saved game from the command line using "-load upload-refresh.sve", and even copying upload-refresh.sve into ../simutrans-ex/save did not assist.

Edit: Having compiled a debug version and using debug level 4, this is the output that I get (warning: large file).

Edit 2: The very last part of that large file (for those who do not want to download the whole thing) reads as follows:

Code: [Select]
Message: network_http_post: received header: HTTP/1.1 202 Accepted
Message: network_http_post: received header: Content-Type: text/html
Message: network_http_post: received header: Date: Sat, 01 Jun 2013 13:37:39 GMT
Message: network_http_post: received header: Connection: keep-alive
Message: network_http_post: received header: Transfer-Encoding: chunked
Message: network_http_post: all headers received, message body follows
Message: network_http_post: 5
Message: network_command_t::rdwr: write packet_id=8, client_id=0
Warning: nwc_tool_t::rdwr: rdwr id=8 client=0 plnr=255 pos=koord3d invalid wkzid=8224 defpar=Now 0 clients connected. init=1 flags=0
Debug: karte_t::step: end
Debug: karte_t::step: start step
Debug: karte_t::step: time calculations
Debug: interrupt_check: called from (simworld.cc:4150)
Debug: karte_t::step: pending_season_change. 4096 tiles.
Debug: interrupt_check: called from (simworld.cc:4161)
Debug: interrupt_check: called from (simworld.cc:4161)
Debug: interrupt_check: called from (simworld.cc:4161)
Debug: interrupt_check: called from (simworld.cc:4161)
Debug: interrupt_check: called from (simworld.cc:4172)
Debug: interrupt_check: called from (simworld.cc:4176)
Debug: karte_t::step 4: step 0 convois
Debug: karte_t::step 6: step cities
Debug: karte_t::step: step factories
Debug: karte_t::step: step poweline stuff
Debug: karte_t::step: step players
Debug: karte_t::step: step halts
Debug: interrupt_check: called from (simworld.cc:4243)
Debug: karte_t::step: end
Message: karte_t::reset_timer(): called, mode=$1
ERROR: modal_dialogue(): called without a display driver => nothing will be shown!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Message: karte_t::destroy(): destroying world
Message: karte_t::destroy(): label clear
Message: karte_t::destroy(): convois destroyed
Message: karte_t::destroy(): stops destroyed
Message: karte_t::rem_stadt(): Coatwater
Debug: karte_t::rem_stadt(): reduce city to 0
Debug: karte_t::rem_stadt(): fab_list 4
Message: karte_t::rem_stadt(): delete
Message: karte_t::destroy(): towns destroyed
Message: karte_t::destroy(): sync list cleared
Message: karte_t::destroy(): planquadrat destroyed
Message: karte_t::destroy(): marker destroyed
Message: karte_t::destroy(): player destroyed
Message: karte_t::destroy(): factories destroyed
Message: karte_t::destroy(): attraction list destroyed
Message: karte_t::destroy(): world destroyed
World destroyed.
Warning: nwc_routesearch_t::reset: all static variables are reset
« Last Edit: June 01, 2013, 01:44:57 PM by jamespetts »

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #12 on: June 01, 2013, 11:06:54 PM »
I still see
Creating cities: 1
Creating factories ...
Distributing 1 tourist attractions

That means the standard island is creating. i.e. your map is not loading. or generates an error during loading.

The save directory was needed since version 0.78 or so, even before my time as programmer. Just the path ~/simutrans/save/ for standard is newer, albeit not that new. Once sucessfully loaded, you should see a server13353-netowrk.sve as soon as somebody joined in the simutrans directory. (If simutrans experimental uses another directory, then obviously you need to use this.)

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #13 on: June 02, 2013, 11:30:37 AM »
Am I correct, though, that server13353-network.sve does not load from the /save directory but from the same directory as the executable, and also that an explicit command "load -some-saved-game.sve" can specify the directory in which it is to be found?

I have been specifying -load ./upload-refresh.sve (for example) and still getting this error. It is most odd.

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1434
Re: Command line server issues
« Reply #14 on: June 02, 2013, 04:08:39 PM »
No. You cannot change the save root like that. You may specify sub directories below save/ however.
-load aa/upload-refresh.sve   will load ~/simutrans/save/aa/up-refresh.sve (or ./simutrans/save/aa/up-refresh.sve in single_user mode)

The server13353-network ( and -client) saves do save outside save/ as special cases, but you must put them into save/ if you wish to load them manually.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #15 on: June 02, 2013, 04:54:42 PM »
If that is so, my tests are not working, and there is an unidentified problem with the server, as it ought to be loading server13353.network.sve in any event (and, if I remember correctly, by default if I specify -server at the command line without also specifying -load?).

Edit: Also, this doesn't explain the problem, does it? Even if it creates the little test island, why should it have to display a modal dialogue?
« Last Edit: June 02, 2013, 06:25:42 PM by jamespetts »

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #16 on: June 02, 2013, 08:24:32 PM »
It may find an old version of the setting. Did you try to add "-lang en" on the commandline to avoid showing the language dialogue?

Well my server installation look like this:

~/src/simutrans/trunk/... <- here I compile etc.
~/simserv/ <- here are config dir, pak dir, executable ...
~/simutrans/save <- here goes the intial savegames
~/simutrans/pak/config/simuconf.tab <- here I put the savegame specific server settings
~/simutrans/ <- here the server123-network.sve, pashword hashes and logfiles will be created

The server is initialized by:
./simstable -server 65124 -log 1 -debug 3 -lang en -objects pak/ -nosound -nomidi -server_dns xyz.co.uk -load netzwerk7.sve -server_name "pak64 112.3 I" -server_admin_pw XXX &

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Curiouser and curiouser
« Reply #17 on: June 02, 2013, 09:05:35 PM »
Thank you for your help.

I had not previously tried "-lang en", but I have tried it, and sadly get the same result. The server was previously working with an automatic startup script when it was running 10.27, which does not work now, so the settings that I use cannot be the ultimate problem unless something has changed in how those settings are interpreted.

Incidentally, since I am staying with my parents and have access to a Linux desktop, I tried compiling it there; I had the same result as for the server running the no graphics version, and additionally, the program will crash when compiled in the normal graphics mode, with this backtrace:

Code: [Select]
Reading speedbonus configuration ...
Reading private car ownership configuration ...
Reading electricity consumption configuration ...
Reading menu configuration ...
Reading object data from /home/james/simutrans/Pak128.Britain-Ex-0.8.4/...

Program received signal SIGSEGV, Segmentation fault.
0x082902bc in display_set_base_image_offset (bild=1, xoff=11768, yoff=-16604) at simgraph16.cc:2005
2005        assert(images[bild].base_w > 0);
(gdb) backtrace
#0  0x082902bc in display_set_base_image_offset (bild=1, xoff=11768, yoff=-16604) at simgraph16.cc:2005
#1  0x08290aac in display_img_aux (n=0, xp=351, yp=139, use_player=0 '\000', dirty=1) at simgraph16.cc:2339
#2  0x08290d5d in display_color_img (n=0, xp=351, yp=139, player_nr=0 '\000', daynight=0, dirty=1) at simgraph16.cc:2413
#3  0x082aed7f in loadingscreen_t::display_logo (this=0xbfffcf70) at simloadingscreen.cc:54
#4  0x082aec6d in loadingscreen_t::loadingscreen_t (this=0xbfffcf70, w=0x8384373 "Loading paks ...", max_p=28, logo=true,
    continueflag=false) at simloadingscreen.cc:37
#5  0x080807ce in obj_reader_t::load (path=0x855c8ec "/home/james/simutrans/Pak128.Britain-Ex-0.8.4/", message=0x8384373 "Loading paks ...")
    at besch/reader/obj_reader.cc:144
#6  0x082b2060 in simu_main (argc=1, argv=0xbffff364) at simmain.cc:952
#7  0x082c0dff in sysmain (argc=1, argv=0xbffff364) at simsys.cc:703
#8  0x08353e09 in main (argc=1, argv=0xbffff364) at simsys_s.cc:623

Most perplexing.

Offline Vonjo

  • *
  • Posts: 273
Re: Curiouser and curiouser
« Reply #18 on: June 03, 2013, 09:19:45 AM »
Confirmed.
Even running it with only:
Code: [Select]
$ ./sim -objects pak/...does not work. It returns the same modal_dialogue() error.

The graphics version (simgraph16.cc) works fine for me though.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #19 on: June 03, 2013, 11:22:31 AM »
Vonjo,

thank you for testing. How did you compile your graphics version?

Offline Vonjo

  • *
  • Posts: 273
Re: Command line server issues
« Reply #20 on: June 03, 2013, 04:02:28 PM »
Further investigation reveals that the modal_dialogue() error also occurs with simutrans-standard when there is no -load command line switch provided.
So I tried this:
Code: [Select]
$ ./sim -server -objects pak/ -load abc.sve...and it works fine. I can join the server and play.
It looks like the problem is related to your savegame or directory structure.

Vonjo,

thank you for testing. How did you compile your graphics version?
ostype=linux
color_depth=16
backend=mixer_sdl
debug=3
optimize=1
multi_thread=4

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #21 on: June 03, 2013, 04:30:47 PM »
Thank you very much for testing. Would you mind trying again with a version compiled without MULTI_THREAD defined, and let me know how you get on?

Edit: Also, can you give me your compile settings for the no graphics version?

Offline Vonjo

  • *
  • Posts: 273
Re: Command line server issues
« Reply #22 on: June 03, 2013, 06:30:28 PM »
Ok, I may try it later.
Edit: Also, can you give me your compile settings for the no graphics version?
The settings are the same as the graphics version, but with backend posix and color_depth 0.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #23 on: June 03, 2013, 06:47:23 PM »
The problem is definitely in the savegame or the directory structure.  And yes, something changed in the last six months regarding how those are interpreted.  I'm not sure exactly what. 

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #24 on: June 03, 2013, 11:48:16 PM »
Thank you for your help, both of you. I have found the error in my locally compiled normal graphics version: I used "SDL" instead of "SDL_MIXER". What, I wonder, is straight "SDL" for?

As to the loading of saved games from the command line, now that I am able to get a normal graphics version running, I have been able to test this. Loading a previous server saved game (server13353-network.sve from the 24th of December last year, which is from the Bridgewater-Brunel online game), it gave me a fatal error message that a certain industry (sometimes a greengrocer, other times a power station) could not be located on the map: this was from the recalc_nearby_halts method. I strongly suspect that it is the inability to display this fatal error message that is causing the exiting in the non-graphics version: in effect, the ultimate problem is the trigger of this crash; but it is not something that I have been able to recreate in my Windows debugger back home.

This also points to a possible design issue with error messages: in the no graphics version, where a fatal error message is to be displayed, this should be done by text ouput to the command line rather than by generating an error that it cannot display a modal dialogue.

Offline kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2346
Re: Command line server issues
« Reply #25 on: June 04, 2013, 07:08:13 AM »
SDL was the original version. SDL_Mixer adds support from playing midi files (or indeed, other sound files) as music. If you don't need that plain SDL will be fine.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #26 on: June 04, 2013, 11:03:40 AM »
Hmm - I wonder why plain SDL crashes, then?

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #27 on: June 04, 2013, 02:28:52 PM »
This also points to a possible design issue with error messages: in the no graphics version, where a fatal error message is to be displayed, this should be done by text ouput to the command line rather than by generating an error that it cannot display a modal dialogue.
This would be a definite improvement!

Actually, for various reasons, even in the graphical version I'd like to see fatal errors sent to the command line as well as to a modal dialogue; sometimes the modal dialogue is obscured and goes away (when the game subsequently crashes, for example), and so it's much easier for debugging to look at the command line.

This is how I'd have it work: the error should always go to the command line, and in addition it should go to the modal dialogue if graphics are enabled.

---
Regarding the failure with plain SDL, this *is* interesting.

The main differences in the Makefile are these.  mixer_sdl has:
  SOURCES += sound/sdl_mixer_sound.cc
  SOURCES += music/sdl_midi.cc
While plain sdl on Linux has:
    SOURCES  += sound/sdl_sound.cc
    SOURCES += music/no_midi.cc

(Now, plain SDL has different behavior on mac, cygwin, and mingw, so if you compiled a windows version, there's more going on.)
These aren't very different, so I'm startled that it is causing a crash.

Offline kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2346
Re: Command line server issues
« Reply #28 on: June 04, 2013, 05:43:39 PM »
Plain SDL only uses SDL for video - music is played using windows routines (if on Linux/Mac then it is disabled). SDL_Mixer uses SDL to play music on all platforms.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #29 on: June 04, 2013, 11:02:31 PM »
Hmm, all very curious.

Nathaneal - any thoughts on why it fails in recalc_nearby_halts when loading from the command line?

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #30 on: June 09, 2013, 06:18:41 PM »
I have been spending some time looking into this to-day. I have made progress, but do not have a full resolution yet. Firstly, the problem with loading saved games manually on the command line was caused by the /save directory thing, but the problem with the automatic loading was not caused by this, as the automatic server saved game is still stored in the base directory. This problem was caused by a loading bug (the ultimate cause of which is still not clear to me) which I have fixed on the 11.x branch.

However, a new problem emerges. The server - but only the Linux command line server, not a server running in Windows or the client running in Windows (I have not had the chance to test a Linux client) crashes on a new month in the current Bridgewater-Brunel saved game. The backtrace is as follows:

Code: [Select]

Program received signal SIGSEGV, Segmentation fault.
0x00000000004340f0 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
242             inline typ get_typ() const { return type; }
(gdb) net:br
Undefined command: "net".  Try "help".
(gdb) backtrace
#0  0x00000000004340f0 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
#1  0x0000000000464af8 in bsearch_dings_by_prio (pri=200, dings=0x11a982e8, count=4) at dataobj/dingliste.cc:454
#2  0x0000000000464cc4 in dingliste_t::add (this=0x11746650, ding=0x348a5260) at dataobj/dingliste.cc:501
#3  0x0000000000408a6d in grund_t::obj_add (this=0x11746648, obj=0x348a5260) at bauer/../boden/grund.h:521
#4  0x0000000000710175 in vehikel_t::rauche (this=0x2cb95ac0) at vehicle/simvehikel.cc:1854
#5  0x000000000064ac57 in convoi_t::sync_step (this=0x2cb940c0, delta_t=83) at simconvoi.cc:1141
#6  0x00000000006e727d in karte_t::sync_step (this=0x1576380, delta_t=83, sync=true, display=true) at simworld.cc:3450
#7  0x00000000006f7ed9 in karte_t::interactive (this=0x1576380, quit_month=2147483647) at simworld.cc:7106
#8  0x000000000069b308 in simu_main (argc=8, argv=0x7fffffffe568) at simmain.cc:1245
#9  0x00000000006aaecd in sysmain (argc=8, argv=0x7fffffffe568) at simsys.cc:703
#10 0x000000000074d852 in main (argc=8, argv=0x7fffffffe568) at simsys_posix.cc:147
(gdb)

Any thoughts as to what could be causing this would be most appreciated whilst I continue my investigations.

Edit: The problem appears to occur in the code for vehicle smoke. Strictly, this is not needed in a no-graphics server implementation; but is it safe simply to do this:

Code: [Select]
void vehikel_t::rauche() const
{
#if COLOUR_DEPTH == 0
    // We do not need smoke if there are no graphics.
    return;
#else
    // raucht ueberhaupt ?
    if(  rauchen  &&  besch->get_rauch()  ) {

or will the differing number of objects on a tile possibly cause desynchronisation?

Edit 2: Having tested, not only does this change cause desynchronisation, it also fails to fix the crash. I now get:

Code: [Select]
Program received signal SIGSEGV, Segmentation fault.
0x0000000000462baa in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
188             inline bool is_moving() const { return flags&is_vehicle; }
(gdb) backtrace
#0  0x0000000000462baa in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
#1  0x0000000000709c8f in ding_cast<vehikel_basis_t> (d=0x0) at vehicle/../vehicle/simvehikel.h:198
#2  0x000000000070b755 in vehikel_basis_t::no_cars_blocking (this=0x2bbf3a10, gr=0x11747de8, cnv=0x2bbf1ff0,
    current_fahrtrichtung=4 '\004', next_fahrtrichtung=12 '\f', next_90fahrtrichtung=8 '\b') at vehicle/simvehikel.cc:620
#3  0x00000000007151fa in automobil_t::ist_weg_frei (this=0x2bbf3a10, restart_speed=@0x7fffffffb680: -1, second_check=false)
    at vehicle/simvehikel.cc:3006
#4  0x000000000064ca63 in convoi_t::step (this=0x2bbf1ff0) at simconvoi.cc:1610
#5  0x00000000006ea934 in karte_t::step (this=0x1575380) at simworld.cc:4189
#6  0x00000000006f7f3d in karte_t::interactive (this=0x1575380, quit_month=2147483647) at simworld.cc:7113
#7  0x000000000069b308 in simu_main (argc=8, argv=0x7fffffffe568) at simmain.cc:1245
#8  0x00000000006aaecd in sysmain (argc=8, argv=0x7fffffffe568) at simsys.cc:703
#9  0x000000000074d57e in main (argc=8, argv=0x7fffffffe568) at simsys_posix.cc:147

Edit 3: Hmm - this seems inconsistent. I reverted the above change and ran again. At the month boundary, it desynched (as it often does at month boundaries). On reconnecting from the client (but not before), I had this error on the server (which is set to pause when no clients are connected):

Code: [Select]

Program received signal SIGSEGV, Segmentation fault.
0x00000000004340f0 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
242             inline typ get_typ() const { return type; }
(gdb) backtrace
#0  0x00000000004340f0 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
#1  0x00000000004669d0 in dingliste_t::rdwr (this=0x68eef20, welt=0x1576380, file=0x7fffffffb450, current_pos=...)
    at dataobj/dingliste.cc:1062
#2  0x00000000004532b2 in grund_t::rdwr (this=0x68eef18, file=0x7fffffffb450) at boden/grund.cc:368
#3  0x0000000000450f89 in boden_t::rdwr (this=0x68eef18, file=0x7fffffffb450) at boden/boden.cc:55
#4  0x00000000006a7ff0 in planquadrat_t::rdwr (this=0x7ffff0e5a9d8, welt=0x1576380, file=0x7fffffffb450, pos=...)
    at simplan.cc:240
#5  0x00000000006ee331 in karte_t::save (this=0x1576380, file=0x7fffffffb450, silent=false) at simworld.cc:4972
#6  0x00000000006eda05 in karte_t::save (this=0x1576380, filename=0x7fffffffb770 "server13353-network.sve",
    savemode=loadsave_t::zipped, version_str=0x759a25 "0.112.4", ex_version_str=0x759a21 ".11", silent=false)
    at simworld.cc:4848
#7  0x0000000000485991 in nwc_sync_t::do_command (this=0x348b4d90, welt=0x1576380) at dataobj/network_cmd_ingame.cc:730
#8  0x00000000006f7cb5 in karte_t::interactive (this=0x1576380, quit_month=2147483647) at simworld.cc:7062
#9  0x000000000069b308 in simu_main (argc=8, argv=0x7fffffffe568) at simmain.cc:1245
#10 0x00000000006aaecd in sysmain (argc=8, argv=0x7fffffffe568) at simsys.cc:703
#11 0x000000000074d852 in main (argc=8, argv=0x7fffffffe568) at simsys_posix.cc:147
(gdb)

Edit 4: I noticed in that last backtrace that some of the pointers appeared to be of 32-bit length, so I added the -m64 and -mtune=native options to the build command in config.default. It did not desync at the end of the month, but crashed very shortly thereafter (circa 30-40 game seconds later) with this:

Code: [Select]
Program received signal SIGSEGV, Segmentation fault.
0x0000000000462baa in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
188             inline bool is_moving() const { return flags&is_vehicle; }
(gdb) backtrace
#0  0x0000000000462baa in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
#1  0x00000000004645e2 in dingliste_t::intern_add_moving (this=0xe520390, ding=0x2ba3b500) at dataobj/dingliste.cc:293
#2  0x0000000000464c14 in dingliste_t::add (this=0xe520390, ding=0x2ba3b500) at dataobj/dingliste.cc:488
#3  0x0000000000408a6d in grund_t::obj_add (this=0xe520388, obj=0x2ba3b500) at bauer/../boden/grund.h:521
#4  0x000000000070aa65 in vehikel_basis_t::betrete_feld (this=0x2ba3b500) at vehicle/simvehikel.cc:302
#5  0x000000000070ebab in vehikel_t::betrete_feld (this=0x2ba3b500) at vehicle/simvehikel.cc:1396
#6  0x000000000071b3ad in waggon_t::betrete_feld (this=0x2ba3b500) at vehicle/simvehikel.cc:4268
#7  0x000000000070f00d in vehikel_t::hop (this=0x2ba3b500) at vehicle/simvehikel.cc:1441
#8  0x000000000070abd7 in vehikel_basis_t::fahre_basis (this=0x2ba3b500, distance=4096) at vehicle/simvehikel.cc:347
#9  0x000000000064abb3 in convoi_t::sync_step (this=0x2ba39230, delta_t=83) at simconvoi.cc:1131
#10 0x00000000006e727d in karte_t::sync_step (this=0x1576380, delta_t=83, sync=true, display=true) at simworld.cc:3450
#11 0x00000000006f7ed9 in karte_t::interactive (this=0x1576380, quit_month=2147483647) at simworld.cc:7106
#12 0x000000000069b308 in simu_main (argc=8, argv=0x7fffffffe568) at simmain.cc:1245
#13 0x00000000006aaecd in sysmain (argc=8, argv=0x7fffffffe568) at simsys.cc:703
#14 0x000000000074d852 in main (argc=8, argv=0x7fffffffe568) at simsys_posix.cc:147

Edit 5: This problem seems curiously inconsistent: running again without rebuilding either server or client, I get a desync at the month boundary, and this error on reconnecting:

Code: [Select]

Program received signal SIGSEGV, Segmentation fault.
0x00000000004340f0 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
242             inline typ get_typ() const { return type; }
(gdb) backtrace
#0  0x00000000004340f0 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
#1  0x00000000004669d0 in dingliste_t::rdwr (this=0x68eef20, welt=0x1576380, file=0x7fffffffb450, current_pos=...)
    at dataobj/dingliste.cc:1062
#2  0x00000000004532b2 in grund_t::rdwr (this=0x68eef18, file=0x7fffffffb450) at boden/grund.cc:368
#3  0x0000000000450f89 in boden_t::rdwr (this=0x68eef18, file=0x7fffffffb450) at boden/boden.cc:55
#4  0x00000000006a7ff0 in planquadrat_t::rdwr (this=0x7ffff0e5a9d8, welt=0x1576380, file=0x7fffffffb450, pos=...)
    at simplan.cc:240
#5  0x00000000006ee331 in karte_t::save (this=0x1576380, file=0x7fffffffb450, silent=false) at simworld.cc:4972
#6  0x00000000006eda05 in karte_t::save (this=0x1576380, filename=0x7fffffffb770 "server13353-network.sve",
    savemode=loadsave_t::zipped, version_str=0x759a25 "0.112.4", ex_version_str=0x759a21 ".11", silent=false)
    at simworld.cc:4848
#7  0x0000000000485991 in nwc_sync_t::do_command (this=0x2be68500, welt=0x1576380) at dataobj/network_cmd_ingame.cc:730
#8  0x00000000006f7cb5 in karte_t::interactive (this=0x1576380, quit_month=2147483647) at simworld.cc:7062
#9  0x000000000069b308 in simu_main (argc=8, argv=0x7fffffffe568) at simmain.cc:1245
#10 0x00000000006aaecd in sysmain (argc=8, argv=0x7fffffffe568) at simsys.cc:703
#11 0x000000000074d852 in main (argc=8, argv=0x7fffffffe568) at simsys_posix.cc:147
(gdb)

Edit 6: Adding checks for NULL at the appropriate lines (see the latest commit on the 11.x branch) has not fixed this issue: I now get this crash:

Code: [Select]
Program received signal SIGSEGV, Segmentation fault.
0x00000000004340f0 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
242             inline typ get_typ() const { return type; }
(gdb) backtrace
#0  0x00000000004340f0 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
#1  0x0000000000466a18 in dingliste_t::rdwr (this=0x68eef20, welt=0x1576380, file=0x7fffffffb450, current_pos=...)
    at dataobj/dingliste.cc:1062
#2  0x00000000004532b2 in grund_t::rdwr (this=0x68eef18, file=0x7fffffffb450) at boden/grund.cc:368
#3  0x0000000000450f89 in boden_t::rdwr (this=0x68eef18, file=0x7fffffffb450) at boden/boden.cc:55
#4  0x00000000006a8038 in planquadrat_t::rdwr (this=0x7ffff0e5a9d8, welt=0x1576380, file=0x7fffffffb450, pos=...)
    at simplan.cc:240
#5  0x00000000006ee379 in karte_t::save (this=0x1576380, file=0x7fffffffb450, silent=false) at simworld.cc:4972
#6  0x00000000006eda4d in karte_t::save (this=0x1576380, filename=0x7fffffffb770 "server13353-network.sve",
    savemode=loadsave_t::zipped, version_str=0x759a85 "0.112.4", ex_version_str=0x759a81 ".11", silent=false)
    at simworld.cc:4848
#7  0x00000000004859d9 in nwc_sync_t::do_command (this=0x2cb64cb0, welt=0x1576380) at dataobj/network_cmd_ingame.cc:730
#8  0x00000000006f7cfd in karte_t::interactive (this=0x1576380, quit_month=2147483647) at simworld.cc:7062
#9  0x000000000069b350 in simu_main (argc=8, argv=0x7fffffffe568) at simmain.cc:1245
#10 0x00000000006aaf15 in sysmain (argc=8, argv=0x7fffffffe568) at simsys.cc:703
#11 0x000000000074d89e in main (argc=8, argv=0x7fffffffe568) at simsys_posix.cc:147

Edit 7: Running again for consistency checking produces this error (again on reload after desync):

Code: [Select]
Program received signal SIGSEGV, Segmentation fault.
0x00000000004340f0 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
242             inline typ get_typ() const { return type; }
(gdb) backtrace
#0  0x00000000004340f0 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
#1  0x0000000000466a18 in dingliste_t::rdwr (this=0x68eef20, welt=0x1576380, file=0x7fffffffb450, current_pos=...)
    at dataobj/dingliste.cc:1062
#2  0x00000000004532b2 in grund_t::rdwr (this=0x68eef18, file=0x7fffffffb450) at boden/grund.cc:368
#3  0x0000000000450f89 in boden_t::rdwr (this=0x68eef18, file=0x7fffffffb450) at boden/boden.cc:55
#4  0x00000000006a8038 in planquadrat_t::rdwr (this=0x7ffff0e5a9d8, welt=0x1576380, file=0x7fffffffb450, pos=...)
    at simplan.cc:240
#5  0x00000000006ee379 in karte_t::save (this=0x1576380, file=0x7fffffffb450, silent=false) at simworld.cc:4972
#6  0x00000000006eda4d in karte_t::save (this=0x1576380, filename=0x7fffffffb770 "server13353-network.sve",
    savemode=loadsave_t::zipped, version_str=0x759a85 "0.112.4", ex_version_str=0x759a81 ".11", silent=false)
    at simworld.cc:4848
#7  0x00000000004859d9 in nwc_sync_t::do_command (this=0x2cb77df0, welt=0x1576380) at dataobj/network_cmd_ingame.cc:730
#8  0x00000000006f7cfd in karte_t::interactive (this=0x1576380, quit_month=2147483647) at simworld.cc:7062
#9  0x000000000069b350 in simu_main (argc=8, argv=0x7fffffffe568) at simmain.cc:1245
#10 0x00000000006aaf15 in sysmain (argc=8, argv=0x7fffffffe568) at simsys.cc:703
#11 0x000000000074d89e in main (argc=8, argv=0x7fffffffe568) at simsys_posix.cc:147
(gdb)
« Last Edit: June 09, 2013, 08:21:06 PM by jamespetts »

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #31 on: June 09, 2013, 09:07:41 PM »
Either your game is broken or the version has some serious problems. "this=0x0" means you are accessing an initialized pointer.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #32 on: June 09, 2013, 09:19:24 PM »
It is the latter, I fear, as I have tested the game in Windows without difficulty: but I cannot work out why these problems only show themselves at the end of a month on a Linux server. It is quite the mystery.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #33 on: June 10, 2013, 05:43:57 PM »
*Sigh*.  There's definitely a dangling pointer of some sort.  I hate that sort of error.

I particularly couldn't tell you what's wrong given that I've been having no problems with the full-graphics Linux version, even over many months.

I strongly suspect that saved game corruption is involved.  Have you had trouble with anything other than the Bridgewater-Brunel game?  Can you get it to run for long periods using a different saved game?  Try a brand new game; also try a game originally from Standard.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #34 on: June 10, 2013, 07:30:34 PM »
Thank you for the suggestions. There is definitely something going wrong somewhere with memory management: null pointers suddenly pop up all over the place where they don't in a normal full graphics version. What I don't understand is why this happens only: (1) in a command line server version; (2) after a month boundary; and (3) in code that has never been touched by Experimental specific code (at least anywhere near directly). You may recall that I earlier had trouble with access violations in a totally different place when running the Windows command line server.

As to saving games; I did send a force-sync command with nettool, and this worked, and the saved game produced by that command is perfectly loadable and works fine. I also have no trouble with the Bridgewater-Brunel saved game either in earlier versions or in a full graphics client.

However, I will try a Standard save to eliminate this issue and post the result if you think that that would help to eliminate the saved game itself as an issue.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #35 on: June 10, 2013, 08:38:32 PM »
I was having some weird integer overflow crashes in neuer_monat a while back, which I never tracked down.

Does this work with a full graphics *server*, hosting clients?  Perhaps we can sort out whether this is really in the "no graphics" code, or actually in the server code.

Also, I just fixed your other savegame bug (very sorry about that one -- the finance code logic was really, really hairy).  So -- can you see if you can reproduce this bug on that branch (emergency-depot-bugfix)?  I did several rounds of cleanups there.  The code I was cleaning up could easily have had memory management errors in it, and I suspect some of it did.  If the bug goes away on the emergency-depot-bugfix branch, we should probably just merge that to the 11.x branch and make it the new release candidate.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #36 on: June 10, 2013, 10:03:42 PM »
Are you compiling a 64 bit command line option by accident, but still using ptrhashtables with int somewhere?

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #37 on: June 10, 2013, 10:26:59 PM »
Nathaneal - thank you for those fixes: will test with your latest branch presently. Prissi - I don't think that this can be the issue, since my most recent tests used the -m64 switch, which compiles "int" as 64-bit.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #38 on: June 12, 2013, 01:04:51 PM »
I have now tried Nathaneal's latest work (on my 112.x-private-car-merge branch) to see whether this alleviates the problem, especially in light of the work that has been done which Nathaneal tells me impacts on things in simgraph16.cc and simgraph0.cc. The result is that, after this patch, the behaviour in the non-graphics server version and the full graphics client version have indeed converged: now both crash immediately on a new month in this saved game.

The crash is in line 1779 of simconvoi.cc and is a segmentation fault/access violation: the problem appears to be that get_schedule() returns a NULL pointer. It is not clear why at the beginning of a month, convoys should suddenly have a NULL pointer for their schedules: although that particular line of code is not called except at the ends of months, get_schedule() is called many times elsewhere in the code for convoy objects, and this would surely have caused a crash elsewhere earlier (especially as the game in question is at the very beginning of a month) if the problem did not begin on a new month.

This suggests to me wider and quite significant memory management issues associated with some recent merging work, in particular whatever code was changed recently to align the behaviour of the full graphics and no graphics versions.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #39 on: June 12, 2013, 10:15:01 PM »
I have now tried Nathaneal's latest work (on my 112.x-private-car-merge branch) to see whether this alleviates the problem, especially in light of the work that has been done which Nathaneal tells me impacts on things in simgraph16.cc and simgraph0.cc. The result is that, after this patch, the behaviour in the non-graphics server version and the full graphics client version have indeed converged:
Hooray!  :-)

Quote
now both crash immediately on a new month in this saved game.
Believe it or not, this is progress.

Quote
The crash is in line 1779 of simconvoi.cc and is a segmentation fault/access violation: the problem appears to be that get_schedule() returns a NULL pointer.
Oho, I found and fixed that bug today (now fixed on ncn-devel and on merge-from-standard).  It was giving me a beginning of month crash.

The important thing to realize is that this section of code is called on convois which are *sitting in the depot and have not yet launched*, and as such do not have schedules. 

The call to get_schedule is solely to get the waytype for accounting purposes -- this is another one of the problems arising out of the accounting merger.  I changed the code to get the waytype in a far more tedious manner which should be completely reliable, hopefully.

However, I believe that convoi_t objects should internally store their waytype.  This would allow for removal of great masses of gunk, not just here but in a bunch of other places.

Please try the merge-from-standard branch again and see if there's still *another* lurking crash hiding behind this one.  (Merge-from-standard has the loading screen code fixes which affect simgraph0.cc and simgraph16.cc, ncn-devel doesn't).

--Nathanael

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #40 on: June 12, 2013, 11:57:12 PM »
Hmm, I suspect that the month end but that you have recently fixed might rather have got in the way of seeing this issue: compiling a Windows command line server in MSVC++ from the 112.x-private-car-merge produces an access violation at line 149 of image_reader.cc, the problem being that "dest" is a bad (not null) pointer. This is code unchanged from Standard, so something is still very wrong somewhere.

I have not had chance to test this again with a Linux server yet (have spent the evening baking a cake), but will revert when I get a chance to do so.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #41 on: June 13, 2013, 03:45:27 AM »
Hmm, I suspect that the month end but that you have recently fixed might rather have got in the way of seeing this issue: compiling a Windows command line server in MSVC++ from the 112.x-private-car-merge produces an access violation at line 149 of image_reader.cc, the problem being that "dest" is a bad (not null) pointer. This is code unchanged from Standard, so something is still very wrong somewhere.
This code is changed from standard.  You changed the memory management on April 10th (commit 77fc2bdf691) after finding memory leaks...

(Edit: for reference, I used "git diff standard HEAD -- filename" to find out what differs from standard, "git blame filename" to figure out what commit made the change.)

Unfortunately, there's something worse than memory leaks.  If you call delete too many times, you get bad pointers.  GCC is remarkably good at protecting us from that in debug builds, but in non-debug builds all bets are off, and I'm sure MSVC is different.

Worse, this code compiles differently depending on whether COMMAND_LINE_SERVER is defined, so it may actually be at fault.

Quote
I have not had chance to test this again with a Linux server yet (have spent the evening baking a cake), but will revert when I get a chance to do so.

I'm going to remove the bad references to delete[] (two of them are clearly erroneous).

By the way, I'm going to merge up my ncn-devel branch to merge-from-standard now, since the stuff on merge-from-standard doesn't seem to be hurting anything.

----
To your credit, the patch to fix the memory leaks was *mostly* correct -- there was a memory leak which needed to be fixed, which you fixed correctly in the destructor for bild_besch.  Unfortunately I think you ended up with a "double delete".
« Last Edit: June 13, 2013, 04:06:34 AM by neroden »

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #42 on: June 13, 2013, 08:12:11 AM »
Ahh - hadn't remembered that: thank you for working this out. I had used Dr. Memory to try to find errors in the code, and had fixed a number of memory leaks as a result. Had you implemented the fix for this yet? I ask because, having pulled from the latest merge-from-standard, I am still getting this error.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #43 on: June 13, 2013, 03:13:25 PM »
Ahh - hadn't remembered that: thank you for working this out. I had used Dr. Memory to try to find errors in the code, and had fixed a number of memory leaks as a result. Had you implemented the fix for this yet? I ask because, having pulled from the latest merge-from-standard, I am still getting this error.

I have implemented an attempted fix for this, yes. (commit 49920ee52).

So, at the moment, can I confirm, the graphics version is working fine; the command-line version is crashing?

I noticed that the command-line version compiles with MIDI on by default (!!!), so I ran it with -nomidi -nosound.  That should probably be fixed regardless of whether it's relevant to this bug.

I'm going to try working on a fix-command-line branch since some of the things I try may break further stuff.
----
I'm having some problems tracking this down because the command-line server is now working fine for me.  I'll leave it running for a while in the debugger, see if it crashes..   Perhaps if I tried the Bridgewater-Brunel game I could track this down.
« Last Edit: June 13, 2013, 04:02:13 PM by neroden »

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #44 on: June 13, 2013, 07:50:32 PM »
OK, so I managed to get some trouble running a command line server.

First, a desync:
Code: [Select]
Desync due to checklist mismatch
sync_step=56620  server=[rand=2339077297 halt=1 line=1 cnvy=1 ind_dns_prop=5585 act_ind_dens=1127 traffic=480473] client=[rand=3039275210 halt=1 line=1 cnvy=1 ind_dns_prop=5585 act_ind_dens=1127 traffic=480502] World finished ...
Then, when I attempted to reconnect, a crash with this backtrace:
Code: [Select]
#0  0x0000000000434d82 in ding_t::get_typ (this=0x0) at bauer/../boden/../dataobj/../simdings.h:242
#1  0x0000000000466d21 in dingliste_t::rdwr (this=0x74da360, welt=0x16c1600, file=0x7fffffffb8c0,
    current_pos=...) at dataobj/dingliste.cc:1062
#2  0x000000000045363a in grund_t::rdwr (this=0x74da358, file=0x7fffffffb8c0) at boden/grund.cc:368
#3  0x0000000000451309 in boden_t::rdwr (this=0x74da358, file=0x7fffffffb8c0) at boden/boden.cc:55
#4  0x00000000006afe04 in planquadrat_t::rdwr (this=0x7fffe3dc5a98, welt=0x16c1600, file=0x7fffffffb8c0,
    pos=...) at simplan.cc:240
#5  0x00000000006f4746 in karte_t::save (this=0x16c1600, file=0x7fffffffb8c0, silent=false)
    at simworld.cc:4974
#6  0x00000000006f3ed4 in karte_t::save (this=0x16c1600, filename=0x7fffffffbba0 "server1234-network.sve",
    savemode=loadsave_t::zipped, version_str=0x75ffc5 "0.112.5", ex_version_str=0x75ffc1 ".11", silent=false)
    at simworld.cc:4850
#7  0x0000000000485251 in nwc_sync_t::do_command (this=0x7a9f930, welt=0x16c1600)
    at dataobj/network_cmd_ingame.cc:730
#8  0x00000000006fde04 in karte_t::interactive (this=0x16c1600, quit_month=2147483647) at simworld.cc:7103
#9  0x00000000006a341c in simu_main (argc=9, argv=0x7fffffffe108) at simmain.cc:1261
#10 0x00000000006b2c71 in sysmain (argc=9, argv=0x7fffffffe108) at simsys.cc:703
#11 0x00000000007537ab in main (argc=9, argv=0x7fffffffe108) at simsys_s.cc:720
I'm not sure where to start on debugging this; perhaps someone else has an idea.  I'm also not sure this is your crash!

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #45 on: June 13, 2013, 09:25:50 PM »
Nathaneal,

ahh, you have hit upon the same crash as I did on the Linux server - see above. It is very different to the crash that I get on the Windows command line server, but the same as reported here. The desync is I think an old issue not yet tracked down (because tracking down desyncs is excruciatingly difficult and months are long enough for this to be a tolerable bug for the time being) that often occurs at the beginning/end of months on developed games; the crash happens on reconnexion because the game is set to pause when no clients are connected: I strongly suspect that the crash would occur immediately after the desync were the server not set to pause when no clients are connected.

Edit: Actually, looking at your backtrace, the crash seems to occur in the save/load routine used when a client connects, so the above might not necessarily be right: however, using nettool to bring about a force-sync before the month boundary works fine without any crashes, so this is still end of month related. Sadly, I am not quite sure where to start with this one, either, and it's rather a show-stopper, as they say.

Thank you for the tip re: -nosound and -nomidi: I have added these parameters to my startup script. It may not make much of a difference, but there is no harm in possibly saving a small amount of memory/CPU power.

To answer your question:

Quote from: neroden
So, at the moment, can I confirm, the graphics version is working fine; the command-line version is crashing?

Yes, subject to latent bugs on the graphics version.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #46 on: June 15, 2013, 05:08:34 PM »
I have just attempted to run the command line server on Linux with Valgrind. It was, as I suppose is inevitable, excurciatingly slow (taking over an hour just to load the game), and, as soon as (or perhaps even before) it had loaded the saved game, it immediately quit without any explanation or error message. I cannot explain why this behaviour with Valgrind attached is different than the behaviour without Valgrind.

However, it did generate a log, which is available for inspection here. To those who know more about Valgrind than I do - are any of these errors significant, and might any of them be related to the troubles described in this thread?

Offline Dwachs

  • DevTeam, Coder/patcher
  • Administrator
  • *
  • Posts: 4870
  • Languages: EN, DE, AT
Re: Command line server issues
« Reply #47 on: June 15, 2013, 06:21:01 PM »
Imho, you should try to hunt down the reason for this block:
Code: [Select]
==16951== Conditional jump or move depends on uninitialised value(s)
==16951==    at 0x43A4EE: haus_besch_t::calc_checksum(checksum_t*) const (haus_besch.cc:189)
==16951==    by 0x438079: fabrik_besch_t::calc_checksum(checksum_t*) const (fabrik_besch.cc:82)
==16951==    by 0x40BDBD: fabrikbauer_t::alles_geladen() (fabrikbauer.cc:229)
==16951==    by 0x43ED32: factory_reader_t::successfully_loaded() const (factory_reader.cc:446)
==16951==    by 0x441816: obj_reader_t::laden_abschliessen() (obj_reader.cc:73)
==16951==    by 0x6A0C99: simu_main(int, char**) (simmain.cc:964)
==16951==    by 0x6B17A0: sysmain(int, char**) (simsys.cc:703)
==16951==    by 0x7540E9: main (simsys_posix.cc:147)
... you have uninitialized data in your haus_besch_t ... which might even be the cause of a desync.

I cannot see anything related to the crash. The program dies due to a SIGPIPE, but I have no idea what could have triggered this.



Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #48 on: June 15, 2013, 07:10:59 PM »
Thank you for the pointer (no pun intended). Will investigate.

Edit: I think that I have found and fixed this uninitialised variable bug (on my 112.x-private-car-merge): testing now. It related to whether buildings in old Experimental versions of paksets were marked as control towers or allowed underground.

Edit 2: Updated to the latest 112.x-private-car-merge commit on the Linux server, and ran to the end of the month with the Bridgewater-Brunel saved game from July 1956. Same behaviour as before: desynched fractionally after the end of the month: on reloading, crashed with a segmentation fault. Reloading the same saved game afresh results in this loading error:

Code: [Select]
Warning: karte_t::laden:        Fileversion: 112005
Calculating textures ...done
ERROR: grund_t::rdwr(): invalid waytype 48!
For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
FATAL ERROR: dingliste_t::laden() - During loading: Unknown object type '48'
Aborting program execution ...

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
FATAL ERROR: dingliste_t::laden() - During loading: Unknown object type '48'
Aborting program execution ...

For help with this error or to file a bug report please see the Simutrans forum:
http://forum.simutrans.com
Aborted

It looks as though something in the new month routines is damaging dingliste_t members - but only in the command line server version. This is all very odd.

Edit 3: Trying again with a graphics Windows server, the month transition in the July 1956 Bridgewater-Brunel saved game occurs without crashing, although it does desync: nevertheless, once the client logs in again, it continues to run without errors.

Edit 4: I tried running it with Valgrind again, after fixing the error with the uninitialised variables: the result was the same as before, with it quitting before actually getting to run properly. I wonder whether the SIGPIPE error might be caused by the fact that Valgrind makes the program so slow that it takes about two hours to load a map, by which time some socket or the like might have timed out or somehow become stale? In any event, here is the log.

Edit 5: I am now trying to run it with Valgrind without "-server" specified to see whether I can run it in the no graphics version without getting a SIGPIPE error. Because Valgrind makes it so slow, it might take days before I have any results, however.

Edit 6: The non-server version exited with a segmentation fault: Valgrind log here.

Edit 7: Running the no graphics version in Linux without the server setting eventually produces a segmentation fault with the following backtrace:

Code: [Select]
Program received signal SIGSEGV, Segmentation fault.
0x000000000046327e in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
188             inline bool is_moving() const { return flags&is_vehicle; }
(gdb) backtrace
#0  0x000000000046327e in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
#1  0x0000000000464c64 in dingliste_t::intern_add_moving (this=0x1727e760, ding=0x2b774680) at dataobj/dingliste.cc:289
#2  0x0000000000465330 in dingliste_t::add (this=0x1727e760, ding=0x2b774680) at dataobj/dingliste.cc:488
#3  0x0000000000408a93 in grund_t::obj_add (this=0x1727e758, obj=0x2b774680) at bauer/../boden/grund.h:544
#4  0x00000000007117ad in vehikel_basis_t::betrete_feld (this=0x2b774680) at vehicle/simvehikel.cc:303
#5  0x000000000071595b in vehikel_t::betrete_feld (this=0x2b774680) at vehicle/simvehikel.cc:1405
#6  0x00000000007220df in waggon_t::betrete_feld (this=0x2b774680) at vehicle/simvehikel.cc:4278
#7  0x0000000000715dbd in vehikel_t::hop (this=0x2b774680) at vehicle/simvehikel.cc:1450
#8  0x000000000071191f in vehikel_basis_t::fahre_basis (this=0x2b774680, distance=6688768) at vehicle/simvehikel.cc:348
#9  0x0000000000650a4b in convoi_t::sync_step (this=0x2b772560, delta_t=8164) at simconvoi.cc:1116
#10 0x00000000006ee019 in karte_t::sync_step (this=0x157e550, delta_t=8164, sync=true, display=true) at simworld.cc:3457
#11 0x0000000000698250 in interrupt_check (caller_info=0x795690 "simworld.cc:3805") at simintr.cc:102
#12 0x00000000006ef8f6 in karte_t::new_month (this=0x157e550) at simworld.cc:3805
#13 0x00000000006f0e93 in karte_t::step (this=0x157e550) at simworld.cc:4102
#14 0x00000000006ff103 in karte_t::interactive (this=0x157e550, quit_month=2147483647) at simworld.cc:7196
#15 0x00000000006a20e7 in simu_main (argc=13, argv=0x7fffffffe538) at simmain.cc:1261
#16 0x00000000006b1cc1 in sysmain (argc=13, argv=0x7fffffffe538) at simsys.cc:703
#17 0x000000000075432e in main (argc=13, argv=0x7fffffffe538) at simsys_posix.cc:147

Preliminary conclusions so far:

  • the problem is associated with COLOUR_DEPTH=0, not with running as a server;
  • the problem occurs after the new month routines have run;
  • the exact manifestation of the problem is different each time; but
  • the problems are all caused by memory corruption in dingliste_t.
The conclusion to draw from that is that, in one of the new month routines (or a routine called by it), some code causes memory corruption in dingliste_t, and that code is only active when COLOR_DEPTH=0.

This is rather perplexing to me, since I have never altered code in dingliste_t and it is not an area of the code with which I am familiar. Can anyone possibly enlighten me a little as to how, in overview, the dingliste_t class works?

Edit 8: I wonder whether the problem is with the merging of this Standard commit of the 5th of October 2012 with the comment, "CHG: reduced memory consumption when compiled with COLOUR_DEPTH=0, no copies of object images kept in memory"?
« Last Edit: June 16, 2013, 02:47:47 PM by jamespetts »

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #49 on: June 16, 2013, 04:13:23 PM »
I'm trying to debug one of the crashes in dingliste_t.  I managed to get one with an ordinary graphics game, by just turning it on and running it for a long time.

Code: [Select]
#0  0x0000000000463f9a in ding_t::is_moving (this=0x0)
    at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
#1  0x0000000000469401 in local_display_dinge_bg (ding=0x0, xpos=1044, ypos=700) at dataobj/dingliste.cc:1178
#2  0x0000000000468878 in dingliste_t::display_dinge_bg (this=0xfcdf670, xpos=1044, ypos=700,
    start_offset=0 '\000') at dataobj/dingliste.cc:1200
#3  0x0000000000459375 in grund_t::display_dinge_bg (this=0xfcdf668, xpos=1044, ypos=700, is_global=false,
    draw_ways=true, visible=true) at boden/grund.cc:1367
#4  0x0000000000458a4e in grund_t::display_dinge_all (this=0xfcdf688, xpos=980, ypos=668,
    raster_tile_width=128, is_global=false) at boden/grund.cc:1292
#5  0x00000000006c03f7 in planquadrat_t::display_dinge (this=0x7fffe12301f8, xpos=980, ypos=668,
    raster_tile_width=128, is_global=false, hmin=0 '\000', hmax=127 '\177') at simplan.cc:426
#6  0x00000000004e6bed in world_view_t::internal_draw (this=0x1448e680, offset=..., ding=0x0)
    at gui/components/gui_world_view_t.cc:170
#7  0x000000000051a6b3 in location_view_t::zeichnen (this=0x1448e680, offset=...)
    at gui/components/gui_location_view_t.h:25
#8  0x000000000051b65f in gui_container_t::zeichnen (this=0x1448e5d8, offset=...) at gui/gui_container.cc:226
#9  0x000000000051cff7 in gui_frame_t::zeichnen (this=0x1448e5d0, pos=..., gr=...) at gui/gui_frame.cc:209
#10 0x00000000006eaa9b in display_win (win=9) at simwin.cc:842
#11 0x00000000006eacb4 in display_all_win () at simwin.cc:875
#12 0x00000000006ecde0 in win_display_flush (konto=-101781.95) at simwin.cc:1431
#13 0#0  0x0000000000463f9a in ding_t::is_moving (this=0x0)
    at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
#1  0x0000000000469401 in local_display_dinge_bg (ding=0x0, xpos=1044, ypos=700) at dataobj/dingliste.cc:1178
#2  0x0000000000468878 in dingliste_t::display_dinge_bg (this=0xfcdf670, xpos=1044, ypos=700,
    start_offset=0 '\000') at dataobj/dingliste.cc:1200
#3  0x0000000000459375 in grund_t::display_dinge_bg (this=0xfcdf668, xpos=1044, ypos=700, is_global=false,
    draw_ways=true, visible=true) at boden/grund.cc:1367
#4  0x0000000000458a4e in grund_t::display_dinge_all (this=0xfcdf688, xpos=980, ypos=668,
    raster_tile_width=128, is_global=false) at boden/grund.cc:1292
#5  0x00000000006c03f7 in planquadrat_t::display_dinge (this=0x7fffe12301f8, xpos=980, ypos=668,
    raster_tile_width=128, is_global=false, hmin=0 '\000', hmax=127 '\177') at simplan.cc:426
#6  0x00000000004e6bed in world_view_t::internal_draw (this=0x1448e680, offset=..., ding=0x0)
    at gui/components/gui_world_view_t.cc:170
#7  0x000000000051a6b3 in location_view_t::zeichnen (this=0x1448e680, offset=...)
    at gui/components/gui_location_view_t.h:25
#8  0x000000000051b65f in gui_container_t::zeichnen (this=0x1448e5d8, offset=...) at gui/gui_container.cc:226
#9  0x000000000051cff7 in gui_frame_t::zeichnen (this=0x1448e5d0, pos=..., gr=...) at gui/gui_frame.cc:209
#10 0x00000000006eaa9b in display_win (win=9) at simwin.cc:842
#11 0x00000000006eacb4 in display_all_win () at simwin.cc:875
#12 0x00000000006ecde0 in win_display_flush (konto=-101781.95) at simwin.cc:1431
#13 0x00000000006a905b in intr_refresh_display (dirty=false) at simintr.cc:75
#14 0x00000000006fe628 in karte_t::sync_step (this=0xd05c4a0, delta_t=31, sync=true, display=true)
    at simworld.cc:3502
#15 0x00000000006a91ac in interrupt_check (caller_info=0x7a72e2 "simworld.cc:7045") at simintr.cc:102
#16 0x000000000070e270 in karte_t::interactive (this=0xd05c4a0, quit_month=2147483647) at simworld.cc:7045
#17 0x00000000006b2dc8 in simu_main (argc=1, argv=0x7fffffffe1d8) at simmain.cc:1261
#18 0x00000000006c261d in sysmain (argc=1, argv=0x7fffffffe1d8) at simsys.cc:703
#19 0x00000000007640ef in main (argc=1, argv=0x7fffffffe1d8) at simsys_s.cc:720x00000000006a905b in intr_refresh_display (dirty=false) at simintr.cc:75
#14 0x00000000006fe628 in karte_t::sync_step (this=0xd05c4a0, delta_t=31, sync=true, display=true)
    at simworld.cc:3502
#15 0x00000000006a91ac in interrupt_check (caller_info=0x7a72e2 "simworld.cc:7045") at simintr.cc:102
#16 0x000000000070e270 in karte_t::interactive (this=0xd05c4a0, quit_month=2147483647) at simworld.cc:7045
#17 0x00000000006b2dc8 in simu_main (argc=1, argv=0x7fffffffe1d8) at simmain.cc:1261
#18 0x00000000006c261d in sysmain (argc=1, argv=0x7fffffffe1d8) at simsys.cc:703
#19 0x00000000007640ef in main (argc=1, argv=0x7fffffffe1d8) at simsys_s.cc:720

Poking around, all the code is doing exactly what it's supposed to, except that there is a null pointer lurking in the list of dings.  This is also the cause of the previous crash in ::intern_add_moving -- it's running through objects and finding a crashin

I did three things on ncn-devel
(1) I turned on assertions in the debug build.  I realize this will make things slower, but it needs to be done, and frankly it needs to be done routinely.  If we have particular assertions which are too slow, we can remove them from the code.
(2) I added an assertion during dingliste_t:add, asserting that it is not being asked to add a null pointer. This is the only way to figure out how a null pointer is getting into the dingliste_t. 
(3) I added a check for what is apparently a missing case.  It doesn't seem to be the case causing the problem, though.

Run with assertions on for a while and get the crash again.  Most likely  you'll get a crash based on the assertion -- in which case we can figure out who is trying to add a null pointer.  If you get the same crash as before, then we have to dig further into the dingliste_t code to figure out how it's getting a null pointer when nobody is trying to add one.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #50 on: June 16, 2013, 04:19:50 PM »
Thank you very much for this - this is very helpful. It is interesting that you have been able to reproduce this with a graphics build. I have pulled and compiled your code on the Linux server (albeit running it as a non-server build for testing purposes), and I am waiting to see whether it will still crash, and, if so, with what backtrace.

Turning assertions on in debug builds seems sensible enough; but when running the server with an actual saved game, we need all the speed that we can get, and should turn off debugging in this situation.

Edit 1: My new test with your new version also crashed with the following backtrace:

Code: [Select]
Program received signal SIGSEGV, Segmentation fault.
0x000000000046327e in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
188             inline bool is_moving() const { return flags&is_vehicle; }
(gdb) backtrace
#0  0x000000000046327e in ding_t::is_moving (this=0x0) at dataobj/../vehicle/../boden/../dataobj/../simdings.h:188
#1  0x0000000000464c64 in dingliste_t::intern_add_moving (this=0x171d8460, ding=0x2b9db650) at dataobj/dingliste.cc:289
#2  0x0000000000465330 in dingliste_t::add (this=0x171d8460, ding=0x2b9db650) at dataobj/dingliste.cc:488
#3  0x0000000000408a93 in grund_t::obj_add (this=0x171d8458, obj=0x2b9db650) at bauer/../boden/grund.h:544
#4  0x00000000007117c1 in vehikel_basis_t::betrete_feld (this=0x2b9db650) at vehicle/simvehikel.cc:310
#5  0x000000000071596f in vehikel_t::betrete_feld (this=0x2b9db650) at vehicle/simvehikel.cc:1412
#6  0x00000000007220f3 in waggon_t::betrete_feld (this=0x2b9db650) at vehicle/simvehikel.cc:4285
#7  0x0000000000715dd1 in vehikel_t::hop (this=0x2b9db650) at vehicle/simvehikel.cc:1457
#8  0x0000000000711933 in vehikel_basis_t::fahre_basis (this=0x2b9db650, distance=3497984) at vehicle/simvehikel.cc:355
#9  0x0000000000650a4b in convoi_t::sync_step (this=0x2b9d9230, delta_t=4996) at simconvoi.cc:1116
#10 0x00000000006ee019 in karte_t::sync_step (this=0x157e550, delta_t=4996, sync=true, display=true) at simworld.cc:3457
#11 0x0000000000698250 in interrupt_check (caller_info=0x7956b0 "simworld.cc:3805") at simintr.cc:102
#12 0x00000000006ef8f6 in karte_t::new_month (this=0x157e550) at simworld.cc:3805
#13 0x00000000006f0e93 in karte_t::step (this=0x157e550) at simworld.cc:4102
#14 0x00000000006ff103 in karte_t::interactive (this=0x157e550, quit_month=2147483647) at simworld.cc:7196
#15 0x00000000006a20e7 in simu_main (argc=13, argv=0x7fffffffe538) at simmain.cc:1261
#16 0x00000000006b1cc1 in sysmain (argc=13, argv=0x7fffffffe538) at simsys.cc:703
#17 0x0000000000754342 in main (argc=13, argv=0x7fffffffe538) at simsys_posix.cc:147
(gdb)
« Last Edit: June 16, 2013, 05:26:25 PM by jamespetts »

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #51 on: June 18, 2013, 08:27:23 PM »
Excellent.  This is an extremely helpful backtrace.  The problem I've been having is that I've been catching the bug after the null pointer has been inserted into the dingliste_t.  You managed to catch it *during* the add process; I think I may be able to fix it now.
---

Spoke too soon.  Nope, the null pointer is already in there when your crash happens.  ****.

I can't figure out how the null pointers are getting in.  Nobody is calling add() with a null pointer, which means that there's a logic error in the data structure handling.

I tried completely rewriting it, but I haven't quite gotten that working yet.
« Last Edit: June 18, 2013, 08:35:36 PM by neroden »

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #52 on: June 18, 2013, 08:38:28 PM »
Wait wait wait.  Did someone try to make simutrans multithreaded?

If so, that's the problem.  Everything in dingliste.cc has to be protected by synchronization locks of some sort.

If simutrans is being compiled multithreaded by default, turn that off and see what the results are.  Just an idea...
---
Sigh.  Sadly, no.   I just checked and threading is off in my test executable.   

Having been staring at this file for a week, I think I see how to convert dingliste_t to handle multiple rotations simultaneously :-) but first I have to figure out how to fix the basic memory management code!
« Last Edit: June 19, 2013, 12:19:26 AM by neroden »

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #53 on: June 18, 2013, 08:57:11 PM »
The backtrace almost rules out that someone called interrupt check during a sync_step. (In such a case one routine may have stale elements like the max still in its structures.) This leaves an error by wrong memory allocation using freelist as most likeliest candidate. Smaller dinglist (up to 16 items) are allocated from freelist. If an item is accidently freed (or gotten from the wrong list) it will be overwritten. Same obviously if somethign writes outidie its boundaries. You can remove the handling of dinglist be freelist by changing

Code: [Select]
static void dl_free(ding_t** p, uint8 size)
{
assert(size > 1);
size = 0;
guarded_free(p);
}


static ding_t** dl_alloc(uint8 size)
{
assert(size > 1);
ding_t** p;
p = MALLOCN(ding_t*, size);
return p;
}

Apart from slowing down the game, this will only fix symptoms; it rather points to an error in your dingliste handling (or another object with similar size). It could by also a problem with another object which memory was obtained by freelist and then wrote outside its borders.

To actually debug this by valgrind, you need to define USE_VALGRIND_MEMCHECK

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #54 on: June 18, 2013, 11:21:30 PM »
Thank you both for your assistance with this - that is most helpful. Simutrans is indeed now multi-threaded, and has been for a while, but my server version is built without MULTI_THREAD defined (as my (virtual) server has only one allocated core), so the version that is crashing for me should be a single threaded compile.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #55 on: June 19, 2013, 12:50:20 AM »
I think it's got to be an error somewhere in the insertion/removal routines in dingliste_t.  I tried rewriting them entirely along somewhat different and cleaner (and hopefully slightly faster) lines, but I got a very weird crash, so I'm going to have to go back and figure out what was going on.

it rather points to an error in your dingliste handling (or another object with similar size). It could by also a problem with another object which memory was obtained by freelist and then wrote outside its borders.
I've been debugging these crashes.  The lists appear to be *mostly* valid pointers to ding_t.  They just have null pointer holes in the middle.  The crashing instances have capacity 16, "top" somewhere around 14 or 11, and the "null pointer hole" a few entries back from that, around 11 or 8.  The early entries appear to be proper ding_t pointers, which makes it seem unlikely that the memory has been overwritten by something writing outside its borders. 

I suppose it's possible that they were overwritten by something which was writing lists of ding_t pointers.... but there aren't that many things using the freelist, and most of them aren't lists of ding_t pointers.

This feels like improper list management to me.  Specifically, it feels like an insertion error, where something is inserted in the wrong slot when the list is expanded, so that the new item overwrites an old item (which gets lost) rather than being appended. 

But I'm having trouble finding such an error in the code.  Even though the code is different between standard and experimental, I have not been able to spot any logic errors. 

The basic logic for the type is baroque and complicated, so there could be something I missed.  It's complicated enough that I tried rewriting it, to use "top" for the count at all times, and to use "capacity" strictly for the capacity, not for the count.  But as I said I've had some very mysterious trouble with that.  I'll have to try it again and see if I can spot what's going on....

----

Okay.  New day.  I still don't know what was going wrong with the original implementation, but I do know what went wrong with my revised implementation.

The revised implementation is on my "dinglist-cleanup" branch.  James, I would appreciate it if you would test it to see if I've finally eliminated the error.  (Don't merge it yet, though!)

If I *have* eliminated the error, I'll make an adaptation of the patch for standard.  I believe that it makes the code significantly more readable, and therefore less prone to future fencepost (off-by-one) errors.

If I haven't... well, at least I have more readable code to debug.  :-)

I believe that the revised implementation should be marginally faster as well as significantly cleaner.  Probably not measurably faster, but it does create what I think is a highly optimizable logic based entirely in comparisons against zero. I have some further ideas for improvement but let's see if I got the bugs out first.
« Last Edit: June 19, 2013, 01:32:12 AM by neroden »

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #56 on: June 19, 2013, 10:03:56 AM »
Honestly, I would be very surprised if is an error in standard. dingliste is used a zillion times on the servers. If there would be buggy code there, any large game would trigger it quickly. However the standard servers usually run for weeks without crashing via dingliste.

But one never knows.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #57 on: June 19, 2013, 10:30:33 AM »
Thank you both for your comments on this and Nathaneal for all the work on this - it is most appreciated. I have not had chance to test this yet with my server, but hope to do so soon. If this fix does not solve the problem, the next place to look, I think, is that set of various new month (either new_month() or neuer_monat()) methods, as the crash always occurs almost immediately after the beginning of a new month; disabling them one by one and checking whether the crash recurs should help to narrow down the cause.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #58 on: June 19, 2013, 12:57:47 PM »
Ok, I stand corrected: There was the one off bug fixed in bei(): Albeit this was never triggered, as this is called from for loop limited by top ...

But still I do not see how to get a NULL pointer into the list. It still look to me as there is someting writing out of its bound and hitting a dinglist. You can easily check this by only using malloc.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #59 on: June 19, 2013, 05:37:48 PM »
Ok, I stand corrected: There was the one off bug fixed in bei(): Albeit this was never triggered, as this is called from for loop limited by top ...
(Edit:) One of the reasons I was willing to suspect an error in dingliste.cc is that it had diverged somewhat between experimental and standard.  Not in big ways, but the code was so confusing that I figured it might have generated an off-by-one. 

There *was* a bug in the deletion of "dying trees", which do not get removed from the list and leave bad pointers in the list at the end of the month...

Anyway, my revision is still cleaner architecturally and quite likely to be faster, depending on how inlining works out.

Quote
But still I do not see how to get a NULL pointer into the list. It still look to me as there is someting writing out of its bound and hitting a dinglist. You can easily check this by only using malloc.

I wonder if there's something writing out of bounds BACKWARDS?  That would explain the behavior I'm seeing. 

The 0 end of the array is fine, but the "top" end of the array is broken.  So I'm looking for a loop which writes into an array and counts DOWN using memory from the freelist.

---
(Edit:) Actually, come to think of it, I'm only seeing problems with lists of size 16 specifically (!!!), which is a memory unit of 128 bytes (since I'm running 64-bit).  Since the freelist uses pools of memory where all the allocations of size 16 come from the same pool, I'm looking specifically for something allocated out of the 128 byte pool.

::gimme_node is called in a very short list of places in experimental.
  • operator new in grund_t -- sizeof(grund_t) is 32 and the same for all subclasses
  • dl_alloc in dingliste_t (which is what we've been debugging)
  • dings/groundobj.cc -- called with sizeof(groundobj_t) which is 32
  • dings/baum.cc -- called with sizeof(baum_t) which is 32
  • player/simplay.cc -- called with sizeof(spieler_t::income_message_t) which is 56
  • simhalt.cc -- called with sizeof(haltestelle_t::connexion) -- this is a struct with six elements: four uint16s and two quickstones.  My debugger won't give me the size, but it should be 48 in the least-packed case, more likely 12
  • tpl/slist_tpl.h -- probably the worst use to track down; this is used for node_t, so it's one pointer (8) plus the data type.
  • To hit 128, given the automatic four-byte alignment, this would require a datatype of a very specific and unlikely size.  It's also quite unlikely that anyone would scribble backwards across a node and we'd probably see very bad results from the corrupted linked lists.
  • vehicle/movingobj.cc -- called with sizeof(movingobj_t) which is 72
  • vehicle/simverkehr.cc -- called with sizeof(stadtauto_t) which is 128
This says to me we have to look at backwards loops in the city car code, which certainly does differ between experimental and standard.  James?
« Last Edit: June 19, 2013, 06:24:19 PM by neroden »

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #60 on: June 19, 2013, 08:49:55 PM »
Okay!

So I removed the custom allocators from stadtauto_t, so that it allocates the "normal" way.

This triggered an exciting new crash.  A double free.

*** Error in `/home/neroden/simutrans-ex-new-alternate/sim-carsout': free(): invalid pointer: 0x00000000149b13c0 ***

Code: [Select]
#0  0x00007ffff6933295 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff6936438 in __GI_abort () at abort.c:90
#2  0x00007ffff696e86b in __libc_message (do_abort=do_abort@entry=2,
    fmt=fmt@entry=0x7ffff6a6ad28 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:199
#3  0x00007ffff6978ac6 in malloc_printerr (action=3, str=0x7ffff6a66cca "free(): invalid pointer",
    ptr=<optimized out>) at malloc.c:4902
#4  0x00007ffff6979843 in _int_free (av=<optimized out>, p=0x149b13b0, have_lock=0) at malloc.c:3758
#5  0x000000000064f43c in sparse_tpl<unsigned char>::resize_data (this=0xde44df0, new_size=0)
    at tpl/sparse_tpl.h:199
#6  0x000000000064bd16 in sparse_tpl<unsigned char>::clear (this=0xde44df0) at tpl/sparse_tpl.h:60
#7  0x000000000063cdfd in stadt_t::neuer_monat (this=0xde44da0, check=false) at simcity.cc:2526
#8  0x0000000000703d77 in karte_t::new_month (this=0xde78f0) at simworld.cc:3755
#9  0x000000000070565a in karte_t::step (this=0xde78f0) at simworld.cc:4110
#10 0x000000000071341f in karte_t::interactive (this=0xde78f0, quit_month=2147483647) at simworld.cc:7203
#11 0x00000000006b6cb6 in simu_main (argc=1, argv=0x7fffffffe1a8) at simmain.cc:1261
#12 0x00000000006c6805 in sysmain (argc=1, argv=0x7fffffffe1a8) at simsys.cc:703
#13 0x000000000076905b in main (argc=1, argv=0x7fffffffe1a8) at simsys_s.cc:720

I'm not entirely sure how to interpret this, but I think it supports my theory that there's a failure of memory management for stadtauto_t objects.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #61 on: June 19, 2013, 08:56:09 PM »
Ahh, thank you! I had started to wonder whether the cars might be an issue, since that is the only area of the code that is exclusive to Experimental that causes objects to be removed from the world at the end of the month. Can you put your testing code for using normal allocation for city cars on a separate branch for me to test?

Incidentally, I cannot compile the Linux version of your dingliste-cleanup branch: I get:

Code: [Select]
build/default/gui/components/gui_world_view_t.o: In function `world_view_t::internal_draw(koord, ding_t const*)':
/usr/share/games/simutrans-experimental/gui/components/gui_world_view_t.cc:61: undefined reference to `display_base_img_alpha(unsigned int, unsigned int, unsigned int, short, short, signed char, unsigned short, int, int)'
build/default/simgraph0.o:(.data+0x28): undefined reference to `display_base_img_alpha(unsigned int, unsigned int, unsigned int, short, short, signed char, unsigned short, int, int)'
collect2: error: ld returned 1 exit status
make: *** [simutrans/sim] Error 1
root@438242:/usr/share/games/simutrans-experimental#

Also, I am still getting the graphical problems with the ticker.

Edit: Compiling with COLOUR_DEPTH=0 also fails in Windows/MSVC++:

Code: [Select]
1>simgraph0.obj : error LNK2001: unresolved external symbol "void __cdecl display_base_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_base_img_alpha@@YAXIIIFFCGHH@Z)
1>gui_world_view_t.obj : error LNK2001: unresolved external symbol "void __cdecl display_base_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_base_img_alpha@@YAXIIIFFCGHH@Z)
1>gui_world_view_t.obj : error LNK2019: unresolved external symbol "void __cdecl display_rezoomed_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_rezoomed_img_alpha@@YAXIIIFFCGHH@Z) referenced in function "protected: void __thiscall world_view_t::internal_draw(class koord,class ding_t const *)" (?internal_draw@world_view_t@@IAEXVkoord@@PBVding_t@@@Z)
1>.\simutrans\Simutrans-Experimental-debug.exe : fatal error LNK1120: 2 unresolved externals

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #62 on: June 19, 2013, 09:26:36 PM »
I have had a go at working out what the error might be, and have pushed a possible (but as yet untested) fix to my dingliste-cleanup branch. I cannot test this yet because of the above compile errors - could you look into fixing those? I am not quite sure how they come about except that it seems that the unresolved symbol relates to something ultimately found in simgraph16.cc but not simgraph0.cc - it is not clear why your dingliste changes should have caused this, though.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #63 on: June 20, 2013, 04:57:21 AM »
Ahh, thank you! I had started to wonder whether the cars might be an issue, since that is the only area of the code that is exclusive to Experimental that causes objects to be removed from the world at the end of the month. Can you put your testing code for using normal allocation for city cars on a separate branch for me to test?
Oh, that was easy -- I just commented out the "operator new" and "operator delete" for stadtauto_t, in both the header file and the .cc file.  I didn't bother to make a branch for that.

Quote
Incidentally, I cannot compile the Linux version of your dingliste-cleanup branch: I get:

Code: [Select]
build/default/gui/components/gui_world_view_t.o: In function `world_view_t::internal_draw(koord, ding_t const*)':
/usr/share/games/simutrans-experimental/gui/components/gui_world_view_t.cc:61: undefined reference to `display_base_img_alpha(unsigned int, unsigned int, unsigned int, short, short, signed char, unsigned short, int, int)'
build/default/simgraph0.o:(.data+0x28): undefined reference to `display_base_img_alpha(unsigned int, unsigned int, unsigned int, short, short, signed char, unsigned short, int, int)'
collect2: error: ld returned 1 exit status
make: *** [simutrans/sim] Error 1
root@438242:/usr/share/games/simutrans-experimental#

Also, I am still getting the graphical problems with the ticker.

Edit: Compiling with COLOUR_DEPTH=0 also fails in Windows/MSVC++:

Code: [Select]
1>simgraph0.obj : error LNK2001: unresolved external symbol "void __cdecl display_base_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_base_img_alpha@@YAXIIIFFCGHH@Z)
1>gui_world_view_t.obj : error LNK2001: unresolved external symbol "void __cdecl display_base_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_base_img_alpha@@YAXIIIFFCGHH@Z)
1>gui_world_view_t.obj : error LNK2019: unresolved external symbol "void __cdecl display_rezoomed_img_alpha(unsigned int,unsigned int,unsigned int,short,short,signed char,unsigned short,int,int)" (?display_rezoomed_img_alpha@@YAXIIIFFCGHH@Z) referenced in function "protected: void __thiscall world_view_t::internal_draw(class koord,class ding_t const *)" (?internal_draw@world_view_t@@IAEXVkoord@@PBVding_t@@@Z)
1>.\simutrans\Simutrans-Experimental-debug.exe : fatal error LNK1120: 2 unresolved externals
I'll go back and fix that.  Currently, however, I'm working on fixing the zero mail revenue, and uncovering all manner of lurking bugs.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #64 on: June 20, 2013, 10:02:09 AM »
Thank you for fixing this error: it now compiles, but on Windows I still get an access violation in line 147 of image_reader.cc for some odd reason.

Edit: Having somewhat odd problems testing the latest dingliste-cleanup branch on the Linux server: running it in non-server mode, it appears to freeze without crashing: it stops generating debug messages shortly after loading. Because it is not running as a server, there is no definitive way to test whether it is working or not. I do not know how to debug an infinite loop with gdb, if that is what the problem is. I tried running it as a server with gdb, and that seemed to work for a while, but it desynced after about one game hour: trying to connect again ended up resulting in the thing terminating with a SIGPIPE error (perhaps due to gdb making the server extremely slow?). I have yet to try a server without gdb, which might be the next step.
« Last Edit: June 21, 2013, 09:31:41 AM by jamespetts »

Offline Dwachs

  • DevTeam, Coder/patcher
  • Administrator
  • *
  • Posts: 4870
  • Languages: EN, DE, AT
Re: Command line server issues
« Reply #65 on: June 21, 2013, 02:18:59 PM »
Running in gdb should not slow-down much, as long as your are not using any conditional breakpoints. Pressing ctrl+c in the terminal will break the program and you can analyze the backtrace.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #66 on: June 21, 2013, 10:30:47 PM »
Hmm - if running in gdb does not slow things down much, I do wonder why I get a SIGPIPE. Hmm.

Edit: I can confirm that running this, even without GDB, I seem to get some sort of infinite loop after around 5 game hours on the upload-refresh.sve, which results in a desync for the client (which does not get the infinite loop) and an eventual SIGPIPE error for the server (which I am guessing is caused by a timeout in consequence of the infinite loop?).

Edit 2: Further testing seems to show that the original problem is not solved: I still get a crash at the month boundary. Have not had chance to run it with gdb yet to get a backtrace.
« Last Edit: June 24, 2013, 12:15:09 AM by jamespetts »

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #67 on: June 22, 2013, 07:39:50 PM »
A few things:
(1) I don't fully understand what's going on here, but I think there are some serious problems with dingliste_t management even in standard. 

In short, objects are deleted (in simwerkz.cc, in wkz_remover_t::wkz_remover_intern) but left in the list, so there are going to be bad references in the list. 

Tracing through the code, I don't see any call to loesche_alle to clean the list up.  Maybe I'm missing something.

This shouldn't be creating the problems which we're currently seeing, but I don't understand the memory management here, and I don't like that.

(2) The primary thing going wrong is probably in the citycar code, which appears to be writing to released memory (memory which has been "delete"d), and I haven't had a chance to even look into that code.  It seems to be quite complicated, and tracking down the memory management is going to be serious work.  Perhaps you can find someone else to do it...

(3) The imagereader.cc crash on line 147 is more understandable.  The command line server relies on a really awful hack when loading the image data, to avoid actually loading the image data, and I think the hack isn't fully implemented.  This is a bug present in standard.  It would be highly desirable to implement this more cleanly (in standard), but it will take quite a lot of work.

(4) I've been spending my time doing some more substantial cleanups of the revenue code in hopes of not only fixing the mail revenue bug, but of making it much more maintainable and comprehensible.  (This is over on my "mail-revenue" branch.  Don't merge it yet!)  This has led to some substantial work which I will mention elsewhere.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #68 on: June 22, 2013, 08:53:11 PM »
It could very weel be that some cleanup code is called during deletion, which passed the this pointer to other code (which is not valid at this point any more). Different compiler can be more forgiving on this. Any cleanup hence is in entferne().

All things are removed from the dinglist on deletion anyway, unless they have the "not_on_map" flag. If experimetnal did not touch this, then there are no remains. But should be very easy to add debug code for this; it will just slow down world deletion a little more.

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #69 on: June 22, 2013, 10:10:49 PM »
It could very weel be that some cleanup code is called during deletion, which passed the this pointer to other code (which is not valid at this point any more).
This does not seem to happen, at least not for fussgaenger_t (which is the first one I checked, out of several).
Quote
Different compiler can be more forgiving on this. Any cleanup hence is in entferne().
Which does not get called.

Quote
All things are removed from the dinglist on deletion anyway,
On deletion of *what*?  Sure, on deletion of the grund_t object, but that's not happening.

"grund_t::obj_remove" needs to be called to delete something from the dinglist. 

Instead, raw "delete" is being called.  This happens several times in wkz_remover_t::wkz_remover_intern. 

In some other places ::obj_remove *is* called.

I am pretty close to being convinced that this is simply a programming error.  I will write a patch.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #70 on: June 22, 2013, 10:36:11 PM »
Hmm - thank you all for your input on this: it is much appreciated. It is most annoying that it is so difficult to solve. Because this occurs reproducably in Experimental and is not known to occur in Standard, and because Neroden's investigations suggest that it occurs in the city car code (which has been changed in some places from Standard), I am inclined to think that this error is likely to be Experimental specific. I am minded to re-test by temporarily disabling end of month city car related code to see whether that has any effect. It is a shame that my previous attempt to fix this based on what I hoped was an informed guess as to what the problem is did not work, but, whether it fixes this or not, I think that I found and fixed a real bug, so that is progress at least. I shall have a go at testing to-morrow when less tired (spent the day testing the comfort of various early 20th century railway carriages at the Bluebell Railway, which was most pleasant, especially since it seemed to show that the values that I had guessed for Pak128.Britain-Ex were more or less correct).

Edit 1: Some further testing has shown simultaneously puzzling and promising results. Taking the original saved game that I used in my tests, which would regularly crash at the end of the month with the command line server build, I fast forwarded it with a Windows graphical client for a few months in an unrelated test, then took that saved game, and fast forwarded it to the end of the month so that I could test for crashes at the end of that month without having to wait for the whole month as with upload-refresh.sve. Bizarrely, I can no longer reproduce this crash with the new saved game: it goes over the month boundary without difficulty, and has now done so on two separate months.

Either this crash is idiosyncratic in the extreme, or there was some sort of corruption of the original saved game file which sorted itself out somehow when run for a few months on the Windows version.

It is difficult to know at this juncture quite what to do with this information.

Edit 2: Testing the same thing now for a longer period on the current dingliste-cleanup branch - I have now run from January to July in game time, with no trouble at the month boundaries at all. Indeed, not only does it no longer crash at the beginning of a new month, it also no longer desyncs (even with two clients connected, and even after regular loading/saving as a result of my script's regular calling of nettool's force-synce command).

The inference that I draw from that is that this commit fixed it, but that the original upload-refresh.sve contained corrupted data which had to be flushed out of the dingliste by running it in a non-affected system before these problems would no longer show on an affected Linux system.

That commit fixed a problem whereby, when private car objects were deleted at the end of a month by a routine called to delete private cars if there were too many of them for the level of traffic generated, then it would delete them by a different method than the standard method of deletion (setting current_list to NULL rather than setting time_to_life [sic] to 0). The trouble was that there was more than one place in the code where the car object would check whether time_to_life was zero, and I had not added the extra check for current_list being NULL in one place. It would not have made sense to check for that there in any event, so I simply reverted to deleting by setting time_to_life as NULL. Additionally, cars which were deleted for reasons other than being excess in number at the end of the month were not removed from current_list, with the result that these cars would be deleted again at the end of the month when their numbers were deemed excessive - quite possibly causing the double free errors that Nathaneal detected with the very helpful test involving disabling freelist. Indeed, it was these tests (diagnosis of double free connected to the city car code) that put me onto this bug in the first place.

This testing having been done, the next question is: should the next (and hopefully final, subject perhaps to considerations relating to the revenue cleanup and testing of that) release candidate be based on the dingliste-cleanup branch or not? My fix applies both to the branch affected by dingliste-cleanup and that not. I have so far only tested with the dingliste-cleanup branch: although I suspect that the results would be the same with 112.x-private-car-merge, that has not been tested. Is it worth doing that test, or is it safe enough to use dingliste-cleanup? Indeed, is there any advantage in doing so?

Edit 3: I wonder whether this has all taught us that either dingliste or freelist should try to detect double free errors, if possible? The trouble with using freelist is that things like Valgrind and Dr. Memory don't spot when double free (and possibly other) errors are occurring with them, so memory management errors can be hard to find. I don't know how either work well enough, however, to be able to know whether doing this is possible (1) at all; or (2) without overly compromising performance. (If it is possible but only at the expense of performance, it might be worth putting it behind conditional compilation code for debugging purposes).

Edit 4
: It got to nearly the end of August, when I manually stopped it, without crashing or desyncing at all.
« Last Edit: June 24, 2013, 01:56:59 AM by jamespetts »

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #71 on: June 24, 2013, 06:09:40 AM »
 
Quote
Is it worth doing that test, or is it safe enough to use dingliste-cleanup?
The current version of dinglist-cleanup should be safe enough.  It's basically just a cleanup of the code so that all of the functions run through the cases in the same manner (previously, each function in dingliste_t ran through the cases in a different manner, making it hard to tell whether they were all doing the right thing). 

Quote
Indeed, is there any advantage in doing so?

It should also be slightly faster.
(1) the old version had some very poorly thought-out logic related to "set_capacity"; that might have been optimized out, but it might not have been.
(2) the new version has slightly more carefully thought-out code for expanding the list and for recovering wasted memory, based on the patterns of usage which happen in-game.

----

While you're at it, the current version of mail-revenue is actually recognizing mail revenue and I think there are no bugs in actual revenue...

So you might want to merge it.  There are still several lurking bugs related to the display of items in the goods list, however, which I'd like a chance to fix first.

Quote
Edit 3: I wonder whether this has all taught us that either dingliste or freelist should try to detect double free errors, if possible? The trouble with using freelist is that things like Valgrind and Dr. Memory don't spot when double free (and possibly other) errors are occurring with them, so memory management errors can be hard to find. I don't know how either work well enough, however, to be able to know whether doing this is possible (1) at all;
Well, yeah.  Write freelist_t to check, on "putback", whether it's putting back something which is actually free already.
Quote
(2) without overly compromising performance. (If it is possible but only at the expense of performance, it might be worth putting it behind conditional compilation code for debugging purposes).

I'm not sure how much it will compromise performance.  Simutrans actually doesn't free its objects very often, particularly the small objects.  The performance hit would probably mostly be at the end of the month.  Or, I could be wrong, because slist_tpl uses the freelist, and it may be freeing frequently as slist_tpl items go out of scope.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #72 on: June 24, 2013, 09:51:12 AM »
Thank you for that indication - that is helpful. I shall await your mail revenue fixes with interest, in that case.

As to the freelist changes, should discussion about that perhaps go into the Standard development boards?

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #73 on: June 24, 2013, 10:26:57 AM »
freelist makes loading 300% faster! So this is really needed (and obviously some of the speed is because of not checking for the right size and some stuff.) But as mentioned earlier, you could set flags to debug freelsite even with valgrind. You could even add another 4 byte left of the pointer to include information about freeed, or size and generate errors. Very trivial to do this, an only a memory penalty.

And again: since you distribute experimetnal as release builts without a lot of other safegouards (asserts, range checks) whose impact is reather small compared to freelsit, I think it would make most sence to keep freelist.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #74 on: June 24, 2013, 10:43:26 AM »
I don't think that we were proposing to get rid of freelist entirely, but to consider adding checks.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 10641
  • Languages: De,EN,JP
Re: Command line server issues
« Reply #75 on: June 24, 2013, 08:57:25 PM »
It will be about 10-15% increase in memory, which might be acceptable.

Offline jamespetts

  • Simutrans-Extended project coordinator
  • Administrator
  • *
  • Posts: 20776
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Command line server issues
« Reply #76 on: June 24, 2013, 10:03:34 PM »
Thank you - that is useful.

Offline Dwachs

  • DevTeam, Coder/patcher
  • Administrator
  • *
  • Posts: 4870
  • Languages: EN, DE, AT
Re: Command line server issues
« Reply #77 on: June 27, 2013, 08:04:12 AM »
This does not seem to happen, at least not for fussgaenger_t (which is the first one I checked, out of several).Which does not get called.
On deletion of *what*?  Sure, on deletion of the grund_t object, but that's not happening.
You might want to check the destructor ding_t::~ding_t(), which DOES the cleanup. And this IS called as soon as any such ding object is deleted.

I get the impression that you seem to assume that any code not written by yourself is a-priori full of bugs.

Offline Dwachs

  • DevTeam, Coder/patcher
  • Administrator
  • *
  • Posts: 4870
  • Languages: EN, DE, AT
Re: Command line server issues
« Reply #78 on: June 27, 2013, 09:48:37 AM »
(3) The imagereader.cc crash on line 147 is more understandable.  The command line server relies on a really awful hack when loading the image data, to avoid actually loading the image data, and I think the hack isn't fully implemented.  This is a bug present in standard.  It would be highly desirable to implement this more cleanly (in standard), but it will take quite a lot of work.
Thanks for reporting, seems that for old pak files the reader tries to read (and even write if unlucky) out of bounds...

Offline neroden

  • Devotees (Inactive)
  • *
  • Posts: 831
  • Nathanael Nerode
Re: Command line server issues
« Reply #79 on: June 29, 2013, 06:50:40 PM »
You might want to check the destructor ding_t::~ding_t(), which DOES the cleanup. And this IS called as soon as any such ding object is deleted.

I get the impression that you seem to assume that any code not written by yourself is a-priori full of bugs.
This is because it's usually true :-) 

(Except when I'm working with people who have *proven* their competence such as the GCC developers.  Or when the code is *self-documenting* or *properly commented*.  This is not to say that my code is free of bugs, obviously.)

Attempting to do this cleanup in the destructor creates race conditions.  I guess as long as simutrans is single-threaded, that doesn't matter, but.... it can also require the *entire map* to be searched.  Yeech.