News:

Simutrans Chat Room
Where cool people of Simutrans can meet up.

compiled with -O1 64bit binary crashes on startup

Started by sdog, December 30, 2010, 04:55:19 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

sdog

When compiling 9.1 from the github 9.x repository with umpteen optimisation (-O1 ... -O3) the binary crashes immediately after startup.


Program received signal SIGSEGV, Segmentation fault.
0x00000000005ff03b in display_fb_internal (xp=333, yp=233, w=186, h=12,
    color=<value optimized out>, dirty=<value optimized out>,
    cL=<value optimized out>, cR=<value optimized out>,
    cT=<value optimized out>, cB=<value optimized out>) at simgraph16.cc:3131
3131 *lp++ = longcolval;
(gdb) bt
#0  0x00000000005ff03b in display_fb_internal (xp=333, yp=233, w=186, h=12,
    color=<value optimized out>, dirty=<value optimized out>,
    cL=<value optimized out>, cR=<value optimized out>,
    cT=<value optimized out>, cB=<value optimized out>) at simgraph16.cc:3131
#1  0x00000000005ff154 in display_fillbox_wh_clip (xp=90, yp=5170, w=5170,
    h=1, color=<value optimized out>, dirty=92) at simgraph16.cc:3165
#2  0x0000000000491c03 in button_t::zeichnen (this=0xa537f0,
    offset=<value optimized out>) at gui/components/gui_button.cc:480
#3  0x000000000050216a in gui_file_table_button_column_t::paint_cell (
    this=0xa537d8, offset=..., row=...) at gui/savegame_frame.cc:590
#4  0x00000000004a62c2 in gui_table_t::paint_cells (this=0xa53210,
    offset=<value optimized out>) at gui/components/gui_table.cc:167
#5  0x00000000004a6838 in gui_table_t::zeichnen (this=0xa53210, offset=...)
    at gui/components/gui_table.cc:350
#6  0x00000000004a3f07 in gui_scrollpane_t::zeichnen (this=0xa53410, pos=...)
    at gui/components/gui_scrollpane.cc:180
#7  0x00000000004d1713 in gui_container_t::zeichnen (
    this=<value optimized out>, offset=...) at gui/gui_container.cc:223
#8  0x00000000004f880c in pakselector_t::zeichnen (this=0x5a, p=...,
    gr=<value optimized out>) at gui/pakselector.cc:85
#9  0x0000000000595407 in modal_dialogue (gui=0xa52c20,
    welt=<value optimized out>, quit=<value optimized out>) at simmain.cc:238
#10 0x0000000000595e43 in simu_main (argc=1, argv=0x7fffffffe258)
---Type <return> to continue, or q <return> to quit---
    at simmain.cc:717
#11 0x000000000060818a in main (argc=1, argv=0x7fffffffe258) at simsys_s.cc:770


This is the same behaviour as the pre-compiled linux 64 binary for 9.0 shows. The 32 bit binary runs without faults. I remember this from standard quite a while ago, but the most recent nightly 64 bit binary works flawlessly. (i don't know however with what optimisation it was compiled.

Btw, what's with 64 bit windows binaries?

jamespetts

Hmm. 64-bit crashes are hard to debug for me, as I only have a 32-bit machine, and therefore can't test them. I am about to build new computer, however, which will be 64-bit (just waiting for one last component to arrive), so this might be easier in a short while.

From what I can tell from the backtrace, the problem does indeed originate in a Simutrans-Experimental specific piece of code, being some GUI code written by Bernd Gabriel a while ago for the loading and saving game window (which is also used, I think, for the pakset selector). The method is:


// BG, 26.03.2010
void gui_file_table_button_column_t::paint_cell(const koord &offset, coordinate_t /*x*/, coordinate_t /*y*/, const gui_table_row_t &row)
{
gui_file_table_row_t &file_row = (gui_file_table_row_t&)row;
koord size = koord(get_width(), row.get_height());
btn.set_groesse(size);
koord mouse(get_maus_x() - offset.x, get_maus_y() - offset.y);
if (0 <= mouse.x && mouse.x < size.x && 0 <= mouse.y && mouse.y < size.y){
btn.set_typ(button_t::roundbox);
}
else
{
btn.set_typ(button_t::box);
}
btn.pressed = pressed && file_row.pressed;
btn.zeichnen(offset);
}


The crash actually occurs in a bit of Standard code in simgraph16.cc, here:


/**
* Zeichnet gefuelltes Rechteck
*/
static void display_fb_internal(KOORD_VAL xp, KOORD_VAL yp, KOORD_VAL w, KOORD_VAL h, int color, int dirty, KOORD_VAL cL, KOORD_VAL cR, KOORD_VAL cT, KOORD_VAL cB)
{
if (clip_lr(&xp, &w, cL, cR) && clip_lr(&yp, &h, cT, cB)) {
PIXVAL *p = textur + xp + yp * disp_width;
int dx = disp_width - w;
const PIXVAL colval = specialcolormap_all_day[color & 0xFF];
const unsigned int longcolval = (colval << 16) | colval;

if (dirty) mark_rect_dirty_nc(xp, yp, xp + w - 1, yp + h - 1);

do {
unsigned int count = w;
#ifdef USE_C
uint32 *lp;

if (count & 1) *p++ = colval;
count >>= 1;
lp = (uint32 *)p;
while (count-- != 0) {
*lp++ = longcolval;


That last line is the line that triggers the crash; the pointer "lp" must be defective in some way, either NULL, deleted, or otherwise out of range. This code is identical in Standard.

The odd thing is, it does look a little suspicious from a platform compatibility perspective. The line "*lp++ = longcolval;" assigns "longcolval" (an "unsigned int" type) to the value of lp, an unsigned 32-bit integer. I wonder whether, on certain platforms, an "unsigned int" might be treated as 64-bit? (It's odd, however, that this works in Standard).

To test this, can you change the declaration of "longcolval" from "unsigned int" to "uint32" and see whether that makes any difference to your crash when you've compiled? Thank you very much for reporting this.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

longcolval does not matter. unsigned int is eitehr 32 bit or 64 bit. In the latter case the superflous bits are cutted during assignment. (Having a wrong value there woudl anyway only generate wrong colors.)

Ters

It looks much like the trouble I had with display_fb_internal when compiling standard simutrans for my 64-bit machine (http://forum.simutrans.com/index.php?topic=3566.0). Which compiler (version) is being used to create the crashing executables?

sdog

sorry forgot to write down the compiler version
gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)

jamespetts

Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ters

There is a patch in the thread I linked to. I just got a conflict when I upgraded the source code on my 64-bit machine to the latest revision, and either myself or subversion messed up trying to resolve it so I had to revert all my changes. Not sure if the patch will still apply without human intervention.

Tried to compile and run the unmodified code, and Simutrans standard still crashed on startup (the select pak screen) on my 64-bit Linux machine. Reapplied my changes to display_fb_internal manually by looking at the patch, and now it's working again.

There is a second fix (actually, it's the first) in the patch that I have not reapplied after reverting. I don't remember what triggered that crash.

The core idea of my fix is simply to unoptimize the code. Instead of casting the pointer to operate on 32-bit, I changed the loop to do 16-bit at a time. GCC 4.3 and later will apply it's own optimizations at compile time, turning out code that does 128-bit at a time, if I remember correctly.

jamespetts

Ters,

that's very interesting. Will other compilers (such as MSVC++ and MingW) perform this sort of optimisation, too, or do we need conditional compilation directives to apply your de-optimisation only to GCC?

Given that this is a Standard issue, I'm transferring this to the Standard board.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

However the 64bit nightlies work well ... I would rather say this is a compiler bug in GCC 4.3 or up.

jamespetts

Perhaps there needs to be some means of compensating for the compiler bug? If Ters is right about the optimisation, would it not be better to use his system in any event?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ters

The trouble with applying my patch is that executables compiled with older compilers might get a significant performance drop. That can possibly be solved by using #ifdef, but that means that there will be three versions of display_fb_internal to maintain and test.

jamespetts

There being an extra version might be worthwhile; but what #ifdef can detect a GCC compiler automatically; or would it have to be manual?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ters

The problem is not confined to 64-bit executables. I managed to reproduce the crash by compiling a 32-bit executable with -march=pentium4 using Mingw32 gcc 4.4.

jamespetts

Oh I see. Interesting. Hmm. Is there an #ifdef that can detect GCC automatically, though, or do we need to tell all people compiling with GCC to add -DGCC to their makefiles if there's going to be conditional compilation?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.


jamespetts

#15
Hmm - this creates complexities. I'm not sure how to go about making the following part of the patch conditional:


@@ -1597,7 +1597,7 @@
// jetzt kommen farbige pixel
runlen = *sp++;
#if USE_C
-#if 1
+#if 0


That would require nested #ifdefs, which the precompiler does not seem to cope with. What exactly is the purpose of #if 1 and #if 0 in any event? Are they symbols that can be defined, or do they effectively mean "if true" and "if false"?

Edit: Never mind - I've found the answer. Can you try compiling from my latest devel branch commit and see if it works?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TygerFish

James linked me in from here: http://forum.simutrans.com/index.php?topic=6478

sdog: Did you try with -O2 as well?  I remember coming across this recently: http://forum.simutrans.com/index.php?topic=2412.0

If there's any way I can help test as well, I'm happy to, but I still can't figure out how to compile the project under Linux...

prissi

#17
What I do not understand: display_img_nc uses a very simular code (copying uint32 instead unit16 ... )

Anyway, for 32 bit code, USE_C is usually undefined ...

Ters

My patch changes display_img_nc too (by inverting the #if), but the two functions only crash if xp is odd. If display_img_nc is usually called with an even value for xp, then nothing bad happens.

If I understand correctly, the changes have been applied to a branch of simutrans experimental. Unfortunately, I have only used simutrans standard and lack the time to set up experimental, especially on my 64-bit machine.

Dwachs

#19
What I do not understand is why the pointers ls and lp should have the same alignment in this part of display_img_nc:

ls = (const uint32 *)sp;
ld = (uint32 *)p;
runlen >>= 1;
while (runlen--) {
*ld++ = *ls++;
}

It could be that sp points to 0x...04 whereas p points to 0x...02, increasing both pointers at the same time will not change that. I think we should stick to copying 16bit values here (or simply use memcpy as the original patch of Ters).

Edit: The same holds for display_fb_internal:

PIXVAL *p = textur + xp + yp * disp_width;

Why should p be aligned to a 4-byte address as the code afterwards assumes?? textur is uint16*, hence aligned to 2-byte addresses. The alignment of p depends on the alignment of textur, and the values of xp,yp,disp_widht.

Edit2: If these routines have to copy an odd number of 16bit values, it is not guaranteed that every time the 16bit value 'that is too much' always comes first when copying, which is another assumption in these routines.
Parsley, sage, rosemary, and maggikraut.

Ters

Quote from: Dwachs on January 01, 2011, 01:32:45 PM
Why should p be aligned to a 4-byte address as the code afterwards assumes??

I am not sure the code assumes 4-byte alignment, as non-aligned memory access is legal, though slower, on x86. When optimizing the code, newer versions of GCC do however assume that uint32* points to a 4-byte aligned address. I am not sure if this is intended or not.

prissi

@Dwachs:
p++ depends on pointer size: It is +2 for unit16 and +4 for uint32. The actual address is correct at the end of copying. Memcpy is about 50% slower, and make really an impact in profiling, as this is one of the most often called routines. Even a simple

while (runlen--) {
*p++ = *sp++;
}

was faster than memcpy.  (Furthermore since 64 bit is slower than 32 bit simutrans, and we compile for plain pentium only, issue persists only for self compile. I added a switch for conditional compiling.)

@Ters:
p is aligned to random values, also target and source can be easily misaligned, i.e. source at 4 byte and target at 2 byte border in display_img_nc. Thus not crashing there (and also not with the 64 bit code compiled on MSVC or nightly linux server) really points to an error of the GCC optimizer in the higehr 4.x versions. Could you post the assembler output of the routines after compiling? (gcc -S for simgraph16.cc) That way I could submit a bug to GCC.

Assuming uint32 is 4 byte aligned is not valid on x86 with windows, as parts of the OS could return unalinged DWORD addresses, especially drivers or when working with parts of bitmaps etc.) This would break very much of existing code. Btw. is there a statement somewhere stating this explicitely?

The only option I found was "-mpreferred-stack-boundary=num"

Ters

It is my impression that developers like a smallest possible example that reproduces the problem, so I wrote a small example based on display_fb_internal. Interestingly, the example program does not crash if I put the calling and the called method in the same c-file, but I get serveral warnings about dereferencing lp. Even if I split them into two files, it does not crash for all odd values for offset. It crashes for 1, but not for 3. The crash occurs at a movdqa instruction, just like simutrans.

To produce a crashing executable, I used gcc -Wall -O3 -march=pentium4 test.c test2.c -o test.exe

test.c is the crash producing method
test.s.txt is the optimized assembly (.txt to avoid attachment rules)
test2.c contains main calling the crash producing method, first without crashing and then causing a crash

Hanczar

I came across this problem ( crash after compiling x86-64 version ) today , and nice to see that problem is already diagnosed, with patch to workaround it , thanks :)

I want to add that I think it's not compiler fault, only code fault ( I appreciate that it improves performance with some compilers/platforms ).
As this code behaviour which crashes is undefined (breaking alignment rules and aliasing too) in C/C++. As
undefined code it can work or not - especially on most combinations compiler/version with x86 arch works.
It's not true that on x86 it's legal, because it's not say in any C/C++ specification that on x86 it's legal. It's compilator task to generate adequate instructions and on x86-64 ( and x86 with sse ) not all instruction are alignment-safe.

I found in gcc bugzillia with discussion about similar issue:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35653

and part from C spec which I think is about this case:
Quote
Among the
invalid values for dereferencing a pointer by the unary * operator are
a null pointer, an address inappropriately aligned for the type of
object pointed to, or the address of an object that has automatic
storage duration when execution of the block in which the object is
declared and of all enclosed blocks has terminated.

Dwachs

@prissi: Given the discussion in the link above, we should replace these offending lines by simple copying 16bit-values in a loop. The compiler will optimize this and copy more bytes in one instruction. The intructions that triggers the crash (movdqa) moves 128bit-values after all, but requires aligned data.
Parsley, sage, rosemary, and maggikraut.

prissi

Imho this is still a compiler bug (at least not giving a warning).

But I already commited a fix uncommenting this for GCC >4.2.x and/or manually if ALIGN_COPY is defined.

Ters

There is a bug in display_img_nc (patch attached). Other than that, the GCC 4 stuff seems to work fine. I have now started to optimize my 32-bit windows executable for pentium 4 with USE_C defined since it appears to run smoother.

jamespetts

The latest version of Experimental (9.2) with these fixes (thanks to all who have worked on them) is now pushed to the Master Github branch, and should be compiled automatically this evening; any reports on the stability of this version would be much appreciated!
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

sdog

#28
Compiles and works flawlessly for me now with -O3 compiler parameter. Thanks for all the effort.

edit
Testing simutrans-experimental 9.2, i think i have a massive performace hit. However there is also a new path-finding algorithm, and possible outside factors on my machine. The performance also improved considerably quite a while after loading a massive save-game.

prissi

Experimental is much more CPU intensive than standard, since features in Experimental were not included in Stadard for exactly the reason of being too CPU intensive for their gain in gaming.

jamespetts

SDog,

the performance issue on your saved game "156" is not specific to 9.2: I can reproduce it with my 9.0 Windows binary. It is caused by the private cars searching for their routes. Currently, private routes between cities are not saved, so they all have to be recalculated when the game restarts. Once most of them have been recalculated, things then settle down until the game is next reloaded.

In my initial testing, I had not found this to be as much of a performance issue as it is transpiring to be, and I have since been thinking of ways to reduce its impact, but this issue deserves a separate thread, I think.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

sdog

James,
i think there is more to the performance loss than just the path-finding for the car-ownership model.
I've just tested 64 self compiled 9.2 with -DNDEBUG and -O3 and your 9.1 32 bit binary. In both a performance drop is noticeable shortly after loading a map. This brings down FPS to about 10 to 14 in the 9.1/32 and 4 in 9.2/64 the latter measured only after 30 to 60 seconds as the display dialog did lag too much. After about one more minute 64 seems to be done with those computations and the FPS recover to about 8 to 12. In 9.1/32 i have steady 16 FPS.

The precompiled 32 bit version of 9.2 shows the same behaviour as the 64 bit version mentioned above.

To conclude, 9.2 has much lower performance than 9.1, but it does not depend on the binary being compiled for 32 or 64 bit.

Tested in 64 bit ubuntu 10.10, compiled with gcc 4.4.4-14. Binaries tested with Pak128.Britain-Ex 7.1 with my somewhat complex map to be found in the other thread.

jamespetts

Interesting test result; but why do you think that this is not due to the car route finding?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

sdog

Car route finding is present in 9.1 already, isn't it?
So it can not solely be responsible.

My hunch is route finding in general as it influences both performance after loading and normal operation.

prissi

The new centralized route search system seems most likely, as route search is anyway the most compute intensive task in standard (apart from drawing).