News:

Simutrans Tools
Know our tools that can help you to create add-ons, install and customize Simutrans.

[r2785] Crash with GCC 4.3

Started by Ters, October 17, 2009, 07:37:03 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Ters

While I'm in a bug reporting mode, I might as well dig up another. When compiled with GCC 4.3 for x86-64, simutrans crashes with a segmentation fault at line 2339 in simgraph16.cc, on a movdqa instruction to be precise. GCC 4.1 does not cause any problems. I replaced that code block (line 2333-2341) with a simple call to std::fill, and the problem went away.

My guess is that the coder and the compiler are outsmarting each other at trying to optimize the loop in question, and the casting of pointers result in a misaligned memory access. I think it is best to drop this manual optimization and let the compiler do the job, as they have gotten quite good at it. They will then scale with the hardware and copy eight PIXVAL values at a time on processor with SSE or similar, compared to just two with the manual loop optimization. The code will also become more readable. Unless you insist on supporting old compilers less capable of optimizing well.

prissi

#1
SInce it is build for WIndows (80%+ user base) with GCC 3.x as not GCC 4.x is available as production stable, I would like to keep it. For 64 code, the USE_C switch is mandtory anyway, since the instructions of the assember are 32 bit code, and the address of the arrays could be anywhere with 64 bit code (and should be anywhere, using a decent OS).

I did a lot of optimisation and profiling (of course using non SSE code, as everybody should be able to run this, even my vinatge laptop without; and also since GCC almost never makes use of it). I applying assemble only where neccessary. But again, for ARM, PowerPC or x64 you need -D USE_C for compilation. Before GCC 4.x did not even compile, when I tested last time withotu USE_C.

Ters

GCC 3.x can keep the assembly code, I was focusing on the C/C++ code you get when you define USE_C. It seems to be tailored for the same architecture as the assembly code anyway.

And public binary releases can target older processor when compiling and therefore contain no SSE instructions, but it would be nice if one could compile executables for newer architectures if one wanted.

By the way, I think I got GCC 4.something to compile without USE_C when targeting 32-bit x86.

prissi

Yes, for 32bit wihtout USE_C should work; but for x64 USE_C is mandantory. And since I compile for BeOS too, std is very incomplete there. Also it is still 2.96x very of the GCC compilere with int!=signed int and so on. So this was the most save construct for them all.

Ters

I understand. Nevertheless, I attach a patch here for those who run in to the same problem I did. The call to std::fill is replaced with a simple loop doing the same job so as not to introduce a dependency on the C++ standard library. Both result in similar, if not the same, optimized assembly code. I also changed another function containing a similar potential problem. This should make the code friendlier to architectures not as forgiving when it comes to unaligned memory access as the x86.

prissi

It works and compiles on arm, where unalinged access causes bus error. Thus the optimisation of two words at once is save, since the variables are 2 Byte each for the screen. Odd address cannot occur, only by programm errors.

And word copy was 20% slower compared to long copy as C-code when profiling. Thus I think I will keep it this way, since I feel this is the best compromise between speed and size. Sorry.

Ters

I was not requesting that this be included in trunk, so that's okay. And copying two words at once might be safe, but GCC 4.3 turned it into a 16-byte copy. I am not even sure misaligned memory accesses were to blame, it was just my best guess as the code otherwise looks safe. It might be a compiler bug. Whatever the cause, this patch fixed my problem, even if it is voodoo debugging.

Here is the call that crashes, in case it may tell whether something else is wrong (the odd xp looks dangerous to me):
display_fb_internal (xp=189, yp=237, w=358, h=102,
color=<value optimized out>, dirty=0, cL=<value optimized out>,
cR=<value optimized out>, cT=<value optimized out>,
cB=<value optimized out>)


It is called from the pakselector, so it is possible that this bug only arises for those who have multiple paks, but have not set which to use in simuconf.tab.

prissi

Yes, but xp is a coordinate in PIXVAL (which is a 16 bit variable, so the memory accessed is still even) Thus it should not matter. Turning this into 16 Byte copy by GCC is another piece of cake, although).

Ters

Even, yes, but not a multiple of four, which the compiler might assume that pointer to uint32 is. When code generated for 4-byte aligned memory accesses is fed a pointer that is only 2-byte aligned, bad things might happen.

prissi

Are there machines who want 4 byte aligned pointers? I know of some SSE commands, that assumes 16 or 256 alignment, but 4 byte was new to me. Odd adresses was a big problem on 68000 so I am aware of this.

Ters

I don't remember any names, but I rembember learning that some architectures require that memory addresses are aligned to the size of the data type being used. That means that int16 pointers must be even, int32 pointers divisible by four, and so on. I have even read that non-aligned memory access is bad even on x86, because the processor will use two bus cycles to fetch the non-aligned data. Modern processors have however become so complex that I will not even try to predict if this will have a speed impact in this case. If your profiling indicates that the current code is the fastest for the targetted architecture(s), then it is.

prissi

Ok, getting into very vintage computers here. The old IBM I think had 36 bit words, while char was 9 bit. But and char pointer was then a 36 bit pointer. Also I know some (more recent) DSP I programmed for, that all data needs to be either 20 bit 40 bit or 48 bits, with the latter two put into five 20 bit words and two bits unused. There each sub quad needed to be accessed in the correct order. But I doubt any of those will ever need to run simutrans, as this may break the code in some other places too. Apart from that only ARM and MC68k have issues with odd addresses.

Modern (insanely long pipelined) CPU like pentium have usually 16 (or even 256) Byte alignement for code (there is even a special command to have all routines aligned there) since one bus burst usually reads 16 bytes. And thoses are put into the cache too. So the slowest operation would be an or to a 16 byte word at and odd address (modulo is 15), which usually results in reading two 16 bytes blocks and then wrting two bytes. ARM does this better, since it forbids such operations at all ;)