OK, I'm getting about done with testing, but here's one more with the r3 patch. And let's throw in the 64bit too!
686 asm, and x86_64 align would represent the current trunk out of the box state.
LOW_LEVEL now works with the SSE instructions, and quite likes them.
x86_64 asm has the asm modified to work in 64bit, except for in display_text_proportional_len_clip_rgb() which remains with the USE_C path. Too much headache to convert text_pixel.c
|display_color_img() with recolor:||6000000||iterations||took:||9160||6425||4409||9485||6487||3763||8960||4912||4747||2622|
|view->display(true) and flush||2000||iterations||took:||4153||4128||3685||4345||4299||4173||3900||3680||3593||3314|
display_img_nc() - the asm is clearly outdated. LOW_LEVEL is preferred in all cases and should be the new default. The high code could be kept as a define HIGH_LEVEL option in case some future compiler update breaks the low level.
display_scroll_band() - this is actually not tested in any of the -times tests. Since it only applies to the scrolling ticker, which is quite small, performance here isn't something to particularly care about IMHO. Good enough to just drop the asm for the USE_C code.
display_text_proportional_len_clip_rgb() - asm not helping. LOW_LEVEL is slower. Suggest just keep the high level path as the only one. EDIT: Conroe is painting a different picture, asm still good, followed by low. Annoying.
display_fb_internal() - the problem child! For a 686 target, no 'C' construct has yet been discovered to approach the speed of the asm. Unless can be found, should keep the asm. Once you go to P4, the high level basically matches the asm. You need AVX instructions to start beating the asm. Suggest keep the asm (and enable for 64bit), with HIGH_LEVEL as an option.
EDIT: Conroe is the oldest CPU I still have functional, even then it was retired last year after 10 years of service. It was really starting to drag in everyday tasks. Don't know how many old clunkers people are still trying to run Simutrans on...
|display_color_img() with recolor:||6000000||iterations||took:||13541||9286||8757||13990||9968||11413||13413||12721||12481||9764|
|view->display(true) and flush||2000||iterations||took:||10849||10545||9964||11397||11007||11323||10531||10785||10894||10067|
This is meant only for testing on little endian platforms. It will currently break on big endian platforms. If this solution is determined viable then a macro could be used to support both system endian types.
Oh for GCC I found these flags which might be worth messing around with...Specifically comparing things like rep_4byte with unrolls might be a good idea.
I think big endian support is broken for a long while anyway - not 100% sure.
Applying attributes to target specific optimizations at each function can work well. But, you'd need to revisit their effectiveness quite often as compiler versions change. The biggest problem IMHO is the code clutter. If MSVC support is to be maintained, each attribute needs wrapping in #ifs (and IIRC their required placement doesn't mesh well with the macros currently used for this).
The graphics code in Simutrans-Extended is unchanged from Standard, so a similar result is likely to obtain in Standard if -march=pentium4 is defined.
The 120.1.1 and 120.1.2 releases I did last year are both with -march=pentium4. Once -mstackrealign was added, I don't believe there's been widespread crashing. Also, self compiled Sim-Ex with -march=pentium4 works for me, your executable did indeed crash on zooming.
EDIT: Conroe finally finished, results added. Remind me not to run tests with the iterations set for reasonable times on much faster computers!