News:

Simutrans Wiki Manual
The official on-line manual for Simutrans. Read and contribute.

Changing to new CPU compiler setting

Started by Ters, September 12, 2012, 04:46:10 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Ters

I was thinking of suggesting pentium4 as the CPU to target, too. If we do that, we can possibly drop unicows from the dependencies, as that means we've probably moved beyond Win95/98/ME (ME might possibly support some SSE, but it's a horrible OS anyway).

prissi

Are there any real evidence for the claim that SSE is used by optimzers? My knowledge was that only less than 1% of larger code uses those, and often not in a way which brings on much performance. Especially since MMX is not much different for small data sizes.

And the image drawing code is already handoptimized (at least on the 32 bit builds) and was fast than using MMX opcodes.

When we support IPv6, then we have can drop unicows anyway, as it requires Windows XP. IF you show me a decent speedup by pentium3 versu pentium or pentium 2, well we could set this target too.

I just remebered when I userd MOVSD (or so, forgot which) and it crashed an ancient pentium user. I mean, simutrans is played in indonesia and can run on geode processors.


TurfIt

When I was playing around with the 'USE_C' image drawing routines, GCCs optimizer definitely spits out SSE instructions - just compile to assembly and look. Think >60% faster for pentium4 vs pentium instructions - it's not just SSE being enabled. Note: pentium isn't even enabling MMX. I didn't try MMX vs SSE. Of course, for most builds the brute force .asm code is used which is the fastest. (and pity those systems that fall into the 'memcpy' fallback - I think you can watch the pixels being drawn using that!)

I can't find my profiling results, but I remember quite significant gains across the board ditching the pentium target. I could repeat them...
Even removing -fno-schedule-insns is good for 0.5% - why's that there?

Really, P4 is from 12 years ago, it should be called legacy nevermind pentiums. IMHO it's reasonable to set the default target to pentium4 and possibly have a legacy pentium build.

prissi

#3
"-fno-schedule-insns" was way faster when enabled. This was rather a bug of the 3.4.x series of GCC, which until very recently was building the releases ( http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11488 ). Again, testing the same stuff on two machines gave very different results.

Could you qualify 0.5% faster? What did you run, the one month yoshi87-102 game profiling? What target, OS and compiler did you optimize for?

In my tries I found MSVC actually produced slightly faster code for simutrans using the release switch for my testgames. I would be very interested in the difference just using the -times switch after starting for different architectures on your CPU using GCC. I will do the same at home. (But the way, this here is written on a old pentium II 450 MHz, which is the control computer in our lab for the machine there. You know ISA bus and so ... )

TurfIt

Rather heading off the nightly topic....

instruction scheduling bug: I guess that's the problem with working around a bug, when it gets fixed, the workaround remains. Not doing this optimization should really hurt CPUs with long pipelines - like the pentium4.

0.5% - was testing last December, can't remember specifics. 0.5% is insignificant, rather almost lost in the profiling noise. It just seemed strange to be purposely turning off an optimization for no apparent reason, and that actually had an almost measurable effect. Target would be SDL, MinGW, Win7, GCC 4.6.2, on my old conroe Core2.

-times: show_times() needs more iterations to get meaningful results...

Message: test: display_img(): 300000 iterations took 21 ms
Message: test: display_color_img(): 300000 iterations took 18 ms
Message: test: display_color_img(): next AI: 300000 iterations took 18 ms
Message: test: display_color_img(), other AI: 300000 iterations took 18 ms
Message: test: display_flush_buffer(): 300 iterations took 4 ms
Message: test: display_text_proportional_len_clip(): 300000 iterations took 387 ms
Message: test: display_fillbox_wh(): 300000 iterations took 430 ms

prissi

Well, it really depends on the image (a random image is taken at the moment. Maybe one should rather take a ground tile ... And on my atom this runs total over ten seconds.

The scheduling bug was most bad on powerPC, where it easily could slowed down the executable by a factor of two. ANd it never got fixed according to the bugzilla entry. (Maybe its gone in 4.x series due to other optimizer.) And indeed pentium4 has a very long pipeline.

So I split off the topic

wernieman

#6
I use a precompiled SDL .. so this is not a performance Tip ..
But the next nightly will build with
CCFLAGS = -march=pentium4
CXXFLAGS = -march=pentium4
I hope you understand my English

whoami

I think that Pentium 4 is a good minimum requirement, as one of my computers (that are actually in use) has one ;) , and ST runs fast enough on it. My last experiences with ST on a Pentium III-933 (but with lowest-level on-board graphics without dedicated RAM) showed very poor performance, making only tiny maps with pak64 playable.
If official releases are automated, supplying additional binaries for older hardware and OS versions (also slow portable CPUs) might cause little additional work.

(This topic comes from http://forum.simutrans.com/index.php?topic=10432.0, where - at the end - there are more posts regarding CPU target.)