Why is SDL with USE_HW not default?

nebulon · April 10, 2013, 06:29:27 AM

Hi,

I couldn't find a lot of documentation or discussion about this, so I just simply ask here. This will be my first post anyways :-)

Since I moved to a new low power PC recently, I was unable to play simutrans anymore as the framerate was around 2~5fps.
I suspected, that the Atom CPU is simply overloaded, as simutrans was only using one core and that one was fully loaded all the time.

After playing around with allegro, which did not improve anything, I was looking through the SDL backend code and found the USE_HW define.
Enabling that, gives me nice 25-30fps!

So it took me quite some time to find this option and of course the package from my distribution (I use ArchLinux, don't know how it is compiled on Ubuntu or the likes), didn't have the option set.

Is there a specific reason for not enabling this by default, it would improve performance for most people on Linux.
Also helps quite a lot in power consumption on my Laptop.

-----
simsys_s.cc:25
// try to use hardware double buffering ...
// this is equivalent on 16 bpp and much slower on 32 bpp
// #define USE_HW
-----

This seems to only affect double buffering, but not the hardware surface flag which is also set by USE_HW.
The double buffering hat no visual impact on the framerate.

I currently use the opensource nouveau driver on linux, which usually has still often issues, but in this case I couldn't detect visual problems, with using the hardware support, but maybe thats' the reason for it being disabled by default?

Thanks,
Johannes

Dwachs · April 10, 2013, 07:45:30 AM

Welcome to the forum

Thank you for your investigation. To be honest, I do not have any idea with respect to the USE_HW flag. All the code that is enclosed by '#ifdef USE_HW' is like 7 years old, and may be broken now.

Did you make any other changes besides enabling USE_HW?

prissi · April 10, 2013, 08:21:19 AM

When I tried this on my original eeePC on windows it moved from 10 fps to 2 fps with USE_HW. It may depend very much on your hardware and their drivers. Additionally, the SDL version might be important. But honestly, copying 16 bit bitmaps per CPU should be as fast as per hardware. Even more that with USE_HW always the fully screen is copied. It may be that use_hw prevent certain bugs from the driver to show up on certain constellations.

It will halt simutrans for buffer switching (Vsync), which may confuse the screen timer at higher fps settings, and can give a 8-11 fps jitter on network games. When I tested it a long time ago, I had also issues with resizing the window.

Dwachs · April 10, 2013, 08:37:57 AM

From a quick glance through the code, we can skip the dirty-buffer stuff, which is quite an overhead.

Quote from: prissi on April 10, 2013, 08:21:19 AM
It will halt simutrans for buffer switching (Vsync), which may confuse the screen timer at higher fps settings, and can give a 8-11 fps jitter on network games.

Where does this happen?

I have to test this myself first. I can imagine that parts of these code are broken, it has been pretty much unmaintained since years.

prissi · April 10, 2013, 09:38:02 AM

The dirty buffer stuff helped (and imho still helps, just look at the CPU with the window all dirty for every frame) especially on large screens with separate graphic adapters but is obviously less effective on internal graphics. Or do you mean when "USE_HW" is defined?

During the double buffering, the SDL on some computers (not all when I tested it) waited for a ysync, i.e. 1/60 second. This can obviously confuse the frame time, and can lead to updates every 5 and 7 frames alternatingly, when targeting 6 /for 10 fps for some networkgames or fast forward).

Markohs · April 10, 2013, 09:50:44 AM

Quote from: prissi on April 10, 2013, 08:21:19 AM
It will halt simutrans for buffer switching (Vsync), which may confuse the screen timer at higher fps settings, and can give a 8-11 fps jitter on network games. When I tested it a long time ago, I had also issues with resizing the window.

Seeing the behaviour we have on speed-scaling architectures I'd say the fps and timing management Simutrans does has to be rewritten.

It should not try to schedule frames as it's doing now, just computing and simulating as fast as it can until it reaches a maximum fps. On all books and articles on game development I've seen the timing it's done very different than simutrans does. That code is buggy and should be changed. Our current code tris to scale the game speed to the available CPU, that has to be removed.

About disabling/removing tile buffer management I'm not 100% sure of it, since it saves lots of I/O bandwidth, and really speeds up the game. But if we switch to double buffering ofc that's not needed. And as prissi comments that HW support might not work good on al plattforms.

This is not a simple question, I'm unsure myself on how it whould be better.

Quote from: prissi on April 10, 2013, 08:21:19 AM
I had also issues with resizing the window.

I think that should be fixed with my last update to svn. This happened too when the new background was active.

prissi · April 10, 2013, 01:42:30 PM

Those resizing troubles was when I uncommented the USE_WH code back in 2007 or so ...

Simutrans has to do this timing anyway, otherwise network games would be impossible, as during a frame all stuff is moved. Different frame rates, different moving patterns, desync.

One could argue to move stuff outside the frames; but still there is internal code needed to ensure 10 movement cycles per second, and cutting down drawing is the most likely way to achieve low CPU. The simulation speed of simutrans without sync of vehicles and display would just wast CPU without any improvement of display as nothing would move between frame updates.

NB: OpenTTD does actually the same, although there the frame rate is rigidly set at 20 fps.

But what exactly do you propose?

Ters · April 10, 2013, 01:54:47 PM

As for dirty rectangles or not, or USE_HW or not, there is only one way to figure out what is best, and that is to try and see. The downside of being cross-platform is that there are so many platforms to test on.

Personally, I sometimes spend much time following trains around. That requires full screen redraws, and I don't notice any slowdown from that. But then my machine isn't of the kind(s) that choke the FPS to 2 under normal circumstances either.
Multiplayer games that run in sync step must enforce a fixed, usually slightly low, frame rate. Such games aren't the most popular to write about, not since the 90s anyway, which explains the prevalence of text about the "First Person Shooter-type" time management. One could of course uncouple the rendering from the logic, and only keep the logic frame rate fixed, but what's the point of drawing the same image more than once? It only burns off power.

Markohs · April 10, 2013, 02:14:16 PM

I don't know the details yet, since I had planned to look at that part of the code deeper when I had my current patches finished, but there is certainly a problem with current implementation, and you can check it out:

1) Start a game (a cpu hungry one preferably)
2) I see frame time 37 ms idle 25 fps 26 simloops 5.0
3) Halt program execution with the debugger of with breakpoint
4) Resume execution after some seconds (20 in my case)
5) Frame time 250 ms idle 25 fps 0 simloops 1.2
6) The game will slowly realize he has more CPU available and will decrement frame time by 1 each frame, at 1 fps starting, until it reaches 25fps again. This process lasts about 1 minute.

This shows simutrans overreacts to CPU scarcity, and I suspect this is behind the scaling CPU's problems we have seen in Ubuntu for example.

I don't know if this current algorithm can just be tweaked to recover faster from this situations, but it certainly has to do it so, because even my example is a extreme case, things similar can happen in the game naturally, like when seasons change, when you scroll large areas of the map or when you computer starts a background cron that temporally diminishes available CPU. This has to be improved.

What do I have in mind? Just ideas and I might be wrong in some or all of them, but:

1) I'd enforce a fixed simulation time rate as you said, and simutrans whould just stop when it's reached, not more, not less.
2) I'd also enforce a target fps of 25 maybe on screen redraw, on the time the world simulation leaves the CPU free.
3) Only past this time, simutrans can go sleep and yield CPU.

I guess the current algorithm does more or less this, dunno if tweaking times will be enough, or more coding will be needed.

TurfIt · April 10, 2013, 06:23:12 PM

Enabling USE_HW is disastrous on any system I've tried... ~25% slower frametimes. (and some strange interactions where it takes >5 secs to load a game that takes <0.5 without USE_HW...)
I've yet to see hw_available report true. i.e. I always get:

Code Select


hw_available=0, video_mem=0, blit_sw=0
bpp=32, bytes=4

If there are platforms out there that are actually helped by this, I think we need to move to a runtime selection rather than compile time...

prissi · April 10, 2013, 11:58:55 PM

Thanks TrufIt, it seems nothing changed in the last 7 years apart from a single platform. For SDL we have aparently try USE_HW, as well as OpenGL before settling for normal SDL ... Maybe we should revive the allegro backend.

nebulon · April 11, 2013, 11:35:44 PM

A lot replies :-) A lot of stuff around the framehandling I have no clue of

Enabling hardware buffers in SDL improves by a huge bit on Linux when running an OpenGL compositor like compositing KWin or compiz. If SDL does not use hardware surfaces, it needs to copy every pixel once more into a texture...which in my case is the main bottleneck.
I also hacked the code to initialize SDL with '0' as color depth and it chooses the native one, which is, again in compositing mode, 32bpp. Of course the whole rendering is off and colors are broken but it also squeezed out another 10fps

This of course does not make any noticable difference on one of my beefier machines...

Those pixel manipulations/copies every frame are still very expensive as resolutions are often quite higher nowadays.

Looking at the output you posted, it might be possible to switch to the code, hidden by this flag, depending on the hw_available field? Lets see if I can find some time on the weekend to prepare something.
Or maybe at least make it a config step, rather than a compilation step?

Thanks

prissi · April 12, 2013, 01:03:33 PM

Making it configurable is easy. That 32 bit is faster than 16 bit is another indication that SDL on your configuration is to blame. (Or more exact compiz does a crappy support of bitblt, the most fundamental operation of them all.) However, since everything is anyway OpenGL, why not using that backend?

And, as Ters mentioned, does SDL actually report "hw_available=1"? Since this is usually zero, enabling USE_HW when 1 is returned seems like a good choice.

TurfIt · April 13, 2013, 05:30:05 AM

Find attached patch to add -use_hw command line option replacing USE_HW compile time option.

I also found SDL_UpdateRect was being called a lot... SDL documentation suggests SDL_UpdateRects is more efficient, so patch included a quick and dirty implementation of that. I'm seeing ~15% faster at normal zoom, 5% faster when fully zoomed out. I suspect there's overlapping rects slowing things down too...

Also, the GDI backend had a multithreaded update added, but not SDL. So, I added it to SDL too. Now 33% faster at normal zoom, 10% faster at max zoom out. But, UpdateRects is slower than UpdateRect when multithreaded, so only for single threaded. Interestingly, when multithreaded, -use_hw no longer severely impacts performance, only a 0-3% slowdown. Note: my platform does not return hw_available=true, so I don't know/can't test if the multithreading works when hardware accel is actually available.

Another interesting test would be to try hardware surfaces but not double buffered. i.e. simply set the SDL_HWSURFACE flag and don't use the -use_hw option. Of course testing this requires a suitable platform...

Further, the frame time stat is completely unreliable. It was stuck at ~30ms no matter what I did, even as FPS and simloops plumet, frametime=30. To test this patch, timing needs to be obtained elsewhere.

prissi · April 13, 2013, 08:46:00 PM

I would suggest to submit this, so we can get more feedback from the nightlies. It looks like a good idea.

nebulon · April 13, 2013, 10:42:06 PM

Tested the patch on two of my setups and it works just fine!

On a side note, I have a patch attached to this, which just changes some lines in the readme.txt, as they were a bit confusing for me when I tried to build the project the first time.

Thanks a lot,
Johannes

TurfIt · April 13, 2013, 10:53:42 PM

If you don't mind, could you quantify "works just fine!". i.e. provide some performance numbers... Ideally pre patch, post patch, with -use_hw, with -use_hw and -async, single threaded, multithread, ... all combinations.
I'm concerned -use_hw, -async, and MULTI_THREAD>1 will not work.

Also, do you mind posting the first few lines of the console output for each combination. Specifically the screen flags, and SDL driver sections?

re: the patch, The coding style document is in desperate need of an update...

Edit: Also, have you tried the OpenGL backend?

nebulon · April 14, 2013, 02:50:52 AM

Hi,

some stats. All numbers taken when following a train, so ongoing scrolling.
If I don't scroll the framerates are roughly 5fps higher.

With your patch, normal zoomlevel and MULTI_THREAD = 4:
With compositor (compiz):
- no arguments: ~3fps
- use_hw: ~12fps
- use_hw and async: ~14fps
Without compositor (no windowmanager):
- no arguments: ~18fps
- use_hw: ~19fps
- use_hw and async: ~14fps

Without your patch, normal zoomlevel and MULTI_THREAD = 4:
With compositor (compiz):
- no arguments: ~4fps
- async: ~5fps
Without compositor (no windowmanager):
- no arguments: ~17fps
- async: ~11fps

Without your patch but USE_HW defined:
With compositor:
-no arguments: ~14fps

The async option is somehow strange...does not seem to help at all on my system, makes things worse.
Only defining USE_HW did not achieve as good fps as you patch with -use_hw,
I dont know if this makes any sense, as I didn't fully understood the threading part in the patch.

According to /proc/cpuinfo, I am running an "Intel(R) Atom(TM) CPU 330 @ 1.60GHz" with 4 "cores,
thus I thought the MULTI_THREAD = 4 is ok? The clockspeed is a bit missleading, its a real slow box,
but has a nvidia gpu which works fine for all my other usecases

Baseline startup with patch:

~/projects/simutrans/simutrans (git)-[master] % ../build/default/sim -use_workdir . -objects ../PAK128.german -screensize 1280x1024
Use work dir /home/jzellner/projects/simutrans/simutrans/
Reading low level config data ...
parse_simuconf() at config/simuconf.tab: Reading simuconf.tab successful!
Preparing display ...
SDL_driver=x11, hw_available=0, video_mem=0, blit_sw=0, bpp=32, bytes=4
Screen Flags: requested=10, actual=10
dr_os_open(SDL): SDL realized screen size width=1280, height=1024 (requested w=1280, h=1024)Loading font 'font/prop.fnt'
font/prop.fnt successfully loaded as old format prop font!
Init done.
....

As you see, the hardware flag is actually off. I was pretty sure the other day it was on...
I checked my system update logs, but it didn't looked like something could have affected it, besides a kernel update
(I use the nouveau driver which enjoys a lot of developement currently), but I doubt that was affected,
so I guess I just looked at the wrong output.

As for the opengl backend, I had to add "-lGLEW" in the Makefile:552 to get it linked.
I get ~4fps, regardless wheather I run in compositing or non compositing mode.
It must hit some highly unsupported path in my setup, as in comparison I get 25-30fps with Team Fortress using the native
Source engine port for Linux. But my card also does not have a fixed pipeline anymore,
and I couldn't see shader usage in the gl backend. So the plain SDL version is still much better.

Hopefully the data provides some help,
Johannes

TurfIt · April 14, 2013, 04:17:00 AM

Thanks for the results. It definitely strange use_hw helps you so much; With the way simutrans updates the screen, single buffered dirty tiles updates should be the fastest, especially for weaker machines. According to the SDL docs, even with a HW buffer, it's continually being copied from system to vid memory and back due to the way simutrans accesses the screen.

One test I see missing is -use_hw, -async, and MULTITHREAD not defined - single threaded.

According the spec sheet, the atom 330 is a 2 core processor with hyperthreading. Might be worth trying MULTI_THREAD=2 as well. I think kierongreen had/has a slow atom as well and was finding some of simutrans multithreaded was better using hyperthreads, some not...

Was the baseline startup info you posted with -use_hw? it looks not since requested screen flags=10. should be 400000011 or so with -use_hw.

Also, the bpp=32 looks strange, simutrans should be asking for 16 since that's all it can handle. I moved where it queried this to after the SDL_setvideomode call as otherwise it was returning my desktop bpp rather than the simutrans sdl screen bpp. If your platform is refusing to give simutrans a 16bpp screen, that could be a huge slowdown as it has to take simutrans 16bpp writes and convert to a 32bpp surface.

As it looks like MULTI_THREAD, -use_hw, and -async are getting along, I'll commit this. Hopefully that let's you continue to play... Although that's one weak system! 1.6Ghz, 1MB L2 cache, and 533 FSB. I think I've a 12 year old Athlon around that might still boot with more grunt than that

Edit: One more thing. Are you compiling to a 32bit target or 64? There's some inline asm that only works for 32bit mode which is critical to good performance. The fallback is to thousands of calls to memcpy for 1-5byte copies which is absolutely glacial on GCC (yet apparently ok for MSVC which is why it's there...).

Ters · April 14, 2013, 07:59:33 AM

Quote from: TurfIt on April 14, 2013, 04:17:00 AM
Edit: One more thing. Are you compiling to a 32bit target or 64? There's some inline asm that only works for 32bit mode which is critical to good performance. The fallback is to thousands of calls to memcpy for 1-5byte copies which is absolutely glacial on GCC (yet apparently ok for MSVC which is why it's there...).

The assembly I've seen GCC 4.2+ generate for the C code is just as good as the hand written assembly. GCC knows what memcpy is, and inlines it efficiently. Because of this, it might even be better, as it makes good use of SSE to shuffle 16 bytes at a time if you let it. Nor do I have performance issues with my 64 bit Simutrans. I also think I'm the one who suggested using memset() on 64 bit GCC because it gave best performance.

TurfIt · April 14, 2013, 04:58:16 PM

My experience was GCC absolutely refusing to inline that memcpy call. I tried every tag, attribute, flag, etc going to try and force it, but no way no how was that going to be inlined. Maybe just an issue with the MinGW version...

Replacing in display_image_nc():

Code Select


	// some architectures: faster with inline of memory functions!
	memcpy( p, sp, runlen*sizeof(PIXVAL) );
	sp += runlen;
	p += runlen;

with:

Code Select


	while(  runlen--  ) {
		*p++ = *sp++;
	}
+

and compiling -O3 with -march=core2 was letting the optimizer vectorize the loop and spit out SSE2 instructions. IIRC that was ~5x faster than all the non-inline memcpy function calls. But, that truly beautiful brute force asm block was still ~15% faster. I don't think you'll every see an compiler that spits out code quite like it!

As for 64-bit not performing badly, you presumably have a system with enough power to mask it. Even in 32bit, I've found a few data structures just overflowing a cacheline, I just trimmed stadtauto from 68 bytes to 64 making it fit and see a 40% decrease in the sync_step time for citycars now (running a patch I've been working on that completely rearranges sync_step). With compiling for 64bit blowing up all pointers, several of these structures that barely fit the cache in 32bit will have significant performance reductions.

kierongreen · April 14, 2013, 05:41:11 PM

On my Atom system 2 threads generally increased speed to 150-180% of that of a single thread, 4 threads took this to about 250-300%

Ters · April 14, 2013, 05:48:53 PM

Quote from: TurfIt on April 14, 2013, 04:58:16 PM
My experience was GCC absolutely refusing to inline that memcpy call. I tried every tag, attribute, flag, etc going to try and force it, but no way no how was that going to be inlined. Maybe just an issue with the MinGW version...

Replacing in display_image_nc():
Code Select Expand
// some architectures: faster with inline of memory functions! memcpy( p, sp, runlen*sizeof(PIXVAL) ); sp += runlen; p += runlen;
with:
Code Select Expand
while( runlen-- ) { *p++ = *sp++; } +
and compiling -O3 with -march=core2 was letting the optimizer vectorize the loop and spit out SSE2 instructions. IIRC that was ~5x faster than all the non-inline memcpy function calls. But, that truly beautiful brute force asm block was still ~15% faster. I don't think you'll every see an compiler that spits out code quite like it!

As for 64-bit not performing badly, you presumably have a system with enough power to mask it. Even in 32bit, I've found a few data structures just overflowing a cacheline, I just trimmed stadtauto from 68 bytes to 64 making it fit and see a 40% decrease in the sync_step time for citycars now (running a patch I've been working on that completely rearranges sync_step). With compiling for 64bit blowing up all pointers, several of these structures that barely fit the cache in 32bit will have significant performance reductions.

That computer isn't particularly powerful, and it was built from cheap components even back then. However, it is and was a Linux machine, so I might have avoided some issues mingw has or had. That 64-bit pointers ruin carefully crafted alignments is another issue.

Speaking of alignment, I wonder if the hand written assembly could make better use of alignment. Although x86 allows misaligned memory access, it at least used to be slower. The problem is that the two pointers may be differently aligned. Maybe that's what slows down GCC's assembly.

TurfIt · April 14, 2013, 08:46:27 PM

I note the makefile for mac contains -DUSE_HW. I wonder if this is a cause of the slow downs reported with macs???

prissi · April 14, 2013, 10:45:18 PM

That is a very good observation, and I think this should be removed ...

TurfIt · April 15, 2013, 08:02:19 PM

-DUSE_HW for mac os removed r6458.
Seems this was a truly ancient addition from r251!

I also see from r1965:

Code Select


  CCFLAGS += -Os -fast

No idea what -fast does. If typo and was supposed to be -Ofast, seems at odds with -Os. Also, no other platform have optimize params set here.. Although -Os is probably better than nothing if the nightly server doesn't have OPTIMIZE set in it's config.default. But better of course would be -O3...

ArthurDenture · April 16, 2013, 03:39:00 AM

I just tried out the latest HEAD on my macbook, and I observe the same performance with or without -use_hw: about 180ms per frame (5 qps). So it appears that it was fine to remove USE_HW from the mac makefile. Over at http://forum.simutrans.com/index.php?topic=11418.msg114930#msg114930 I had reported awful performance in the non-USE_HW code path that called SDL_UpdateRect repeatedly; it looks like the change to use SDL_UpdateRects helped that out dramatically.

Incidentally, enabling threaded rendering (which requires a hacky patch[1] to work around the lack of pthread barriers on mac) improves this to about 160ms per frame (6 qps).

I think it's still the case that achieving reasonable performance on the mac will require hardware acceleration, whether via the opengl backend (argued against in the thread linked above) or by porting to SDL2.

[1] Submitting this patch is blocked on the fact that it relies on some barrier code from Stack Overflow, and SO code is under the cc-by-sa license, which means it can't really go into simutrans.

Edit: as for '-Os -fast', the manpage for mac gcc says that for Intel macs, -fast is an alias for "-O3 -fomit-frame-pointer -fstrict-aliasing -momit-leaf-frame-pointer -fno-tree-pre -falign-loops" and -Os means "Optimize for size, but not at the expense of speed", equivalent to -O2 with some flags turned off. Just going with -O3 sure sounds reasonable to me.

prissi · April 16, 2013, 11:09:33 AM

Since the screen drawing does not need to be multithreded, of the high quality rendiering is needed (i.e. !( umgebung_t::simple_drawing && can_multithreading ) in simview.cc) wouldn't it make sence to use UpdateRects in this occasions too, i.e. only use multithreded code if needed?

TurfIt · April 16, 2013, 08:02:23 PM

I found using UpdateRects vs UpdateRect to be the same speed when multithreaded in simsys_s on my I7, but slower on the Core2 by ~2%. It's easy enough to change... And it sounds like Macs truly hate UpdateRect.

What do you mean by use multithreaded code only if needed? The routines in simview which switch to a multithreaded simple drawing mode are quite independent from the simsys_* mutithreaded background screen blit.

prissi · April 16, 2013, 10:48:51 PM

I mean the routines in simview, obviously. Was MT UpdateRect tested on the MAC? I am starting to lose the overview ...

TurfIt · April 16, 2013, 10:56:58 PM

Sorry, I'm still not following what you wanted with simview.

MT UpdateRect vs UpdateRects on Mac was not tested by me. Perhaps ArthurDenture did?
Also, it sounds like we need to reinvent our own barriers for Mac if its pthread port is missing them for any multithreading there.
Would be soo much easier if we actually had a Mac

News:

Why is SDL with USE_HW not default?

nebulon

nebulon

nebulon

nebulon