The OpenGL thing works for me much better than both SDL and SDL2. On both SDL and SDL2 using "transparent instead of hidden" in addition to "Smart hide objects" leaves some artefacts (shaded building elements)
Confirmed. The 'smart hider' is not marking things dirty correctly. As Ters alluded to, Simutrans only copies to screen things that have changed; This provides a good speed boost in all cases I've seen. i.e. I've not seen the hypothetical system that chokes on the smaller copies.
The only reason OpenGL isn't doing this is the 'hackish' GL backend doesn't use the dirty system, just copies everything every frame. Hence it is much (50%+) slower than the other backends.
and SDL2 version crashes on window maximize (SDL maximizes OK), even if it's first thing after running Simutrans.
I vaguely remember such from developing the SDL2 backend. IIRC it was solved by updating the version of the SDL2 library. What version are you using?
Also, SDL2 was created to solve the performance issues with OSX and SDL1. It's not really tested on platforms other than mac.
In addition - it seems, that OpenGL works "smoother" - especially when "following" a vehicle. So it is very good hack
The only problem with OpenGL is a scrolled text on "message bar" (bar seems to be "flashing" and text appears as "moved" several letters making it unreadable). Since there is an option to disable messages on that bar - and this setting is saved on game quit - it's acceptable to me.
vsync? In general, Simutrans isn't very smooth. Out of the box it's set for 25fps which doesn't play nice with 60Hz screen refreshes. Of course unless you're in a place with 50Hz... But even then vehicle movement is jerky due to the way their map positions are translated to the screen. Hence the slinky accordion effect on trains too.
There is also one more issue when building from source provided via zip file - since flag "WITH_REVISION" is set to 1, it seems that during compilation it tries to use svn/cvs/etc to get revision number and it fails, giving in result "rNiewersjonowany katalog" (with English locale settings on compilation time it could be something like "rUnversioned directory").
No flags are set in the source .zip. config.template is provided with all options commented out. You have to create config.default youself, and uncomment the relavent lines. If you try to use WITH_REVISION when compiling from a non-SVN working directory, then you'll get this error.
Maybe the SDL backend also has some concurrency issues if multi-threading is enabled (I've never built Simutrans multi-threaded). I also see that batch copying is disabled when rendering is multi-threaded. The OpenGL backend turns a blind eye to the issue of multi-threading.
Not that I know of...
What batch copy disabled?
allowing a modern compiler to use SSE and stuff, results in what I think is faster machine code, although Simutrans will complain that it is using the slowest of the three copy algorithms (the hand-optimized assembly only works with 32-bit GCC, and the middle one confuses GCC when vector optimizations is turned on).
SSE and stuff works in 32bit too... The current compiler target is supposed to be 'pentium4' which enables the first SSE instructions. Benchmarking with the cpu target set newer (and hence able to use newer SSE instructions) resulted in minimal gains. That was gcc 4.5 though, perhaps newer versions have finally got better optimizing for the new instructions.
No compiler optimization will ever beat the brute force RLE blitting the assembly does. No sane compiler programmer would ever write a compiler that puts out such an ugly but very effective string of instructions.
Since people insisit on building 64 bit, the slow copy path should be replaced by a simple while loop rather than that memcpy call which just kills gcc. Or even better, just enable this section of assembly. I can't find anything wrong with it in 64 bit mode. There is another section of assembly which is indeed not 64 bit safe.
I've just sent bug report
In my opinion, if one can compile Simutrans against OpenGL on Linux (regardless of "bitnes") without errors, it should do it without digging stackexchange for valid flags for library that is more or less portable. If it works worse or better - it's kind of taste 
The OpenGL backend should be considered abandoned.
What library flags?
I'm rather sysadmin than programmer and do not know too much about Simutrans internals, but I made some kind of an experiment:
- I run 4 instances of simutrans:
- 32bit binary (from SF) as a server, with 1024x1024 map
- same binary as client - "simutrans"
- 64bit binary with SDL backend (client) - "sim-sdl"
- 64bit binary with openGL backend (client) - "sim"
Apples and Oranges I thinks... do note the official binaries are compiled in debug mode - that roughly halves performance.
Please self compile the 32bit using the exact settings you used for the 64bit. Also compile all libraries with such as well.
Then, you'll find the 32bit significantly faster than 64, especially at drawing the screen. Try running 2560x1600 or greater, single threaded, and with pak64 zoomed out. That'll really highlight the difference.
Looking in process table I could say that 64bit version consumes much more memory (about triple in size virtual memory) but less CPU. Looking closer on physical memory usage, 64bit SDL binary uses 1.5 times more memory of it's 32bit counterpart and 64bit OpenGL version - twice as 32bit SDL. It could be acceptable, especially if one has much memory, so memory swapping on physical device doesn't occur.
The problem isn't not having enough memory, it's being able to access it. RAM is slow. Very very slow compared to modern processors. Simutrans working set far blows away even the biggest L3 caches, hence it has to read almost everything from RAM every single frame. I have unfinished work that mucks about the with memory layout of the game's objects and yields 20-50% faster performance, but it's far from ready.
"doubled". Same occurs on system bus - in very low level burst transfer of some memory region (As far as I remember from computer architecture lectures it's done nearly entirely via MMU) will be much faster than calculating which bytes changed and should be copied (using CPU time and initiating bunch of small transfers)
You can actually do an awful lot of calculations in the time it takes to wait for a cache miss to be fulfilled. Packing things tightly together and using CPU to break them apart when needed is far better than having them separate in memory and wasting bits due to alignment.