News:

SimuTranslator
Make Simutrans speak your language.

when simutrans can load images over 65534?

Started by Manche, December 11, 2014, 12:00:30 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Ters

Supporting both 32-bit and 64-bit can be very difficult. Optimizing for one might be mutually exclusive with optimizing for the other.

DrSuperGood

Quote
Well, but that assembly can be rewritten to 64-bit equivalent no? I suspect the poorer performance has also much to do on the extra memory bandwith required to move those memory pointers, that turned 64 bits, with the increase in certain memory structures too.

Anyway the only way of really solving this, is using hardware acceleration on the display, that's quite a ingent amount of work that needs to be devoted to it, including a rework of how simutrans manages its inner data structures.
I looked at the assembly yesterday and came to the following conclusions. Unless you build using GCC it is always turned off since the structure used is not supported by Visual C++. Porting it to support Visual C++ should be pretty easy as it has its own inline assembly structure (which is slightly more readable as it is not processed like a string). Visual C++ ignores all inline assembly when building for x64 target (64bit build) so it would need to be disabled with such target anyway. Assembly can still be used in 64bit builds but you need to link it in as opposed to compile with it. I am not sure what GCC does to the assembly when building for 64bit.

QuoteI have to agree with Ters here, I believe the compiler will generate even better code than hand-written assembly most of the times, if it's given enough information, and not forcing what we think they are optimizations with our code. Compiler will do better than a human most of times. Even it looks like the assembler lines in simgraph are quite a significant performance boost, giving benchmarks, I can't deny that.
It all depends if the compiler can see past some of the explicit behaviour shown and do the implicit behaviour. I am trying to test the performance so I can compare between 32 and 64 bit builds to see where the performance improves/worsens.

Ters

Quote from: DrSuperGood on December 16, 2014, 04:55:53 PM
I looked at the assembly yesterday and came to the following conclusions. Unless you build using GCC it is always turned off since the structure used is not supported by Visual C++.

I think that is pretty much the reason why releases for all platforms are made using GCC, even though most Windows developers seem to have Visual C++ now. (I don't have Visual Studio, but I do have the Windows SDK, which includes the entire compiler toolchain.) Unless being able to depend only on msvcrt.dll, which comes bundled with Windows, is the main reason, because that is actually a bad thing to do.

DrSuperGood

The VisualC++ project was badly out of date. With a bit of messing around I have finally got SDL2.0 builds working (which are not even offered as nightlies).

The compiler reported an error with "sdl_sound.cc". Specifically it makes reference to the function "printf" yet fails to "#include <stdio.h>" anywhere so the function is never declared.

prissi

SDL was never ever build with MSVC, only GDI.

ANd yesterday -Wl,--large-address-aware did not worked, but today it does?!? Anyway, comitted.

DrSuperGood

QuoteSDL was never ever build with MSVC, only GDI.
Yes however I have SLD2 building and working apparently fine. As far as I am aware no build is offered for SDL2 for Windows ATM (only SDL aka SDL1).

I might upload the project files later since they could be useful to people wanting to build with MSVC.

Ters

Quote from: DrSuperGood on December 16, 2014, 10:26:34 PM
The compiler reported an error with "sdl_sound.cc". Specifically it makes reference to the function "printf" yet fails to "#include <stdio.h>" anywhere so the function is never declared.

stdio.h is one of those headers that almost always gets dragged in by some other header. Which headers pull in others varies from vendor to vendor, or even version to version.

Quote from: DrSuperGood on December 16, 2014, 10:26:34 PM
With a bit of messing around I have finally got SDL2.0 builds working (which are not even offered as nightlies).

I think Mac nightlies use SDL 2.0. That's pretty much the entire reason there is SDL 2.0 support. On everything else, SDL 1.2 is just as fast as SDL 2.0, or even slightly faster and/or with lesser requirements. SDL 1.2 is also well-known and tested.

DrSuperGood

#77
Quotestdio.h is one of those headers that almost always gets dragged in by some other header. Which headers pull in others varies from vendor to vendor, or even version to version.
It is still good practice to include things you use where you use them for this reason. You do not want to couple your code to a header you have no control over (as must be happening here).

QuoteI think Mac nightlies use SDL 2.0. That's pretty much the entire reason there is SDL 2.0 support. On everything else, SDL 1.2 is just as fast as SDL 2.0, or even slightly faster and/or with lesser requirements. SDL 1.2 is also well-known and tested.
I am getting similar performance for both GDI and SDL 2.0 as far as frame updates goes which is surprisingly higher than the current release + nightly versions for Windows. There is some issue with SDL 2.0 however that causes maps to load extremely slowly (about 3-5 times longer to load a map compared with GDI). I still need to get a SDL1 build going for comparisons.

EDIT:

It also appears that a 64bit build using GDI achieves higher frame rates than the standard Windows distribution (20-23 range in what is 15-17 with standard distribution). This is most strange...

Ters

Quote from: DrSuperGood on December 17, 2014, 01:56:14 PM
It is still good practice to include things you use where you use them for this reason. You do not want to couple your code to a header you have no control over (as must be happening here).

As for me personally, I don't remember which C header provides what (beyond printf and FILE-stuff being in stdio.h), so I just try to use the function, and if it isn't defined work, I have to google what to include. I find the STL-headers somewhat easier to remember, or deduce.

Quote from: DrSuperGood on December 17, 2014, 01:56:14 PM
It also appears that a 64bit build using GDI achieves higher frame rates than the standard Windows distribution (20-23 range in what is 15-17 with standard distribution). This is most strange...

It could be because the compiler can use instructions moving bigger chunks of data when targetting x86-64. Maybe the newer C runtimes are faster than the old msvcrt.dll as well. Standard distributions also have a bit more debugging in them, which might have some effect. Microsoft puts much of the debug information in a separate file.

Dwachs

Here is a patch. It changes image-ids to 32bit, except for the ground images, which are per construction at the lower end of the spectrum. Not much memory wasted for my 64bit build.

Here are the sizes of structures suspected to increase in size (64bit builds).

After patch: curiously vehikel_t and stadtauto_t do not have increased size - maybe there was already enough padding space wasted.

Message: Debug: size of structures
Message: sizes: weg_t: 48
Message: sizes: stadtauto_t: 88
Message: sizes: grund_t: 32
Message: sizes: vehikel_t: 112

Before patch:

Message: Debug: size of structures
Message: sizes: weg_t: 40
Message: sizes: stadtauto_t: 88
Message: sizes: grund_t: 32
Message: sizes: vehikel_t: 112
Parsley, sage, rosemary, and maggikraut.

DrSuperGood

QuoteAs for me personally, I don't remember which C header provides what (beyond printf and FILE-stuff being in stdio.h), so I just try to use the function, and if it isn't defined work, I have to google what to include. I find the STL-headers somewhat easier to remember, or deduce.
Yeh it can get annoying at times.

QuoteIt could be because the compiler can use instructions moving bigger chunks of data when targetting x86-64. Maybe the newer C runtimes are faster than the old msvcrt.dll as well. Standard distributions also have a bit more debugging in them, which might have some effect. Microsoft puts much of the debug information in a separate file.
It might also have to do with SSE2 instructions being enabled by default on the more recent versions of the MSVC compilers. I am not sure of the standard GCC builds have them disabled for compatibility with old processors. In any case the performance difference I am noticing in regards to frame rate between the standard GCC build and self-made MSVC build when running a well developed pak64 map at maximum view distance (you can see 100s of convoys) is quite substantial. The performance difference between x86 and x64 builds in this case is more trivial.

QuoteMessage: Debug: size of structures
Message: sizes: weg_t: 40
Message: sizes: stadtauto_t: 88
Message: sizes: grund_t: 32
Message: sizes: vehikel_t: 112
The amount of memory spent on convoys, factories etc is trivial. A simple 256*256 map (small for some servers) has 65,536 instances of grund_t however might max out at only 1,000-5,000 convoys (if lucky). Ways are also very important since you can easily accumulate tens of thousands of tiles of way in a well developed map. This is how JIT2 could get away with increasing the size of factories without anyone even noticing lol.

What really needs to be done is look at what the difference is between performance of x86 and x64 builds and try and improve it where possible. It could well be the case that the only non-trivial difference occurs in very massive maps in which case there really is no reason the project should remain officially x86 only.

Ters

Quote from: Dwachs on December 17, 2014, 05:36:25 PM
curiously vehikel_t and stadtauto_t do not have increased size - maybe there was already enough padding space wasted.

vehikel_t doesn't increase in size because there is a 32-bit field (speed_limit) following the image_id (bild in vehikel_basis_t). There is therefore, by default, 2 bytes of padding between bild and speed_limit. Changing image_id to 32-bit just fills this padding, as you suspected. It's the same with stadtauto_t, except that the 32-bit field is weg_next.

As for grund_t, summing up the sizes of the fields brings me to 22, not 32. If my calculations are correct, that should mean there are two free bytes available up to a size of 24 (requires some field reordering), except for brueckenboden_t and maybe wasser_t (depends on how large the compiler treats an enum). Was it this prissi saw, or the image_id16 trick? (22 or 24 bytes in not cache line aligned, which I thought grund_t was. 32 is right for this.)

Quote from: DrSuperGood on December 17, 2014, 05:52:13 PM
It might also have to do with SSE2 instructions being enabled by default on the more recent versions of the MSVC compilers. I am not sure of the standard GCC builds have them disabled for compatibility with old processors.

SSE2 was among the things I meant with "instructions moving bigger chunks". I was originally more specific in what I wrote, but changed it. GCC has an option for specifying target CPU, plus individual options for MMX, SSE, SSE2 and so on. It's not 100% clear if the former option controls the defaults for the latter, but I think they do. Target CPU for Simutrans nightlies is Pentium II or III (might be Pentium II because wernieman became tired of going tiny steps down to figure how to make the nightly run on someone's machine, each step taking a day to verify), so no SSE for sure there.

Dwachs

Quote from: Ters on December 17, 2014, 07:27:47 PM
As for grund_t, summing up the sizes of the fields brings me to 22, not 32. If my calculations are correct, that should mean there are two free bytes available up to a size of 24 (requires some field reordering), except for brueckenboden_t and maybe wasser_t (depends on how large the compiler treats an enum). Was it this prissi saw, or the image_id16 trick? (22 or 24 bytes in not cache line aligned, which I thought grund_t was. 32 is right for this.)
The numbers I posted were for a 64bit build. There is a pointer in dataobj that throws the padding off if being 8byte. Also the pointer to the virtual table is 8 bytes.
Parsley, sage, rosemary, and maggikraut.

prissi

The images in ground can be all over the range, even addons like tunnels with their own ground could throw this off or you could overlay a foundation. I started a patch to use a second lookup table. But then it went quickly overboard, so for now I just increased the image_id.

Moreover, the images are generated after all other images are loaded, so they most likely do not end up in the 65535 range.

MSVC never does packed structures. You can enable it, but then you loose the ability to load files via stdio, since the FILE structure (and many more) get also packed. I gave up on packing structures with MSVC.

The grund is returned from freelist. In principle, one can add more freelist to have a 2 byte granularity, if the compiler supports it.

DrSuperGood

QuoteMSVC never does packed structures. You can enable it, but then you loose the ability to load files via stdio, since the FILE structure (and many more) get also packed. I gave up on packing structures with MSVC.
The issue is that technically packed structs are kind of optional in this day and age. Although the C/C++ standard does mentioned that they have to compile (be supported by the language), it does not explicitly define their physical behaviour and memory model (only their logical behaviour is defined which the compiler still obeys) instead leaving it up to the compiler developers to decide. Since they are inherently incompatible with synchronization and have their own performance penalties Microsoft decided to ignore them. I was actually quite surprised to find that GCC does obey them a while back.

Ters

Quote from: prissi on December 17, 2014, 10:15:19 PM
MSVC never does packed structures. You can enable it, but then you loose the ability to load files via stdio, since the FILE structure (and many more) get also packed. I gave up on packing structures with MSVC.

#pragma pack allows fine-grained controll over structure packing. It's so widely used that GCC actually understands this Microsoft extension in addition to it's own packing directives.

Quote from: prissi on December 17, 2014, 10:15:19 PM
The grund is returned from freelist. In principle, one can add more freelist to have a 2 byte granularity, if the compiler supports it.

If you're referring to the two free bytes at the end of grund_t, then you might save memory that way, but I think the two bytes going from 22 to 24 would still be cheaper than going from 24 to 26, as the latter would start using a new native machine word (what x86 for legacy reasons calls a double-word). This is however just speculation.

DrSuperGood

Quote#pragma pack allows fine-grained controll over structure packing. It's so widely used that GCC actually understands this Microsoft extension in addition to it's own packing directives.
Hmmm I might try to look into applying this to some of the objects. The memory saving could raise performance further for MSCV builds.

I can confirm there is a noticeable performance penalty for using x64 builds. In my test map this was about 2-3 fps. The map was very well developed but pretty small so the penalty could increase with larger maps.

I have been trying to benchmark the game to discover where the bottlenecks are occurring (what works faster and slower in x64) however I cannot get the instrumentation to attach. I am guessing its because all the dependant libraries require being built with instrumentation which is a lot of work and annoying.

Ters

Quote from: DrSuperGood on December 18, 2014, 01:48:00 PM
Hmmm I might try to look into applying this to some of the objects. The memory saving could raise performance further for MSCV builds.

Most objects in Simutrans have fields that pack perfectly already, as long as pointers are 32-bit. Forcing packing otherwise will lead to lots of misaligned reads, which might be worse than not packing. 32- and 64-bit builds might require different arrangements of fields to be optimal, although solutions that suit both might also be possible. The latter depends somewhat on how important it is to keep related fields close together.

prissi

#88
With MSVC using packed structures requires that any pointer to such strcuture (like a planquadrat_t is declared as   __unaligned  planquadrat_t *p;)

EDIT: However, it compiles normally with the patch. Maybe this is from ARM and MIPS support times of MSVC. (And save 5% of memory).

TurfIt

Benchmark results: 32bit GCC build.

|r7427 - 16bit||r7428 - 32bit||r7427 w/ Dwachs 32bit patch|
   sizes:   
     koord      4      4      4   
     koord3d      5      5      5   
     ribi_t::ribi      1      1      1   
     karte_t      6408      6408      6408   
     planquadrat_t      12      12      12   
     grund_t      24      24      24   
     boden_t      24      24      24   
     wasser_t      24      28      24   
     obj_t      12      12      12   
     baum_t      16      16      16   
     gebaeude_t      32      32      32   
     senke_t      52      56      56   
     stadtauto_t      64      68      68   
     roadsign_t      32      36      36   
     wolke_t      20      20      20   
     movingobj_t      44      48      48   
     fussaenger_t      40      44      44   
     weg_t      32      36      36   
     convoihandle_t      2      2      2   
     convoi_t      1016      1016      1016   
     convoi_sync_t      104      104      104   
     vehikelbasis_t      28      32      32   
     vehikel_t      84      88      88   
     automobil_t      84      88      88   
     waggon_t      84      88      88   
     monorail_waggon_t      84      88      88   
     maglev_waggon_t      84      88      88   
     narrowgauge_waggon_t      84      88      88   
     schiff_t      84      88      88   
     aircraft_t      120      128      128   
     fabrik_t      1568      1568      1568   
     ware_t      12      12      12   
     haltestelle_t      896      896      896   
     halthandle_t      2      2      2   
     route_t      12      12      12   
     linehandle_t      2      2      2   
     schedule_t      12      12      12   
     spieler_t      348      348      348   
     finance_t      116120      116120      116120   
   timings: (usec/ss,step,or frame)   
    sync_step() sync      498      504      500   
    step()      404      405      405   
    sync_step() display      26972      26789      26898   

Structures larger. Simulation marginally slower. Screen display marginally faster  :o. Overall insignificant difference.
sync_step() timings per object type show citycars and people taking the biggest hit. When I shrunk stadtauto_t from 72 to 64 bytes there was a ~25% speedup. This change putting it back to 68 bytes is showing ~15% slower again. Obviously these results are from some unreleased work rearranging these structures - trunk is larger and slower...

Ters

Quote from: prissi on December 18, 2014, 10:45:55 PM
With MSVC using packed structures requires that any pointer to such strcuture (like a planquadrat_t is declared as   __unaligned  planquadrat_t *p;)

From what I can read out of MSDN __unaligned means the the structure itself isn't aligned. When I think of packed structures, it's the fields inside the structure that is packed. The structure as a whole would remain aligned. If there is a structure packed inside another structure, pointers to that must however be marked __unaligned. The freelist might need some adapting to operate on multiples of 8 instead of 4 for 64-bit builds.

Rearranging the fields to avoid wasting space on padding should be a far better idea than force-packing. Unaligned access would cause anything for mild (typically x86) to severe (mostly everything else, including x86 in some cases) performance implications. Manual packing has been done successfully for Simutrans so far from what I've read, including TurfIt's post right above.

Quote from: TurfIt on December 19, 2014, 01:43:55 AM
Structures larger. Simulation marginally slower. Screen display marginally faster  :o. Overall insignificant difference.
sync_step() timings per object type show citycars and people taking the biggest hit. When I shrunk stadtauto_t from 72 to 64 bytes there was a ~25% speedup. This change putting it back to 68 bytes is showing ~15% slower again. Obviously these results are from some unreleased work rearranging these structures - trunk is larger and slower...

Apart from wasser_t, it doesn't seem like Dwachs' image_id16 gains anything should it actually work. There might be a lot of water, though. On the other hand, I would expect this to affect rendering most. The differences are so small that they could be nothing but noise from other processes on the machine for all I know. I also don't know if this is a low, middle or high end computer. If it's in the higher end, the question of how low end computers are affected remains.

Dwachs

I think, we can forget my patch. prissi's brute force 32bit approach did not increase grund_t for my 64 bit builds. Nothiing to gain by going to 16bit image id's.

@TurftIt: can you post your patch for rearranging data structures?

btw, the 65534 image limit is still in simgraph16.cc ;)
Parsley, sage, rosemary, and maggikraut.

Ters

I've done some more experiments, and see now that the size of a structure is rounded up to a multiple of the size of the biggest primitive field. Earlier, this didn't seem to be the case when I did a quick test. Must have been a stale file somewhere.

I also see from what has been committed that prissi seems to be thinking of the size of structures that are embedded in other structures, and that packing will indeed not only affect the padding between fields, but also at the end of the structure. My comments about __unaligned should still be valid, though, as long as we ensure that nested structures are properly aligned in the composite structure. Things only get problematic once these packed structures are also used by themselves in arrays.

prissi

pragma pack(1) works fine on Intel; but for instance a 16 bit word at an uneven address on Motorola or Arm will cause a bus error resp. address error under some processors/modes.

Anyway, since this has gone in, maybe under incorporated for now.

Ters

Quote from: prissi on December 19, 2014, 11:38:54 PM
pragma pack(1) works fine on Intel; but for instance a 16 bit word at an uneven address on Motorola or Arm will cause a bus error resp. address error under some processors/modes.

I just thought the compiler could remember itself that the structure has been packed without care for alignment. Furthermore, Simutrans should not put 16-bit words on uneven addresses. I figured the purpose of packing koord3d was to make its size 5 bytes rather than 6, so that a structure containing a koord3d, an uint8 and an sint16 (in that order) would be 8 bytes, not 10. Yet nothing in this structure would not be properly aligned, nor need forced packing to remove empty spaces. A packed struct of an uint8, and sint16 and a koord3d would however normally be bad, whether packed or not.

sheldon_cooper

Those settings will be in the next version pak128 that will come out ?? I'm having this problem now. :P

DrSuperGood

QuoteThose settings will be in the next version pak128 that will come out ?? I'm having this problem now. :P
This is a simutrans problem and not one of its paksets that it uses. You can try downloading the nightly builds of simutrans now and testing if the problem has been fixed.

Ters

I see no noticeable performance drop on my Intel Pentium E2200 dual-core 64-bit Linux machine. On it, Simutrans is compiled as 64-bit, which means alignment is sub-optimal anyway. GCC is however allowed to use the full instruction set available on such a basic x86-64 CPU, which includes SSE3. It only has a 1280x1024 screen, which means Simutrans only deals with 1280x915 pixels. Yet when following a train through dense forest at normal zoom level in a well developed 1024x1024 pak64 map, things still move along at 25 FPS, almost 5 simloops, with idle time to spare and only about 50% utilization of a single core (no multi-threading).

jamespetts

I have applied the commits from the SVN to Experimental: I have been worried for some time about Pak128.Britain-Ex reaching the limit, as it has so many vehicles, many with multiple livery variants. Do the changes on the SVN actually extend the limit? That is not entirely clear from reading this thread. I see that Dwachs' patch goes further than that, but he later points out that the limit is still in simgraph16.cc, and his patch only changes an error message in that file.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ters

I'm not sure what Dwachs was referring to. There are however several functions in simgraph16.cc, or perhaps rather simgraph.h, that do not use image_id as the type for image numbers. unsigned should however be a 32-bit type.

jamespetts

So, is the 65534 image limit now gone in Standard (in the latest SVN/nightlies)?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ters

Quote from: jamespetts on December 25, 2014, 10:46:06 PM
So, is the 65534 image limit now gone in Standard (in the latest SVN/nightlies)?

This is in Incorporated patches, is it not? Some bugs might linger, though. I am also not sure anyone has tried loading more than 65534 images yet.

jamespetts

Ahh, I was unsure whether the patch incorporated actually had that effect given Dwachs' comments.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

If people encounter any bugs they can be fixed there and then. For now it seems stable so any progress towards removing the limit is better than none.