News:

Simutrans Tools
Know our tools that can help you to create add-ons, install and customize Simutrans.

Reducing code size/optimising

Started by kierongreen, August 06, 2013, 01:59:19 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

kierongreen

Anyone who's been looking at SVN log over the last day might have noticed some small commits aimed at optimising/reducing code size. Well, here's a larger one. I've rolled up a few sections of code in simwerkz into loops - the benefits being:
a) Smaller code so should be more readable
b) Less duplicated code so should be more difficult to make copy and paste errors.

However before I committed this I just wanted to check that noone had any objections and that I've not made a silly error in the process. If I get more time I'll be looking for more cases like this of repeated sections of code and try and simplify them also.

Incidentally, the various changes I've made over today have reduced the compiled executable size by 7.5kb here (0.2%) - hopefully that slightly balances out my usual tendency to add add add add... :p

Ters

In one instance, you eliminated creating a koord just for passing it to karte_t::access, which can take x and y directly. But in another place, you also switch to the non-koord access even though you have and need a koord anyway. Seems a bit pointless. As for karte_t::access, would it be possible to have the koord version call the non-koord version to remove duplication of implementation? Or will some compilers fail to inline two levels deep?

kierongreen

QuoteIn one instance, you eliminated creating a koord just for passing it to karte_t::access, which can take x and y directly. But in another place, you also switch to the non-koord access even though you have and need a koord anyway. Seems a bit pointless.
This happened mainly because I first eliminated both koords then discovered that one was needed. When inlined it code will be slightly smaller I think for access(x,y) than access(k) though (assuming you already have x and y as separate variables which we do here (although this might depend on compiler?):

   inline bool is_within_limits(koord k) const {
      return (k.x|k.y|(cached_size.x-k.x)|(cached_size.y-k.y))>=0;
   }

   inline bool is_within_limits(sint16 x, sint16 y) const {
      return (x|y|(cached_size.x-x)|(cached_size.y-y))>=0;
   }
...
   inline planquadrat_t *access(int i, int j) const {
      return is_within_limits(i, j) ? &plan[i + j*cached_grid_size.x] : NULL;
   }

   inline planquadrat_t *access(koord k) const {
      return is_within_limits(k) ? &plan[k.x + k.y*cached_grid_size.x] : NULL;
   }


Incidentally as you can see from this code inline is already 2 levels here.

Ters

3 levels then (access, access, is_within_limits). is_within_limits is also a duplication of implementation. If compilers won't inline enough on their own, one can always turn to the preprocessor, but that may make it more difficult for IDEs to understand what's going on.

kierongreen

Right - made the following changes to trunk:
removed duplicate code in x,y and k variants of functions - inlining k variants to x,y ones and provided x,y ones where they weren't before.

removed the welt->lookup(koord) (returns plan) as welt->access(koord) (returns plan) was identical. Also having lookup sometimes returning ground, sometimes plan was a bit confusing. Now welt->lookup always returns a ground.

I've also changed all calls to welt->abc(koord(x,y)) to welt->abc(x,y) throughout the program to simplify the code. Note this is only welt->abc(koord(x,y)) that I've replaced - calls to welt->abc(k) I've kept as is.


This means that the patch above needed updating so new version attached.


Edit: Note that with new variant functions compiled code size is increased as there are more function names - however when you strip out symbols it is smaller than before.

prissi

Did not koord already had nearest neighbor offsets, so one could just count from 0 to 9? Since leitung2.cc needed this too.

The change in simworld is somewhat meaningless, since you made a koord for min_hgt again ... I find

koord k(ix,iy);
access(k)->kartenboden_setzen( new boden_t( this, koord3d( k, max( min_hgt(k), get_water_hgt(k) ) ), 0 ) );

even more clear.

Ters

Quote from: kierongreen on August 06, 2013, 12:29:35 PM
removed the welt->lookup(koord) (returns plan) as welt->access(koord) (returns plan) was identical. Also having lookup sometimes returning ground, sometimes plan was a bit confusing. Now welt->lookup always returns a ground.

These methods might need better names. There is nothing obvious about that access is 2d (returns planquadrat) and lookup is 3d (returns ground). Perhaps call then simply plan_at and ground_at respectively. But renaming such central functions is cruel to patches.

kierongreen

QuoteDid not koord already had nearest neighbor offsets, so one could just count from 0 to 9? Since leitung2.cc needed this too.
Well, neighbours allows you to count from 0 to 7. However there are interesting side effects of using that vs looping over x,y -1 to 1 which I encountered...

QuoteThe change in simworld is somewhat meaningless, since you made a koord for min_hgt again
Ok, reverted in latest commit.

Anyway, committed more changes (including the patch above) aimed at reducing large code duplication in boden/grund.cc, boden/wasser.cc and boden/wege/weg.cc as well as simwerkz.cc. Final tally for code reduction was around 12kb (0.35%) off a stripped executable, and more readable code (in my opinion anyway!). That'll be last of changes for a while so feel free to start translating...

Incidentally, as a result of the code tidy I finally (after over a year) managed to track down and get rid of a very annoying bug regarding back bild slope images being displayed incorrectly :)

kierongreen

Interesting discovery - 99% of code calling lookup_hgt (and get_water_hgt) already knows that the location is within the map. So why make an additional check in these functions? Taking it out (and modifying the very small number of calls that might have an invalid location to perform a check first) decreases drain_tile time by 20% (which on my mega heightmap means saving of 2.5 minutes on generating game from this function alone - along with smaller savings in the rest of enlarge karte too!).

prissi

#9
The idea of simutrans was that even wrong parameter should get an idea of an error. But with todays MMU an accidently wrong address will most likely terminate the program anyway. But for calls directly from karte_t you can access planquadrat directly anyway, or even planquadrat and then the ground at zero.

I would really limit this to thing like map creating, since in the past we had from time to time maps with holes, which did not crashed simutrans but could be recovered by loading and saving.

kierongreen

Yes map generation is a major place for this (looping over all tiles and checking each one to see whether it is in the world... several times!). However there are others..... As I said, 99% of calls to these either have explicitly checked that the position is valid, or have code around them which assumes that it is (e.g. welt->access(k)->.... which would crash simutrans if k was invalid).

I appreciate that simutrans is designed to be foolsafe (largely). Aside from map creation the other place where I think there'd be a reasonable performance boost would be in karte_t raise and lower (and associated function). These can get called repeatedly, including in map creation so even small savings would add up.

In my local copy currently I've made lookup_hgt_nocheck, get_water_hgt_nocheck, min_hgt_nocheck and max_hgt_nocheck functions - with the normal versions of these keeping checks. Hence it's up to caller to decide whether they need the check or not.

Ters

It would make sense that validity of coordinates are only checked when entering the first karte_t function. For calls internally in karte_t, one can perhaps assume coordinate is valid. Are there any bottlenecks that would be unable to use _nocheck if they are private?

kierongreen

So create lookup_hgt_intern, get_water_hgt_intern, min_hgt_intern and max_hgt_intern - make these all private and only use them internally in karte_t? That could work and should cover most cases (including the ones with large performance boosts). Could even inline the normal ones to those functions.

Ters

I'd keep the names as originally proposed. They are more informative. Implementing the existing functions in terms of these new ones seems like a good idea, as long as a critical inline nesting level isn't crossed. Preferrably with the nocheck next to the original function, althought that looks better in Java and similar languages with per-member access modifiers than in C++.

prissi

SInce the actual lookup is trivial, one could write them out explicitely. And yes, ..._nocheck sound better.

kierongreen

Right here's a work in progress tidy/optimise patch (I'll submit final version when SVN is back up...).

It's rather big as there are a lot of changes and renaming going on. I've standardised names for variables in simworld.cc (and simwerkz.cc). k now should always refer to koord, pos to koord3d. Previously the majority tended to conform to this but there used somewhat randomly.

nocheck function list:
lookup_hgt_nocheck
get_water_hgt_nocheck
min_hgt_nocheck
max_hgt_nocheck
access_nocheck
lookup_kartenboden_nocheck

min_hgt_nocheck and max_hgt_nocheck use optimised code to lookup height values of the grid in each corner previously used in calc_natural_slope. min_hgt and max_hgt themselves keep the existing easier to read code.

Times in seconds for map generation of my huge heightmap: Was 926 - now 430 (46% of previous). Most of this saving comes from drain tile: was 763 - now 271 (35% of previous) however other parts also have significant savings - creating grounds now takes 28 seconds rather than 33 (84% of previous), cleaning the map takes 39 seconds rather than 49 (79% of previous).

Smaller maps have similar speedups - in general map generation now takes half the time it used to.

One point that should be noted is that the depth of the highest lake does have a huge impact on performance. For example - changing this from grundwasser+8 to grundwasser+40 results in map generation with the huge heightmap taking 1200 seconds or so (with drain tile taking 1000 seconds).

A section of code that still takes lots of time is river generation on maps with lots of steep slopes - it seems to have to try lots of times before finally giving up. I might look at this...

Of course, there's always multithreading to help matters...

kierongreen

Added multithreading to this patch for perlin noise, ground creation and calculating climate transitions and ground image. Speed is now 2 or 3 times quicker through these sections with 4 cores. There's other sections that could be multithreaded very easily but because they don't take as long to run the improvement probably wouldn't be noticeable. The major one to try and multithread would be the tile drain function - but that would require a rethink.

Generation times for maps are pretty much consistently 42% of what they originally were (or 2.3x faster).

TurfIt

There's an awful lot of loops in enlarge_map that are iterating x/y rather than y/x. i.e. The memory layout is such that the x loops should be inner.

What does the whole drain tiles section do? From the comments about creating lakes, I assume it's supposed to actually create lakes above sealevel. But I've *never* seen a single lake created...

A quick look at this routine, and I'd guess the massive slow down is from those two arrays (from_dir and stage) being new'd and worse, memset for every tile on the map! I think that needs a rethunking to make do with smaller arrays, or reusing them with only re-memsetting the modified sections, or something.

Ters

Although I have not yet fully grasped what exactly drain_tile does, I suspect that drain_tile is doing much of the same calculations over and over again. Makes me think of dynamic programming, though I don't remember much of it.

kierongreen

#19
Ok addressing several points:

QuoteThere's an awful lot of loops in enlarge_map that are iterating x/y rather than y/x. i.e. The memory layout is such that the x loops should be inner.
Indeed - not sure why I didn't spot that one as I remember it being one of the first optimisations I learned (at the age of 10, from my dad boasting about how he'd massively improved performance in a program at work by doing just that). Implemented for a small but measurable saving of up to 5% on overall map generation times

QuoteI'd guess the massive slow down is from those two arrays (from_dir and stage) being new'd and worse, memset for every tile on the map!
This is actually very quick - should take a second at most even on huge maps. On 256x256 maps you're talking about a few milliseconds for this.

QuoteI suspect that drain_tile is doing much of the same calculations over and over again.
The nature of the algorithm does result in tiles being visited repeatedly (up to the maximum lake height ( 8 ) times potentially although less for many tiles). It is a naive implementation so there is potential for improvement for sure.

QuoteBut I've *never* seen a single lake created...
Lakes are formed in basins with no drainage - although prissi complained about the number so I made tweaks to reduce them. If you generate a 256x256 map with height 320 and a noise of 6 or more you should start to get some.



I'm going to look at lake generation to see if I can find a way to improve it - both in terms of speed and appearance. There's probably scope for configuring various parameters via the gui - I'm going to leave that aspect for now though because I know other people are playing around with the gui (and I think we probably need to completely overhaul the layout of the landscape settings dialogue into tabs as it's far too cluttered currently).


Incidentally it's worth pointing out the optimisations and multithreading between them result in map generation times up to 2x faster even taking out the lake generation code - for example with my huge height map:
Start (s)Optimised (s)Optimised+multithreaded (s)% overall saving
Total926430370.560%
Lakes76328327364%
Total without lakes16314710337%

Although as lakes still take 2/3 of the overall time clearly any improvement will have a significant effect on overall times.

kierongreen

#20
Ok - I rewrote lake generation from the bottom up... Literally - now rather than flooding the world and draining off the water the code tries to flood each tile, starting at grundwasser+1 ending at grundwasser+8. Although that may seem inefficient when it fails it marks those tiles to be excluded from subsequent attempts so performance is improved on previous code (time reduced by least 20% compared to previous optimised code - in one case reduced by 60%). Compared to the non optimised code lake generation is between 3 and 8 times quicker - overall landscape generation is 2 to 5 times quicker.

Along with being fast it also produces rather nice looking lakes too with a wide range of map settings - try and see!

Anyway, I've committed changes in revision 6640 - SVN back up :)

Edit - and after spending 2 hours trying to multithread lake generation (it worked - but performance was no better) discovered further scope for increasing the speed of searching tiles (hello goto, I thought I'd left you behind in the 1990s with BASIC...). Time down by another 25% over previous best...

prissi

About going over tiles xy versus yx: Would it the proper way to just add an iterator over the planquadrat class? That would also take care of valid planquadrats and tile order in memory and so on (especially since it would just iterate from zero to max planquadrat count witthout any multiplications). One could even have a second iterator for all grounds.

Ters

I would assume most code wants to know the coordinates of the tiles it is iterating over. A linear iterator would in that case trade a multiplication for a division or a peek into the grounds, which is not the right way to go. There is another way to get rid of the multiplication, but I'm not sure it's worth it.

prissi

Some code (like laden abschliessen) iterates over all tiles ...

By the way, simthread.cc does not for some more people. Imho we should avoid emulating the OS with simutrans. And if doing so, why not rename the function then according to the real pthred names?

So what was the reason? THe xy_loop worked fine before. Or was there need for a fix.

TurfIt

Quote from: prissi on August 12, 2013, 07:50:40 PM
By the way, simthread.cc does not for some more people. Imho we should avoid emulating the OS with simutrans. And if doing so, why not rename the function then according to the real pthread names?
'does not for some more people' ??

simthread.cc was just supposed to be an experiment in seeing if some of the OSX speed woes could be helped by getting the multithreading working - wasn't really meant for trunk yet...
I agree simutrans shouldn't be emulating the OS, but when one of the supposed supported simutrans platforms doesn't provide a critical feature, what do we do?  barriers are such a fundamental construct I don't understand why it's optional, or worse why OSX doesn't implement it. But it doesn't so....  Is there some other multiplatform threading library we should be using? Or extend the whole simthread concept to be wrappers to each OS native threading?

I purposely avoided using the real pthread names in case the pthread library actually defined them but just had stub implementations. If there's no pthread library out there doing that, then could rename (and get rid of the ugly #define hack too...)


Quote from: prissi on August 12, 2013, 07:50:40 PM
So what was the reason? THe xy_loop worked fine before. Or was there need for a fix.
xy works. yx works too, and slightly faster. No downside I know of, so why not?


---
Multithreading the create grounds will result in the bodens being scattered about memory - is there anything than iterates over these? There's already a measurable performance hit on new maps vs reloaded ones, presumably something else allocating differently on new vs load. A save/reload fixes things, so not too important. Really all the allocations need to be done single threaded, then a multithreaded init routine called - constructors need to be nuked.


---
Lakes, there's sure a lot of lakes now... With the old code I finally seen a couple lakes but setting roughness to 6. on 4, no lakes; Plenty of basins where lakes could've been but weren't. Now the new code seems to find every basin and fill it. Maybe there needs to be a configurable % of basins left without water?


---
When adding new files to svn (simthread.*), please remember svn:eol-style=native.   I get CR, CR/LF nightmares when this property is missing.

kierongreen

#25
QuoteAbout going over tiles xy versus yx: Would it the proper way to just add an iterator over the planquadrat class?
I suspect that would be a lot slower. Iterating is a good solution where there are only a few tiles to process, but as the number increases it becomes slower relative to xy/yx.

QuoteI would assume most code wants to know the coordinates of the tiles it is iterating over. A linear iterator would in that case trade a multiplication for a division or a peek into the grounds, which is not the right way to go. There is another way to get rid of the multiplication, but I'm not sure it's worth it.
For really speed critical code you can loop over memory locations instead of x. So:

   // loop over map to find marked tiles
   for(  int y = 0;  y<size_y;  y++  ) {
      for(  int x = 0;  x<size_x;  x++  ) {
         if(  stage[array_koord(x,y)] > -1  ) {
            set_water_hgt(x, y, new_water_height );
         }
      }
   }

becomes

#define array_koord(px,py) (px + py * size_x)

   // loop over map to find marked tiles
   for(  int y = 0;  y<size_y;  y++  ) {
      uint32 offset_max = array_koord(size_x - 1,y);
      for(  uint32 offset = array_koord(0,y);  offset < offset_max;  offset++  ) {
         if(  stage[offset] == -1  ) continue;
         water_hgts[offset] = new_water_height;
      }
   }


This latter code is very efficient where small segments of code are to be executed for each tile that can operate directly on arrays rather than through function.

QuoteBy the way, simthread.cc does not for some more people. Imho we should avoid emulating the OS with simutrans. And if doing so, why not rename the function then according to the real pthred names?
What wasn't working? To me it looked like for Windows/Linux there should be no change as it just defines to pthread. Certainly here it didn't make a difference. For OSX it wasn't working anyway. Getting multithreading working on OSX really has to be a priority especially with the other poor performance people are getting on that.

Anyway, if I'd thought it would have caused problems for people I wouldn't have committed so I'm sorry.

Quotesimthread.cc was just supposed to be an experiment in seeing if some of the OSX speed woes could be helped by getting the multithreading working - wasn't really meant for trunk yet...
Well, many people will only report issues or improvements in nightlies - they are unlikely to download patches and compile the sources themselves. Hence if something looks like it is a reasonable solution without obvious problems it should get into trunk quickly for testing.

QuoteMultithreading the create grounds will result in the bodens being scattered about memory - is there anything than iterates over these? There's already a measurable performance hit on new maps vs reloaded ones, presumably something else allocating differently on new vs load. A save/reload fixes things, so not too important. Really all the allocations need to be done single threaded, then a multithreaded init routine called - constructors need to be nuked.
That's interesting, and something I hadn't necessarily considered... Perhaps for map creation we could just new an array of grounds then set the kartenboden on each to that. Wouldn't matter then whether multithreaded or not. Should be even faster than current too... Might even be worth it for map load too?

QuoteI agree simutrans shouldn't be emulating the OS, but when one of the supposed supported simutrans platforms doesn't provide a critical feature, what do we do?  barriers are such a fundamental construct I don't understand why it's optional, or worse why OSX doesn't implement it. But it doesn't so....  Is there some other multiplatform threading library we should be using? Or extend the whole simthread concept to be wrappers to each OS native threading?
Extending it may be the way to go I think.

QuoteLakes, there's sure a lot of lakes now... With the old code I finally seen a couple lakes but setting roughness to 6. on 4, no lakes; Plenty of basins where lakes could've been but weren't. Now the new code seems to find every basin and fill it. Maybe there needs to be a configurable % of basins left without water?
There's plenty of scope here for sure. All basins should in general be filled by water (or silted up over time). Only in very dry areas are lakes non existent or lower than the outlet to a basin (think Dead Sea or northwestern China).

I find good results with lower noise and higher heights overall which is why I gave more flexibility to the balance between the two when generating maps.

Edit:
Some more thoughts regarding lakes - rather than % (which might be difficult with current code) maybe regions of simutrans world could be defined - or lake generation could be linked into a more sophisticated climate generation model. After all in Scandinavia there are lakes everywhere where water can collect, in Saudi Arabia not so many...

At very least it could be made an on/off option like trees. I want to improve tools for lowering and raising water level too at some point.

Unfortunately the amount of code I'll be writing from now on will be a bit less as I'm back at work from tomorrow. Oh how the summer holidays pass so quickly...

Ters

Quote from: kierongreen on August 12, 2013, 11:25:52 PM
I suspect that would be a lot slower. Iterating is a good solution where there are only a few tiles to process, but as the number increases it becomes slower relative to xy/yx.

How can two nested for-loops be faster than one? I assume what prissi mean essentially boils down to:


for (planquadrat_t *iter = world->plan; iter != (world->plan + w * h); ++iter) {
  ...
}


But as I pointed out, it works best when you need to process all tiles blindly.

kierongreen

Ah my bad that should be quick indeed.