The International Simutrans Forum

 

Author Topic: Server performance: preliminary findings  (Read 1601 times)

0 Members and 1 Guest are viewing this topic.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 19078
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Server performance: preliminary findings
« on: December 23, 2019, 11:51:35 PM »
I have not had time to do a great deal of investigation into the server performance issues in the lead-up to Christmas, and I am now staying with my parents for some weeks so will not be able to do much else until sometime into the new year, but I have been able to undertake a preliminary analysis of some of the issues relating to the performance of the server by running a performance profiler on my own local machine.

First of all, by far the greatest issue appears to be the amount of time spent saving a game. On the server, this has to happen whenever a new player joins or once every hour, whichever is less frequent at any given time. Even on my new build Ryzen 3900x, saving the server game takes a seriously long time. I have not timed it exactly, but it is many minutes. The actual server, which will almost certainly be slower than my new Ryzen at home, will take even longer.

The performance profiler shows that virtually all of the saving time is spent compressing the file: only a few seconds are spent actually writing the data, and than many, many minutes are spent compressing it. The compression uses the zlib algorithm. The bzip algorithm is an alternative option, but this is even slower as it has a higher compression ratio, and was dropped from server games even in Standard some years ago as it was realised that any transmission time saved by having a smaller file was more than obliterated by the compression time.

The zlib algorithm is based on an open source library, and, at least in so far as is implemented in Simutrans at present, is single threaded. Since multi-core machines are now commonplace, and the server has access to 6 cores, if there is a multi-threaded alternative to this, this could very significantly increase performance on the server in larger games.

That the server is spending almost all of its time saving the game would account for the experience reported by players, namely, taking an extremely long time to join and the server being unavailable (because others are trying to join, requiring saving of the game) most of the time.

If anyone knows of any suitable multi-threaded compression libraries with a compatible licence (Simutrans uses the Artistic Licence, so, sadly, the GPL will not be compatible unless a pre-compiled library be used; but the LGPL should be fine), or (even better) just a multi-threaded implementation of the current compression algorithm, that would be very helpful.

As to the running game, most of the time is still spent with passenger generation (as discussed previously) and finding a suitable public transport route for passengers to get from any given origin to any given destination. I did make a few small optimising changes a week or so ago once I was able to use Visual Studio 2019 on my new computer and take advantage of its line by line CPU utilisation data, but these will be fairly minor in effect.

However, increasing the number of threads will assist with this, as this is a fully threaded workload, and can scale proportionally to the number of available CPU cores. Since the server now has access to 6 cores, I have increased the number of threads on the server to 6.

Increasing the number of threads to 12 on my local machine (a new Ryzen with 12 cores) resulted in a somewhat uneven performance improvement; the game would perform somewhat jerkily, slowing down for a fraction of a second every 1-2 seconds. This suggests that the single threaded workloads are of some importance to the performance.

On the current server game, whose year is at present 2004, the greatest single threaded workloads are the sync_step operations for private cars and pedestrians (and especially the former), hop_check for private cars being the greatest single consumer of resources. Neither private cars nor pedestrians do any actual route finding, so it is just some fairly basic checks being done an extremely large number of times that is taking all the CPU time.

I wonder, therefore, whether any of this can be multi-threaded. Since pedestrians do not affect anything else directly, in theory those could be multi-threaded concurrently quite easily, but there could possibly be some situations where the number of objects on a tile including pedestrians exceeds the in-built limit and it would then affect other game states and might cause a loss of synchronisation if these were multi-threaded. I am not sure of whether or not there is a robust way around this at present.
As to private cars, since these do affect other things in the game (and each other), these cannot be fully multi-threaded concurrently, but I wonder whether some of the more intensive parts of their algorithms might be multi-threaded in parallel in much the same way as the player vehicle route finding is done. I have not looked into this in detail, however.

Additionally, there is the matter of memory consumption. Memory consumption on the server is again ~95%, suggesting that this is causing a problem for the server which may well impair performance. The only remedy to this is to have a smaller map (smaller, that is, not in land area, but in number of cities). There are 768 on the current server game - the next one might have perhaps circa 450. By way of comparison, the current Stevenson-Seimens server has 37 towns. I have not tested this, but I suspect that the large number of private cars will be largely responsible for the high memory consumption.

Finally, there is the matter of the number of sync_steps per step. I had experimentally increased this whilst also increasing the framerate to attempt to get a smoother play experience. However, unfortunately, a combination of the problems with accessing the server caused by excessive saving time, together with a pakset compiling problem at home which caused me to lose synchronisation the only time that I was able to get a spare joining slot has meant that I have been unable to test the efficacy of these steps. It would be useful to have detailed feedback from anyone who has managed to log in, however.

Edit: There are multi-threaded and open source implementations of bzip2 (lbzip2 and pbzip2) which claim to scale linearly with the number of threads; but, although I have found the source code, there appears to be no equivalent of the function calls in bzip2 and no obvious way of substituting this algorithm for the bzip2 algorithm in loadsave.cc.

If anyone knows anything more about this than I do, any assistance in this regard would be much appreciated.
« Last Edit: December 24, 2019, 12:51:11 AM by jamespetts »

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1334
Re: Server performance: preliminary findings
« Reply #1 on: December 24, 2019, 01:37:27 AM »
If anyone knows of any suitable multi-threaded compression libraries with a compatible licence (Simutrans uses the Artistic Licence, so, sadly, the GPL will not be compatible unless a pre-compiled library be used; but the LGPL should be fine), or (even better) just a multi-threaded implementation of the current compression algorithm, that would be very helpful.

Adapting "pigz" would likely be the best bet. When I implemented the current 2 threaded save/load routines I'd investigated expanding them to handle more threading, but decided it was not worth the effort. When compressing with zlib, you can run 2 compression threads per serialization thread. i.e. compression is half the speed of serializing. To run more than one serialization requires a significant rewrite of the save format and routines.  And don't forget the loading time... things are even worse there with object allocation taking more than 5 times the decompression time. Parallel object allocation is another massive effort. Best idea - don't save so much data!

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2791
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #2 on: December 24, 2019, 02:29:38 AM »
Currently the main performance issue is the lack of free memory. Compression is slow, but compression is considerably more slow when having to deal with page faults during the process. Under these conditions the saving takes so long that the clients timeout trying to join.

Offline Freahk

  • *
  • Posts: 422
  • Languages: DE, EN
Re: Server performance: preliminary findings
« Reply #3 on: December 24, 2019, 11:58:49 AM »
As mentioned, we can have a look at pigz for a multithreaded implementation of gzip.

Alternatively we can have a look at lzop/LZO or lz4, both are much faster compared to zlib. Compression won't be that good but I guess still sufficient.
Sadly it is distributed under GPL but preinstalled or at least available in system repos of any commonly used Linux distro and for Windows there are Precompiled binaries available.
It is pretty likely to even outperform pigz up to at least 4 cores in compression and decompression time but at slightly worse compression rate.

Another relatively new thing in compression is Zstandard. While being slightly slower than lzo when using only one thread, it has multithread support and will outperform all the the here mentioned libs in that case at least in compression time at low compression levels, while compression rate will be roughly the same as for zlib.
It is using the BSD license. I don't know in how far that license is compatible to artistic license.

So far about compression these days, when enough memory is available.
Lack of free memory and pagefaults will for sure greatly decrease IO and thus slow down any compression algorithm.
Zlib is most likely the one consuming the lowest additional amount of memory while compressing and decompressing.
Lzop and zstd single-threaded will consume a little bit more memory but it's still not too much.
Pigz and lz4 will consume relatively huge amounts of additionally memory compared to gzip but it's still pretty small compared to the overall amount of data to be compressed.
Zstd seems to consume quite a lot more memory when increasing the number of threads. Thus, we have to use multithreading carefully here but when using only 2 or 4 cores it should be fine.
« Last Edit: December 24, 2019, 01:06:03 PM by Freahk »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 19078
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #4 on: December 24, 2019, 12:20:31 PM »
I have just looked at pigz - this appears to have the same problem in that the source code does not provide a replacement library for zlib where the same calls, such as gzopen, are replicated, and there is no documentation on how, if at all, this can be used inside existing code rather than as a stand-alone application.

Offline Vladki cz

  • Devotee
  • *
  • Posts: 2798
    • My addons, mostly roadsigns, pak128.cs
  • Languages: EN, CS
Re: Server performance: preliminary findings
« Reply #5 on: December 24, 2019, 03:15:03 PM »
I would also recommend lzo or lz4. These are used for vpn and filesystem (zfs) compression, so speed is most important for them, and compression ratio still quite good.

Regarding pedestrians, as they are purely decorative, wouldn't it be good to switch them off completely for very large games? I also turn off water animation.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 19078
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #6 on: December 24, 2019, 03:19:02 PM »
Do lzo or lz4 have an API for implementing in existing codebases, do you know?

Offline Freahk

  • *
  • Posts: 422
  • Languages: DE, EN
Re: Server performance: preliminary findings
« Reply #7 on: December 24, 2019, 04:21:15 PM »
lzop/lzo has roughly the same relation as gzip/zlib, zlib and lzo are libraries, so you can use/link these from existing codebases, whereas gzip and lzop are command line tools using that lib.

In case of lz4, both have the same name but there is also a lib and a commandline tool using that lib.
Quote
These are used for vpn and filesystem (zfs) compression, so speed is most important for them, and compression ratio still quite good.
zlib, lzo and zstd can all be used for compression of btrfs filesystems, so speed is most important for them!
Well, I guess you got the point.

Single threaded lzo will be the fastest of these, however, zstd will compress roughly as good as zlib and will be faster than lzo for #cores>=2, so we should prefer zstd imho.
Here is the API http://facebook.github.io/zstd/zstd_manual.html

Their BSD license seems to be fine.
https://github.com/facebook/zstd/blob/dev/LICENSE

Quote
Regarding pedestrians, as they are purely decorative, wouldn't it be good to switch them off completely for very large games? I also turn off water animation.
Yes, please!
« Last Edit: December 24, 2019, 05:06:30 PM by Freahk »

Offline CK

  • *
  • Posts: 26
  • Languages: NL, EN, FR
Re: Server performance: preliminary findings
« Reply #8 on: December 26, 2019, 12:24:49 PM »
I was able to find this on license compatibility between the Zlib one and the Artistic License. The main difference between the AL and BSD licenses appears to be the requirement to note changes within the file. Isn't it possible to put the compression library into its own DLL file?

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 19078
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #9 on: December 26, 2019, 12:54:46 PM »
Thank you both for your replies. We already use zlib, so I assume that they are as compatible as they need to be for our purposes. If the lzo licence is the same as the zlib licence, then this should be compatible to the same extent.

As to lzo, it is useful to find something that apparently has a library - I will have to investigate that further when I return home after the Christmas holidays. Thank you for looking into this.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 19078
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #10 on: January 15, 2020, 01:16:22 AM »
I have looked into this a little further: testing confirms that it is indeed the compression that is taking all the time. Saving in binary mode (i.e. uncompressed) takes a tiny fraction of the time taken by saving in the default zipped mode, but produces a file ~4gb instead of ~1gb. Saving in bzip mode seems to take slightly longer again but produces a smaller file.

The thing that takes all the effort when saving is the routing data for passengers, mail and goods: turning off saving these produces a very short save time even when zipped (comparable to not compressing at all), but the route refresh on loading takes an extreme amount of time.

Running a performance analysis shows that 91.36% of time spent saving is spent by the line

Code: [Select]
lsp->loadsave_routine->flush_buffer(buf);

which ultimately points to "deflate", which consumes 90.73% of time. Loading, by contrast, takes a fraction of the time taken by saving.

In some senses, looking into multi-threaded compression might make this process somewhat faster, but if the rebuilding of paths takes tens of real time minutes in any event, then the game is likely to be excessively slow to update paths even if it is fast to access, so the ultimate issue here is simply that the game is beyond the size that can be handled by modern computers. Thus, faster compression may well be worthwhile, but it is probably not a priority. What is a priority is working out how to size the next Bidgewater-Brunel map to prevent this from recurring without limiting gameplay excessively.

The more complex question is how to measure "size". It is not size of the map, of course, as it is not the empty land tiles that take all the effort. What takes all the time in loading/saving (and what also takes by far the most time during running) is dealing with the pathing data for passengers, mail and goods. These data scale more or less exponentially with the number of stops, assuming that there are not large networks completely isolated from one another, which it is reasonable to assume will always be the case in a developed game. Thus, we need to think of "size" for the purpose of computational intensity in terms of stops.

The Bridgewater-Brunel game currently (as of in-game time May 2004) has 9,860 stops. By contrast, the Stephenson-Seimens server game has 1,021 stops and performs without any problems at all even on a less powerful server.

In the spring of 2019, the Bridgewater-Brunel game was still playable. This was the 1950s/1960s in game time. I have a saved game from May 2019. It is not particularly easy to obtain the number of stops in-game (perhaps this should be changed one day), but using a debugger, the number is revealed as 6,945. This suggests that we should look roughly at a limit of 7,000 stops (perhaps 7,500) as the maximum extent of what can be coped with computationally.

So, how to set up a map such that it does not exceed 7,000 - 7,500 stops in a developed late game where 'buses are easy to set up and generally profitable (a cursory inspection of the saved games suggests that a proliferation of local road transport is responsible for the increase)? One approach would be simple scaling using town numbers. So, if we know that a map with 768 towns produces 9,860 stops in 2004, we can, very crudely, extrapolate downwards: 7,000 / 9.680 = 0.723; 0.723 * 768 gives 555 towns to the nearest integer. So, a very basic analysis suggests that a long-term server game should be limited to circa 550 towns.

However, we should perhaps be a little more cautious than this, since, probably owing to the difficulties in joining, some towns do not have as well developed transport network as they might have. Low Emingpole, for example, is very sparesely served, as is Heten. Harkly and Dartby have no local transport at all: only one railway station each. Thus, something in the region of perhaps 400-450 towns is perhaps more sensible.

I have recently fixed a bug in which the minimum spacing distance between towns was ignored. This had been present for some time. Having the minimum spacing working properly should at least allow a smaller number of towns to be better spread across a map. Reducing the map size is not ideal, as it suppresses the opportunities for long distance transport; aircraft in particular are limited by too small a map, and railways and shipping, too, albeit to a somewhat lesser extent.

Hopefully, this heuristic approach to finding a playable limit to size in terms of number of towns will pay off, but I suspect that, once we start a new game on the Bridgewater-Brunel server, it will be at least a year and possibly more before we find out whether this calculation is correct.

In the meantime, the rule of thumb would seem to be that 400-500 towns is probably the most that should be generated for an online game expected to continue to the modern era and result in a well developed map with plentiful local passenger transport.

Offline freddyhayward au

  • *
  • Posts: 95
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #11 on: January 15, 2020, 03:00:39 AM »
What if a small optional 'stop tax' (say 10.00c per stop, per month, separate from maintainence) were introduced via settings to encourage players to build fewer stops, particularly within local transport networks. A player building a 10km bus line might then choose to build 10 stops, 1 per km, rather than 20-30. Even better, the money might eventually pay off the public service debt!

Additionally, players could request or offer to merge their stops with each other via some asynchronous mechanism rather than deleting their stop/station and waiting for the other player to rebuild it.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2791
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #12 on: January 15, 2020, 05:44:41 AM »
What if a small optional 'stop tax' (say 10.00c per stop, per month, separate from maintainence) were introduced via settings to encourage players to build fewer stops, particularly within local transport networks. A player building a 10km bus line might then choose to build 10 stops, 1 per km, rather than 20-30. Even better, the money might eventually pay off the public service debt!
People generally are not building them more than 1 per km due to the distance between stops being too small for decent average speed. If one spaces them too far out then their operation becomes unrealistic.
which ultimately points to "deflate", which consumes 90.73% of time. Loading, by contrast, takes a fraction of the time taken by saving.
You can try tuning the buffer size. This is tricky as ideally the buffer should remain in the lowest level cache for optimal performance. One also wants it as large as possible to minimize call count without evicting regularly used data from the cache. There might be a few percent of performance to gain by doing, assuming the value is not already highly optimal. Since this value is technically CPU specific if there are gains to be had it would need to be broken out to a setting of sorts which server owners could customize for their processor. If the difference is not more than a few seconds for such a large complex save then it is not worth investigating further.
In the meantime, the rule of thumb would seem to be that 400-500 towns is probably the most that should be generated for an online game expected to continue to the modern era and result in a well developed map with plentiful local passenger transport.
I would suggest 400. Better to be on the side of caution as far as accessibility and performance goes as more people regularly playing is better than a few because it is starting to reach performance thresholds that turn them away. If it performs well enough to the end then one could consider raising it for future games.

Reduced city count will reduce the city density if the map is kept the same size. This may or may not be a good thing as far as gameplay goes.

Offline Freahk

  • *
  • Posts: 422
  • Languages: DE, EN
Re: Server performance: preliminary findings
« Reply #13 on: January 15, 2020, 11:02:52 AM »
People generally are not building them more than 1 per km due to the distance between stops being too small for decent average speed.
I don't know about bridgewater-brunel but in the stephenson-siemens game there are quite dense bus networks in some cities with stop spacings around 625m which, in combination with express lines competed other players bus services within cities quite well and increased the transported/passenger stat ratio in the townhalls quite a bit.

However, number of stops will scale by square with grid density, so I gues the point why people on bridgewater-brunel did not build such dense networks is simply the required time setting this up can better be spent interlinking cities to each other.

On stephenson-siemens, there are only 37 towns, so there is not much possibility in further improving inter-city links except for setting up immediately competing links in between cities.
« Last Edit: January 15, 2020, 12:36:13 PM by Freahk »

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 19078
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #14 on: January 16, 2020, 02:30:26 AM »
Freddy - thank you for the thought. However, I do not believe that altering the economic balance is the right approach to managing performance issues, as this conflicts with a fundamental design principle of Simutrans-Extended, viz. that the economics should be as realistic as possible. The better approach is simply to size the map based on what a computer can cope with.

Dr. Supergood - thank you for the suggestion regarding the buffer size. May I ask where in zlib's API this can be modified?

One important factor to consider in relation to computational intensity is not merely the number of towns, but their size. On the Bridgewater-Brunel server, the towns all grew to be rather large after a long time. I believe that it is likely that this is due to the system for town growth calibration used in Simutrans (both Standard and Extended), in which town growth rates are divided into three categories, "capitals", "cities" and "villages", the larger categories having significantly more growth than the smaller categories. These categories are based on absolute sizes, so, in a long game, many towns will pass into the higher categories eventually, and thus thereafter grow at a higher rate.

I have just pushed a modification to this system (which is optional and can be disabled in simuconf.tab) to use a relative system rather than an absolute system. In this system, the three categories of Standard are retained, but their boundaries are determined by the town's rank in population size. The top ranked town will always be a "capital" (this is hard-coded to avoid rounding errors meaning that no town counts as a "capital" in games with few towns). Simuconf.tab settings can then set the percentage rank above which a town will be a "city" and a "capital", with values of 20% and 2% respectively set in the default simuconf.tab (the actual defaults if no values be specified are 0, which reverts to the old behaviour using the absolute sizes).

This should help to ensure a realistic distribution of town sizes throughout a long game, and should also (hopefully) mean that, in a later stage game, there will be a large disparity between the largest and smallest towns, allowing for interesting metropolitan transport opportunities in the larger towns. I have also increased the disparity between the growth rates of the smallest and largest towns with the intention of having a small number of huge metropolitan areas, a modest number of sizeable towns, and a large number of villages or small towns in the later game.

If we have 400 towns, at 2% and 20% respectively, we will have:
8 "capitals";
80 "cities"; and
312 "villages".

This should help our town size distribution to retain an approximately Zipfian form.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2791
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #15 on: January 16, 2020, 04:12:02 AM »
Dr. Supergood - thank you for the suggestion regarding the buffer size. May I ask where in zlib's API this can be modified?
I was referring to the buffer used to hold the saved data that is flushed by compression. Some amount of data should be getting written to a recycled buffer which when full gets handed to the compression library to compress effectively emptying it. The larger this buffer, the fewer compression calls are made but also the more likely it is to evict data from CPU caches.

Offline Phystam jp

  • *
  • Posts: 312
  • Pak256.Ex developer
    • Pak256 wiki page
  • Languages: JP, EN, EO
Re: Server performance: preliminary findings
« Reply #16 on: January 16, 2020, 05:03:02 AM »
If we can use zstd which is very fast and effective compression library without any dependencies, it would be better, as freahk proposed. And more, they provide also pzstd — parallelized zstd. It is much effective.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9633
  • Languages: De,EN,JP
Re: Server performance: preliminary findings
« Reply #17 on: January 16, 2020, 05:45:47 AM »
I think the profiling (especially when done single_threaded) is overestimating the compression. The compression algorithm must be fed data, and that data come from many rdwr calls to different things (which are then less noted). However, if the data is always similar (which I assume for the vast amount) then a Huffman coding with precalculated tables can be extremely fast without consuming any extra memory. (https://en.wikipedia.org/wiki/Huffman_coding)

To caluclate that static table of probabilities of 16 bit words (the binary tree), one would just take an uncompressed very large savegame and determine the probabililities for each 16 word. I assume 0, 1, 2 or so will be more common. Then one will use this table as encoding for all games. Compression is probably slightly worse than zlib, but speed is as fast as it can get (because of simple lookup). Actually decompression may be slower than compression.

To elaborate on DrSuperGood, Simutrans compresses already in the background, i.e. one buffer is written to while the already filled buffer is compressed. Increasing the size means a longer delay before starting to compress, but then longer continutation at max writing speed. But many servers (usually virtual ones) do not profit from that at all, since the only offer a single one. Only on the next tier four core are standard and thus here it would help.

EDIT: There is a very nice compression benchmark table at https://quixdb.github.io/squash-benchmark/#results suggestion snappy is almost 7 times faster than zlib. https://google.github.io/snappy/ resp. https://github.com/google/snappy That sound like very worthwhile implementing for Simutrans servers and also not so diffcult to actually implement (since the actual writing is up to the application).
« Last Edit: January 16, 2020, 07:01:41 AM by prissi »

Offline Matthew gb

  • *
  • Posts: 260
    • Japan Railway Journal
  • Languages: EN, some ZH, DE & SQ
Re: Server performance: preliminary findings
« Reply #18 on: January 16, 2020, 08:20:31 AM »
I have just pushed a modification to this system (which is optional and can be disabled in simuconf.tab) to use a relative system rather than an absolute system. In this system, the three categories of Standard are retained, but their boundaries are determined by the town's rank in population size. The top ranked town will always be a "capital" (this is hard-coded to avoid rounding errors meaning that no town counts as a "capital" in games with few towns). Simuconf.tab settings can then set the percentage rank above which a town will be a "city" and a "capital", with values of 20% and 2% respectively set in the default simuconf.tab (the actual defaults if no values be specified are 0, which reverts to the old behaviour using the absolute sizes).

...

This should help our town size distribution to retain an approximately Zipfian form.

James, thank you for improving the town growth system and taking the time to make it optional too. I don't know whether it will solve the Bridgewater-Brunel problem, but it should certainly improve the realism of the economic simulation.  :thumbsup:

Offline Vladki cz

  • Devotee
  • *
  • Posts: 2798
    • My addons, mostly roadsigns, pak128.cs
  • Languages: EN, CS
Re: Server performance: preliminary findings
« Reply #19 on: January 16, 2020, 04:03:44 PM »
I don't know about bridgewater-brunel but in the stephenson-siemens game there are quite dense bus networks in some cities with stop spacings around 625m which, in combination with express lines competed other players bus services within cities quite well and increased the transported/passenger stat ratio in the townhalls quite a bit.
Also note the real-world density of stops. I have measured the path I commute daily, and distances between stops are from 150 m to almost 1 km. Average is somewhere at 400-500 m. And as we aim for reality, we should expect that inner city transport may have stops at similar density.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2791
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #20 on: January 17, 2020, 12:05:39 AM »
Also note the real-world density of stops. I have measured the path I commute daily, and distances between stops are from 150 m to almost 1 km. Average is somewhere at 400-500 m. And as we aim for reality, we should expect that inner city transport may have stops at similar density.
Very problematic seeing how tiles are 125m. A 150m stop interval would cause multiple stops to merge together.

This is a situation when one wants it to look realistic rather than use realistic distance.
To elaborate on DrSuperGood, Simutrans compresses already in the background, i.e. one buffer is written to while the already filled buffer is compressed. Increasing the size means a longer delay before starting to compress, but then longer continutation at max writing speed. But many servers (usually virtual ones) do not profit from that at all, since the only offer a single one. Only on the next tier four core are standard and thus here it would help.
I was referring to buffer tuning which is the process of optimizing by finding a buffer size which provides the highest throughput or best trade-off. In the case of a multi thread implementation where 1 thread populates data and the other compresses and writes it then the optimum buffer size should be some fraction of the highest level cache size since that way the data can be passed between the threads without being written to and read from memory which is significantly slower than the cache. If this performance actually matters is another question entirely...

I was not aware that there was multi threading for this task. If there is one might be able to benchmark the compression performance by timing how long the compression function calls take by measuring the sum of time between the start of the call and the call returning. This is likely more accurate to real times than statistical, instrument or debug related profiling.

If people want I can try looking into changing the compression used.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 19078
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #21 on: January 17, 2020, 12:53:54 AM »
Thank you all for your responses: this is most helpful. I should note for reference that the Bridgewater-Brunel server, while a virtual server, has access to six dedicated cores, and thus will benefit from multi-threading the compression algorithm.

Dr. Supergood - it would be extremely helpful if you could look into changing the compression used; improving the speed of saving games by optimising the compression could make a very big difference to the playability of online games when they get large (and even large single player games). On the face of it, Prissi's suggestion of Snappy looks promising, but I will leave it to your discretion what to look into.

As to stop distances, we do want to have realism in so far as possible, but stops actually being immediately next to each other cannot work, so the game engine and scale limits us a little in terms of the density of stops, but, within that limitation, the passenger generation algorithm is designed to give as realistic a possible consequence to stops being too sparse (i.e. longer walking times, therefore longer journey times and thus lower ridership).

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1334
Re: Server performance: preliminary findings
« Reply #22 on: January 17, 2020, 01:42:18 AM »
Speed of compression can vary drastically depending on the data being compressed. i.e. for best results the compression algorithm can be matched with the data. Feed it 'bad' data, and it chokes.

The gzip library is currently being used at default settings, the compression level could be lowed (defaults to 7/9 iirc), and the strategy parameter adjusted. One of the strategies is plain Huffman as Prissi mentioned, there's also a filtered strategy which is a hybrid of Huffman and the default.

I also see that atleast for the savegame from the Stevens server that saving it with Bzip2 isn't much slower than zip. Possibly the path explorer data that Extended saves is 'liked' by bzip2 so it flies through it even being set to maximum compression. Again lowering this setting, and possibly adjusting the 'try' parameter for when it fallsback on its strategies might be worth a try and beat gzip.  This could mean splitting the save in two would be warranted to save the part of the file similar the Standard that bzip is an order of magnitude slower on using gzip, and the rest with bzip.

Can you make the bridgewater save available somewhere?  (The server refuses to serve it up...)

Offline Freahk

  • *
  • Posts: 422
  • Languages: DE, EN
Re: Server performance: preliminary findings
« Reply #23 on: January 17, 2020, 01:50:03 AM »
Before implementing one of these, we should maybe take a huge savegame as bridgewater, save it uncompressed and compress it using the commandline tools. That will give us a rough guess of how well the algorithm behind the tool will work for this kind of data.

Sadly, I am not able to connect to bridgewater, thus I can't fetch the whole save from the server.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9633
  • Languages: De,EN,JP
Re: Server performance: preliminary findings
« Reply #24 on: January 17, 2020, 03:25:48 AM »
Since it is relatively straight forward, I will play around with snappy in standard too.

EDIT: Some tests with the largest developed savegames from standard (not jsut large empty maps):
$ ./build/snappy_unittest.exe  testdata/a3-88-10-05.sve testdata/yoshi87-102-2-2.sve
testdata/a3-88-10-05.sve                 :
ZLIB:   [b 1M] bytes 39740055 -> 9874538 24.8%  comp  19.8 MB/s  uncomp 323.9 MB/s
SNAPPY: [b 4M] bytes 39740055 -> 15889790 40.0%  comp 413.4 MB/s  uncomp 644.5 MB/s
testdata/yoshi87-102-2-2.sve             :
ZLIB:   [b 1M] bytes 21014243 -> 4751418 22.6%  comp  24.4 MB/s  uncomp 323.7 MB/s
SNAPPY: [b 4M] bytes 21014243 -> 7539681 35.9%  comp 444.2 MB/s  uncomp 761.2 MB/s

So while Zlib delivers 11 MB/s Snappy gives 413 MB/s, but size is 1.5-2 as big. However, even 19MB/s is more than 100 Mbit/s and faster than most clients can recieve over longer distances. (This is a 32bit built with -O3 and no assertions though!)

But then is is a i7-7500U. I will run the same tonight on my Celeron laptop to get something closer to many virtual servers ...
« Last Edit: January 17, 2020, 07:28:18 AM by prissi »

Offline Vladki cz

  • Devotee
  • *
  • Posts: 2798
    • My addons, mostly roadsigns, pak128.cs
  • Languages: EN, CS
Re: Server performance: preliminary findings
« Reply #25 on: January 17, 2020, 08:06:38 AM »
Regarding the stop density. I know that 150 m is too close for the game. That is indeed extreme even in real world. What is more important is the average value <500m which is 4 tiles, and is still quite dense for the game.

But looking at the minimap, in Stephenson Siemens game, you can see that while commuting is green everywhere, visiting is not, even in well served cities. Wonder why mail map is red, even if mail network coverage is quiet large.

Offline Phystam jp

  • *
  • Posts: 312
  • Pak256.Ex developer
    • Pak256 wiki page
  • Languages: JP, EN, EO
Re: Server performance: preliminary findings
« Reply #26 on: January 17, 2020, 08:19:45 AM »
I tried comparing gzip, bzip2 and zstd.
I used current Stephenson-siemens server save data without compression, which has 96 MB.

machine power: i7-7700HQ CPU @ 2.80GHz, using M.2 SSD

BZIP2: achieved ~22MB
Code: [Select]
//compression
$ time bzip2 binary.sve

real    0m12.544s
user    0m12.016s
sys     0m0.438s

//decompression
$ time bzip2 -d binary.sve.bz2

real    0m5.882s
user    0m5.047s
sys     0m0.766s

GZIP: achieved ~28MB
Code: [Select]
//compression
$ time gzip binary.sve

real    0m6.170s
user    0m6.094s
sys     0m0.078s
//decompression
$ time gunzip binary.sve.gz

real    0m0.739s
user    0m0.656s
sys     0m0.094s

ZSTD -3: achieved ~26MB
Code: [Select]
//compression
$ time zstd -3 binary.sve
binary.sve           : 26.66%   (98183317 => 26175069 bytes, binary.sve.zst)

real    0m1.042s
user    0m0.891s
sys     0m0.125s
//decompression
$ time zstd -d binary.sve.zst
binary.sve.zst      : 98183317 bytes

real    0m0.445s
user    0m0.281s
sys     0m0.156s

PZSTD -3 with 2 processes:
Code: [Select]
$ time pzstd -3 -p 2 binary.sve
//compression
binary.sve           : 26.70%   (98183317 => 26213185 bytes, binary.sve.zst)

real    0m0.474s
user    0m0.844s
sys     0m0.141s
//decompression
$ time pzstd -d -p 2 binary.sve.zst
binary.sve.zst      : 98183317 bytes

real    0m0.170s
user    0m0.297s
sys     0m0.094s
PZSTD -3 with 4 processes:
Code: [Select]
//compression
$ time pzstd -3 -p 4 binary.sve
binary.sve           : 26.70%   (98183317 => 26213185 bytes, binary.sve.zst)

real    0m0.398s
user    0m1.344s
sys     0m0.109s
//decompression
$ time pzstd -d -p 4 binary.sve.zst
binary.sve.zst      : 98183317 bytes

real    0m0.145s
user    0m0.297s
sys     0m0.188s

From these result, pzstd with 2 process has very high efficiency. Compared to bzip2, the compression time was only 3%.
« Last Edit: January 17, 2020, 08:37:49 AM by Phystam »

Offline Qayyum

  • *
  • Posts: 148
Re: Server performance: preliminary findings
« Reply #27 on: January 17, 2020, 08:37:01 AM »
I pick ZSTD -3, Time too short leads to issues regarding memory.

Offline Phystam jp

  • *
  • Posts: 312
  • Pak256.Ex developer
    • Pak256 wiki page
  • Languages: JP, EN, EO
Re: Server performance: preliminary findings
« Reply #28 on: January 17, 2020, 12:41:39 PM »
Furthermore, I checked the lz4 compression.
Code: [Select]
//compression
$ time lz4 binary.sve
Compressed filename will be : binary.sve.lz4
Compressed 98183317 bytes into 44430881 bytes ==> 45.25%

real    0m0.345s
user    0m0.250s
sys     0m0.094s
//decompression
$ time lz4 -d binary.sve.lz4
Decoding file binary.sve
Successfully decoded 98183317 bytes

real    0m0.276s
user    0m0.141s
sys     0m0.125s
It is very fast, but the compression rate is not so good.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9633
  • Languages: De,EN,JP
Re: Server performance: preliminary findings
« Reply #29 on: January 17, 2020, 03:09:55 PM »
I did some measurement (in standard under windows). Binary (no compression) delivers data (on a 2.9 i7-7500 CPU) at about 34.6 MB/s, with zlib this is then 28.9 MB/s (and level 1 "wb1" 31.4 MB/s) and with bzip2 it is down to 11.9 MB/s.

If I use a the largest heavily populated map I have (uncompressed 377 MB 3074x3074 tiles), then the results are: binary 23.6 MB/s, zlib 14.9 MB/s default (and 22 MB/s "wb1")) and bzip2 10.9 MB/s

I think the conclusion is that it is not worthwhile to look for a compression algorithm which can crunsh (on my machine) more than zlib level 1 "wb1". Furthermore, the https://quixdb.github.io/squash-benchmark/ suggest that in the region of zlib level 1 the only better one is zstd. However, comparing the effort of chaning one string to the hassle of adding a new library to crosscompiling, I think for standard I will just change the method of zlib writing. This even keeps the savegames compatible ...

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 19078
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #30 on: January 17, 2020, 03:54:07 PM »
Thank you all for your extensive investigations into this. Would it be of assistance if I were to make available the current Bridgewater-Brunel saved game for testing to see whether Prissi's results in Standard also hold for the extremely large amount of pathing data in that saved game, of a type which is not saved in Standard?

Offline Freahk

  • *
  • Posts: 422
  • Languages: DE, EN
Re: Server performance: preliminary findings
« Reply #31 on: January 17, 2020, 05:07:25 PM »
If I had the save I could throw a few compresion algorithms at it and see what happens.
It's up to you deciding if that would be of assistance to you or not.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Moderator
  • *
  • Posts: 19078
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #32 on: January 17, 2020, 05:47:42 PM »
If I had the save I could throw a few compresion algorithms at it and see what happens.
It's up to you deciding if that would be of assistance to you or not.

That would indeed be helpful - here is the link to the current saved game.

Offline Phystam jp

  • *
  • Posts: 312
  • Pak256.Ex developer
    • Pak256 wiki page
  • Languages: JP, EN, EO
Re: Server performance: preliminary findings
« Reply #33 on: January 17, 2020, 09:08:04 PM »
I tested with the file --- the decompressed size is ~4GB.

machine: Ryzen 7 3700X, M.2 SSD

Result:


This result shows that we can compress such a big save data only within 10 seconds using multi-process!
Generally speaking, option -1 to -4 provides almost the same results, and their compressed sizes are almost the same as gzip.
I think that the option -3 or -4 with 4 cores has the most effective.

Offline DrSuperGood

  • Dev Team
  • Devotee
  • *
  • Posts: 2791
  • Languages: EN
Re: Server performance: preliminary findings
« Reply #34 on: January 17, 2020, 09:57:37 PM »
I pick ZSTD -3, Time too short leads to issues regarding memory.
Issues regarding memory? Please elaborate.
So while Zlib delivers 11 MB/s Snappy gives 413 MB/s, but size is 1.5-2 as big. However, even 19MB/s is more than 100 Mbit/s and faster than most clients can recieve over longer distances. (This is a 32bit built with -O3 and no assertions though!)
Too bad that at the moment the entire save must complete first before the server starts to transfer to clients. If it started transferring the instant data started to be compressed this topic would likely have not been started lol.