The International Simutrans Forum

Simutrans Extended => Simutrans-Extended development => Topic started by: jamespetts on December 23, 2019, 11:51:35 PM

Title: Server performance: preliminary findings
Post by: jamespetts on December 23, 2019, 11:51:35 PM
I have not had time to do a great deal of investigation into the server performance issues in the lead-up to Christmas, and I am now staying with my parents for some weeks so will not be able to do much else until sometime into the new year, but I have been able to undertake a preliminary analysis of some of the issues relating to the performance of the server by running a performance profiler on my own local machine.

First of all, by far the greatest issue appears to be the amount of time spent saving a game. On the server, this has to happen whenever a new player joins or once every hour, whichever is less frequent at any given time. Even on my new build Ryzen 3900x, saving the server game takes a seriously long time. I have not timed it exactly, but it is many minutes. The actual server, which will almost certainly be slower than my new Ryzen at home, will take even longer.

The performance profiler shows that virtually all of the saving time is spent compressing the file: only a few seconds are spent actually writing the data, and than many, many minutes are spent compressing it. The compression uses the zlib algorithm. The bzip algorithm is an alternative option, but this is even slower as it has a higher compression ratio, and was dropped from server games even in Standard some years ago as it was realised that any transmission time saved by having a smaller file was more than obliterated by the compression time.

The zlib algorithm is based on an open source library, and, at least in so far as is implemented in Simutrans at present, is single threaded. Since multi-core machines are now commonplace, and the server has access to 6 cores, if there is a multi-threaded alternative to this, this could very significantly increase performance on the server in larger games.

That the server is spending almost all of its time saving the game would account for the experience reported by players, namely, taking an extremely long time to join and the server being unavailable (because others are trying to join, requiring saving of the game) most of the time.

If anyone knows of any suitable multi-threaded compression libraries with a compatible licence (Simutrans uses the Artistic Licence, so, sadly, the GPL will not be compatible unless a pre-compiled library be used; but the LGPL should be fine), or (even better) just a multi-threaded implementation of the current compression algorithm, that would be very helpful.

As to the running game, most of the time is still spent with passenger generation (as discussed previously) and finding a suitable public transport route for passengers to get from any given origin to any given destination. I did make a few small optimising changes a week or so ago once I was able to use Visual Studio 2019 on my new computer and take advantage of its line by line CPU utilisation data, but these will be fairly minor in effect.

However, increasing the number of threads will assist with this, as this is a fully threaded workload, and can scale proportionally to the number of available CPU cores. Since the server now has access to 6 cores, I have increased the number of threads on the server to 6.

Increasing the number of threads to 12 on my local machine (a new Ryzen with 12 cores) resulted in a somewhat uneven performance improvement; the game would perform somewhat jerkily, slowing down for a fraction of a second every 1-2 seconds. This suggests that the single threaded workloads are of some importance to the performance.

On the current server game, whose year is at present 2004, the greatest single threaded workloads are the sync_step operations for private cars and pedestrians (and especially the former), hop_check for private cars being the greatest single consumer of resources. Neither private cars nor pedestrians do any actual route finding, so it is just some fairly basic checks being done an extremely large number of times that is taking all the CPU time.

I wonder, therefore, whether any of this can be multi-threaded. Since pedestrians do not affect anything else directly, in theory those could be multi-threaded concurrently quite easily, but there could possibly be some situations where the number of objects on a tile including pedestrians exceeds the in-built limit and it would then affect other game states and might cause a loss of synchronisation if these were multi-threaded. I am not sure of whether or not there is a robust way around this at present.
As to private cars, since these do affect other things in the game (and each other), these cannot be fully multi-threaded concurrently, but I wonder whether some of the more intensive parts of their algorithms might be multi-threaded in parallel in much the same way as the player vehicle route finding is done. I have not looked into this in detail, however.

Additionally, there is the matter of memory consumption. Memory consumption on the server is again ~95%, suggesting that this is causing a problem for the server which may well impair performance. The only remedy to this is to have a smaller map (smaller, that is, not in land area, but in number of cities). There are 768 on the current server game - the next one might have perhaps circa 450. By way of comparison, the current Stevenson-Seimens server has 37 towns. I have not tested this, but I suspect that the large number of private cars will be largely responsible for the high memory consumption.

Finally, there is the matter of the number of sync_steps per step. I had experimentally increased this whilst also increasing the framerate to attempt to get a smoother play experience. However, unfortunately, a combination of the problems with accessing the server caused by excessive saving time, together with a pakset compiling problem at home which caused me to lose synchronisation the only time that I was able to get a spare joining slot has meant that I have been unable to test the efficacy of these steps. It would be useful to have detailed feedback from anyone who has managed to log in, however.

Edit: There are multi-threaded and open source implementations of bzip2 (lbzip2 and pbzip2) which claim to scale linearly with the number of threads; but, although I have found the source code, there appears to be no equivalent of the function calls in bzip2 and no obvious way of substituting this algorithm for the bzip2 algorithm in loadsave.cc.

If anyone knows anything more about this than I do, any assistance in this regard would be much appreciated.
Title: Re: Server performance: preliminary findings
Post by: TurfIt on December 24, 2019, 01:37:27 AM
Quote from: jamespetts on December 23, 2019, 11:51:35 PM
If anyone knows of any suitable multi-threaded compression libraries with a compatible licence (Simutrans uses the Artistic Licence, so, sadly, the GPL will not be compatible unless a pre-compiled library be used; but the LGPL should be fine), or (even better) just a multi-threaded implementation of the current compression algorithm, that would be very helpful.

Adapting "pigz" would likely be the best bet. When I implemented the current 2 threaded save/load routines I'd investigated expanding them to handle more threading, but decided it was not worth the effort. When compressing with zlib, you can run 2 compression threads per serialization thread. i.e. compression is half the speed of serializing. To run more than one serialization requires a significant rewrite of the save format and routines.  And don't forget the loading time... things are even worse there with object allocation taking more than 5 times the decompression time. Parallel object allocation is another massive effort. Best idea - don't save so much data!
Title: Re: Server performance: preliminary findings
Post by: DrSuperGood on December 24, 2019, 02:29:38 AM
Currently the main performance issue is the lack of free memory. Compression is slow, but compression is considerably more slow when having to deal with page faults during the process. Under these conditions the saving takes so long that the clients timeout trying to join.
Title: Re: Server performance: preliminary findings
Post by: Mariculous on December 24, 2019, 11:58:49 AM
As mentioned, we can have a look at pigz for a multithreaded implementation of gzip.

Alternatively we can have a look at lzop/LZO or lz4, both are much faster compared to zlib. Compression won't be that good but I guess still sufficient.
Sadly it is distributed under GPL but preinstalled or at least available in system repos of any commonly used Linux distro and for Windows there are Precompiled binaries available.
It is pretty likely to even outperform pigz up to at least 4 cores in compression and decompression time but at slightly worse compression rate.

Another relatively new thing in compression is Zstandard. While being slightly slower than lzo when using only one thread, it has multithread support and will outperform all the the here mentioned libs in that case at least in compression time at low compression levels, while compression rate will be roughly the same as for zlib.
It is using the BSD license. I don't know in how far that license is compatible to artistic license.

So far about compression these days, when enough memory is available.
Lack of free memory and pagefaults will for sure greatly decrease IO and thus slow down any compression algorithm.
Zlib is most likely the one consuming the lowest additional amount of memory while compressing and decompressing.
Lzop and zstd single-threaded will consume a little bit more memory but it's still not too much.
Pigz and lz4 will consume relatively huge amounts of additionally memory compared to gzip but it's still pretty small compared to the overall amount of data to be compressed.
Zstd seems to consume quite a lot more memory when increasing the number of threads. Thus, we have to use multithreading carefully here but when using only 2 or 4 cores it should be fine.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on December 24, 2019, 12:20:31 PM
I have just looked at pigz - this appears to have the same problem in that the source code does not provide a replacement library for zlib where the same calls, such as gzopen, are replicated, and there is no documentation on how, if at all, this can be used inside existing code rather than as a stand-alone application.
Title: Re: Server performance: preliminary findings
Post by: Vladki on December 24, 2019, 03:15:03 PM
I would also recommend lzo or lz4. These are used for vpn and filesystem (zfs) compression, so speed is most important for them, and compression ratio still quite good.

Regarding pedestrians, as they are purely decorative, wouldn't it be good to switch them off completely for very large games? I also turn off water animation.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on December 24, 2019, 03:19:02 PM
Do lzo or lz4 have an API for implementing in existing codebases, do you know?
Title: Re: Server performance: preliminary findings
Post by: Mariculous on December 24, 2019, 04:21:15 PM
lzop/lzo has roughly the same relation as gzip/zlib, zlib and lzo are libraries, so you can use/link these from existing codebases, whereas gzip and lzop are command line tools using that lib.

In case of lz4, both have the same name but there is also a lib and a commandline tool using that lib.
QuoteThese are used for vpn and filesystem (zfs) compression, so speed is most important for them, and compression ratio still quite good.
zlib, lzo and zstd can all be used for compression of btrfs filesystems, so speed is most important for them!
Well, I guess you got the point.

Single threaded lzo will be the fastest of these, however, zstd will compress roughly as good as zlib and will be faster than lzo for #cores>=2, so we should prefer zstd imho.
Here is the API http://facebook.github.io/zstd/zstd_manual.html

Their BSD license seems to be fine.
https://github.com/facebook/zstd/blob/dev/LICENSE

QuoteRegarding pedestrians, as they are purely decorative, wouldn't it be good to switch them off completely for very large games? I also turn off water animation.
Yes, please!
Title: Re: Server performance: preliminary findings
Post by: CK on December 26, 2019, 12:24:49 PM
I was able to find this on license compatibility between the Zlib one and the Artistic License (https://opensource.stackexchange.com/questions/6812/are-the-artistic-licence-1-0-and-the-zlib-licence-compatible-with-one-another-fo). The main difference between the AL and BSD licenses appears to be the requirement to note changes within the file. Isn't it possible to put the compression library into its own DLL file?
Title: Re: Server performance: preliminary findings
Post by: jamespetts on December 26, 2019, 12:54:46 PM
Thank you both for your replies. We already use zlib, so I assume that they are as compatible as they need to be for our purposes. If the lzo licence is the same as the zlib licence, then this should be compatible to the same extent.

As to lzo, it is useful to find something that apparently has a library - I will have to investigate that further when I return home after the Christmas holidays. Thank you for looking into this.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 15, 2020, 01:16:22 AM
I have looked into this a little further: testing confirms that it is indeed the compression that is taking all the time. Saving in binary mode (i.e. uncompressed) takes a tiny fraction of the time taken by saving in the default zipped mode, but produces a file ~4gb instead of ~1gb. Saving in bzip mode seems to take slightly longer again but produces a smaller file.

The thing that takes all the effort when saving is the routing data for passengers, mail and goods: turning off saving these produces a very short save time even when zipped (comparable to not compressing at all), but the route refresh on loading takes an extreme amount of time.

Running a performance analysis shows that 91.36% of time spent saving is spent by the line


lsp->loadsave_routine->flush_buffer(buf);


which ultimately points to "deflate", which consumes 90.73% of time. Loading, by contrast, takes a fraction of the time taken by saving.

In some senses, looking into multi-threaded compression might make this process somewhat faster, but if the rebuilding of paths takes tens of real time minutes in any event, then the game is likely to be excessively slow to update paths even if it is fast to access, so the ultimate issue here is simply that the game is beyond the size that can be handled by modern computers. Thus, faster compression may well be worthwhile, but it is probably not a priority. What is a priority is working out how to size the next Bidgewater-Brunel map to prevent this from recurring without limiting gameplay excessively.

The more complex question is how to measure "size". It is not size of the map, of course, as it is not the empty land tiles that take all the effort. What takes all the time in loading/saving (and what also takes by far the most time during running) is dealing with the pathing data for passengers, mail and goods. These data scale more or less exponentially with the number of stops, assuming that there are not large networks completely isolated from one another, which it is reasonable to assume will always be the case in a developed game. Thus, we need to think of "size" for the purpose of computational intensity in terms of stops.

The Bridgewater-Brunel game currently (as of in-game time May 2004) has 9,860 stops. By contrast, the Stephenson-Seimens server game has 1,021 stops and performs without any problems at all even on a less powerful server.

In the spring of 2019, the Bridgewater-Brunel game was still playable. This was the 1950s/1960s in game time. I have a saved game from May 2019. It is not particularly easy to obtain the number of stops in-game (perhaps this should be changed one day), but using a debugger, the number is revealed as 6,945. This suggests that we should look roughly at a limit of 7,000 stops (perhaps 7,500) as the maximum extent of what can be coped with computationally.

So, how to set up a map such that it does not exceed 7,000 - 7,500 stops in a developed late game where 'buses are easy to set up and generally profitable (a cursory inspection of the saved games suggests that a proliferation of local road transport is responsible for the increase)? One approach would be simple scaling using town numbers. So, if we know that a map with 768 towns produces 9,860 stops in 2004, we can, very crudely, extrapolate downwards: 7,000 / 9.680 = 0.723; 0.723 * 768 gives 555 towns to the nearest integer. So, a very basic analysis suggests that a long-term server game should be limited to circa 550 towns.

However, we should perhaps be a little more cautious than this, since, probably owing to the difficulties in joining, some towns do not have as well developed transport network as they might have. Low Emingpole, for example, is very sparesely served, as is Heten. Harkly and Dartby have no local transport at all: only one railway station each. Thus, something in the region of perhaps 400-450 towns is perhaps more sensible.

I have recently fixed a bug in which the minimum spacing distance between towns was ignored. This had been present for some time. Having the minimum spacing working properly should at least allow a smaller number of towns to be better spread across a map. Reducing the map size is not ideal, as it suppresses the opportunities for long distance transport; aircraft in particular are limited by too small a map, and railways and shipping, too, albeit to a somewhat lesser extent.

Hopefully, this heuristic approach to finding a playable limit to size in terms of number of towns will pay off, but I suspect that, once we start a new game on the Bridgewater-Brunel server, it will be at least a year and possibly more before we find out whether this calculation is correct.

In the meantime, the rule of thumb would seem to be that 400-500 towns is probably the most that should be generated for an online game expected to continue to the modern era and result in a well developed map with plentiful local passenger transport.
Title: Re: Server performance: preliminary findings
Post by: freddyhayward on January 15, 2020, 03:00:39 AM
What if a small optional 'stop tax' (say 10.00c per stop, per month, separate from maintainence) were introduced via settings to encourage players to build fewer stops, particularly within local transport networks. A player building a 10km bus line might then choose to build 10 stops, 1 per km, rather than 20-30. Even better, the money might eventually pay off the public service debt!

Additionally, players could request or offer to merge their stops with each other via some asynchronous mechanism rather than deleting their stop/station and waiting for the other player to rebuild it.
Title: Re: Server performance: preliminary findings
Post by: DrSuperGood on January 15, 2020, 05:44:41 AM
Quote from: freddyhayward on January 15, 2020, 03:00:39 AMWhat if a small optional 'stop tax' (say 10.00c per stop, per month, separate from maintainence) were introduced via settings to encourage players to build fewer stops, particularly within local transport networks. A player building a 10km bus line might then choose to build 10 stops, 1 per km, rather than 20-30. Even better, the money might eventually pay off the public service debt!
People generally are not building them more than 1 per km due to the distance between stops being too small for decent average speed. If one spaces them too far out then their operation becomes unrealistic.
Quote from: jamespetts on January 15, 2020, 01:16:22 AMwhich ultimately points to "deflate", which consumes 90.73% of time. Loading, by contrast, takes a fraction of the time taken by saving.
You can try tuning the buffer size. This is tricky as ideally the buffer should remain in the lowest level cache for optimal performance. One also wants it as large as possible to minimize call count without evicting regularly used data from the cache. There might be a few percent of performance to gain by doing, assuming the value is not already highly optimal. Since this value is technically CPU specific if there are gains to be had it would need to be broken out to a setting of sorts which server owners could customize for their processor. If the difference is not more than a few seconds for such a large complex save then it is not worth investigating further.
Quote from: jamespetts on January 15, 2020, 01:16:22 AMIn the meantime, the rule of thumb would seem to be that 400-500 towns is probably the most that should be generated for an online game expected to continue to the modern era and result in a well developed map with plentiful local passenger transport.
I would suggest 400. Better to be on the side of caution as far as accessibility and performance goes as more people regularly playing is better than a few because it is starting to reach performance thresholds that turn them away. If it performs well enough to the end then one could consider raising it for future games.

Reduced city count will reduce the city density if the map is kept the same size. This may or may not be a good thing as far as gameplay goes.
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 15, 2020, 11:02:52 AM
Quote from: DrSuperGood on January 15, 2020, 05:44:41 AMPeople generally are not building them more than 1 per km due to the distance between stops being too small for decent average speed.
I don't know about bridgewater-brunel but in the stephenson-siemens game there are quite dense bus networks in some cities with stop spacings around 625m which, in combination with express lines competed other players bus services within cities quite well and increased the transported/passenger stat ratio in the townhalls quite a bit.

However, number of stops will scale by square with grid density, so I gues the point why people on bridgewater-brunel did not build such dense networks is simply the required time setting this up can better be spent interlinking cities to each other.

On stephenson-siemens, there are only 37 towns, so there is not much possibility in further improving inter-city links except for setting up immediately competing links in between cities.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 16, 2020, 02:30:26 AM
Freddy - thank you for the thought. However, I do not believe that altering the economic balance is the right approach to managing performance issues, as this conflicts with a fundamental design principle of Simutrans-Extended, viz. that the economics should be as realistic as possible. The better approach is simply to size the map based on what a computer can cope with.

Dr. Supergood - thank you for the suggestion regarding the buffer size. May I ask where in zlib's API this can be modified?

One important factor to consider in relation to computational intensity is not merely the number of towns, but their size. On the Bridgewater-Brunel server, the towns all grew to be rather large after a long time. I believe that it is likely that this is due to the system for town growth calibration used in Simutrans (both Standard and Extended), in which town growth rates are divided into three categories, "capitals", "cities" and "villages", the larger categories having significantly more growth than the smaller categories. These categories are based on absolute sizes, so, in a long game, many towns will pass into the higher categories eventually, and thus thereafter grow at a higher rate.

I have just pushed a modification to this system (which is optional and can be disabled in simuconf.tab) to use a relative system rather than an absolute system. In this system, the three categories of Standard are retained, but their boundaries are determined by the town's rank in population size. The top ranked town will always be a "capital" (this is hard-coded to avoid rounding errors meaning that no town counts as a "capital" in games with few towns). Simuconf.tab settings can then set the percentage rank above which a town will be a "city" and a "capital", with values of 20% and 2% respectively set in the default simuconf.tab (the actual defaults if no values be specified are 0, which reverts to the old behaviour using the absolute sizes).

This should help to ensure a realistic distribution of town sizes throughout a long game, and should also (hopefully) mean that, in a later stage game, there will be a large disparity between the largest and smallest towns, allowing for interesting metropolitan transport opportunities in the larger towns. I have also increased the disparity between the growth rates of the smallest and largest towns with the intention of having a small number of huge metropolitan areas, a modest number of sizeable towns, and a large number of villages or small towns in the later game.

If we have 400 towns, at 2% and 20% respectively, we will have:
8 "capitals";
80 "cities"; and
312 "villages".

This should help our town size distribution to retain an approximately Zipfian form.
Title: Re: Server performance: preliminary findings
Post by: DrSuperGood on January 16, 2020, 04:12:02 AM
Quote from: jamespetts on January 16, 2020, 02:30:26 AMDr. Supergood - thank you for the suggestion regarding the buffer size. May I ask where in zlib's API this can be modified?
I was referring to the buffer used to hold the saved data that is flushed by compression. Some amount of data should be getting written to a recycled buffer which when full gets handed to the compression library to compress effectively emptying it. The larger this buffer, the fewer compression calls are made but also the more likely it is to evict data from CPU caches.
Title: Re: Server performance: preliminary findings
Post by: Phystam on January 16, 2020, 05:03:02 AM
If we can use zstd which is very fast and effective compression library without any dependencies, it would be better, as freahk proposed. And more, they provide also pzstd — parallelized zstd. It is much effective.
Title: Re: Server performance: preliminary findings
Post by: prissi on January 16, 2020, 05:45:47 AM
I think the profiling (especially when done single_threaded) is overestimating the compression. The compression algorithm must be fed data, and that data come from many rdwr calls to different things (which are then less noted). However, if the data is always similar (which I assume for the vast amount) then a Huffman coding with precalculated tables can be extremely fast without consuming any extra memory. (https://en.wikipedia.org/wiki/Huffman_coding)

To caluclate that static table of probabilities of 16 bit words (the binary tree), one would just take an uncompressed very large savegame and determine the probabililities for each 16 word. I assume 0, 1, 2 or so will be more common. Then one will use this table as encoding for all games. Compression is probably slightly worse than zlib, but speed is as fast as it can get (because of simple lookup). Actually decompression may be slower than compression.

To elaborate on DrSuperGood, Simutrans compresses already in the background, i.e. one buffer is written to while the already filled buffer is compressed. Increasing the size means a longer delay before starting to compress, but then longer continutation at max writing speed. But many servers (usually virtual ones) do not profit from that at all, since the only offer a single one. Only on the next tier four core are standard and thus here it would help.

EDIT: There is a very nice compression benchmark table at https://quixdb.github.io/squash-benchmark/#results suggestion snappy is almost 7 times faster than zlib. https://google.github.io/snappy/ resp. https://github.com/google/snappy That sound like very worthwhile implementing for Simutrans servers and also not so diffcult to actually implement (since the actual writing is up to the application).
Title: Re: Server performance: preliminary findings
Post by: Matthew on January 16, 2020, 08:20:31 AM
Quote from: jamespetts on January 16, 2020, 02:30:26 AMI have just pushed a modification to this system (which is optional and can be disabled in simuconf.tab) to use a relative system rather than an absolute system. In this system, the three categories of Standard are retained, but their boundaries are determined by the town's rank in population size. The top ranked town will always be a "capital" (this is hard-coded to avoid rounding errors meaning that no town counts as a "capital" in games with few towns). Simuconf.tab settings can then set the percentage rank above which a town will be a "city" and a "capital", with values of 20% and 2% respectively set in the default simuconf.tab (the actual defaults if no values be specified are 0, which reverts to the old behaviour using the absolute sizes).

...

This should help our town size distribution to retain an approximately Zipfian form.

James, thank you for improving the town growth system and taking the time to make it optional too. I don't know whether it will solve the Bridgewater-Brunel problem, but it should certainly improve the realism of the economic simulation.  :thumbsup:
Title: Re: Server performance: preliminary findings
Post by: Vladki on January 16, 2020, 04:03:44 PM
Quote from: Freahk on January 15, 2020, 11:02:52 AMI don't know about bridgewater-brunel but in the stephenson-siemens game there are quite dense bus networks in some cities with stop spacings around 625m which, in combination with express lines competed other players bus services within cities quite well and increased the transported/passenger stat ratio in the townhalls quite a bit.
Also note the real-world density of stops. I have measured the path I commute daily, and distances between stops are from 150 m to almost 1 km. Average is somewhere at 400-500 m. And as we aim for reality, we should expect that inner city transport may have stops at similar density.
Title: Re: Server performance: preliminary findings
Post by: DrSuperGood on January 17, 2020, 12:05:39 AM
Quote from: Vladki on January 16, 2020, 04:03:44 PMAlso note the real-world density of stops. I have measured the path I commute daily, and distances between stops are from 150 m to almost 1 km. Average is somewhere at 400-500 m. And as we aim for reality, we should expect that inner city transport may have stops at similar density.
Very problematic seeing how tiles are 125m. A 150m stop interval would cause multiple stops to merge together.

This is a situation when one wants it to look realistic rather than use realistic distance.
Quote from: prissi on January 16, 2020, 05:45:47 AMTo elaborate on DrSuperGood, Simutrans compresses already in the background, i.e. one buffer is written to while the already filled buffer is compressed. Increasing the size means a longer delay before starting to compress, but then longer continutation at max writing speed. But many servers (usually virtual ones) do not profit from that at all, since the only offer a single one. Only on the next tier four core are standard and thus here it would help.
I was referring to buffer tuning which is the process of optimizing by finding a buffer size which provides the highest throughput or best trade-off. In the case of a multi thread implementation where 1 thread populates data and the other compresses and writes it then the optimum buffer size should be some fraction of the highest level cache size since that way the data can be passed between the threads without being written to and read from memory which is significantly slower than the cache. If this performance actually matters is another question entirely...

I was not aware that there was multi threading for this task. If there is one might be able to benchmark the compression performance by timing how long the compression function calls take by measuring the sum of time between the start of the call and the call returning. This is likely more accurate to real times than statistical, instrument or debug related profiling.

If people want I can try looking into changing the compression used.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 17, 2020, 12:53:54 AM
Thank you all for your responses: this is most helpful. I should note for reference that the Bridgewater-Brunel server, while a virtual server, has access to six dedicated cores, and thus will benefit from multi-threading the compression algorithm.

Dr. Supergood - it would be extremely helpful if you could look into changing the compression used; improving the speed of saving games by optimising the compression could make a very big difference to the playability of online games when they get large (and even large single player games). On the face of it, Prissi's suggestion of Snappy looks promising, but I will leave it to your discretion what to look into.

As to stop distances, we do want to have realism in so far as possible, but stops actually being immediately next to each other cannot work, so the game engine and scale limits us a little in terms of the density of stops, but, within that limitation, the passenger generation algorithm is designed to give as realistic a possible consequence to stops being too sparse (i.e. longer walking times, therefore longer journey times and thus lower ridership).
Title: Re: Server performance: preliminary findings
Post by: TurfIt on January 17, 2020, 01:42:18 AM
Speed of compression can vary drastically depending on the data being compressed. i.e. for best results the compression algorithm can be matched with the data. Feed it 'bad' data, and it chokes.

The gzip library is currently being used at default settings, the compression level could be lowed (defaults to 7/9 iirc), and the strategy parameter adjusted. One of the strategies is plain Huffman as Prissi mentioned, there's also a filtered strategy which is a hybrid of Huffman and the default.

I also see that atleast for the savegame from the Stevens server that saving it with Bzip2 isn't much slower than zip. Possibly the path explorer data that Extended saves is 'liked' by bzip2 so it flies through it even being set to maximum compression. Again lowering this setting, and possibly adjusting the 'try' parameter for when it fallsback on its strategies might be worth a try and beat gzip.  This could mean splitting the save in two would be warranted to save the part of the file similar the Standard that bzip is an order of magnitude slower on using gzip, and the rest with bzip.

Can you make the bridgewater save available somewhere?  (The server refuses to serve it up...)
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 17, 2020, 01:50:03 AM
Before implementing one of these, we should maybe take a huge savegame as bridgewater, save it uncompressed and compress it using the commandline tools. That will give us a rough guess of how well the algorithm behind the tool will work for this kind of data.

Sadly, I am not able to connect to bridgewater, thus I can't fetch the whole save from the server.
Title: Re: Server performance: preliminary findings
Post by: prissi on January 17, 2020, 03:25:48 AM
Since it is relatively straight forward, I will play around with snappy in standard too.

EDIT: Some tests with the largest developed savegames from standard (not jsut large empty maps):
$ ./build/snappy_unittest.exe  testdata/a3-88-10-05.sve testdata/yoshi87-102-2-2.sve
testdata/a3-88-10-05.sve                 :
ZLIB:   [b 1M] bytes 39740055 -> 9874538 24.8%  comp  19.8 MB/s  uncomp 323.9 MB/s
SNAPPY: [b 4M] bytes 39740055 -> 15889790 40.0%  comp 413.4 MB/s  uncomp 644.5 MB/s
testdata/yoshi87-102-2-2.sve             :
ZLIB:   [b 1M] bytes 21014243 -> 4751418 22.6%  comp  24.4 MB/s  uncomp 323.7 MB/s
SNAPPY: [b 4M] bytes 21014243 -> 7539681 35.9%  comp 444.2 MB/s  uncomp 761.2 MB/s

So while Zlib delivers 11 MB/s Snappy gives 413 MB/s, but size is 1.5-2 as big. However, even 19MB/s is more than 100 Mbit/s and faster than most clients can recieve over longer distances. (This is a 32bit built with -O3 and no assertions though!)

But then is is a i7-7500U. I will run the same tonight on my Celeron laptop to get something closer to many virtual servers ...
Title: Re: Server performance: preliminary findings
Post by: Vladki on January 17, 2020, 08:06:38 AM
Regarding the stop density. I know that 150 m is too close for the game. That is indeed extreme even in real world. What is more important is the average value <500m which is 4 tiles, and is still quite dense for the game.

But looking at the minimap, in Stephenson Siemens game, you can see that while commuting is green everywhere, visiting is not, even in well served cities. Wonder why mail map is red, even if mail network coverage is quiet large.
Title: Re: Server performance: preliminary findings
Post by: Phystam on January 17, 2020, 08:19:45 AM
I tried comparing gzip, bzip2 and zstd.
I used current Stephenson-siemens server save data without compression, which has 96 MB.

machine power: i7-7700HQ CPU @ 2.80GHz, using M.2 SSD

BZIP2: achieved ~22MB

//compression
$ time bzip2 binary.sve

real    0m12.544s
user    0m12.016s
sys     0m0.438s

//decompression
$ time bzip2 -d binary.sve.bz2

real    0m5.882s
user    0m5.047s
sys     0m0.766s


GZIP: achieved ~28MB

//compression
$ time gzip binary.sve

real    0m6.170s
user    0m6.094s
sys     0m0.078s
//decompression
$ time gunzip binary.sve.gz

real    0m0.739s
user    0m0.656s
sys     0m0.094s


ZSTD -3: achieved ~26MB

//compression
$ time zstd -3 binary.sve
binary.sve           : 26.66%   (98183317 => 26175069 bytes, binary.sve.zst)

real    0m1.042s
user    0m0.891s
sys     0m0.125s
//decompression
$ time zstd -d binary.sve.zst
binary.sve.zst      : 98183317 bytes

real    0m0.445s
user    0m0.281s
sys     0m0.156s


PZSTD -3 with 2 processes:

$ time pzstd -3 -p 2 binary.sve
//compression
binary.sve           : 26.70%   (98183317 => 26213185 bytes, binary.sve.zst)

real    0m0.474s
user    0m0.844s
sys     0m0.141s
//decompression
$ time pzstd -d -p 2 binary.sve.zst
binary.sve.zst      : 98183317 bytes

real    0m0.170s
user    0m0.297s
sys     0m0.094s

PZSTD -3 with 4 processes:

//compression
$ time pzstd -3 -p 4 binary.sve
binary.sve           : 26.70%   (98183317 => 26213185 bytes, binary.sve.zst)

real    0m0.398s
user    0m1.344s
sys     0m0.109s
//decompression
$ time pzstd -d -p 4 binary.sve.zst
binary.sve.zst      : 98183317 bytes

real    0m0.145s
user    0m0.297s
sys     0m0.188s


From these result, pzstd with 2 process has very high efficiency. Compared to bzip2, the compression time was only 3%.
Title: Re: Server performance: preliminary findings
Post by: Qayyum on January 17, 2020, 08:37:01 AM
I pick ZSTD -3, Time too short leads to issues regarding memory.
Title: Re: Server performance: preliminary findings
Post by: Phystam on January 17, 2020, 12:41:39 PM
Furthermore, I checked the lz4 compression.

//compression
$ time lz4 binary.sve
Compressed filename will be : binary.sve.lz4
Compressed 98183317 bytes into 44430881 bytes ==> 45.25%

real    0m0.345s
user    0m0.250s
sys     0m0.094s
//decompression
$ time lz4 -d binary.sve.lz4
Decoding file binary.sve
Successfully decoded 98183317 bytes

real    0m0.276s
user    0m0.141s
sys     0m0.125s

It is very fast, but the compression rate is not so good.
Title: Re: Server performance: preliminary findings
Post by: prissi on January 17, 2020, 03:09:55 PM
I did some measurement (in standard under windows). Binary (no compression) delivers data (on a 2.9 i7-7500 CPU) at about 34.6 MB/s, with zlib this is then 28.9 MB/s (and level 1 "wb1" 31.4 MB/s) and with bzip2 it is down to 11.9 MB/s.

If I use a the largest heavily populated map I have (uncompressed 377 MB 3074x3074 tiles), then the results are: binary 23.6 MB/s, zlib 14.9 MB/s default (and 22 MB/s "wb1")) and bzip2 10.9 MB/s

I think the conclusion is that it is not worthwhile to look for a compression algorithm which can crunsh (on my machine) more than zlib level 1 "wb1". Furthermore, the https://quixdb.github.io/squash-benchmark/ (https://quixdb.github.io/squash-benchmark/) suggest that in the region of zlib level 1 the only better one is zstd. However, comparing the effort of chaning one string to the hassle of adding a new library to crosscompiling, I think for standard I will just change the method of zlib writing. This even keeps the savegames compatible ...
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 17, 2020, 03:54:07 PM
Thank you all for your extensive investigations into this. Would it be of assistance if I were to make available the current Bridgewater-Brunel saved game for testing to see whether Prissi's results in Standard also hold for the extremely large amount of pathing data in that saved game, of a type which is not saved in Standard?
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 17, 2020, 05:07:25 PM
If I had the save I could throw a few compresion algorithms at it and see what happens.
It's up to you deciding if that would be of assistance to you or not.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 17, 2020, 05:47:42 PM
Quote from: Freahk on January 17, 2020, 05:07:25 PM
If I had the save I could throw a few compresion algorithms at it and see what happens.
It's up to you deciding if that would be of assistance to you or not.

That would indeed be helpful - here (http://bridgewater-brunel.me.uk/saves/bb-17-jan-2020.sve) is the link to the current saved game.
Title: Re: Server performance: preliminary findings
Post by: Phystam on January 17, 2020, 09:08:04 PM
I tested with the file --- the decompressed size is ~4GB.

machine: Ryzen 7 3700X, M.2 SSD

Result:
(https://cdn.discordapp.com/attachments/665986412641779712/667839328441073664/unknown.png)

This result shows that we can compress such a big save data only within 10 seconds using multi-process!
Generally speaking, option -1 to -4 provides almost the same results, and their compressed sizes are almost the same as gzip.
I think that the option -3 or -4 with 4 cores has the most effective.
Title: Re: Server performance: preliminary findings
Post by: DrSuperGood on January 17, 2020, 09:57:37 PM
Quote from: Qayyum on January 17, 2020, 08:37:01 AMI pick ZSTD -3, Time too short leads to issues regarding memory.
Issues regarding memory? Please elaborate.
Quote from: prissi on January 17, 2020, 03:25:48 AMSo while Zlib delivers 11 MB/s Snappy gives 413 MB/s, but size is 1.5-2 as big. However, even 19MB/s is more than 100 Mbit/s and faster than most clients can recieve over longer distances. (This is a 32bit built with -O3 and no assertions though!)
Too bad that at the moment the entire save must complete first before the server starts to transfer to clients. If it started transferring the instant data started to be compressed this topic would likely have not been started lol.
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 17, 2020, 11:20:57 PM
Let's start slow and get faster.
I did all of these in ram, so read/write will not be limited by the HDD. CPU was an i7 6700HQ

gzip fastest: ~1,3G

time gzip -1 bb-17-jan-2020.sve.uncompressed
real    0m54,607s
user    0m53,413s
sys     0m1,168s



time gunzip bb-17-jan-2020.sve.uncompressed.gz

real    0m29,111s
user    0m28,022s
sys     0m1,088s


lz4 fastest: ~2,0G

time lz4 -1 ./bb-17-jan-2020.sve.uncompressed

real    0m14,402s
user    0m10,699s
sys     0m1,420s



time lz4 ./bb-17-jan-2020.sve.uncompressed.lz4

real    0m35,982s
user    0m4,392s
sys     0m1,291s

Note that I was trying to achieve the same compression level as gzip -1 but it is totally not worth!
lz4 -7 was around ~1.5G, whilst already above 2 minutes compression time.

zstd fastest: ~1,1G

time zstd -1 ./bb-17-jan-2020.sve.uncompressed

real    0m15,672s
user    0m16,081s
sys     0m0,899s



time zstd -d ./bb-17-jan-2020.sve.uncompressed.zst

real    0m11,133s
user    0m8,162s
sys     0m1,088s


pzstd 2 threads: ~1,3G

time pzstd -1 -p2 ./bb-17-jan-2020.sve.uncompressed

real    0m11,082s
user    0m16,794s
sys     0m1,139s


time pzstd -1 -d -p2 ./bb-17-jan-2020.sve.uncompressed.zst

real    0m6,517s
user    0m8,834s
sys     0m1,899s



pzstd 4 threads: ~1,3G

time pzstd -1 -p4 ./bb-17-jan-2020.sve.uncompressed

real    0m6,582s
user    0m18,278s
sys     0m1,399s


time pzstd -1 -d -p4 ./bb-17-jan-2020.sve.uncompressed.zst

real    0m4,334s
user    0m9,052s
sys     0m2,096s


Conclusion: pstd is quite fast and has a good compression rate, so we should give it a try.
Title: Re: Server performance: preliminary findings
Post by: freddyhayward on January 17, 2020, 11:26:25 PM
How is memory usage for all of these options - would the server be affected by additional swapping?
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 17, 2020, 11:43:50 PM
Good point, I'll have a look at it. I expect gzip to consume the fewest memory, however none of these should have a significant memory overhead.

Note pzstd will add information required for parallel decompression to the compressed files, whilst zstd used with the -T will compress in parallel but won't add that kind of information, so decompression won't profit from multithreading.
However, the pzstd approach does not seem to be part of the lib itself, so its resuls are missleading.

Here is zstd -T: ~1,1G
time zstd -1 -T2 ./bb-17-jan-2020.sve.uncompressed

real    0m11,130s
user    0m17,928s
sys     0m0,981s


time zstd -1 -T4 ./bb-17-jan-2020.sve.uncompressed

real    0m6,586s
user    0m18,287s
sys     0m1,139s


time zstd -d -T4 ./bb-17-jan-2020.sve.uncompressed.zst

real    0m10,966s
user    0m7,833s
sys     0m1,144s
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 12:08:00 AM
Freahk - this is very interesting, thank you. I should indeed like to see memory consumption, too.

Dr. Supergood - it would indeed be splendid if someone had the time and knowledge to write code for streaming the saved games. Would anyone like to volunteer?
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 18, 2020, 12:21:52 AM
James, do you have elder versions of the bridgewater game?
Zstd can be trained to a specific filetype to perform even better and get larger compression levels. I am really interessted in trying this out for extended saves.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 12:58:15 AM
How much earlier were you after? I have one from August here (http://bridgewater-brunel.me.uk/saves/server-crash-19-aug-2019.sve).
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 18, 2020, 01:07:10 AM
I guess how long does not matter as long as files structure remains the same, so I would expect it to work best with relatively recent saves e.g. learning from yesterdays (uncompressed) savegames to compress todays savegames.

However, that's just an idea. I don't know if training will give any advantage for such large files but without trying we won't know.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 01:08:58 AM
Is the one from August any use?
Title: Re: Server performance: preliminary findings
Post by: DrSuperGood on January 18, 2020, 01:16:37 AM
Quote from: jamespetts on January 18, 2020, 12:08:00 AMDr. Supergood - it would indeed be splendid if someone had the time and knowledge to write code for streaming the saved games. Would anyone like to volunteer?
The changes needed are not trivial. The entire net code part is a mess with respect to such features. Ideally it should be made asynchronous however this is difficult due to standard's requirements to support single thread only builds. This means one cannot use dedicated receive and transmission threads, instead having to rely on polling.

While at it one could also implement other features... Like more than 1 player joining from the same save file. Possibly even allowing the game to progress during the transfer depending on server administrator preferences.
Ability to query server status asynchronously. Currently unresponsive servers cause the client to freeze until a timeout occurs.

The solution is not as simple as piping/nesting the compression results via socket since the server also has to write out the results to file. While saving it would have to read this written out data and send it to the client(s) that are joining, which would either require another thread to perform or greatly complicate the compression code, at least tying it in directly with net transfers.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 01:20:28 AM
 
Quote from: DrSuperGood on January 18, 2020, 01:16:37 AM
The changes needed are not trivial. The entire net code part is a mess with respect to such features. Ideally it should be made asynchronous however this is difficult due to standard's requirements to support single thread only builds. This means one cannot use dedicated receive and transmission threads, instead having to rely on polling.

While at it one could also implement other features... Like more than 1 player joining from the same save file. Possibly even allowing the game to progress during the transfer depending on server administrator preferences.
Ability to query server status asynchronously. Currently unresponsive servers cause the client to freeze until a timeout occurs.

The solution is not as simple as piping/nesting the compression results via socket since the server also has to writing out the results to file. While saving it would also have to read this written out data and send it to the client(s) that are joining.

Indeed this is not trivial - this is why I asked whether there are any volunteers, since I certainly will not have the time to implement such non-trivial changes for the foreseeable future given the almost unimaginably long list of higher priority items.

As to the need to support single threaded builds - these are sometimes useful for debugging purposes, especially detecting whether a loss of synchronisation in an online game is caused by a multi-threading issue. However, if there were to be a massive gain from such a feature, and someone were only realistically able to code it within the time available by mandating multi-threading, then it may be worth dropping this requirement.
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 18, 2020, 02:43:18 AM
Quote from: jamespetts on January 18, 2020, 01:08:58 AMIs the one from August any use?
One alone is not no. It would require a few to learn from the samples.
Title: Re: Server performance: preliminary findings
Post by: TurfIt on January 18, 2020, 04:39:44 AM
Terrible table, but my results:


typeleveltime (s)size (MiB)
bin---16.94021
zipped6 (default)225.31154
6f249.91334
6h56.12095
6R54.32078
6F227.81422
6 64K buf220.91154
370.01298
145.61319
1 64k buf42.01319
bzip29,30380.6953
9,150429.6953
1,30307.5984
zstd6 (default)26.41143
122.01144
326.51143
775.51040
890.01006
9130.01008
-123.61143
-218.21409
-317.51478
1 64K20.31144
1 256K19.11143

Taken from ingame since compressing a stream is rather different than taking a command line tool to an existing file...

Zstandard would be a good choice to change to. Same filesize for 1/10 the time. And only 20mins to hack it into the Simutrans code thanks to a provided zlib compatibility shim (and no thanks to the documentation of that shim which is .....) Parallel zstd does not have compatibility shims - would take hours/days to do anything there IMO. And you're limited by the 17 seconds it takes to serialize the data anyway.

Both zipped and zstd like having their buffer up a bit from the 8K default:

fd->gzfp = gzopen(filename, "wb2");
gzbuffer(fd->gzfp, 65536);


Random thoughts:
   The default zipped is taking *only* 225 seconds. >> Bridgewater brunnel must be running on a potato to be timing out clients at 10 mins and still not sending the file...
   Once loaded, the game runs at half speed. >> then how glacial it must be running on the server...
   Bridgewater serves up the savegame from its http server quite slow too - 10Mbit/s average, peak 15 ish, many extended slows to < 5.  Spending extra CPU for better compression is very much warranted at these slow transfer rates.
   WhyTF does Extended crash upon loading the pakset if you've moved the mouse while it's loading?? ?? What sorta bug you got in there...


Title: Re: Server performance: preliminary findings
Post by: DrSuperGood on January 18, 2020, 08:04:28 AM
Quote from: Freahk on January 18, 2020, 02:43:18 AMThe default zipped is taking *only* 225 seconds. >> Bridgewater brunnel must be running on a potato to be timing out clients at 10 mins and still not sending the file...
This has been raised a lot. It is very low on memory so the server performs very poorly due to page faults. I would not be surprised if saving causes page thrashing.
Quote from: TurfIt on January 18, 2020, 04:39:44 AMWhyTF does Extended crash upon loading the pakset if you've moved the mouse while it's loading?? ?? What sorta bug you got in there...
Standard could suffer the same bug just pak128 Britain extended is probably the largest pakset Simutrans has ever seen so the first to observe it. It could also be a SDL2 related bug.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 11:41:22 AM
TurfIt - thank you very much for that: that is most interesting. Are you able to share the code that you used to integrate zstd?

In respect of the Bridgewater-Brunel server, it is indeed low on memory, which I suspect is why the saving is taking even longer than indicated in your tests; the running game is not as slow as the slowness of the saving would suggest.

As to the crashing on mouse movement, I have never seen this and it has never been reported before. Moreover, I cannot reproduce this. Are you able to give, preferably in a new thread, detailed steps to reproduce this reliably?

Also, Freahk - how many would you need?
Title: Re: Server performance: preliminary findings
Post by: prissi on January 18, 2020, 01:43:16 PM
If I read some more on zstd, It seems that the zstd can read also zlib file, so it would have backward compatibility without any extra effort. That is nice.

I wonder though how TurfIt gets more than four times of the datarate on writing binary files. Does experimental dump large tables at one point?
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 02:08:48 PM
Quote from: prissi on January 18, 2020, 01:43:16 PM
If I read some more on zstd, It seems that the zstd can read also zlib file, so it would have backward compatibility without any extra effort. That is nice.

I wonder though how TurfIt gets more than four times of the datarate on writing binary files. Does experimental dump large tables at one point?

I suspect that it does: the routing data in large games can be very large, and these are in hashtables.
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 18, 2020, 03:03:02 PM
Quote from: jamespetts on January 18, 2020, 11:41:22 AMAlso, Freahk - how many would you need?
I don't know exactly but a few, ~5-10 should give us an idea if it's worth further investigating zstd training for our purposes or not.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 04:04:19 PM
How different from each other would they need to be; would a set of backups a few hours apart work?
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 18, 2020, 04:30:05 PM
QuoteHow different from each other would they need to be; would a set of backups a few hours apart work?
The idea of training is detecting typical patterns within files of a given type, so I guess a set of backups would work perfectly well as long as there were players actually playing the game at that time.

Otherwise, such a comparisation would be quite pointless. It would be kind of "calculate the best way to compress a set of 95% simmilar files", "now use the gathered information for another file that is by 95% just the same".
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 04:55:42 PM
If it is of any help, have a look at the games currently in the saves directory of the Bridgewater-Brunel fileserver here (http://bridgewater-brunel.me.uk/saves/?C=S;O=A). You will see that the largest of them (order by size to see this easily) are from the latest server game. Is this of any help? I have not yet installed FileZilla on my new computer, so it may be a while before I can easily manipulate files on the server at present.
Title: Re: Server performance: preliminary findings
Post by: Mariculous on January 18, 2020, 06:01:29 PM
I don't know but I will give it a try.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 07:41:37 PM
Quote from: Freahk on January 18, 2020, 06:01:29 PM
I don't know but I will give it a try.

Splendid, thank you.
Title: Re: Server performance: preliminary findings
Post by: TurfIt on January 18, 2020, 07:46:33 PM
Quote from: jamespetts on January 18, 2020, 11:41:22 AM
TurfIt - thank you very much for that: that is most interesting. Are you able to share the code that you used to integrate zstd?

Quick and dirty patch attached. Most of the makefile changes are just to get it to compile, you have it very corrupted (wrong logic and TABS!!!). I did not fix it 'cleanly'.
The zlibwrapper directory is straight from the zstd source, with some extra headers mushed into one directory. Again very not 'clean'. But good enough for the test at hand.


Quote from: DrSuperGood on January 18, 2020, 08:04:28 AM
This has been raised a lot. It is very low on memory so the server performs very poorly due to page faults. I would not be surprised if saving causes page thrashing.

This game requires 8.5GB on a server (for the Simutrans process alone, OS overhead extra). If the Server only has 8GB then that's the first thing that needs "fixing". While zstd is still frugal memory wise, especially at low compression settings, it does still add to the memory footprint, and might make things even worse. Swapping == death.


Quote from: prissi on January 18, 2020, 01:43:16 PM
If I read some more on zstd, It seems that the zstd can read also zlib file, so it would have backward compatibility without any extra effort. That is nice.

Yes, it can read and write zlib files.  A single function call to select zstd or zlib for writing if desired.


Quote from: prissi on January 18, 2020, 01:43:16 PM
I wonder though how TurfIt gets more than four times of the datarate on writing binary files. Does experimental dump large tables at one point?

Probably by not using a laptop; You posted having an i7-7500 which doesn't exist, so I'm assuming actually an i7-7500U which is a low power laptop chip. Laptops also tend to have slow memory (i.e. sticking with spec which is a mere ddr-2133 for that 7500U [or even ddr3-1600 is supported and a cheap laptop manufacturer might sneak that in there!]) and memory speed plays a huge factor in compression and even in running Simutrans in general which is memory latency bound.

For reference I did these tests on a Ryzen 7 2700X and mine has crippled memory support so stuck at DDR4-3066 (it'll do quite tight timings at that, just will not do any more frequency). All in memory too - ramdrive.

Experimental's save time problem is entirely due to saving the path explorer data, something that doesn't exist in Standard. i.e your change to wb1 is counterproductive IMHO unless you expect all clients to be able to download at 100Mbit/s from all servers with servers running on poor (laptop) CPUs...

The Bridgewater brunnel game without path data is 677MB, which compresses to 137MB in 16.4 seconds with the default zipped. It's using 1.5 cores doing so with the limited multithread.  With the path data, there's an extra 3344MB to save taking 185 seconds - CPU usage drops to a single core since zip is choking on this bitstream. Less than half the compression rate. Zstandard seems to handle it fine, otherwise I'd have suggested to reassess the structure of this data.

---
I also tried a maximum compression with lzma2. 16 threads, ultra level, 192MB dictionary. Takes the same 225 seconds as zipped but produces a file half the size at 620MB. But, consumes a whopping 21GB ram just for the compression doing so!
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 09:29:58 PM
Thank you for this - that is most helpful. Can I check what dependencies are required for this in both Linux and Windows?
Title: Re: Server performance: preliminary findings
Post by: TurfIt on January 18, 2020, 10:08:44 PM
Zstandard.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 18, 2020, 11:39:28 PM
Quote from: TurfIt on January 18, 2020, 10:08:44 PM
Zstandard.

Splendid, thank you.
Title: Re: Server performance: preliminary findings
Post by: prissi on January 19, 2020, 10:54:50 AM
TurfIt, most servers for standard games are really weak (1 core Vservers, at least in my case). Otherwise bz2 would not be off by default for servers. Most clients will at least has a faster download than 10 MBit/s nowadays. Moreover, the switch (zipped/bzip) in the simuconf.tab is exactly to have a choice between speed and size.

(My laptop was the most expensive model from Panasonic in 2017, about $3000. CPU is special made model with 6 cores and 2.9 GHz nominal which does not exist in databases. Resource monitor has 6 threads up to 3.6 GHz speedup. So I was a little surprised about a factor of four. But this gets offtopic fast.)

Btw. the Debian package name is libzstd-dev and it is already installed with pacman on mingw64 it seems.
Title: Re: Server performance: preliminary findings
Post by: Phystam on January 19, 2020, 11:33:23 AM
Even if you want to cross-compile the library, you can easily do it since the library does not have any dependencies. It was very easy to introduce it.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 19, 2020, 05:39:56 PM
Dr. Supergood - can I check whether you are still interested in working on this? It would help me to know whether to spend my time on this in the near future or whether to spend the same time working on bug fixes and other enhancements.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 25, 2020, 02:12:22 AM
I have been looking at attempting to implement Zstd this evening, but cannot find any precompiled libraries for Visual Studio, nor a Visual Studio build file. Do I assume that the only way to build this for Windows without taking a very large amount of time is to use Cygwin or similar? Has anyone tried compiling with this in Windows?
Title: Re: Server performance: preliminary findings
Post by: Phystam on January 25, 2020, 04:05:29 AM
I usually use Windows subsystem for linux (WSL) and the minGW compiler set. At least I have succeeded to compile the Zstd library.
Title: Re: Server performance: preliminary findings
Post by: DrSuperGood on January 25, 2020, 05:34:48 AM
Quote from: jamespetts on January 19, 2020, 05:39:56 PM
Dr. Supergood - can I check whether you are still interested in working on this? It would help me to know whether to spend my time on this in the near future or whether to spend the same time working on bug fixes and other enhancements.
Currently it does not seem worth while to spend time on changing the compression. It only became an issue because of how large the server game is that the server ran out of memory so any kind of compression would still be very slow. The better solution, involving parallel transfer and compression, is not trivial to implement and something to do more towards a polish phase of development or when there are not important features to implement. It is also something that would be worth while to implement in standard as well for similar reasons.

Parallel saving and transfer of save file data would effectively eliminate the compression time altogether for most algorithms since the load bottleneck becomes the upload speed of the server or download speed of the client. One could even use slower/better compression then since as long as the compression speed is faster than the server upload rate, any reduction in file size represents a reduction in load times.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 25, 2020, 06:35:54 PM
It probably is worth spending time to do this, since even in single player mode, the amount of time spent saving the current Bridgewater-Brunel saved game is unreasonable, whereas it is reasonable when compression is entirely disabled. Thus, the slowness on the Bridgewater-Brunel server is not exclusively down to it being low on memory, although this very probably makes it worse.

It is of course up to Dr. Supergood what he would like to spend his time doing; but I will proceed when I am able to integrate this, if I can overcome the library issue.

Incidentally, I notice that it is possible to compile the Zstd library so as to enable multi-threading, which would on the face of it seems to render pzstd redundant.

Phystam - did you enable multi-threading when you compiled the library? This would seem to be a worthwhile thing to do.
Title: Re: Server performance: preliminary findings
Post by: DrSuperGood on January 26, 2020, 03:26:32 AM
Quote from: prissi on January 19, 2020, 10:54:50 AM(My laptop was the most expensive model from Panasonic in 2017, about $3000. CPU is special made model with 6 cores and 2.9 GHz nominal which does not exist in databases. Resource monitor has 6 threads up to 3.6 GHz speedup. So I was a little surprised about a factor of four. But this gets offtopic fast.)
I assume it is still an Intel CPU? That is likely an AMD vs Intel issue.

AMD processors such as used by TurfIt generally score higher in compression/decompression tests than comparable Intel desktop ones. AMD heavily optimizes for this kind of workload, for some reason. Zen2 (the current generation) took this to a new level since as an example the Ryzen 5 3600 punches near similar performance in these tests to a Core I9 9900K despite having lower clocks, a lot lower power limit, 2 fewer cores and costing under half the price.

Assuming compression takes long enough that boost duration expires on the Intel system then that could be 3x performance from clock and better compression IPC and an extra times from better memory bandwidth and latency.
Quote from: jamespetts on January 25, 2020, 06:35:54 PMIt is of course up to Dr. Supergood what he would like to spend his time doing; but I will proceed when I am able to integrate this, if I can overcome the library issue.
So to get the task straight. The existing zip library (zlib?) is to be entirely replaced by Zstd which is to operate in multi threaded mode? And the changes have to be applied to both makefile (for Linux) and VisualStudio?
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 26, 2020, 12:14:38 PM
Quote from: DrSuperGood on January 26, 2020, 03:26:32 AM
So to get the task straight. The existing zip library (zlib?) is to be entirely replaced by Zstd which is to operate in multi threaded mode? And the changes have to be applied to both makefile (for Linux) and VisualStudio?

Yes, for the most part: TurfIt has done much of the work with the patch (assuming static linking, on which see below), but I have so far not been able to set myself up to compile the necessary libraries for Windows at least. I have not yet tried with Linux yet. Ultimately, this will need to be able to be compiled in three ways: (1) Windows native with Visual Studio for development/debugging; (2) Windows cross-compile with GCC (for building on the server); and (3) Linux native with GCC (for building on the server).

If at all possible, the library will need to be compiled with the multi-thread option enabled, as this could make a big difference to performance as suggested by the tests of pzstd earlier in the thread.

One thing that will need to be checked, which I have not yet had time to review, is licence compatibility. Simutrans (and thus Simutrans-Extended) uses a weird licence, the Artistic Licence 1.0, as was specified by Hajo back in 2004. This is not compatible with the GPL. Zstd is licenced under both the GPL v. 2 and the BSD licences. I have not yet checked the compatibility of the latter.

If the licence is incompatible, we will not be permitted to incorporate this library by static linking (as I believe is the implementation in TurfIt's patch), but will need to use dynamic linking, so a .dll file will need to be generated for Windows builds, and whatever the equivalent of that is for Linux builds, and this, together with a copy of the GPL and a link to the sources, will need to be distributed with the executable builds.
Title: Re: Server performance: preliminary findings
Post by: prissi on January 26, 2020, 12:32:07 PM
I think the biggest speedup is from comparing standard to experimental. If you write large hashtables in experimental, cash misses are way less frequent compared to standard when the data delivery rate is limited by collecting the objects from everywhere in memory.

To the task at hand:
If you can live with 10-20% larger size but still want maximum speed, just change the modefier string from "wb" to "wb1" in gzopen. It compresses as almost as fast as Simutrans can create the binary data but is 25% larger than zstd compression. The games keep compatibility and the time can be used for other stuff. (Also it would avoid all the hassle with requiring a self-compiled library on macintosh, with MSVC, and on ARM systems. This is why I did not considered it for standard right now.)

The library TurfIt uses in not multithreaded, as he wrote. Anyway, it is not needed too, since apparently Simutrans cannot deliver the data as fast as zstd can compress single-threaded. And you still need zlib to read old savegames in the zlib wrapper. You just need to copy the zlib-wrapper files from zstd. (Not sure, if you can automatically pull only them into that directory.)
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 26, 2020, 02:23:07 PM
Prissi - thank you for that information. I have tried experimentally modifying "wb" to "wb1", but, when I do so and when I try to open the Bridgewater-Brunel saved game from December, I get an assert failure in some external code, the substance of which seems to be, "Invalid file open mode", 0. Trying to continue after this causes a segfault in loadsave.cc.

Can I check whether I have done this correctly - should I change all instances of "wb" to "wb1" in loadsave.cc, or only some of them?
Title: Re: Server performance: preliminary findings
Post by: TurfIt on January 26, 2020, 05:13:39 PM
The zstd hack patch I'd posted already has zlib set to level 1. You can look there for what to change. It's not important to the zstd integration, just a left over from the timing tests. And those timing test show clearly zstd produces the Extended savefile from Bridgewater server 90% the size of zlib'1' in 1/2 the time.  Very much worth it for Extended while saving the routing data which Standard does not have, and the zlib does not like. I've not seen a save from Standard that's worth changing from zlib '6' default. You at best save a few seconds on compressions, and spent a couple extra minutes on file xfer. Similarly it's not really worth the hassle of another library in Standard since you save a few seconds at best.

I was going to mention the licensing, but decided to ostrich instead.  Unless we have some laywer join the team... it's a mess. From what I see, the Artistic license was deemed too vague, and hence is blackballed. So all the other libraries being used in Simutrans are in a grey mess too. Also note, we're trying to use the zstd wrapper here - actually including their source files directly, not just using the zstd library.

The zstd project comes with VS project files per the README, are those not sufficient to build a library? I have no experience with VS.
Title: Re: Server performance: preliminary findings
Post by: prissi on January 27, 2020, 09:23:21 AM
Proper zstd support was very easy to add, without playing with the wrapper in the first place. I will submit it for standard today. Actually, the buffer system is already in place, so one had just to discard some old stuff with mixed read and writes and just use consequently lsgetc() and read() from loadsave_t. libzstd builds straight forward in MSVC.

At higher compression the standard game size seems even smaller with zstd than bz2 by about 5%, and of course saving and loading is much faster than bz2 (with single threaded library, but compression in background thread). More quantitative data will follow.
Title: Re: Server performance: preliminary findings
Post by: prissi on January 27, 2020, 02:29:14 PM
Below is a patch for loadsave_t (which apply without too many problems on experimental, I guess). This is using the native zstd API, so no source from zstd project needs to be included.

However, I did some tests and zstd was really disappointing. It is in default setting (3) for standard Simutrans games same speed as bzip2 but larger, (or at 9 compression level almost on par but way slower). One the other hand, zlib "wb1" is still larger but much much faster.

My times
zlib load world 2902 ms (zlib default 3012 ms)
bz2 load world 3079 ms
zstd9 load world 3085 ms
zstd3 load world 2996 ms

(same time without threading (like on simple server): load is dominated by object allocation.)
zlib1 save world 910 ms size 5666 kB (zlib default 1260 ms, size 4644 kB)
zstd9 save world 14107 ms 3298 kB
zstd3 save world 2962 ms 4432 kB
bzip2 save world 2239 ms 3177 kB

My conclusion for standard: zstd might be much faster on decompression, but that is not the limiting step. On compression it is much slower than zlib "wb1" and even default zlib, and not generating smaller files than bz2 nor much smaller than default zlib. Maybe the standard savegames are too small, or I am too stupid to do it correctly.
Title: Re: Server performance: preliminary findings
Post by: jamespetts on January 27, 2020, 03:14:05 PM
This is interesting - thank you. I suspect that TurfIt may be right that the pathing data, when saved in Extended, does not work well with Zlib, so this may well make a substantial difference to Extended even if not to Standard.

I have been looking into the licence situation. ZStd is dual licensed - under the GPL v. 2.0 and the BSD modified licence. The former is known to be incompatible with the Artistic Licence 1.0. However, the latter is compatible. The terms of the licence are very short and simple, and I reproduce them below in full:

Quote from: The licence
BSD License

For Zstandard software

Copyright (c) 2016-present, Facebook, Inc. All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
   list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution.

* Neither the name Facebook nor the names of its contributors may be used to
   endorse or promote products derived from this software without specific
   prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Thus, what we need to do is make sure that we include the above text in both source and binary distributions; provided that we do that, we will be compliant. I suggest that this BSD licence be added to the Standard sources and binary distributables once the ZSTD be integrated, if this integration is to occur.

See here for more information on the BSD licences.

This also means that we do not need to look into the more complex process of dynamic linking as would have been necessary if this were available only under an incompatible copyleft style licence.
Title: Re: Server performance: preliminary findings
Post by: Phystam on January 27, 2020, 03:39:43 PM
I tried TurfIt's implementation for my big map (it is the same map which I sent you --- Java Island). It is not so developed map, but it takes around 6-7 sec to save with zlib. After including zstd, it takes around 3 sec (without multi-threading!)
Title: Re: Server performance: preliminary findings
Post by: jamespetts on February 01, 2020, 11:25:34 PM
I have now attempted to apply Prissi's version of this using the patch. Unfortunately, the patch refused to apply automatically (I think that the base code is too different), so I have had to apply it manually; however, I believe that I must have made some error, as attempting to save a game results in a fatal error indicating that, on reading a saved game, the save file is corrupt - this error happens before any actual saving takes place (or, I think, just about the time that the version is saved).

For reference, the Github branch on which I have attempted to implement this is here (https://github.com/jamespetts/simutrans-extended/tree/zstd-native).

I am not quite sure what features of Zstd called for some of the code changes that were in the patch, so I am unsure whether I have interpreted or applied the patch correctly in light of the differences in code between Extended and Standard.

Any assistance would be much appreciated.

Edit: I have also retried the wrapper version - I cannot get this to compile, as it fails at the linker stage, complaining of unresolved external symbols, despite pointing to the same library file as I was able to compile the native version with successfully. This is rather perplexing.

Edit 2: I have changed the zlib parameter from "wb" to "wb1". This seems to help a little, but it is still quite slow. I should be grateful if people could let me know with to-morrow's nightly build whether this is any better with the Bridgewater-Brunel server.
Title: Re: Server performance: preliminary findings
Post by: prissi on February 02, 2020, 12:44:33 PM
During merging there will be coming standard's zstd support in -r 888x and following (where it is optinal). Because I was really disappointed, after more tests. (I have the exact data on the other computer.) It seems, that zstd is rather geared more towards highly repetative ASCII data (or many zeros/0xFF in data sets) and rather performs not so good on binary data.

(That was also the conclusion of the "squash" benchmark binary test, where firefox was the target; I think this is not so far from a simutrans game.)

zstd with a decent compression is as fast or slow as zlib, abeit files sizes are 10-25% smaller for zstd (for same compression time). While bzip is much slower, the files were almost 15-50% less size (and also less memory footprint). If memory is an issue, and the games are really large, bzips could still compress faster than the data is sent.

Also the MSVC libzstd need manual compiling, which may also put off people. Also, I have no clue to force multithread manual compiling on MSVC. (SO I benchmarked with Mingw64.)