The International Simutrans Forum

 

Author Topic: Long-term thoughts on multi-threading in Standard and Experimental  (Read 20587 times)

0 Members and 1 Guest are viewing this topic.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
First of all, I am not sure whether this is the right forum for this discussion, but it more or less fits with the theme "extension requests", so this seemed to be the safest place to put it. If I am mistaken in my choice, please do move this to a more appropriate place.

The thinking behind this discussion is borne out of my experience with Experimental, and is probably more directly relevant to that fork, with its higher demand for computing power than Standard, but it is of relevance to Standard, too, so it is worthwhile discussing it in a general, rather than an Experimental specific forum.

There are quite a few features that it would be good to be able to add (such as multi-factorial route finding for passengers/goods or specific route finding for private cars) which are currently not practicable to add because the very large computational effort that they would involve would give unacceptable performance on games with large map sizes. These things are not high priorities in the short-term, but it would be good to be able to aim for them within a period of a few years from now, as they could potentially make a very real difference to the depth of the simulation and overcome quite significant limitations in the current model.

However, as many who follow the trends of such things here may be aware, Moore's Law does not have a great many years left. Transistor gates are already only scores of atoms wide. Before long, making them much smaller will not be possible, and the age of ever increasing computational speed will be at an end for the time being. Quantum computers are interesting, but, even if they prove a practical proposition for general use one day, are only actually faster in some very specific applications which are unlikely to assist (I might be wrong about the latter, but I suspect that it will be decades, if at all, before quantum computers reach the home in any event).

For some years, the IT industry's response has been parallellisation: the use of multiple processor cores to perform tasks at the same time. Even mid-range tablets now come with 8 core CPUs. Making use of this additional computing power, however, requires multi-threading, and this presents particular challenges for Simutrans, especially connected to multi-player gaming. The Standard developers have made great progress in recent years in applying multi-threading to Simutrans, and it covers now, I believe, loading/saving (along with compression) and graphics rendering. The difficulty is that it cannot easily be made to be compatible with online play: because online play requires the server and all clients to be exactly synchronised, the non-deterministic nature of multi-threading makes it very challenging to use this technique in the actual simulation algorithms, places where it is often needed the most.

It is worthwhile considering, which is why I have started this thread, some ways in which we might rise to that challenge and devise means of multi-threading certain critical algorithms that are able to work reliably in a synchronised multi-player environment. This is a rather long-term aim, and I wonder whether any of the Standard developers have already considered any of these ideas, but I thought it worthwhile to discuss them, at least in general terms, at this juncture.

Firstly, one thought that I had some time ago in connexion with the private car routing system was the use of concurrent multi-threading only on the server, or even distributed between clients and servers. The idea was that each thread would perform calculations (in my original idea, calculations of private car routes) independently and in a way that, until committed, did not affect any other data in the game, and then commit the results using the same system as a human player sending a command: in other words, a system that is intrinsically designed to deal with non-deterministic input. The trouble that I found with this approach is that my initial calculations showed that it would consume far too much bandwidth to send these data between clients and server, which is probably fatal to this technique being of much use at all unless some use can be found of it where the computational effort involved in the calculations are very great, but the data comprising the results of the calculations are very slight. I cannot think of such a situation, however. (As an aside, the concurrent method of multi-threading would be ideal in many situations in Simutrans were it not for the need to synchronise with online play. In theory, such code could be written and made to work only in single player mode, although this would seem somewhat of a wasted effort when the largest and most complicated maps are usually to be found in multi-player games).

The more obvious means of useful multi-threading in Simutrans is parallel multi-threading, where an algorithm is performed many times over on a given dataset by different threads at the same time whilst no other activity is occurring and the results committed when all threads have completed their work. The difficulty is finding a dataset on which the same algorithm needs to be performed many times and where the results of performing it on one member of the set do not need to be taken into account when performing it on another member of the set.

One possible candidate for this is routing. Routing in Experimental is perhaps more suited to parallel multi-threading than routing in Standard because the routes are calculated all at once and cached, whereas, in Standard, they are cached. Knightly wrote the current Simutrans-Experimental's routing system, which is currently single threaded. He made it very efficient indeed using the Floyd-Warshall algorithm and a spot of co-operative multi-tasking. It works very well and performs acceptably even on large maps. However, on the very largest and most complicated of maps, it does have a noticeable performance impact when running. Also, it is currently a single factor routing system, taking into account only speed. Ideally, it would be good to have multi-factorial routing, taking into account comfort and possibly even cost. One way of doing this whilst maintaining a cached routing system would be to have different classes of passenger, each of which has a different balance of factors which, when taken together favour one route over another, and store different routes to the same destination for each of several classes of passenger. It might be possible to use parallel multi-threading to calculate these several different routing options in the same time as it would take a single threaded system to calculate one of them. However, even a single set of multi-factorial routes would be considerably more computationally expensive to calculate than single factor routes, as the cost calculation in the Floyd-Warshall algorithm, which is called extremely frequently, and is currently just a simple integer comparison, would instead become a complex equation almost certainly involving computationally expensive integer division. What I do not know is whether it is possible to use parallel multi-threading on a single pass of the Floyd-Warshall algorithm - I rather suspect not. It is possible that such gains as there are left to be had with shrinking transistor size will make a single set of multi-factorial routes calculable in the same time as single-factorial routes are calculated currently, which might assist there, but any means of multi-threading route finding would be very useful.

The second area that could possibly be multi-threaded using the parallel technique is passenger generation, which is known to be computationally intensive, and to which in Experimental I have added numerous features. This would entail generating packets of passengers in batches in multiple threads, then committing the results in a single thread to avoid desynchronisation. I think that this is probably technically feasible (as the generation of one batch of passengers does not inherently depend on what has happened to another batch of passengers just generated, with the exception of the limit on the number of jobs available in the latest unreleased Experimental versions, but that is easy to work around by the use of negative numbers of jobs, as already done, to achieve balance overall), but it would be quite challenging in practice to separate the generation of batches of passengers from committing the results of that generation to the game at large, the two operations of which are currently closely interwoven.

A third area in which multi-threading might conceivably be of assistance is in vehicle route finding: ship route finding especially takes a large computational toll on a large and busy map at present. This is theoretically possible because the route chosen by one convoy is not affected by the route chosen by another convoy (the multi-threaded part would not include the block reservation for trains, just the finding of the route itself). The trouble with this, however, is that multiple vehicles rarely want to find a route at exactly the same time. A solution which I think is technically feasible but would require rather a lot of difficult work is to anticipate in advance which loading vehicles will need route finding soon, and process them in batches. There would need to be a way of working out which convoys already have their routes set when the time comes to leave and a sensible way of scheduling this. This could, of course, adversely impact on the performance of single threaded clients and servers, for which doing difficult things in batches makes things worse rather than better, so it should be disabled when a single player game or server is single threaded (is there a way of detecting the available number of CPU cores, I wonder?).

I should be very interested in any views on these matters, and any plans that the Standard developers have to multi-thread parts of the simulation code in Standard, as well as any more general thoughts on long-term solutions to performance issues as the complexity and size of the simulation increases.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #1 on: March 27, 2014, 05:34:38 AM »
I've been thinking much of the same, although I know almost nothing about Simutrans' simulation logic. (I've mostly been messing around  in simgraph and simsys, and the besch-es.) But as for calculation all kinds of routes, one must keep in mind that these routes not only need to be calculated, they must also be stored and, more importantly, transferred between CPU and RAM.

Offline ӔO

  • Devotees (Inactive)
  • *
  • Posts: 2345
  • Hopefully helpful
  • Languages: en, jp
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #2 on: March 27, 2014, 06:44:20 AM »
ships are typically seen as not having any 'road' to travel on, but they do have some major lanes that are followed with little deviance.

http://www.marine-knowledge.com/wp-content/uploads/2012/01/Ocean-Shipping-Lanes.jpg

perhaps if ships were always funnelled into using a cpu or human decided path, that could lighten the path finding load, as a certain length will always be known?

Offline Dwachs

  • DevTeam, Coder/patcher
  • Administrator
  • *
  • Posts: 4587
  • Languages: EN, DE, AT
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #3 on: March 27, 2014, 09:24:35 AM »
First of all, you should identify the parts that are the performance hogs by using a profiler.

(1) Cargo Routing: Google founds a lot of pages about 'parallelize floyd warshall', so the algorithm itself seems to be parallizable.
I am unsure whether Standard's routing would benefit from parallelizing: the threads will work on the same data after all.

(2) Passenger generation: Also here, cargo routing has to be done (with the same caveats as above, as threads have to work on the same data in memory). Moreover the result may depend on the current game state: overcrowding of stations.

(3) Vehicle routing: Effect of routing depends on the game state: roads get deleted, routing weight depends on traffic.

Maybe one can try to parallelize the stuff that happens in karte_t::step:
(a) Passenger and factory cargo generation in one thread
(b) Vehicle routing in another thread
Both tasks do not interfere with each other. However, both are access a lot of data, so memory bandwidth may become an issue.

Movement of pedestrians (and clouds) can be parallelized as well: Each thread moves pedestrians in some sectors of the map. These will not interfere with the pedestrians in other sectors. In a last step, pedestrians at the boundaries of the sectors are moved.

Offline Sarlock

  • Devotee
  • *
  • Posts: 1340
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #4 on: March 27, 2014, 09:48:51 AM »
I think you can fairly safely assume that players choosing to play on very large maps (4000x4000 or larger) are probably using higher-end processors that are very likely multi-core CPUs.  Adding a small performance impact to the smaller maps on single core processors may be worth it if the gain on larger maps with multi-core CPUs is significant.

As to how to design a system that allows a second processor to perform all of the wayfinding while not holding up the primary CPU... and not having to rewrite half the game, is certainly no small task.

With a very large map (such as the current Experimental server game), how much of the system load is pure processor-based and how much of the load is RAM access?  I would think that the amount of RAM accessed increases at a larger exponential rate than the processor load... until the player(s) fill the world up with 5000+ convoys, at least...

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #5 on: March 27, 2014, 10:33:16 AM »
Some very interesting thoughts - thank you for all the feedback.

Firstly, to reply to some of Dwach's observations: passenger generation in Experimental is separate from route finding for passengers (although I realise that this is not separate in Standard), as it uses the cached routes that have already been found in a different step. In Experimental (again, different from Standard), the overcrowding of a transfer station does not affect the routing, albeit the overcrowding of an origin station does. The latter point is a surmountable problem if one accepts some degree of flexibility in the overcrowding limit for origin stations: the passengers can be generated in worker threads but not actually added to the stations until the worker threads have finished, and then added all at once by the main thread whether the station is overcrowded or not. The next batch of passengers to be generated will then not be able to add any more to that station until it is back below its limit once again.

As to routing, although this does depend on the game state, if a batch of routing was done at the same time (in parallel threads at the same point in the same step), this would not matter, as the routing would not itself change the game state and all threads would use the same game state to route.

The movement of pedestrians and moving objects can probably be multi-threaded on the concurrent basis as these things do not affect the game state at all so far as I am aware (unless, perhaps, they need random numbers generated: but, if this is so, they can perhaps be given a batch of random numbers on creation to be getting on with).

As to CPU usage versus memory bandwidth, parallelisation does not, of course, assist with memory bandwidth. However, at present, CPU usage seems to be the greatest limitation on performance in Simutrans. If we move to a state where memory bandwidth is the greatest limitation, we will have made real progress, not least since I suspect that there are fewer obstacles to the increase of memory bandwidth in the future than there are to the increase of base processing power in the future, and greater parallelisation will give manufacturers an incentive to increase memory bandwidth as this will increasingly become the primary limitation in heavily multi-threaded applications.

Shipping lanes are an interesting alternative idea for alleviating performance troubles from ships, although I wonder how difficult that this might be for players for whom it is unlikely to be obvious that, to send a ship over open water, one must first define a shipping lane.

Offline Dwachs

  • DevTeam, Coder/patcher
  • Administrator
  • *
  • Posts: 4587
  • Languages: EN, DE, AT
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #6 on: March 27, 2014, 11:00:19 AM »
The movement of pedestrians and moving objects can probably be multi-threaded on the concurrent basis as these things do not affect the game state at all so far as I am aware (unless, perhaps, they need random numbers generated: but, if this is so, they can perhaps be given a batch of random numbers on creation to be getting on with).
There is a subtle effect on game state: only 250+x objects are allowed on a particular tile, if there are a lot of pedestrians, this may become an issue.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #7 on: March 27, 2014, 11:13:47 AM »
There is a subtle effect on game state: only 250+x objects are allowed on a particular tile, if there are a lot of pedestrians, this may become an issue.

Hmm - would it not be possible for the game to give automatic priority to game objects over pedestrians, deleting excess pedestrians as necessary, so that the presence of pedestrians can never affect the simulation state?

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9507
  • Languages: De,EN,JP
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #8 on: March 27, 2014, 12:13:48 PM »
First thing: Multithreading will not help much. Image you can split something in 3 threads (like padestrians, clouds etc, routing, of vehicles, routing of goods). This will just allow for an increase of map size of 1.5 in both directions, as the amount of stuff then is increased exponentially.

Passenger generation is dirt cheap, it is just a few calls to simrand (which must happen in certain order). Routing those is the expensive part.

You can certainly have more than one thread. Maybe at the start of each step, generate a random number for each city and run then there own counter, so you could thread as many task as you use similar starting random counter and their own random function.

Routing of ships (and other vehicles) can be cached, until a way with this waytype is altered. However, this happens almost all the time, so I found this not very helpful.

The sync_step has already three task for eyecandy (animations), pedestrians, and the rest; I tried threading them but got no gain. In programming there should be some stuff to this. The main stuff (vehicles) eat up most time anyway.

As said, the pedestrians (as long as you make sure there are less than 200 on a tile and access the map not at the same time as any other) can have their own thread. Same for building animations.

As discussed before, you may be able to split water, rail, street, and maglev waytype networks, with special care on the state of crossings (must be set once afterwards. So a queue for a crossing, and the state is assumed close until next step). Airplanes only after all other are finished. This will (at most) give you three main savings, rail and road, and ships.

In the end you end up with something which may allow you to run two times more ships, or more cars, and on 1.4*1.4 larger maps (i.e. going from 2048 x 2048 to 2850 x 2850). You get about the same gain by doing nothing and buying you computer a year or two later.

And the thing is: That is it. No further progress, without starting a completely new approach. If you really want to use parallel speedup, I think you need to go back a lot and use different data structures to access the world, and build a systems with lots of queues and callbacks.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #9 on: March 27, 2014, 03:21:56 PM »
Thank you for your insights: this is most helpful. The idea of separating networks' route finding is especially interesting.

I suspect that there is some divergence between Standard and Experimental of the extent to which multi-threading may be of assistance. Instead of allowing larger maps, threading can allow for greater simulation depth and complexity with the same size of maps; as already observed, passenger generation in Experimental is more onerous than in Standard even though routing is cached, and routing itself is more onerous in Experimental.

The area of private car route finding especially is one that might well benefit from threading. Currently, a very approximate system is used because it takes too much computational effort to find a route from each origin building to each destination building: instead, a route is found from each townhall road to each other townhall road (and each road connected to an industry or attraction not in a town), and the time that it takes to traverse that route approximated to give a time per straight line tile, which is then used to extrapolate the probable length of a private car trip from any individual point to any other individual point. This does not allow the simulation of actual traffic flow, which would be a very desirable thing to do, and even the system currently in use causes some noticeable effect on performance and has to be spread over the course of a game month to minimise the impact. Multi-threading alone may well not be enough here (although might well make a real difference): a more sophisticated approach may well be needed, involving the calculation of the route from each city building to each other city building in the same city and also each road tile on the city boundary, then calculating the route from each boundary tile to each other city boundary tile in a different city as a separate step, then, for each inter-city journey, concatenating the three routes (from origin city to boundary tile, from origin city boundary tile to destination city boundary tile, and from destination city boundary tile to destination city building). Private car graphics could then be assigned these as actual routes to drive, and the number of vehicles passing over tiles can be used to compute the congestion rather than a general statistical scheme at present based on the total number of private car trips in a town as against the number of kilometres of road.

Quote
In the end you end up with something which may allow you to run two times more ships, or more cars, and on 1.4*1.4 larger maps (i.e. going from 2048 x 2048 to 2850 x 2850). You get about the same gain by doing nothing and buying you computer a year or two later.

The point is rather that this will not hold true for very much longer. In about 5-6 years' time, according to current estimates, buying a computer a year or two later will not result in faster base performance (although other things, such as memory bandwidth or the number of cores may well increase; indeed, we might well see 16, 32 or even 64 core processors a decade from now).

Offline ӔO

  • Devotees (Inactive)
  • *
  • Posts: 2345
  • Hopefully helpful
  • Languages: en, jp
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #10 on: March 27, 2014, 04:27:31 PM »
Intel did announce a 20th yr edition, fully unlocked Haswell Pentium (G3xxx) in the works.

Some guesses are 4.8Ghz overclocked and since these are dual core, it's very likely to be super easy to achieve that with fewer less than ideal transistors. It might be the ideal? CPU for a game like simutrans, where single core speed is preferred over multiple cores.

---

As for the ship paths on sea, if manually done, it could be the same as building roads on ground, except free to construct and use by anyone.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #11 on: March 27, 2014, 05:16:25 PM »
Parallelizing route finding within a step is likely the easiest thing, as you don't alter anything but the route being generated, which nobody else cares about yet, and the world stays static for that part of the step. Anything which actually updates the world is more difficult, as all writes must potentially be synchronized between threads. As prissi points out, a complete rethinking with new data structures may be needed. The gains of doing things in parallel might easily be outweighed by the cost of synchronizing the results in the end or along the way.

Offline kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2269
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #12 on: March 27, 2014, 07:55:29 PM »
Memory access is also a problem with parallel tasks as different tasks are accessing different sections of memory. This lowers the effectiveness of caching significantly. This, together the cost that Ters has aleady pointed out of syncronising can result in slower code execution.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #13 on: March 27, 2014, 08:09:23 PM »
Memory access is also a problem with parallel tasks as different tasks are accessing different sections of memory. This lowers the effectiveness of caching significantly. This, together the cost that Ters has aleady pointed out of syncronising can result in slower code execution.

The underlying premises for my original post are that Simutrans is limited more by CPU power than memory bandwidth and that the growth of computer power in the future is likely to come from parellisation as before long it becomes impossible to shrink transistors any further, so base CPU power will not increase much after about 5-6 years from now (and perhaps even before then), whereas memory bandwidth may well increase, as that is not subject to the same inherent limitation to the same extent. As stated above, if we are able to move the limitation from CPU power to memory bandwidth, we will have achieved something worthwhile; certainly, whilst it might be hard to predict when the point will be crossed at which memory bandwidth becomes the dominant limiting factor, it does not seem to me to make sense to dismiss any attempt to deploy parellisation on the ground that memory bandwidth is also limited (unless, of course, tests have already shown it to fail to result in improvements in performance when deployed in the ways suggested); but I am not sure whether or not that is what you were suggesting. The individual proposals for paralleisation need to be considered on their own merits, I think, which individual consideration was one of the aims of having this thread in the first place.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #14 on: March 27, 2014, 08:48:30 PM »
To make it clear: What I write is from my experience of Simutrans being I/O rather than CPU limited. That's the only world I know. The I/O problem was related to rendering, though, and cured with either a driver update or something along those lines.

Offline kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2269
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #15 on: March 27, 2014, 09:09:37 PM »
I think james hopes that any IO issues will disappear in time with increased memory speeds. I'm not sure how likely this is.

In anycase, back to parallel tasks - most of the simutrans code is interdependent. Vehicles interact with each other at signals and crossings meaning that they always have to be processed in the same order. Passenger and goods routing is similarly dependent on routing that has already taken place so must also be processed in the same order. Failure to do so will result in desynchronisation.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #16 on: March 27, 2014, 09:32:02 PM »
[...]same order. Failure to do so will result in desynchronisation.

True, the way networked games works adds another level of problem. Even if we do manage to get threads to produce a consistent result on one machine, that doesn't mean they produce the same consistent result on all participating machines.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #17 on: March 27, 2014, 10:08:02 PM »
I am well aware of the interdependency - that is why the original suggestions focussed on tasks that can be done in batches and committed to the game state in a single thread. This would be relevant to algorithms that are mainly CPU limited and that can be broken down into independent batches - I suggest that passenger generation, vehicle route finding and private car route finding are possible examples of things that meet all three of those criteria.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9507
  • Languages: De,EN,JP
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #18 on: March 27, 2014, 10:10:22 PM »
You could decouple vehicles types (road, etc) by freezing crossing states at the beginning of a sync_step, give each their own random counter and process those waytypes together but in fixed order for one waytype. This is quite feasible. However, 80% is eaten up by road traffic on most maps (most convois, citycars, larger networks).

Route search is cheap for trains as there is never a very large network and never ever no routes. The only time I see significant contribution of route search is with ships or broken routes. On the other hand, lines could cache the default route for a waytype, until a tile is altered with said waytype (i.e. land is raise etc.) Hence a vehicle would first ask the line for its route, and if unknown (first vehicle on that line, invalid etc) calculates its route and give it to the line.

You can tried to parallize the waytypes, and the passenger generations in cities. The latter is also quite easy in standard, if a special deterministic random per city is used (making the mersenne twister a class and use one per city, initialised on loading). Instead on one counter one would need citynum[array] counts for waysearch to have a routing in parallel. Routing in Standard can easily give also the second best route (which is probably just changing a station earlier or later, computer are dumb ... ).

The same concurrent route counters could be used for a parallel routing after arrival at stops. Although with the joining and so on it has become hard to saturate standard on todays computer CPU wise. On large maps, I rather run into bandwidth issues.

Private car route finding is an experimental problem, so I do not have any advice on that (apart form don't do it ...)

Also, larger maps in Simutrans are often also bandwidth limited. Accessing 2 GB will invalid any cache there is; also the elements sync step lists are not in the same place in memory. Accessing them in parallel will even worsen caching, as larger caches tend to be joined (and hence less efficient).

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #19 on: March 28, 2014, 05:55:27 AM »
You can tried to parallize the waytypes, and the passenger generations in cities. The latter is also quite easy in standard, if a special deterministic random per city is used (making the mersenne twister a class and use one per city, initialised on loading). Instead on one counter one would need citynum[array] counts for waysearch to have a routing in parallel. Routing in Standard can easily give also the second best route (which is probably just changing a station earlier or later, computer are dumb ... ).

Isn't passenger generation dependent on there actually being a route? Most importantly there must be free capacity at the first stop. If two different threads generate passengers within the catchment area of the same stop, and that stop is one passenger away from being full, there will be a race as to which thread manages to generate its passenger first to fill that slot. The result of this race will be different on different computers, as well as when the same computer tries to reproduce something from a save game. As long as overcrowded stops along the way is ignored, one might solve it somehow, although with some processing overhead. If passengers are to avoid overcrowding at all costs, I can't see how this can be parallelized, at least without making it slower.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9507
  • Languages: De,EN,JP
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #20 on: March 28, 2014, 10:46:44 AM »
You probably a lock for each station. But those cases (overlapping cities) are not that common. I though of one thread per city. Make it one thread per connected cityarea+stations coverage, and you are done. Or maybe one thread per 256x256 areas on the map. Similarily factory production can be parallellize in chunks of the same size.

Apart from this effort in step(), all the other efforts (seperating waytypes etc.) would not scale with map size.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #21 on: March 28, 2014, 11:05:47 AM »
The problem to which Ters refers could be solved by not actually adding the passengers to the stop during the parallel phase, but storing them separately ready to be added, and doing the actual adding in the single threaded commit phase after the parallel generation phase. The single threaded commit phase would not refuse to add passengers to a stop that is overcrowded; however the parallel phase would not generate passengers to be added at that stop unless it is not already overcrowded. This might result in a limited number of passengers being added to a stop beyond its overcrowding limit, but I do not think that a somewhat flexible approach to overcrowding is unrealistic or problematic.

As to having one thread per city - this would not work in Experimental, where the latest branch combines all passenger generation into karte_t, although it would probably work well in Standard as Prissi describes. In Experimental, there would have to be batches of passengers generated in parallel, each with their own thread and, as Prissi pointed out earlier, their own random seed.

As to factory production, how computationally intensive is that?

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #22 on: March 28, 2014, 04:50:34 PM »
The problem to which Ters refers could be solved by not actually adding the passengers to the stop during the parallel phase, but storing them separately ready to be added, and doing the actual adding in the single threaded commit phase after the parallel generation phase. The single threaded commit phase would not refuse to add passengers to a stop that is overcrowded; however the parallel phase would not generate passengers to be added at that stop unless it is not already overcrowded. This might result in a limited number of passengers being added to a stop beyond its overcrowding limit, but I do not think that a somewhat flexible approach to overcrowding is unrealistic or problematic.

Your into what I meant when I wrote "at all costs". If the game is less strict about routing over overcrowded stations, one does not need a single thread to finish up. It's enough to know how full the stops were at the beginning. This can be done in parallel as a prelude. In all cases, one must synchronize on updating a station, or spend much time, possibly in vain, trying to divide the world so that each station fully belongs to one thread. Alternatively move passenger generation from buildings to stops, but that changes a lot.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #23 on: March 28, 2014, 06:35:05 PM »
Synchronising on updating a station was the reason that I suggested actually committing the results in the main thread at the end of the parallel threads' tasks...

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #24 on: March 28, 2014, 07:59:29 PM »
Synchronising on updating a station was the reason that I suggested actually committing the results in the main thread at the end of the parallel threads' tasks...

Then each thread needs its own copy of every station's inventory, not necessarily with the existing contents, but it will be a similar data structure. Synchronization might be cheaper, depending on how the threads divide the world between them. There might be relatively few stations falling in under the authority of different threads.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9507
  • Languages: De,EN,JP
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #25 on: March 29, 2014, 06:50:17 PM »
A centralized generation of passengers which are then parallized again does not sound like a sensible division of data.

As I said, the aim is to scaled the game speed parallelism with map size and convoi number. Hence the suggestion to have regions of the map (like 256x256 tiles) which are in parallel and a queue for everything which traverses between these regions. Hence all passenger generation, productions, moving of vehicles etc. in these areas is done in parallel without the need to lock the map. If one is ok with the overhead in the route search, also each region get the own route array. As a result the game should run as fast as a 256x256 (plus overhead).

This approach would scale nicely until the number of cores equals the number of tiles. I will be tedious (also not so difficult) to distribute stuff in the right region (i.e. sync_list etc for each region). A little tougher might be a synchronized execution, i.e. in network games same area sizes are needed. Most difficult is the correct treatment of moving between region, imho.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #26 on: March 29, 2014, 07:08:43 PM »
Dividing the game into regions won't really help, because vehicles will inevitably want to move from one region to the next. However, a vehicle can only move from one tile to the next if that tile is free. Whether that tile is free is subject to race conditions, as the other thread may or may not have processed movement in that area yet. For a single player game, this isn't that much of a problem, but for multiplayer, desynchronizatin will occur sooner or later, unless the threads are forced into lockstep somehow.

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1324
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #27 on: March 29, 2014, 07:16:59 PM »
Hence the suggestion to have regions of the map (like 256x256 tiles) which are in parallel and a queue for everything which traverses between these regions. Hence all passenger generation, productions, moving of vehicles etc. in these areas is done in parallel without the need to lock the map.
This is exactly (well close enough) what I've done with my work to thread sync_step(). It's still very much a work in progress however... And I've not really looked too in-depth at threading step() yet.

I'm not sure where the idea Simutrans is CPU bound comes from, although I guess that depends on the definition of CPU bound. All my testing points to the CPU cores being stalled waiting on data, and multithreading doesn't help too much in that case. It does help hide the memory access latency a bit given the ineffectiveness of the hardware prefetchers at handling the access patterns of Simutrans. But you're only talking ~15% improvement. Using explicit prefetch instructions I can get a ~20% improvement, but there's no way I can make this portable between even different CPUs let alone OSs. But, rearranging the data to align with Simutrans use of it, I'm seeing a ~15-30% gain. Double what multithreading got, hence I've been working at doing that first.


Dividing the game into regions won't really help, because vehicles will inevitably want to move from one region to the next.
Vehicles can only move so far each step... 

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #28 on: March 29, 2014, 08:12:38 PM »
Vehicles can only move so far each step... 

It's enough that they can cross the tile boundary separating one region from the next. And the effect isn't confined to the boundary region, because if a vehicle can't cross the boundary, it won't free up the tile it came from. This in turn affects other vehicles wanting to pass into that tile and so on.

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1324
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #29 on: March 29, 2014, 08:24:24 PM »
So you need more regions than threads. Threads independently process regions that are always 3 apart, and regions are sized so that a vehicle can never move father than one region.

Offline kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2269
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #30 on: March 29, 2014, 08:53:18 PM »
It's not just vehicles moving into different regions though, it's vehicles reserving ways across different regions that might be an issue.

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1324
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #31 on: March 29, 2014, 08:58:11 PM »
Yes that is a problem. Thankfully it's relatively rare, and hence I just defer them to a single threaded cleanup.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9507
  • Languages: De,EN,JP
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #32 on: March 29, 2014, 09:46:03 PM »
If one is worrying over access pattern, then I am not sure much can be improved: The goods, and vehikels have already their own memory management (by size), which keeps them as close as possible.

About vehicles leaving regions: Those could be transferred to a semaphore protected list. This list is then handled after everything else is handled. (And on this lists they are moved back into their oncoming region, or they stay there, if they fail to move to an adjacent tile).

Step would handle to regions similarely. Actually the climate change is already handled in regions. But also productions, passenger generation routing etc. could be done regionwise.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #33 on: March 29, 2014, 10:01:14 PM »
So you need more regions than threads. Threads independently process regions that are always 3 apart, and regions are sized so that a vehicle can never move father than one region.

If a tile at the edge of region A has vehicles attempting to enter it from both region A and region B, how is it supposed to prioritize which vehicles enter that tile when it's free? If there is a steady stream of vehicles, processing of region A will only fill that free tile with vehicles from region A, since vehicles in B won't be processed until A is finished. Simutrans currently seems to have some sort of fairness, whether accidental or intentional. Processing A or B first alternately won't help if the time it takes for vehicles to clear the tile is an integer multiple of this frequency.

About vehicles leaving regions: Those could be transferred to a semaphore protected list. This list is then handled after everything else is handled. (And on this lists they are moved back into their oncoming region, or they stay there, if they fail to move to an adjacent tile).

Then you can't know if the tile is free until the next tick. And there is a different variant of the problem above: If vehicles crossing between regions are processed after everything else, they won't have a chance entering busy tiles. This list must also be explicity ordered somehow, as threads will add vehicles into this list in different orders on different machines (or otherwise repeated runs from the same state).

Offline kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2269
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #34 on: March 29, 2014, 10:07:12 PM »
Quote
This list must also be explicity ordered somehow
Meaning even more overhead...

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1324
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #35 on: March 29, 2014, 10:48:18 PM »
If a tile at the edge of region A has vehicles attempting to enter it from both region A and region B, how is it supposed to prioritize which vehicles enter that tile when it's free? If there is a steady stream of vehicles, processing of region A will only fill that free tile with vehicles from region A, since vehicles in B won't be processed until A is finished. Simutrans currently seems to have some sort of fairness, whether accidental or intentional. Processing A or B first alternately won't help if the time it takes for vehicles to clear the tile is an integer multiple of this frequency.
There is currently no built in fairness/prioritization. The first convoi to be processed after the tile becomes free takes it. The order convois are processed depends on the order they were created. In the region setup, if the convoi in A waiting for the tile is processed after the convoi leaving the tile is, then it will always take it. If the A waiting convoi is processed before the A vacating convoi, then the B waiting convoi will take it. There may be a slight bias towards preferring A when there's multiple A waiters, but I don't see any real issue here.

EDIT: missed this.
If one is worrying over access pattern, then I am not sure much can be improved: The goods, and vehikels have already their own memory management (by size), which keeps them as close as possible.
There's quite a bit to be gained fixing the terrible memory access patterns that now take place. Many objects can be shrunk a bit by reordering the members to fill the padding gaps. Some members can be changed to smaller data types too. Reorder the members so items accessed at the same general time are on the same cache line. Then allocate the objects packed together in the order they're iterated across. And once done convoi_t::sync_step() ends up 30% faster. I'm sure there's another 10% or so to be had as well, I just need to win the battle I'm having with C++ not letting me do what I want... bloody type safety - it's bits'n'bytes, let me at em!
« Last Edit: March 30, 2014, 03:16:30 AM by TurfIt »

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #36 on: March 30, 2014, 09:21:33 AM »
There is currently no built in fairness/prioritization. The first convoi to be processed after the tile becomes free takes it. The order convois are processed depends on the order they were created. In the region setup, if the convoi in A waiting for the tile is processed after the convoi leaving the tile is, then it will always take it. If the A waiting convoi is processed before the A vacating convoi, then the B waiting convoi will take it. There may be a slight bias towards preferring A when there's multiple A waiters, but I don't see any real issue here.

The issue is that junctions at region boundaries will without explaination (in the eyes of regular player) give priority in certain directions, whereas all the others don't. But I see that the time at which the vacanting vehicle is processed makes it less of an issue than I first thought.

There's quite a bit to be gained fixing the terrible memory access patterns that now take place. Many objects can be shrunk a bit by reordering the members to fill the padding gaps. Some members can be changed to smaller data types too. Reorder the members so items accessed at the same general time are on the same cache line. Then allocate the objects packed together in the order they're iterated across. And once done convoi_t::sync_step() ends up 30% faster. I'm sure there's another 10% or so to be had as well, I just need to win the battle I'm having with C++ not letting me do what I want... bloody type safety - it's bits'n'bytes, let me at em!

I also think there is much to be gained here. And C++ does let you do what you want. You just have to say the magic words. Most other languages won't let you do that. The biggest problem is that different compilers don't lay out the bits and bytes in the same way. Arrays of polymorphic types, rather than arrays of pointers to such, is probaby the most difficult things to do.

Offline Dwachs

  • DevTeam, Coder/patcher
  • Administrator
  • *
  • Posts: 4587
  • Languages: EN, DE, AT
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #37 on: March 30, 2014, 04:31:26 PM »
What strikes me most when profiling is that karte_t::sync_step itself takes 20% of processing (self-) time (without graphics). It only has to process a list, resolve virtual functions, and release some objects.

Turfit, do you have any work-in-progress patches?

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1324
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #38 on: March 30, 2014, 05:07:36 PM »
The issue is that junctions at region boundaries will without explaination (in the eyes of regular player) give priority in certain directions, whereas all the others don't. But I see that the time at which the vacanting vehicle is processed makes it less of an issue than I first thought.
Worst case, a vehicle could get stuck for a while, but then it can with the current single threaded routine too. That's what traffic lights are for!


I also think there is much to be gained here. And C++ does let you do what you want. You just have to say the magic words. Most other languages won't let you do that. The biggest problem is that different compilers don't lay out the bits and bytes in the same way. Arrays of polymorphic types, rather than arrays of pointers to such, is probaby the most difficult things to do.
So far my incantations have given my compiler a sense of humor. I particularly liked when it claimed "error: vector_tpl is not a template" when I hadn't gone any where near modifying vector_tpl, and it was very much a template the day before.

If you can point me towards the magic to enable arrays of polymorphic objects embeded in other objects that are themselves heap placement allocated arrays of objects that would be great. I was attempting to experiment with just that - sticking a vehicle or two right inside the convoi object.


What strikes me most when profiling is that karte_t::sync_step itself takes 20% of processing (self-) time (without graphics). It only has to process a list, resolve virtual functions, and release some objects.
Processing the lists themselves is quite time consuming. There's many many add/deletes every frame. Splitting them into more or less static vs heavily modified lists helps greatly, and since I ended up with 5 lists, it makes sense to just go right to 8 and have a separate list per sync stepped object type. Combine this with a memory allocation slab per object to pack them in tightly, and order the list as per the allocation, and performance++. Just needs a little tweaking to work for multiplayer though...


Turfit, do you have any work-in-progress patches?
Not really, it's quite the meandering experimental mess. It's been 8-10 months since I last really looked at it, planned to return to it, but got distracted (as usual) mucking about with Experimentals network game woes. Perhaps next up.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #39 on: March 30, 2014, 05:54:54 PM »
So far my incantations have given my compiler a sense of humor. I particularly liked when it claimed "error: vector_tpl is not a template" when I hadn't gone any where near modifying vector_tpl, and it was very much a template the day before.

If you can point me towards the magic to enable arrays of polymorphic objects embeded in other objects that are themselves heap placement allocated arrays of objects that would be great. I was attempting to experiment with just that - sticking a vehicle or two right inside the convoi object.

Modifying vector_tpl is exactly what you have to do. Or rather make something like vector_tpl, but with the necessary hacks. It is a bit unsafe, as one has to explicitly specify the size of the largest polymorphic type at compile time.

There is one aspect I have no solution for, and that is compacting the array when an element is removed in the middle. This is however a problem anyway, if there are references (either pointer or index) to the object from elsewhere. And this is perhaps the biggest problem with putting objects, rather than pointer to objects, into arrays.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #40 on: March 30, 2014, 06:55:57 PM »
Returning for a moment to passenger generation, one thing that strikes me is that Experimental may benefit rather more from multi-threading this part of the code than Standard on account of Experimental's multiple destinations feature, which multiplies the number of runs through the code checking for destinations by potentially a great many times per packet of passengers. The code for committing the results to the world is actually already somewhat separated from the code for generating the passengers (at least, it is in the more recent versions, the latest of which is currently on the way-improvements branch); all that would be required is a means of separating these things into two methods and transferring the data between them in order to achieve a full separation necessary for multi-threading. The amount of CPU time taken by this code is noticeable, as increasing the number of alternative destinations substantially reduces the fast forward speed on a large map, so this might achieve a worthwhile result.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9507
  • Languages: De,EN,JP
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #41 on: March 30, 2014, 08:54:14 PM »
Why one passenger to multiple destinations? I could understand multiple routes, but the thing about simutrans is that passengers does not choose their destination on availability. If there is no route, well no travel. I wonder why to change this?

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #42 on: March 30, 2014, 09:32:45 PM »
Ahh, the alternative destinations feature has been with Simutrans-Experimental for a while, although the as yet unreleased overhaul of the passenger generation system  has changed its scope somewhat.

The basic idea is that passengers make trips for all sorts of reasons: some to go to visit relatives, some to have a walk in the park, some to go to work, some to go to the shops to buy bread, and so on. Each of those destination types might or might not have more than one actual destinations: a trip to see relatives, for example, has only one possible destination, whereas a trip to go to the shops has many possible destinations, and people will generally choose whichever is closer. As to trips to work, although people work in one specific place, people tend to choose to work in places near where they live and will not apply for jobs too far away.

To simulate this, passengers in Simutrans-Experimental have between zero and n alternative destinations (n being defined in simuconf.tab). The actual destination is picked at weighted random from the list of buildings (which is, in the latest unreleased version, stored in the world rather than in cities), and the algorithm checks whether any possible means of getting there (walking, player transport starting from any stop whose coverage extends over the starting building or private car, if these passengers are deemed to have access to one) takes little enough time to be within that packet of passengers' journey time tolerance, a figure set at random between a minimum and maximum specified in simuconf.tab for every packet of passengers generated. If the passengers can reach their destination within the journey time tolerance, they take whichever is the means available to them that takes the least time overall (walking, player transport or private car). If they cannot reach that destination within their journey time tolerance or at all, the algorithm finds another destination and runs the same check on that, until the number of alternative destinations has been exhausted or the passengers have found a satisfactory route to their destination, whichever comes first. It is this repeated checking that could potentially be parallelised by performing it on several packets of passengers at once, as this is considerably more computationally intensive, I believe, then setting the passengers on their way by adding them to their origin stop.

Edit: Incidentally, does Simutrans use thread pooling? I do wonder whether there might be issues with creating and destroying threads very quickly as this sort of feature might require.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9507
  • Languages: De,EN,JP
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #43 on: March 30, 2014, 09:52:37 PM »
About multiple destinations: This would change very little (if nothing) to a player, since the destinations are anyway random; or rather it would add some more toyish behaviour as in OpenTTD, where passengers are generated anyway. Incidentally, I wrote a patch for OpenTTD to achieve this. Each stop had a list of all connected stop within a certain depth (or journey time, if you like). This consume quite some memory, but with nowadays computers memory is more available than CPU.

Also: you approach is much more useful for a routing as in standard. Since you can store the mean travel time for each leg between stations in a tree, you can quickly search until the time/distance/transfer or whatever in weight is reached.

Back to sync_step:

I played around a lot with different lists, vector hashtable, ... in sync_step. First building up and deleting sync_step take some time (especially for vectors). For meaningful profiling some month of fast forward are needed.

There are lists which change almost constantly (smoke, pedestrians). Since here the processing order does not matter but adding and deletion happens freuqently, they are a hashtable. That is why the eyecandy (buildings, smoke) is a ptrhashtable.

I had to brush up the old code a bit. Use r7124 for trials. I found vector_tpl the fastest, although each sync_step it needs to copy the entire vector. However, I could only profile for a few gamedays before it froze. Something really messed up with the profiling with my GCC. Since it works with MSVC, I suspect a problem with my GCC (4.6.2) and the profiler.

Offline kierongreen

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 2269
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #44 on: March 30, 2014, 10:03:25 PM »
Quote
Incidentally, does Simutrans use thread pooling? I do wonder whether there might be issues with creating and destroying threads very quickly as this sort of feature might require.
Certain sections of the multithreaded code (display for example) do have threads which just idle when inactive yes. Originally threads were created and destroyed every time they were called but this was very slow on some platforms.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #45 on: March 30, 2014, 10:11:54 PM »
Prissi - the alternative destinations feature does make quite a difference to the player, as the destinations to which the passengers actually go (taking into account the journey time as described above) are not totally random: instead, the passengers will tend to go more often to destinations closer to home, but will sometimes (when their journey time tolernace is high and they pick a far-away destination) travel further afield. This reflects reality, in which most journeys are local journeys: in real life, one is far more likely to travel to the next town than the other side of the country. The consequential effect on this is a tempering of the network effect of connecting additional towns to a passenger network: in Standard, the number of passengers increases exponentially when this is done, whereas, in Experimental, the alternative destinations feature makes this much less so, which is more realistic.

Kieron - thank you for that insight: that is useful. I suppose that thread pooling would not be too hard in this case: just have a fixed maximum number of threads for this work and generate them when the game starts, each with their own random seed.

Edit: Looking into the code, one thing that would complicate the process of multi-threading passenger generation is the onward trip algorithm, in which calculating the onward trip is closely woven with committing the generated results to the world.
« Last Edit: March 30, 2014, 10:19:05 PM by jamespetts »

Offline TurfIt

  • Dev Team, Coder/patcher
  • Devotee
  • *
  • Posts: 1324
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #46 on: March 30, 2014, 11:55:16 PM »
The amount of CPU time taken by this code is noticeable, as increasing the number of alternative destinations substantially reduces the fast forward speed on a large map, so this might achieve a worthwhile result.
But what is the bottleneck in the CPU for that routine? Until you know that, you can't know what where to fix. Multithreading on a multicore processor brings more execution units and more L1/2 cache into play. If your algorithm isn't being limited by those, then multithreading isn't going to gain you much.


This consume quite some memory, but with nowadays computers memory is more available than CPU.
I'd argue the opposite. Sure, lots of memory is available, but actually using it brings things to a halt. In the 386 days running out of memory and waiting for a page to swap in from the HDD was painful. Now, overflowing the L3 cache and waiting for a memory read is just as painful, and should you actually page swap - unbearable.


I had to brush up the old code a bit. Use r7124 for trials. I found vector_tpl the fastest, although each sync_step it needs to copy the entire vector. However, I could only profile for a few gamedays before it froze. Something really messed up with the profiling with my GCC. Since it works with MSVC, I suspect a problem with my GCC (4.6.2) and the profiler.
Using vector_tpl for everything is the fastest in my tests as well. I also modified vector_tpl to have the swap_erase method we'd talked about before (1 or 2 years ago I think), and use that instead of copying the vector each cycle. Anyways, if everyone is happy with the state of the double height code, instead of looking into it once I finish up with Experimental, I'll try to get my sync_step patch into a presentable shape (minus the mutithreading aspect as I think that'll still take some months).

I gave up on statistical profiling for Simutrans long ago. I just can't get consistent enough results between runs. When everything varies by 30+% everytime, you can't draw any conclusions. Instead I've been embedding explicit timing checks into the code being tested, much better results that way.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #47 on: March 31, 2014, 10:43:00 AM »
But what is the bottleneck in the CPU for that routine? Until you know that, you can't know what where to fix. Multithreading on a multicore processor brings more execution units and more L1/2 cache into play. If your algorithm isn't being limited by those, then multithreading isn't going to gain you much.

How would one test this? Simple profiling will tell one how long that this function is taking, but not where the performance limit is.

Offline Markohs

  • DevTeam, Coder/patcher
  • Devotees (Inactive)
  • *
  • Posts: 1559
  • Languages: EN,ES,CAT
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #48 on: March 31, 2014, 11:16:56 AM »
I just did read all this thread lightly, so I might have ignored comments.

 But I think as prissi expressed, a much more in-depth refactoring of the memory structures and code is needed to scale the code to multithreaded.

 Imho, if the code is to scale good, we'd have to think about more than 4 threads. And that whould require consider all the world static in that cycle, using just read-only access to the world, by all threads, and those threads should generate a list of changes that need be to applied to the world. A transaction manager whould apply the changes sequentially in a periodic timing, as milestones. It can reject the changes that whould break the game coherencies (two trains on same track, for example). So they will just be re-calculated on deand on next cycles.

 That's the only way the game can scale good, imho, and it's the way to go. It's quite a lot of work.

 You can't have mutual exclusions and hope the game to perform good, you have to remove all dependancies and exclusions, and resolver them later. What you can use it's a heuristic approach to keep the number of "rollbacks/rejects" to the minimum.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #49 on: March 31, 2014, 11:42:13 AM »
That would really be a vast amount of work - years, possibly. I wonder what sort of performance gain that that might enable compared to a more narrowly focussed parallelisation of specific algorithms in the way discussed above?

Offline Markohs

  • DevTeam, Coder/patcher
  • Devotees (Inactive)
  • *
  • Posts: 1559
  • Languages: EN,ES,CAT
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #50 on: March 31, 2014, 11:51:19 AM »
The moment you need mutual exclusion and sync when accessing to the world structures you commit yourself to not scale much more than 3-4 threads imho. Seeing how current computers (even portable devices) won't scale much in terms of single-thread speed, but in memory bandwidth and number of cores, this option has no future.

 I'd chose an option that has future, and not a temporary increase in performance that I doubt it will be enough boost to justify itself, *and* with the amounts of complexity that's going to add to the code (with the added maintenance). In fact, even the current threaded display code, it's not performing good enough, imho, specially if we take into account the complexity it added to the code.

 Ofc, maybe a hybrid implementation, or a pure MP implementation, with heuristics, can be enough, or *maybe* dividing the world in chunks, and post-process the interactions between boundaries.

 It's hard to say, I'm just guessing here, maybe my option doesn't scale good enough when you have less than 4 cores, and it's not good enough.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #51 on: March 31, 2014, 12:13:55 PM »
Interesting analysis - scaling to 3-4 threads can potentially make significant gains if applied in the right places. It is, I suppose, a matter of working out what the main limitations on performance are now and will be in the future and carrying out a cost-benefit analysis for the three broad options: (1) no significant multi-threading of the simulation code (as distinct from, say, graphics or load/save); (2) limited multi-threading as discussed previously in this thread; and (3) the total rewrite necessary for fully integrated multi-threading as discussed here. These are difficult choices indeed.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 5523
  • Languages: EN, NO
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #52 on: March 31, 2014, 03:36:42 PM »
That would really be a vast amount of work - years, possibly. I wonder what sort of performance gain that that might enable compared to a more narrowly focussed parallelisation of specific algorithms in the way discussed above?

Which algorithms? Doing parallelization internally in an algorithm may run into too high start-up and merge costs. At work, we recently did some fork-join on a pathfinding-ish operation, and while we were suddenly able to handle bigger maps, memory consumption went through the roof as each subtask had to have a copy of the preliminary result.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #53 on: March 31, 2014, 03:51:09 PM »
Memory consumption per se has traditionally been one of Simutrans's least significant limiting factors, behind CPU use and memory bandwidth. As to which algorithms - this has been discussed earlier in the thread. It is, of course, important for parallelisation to be used in areas where the computational cost of producing a dataset for subsequent merging significantly exceeds the computational cost of the merging itself; but I think that there are some areas, in Experimental at least, where this is so (passenger generation, private car route finding, and possibly certain parts of the goods route-finding, being the initial construction of the table of routes and times from stops to directly connected stops).

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 9507
  • Languages: De,EN,JP
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #54 on: March 31, 2014, 04:09:20 PM »
Implicit parallelisation (?) does not really work well in real live problems beyond simple loops. There are more than 15+ year of research into it.

I have somewhere a patch to use OpenMP instead thread for display. It was even slower than single threaded and crashed occasionally ...

About passenger generation: With the amount of passenger generated and transported, I hardly doubt the player will notice a passenger going closer or farther; statistically there is a very good chance that on a well developed map it will even out. I am not sure of the implementation in experimental, but in standard the distance is taken into account, depending on simuconf.tab setting; i.e. in earlier year passenger already travel shorter distances than in later years.

I would not add CPU there. Rather fix the amount of time in the physic module.

Offline jamespetts gb

  • Simutrans-Extended project coordinator
  • Devotee
  • *
  • Posts: 18679
  • Cake baker
    • Bridgewater-Brunel
  • Languages: EN
Re: Long-term thoughts on multi-threading in Standard and Experimental
« Reply #55 on: March 31, 2014, 04:31:28 PM »
Thank you for your helpful input. By "implicit parellisation", do you mean the sort of parallelisation to which Markhos was referring?

As to passenger generation, the current release version of Experimental uses distance ranges and a far lower number of alternative destinations. However, this has lead to significant problems: in early eras, passengers are completely unwilling to take the stage coach, which might take many hours to make a short journey, as the range of journey time tolerances associated with local or medium range trips are well below the stage coach's travelling time. Research shows that people's destinations are selected in large part in consequence of the travelling time to those destinations, which is the way in which the latest (unreleased) development version of Experimental works.

As to the time spent in the physics module, Bernd has done some optimisation work recently, but I am not sure how significant that the results are. The ultimate difficulty is the inability to use native floating point types because they cannot be synchronised reliably between different platforms in network use, and Bernd had to spend months writing an integer based faux floating point class, which, whilst reliable, is much slower than native floating point. I am not familiar with the inner workings of the physics code, but if you (or anyone else) has any tips for Bernd on how to make it any faster, then I shall gladly pass them onto him.