News:

Do you need help?
Simutrans Wiki Manual can help you to play and extend Simutrans. In 9 languages.

Long-term thoughts on multi-threading in Standard and Experimental

Started by jamespetts, March 27, 2014, 01:41:09 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

jamespetts

First of all, I am not sure whether this is the right forum for this discussion, but it more or less fits with the theme "extension requests", so this seemed to be the safest place to put it. If I am mistaken in my choice, please do move this to a more appropriate place.

The thinking behind this discussion is borne out of my experience with Experimental, and is probably more directly relevant to that fork, with its higher demand for computing power than Standard, but it is of relevance to Standard, too, so it is worthwhile discussing it in a general, rather than an Experimental specific forum.

There are quite a few features that it would be good to be able to add (such as multi-factorial route finding for passengers/goods or specific route finding for private cars) which are currently not practicable to add because the very large computational effort that they would involve would give unacceptable performance on games with large map sizes. These things are not high priorities in the short-term, but it would be good to be able to aim for them within a period of a few years from now, as they could potentially make a very real difference to the depth of the simulation and overcome quite significant limitations in the current model.

However, as many who follow the trends of such things here may be aware, Moore's Law does not have a great many years left. Transistor gates are already only scores of atoms wide. Before long, making them much smaller will not be possible, and the age of ever increasing computational speed will be at an end for the time being. Quantum computers are interesting, but, even if they prove a practical proposition for general use one day, are only actually faster in some very specific applications which are unlikely to assist (I might be wrong about the latter, but I suspect that it will be decades, if at all, before quantum computers reach the home in any event).

For some years, the IT industry's response has been parallellisation: the use of multiple processor cores to perform tasks at the same time. Even mid-range tablets now come with 8 core CPUs. Making use of this additional computing power, however, requires multi-threading, and this presents particular challenges for Simutrans, especially connected to multi-player gaming. The Standard developers have made great progress in recent years in applying multi-threading to Simutrans, and it covers now, I believe, loading/saving (along with compression) and graphics rendering. The difficulty is that it cannot easily be made to be compatible with online play: because online play requires the server and all clients to be exactly synchronised, the non-deterministic nature of multi-threading makes it very challenging to use this technique in the actual simulation algorithms, places where it is often needed the most.

It is worthwhile considering, which is why I have started this thread, some ways in which we might rise to that challenge and devise means of multi-threading certain critical algorithms that are able to work reliably in a synchronised multi-player environment. This is a rather long-term aim, and I wonder whether any of the Standard developers have already considered any of these ideas, but I thought it worthwhile to discuss them, at least in general terms, at this juncture.

Firstly, one thought that I had some time ago in connexion with the private car routing system was the use of concurrent multi-threading only on the server, or even distributed between clients and servers. The idea was that each thread would perform calculations (in my original idea, calculations of private car routes) independently and in a way that, until committed, did not affect any other data in the game, and then commit the results using the same system as a human player sending a command: in other words, a system that is intrinsically designed to deal with non-deterministic input. The trouble that I found with this approach is that my initial calculations showed that it would consume far too much bandwidth to send these data between clients and server, which is probably fatal to this technique being of much use at all unless some use can be found of it where the computational effort involved in the calculations are very great, but the data comprising the results of the calculations are very slight. I cannot think of such a situation, however. (As an aside, the concurrent method of multi-threading would be ideal in many situations in Simutrans were it not for the need to synchronise with online play. In theory, such code could be written and made to work only in single player mode, although this would seem somewhat of a wasted effort when the largest and most complicated maps are usually to be found in multi-player games).

The more obvious means of useful multi-threading in Simutrans is parallel multi-threading, where an algorithm is performed many times over on a given dataset by different threads at the same time whilst no other activity is occurring and the results committed when all threads have completed their work. The difficulty is finding a dataset on which the same algorithm needs to be performed many times and where the results of performing it on one member of the set do not need to be taken into account when performing it on another member of the set.

One possible candidate for this is routing. Routing in Experimental is perhaps more suited to parallel multi-threading than routing in Standard because the routes are calculated all at once and cached, whereas, in Standard, they are cached. Knightly wrote the current Simutrans-Experimental's routing system, which is currently single threaded. He made it very efficient indeed using the Floyd-Warshall algorithm and a spot of co-operative multi-tasking. It works very well and performs acceptably even on large maps. However, on the very largest and most complicated of maps, it does have a noticeable performance impact when running. Also, it is currently a single factor routing system, taking into account only speed. Ideally, it would be good to have multi-factorial routing, taking into account comfort and possibly even cost. One way of doing this whilst maintaining a cached routing system would be to have different classes of passenger, each of which has a different balance of factors which, when taken together favour one route over another, and store different routes to the same destination for each of several classes of passenger. It might be possible to use parallel multi-threading to calculate these several different routing options in the same time as it would take a single threaded system to calculate one of them. However, even a single set of multi-factorial routes would be considerably more computationally expensive to calculate than single factor routes, as the cost calculation in the Floyd-Warshall algorithm, which is called extremely frequently, and is currently just a simple integer comparison, would instead become a complex equation almost certainly involving computationally expensive integer division. What I do not know is whether it is possible to use parallel multi-threading on a single pass of the Floyd-Warshall algorithm - I rather suspect not. It is possible that such gains as there are left to be had with shrinking transistor size will make a single set of multi-factorial routes calculable in the same time as single-factorial routes are calculated currently, which might assist there, but any means of multi-threading route finding would be very useful.

The second area that could possibly be multi-threaded using the parallel technique is passenger generation, which is known to be computationally intensive, and to which in Experimental I have added numerous features. This would entail generating packets of passengers in batches in multiple threads, then committing the results in a single thread to avoid desynchronisation. I think that this is probably technically feasible (as the generation of one batch of passengers does not inherently depend on what has happened to another batch of passengers just generated, with the exception of the limit on the number of jobs available in the latest unreleased Experimental versions, but that is easy to work around by the use of negative numbers of jobs, as already done, to achieve balance overall), but it would be quite challenging in practice to separate the generation of batches of passengers from committing the results of that generation to the game at large, the two operations of which are currently closely interwoven.

A third area in which multi-threading might conceivably be of assistance is in vehicle route finding: ship route finding especially takes a large computational toll on a large and busy map at present. This is theoretically possible because the route chosen by one convoy is not affected by the route chosen by another convoy (the multi-threaded part would not include the block reservation for trains, just the finding of the route itself). The trouble with this, however, is that multiple vehicles rarely want to find a route at exactly the same time. A solution which I think is technically feasible but would require rather a lot of difficult work is to anticipate in advance which loading vehicles will need route finding soon, and process them in batches. There would need to be a way of working out which convoys already have their routes set when the time comes to leave and a sensible way of scheduling this. This could, of course, adversely impact on the performance of single threaded clients and servers, for which doing difficult things in batches makes things worse rather than better, so it should be disabled when a single player game or server is single threaded (is there a way of detecting the available number of CPU cores, I wonder?).

I should be very interested in any views on these matters, and any plans that the Standard developers have to multi-thread parts of the simulation code in Standard, as well as any more general thoughts on long-term solutions to performance issues as the complexity and size of the simulation increases.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ters

I've been thinking much of the same, although I know almost nothing about Simutrans' simulation logic. (I've mostly been messing around  in simgraph and simsys, and the besch-es.) But as for calculation all kinds of routes, one must keep in mind that these routes not only need to be calculated, they must also be stored and, more importantly, transferred between CPU and RAM.

ӔO

ships are typically seen as not having any 'road' to travel on, but they do have some major lanes that are followed with little deviance.

http://www.marine-knowledge.com/wp-content/uploads/2012/01/Ocean-Shipping-Lanes.jpg

perhaps if ships were always funnelled into using a cpu or human decided path, that could lighten the path finding load, as a certain length will always be known?
My Sketchup open project sources
various projects rolled up: http://dl.dropbox.com/u/17111233/Roll_up.rar

Colour safe chart:

Dwachs

First of all, you should identify the parts that are the performance hogs by using a profiler.

(1) Cargo Routing: Google founds a lot of pages about 'parallelize floyd warshall', so the algorithm itself seems to be parallizable.
I am unsure whether Standard's routing would benefit from parallelizing: the threads will work on the same data after all.

(2) Passenger generation: Also here, cargo routing has to be done (with the same caveats as above, as threads have to work on the same data in memory). Moreover the result may depend on the current game state: overcrowding of stations.

(3) Vehicle routing: Effect of routing depends on the game state: roads get deleted, routing weight depends on traffic.

Maybe one can try to parallelize the stuff that happens in karte_t::step:
(a) Passenger and factory cargo generation in one thread
(b) Vehicle routing in another thread
Both tasks do not interfere with each other. However, both are access a lot of data, so memory bandwidth may become an issue.

Movement of pedestrians (and clouds) can be parallelized as well: Each thread moves pedestrians in some sectors of the map. These will not interfere with the pedestrians in other sectors. In a last step, pedestrians at the boundaries of the sectors are moved.
Parsley, sage, rosemary, and maggikraut.

Sarlock

I think you can fairly safely assume that players choosing to play on very large maps (4000x4000 or larger) are probably using higher-end processors that are very likely multi-core CPUs.  Adding a small performance impact to the smaller maps on single core processors may be worth it if the gain on larger maps with multi-core CPUs is significant.

As to how to design a system that allows a second processor to perform all of the wayfinding while not holding up the primary CPU... and not having to rewrite half the game, is certainly no small task.

With a very large map (such as the current Experimental server game), how much of the system load is pure processor-based and how much of the load is RAM access?  I would think that the amount of RAM accessed increases at a larger exponential rate than the processor load... until the player(s) fill the world up with 5000+ convoys, at least...
Current projects: Pak128 Trees, blender graphics

jamespetts

Some very interesting thoughts - thank you for all the feedback.

Firstly, to reply to some of Dwach's observations: passenger generation in Experimental is separate from route finding for passengers (although I realise that this is not separate in Standard), as it uses the cached routes that have already been found in a different step. In Experimental (again, different from Standard), the overcrowding of a transfer station does not affect the routing, albeit the overcrowding of an origin station does. The latter point is a surmountable problem if one accepts some degree of flexibility in the overcrowding limit for origin stations: the passengers can be generated in worker threads but not actually added to the stations until the worker threads have finished, and then added all at once by the main thread whether the station is overcrowded or not. The next batch of passengers to be generated will then not be able to add any more to that station until it is back below its limit once again.

As to routing, although this does depend on the game state, if a batch of routing was done at the same time (in parallel threads at the same point in the same step), this would not matter, as the routing would not itself change the game state and all threads would use the same game state to route.

The movement of pedestrians and moving objects can probably be multi-threaded on the concurrent basis as these things do not affect the game state at all so far as I am aware (unless, perhaps, they need random numbers generated: but, if this is so, they can perhaps be given a batch of random numbers on creation to be getting on with).

As to CPU usage versus memory bandwidth, parallelisation does not, of course, assist with memory bandwidth. However, at present, CPU usage seems to be the greatest limitation on performance in Simutrans. If we move to a state where memory bandwidth is the greatest limitation, we will have made real progress, not least since I suspect that there are fewer obstacles to the increase of memory bandwidth in the future than there are to the increase of base processing power in the future, and greater parallelisation will give manufacturers an incentive to increase memory bandwidth as this will increasingly become the primary limitation in heavily multi-threaded applications.

Shipping lanes are an interesting alternative idea for alleviating performance troubles from ships, although I wonder how difficult that this might be for players for whom it is unlikely to be obvious that, to send a ship over open water, one must first define a shipping lane.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs

Quote from: jamespetts on March 27, 2014, 10:33:16 AM
The movement of pedestrians and moving objects can probably be multi-threaded on the concurrent basis as these things do not affect the game state at all so far as I am aware (unless, perhaps, they need random numbers generated: but, if this is so, they can perhaps be given a batch of random numbers on creation to be getting on with).
There is a subtle effect on game state: only 250+x objects are allowed on a particular tile, if there are a lot of pedestrians, this may become an issue.
Parsley, sage, rosemary, and maggikraut.

jamespetts

Quote from: Dwachs on March 27, 2014, 11:00:19 AM
There is a subtle effect on game state: only 250+x objects are allowed on a particular tile, if there are a lot of pedestrians, this may become an issue.

Hmm - would it not be possible for the game to give automatic priority to game objects over pedestrians, deleting excess pedestrians as necessary, so that the presence of pedestrians can never affect the simulation state?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

First thing: Multithreading will not help much. Image you can split something in 3 threads (like padestrians, clouds etc, routing, of vehicles, routing of goods). This will just allow for an increase of map size of 1.5 in both directions, as the amount of stuff then is increased exponentially.

Passenger generation is dirt cheap, it is just a few calls to simrand (which must happen in certain order). Routing those is the expensive part.

You can certainly have more than one thread. Maybe at the start of each step, generate a random number for each city and run then there own counter, so you could thread as many task as you use similar starting random counter and their own random function.

Routing of ships (and other vehicles) can be cached, until a way with this waytype is altered. However, this happens almost all the time, so I found this not very helpful.

The sync_step has already three task for eyecandy (animations), pedestrians, and the rest; I tried threading them but got no gain. In programming there should be some stuff to this. The main stuff (vehicles) eat up most time anyway.

As said, the pedestrians (as long as you make sure there are less than 200 on a tile and access the map not at the same time as any other) can have their own thread. Same for building animations.

As discussed before, you may be able to split water, rail, street, and maglev waytype networks, with special care on the state of crossings (must be set once afterwards. So a queue for a crossing, and the state is assumed close until next step). Airplanes only after all other are finished. This will (at most) give you three main savings, rail and road, and ships.

In the end you end up with something which may allow you to run two times more ships, or more cars, and on 1.4*1.4 larger maps (i.e. going from 2048 x 2048 to 2850 x 2850). You get about the same gain by doing nothing and buying you computer a year or two later.

And the thing is: That is it. No further progress, without starting a completely new approach. If you really want to use parallel speedup, I think you need to go back a lot and use different data structures to access the world, and build a systems with lots of queues and callbacks.

jamespetts

Thank you for your insights: this is most helpful. The idea of separating networks' route finding is especially interesting.

I suspect that there is some divergence between Standard and Experimental of the extent to which multi-threading may be of assistance. Instead of allowing larger maps, threading can allow for greater simulation depth and complexity with the same size of maps; as already observed, passenger generation in Experimental is more onerous than in Standard even though routing is cached, and routing itself is more onerous in Experimental.

The area of private car route finding especially is one that might well benefit from threading. Currently, a very approximate system is used because it takes too much computational effort to find a route from each origin building to each destination building: instead, a route is found from each townhall road to each other townhall road (and each road connected to an industry or attraction not in a town), and the time that it takes to traverse that route approximated to give a time per straight line tile, which is then used to extrapolate the probable length of a private car trip from any individual point to any other individual point. This does not allow the simulation of actual traffic flow, which would be a very desirable thing to do, and even the system currently in use causes some noticeable effect on performance and has to be spread over the course of a game month to minimise the impact. Multi-threading alone may well not be enough here (although might well make a real difference): a more sophisticated approach may well be needed, involving the calculation of the route from each city building to each other city building in the same city and also each road tile on the city boundary, then calculating the route from each boundary tile to each other city boundary tile in a different city as a separate step, then, for each inter-city journey, concatenating the three routes (from origin city to boundary tile, from origin city boundary tile to destination city boundary tile, and from destination city boundary tile to destination city building). Private car graphics could then be assigned these as actual routes to drive, and the number of vehicles passing over tiles can be used to compute the congestion rather than a general statistical scheme at present based on the total number of private car trips in a town as against the number of kilometres of road.

Quote
In the end you end up with something which may allow you to run two times more ships, or more cars, and on 1.4*1.4 larger maps (i.e. going from 2048 x 2048 to 2850 x 2850). You get about the same gain by doing nothing and buying you computer a year or two later.

The point is rather that this will not hold true for very much longer. In about 5-6 years' time, according to current estimates, buying a computer a year or two later will not result in faster base performance (although other things, such as memory bandwidth or the number of cores may well increase; indeed, we might well see 16, 32 or even 64 core processors a decade from now).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

ӔO

Intel did announce a 20th yr edition, fully unlocked Haswell Pentium (G3xxx) in the works.

Some guesses are 4.8Ghz overclocked and since these are dual core, it's very likely to be super easy to achieve that with fewer less than ideal transistors. It might be the ideal? CPU for a game like simutrans, where single core speed is preferred over multiple cores.

---

As for the ship paths on sea, if manually done, it could be the same as building roads on ground, except free to construct and use by anyone.
My Sketchup open project sources
various projects rolled up: http://dl.dropbox.com/u/17111233/Roll_up.rar

Colour safe chart:

Ters

Parallelizing route finding within a step is likely the easiest thing, as you don't alter anything but the route being generated, which nobody else cares about yet, and the world stays static for that part of the step. Anything which actually updates the world is more difficult, as all writes must potentially be synchronized between threads. As prissi points out, a complete rethinking with new data structures may be needed. The gains of doing things in parallel might easily be outweighed by the cost of synchronizing the results in the end or along the way.

kierongreen

Memory access is also a problem with parallel tasks as different tasks are accessing different sections of memory. This lowers the effectiveness of caching significantly. This, together the cost that Ters has aleady pointed out of syncronising can result in slower code execution.

jamespetts

Quote from: kierongreen on March 27, 2014, 07:55:29 PM
Memory access is also a problem with parallel tasks as different tasks are accessing different sections of memory. This lowers the effectiveness of caching significantly. This, together the cost that Ters has aleady pointed out of syncronising can result in slower code execution.

The underlying premises for my original post are that Simutrans is limited more by CPU power than memory bandwidth and that the growth of computer power in the future is likely to come from parellisation as before long it becomes impossible to shrink transistors any further, so base CPU power will not increase much after about 5-6 years from now (and perhaps even before then), whereas memory bandwidth may well increase, as that is not subject to the same inherent limitation to the same extent. As stated above, if we are able to move the limitation from CPU power to memory bandwidth, we will have achieved something worthwhile; certainly, whilst it might be hard to predict when the point will be crossed at which memory bandwidth becomes the dominant limiting factor, it does not seem to me to make sense to dismiss any attempt to deploy parellisation on the ground that memory bandwidth is also limited (unless, of course, tests have already shown it to fail to result in improvements in performance when deployed in the ways suggested); but I am not sure whether or not that is what you were suggesting. The individual proposals for paralleisation need to be considered on their own merits, I think, which individual consideration was one of the aims of having this thread in the first place.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ters

To make it clear: What I write is from my experience of Simutrans being I/O rather than CPU limited. That's the only world I know. The I/O problem was related to rendering, though, and cured with either a driver update or something along those lines.

kierongreen

I think james hopes that any IO issues will disappear in time with increased memory speeds. I'm not sure how likely this is.

In anycase, back to parallel tasks - most of the simutrans code is interdependent. Vehicles interact with each other at signals and crossings meaning that they always have to be processed in the same order. Passenger and goods routing is similarly dependent on routing that has already taken place so must also be processed in the same order. Failure to do so will result in desynchronisation.

Ters

Quote from: kierongreen on March 27, 2014, 09:09:37 PM
[...]same order. Failure to do so will result in desynchronisation.

True, the way networked games works adds another level of problem. Even if we do manage to get threads to produce a consistent result on one machine, that doesn't mean they produce the same consistent result on all participating machines.

jamespetts

I am well aware of the interdependency - that is why the original suggestions focussed on tasks that can be done in batches and committed to the game state in a single thread. This would be relevant to algorithms that are mainly CPU limited and that can be broken down into independent batches - I suggest that passenger generation, vehicle route finding and private car route finding are possible examples of things that meet all three of those criteria.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

You could decouple vehicles types (road, etc) by freezing crossing states at the beginning of a sync_step, give each their own random counter and process those waytypes together but in fixed order for one waytype. This is quite feasible. However, 80% is eaten up by road traffic on most maps (most convois, citycars, larger networks).

Route search is cheap for trains as there is never a very large network and never ever no routes. The only time I see significant contribution of route search is with ships or broken routes. On the other hand, lines could cache the default route for a waytype, until a tile is altered with said waytype (i.e. land is raise etc.) Hence a vehicle would first ask the line for its route, and if unknown (first vehicle on that line, invalid etc) calculates its route and give it to the line.

You can tried to parallize the waytypes, and the passenger generations in cities. The latter is also quite easy in standard, if a special deterministic random per city is used (making the mersenne twister a class and use one per city, initialised on loading). Instead on one counter one would need citynum[array] counts for waysearch to have a routing in parallel. Routing in Standard can easily give also the second best route (which is probably just changing a station earlier or later, computer are dumb ... ).

The same concurrent route counters could be used for a parallel routing after arrival at stops. Although with the joining and so on it has become hard to saturate standard on todays computer CPU wise. On large maps, I rather run into bandwidth issues.

Private car route finding is an experimental problem, so I do not have any advice on that (apart form don't do it ...)

Also, larger maps in Simutrans are often also bandwidth limited. Accessing 2 GB will invalid any cache there is; also the elements sync step lists are not in the same place in memory. Accessing them in parallel will even worsen caching, as larger caches tend to be joined (and hence less efficient).

Ters

Quote from: prissi on March 27, 2014, 10:10:22 PM
You can tried to parallize the waytypes, and the passenger generations in cities. The latter is also quite easy in standard, if a special deterministic random per city is used (making the mersenne twister a class and use one per city, initialised on loading). Instead on one counter one would need citynum[array] counts for waysearch to have a routing in parallel. Routing in Standard can easily give also the second best route (which is probably just changing a station earlier or later, computer are dumb ... ).

Isn't passenger generation dependent on there actually being a route? Most importantly there must be free capacity at the first stop. If two different threads generate passengers within the catchment area of the same stop, and that stop is one passenger away from being full, there will be a race as to which thread manages to generate its passenger first to fill that slot. The result of this race will be different on different computers, as well as when the same computer tries to reproduce something from a save game. As long as overcrowded stops along the way is ignored, one might solve it somehow, although with some processing overhead. If passengers are to avoid overcrowding at all costs, I can't see how this can be parallelized, at least without making it slower.

prissi

You probably a lock for each station. But those cases (overlapping cities) are not that common. I though of one thread per city. Make it one thread per connected cityarea+stations coverage, and you are done. Or maybe one thread per 256x256 areas on the map. Similarily factory production can be parallellize in chunks of the same size.

Apart from this effort in step(), all the other efforts (seperating waytypes etc.) would not scale with map size.

jamespetts

The problem to which Ters refers could be solved by not actually adding the passengers to the stop during the parallel phase, but storing them separately ready to be added, and doing the actual adding in the single threaded commit phase after the parallel generation phase. The single threaded commit phase would not refuse to add passengers to a stop that is overcrowded; however the parallel phase would not generate passengers to be added at that stop unless it is not already overcrowded. This might result in a limited number of passengers being added to a stop beyond its overcrowding limit, but I do not think that a somewhat flexible approach to overcrowding is unrealistic or problematic.

As to having one thread per city - this would not work in Experimental, where the latest branch combines all passenger generation into karte_t, although it would probably work well in Standard as Prissi describes. In Experimental, there would have to be batches of passengers generated in parallel, each with their own thread and, as Prissi pointed out earlier, their own random seed.

As to factory production, how computationally intensive is that?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ters

Quote from: jamespetts on March 28, 2014, 11:05:47 AM
The problem to which Ters refers could be solved by not actually adding the passengers to the stop during the parallel phase, but storing them separately ready to be added, and doing the actual adding in the single threaded commit phase after the parallel generation phase. The single threaded commit phase would not refuse to add passengers to a stop that is overcrowded; however the parallel phase would not generate passengers to be added at that stop unless it is not already overcrowded. This might result in a limited number of passengers being added to a stop beyond its overcrowding limit, but I do not think that a somewhat flexible approach to overcrowding is unrealistic or problematic.

Your into what I meant when I wrote "at all costs". If the game is less strict about routing over overcrowded stations, one does not need a single thread to finish up. It's enough to know how full the stops were at the beginning. This can be done in parallel as a prelude. In all cases, one must synchronize on updating a station, or spend much time, possibly in vain, trying to divide the world so that each station fully belongs to one thread. Alternatively move passenger generation from buildings to stops, but that changes a lot.

jamespetts

Synchronising on updating a station was the reason that I suggested actually committing the results in the main thread at the end of the parallel threads' tasks...
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ters

Quote from: jamespetts on March 28, 2014, 06:35:05 PM
Synchronising on updating a station was the reason that I suggested actually committing the results in the main thread at the end of the parallel threads' tasks...

Then each thread needs its own copy of every station's inventory, not necessarily with the existing contents, but it will be a similar data structure. Synchronization might be cheaper, depending on how the threads divide the world between them. There might be relatively few stations falling in under the authority of different threads.

prissi

A centralized generation of passengers which are then parallized again does not sound like a sensible division of data.

As I said, the aim is to scaled the game speed parallelism with map size and convoi number. Hence the suggestion to have regions of the map (like 256x256 tiles) which are in parallel and a queue for everything which traverses between these regions. Hence all passenger generation, productions, moving of vehicles etc. in these areas is done in parallel without the need to lock the map. If one is ok with the overhead in the route search, also each region get the own route array. As a result the game should run as fast as a 256x256 (plus overhead).

This approach would scale nicely until the number of cores equals the number of tiles. I will be tedious (also not so difficult) to distribute stuff in the right region (i.e. sync_list etc for each region). A little tougher might be a synchronized execution, i.e. in network games same area sizes are needed. Most difficult is the correct treatment of moving between region, imho.

Ters

Dividing the game into regions won't really help, because vehicles will inevitably want to move from one region to the next. However, a vehicle can only move from one tile to the next if that tile is free. Whether that tile is free is subject to race conditions, as the other thread may or may not have processed movement in that area yet. For a single player game, this isn't that much of a problem, but for multiplayer, desynchronizatin will occur sooner or later, unless the threads are forced into lockstep somehow.

TurfIt

Quote from: prissi on March 29, 2014, 06:50:17 PM
Hence the suggestion to have regions of the map (like 256x256 tiles) which are in parallel and a queue for everything which traverses between these regions. Hence all passenger generation, productions, moving of vehicles etc. in these areas is done in parallel without the need to lock the map.
This is exactly (well close enough) what I've done with my work to thread sync_step(). It's still very much a work in progress however... And I've not really looked too in-depth at threading step() yet.

I'm not sure where the idea Simutrans is CPU bound comes from, although I guess that depends on the definition of CPU bound. All my testing points to the CPU cores being stalled waiting on data, and multithreading doesn't help too much in that case. It does help hide the memory access latency a bit given the ineffectiveness of the hardware prefetchers at handling the access patterns of Simutrans. But you're only talking ~15% improvement. Using explicit prefetch instructions I can get a ~20% improvement, but there's no way I can make this portable between even different CPUs let alone OSs. But, rearranging the data to align with Simutrans use of it, I'm seeing a ~15-30% gain. Double what multithreading got, hence I've been working at doing that first.


Quote from: Ters on March 29, 2014, 07:08:43 PM
Dividing the game into regions won't really help, because vehicles will inevitably want to move from one region to the next.
Vehicles can only move so far each step... 

Ters

Quote from: TurfIt on March 29, 2014, 07:16:59 PM
Vehicles can only move so far each step... 

It's enough that they can cross the tile boundary separating one region from the next. And the effect isn't confined to the boundary region, because if a vehicle can't cross the boundary, it won't free up the tile it came from. This in turn affects other vehicles wanting to pass into that tile and so on.

TurfIt

So you need more regions than threads. Threads independently process regions that are always 3 apart, and regions are sized so that a vehicle can never move father than one region.

kierongreen

It's not just vehicles moving into different regions though, it's vehicles reserving ways across different regions that might be an issue.

TurfIt

Yes that is a problem. Thankfully it's relatively rare, and hence I just defer them to a single threaded cleanup.

prissi

If one is worrying over access pattern, then I am not sure much can be improved: The goods, and vehikels have already their own memory management (by size), which keeps them as close as possible.

About vehicles leaving regions: Those could be transferred to a semaphore protected list. This list is then handled after everything else is handled. (And on this lists they are moved back into their oncoming region, or they stay there, if they fail to move to an adjacent tile).

Step would handle to regions similarely. Actually the climate change is already handled in regions. But also productions, passenger generation routing etc. could be done regionwise.

Ters

Quote from: TurfIt on March 29, 2014, 08:24:24 PM
So you need more regions than threads. Threads independently process regions that are always 3 apart, and regions are sized so that a vehicle can never move father than one region.

If a tile at the edge of region A has vehicles attempting to enter it from both region A and region B, how is it supposed to prioritize which vehicles enter that tile when it's free? If there is a steady stream of vehicles, processing of region A will only fill that free tile with vehicles from region A, since vehicles in B won't be processed until A is finished. Simutrans currently seems to have some sort of fairness, whether accidental or intentional. Processing A or B first alternately won't help if the time it takes for vehicles to clear the tile is an integer multiple of this frequency.

Quote from: prissi on March 29, 2014, 09:46:03 PM
About vehicles leaving regions: Those could be transferred to a semaphore protected list. This list is then handled after everything else is handled. (And on this lists they are moved back into their oncoming region, or they stay there, if they fail to move to an adjacent tile).

Then you can't know if the tile is free until the next tick. And there is a different variant of the problem above: If vehicles crossing between regions are processed after everything else, they won't have a chance entering busy tiles. This list must also be explicity ordered somehow, as threads will add vehicles into this list in different orders on different machines (or otherwise repeated runs from the same state).

kierongreen

QuoteThis list must also be explicity ordered somehow
Meaning even more overhead...