News:

Simutrans Chat Room
Where cool people of Simutrans can meet up.

Long-term thoughts on multi-threading in Standard and Experimental

Started by jamespetts, March 27, 2014, 01:41:09 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

TurfIt

Quote from: Ters on March 29, 2014, 10:01:14 PM
If a tile at the edge of region A has vehicles attempting to enter it from both region A and region B, how is it supposed to prioritize which vehicles enter that tile when it's free? If there is a steady stream of vehicles, processing of region A will only fill that free tile with vehicles from region A, since vehicles in B won't be processed until A is finished. Simutrans currently seems to have some sort of fairness, whether accidental or intentional. Processing A or B first alternately won't help if the time it takes for vehicles to clear the tile is an integer multiple of this frequency.
There is currently no built in fairness/prioritization. The first convoi to be processed after the tile becomes free takes it. The order convois are processed depends on the order they were created. In the region setup, if the convoi in A waiting for the tile is processed after the convoi leaving the tile is, then it will always take it. If the A waiting convoi is processed before the A vacating convoi, then the B waiting convoi will take it. There may be a slight bias towards preferring A when there's multiple A waiters, but I don't see any real issue here.

EDIT: missed this.
Quote from: prissi on March 29, 2014, 09:46:03 PM
If one is worrying over access pattern, then I am not sure much can be improved: The goods, and vehikels have already their own memory management (by size), which keeps them as close as possible.
There's quite a bit to be gained fixing the terrible memory access patterns that now take place. Many objects can be shrunk a bit by reordering the members to fill the padding gaps. Some members can be changed to smaller data types too. Reorder the members so items accessed at the same general time are on the same cache line. Then allocate the objects packed together in the order they're iterated across. And once done convoi_t::sync_step() ends up 30% faster. I'm sure there's another 10% or so to be had as well, I just need to win the battle I'm having with C++ not letting me do what I want... bloody type safety - it's bits'n'bytes, let me at em!

Ters

Quote from: TurfIt on March 29, 2014, 10:48:18 PM
There is currently no built in fairness/prioritization. The first convoi to be processed after the tile becomes free takes it. The order convois are processed depends on the order they were created. In the region setup, if the convoi in A waiting for the tile is processed after the convoi leaving the tile is, then it will always take it. If the A waiting convoi is processed before the A vacating convoi, then the B waiting convoi will take it. There may be a slight bias towards preferring A when there's multiple A waiters, but I don't see any real issue here.

The issue is that junctions at region boundaries will without explaination (in the eyes of regular player) give priority in certain directions, whereas all the others don't. But I see that the time at which the vacanting vehicle is processed makes it less of an issue than I first thought.

Quote from: TurfIt on March 29, 2014, 10:48:18 PM
There's quite a bit to be gained fixing the terrible memory access patterns that now take place. Many objects can be shrunk a bit by reordering the members to fill the padding gaps. Some members can be changed to smaller data types too. Reorder the members so items accessed at the same general time are on the same cache line. Then allocate the objects packed together in the order they're iterated across. And once done convoi_t::sync_step() ends up 30% faster. I'm sure there's another 10% or so to be had as well, I just need to win the battle I'm having with C++ not letting me do what I want... bloody type safety - it's bits'n'bytes, let me at em!

I also think there is much to be gained here. And C++ does let you do what you want. You just have to say the magic words. Most other languages won't let you do that. The biggest problem is that different compilers don't lay out the bits and bytes in the same way. Arrays of polymorphic types, rather than arrays of pointers to such, is probaby the most difficult things to do.

Dwachs

What strikes me most when profiling is that karte_t::sync_step itself takes 20% of processing (self-) time (without graphics). It only has to process a list, resolve virtual functions, and release some objects.

Turfit, do you have any work-in-progress patches?
Parsley, sage, rosemary, and maggikraut.

TurfIt

Quote from: Ters on March 30, 2014, 09:21:33 AM
The issue is that junctions at region boundaries will without explaination (in the eyes of regular player) give priority in certain directions, whereas all the others don't. But I see that the time at which the vacanting vehicle is processed makes it less of an issue than I first thought.
Worst case, a vehicle could get stuck for a while, but then it can with the current single threaded routine too. That's what traffic lights are for!


Quote from: Ters on March 30, 2014, 09:21:33 AM
I also think there is much to be gained here. And C++ does let you do what you want. You just have to say the magic words. Most other languages won't let you do that. The biggest problem is that different compilers don't lay out the bits and bytes in the same way. Arrays of polymorphic types, rather than arrays of pointers to such, is probaby the most difficult things to do.
So far my incantations have given my compiler a sense of humor. I particularly liked when it claimed "error: vector_tpl is not a template" when I hadn't gone any where near modifying vector_tpl, and it was very much a template the day before.

If you can point me towards the magic to enable arrays of polymorphic objects embeded in other objects that are themselves heap placement allocated arrays of objects that would be great. I was attempting to experiment with just that - sticking a vehicle or two right inside the convoi object.


Quote from: Dwachs on March 30, 2014, 04:31:26 PM
What strikes me most when profiling is that karte_t::sync_step itself takes 20% of processing (self-) time (without graphics). It only has to process a list, resolve virtual functions, and release some objects.
Processing the lists themselves is quite time consuming. There's many many add/deletes every frame. Splitting them into more or less static vs heavily modified lists helps greatly, and since I ended up with 5 lists, it makes sense to just go right to 8 and have a separate list per sync stepped object type. Combine this with a memory allocation slab per object to pack them in tightly, and order the list as per the allocation, and performance++. Just needs a little tweaking to work for multiplayer though...


Quote from: Dwachs on March 30, 2014, 04:31:26 PM
Turfit, do you have any work-in-progress patches?
Not really, it's quite the meandering experimental mess. It's been 8-10 months since I last really looked at it, planned to return to it, but got distracted (as usual) mucking about with Experimentals network game woes. Perhaps next up.

Ters

Quote from: TurfIt on March 30, 2014, 05:07:36 PM
So far my incantations have given my compiler a sense of humor. I particularly liked when it claimed "error: vector_tpl is not a template" when I hadn't gone any where near modifying vector_tpl, and it was very much a template the day before.

If you can point me towards the magic to enable arrays of polymorphic objects embeded in other objects that are themselves heap placement allocated arrays of objects that would be great. I was attempting to experiment with just that - sticking a vehicle or two right inside the convoi object.

Modifying vector_tpl is exactly what you have to do. Or rather make something like vector_tpl, but with the necessary hacks. It is a bit unsafe, as one has to explicitly specify the size of the largest polymorphic type at compile time.

There is one aspect I have no solution for, and that is compacting the array when an element is removed in the middle. This is however a problem anyway, if there are references (either pointer or index) to the object from elsewhere. And this is perhaps the biggest problem with putting objects, rather than pointer to objects, into arrays.

jamespetts

Returning for a moment to passenger generation, one thing that strikes me is that Experimental may benefit rather more from multi-threading this part of the code than Standard on account of Experimental's multiple destinations feature, which multiplies the number of runs through the code checking for destinations by potentially a great many times per packet of passengers. The code for committing the results to the world is actually already somewhat separated from the code for generating the passengers (at least, it is in the more recent versions, the latest of which is currently on the way-improvements branch); all that would be required is a means of separating these things into two methods and transferring the data between them in order to achieve a full separation necessary for multi-threading. The amount of CPU time taken by this code is noticeable, as increasing the number of alternative destinations substantially reduces the fast forward speed on a large map, so this might achieve a worthwhile result.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

Why one passenger to multiple destinations? I could understand multiple routes, but the thing about simutrans is that passengers does not choose their destination on availability. If there is no route, well no travel. I wonder why to change this?

jamespetts

Ahh, the alternative destinations feature has been with Simutrans-Experimental for a while, although the as yet unreleased overhaul of the passenger generation system  has changed its scope somewhat.

The basic idea is that passengers make trips for all sorts of reasons: some to go to visit relatives, some to have a walk in the park, some to go to work, some to go to the shops to buy bread, and so on. Each of those destination types might or might not have more than one actual destinations: a trip to see relatives, for example, has only one possible destination, whereas a trip to go to the shops has many possible destinations, and people will generally choose whichever is closer. As to trips to work, although people work in one specific place, people tend to choose to work in places near where they live and will not apply for jobs too far away.

To simulate this, passengers in Simutrans-Experimental have between zero and n alternative destinations (n being defined in simuconf.tab). The actual destination is picked at weighted random from the list of buildings (which is, in the latest unreleased version, stored in the world rather than in cities), and the algorithm checks whether any possible means of getting there (walking, player transport starting from any stop whose coverage extends over the starting building or private car, if these passengers are deemed to have access to one) takes little enough time to be within that packet of passengers' journey time tolerance, a figure set at random between a minimum and maximum specified in simuconf.tab for every packet of passengers generated. If the passengers can reach their destination within the journey time tolerance, they take whichever is the means available to them that takes the least time overall (walking, player transport or private car). If they cannot reach that destination within their journey time tolerance or at all, the algorithm finds another destination and runs the same check on that, until the number of alternative destinations has been exhausted or the passengers have found a satisfactory route to their destination, whichever comes first. It is this repeated checking that could potentially be parallelised by performing it on several packets of passengers at once, as this is considerably more computationally intensive, I believe, then setting the passengers on their way by adding them to their origin stop.

Edit: Incidentally, does Simutrans use thread pooling? I do wonder whether there might be issues with creating and destroying threads very quickly as this sort of feature might require.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

About multiple destinations: This would change very little (if nothing) to a player, since the destinations are anyway random; or rather it would add some more toyish behaviour as in OpenTTD, where passengers are generated anyway. Incidentally, I wrote a patch for OpenTTD to achieve this. Each stop had a list of all connected stop within a certain depth (or journey time, if you like). This consume quite some memory, but with nowadays computers memory is more available than CPU.

Also: you approach is much more useful for a routing as in standard. Since you can store the mean travel time for each leg between stations in a tree, you can quickly search until the time/distance/transfer or whatever in weight is reached.

Back to sync_step:

I played around a lot with different lists, vector hashtable, ... in sync_step. First building up and deleting sync_step take some time (especially for vectors). For meaningful profiling some month of fast forward are needed.

There are lists which change almost constantly (smoke, pedestrians). Since here the processing order does not matter but adding and deletion happens freuqently, they are a hashtable. That is why the eyecandy (buildings, smoke) is a ptrhashtable.

I had to brush up the old code a bit. Use r7124 for trials. I found vector_tpl the fastest, although each sync_step it needs to copy the entire vector. However, I could only profile for a few gamedays before it froze. Something really messed up with the profiling with my GCC. Since it works with MSVC, I suspect a problem with my GCC (4.6.2) and the profiler.

kierongreen

QuoteIncidentally, does Simutrans use thread pooling? I do wonder whether there might be issues with creating and destroying threads very quickly as this sort of feature might require.
Certain sections of the multithreaded code (display for example) do have threads which just idle when inactive yes. Originally threads were created and destroyed every time they were called but this was very slow on some platforms.

jamespetts

Prissi - the alternative destinations feature does make quite a difference to the player, as the destinations to which the passengers actually go (taking into account the journey time as described above) are not totally random: instead, the passengers will tend to go more often to destinations closer to home, but will sometimes (when their journey time tolernace is high and they pick a far-away destination) travel further afield. This reflects reality, in which most journeys are local journeys: in real life, one is far more likely to travel to the next town than the other side of the country. The consequential effect on this is a tempering of the network effect of connecting additional towns to a passenger network: in Standard, the number of passengers increases exponentially when this is done, whereas, in Experimental, the alternative destinations feature makes this much less so, which is more realistic.

Kieron - thank you for that insight: that is useful. I suppose that thread pooling would not be too hard in this case: just have a fixed maximum number of threads for this work and generate them when the game starts, each with their own random seed.

Edit: Looking into the code, one thing that would complicate the process of multi-threading passenger generation is the onward trip algorithm, in which calculating the onward trip is closely woven with committing the generated results to the world.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

TurfIt

Quote from: jamespetts on March 30, 2014, 06:55:57 PM
The amount of CPU time taken by this code is noticeable, as increasing the number of alternative destinations substantially reduces the fast forward speed on a large map, so this might achieve a worthwhile result.
But what is the bottleneck in the CPU for that routine? Until you know that, you can't know what where to fix. Multithreading on a multicore processor brings more execution units and more L1/2 cache into play. If your algorithm isn't being limited by those, then multithreading isn't going to gain you much.


Quote from: prissi on March 30, 2014, 09:52:37 PM
This consume quite some memory, but with nowadays computers memory is more available than CPU.
I'd argue the opposite. Sure, lots of memory is available, but actually using it brings things to a halt. In the 386 days running out of memory and waiting for a page to swap in from the HDD was painful. Now, overflowing the L3 cache and waiting for a memory read is just as painful, and should you actually page swap - unbearable.


Quote from: prissi on March 30, 2014, 09:52:37 PM
I had to brush up the old code a bit. Use r7124 for trials. I found vector_tpl the fastest, although each sync_step it needs to copy the entire vector. However, I could only profile for a few gamedays before it froze. Something really messed up with the profiling with my GCC. Since it works with MSVC, I suspect a problem with my GCC (4.6.2) and the profiler.
Using vector_tpl for everything is the fastest in my tests as well. I also modified vector_tpl to have the swap_erase method we'd talked about before (1 or 2 years ago I think), and use that instead of copying the vector each cycle. Anyways, if everyone is happy with the state of the double height code, instead of looking into it once I finish up with Experimental, I'll try to get my sync_step patch into a presentable shape (minus the mutithreading aspect as I think that'll still take some months).

I gave up on statistical profiling for Simutrans long ago. I just can't get consistent enough results between runs. When everything varies by 30+% everytime, you can't draw any conclusions. Instead I've been embedding explicit timing checks into the code being tested, much better results that way.

jamespetts

Quote from: TurfIt on March 30, 2014, 11:55:16 PM
But what is the bottleneck in the CPU for that routine? Until you know that, you can't know what where to fix. Multithreading on a multicore processor brings more execution units and more L1/2 cache into play. If your algorithm isn't being limited by those, then multithreading isn't going to gain you much.

How would one test this? Simple profiling will tell one how long that this function is taking, but not where the performance limit is.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Markohs

I just did read all this thread lightly, so I might have ignored comments.

But I think as prissi expressed, a much more in-depth refactoring of the memory structures and code is needed to scale the code to multithreaded.

Imho, if the code is to scale good, we'd have to think about more than 4 threads. And that whould require consider all the world static in that cycle, using just read-only access to the world, by all threads, and those threads should generate a list of changes that need be to applied to the world. A transaction manager whould apply the changes sequentially in a periodic timing, as milestones. It can reject the changes that whould break the game coherencies (two trains on same track, for example). So they will just be re-calculated on deand on next cycles.

That's the only way the game can scale good, imho, and it's the way to go. It's quite a lot of work.

You can't have mutual exclusions and hope the game to perform good, you have to remove all dependancies and exclusions, and resolver them later. What you can use it's a heuristic approach to keep the number of "rollbacks/rejects" to the minimum.

jamespetts

That would really be a vast amount of work - years, possibly. I wonder what sort of performance gain that that might enable compared to a more narrowly focussed parallelisation of specific algorithms in the way discussed above?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Markohs

The moment you need mutual exclusion and sync when accessing to the world structures you commit yourself to not scale much more than 3-4 threads imho. Seeing how current computers (even portable devices) won't scale much in terms of single-thread speed, but in memory bandwidth and number of cores, this option has no future.

I'd chose an option that has future, and not a temporary increase in performance that I doubt it will be enough boost to justify itself, *and* with the amounts of complexity that's going to add to the code (with the added maintenance). In fact, even the current threaded display code, it's not performing good enough, imho, specially if we take into account the complexity it added to the code.

Ofc, maybe a hybrid implementation, or a pure MP implementation, with heuristics, can be enough, or *maybe* dividing the world in chunks, and post-process the interactions between boundaries.

It's hard to say, I'm just guessing here, maybe my option doesn't scale good enough when you have less than 4 cores, and it's not good enough.

jamespetts

Interesting analysis - scaling to 3-4 threads can potentially make significant gains if applied in the right places. It is, I suppose, a matter of working out what the main limitations on performance are now and will be in the future and carrying out a cost-benefit analysis for the three broad options: (1) no significant multi-threading of the simulation code (as distinct from, say, graphics or load/save); (2) limited multi-threading as discussed previously in this thread; and (3) the total rewrite necessary for fully integrated multi-threading as discussed here. These are difficult choices indeed.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ters

Quote from: jamespetts on March 31, 2014, 11:42:13 AM
That would really be a vast amount of work - years, possibly. I wonder what sort of performance gain that that might enable compared to a more narrowly focussed parallelisation of specific algorithms in the way discussed above?

Which algorithms? Doing parallelization internally in an algorithm may run into too high start-up and merge costs. At work, we recently did some fork-join on a pathfinding-ish operation, and while we were suddenly able to handle bigger maps, memory consumption went through the roof as each subtask had to have a copy of the preliminary result.

jamespetts

Memory consumption per se has traditionally been one of Simutrans's least significant limiting factors, behind CPU use and memory bandwidth. As to which algorithms - this has been discussed earlier in the thread. It is, of course, important for parallelisation to be used in areas where the computational cost of producing a dataset for subsequent merging significantly exceeds the computational cost of the merging itself; but I think that there are some areas, in Experimental at least, where this is so (passenger generation, private car route finding, and possibly certain parts of the goods route-finding, being the initial construction of the table of routes and times from stops to directly connected stops).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

Implicit parallelisation (?) does not really work well in real live problems beyond simple loops. There are more than 15+ year of research into it.

I have somewhere a patch to use OpenMP instead thread for display. It was even slower than single threaded and crashed occasionally ...

About passenger generation: With the amount of passenger generated and transported, I hardly doubt the player will notice a passenger going closer or farther; statistically there is a very good chance that on a well developed map it will even out. I am not sure of the implementation in experimental, but in standard the distance is taken into account, depending on simuconf.tab setting; i.e. in earlier year passenger already travel shorter distances than in later years.

I would not add CPU there. Rather fix the amount of time in the physic module.

jamespetts

Thank you for your helpful input. By "implicit parellisation", do you mean the sort of parallelisation to which Markhos was referring?

As to passenger generation, the current release version of Experimental uses distance ranges and a far lower number of alternative destinations. However, this has lead to significant problems: in early eras, passengers are completely unwilling to take the stage coach, which might take many hours to make a short journey, as the range of journey time tolerances associated with local or medium range trips are well below the stage coach's travelling time. Research shows that people's destinations are selected in large part in consequence of the travelling time to those destinations, which is the way in which the latest (unreleased) development version of Experimental works.

As to the time spent in the physics module, Bernd has done some optimisation work recently, but I am not sure how significant that the results are. The ultimate difficulty is the inability to use native floating point types because they cannot be synchronised reliably between different platforms in network use, and Bernd had to spend months writing an integer based faux floating point class, which, whilst reliable, is much slower than native floating point. I am not familiar with the inner workings of the physics code, but if you (or anyone else) has any tips for Bernd on how to make it any faster, then I shall gladly pass them onto him.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.