Multi-threading in simulation code

jamespetts · October 24, 2016, 07:42:45 PM

I have finally found a way of making use of multi-threading in actual simulation code in a way that, at least in Experimental, makes a non-trivial difference to performance.

As is well known, what can be achieved with multi-threading in Simutrans is very limited because of the need for client and server to stay in sync in a network game, and multi-threading is undeterministic in many significant ways. Until now, only graphics and loading/saving have been multi-threaded, both of them inherited from Standard. Whilst these two areas are very worthwhile, Experimental does exact more work on the CPU, and it is worth seeing whether other parts of the code can be multi-threaded.

One part of the code that I noticed from profiling in Visual Studio 2015 that took up a significant amount of CPU time in devel-new-2 was the unreserve_route() function, which is called whenever a train stops, and iterates over all the way tiles in the game, unreserving every tile reserved by the current convoy. This was done to avoid leaving arbitrary tiles reserved as was sometimes happening after the implementation of the new signalling system.

What I have done is to break down this task into chunks and put each of these chunks into its own thread. The threads are set running in parallel, and the game will only continue past the end of the unreserve_route() function that creates the threads once all the threads have finished their work. Because unreserving any one tile does not in any way affect unreserving any other tile in any circumstances, doing this in parallel is network safe. I have tested this with a network client/server on a loopback interface (both running Windows; the desync bug with Linux servers and Windows clients is a different matter), and synchronisation is maintained. For anyone interested, the commit on Github with the relevant code is here.

Profiling tells me that over 6% of total CPU time (excluding loading/saving) is now consumed by this new set of threads, meaning that there is a potential for a non-trivial performance improvement on multi-core systems.

There might be potential to multi-thread more discrete operations of this sort. I am considering applying similar logic to the path explorer (running several categories of goods at the same time, as the routing for any one category of goods should not affect any other in any way), which takes a significant amount of computational resources when it runs, although this will be more complex, as significant chunks of the path explorer code assume single threading, although I believe that these can be modified without affecting the underlying logic of this system.

I have also been working on some other performance optimisations which should improve some performance issues that had been introduced since the last release version of Experimental.

DrSuperGood · October 24, 2016, 10:16:45 PM

Quote
One part of the code that I noticed from profiling in Visual Studio 2015 that took up a significant amount of CPU time in devel-new-2 was the unreserve_route() function, which is called whenever a train stops, and iterates over all the way tiles in the game, unreserving every tile reserved by the current convoy. This was done to avoid leaving arbitrary tiles reserved as was sometimes happening after the implementation of the new signalling system.

I think you just found an optimization for a hacky fix...

I strongly recommend finding a more efficient way than iterating every way tile in the game every time a train stops. No matter how much you optimize, the operation is still O(n) with way tile number so will never scale well.

jamespetts · October 25, 2016, 01:59:59 AM

Hacky fix or not, it is actually useful to see that multi-threading can work in simulation code in certain defined circumstances. The path explorer is certainly not a hacky fix, and takes considerable computational resources, so this will be a worthwhile improvement.

jamespetts · October 30, 2016, 01:45:32 AM

I have been progressing with multi-threading to more useful areas, outside the realms of things that might be described as hackish fixes.

Have a look at the latest commits on the devel-new-2 branch (for those reading this later, look for commits around the 29th/30th of October 2016).

I have now multi-threaded the private car route finder. The private car route finder is the function that finds road routes for private cars between every pair of cities in the game. In very large maps with complex road networks, this can be very computationally intensive, so optimising it is worthwhile.

Unlike the route unerserver, which is a purely parallel multi-threading implementation, the private car route finder multi-threading is both parallel and concurrent. It processes a set number of cities in parallel in each step in which the private car route finder is called and allows them to work in the background whilst selected other parts of karte_t::step() run in the main thread. There is a thread barrier to make sure that they are not still running when the part of karte_t::step() that actually makes use of the data in the private car routes (i.e., the generation of mail and passengers, which in Simutrans-Experimental has moved from the city objects to the world object) is called, but the convoys and the cities are now stepped concurrently with private car route calculation.

I have tested this for network safety, and, with a Windows client and Windows server on the loopback interface, it remains connected in a test scenario that I know is an effective test scenario because they did not stay connected when I had failed to deal properly with a static variable relating to the path-finding when I was trying to make it work earlier.

This implementation also re-uses a fixed pool of threads rather than creating new threads on demand, which is much more efficient, especially in relation to initialising the node data in the route finding algorithm, which now only has to be done on creating/loading a map, as in the single threaded implementation.

One thing that has helped enormously (especially with all the static variables in route_t) is the thread_local qualifier: although this is part of the C++11 standard, it is also implemented in pthreads, so it should be fully compatible with all Simutrans platforms. Because it can simply be omitted without any other change of syntax in a single threaded build, this keyword can simply be defined as a space with a preprocessor directive in the event that the MULTI_THREAD flag is not present.

This gives me good reason to think that rather more of the core simulation code of Simutrans can be multi-threaded than had previously been thought. The thread_local qualifier should make it much easier to multi-thread the path explorer (which also uses many static variables) than I had previously envisaged, and that should yield a noticeable performance improvement. Anywhere where the data modified one part of karte_t::step() is not modified or accessed by anything in karte_t::sync_step() or another specific part of karte_t::step() is a prime candidate for concurrent multi-threading. In many cases, only concurrent multi-threading (where a call to snyc_step() in the main thread is made whilst the background threads are working) can really improve the perception of responsiveness. In some cases, where it is not possible to use concurrent multi-threading, parallel multi-threading might still be of value if, for example, a single operation or set of operations must be performed on a large set of data and the outcome of any one operation does not depend on the outcome of another, and no data are shared. The unreserving of routes is a good example (this is not exclusive to Experimental, although Standard uses it less; on a large map, however, it might still cause a momentary loss of responsiveness when, for example, a convoy enters a depot). Another good example is where companies are liquidated, and the game has to iterate through every single tile of way to remove those belonging to the liquidated company; again, this might cause momentary unresponsiveness in a single-threaded implementation on a very large map.

As the rate of processor speed increases declines considerably, and parallel processing is offered as an alternative to increased serial speed by hardware manufacturers, a multi-threaded implementation is ever more important. Although probably more important on Experimental than Standard because of Experimental's greater computational complexity, this may well nonetheless be useful for Standard, for example, in ship pathfinding: the paths of numerous vehicles might be computed in parallel whilst parts of the code that does not in any way depend on or influence the routes of convoys which are in the calculating route state (including a sync step) are executed concurrently.

The ability to have extremely large maps with decent performance is one of the things that really distinguishes Simutrans from much more modern games such as Transport Fever, and so optimising performance in the case of such large maps is a worthwhile thing to do.

DrSuperGood · October 30, 2016, 03:07:44 AM

QuoteOne thing that has helped enormously (especially with all the static variables in route_t) is the thread_local qualifier: although this is part of the C++11 standard, it is also implemented in pthreads, so it should be fully compatible with all Simutrans platforms. Because it can simply be omitted without any other change of syntax in a single threaded build, this keyword can simply be defined as a space with a preprocessor directive in the event that the MULTI_THREAD flag is not present.

Thread locals often have slower access speed than global and local variables because of how they are implemented.

Also removing the option of a single thread build has been suggested in the past. In this day and age one really should not be making single threaded applications. It is even a violation of windows applications to be single threaded as the event processor thread must be separate from any heavy computation thread for responsiveness.

Fundamentally one could do something like divide a map into X areas where each area is run by a separate thread. Anything that occurs across the areas then has a well defined handling between the threads. One can expand on that by having clusters of objects rather than areas so it can apply to any number of threads. The idea being that as long as there is no interaction between an object and an object owned by another thread the single thread can process it at full speed with slower handling for objects with cross boundary interaction. Since most objects interaction is limited to related objects, eg a train may only ever remain in a small area of the map, then for most of the time the thread will execute very efficiently.

The main problem is the entire engine is not written for multi thread support. Its a joke that single thread is even an option so everything related to multi thread needs to be wrapped in ugly macros. When one clicks a server that is not responding the game should not become unresponsive until the TCP connect times out.

I think a big speed up for multi threading might be if one migrated the graphics to an actor based system. Currently objects like a factory handle their own graphics and so have graphic code mixed in with the production code and such. If instead the graphics were handled by a factory actor and the factory itself just handled the production code and the production code sent messages to the actor on events, eg "PRODUCTION_ON" or "PRODUCTION_OFF", in theory all overhead related to the graphics can be deferred to an actor thread processing the messages. Since all graphic code would be in one place it would allow potentially for efficient use of hardware acceleration for graphic rasterization, which would allow for support of features like 32bit or HDR colour, 3d models, higher frame rates or JIT rendered objects.

Another approach would be physical map fragmentation. Currently experimental looks at maps like 2000*2000. Instead what could be done is fragment that 2000*2000 map into 16 (4*4) different 500*500 maps each which is run on a separate thread by a server. Through the use of portals and some rough communication the maps could interact with each other to some extent. Clients only ever download and play a single 500*500 map at a time meaning faster download times and less overhead for them. Fundamentally this goes back to the thread object ownership idea, except is even more strict with physical separations.

killwater · October 30, 2016, 09:53:18 AM

That sounds like a huge step forward. It would be awesome.

jamespetts · October 30, 2016, 10:58:35 AM

Dr. Supergood - those are interesting ideas, but would that not require a very fundamental re-write of more or less the whole code? What I am implementing is intended to make relatively few changes to the fundamentals. I am not sure that an area based approach would work; what about, for example, route searching and block reservation for trains if a route spans multiple areas? It is difficult to see how that could be efficient to code, maintain or execute. Would it not make more sense to have concurrent and/or parallel operations for different types of object?

I have never really looked into the graphics code in detail, but it is not ideal being integrated with objects. I have had to interact with it in terms of swapping images for signals when I added extra aspects to signals for the signalling code. Is it really practical to go through the entire code and unpick this now?

It certainly does seem sensible to multi-thread the querying of servers; this unresponsiveness is a long-time problem, and there is no issue of network safety in this either. This, I imagine, would be much more straightforward.

In terms of removing the option for a single threaded build, that would certainly make the code much cleaner and easier to maintain. Does anyone know why this option still remains and whether it has a useful purpose to serve any longer? I should probably want Experimental to follow Standard on this issue.

Edit: As it happens, it would make multi-threading the path explorer much easier if the single threaded builds need no longer be supported, as I am having trouble at present imagining how it might practically be accomplished in a way that works properly between single and multi-threaded builds. I know that the compile-time flag for multi-threading was originally a sensible thing when the feature was new and in testing, but both it and pthreads seem stable now. Is there any compelling reason of which anyone is aware to maintain the ability to have a strictly single threaded build (e.g. compatibility with certain platforms)?

kierongreen · October 30, 2016, 03:34:46 PM

Performance of certain code fragments will likely be highest when optimised to the number of cores available. Hence the overheads of having 16 threads on a machine with only 2 cores may lead to worse performance than 1 (or 2 threads).

Splitting the map up may lead to too many overheads swapping between sections.

Note the may here - when we started putting multithreading into standard there were tests to try and work out what was going to give the best results. Complication is that performance will change from one platform and processor to another.

jamespetts · October 30, 2016, 03:38:11 PM

I have tried to respect the env_t::threads value in determining the number of parallel operations, although of course in some cases, this will need to be synchronised between client and server.

Do you think that there is any longer a need to have single threaded builds? The could could be cleaned up considerably if not.

kierongreen · October 30, 2016, 04:21:46 PM

Comes down to the effect it has on performance. For Standard this is more an issue than Experimental as some players have very old hardware. Would need people with single core processors to test this.

DrSuperGood · October 30, 2016, 05:05:50 PM

Quote
I have never really looked into the graphics code in detail, but it is not ideal being integrated with objects. I have had to interact with it in terms of swapping images for signals when I added extra aspects to signals for the signalling code. Is it really practical to go through the entire code and unpick this now?

It would have to be a long term revision.

Quote
It certainly does seem sensible to multi-thread the querying of servers; this unresponsiveness is a long-time problem, and there is no issue of network safety in this either. This, I imagine, would be much more straightforward.

One would think so... I tried in the past but due to over use of static variables it is not really possible. The best one could do is query 1 server at a time currently.

Quote
In terms of removing the option for a single threaded build, that would certainly make the code much cleaner and easier to maintain. Does anyone know why this option still remains and whether it has a useful purpose to serve any longer? I should probably want Experimental to follow Standard on this issue.

In the past multithreading was not really a major design consideration. Games on consoles only used 1 thread. As multi-threading was not supported natively in a platform independent way the only way to achieve multi threading support was with heavy libraries like PThread or not platform independent ways such as the Windows API for Windows operating systems.

However with the rise of multi core systems and multi-thread orientated operating systems it has become very important to have multi threading support.

Quote
Performance of certain code fragments will likely be highest when optimised to the number of cores available. Hence the overheads of having 16 threads on a machine with only 2 cores may lead to worse performance than 1 (or 2 threads).

But it will still lead to better performance than only 1 thread. Also if all threads are not heavy (need to run 100% of time) the performance difference might be 5% or less.

Quote
Comes down to the effect it has on performance. For Standard this is more an issue than Experimental as some players have very old hardware. Would need people with single core processors to test this.

Should be minimal. Especially if number of worker threads is tuned.

jamespetts · October 31, 2016, 07:31:54 PM

I have had a go at multi-threading the passenger generation code, but have encountered so far unfathomable problems with it: most particularly and most bizarrely, a client (but not a server) in network mode will usually (but not always) crash in this block of code in simworld.cc:

Code Select


// process enqueued network world commands
	while(  !command_queue.empty()  &&  (next_command_step<=sync_steps/*  ||  step_mode&PAUSE_FLAG*/)  ) {
		network_world_command_t *nwc = command_queue.remove_first();
		if (nwc) {
		>>>	do_network_world_command(nwc);
			delete nwc;
		}
		next_command_step = get_next_command_step();

with an access violation for nwc, with nwc being a non-null value (indicating that it has already been deleted). I do not have a full understanding of how the command queue works; can anyone assist with what kinds of things might cause what I infer to be a double deletion of an nwc object in the command queue?

TurfIt · November 01, 2016, 10:05:55 AM

Don't know about the nwc issue, but:

Code Select


+#ifdef MULTI_THREAD
+	pthread_mutex_lock(&step_passengers_and_mail_mutex);
+#endif
 	const uint32 units_this_step = simrand((uint32)settings.get_passenger_routing_packet_size(), "void karte_t::step_passengers_and_mail(uint32 delta_t) passenger/mail packet size") + 1;
-
+#ifdef MULTI_THREAD
+	pthread_mutex_unlock(&step_passengers_and_mail_mutex);
+#endif

wut? No. Just No. Speechless.

Combuijs · November 01, 2016, 10:42:19 AM

Quote from: jamespetts on October 31, 2016, 07:31:54 PM
I have had a go at multi-threading the passenger generation code, but have encountered so far unfathomable problems with it: most particularly and most bizarrely, a client (but not a server) in network mode will usually (but not always) crash in this block of code in simworld.cc:

Code Select Expand
// process enqueued network world commands while( !command_queue.empty() && (next_command_step<=sync_steps/* || step_mode&PAUSE_FLAG*/) ) { network_world_command_t *nwc = command_queue.remove_first(); if (nwc) { >>> do_network_world_command(nwc); delete nwc; } next_command_step = get_next_command_step();

with an access violation for nwc, with nwc being a non-null value (indicating that it has already been deleted). I do not have a full understanding of how the command queue works; can anyone assist with what kinds of things might cause what I infer to be a double deletion of an nwc object in the command queue?

I suppose

command_queue.remove_first()

might not be fit for multi-threading? If this function gets the first entry and then removes it without locking these two statements to force doing them both without being interrupted, the order of statements may be

get_first_entry
get_first_entry (in another thread)
remove_first
remove_first (in another thread)

This would mean the first original command will be returned twice (in separate threads) and the second original command will dissappear completely.

James, getting code fit for multi-threading while its original intention was single-threading is a hell of a job. Even when it seems to work right, you might stumble over unexpected crashes every now and then which are very hard to find. In general, trying to improve the single-thread code is often much more effective and far more stable.

jamespetts · November 01, 2016, 01:26:29 PM

Thank you both very much for your help.

Turfit - forgive my ignorance, but may I ask what is wrong with the mutex example in the quoted segment? I have not yet had a great deal of experience with multi-threading, so if I am doing something terribly wrong with this use of mutexes, it would help me greatly to know what it is.

I have in any event moved on somewhat in the implementation of multi-threading this part of the code in any event. Although I still do not really understand why I was getting crashes in the do_network_world_command(nwc) calls (as that part of the code was not itself multi-threaded), I am not getting them any longer. I found a problem that would be capable of causing network desyncs, which, when removed, also stopped the crashes from occurring, the problem being related to the threading of the algorithm to count the number of times that the passenger generation code would be run in any one step.

However, I am now having an even more bizarre problem with some modified code (not yet committed because it is not working), in that all the threads seem simultaneously to be waiting at the same barrier, causing a deadlock, which makes no sense to me, as I understand the position to be that the threads will synchronise at a barrier and will move on when (and only when) the specified number of threads reach that point.

DrSuperGood · November 01, 2016, 04:42:02 PM

Quote
I have had a go at multi-threading the passenger generation code, but have encountered so far unfathomable problems with it: most particularly and most bizarrely, a client (but not a server) in network mode will usually (but not always) crash in this block of code in simworld.cc:

with an access violation for nwc, with nwc being a non-null value (indicating that it has already been deleted). I do not have a full understanding of how the command queue works; can anyone assist with what kinds of things might cause what I infer to be a double deletion of an nwc object in the command queue?

I ran into this when debugging standard the other day... Same place, same conclusion of cause (double deletion). Only my client crashed, with the server continuing fine.

It happened when I was testing my network core changes. However seeing as those have nothing to do with that piece of code and when I tried to continue the game continued perfectly (you have been disconnected) my only conclusion is that the crash was a once off. Either that or it occurs when running debug mode and a server drops the client at a dumb time such as in the middle of a command packet. It could also possibly only occur in debug builds as double free in release versions might not throw an error (instead the game continues to execute).

In any case this is general Simutrans problem.

Quote
might not be fit for multi-threading? If this function gets the first entry and then removes it without locking these two statements to force doing them both without being interrupted, the order of statements may be

get_first_entry
get_first_entry (in another thread)
remove_first
remove_first (in another thread)

This would mean the first original command will be returned twice (in separate threads) and the second original command will dissappear completely.

That piece of code is only single threaded as far as I know. It has to be as it is what processes multiplayer commands.

Quote
Turfit - forgive my ignorance, but may I ask what is wrong with the mutex example in the quoted segment? I have not yet had a great deal of experience with multi-threading, so if I am doing something terribly wrong with this use of mutexes, it would help me greatly to know what it is.

I think the problem is with determinism. A race condition is created with access to the random number generator.

Say we have two threads executing on objects A and B which each request a number from the RNG in the range of 1 to 9 at the same time. The mutex allows this to happen by only letting 1 thread use the RNG at a time so its internal state is consistent but that is not the problem. On client 1 object A gets 3 due to reaching the mutex first and object B gets 6. On client 2 object B is processed first so gets 3 and object A gets 6. Client 1 and client 2 are now out of sync with each other as objects A and B do not match. If the random range of the objects were different the RNG would be in a different state and so all future RNG calls will give different results between clients.

The solution in this case is to give objects A and B separate random number generator states. Another solution would be some kind of ordering logic that forces A to be processed before B but then this results in reverting back to single threaded execution logic pretty much as not much can happen in parallel.

Quote
However, I am now having an even more bizarre problem with some modified code (not yet committed because it is not working), in that all the threads seem simultaneously to be waiting at the same barrier, causing a deadlock, which makes no sense to me, as I understand the position to be that the threads will synchronise at a barrier and will move on when (and only when) the specified number of threads reach that point.

Depends on how the barrier is implemented. Some of them act as simple nodes any number of threads can sleep on and the last thread must then signal that object to wake up all waiting threads. Some higher level barriers literally allow one to specify a number of threads to wait for before resuming.

jamespetts · November 01, 2016, 04:55:56 PM

Dr. Supergood - thank you: that is very helpful. It does look as though it will be very difficult to multi-thread anything with random calls without a system of assigning a different RNG state to each thread, which will require a fair bit of work. I shall have to look into that. If that be done, however, it would I suppose remove the need to have a mutex around each random call. Perhaps what we need is a simrand_threaded() function, with a thread_local random seed? I wonder, though, how this could then be kept in sync with a single threaded client (i.e. a client built without the MULTI_THREAD preprocessor directive defined)? These threading considerations are not easy.

The barrier system that pthreds uses is, I believe, exactly the type that you describe in which one specifies the number of threads to wait for; but this value is set when the barrier is initialised, and is set to the same number as the total number of threads created plus one (i.e. the main thread, which also has a barrier call), so this is still rather baffling (and made harder to work out owing to the lack of a way that I can find to probe the status of this number in the debugger).

DrSuperGood · November 01, 2016, 05:17:46 PM

Quote
Dr. Supergood - thank you: that is very helpful. It does look as though it will be very difficult to multi-thread anything with random calls without a system of assigning a different RNG state to each thread, which will require a fair bit of work. I shall have to look into that. If that be done, however, it would I suppose remove the need to have a mutex around each random call. Perhaps what we need is a simrand_threaded() function, with a thread_local random seed? I wonder, though, how this could then be kept in sync with a single threaded client (i.e. a client built without the MULTI_THREAD preprocessor directive defined)? These threading considerations are not easy.

Each object needs its own random state then, not each thread. That is because the order which threads run and the objects they process will vary between clients. Random results for each object must not vary.

The only time one can really multi thread randomness is when dealing with client local stuff, mostly output. For example which animation is used or what sound plays can use a thread local RNG because between clients the results can be different without them going out of sync. However for tasks such as passenger destinations and passenger packet size the results have to be the same between clients which means that either each object needs its own random source that is independent of each other, or that a very strict order of execution has to be established.

jamespetts · November 01, 2016, 05:30:05 PM

Yes, I see that this may be difficult. In Experimental, the passenger generation is no longer specific to cities, but is performed in karte_t using global lists of passenger origins and destinations. I am not quite sure to which objects that random number generation would be specific here.

It seems as though the passenger and mail generation may be one of the more difficult and less (in performance terms) rewarding parts of the code to multi-thread. The convoys' step() function might be rather better, as there seems to be only one call to simrand() there, which is easily replaced by a very basic quasi-random number that is easily made deterministic between threads, and, when there are a lot of routes (especially sea based routes), this can consume a great deal of processing power.

Vladki · November 01, 2016, 06:28:30 PM

A little fun quiz for multithreading programmers:
https://deadlockempire.github.io/

jamespetts · November 10, 2016, 11:40:14 PM

I wonder whether anyone can assist with some trouble that I have been having of late trying to set up multi-threading for Knightly's path explorer. I am trying to use a form of concurrent multi-threading for this, without doing it in parallel (which would involve multiple categories of goods at once) as this should in principle be easier, and is the thing that will improve responsiveness, whereas parallel multi-threading will not.

The idea is for the thread in which the path explorer is to run to start late in karte_t::step(), and continue to run until early in the next karte_t::step(), when it will be stopped just before anything that needs to use its data runs its step. This will then let it run in the background while several sync_steps are called, greatly improving responsiveness. Indeed, testing shows that, on a large map where the path explorer is called frequently, it really does make a big difference to responsiveness.

However, there is a complication the solution to which is causing some problems. The complication is that there are a number of things that can happen in sync_step() that affect the data set on which the path explorer can work. The solution in theory seems to be simple: to insert code before any code in the various methods in sync_step() that will wait for the path explorer thread to finish its step (each call to karte_t::step() should correspond to exactly one iteration of path_explorer_t::step(), as is the case in the single-threaded implementation) before proceeding. To that end, I have set up a condition variable.

However, testing has consistently shown that this does not work as expected. In particular, when, for testing, in karte_t::step(), the commands to start and stop the path explorer thread (both of which have their own methods) are run in sequence (i.e as follows below:)

Code Select


   start_path_explorer();
   stop_path_explorer();

the path explorer is in some instances still found to be running after the stop_path_explorer() command is given, leading to crashes as it runs concurrently with things (i.e. the single threaded part of stepping the convoys) that use the data that it is in the process of manipulating. This is so even where the start_path_explorer() call above is the only call to this method in the whole code.

The relevant part of the implementation of the start/stop code and the threaded code on which it acts I reproduce below. path_explorer_t::step() is the method in which all the substantive work is done: the single threaded version simply calls this straight from karte_t::step().

Code Select


void* path_explorer_threaded(void* args)
{
   karte_t* world = (karte_t*)args;
   path_explorer_t::allow_path_explorer_on_this_thread = true;

   do
   {
      simthread_barrier_wait(&start_path_explorer_barrier);
      pthread_mutex_lock(&path_explorer_mutex);
      //pthread_cond_wait(&path_explorer_conditional_begin, &path_explorer_mutex);
      //if (karte_t::path_explorer_thread_active)
      if(true) // For TESTing
      {
         karte_t::path_explorer_step_progress = 0;
         
         if (karte_t::world->is_terminating_threads())
         {
            karte_t::path_explorer_step_progress = 2;
            pthread_mutex_unlock(&path_explorer_mutex);
            break;
         }
   
         path_explorer_t::step();
      }

      karte_t::path_explorer_step_progress = 1;

      pthread_cond_signal(&path_explorer_conditional_end);
      karte_t::path_explorer_thread_active = false; // In case of spurious wakeups.
      karte_t::path_explorer_step_progress = 2;
      pthread_mutex_unlock(&path_explorer_mutex);
   } while (!karte_t::world->is_terminating_threads());
      
   pthread_exit(NULL);
   return args;
}

void karte_t::stop_path_explorer()
{
   pthread_mutex_lock(&path_explorer_mutex);
   path_explorer_thread_active = false;
   
   if (path_explorer_step_progress < 1)
   {
      pthread_cond_wait(&path_explorer_conditional_end, &path_explorer_mutex);
      pthread_mutex_unlock(&path_explorer_mutex);
   }
   else
   {
      pthread_mutex_unlock(&path_explorer_mutex);
   }
}

void karte_t::start_path_explorer()
{
   pthread_mutex_lock(&path_explorer_mutex);
   path_explorer_thread_active = true;
   pthread_mutex_unlock(&path_explorer_mutex);

   //do 
   //{
   //   // The while loop avoids lost wakeups
   //   pthread_mutex_lock(&path_explorer_mutex);
   //   pthread_cond_signal(&path_explorer_conditional_begin);
   //   pthread_mutex_unlock(&path_explorer_mutex);
   //} while (karte_t::path_explorer_step_progress != 0 && !karte_t::terminating_threads);
   //// This makes sure that stop_path_exporer() cannot be called until 
   //// path_explorer_step_progress == 0.
   
   simthread_barrier_wait(&start_path_explorer_barrier);
}

I have tried a number of ways of achieving this, all producing errors. I have in many cases managed to find the errors in my previous implementations, but the implementation with which I replace it seems to contain no fewer (just different and equally problematic) errors. Can anyone think of a robust way of doing this?

What is needed is code that allows stop_path_explorer() to be called at any time from any thread and for the calling thread to be blocked until path_explorer_t::step() will not be called again until after start_path_explorer() is called. start_path_explorer() can be assumed to be called only on the main thread, only once per step, only in karte_t::step(), and only after stop_path_explorer() has already been called by the main thread earlier in the same instance of karte_t::step().

Any thoughts from masters of multi-threading would be gratefully received.

DrSuperGood · November 11, 2016, 02:09:25 AM

path_explorer_step_progress is never less than 1. As such the stop calling thread never waits.

EDIT: Thought I should add an explanation...

Only 1 thread can enter code guarded by the same mutex at any given time. If another thread already is in such code then the thread will block on the pthread_mutex_lock call until the mutex is unlocked. The path_explorer_threaded thread takes control of the mutex at the start of the do loop, and during execution sets path_explorer_step_progress to 0. However before it unlocks the mutex it then sets path_explorer_step_progress to 2. As such when the mutex is unlocked and stop_path_explorer continues execution it evaluates " if (2 < 1)" which is always false so it immediately unlocks the mutex as opposed to waiting for the condition.

jamespetts · November 11, 2016, 08:43:56 AM

Thank you. The solution seems to be to set the variable to zero in the start method in the main thread.

wlindley · November 13, 2016, 01:43:51 PM

Report: On Linux 64-bit (Mint/Ubuntu), compiled latest versions work splendidly and much smoother, although sometimes on loading they will freeze forever after "Message: karte_t::load(): Prepare for loading" (message with debug set to 5 on cmd line) and it may take numerous abort+retry to get it running.

jamespetts · November 13, 2016, 01:53:48 PM

Thank you very much for the report - that is most helpful. The loading issue sounds like a thread deadlock - can you upload a saved game in which this can be reproduced?

DrSuperGood · November 13, 2016, 04:28:37 PM

Quote
The loading issue sounds like a thread deadlock - can you upload a saved game in which this can be reproduced?

Thread deadlocks can be very difficulty to reproduce reliably. On some systems the sequence needed to deadlock might never occur.

jamespetts · November 13, 2016, 04:32:28 PM

It may at least be worthwhile seeing the saved game in which this occurs for wlindley in case it can be reproduced on my system without too much ado. If it happens regularly for wlindley, it might be reproducible for me.

In any event, I have pushed some changes to the multi-threading system since the report, which conceivably might correct the problem, so I should be interested to see whether it still occurs.

Edit: By way of update, I have now multi-threaded:

(1) route unreserving (sync_step() mostly, parallel only);
(2) private car route-finding (step(), both parallel and concurrent);
(3) convoy route finding (step(), both parallel and concurrent; the route-finding part of the convoys' step is detached from step() and run between steps);
(4) the path explorer (step(), concurrent only; a single thread for this is run between steps); and
(5) passenger/mail generation (step(), mostly parallel only, currently automatically disabled in network mode as this does not stay in sync due to use of the random number generator).

Together (especially (1), (3) and (4)), these seem to make a very significant difference to the performance of Simutrans-Experimental in larger maps, even in network mode. This is of particular importance given the probable lack of real increases in single threaded performance in the near future of computing, and the importance of large, dense maps in multi-player mode for what I want to achieve with Experimental.

wlindley · November 14, 2016, 01:35:54 PM

With commit 86bd7d, I can no longer reproduce the freeze. Will advise and upload savegame in the event it reoccurs. A definite performance improvement overall to be sure.

jamespetts · November 14, 2016, 01:51:55 PM

Excellent - glad that it is working well. The remaining tasks are to make the multi-theading of the passenger and mail generation work in networked multi-player mode, and to test to see the optimum number of threads to specify in simuconf.tab.

jamespetts · November 16, 2016, 11:22:08 AM

I have been working of late on the multi-threading of the passenger and mail generation in an attempt to make it network safe, and have encountered some complexities on which I should appreciate others' views as to possible solutions, and on the solutions that I propose below.

Firstly, I have dealt with the RNG problem by making the random seed and associated variables thread local, and re-initialising them on each thread when the threads are initialised. This will, however, have the effect that a single-threaded build will not stay in sync with a multi-threaded build. I will therefore need to have the option of keeping the multi-threading of passenger generation single threaded in network mode in case anyone wants to connect a single threaded client to a multi-threaded server or vice versa. There will need to be an explanation in the comments to simuconf.tab that disabling multi-threading of the passenger and mail generation will be necessary if single and multi-threaded clients/servers are to be connected together. I have already changed config.template to make enabling multi-threading the default.

Secondly, I have also changed the way in which the amount of mail and passengers to generate each step is calculated again to make it network safe. It is somewhat difficult to make this as accurate when done in parallel than when it was single-threaded, but the overall results should be deterministic and scale properly. Again, however, this will make single threaded clients incompatible with multi-threaded servers and vice versa. I do wonder whether it might be worthwhile indicating in the splash screen whether a given build is single-threaded.

The further issues are more complex. The next issue relates to the generation of pedestrians and private cars. Both of these are generated during the stepping of mail and passengers when passengers take a private car to their destination (in which case, a city car graphic is created) or when passengers walk either to their destination or to a stop, in which case pedestrian graphics are created.

Pedestrians are more straightforward, as, once created, they do not interact with any other objects. They do, however, call simrand() in various places when they are sync-stepped; but am I correct in deducing that, because they will always be sync-stepped from the main thread, simrand() will always be called in the same order even if each individual pedestrian object may be created in a different order (and therefore in the sync list in a different order) on each client?

Private cars are more tricky. They have the same issues as pedestrians, but also another issue: they interact with each other and with player vehicles on the roads. A particular problem might occur if one particular building during any one step creates multiple city cars. These would potentially be created in a different order on different clients. When multiple city cars are generated on a single tile in a single step, they form a stack, and the cars drive off the stack one by one, each waiting for the last to leave the next tile before moving, until none remains. I infer that the order in which they leave the stack will depend on the order in which they are inserted into this stack. This will lead to desynchronisations if different cars have different destinations (which will affect the probability with which they turn in any given way at junctions), different maximum speeds and different "time_to_life" durations (which, in experimental, are all based on the actual expected journey time of the particular packet of passengers taking the car rather than being a fixed number). This seems to me to be the trickiest problem at present, and is not one to which I have thought of a solution: any deferred generation of the private cars by creating a queue of pending private car generation tasks would, it seems to me, have exactly the same problem as it is designed to solve, viz. that the order in which the tasks were inserted into the queue would affect the order in which private cars were generated.

There is a related issue with private cars and pedestrians, which is that it is theoretically possible that, if there are more than 250 objects on a tile, further private cars/pedestrians may not be generated, which would have further consequences (probably for private cars only) that depend on the order of generation and could introduce desyncs. However, I suspect that the occurrance of this is likely to be so rare as can safely be ignored.

The next problem is that, in Experimental, whether commuting passengers set out for a particular destingation building depends on whether there are any job slots available at that building. Passengers who walk or travel by private car are currently added to that building directly in the function for generating passengers, immediately affecting the number of job slots. Thus, the order in which packets of passengers are processed is critical (as whether a packet of passengers goes to a particular building will depend on the number of packets that have previously walked or taken a private car to it in that step), and this will cause desyncs when the generation of passengers and mail is multi-threaded.

The solution to this, it seems to me, is to implement a feature that Nathaneal ("Neroden") suggested some years ago, which is a delayed input queue for stops and buildings. This would work by, instead of adding passengers directly to a stop or directly registering them at a building when they are generated, put them in an input queue with a timestamp. That will be the time in internal ticks at which the passengers (indeed, this would work for mail and goods, too) should leave the queue, based on the relevant timings. The relevant timings would be, in the case of passengers/mail/goods disembarking at a stop, the transfer or transshipment time; in the case of passengers arriving at a stop from having walked there, the walking time plus the transfer time; in the case of passengers arriving at a building having walked or gone by private car, the total journey time; and in the case of goods arriving at an industry, the transshipment time plus any hand-carting time between the stop and the industry.

For stops, the input queue could simply be a member variable of the stop itself, and likewise for industries. For city buildings, having a single input queue per building would concern me as to performance, especially in terms of memory bandwidth, so it may be better to have a single world queue, which, together with the stepping of stops and industries, would be processed single-threadedly. This should be straightforward, as each ware/cargo packet has its final destination position already encoded.

Stops and industries could show in their GUI the mail, passengers and goods currently transferring in addition to those already there or waiting.

This should have the effect that adding mail/passengers/goods to a stop or a building in the passenger generation code would not change the world state in a way causing the order to make a difference, as processing items in the queue would be order independent: all of the items whose exit time is earlier than or equal to the current time in each step would be processed and the order of their processing should not be critical (as buildings do not refuse to accept passengers when their job slots are exhausted; rather, the job slots become negative, take a longer time to re-charge, and therefore stop any passengers bound for them starting out for longer).

I do not think that the order in which passengers/mail/goods are added to a stop affects anything about the game state which could cause a desync, but please correct me if I am wrong about this.

Then input queue system should also incidentally solve another issue, which is that the actual time that it takes passengers, mail and good to get to their destinations (rather than the notional time used for pathing and whether to travel calculations) does not take into account walking time, transfer/transhipment time or time travelling by private car. This can have an effect on logistics, especially for goods.

I should be greteful for any feedback on these issues, and most particularly any way of solving the issue with private cars that I discuss above, which seems to me to be most tricky.

killwater · November 24, 2016, 08:29:00 PM

I will not be able to help with any of the problems, but as I am a big fan of multithreading, I got curious what is multithreaded in experimental now?

jamespetts · November 25, 2016, 11:40:03 AM

So far:

* route unreserving;
* convoy path finding;
* the path explorer;
* private car (town to town) path finding; and
* passenger generation (single player only for now).

This is additional to the things multi-threaded in Standard, being:

* the graphics; and
* loading/saving.

killwater · November 25, 2016, 10:25:37 PM

Sounds great. Does it mean that there is a big stream of private cars between cities like real world traffic not just random turn choosing? This would be very nice.
Also what is the difference between path finding and path explorer?

jamespetts · November 25, 2016, 10:51:53 PM

Quote from: killwater on November 25, 2016, 10:25:37 PM
Sounds great. Does it mean that there is a big stream of private cars between cities like real world traffic not just random turn choosing? This would be very nice.
Also what is the difference between path finding and path explorer?

The simulation of private car route finding has not changed, and remains very basic: it has just been made more efficient by multi-threading. As before, a route will be calculated from each town hall road to each other town hall road, and actual journeys between different buildings will be approximated based on the travel time between the two town hall roads (with car travel not possible if there is no such route). The actual private cars do not follow a particular path.

The path explorer is the system for computing the paths taken by passengers, mail and goods from their origin to their destination. The path-finding (in this context) is the system for computing the path of vehicles from one stop to the next, or the private cars as described above.

killwater · November 26, 2016, 09:05:26 AM

Citycars following the route would be a nice addition and provide some additional depth to road transport due to necessity of increasing road capacity to take into account intercity private car traffic. I know there is a looong list of more important projects in line

If I remember correctly there was some mechanism implemented to include this feature but the performance impact was considered to big. Would multithreading solve this issue?

Thank you for clarifying the difference.

jamespetts · November 26, 2016, 12:39:33 PM

Quote from: killwater on November 26, 2016, 09:05:26 AM
Citycars following the route would be a nice addition and provide some additional depth to road transport due to necessity of increasing road capacity to take into account intercity private car traffic. I know there is a looong list of more important projects in line If I remember correctly there was some mechanism implemented to include this feature but the performance impact was considered to big. Would multithreading solve this issue?

Thank you for clarifying the difference.

It is something that I should like to see, too, but the amount of computational effort and coding effort involved would be enormous. Multi-threading can only, at best, multiply computational ability by the number of cores compared to single threading. For the simplest implementation of private car route finding, where each time that a private car is generated, it tries to find a route to its destination and then follows that route, much more than 4 or even 8 times the processing power of a modern powerful CPU running single-threadedly would be needed. A more complex system that might be more efficient has been discussed in detail, but any such system would be fantastically complex to implement.

It may be done one day, but it represents a huge challenge to make this work effectively for large maps (i.e. those many thousands of tiles square and with many hundreds of cities, as the computational effort increases exponentially with size and complexity).

jamespetts · December 04, 2016, 12:17:59 AM

I have spent some time to-day re-working the passenger and mail generation, as well as industry, systems to allow for multi-threading of passenger generation when in multi-player mode.

I have implemented a suggestion given by Neroden some years ago to introduce a delay based on the walking/carting/transfer/transshipment time(s) between passengers and goods arriving at a stop and being able to depart from it, between passengers starting out from their origin building and arriving at the origin stop, and between passengers and goods arriving at their destination stop and arriving at their destination building. I have done this by having a vector in each stop (and a further general vector in karte_t for goods/passengers that walk, take a private car, or are hand-carted to their destination and do not go via any stop). During the multi-threaded part, the passengers and goods are added to this vector, together with a timestamp in ticks of the earliest time at which they may be removed. Every step, in a single threaded part of the game, the code iterates through this vector, and finishes the delivery of any passengers or goods that are ready to move on according to their timestamp.

For the pedestrians/private cars, I have made the multi-threaded parts of the code add these objects, not directly to the sync list, where the order would be indeterminate, but to a vector specific to each thread of sync objects. Each (which has a thread local random number generator seed) thread is assigned a sequential thread number when it is first created, and that number is used to index the array of vectors of sync_steppable objects. Then, in a single-threaded part of the code after the passenger generation has finished running, the code iterates through the array of vectors in order so that, no matter in what order that the sync_steppable objects were generated, they are added to the interactive part of the game world in exactly the same order on all clients.

However, at present, I am still getting desyncs when I run with the multi-threaded passenger generation code enabled when in network mode. Testing shows that this does make a significant difference, so I will continue to investigate.

O01eg · December 05, 2016, 09:17:57 PM

I think it possible to test if it caused by RNG access races by fixing RNG result to constant value.

jamespetts · December 07, 2016, 10:17:26 AM

I have been spending an extremely large amount of time working on this as late. Fixing any network mode desync is exceptionally time-consuming, as it involves repeatedly testing a very large saved game (which takes minutes to load each time). In this instance, the desync occurs only after being connected for about 10 minutes, so each cycle of testing and adjustment is extremely lengthy. This is the reason that I have not had a chance lately to work on code translation or cross-compiling.

I have so far found a problem in the convoy multi-threading code that I had thought was working, but which causes a desync when tested with a different saved game (the difference might be that the one that I am now using has ships; I was previously using one of Rollermaterial's reconstructions of the UK, which has rail and road vehicles, but no ships), which I have not looked into yet, and a consistent problem with the passenger generation multi-threading, which I have narrowed down to the generation of pedestrians and city cars (with both disabled and with the convoy multi-threading disabled, it runs without a desync).

The issue with the generation of pedestrians and city cars I had thought was that, although I had made the order in which they were inserted into the sync list deterministic as described above, sometimes there would be too many objects on a tile for any more pedestrians/city cars to be generated, and which ones would be skipped because of this and which ones would be generated would be indeterministic.

To remedy this, I changed the code (not yet on Github) so as to add the objects to the ground tiles at the same time as they are added to the sync list, single-threadedly in karte_t::step().

Doing this seems to increase the amount of time before a desync occurs, but does not appear to prevent it from happening. I have tried with private car generation disabled but with pedestrians enabled (private cars being the more likely culprit, since they interact with player road vehicles), but the desync still occurs. I have yet to try private cars enabled and pedestrians disabled with this new system, but I had tried both combinations with the previous version (adding the objects to the ground multi-threadedly and adding them to the sync list single threadedly), and that failed (i.e. caused a desync).

Does anyone who is more familiar with the code than I have any idea whether the creation of private car/pedestrian objects themselves (as distinct from adding them to ground tiles or the sync list) changes the game state in any way aside from: (1) the state so far as it relates to those objects themselves; or (2) the random number generator's seed (which is now thread local)?

I should be very grateful for any assistance.

Edit: Oddly, the desync seems not to occur with private cars enabled and pedestrians disabled.

Edit 2: The private cars were not causing desyncs because, in the particular testing scenario that I was using, they were not appearing at all. Using Rollermaterial's saved game (with pedestrians turned back on and private cars set to the default level rather than a very low level as they are in the game as saved), the desync occurs very quickly indeed - more quickly than in the very old saved game that I was using previously. This has made it a little easier to test.

I have also realised that making the RNG seed thread local has meant that the RNG seeds other than on the main thread are not checked in the checklist for mismatches. This leads me to believe that the actual problem could well be outside the realm of pedestrians and private cars themselves, relating more fundamentally to the passenger generation system, but only appearing as a desync (within a reasonable time) when pedestrians and private cars are enabled because any differences between server and client would then change the state of the RNG seed on the main thread much more quickly.

Indeed, I have noticed that there do appear to be differences in the number of transferring passengers as between server and client at King's Cross in Rollermaterial's 1943 saved game arising after a few seconds.

I will therefore have to look with greater precision at the other aspects of multi-threaded passenger generation, which may take some considerable time.

Edit 3: The problem with the passenger generation code is proving extraordinarily hard to find. With pedestrians and private cars disabled, I notice that the number of passengers arriving at a station (King's Cross in Rollermaterial's saved game is what I use to test) visibly gets out of sync between client and server, but it takes a long time for this to happen, usually starting with there being 1 more or fewer transferring units on the client or server, and then slowly escalating. Disabling the congestion system has no effect, and I have tried many times without success to find any code in the passenger generation method that has any side effects.

I have also spent some time looking into the convoy multi-threading desync issues, and, again, this proves extremely difficult to understand. On one saved game, a very old game from the Bridgewater-Brunel server, it will desync within a few minutes of connecting if convoy multi-threading is enabled, but will remain connected for a long time if both it and passenger generation multi-threading are disabled. However, on Rollermaterial's saved game, with convoy multi-threading enabled and passenger generation multi-threading disabled, it will stay in sync when run overnight.

I had thought that the difference between them was that the older game had ships and Rollermaterial's game did not, but, checking just now, I see that Rollermaterial's game indeed has not only ships but also aircraft. Disabling the JPS system for ships also made no difference to the desync in the older game. Both games have electricity supply, the desync problem with that having been fixed a long time ago.

Edit 4: I think that I have finally fixed the issue with the passenger generation: the trouble seems to have been that the packets were being added to the lists in a non-deterministic order: the order in which they were taken out of the lists clearly mattered for keeping the game in sync. I have now used the same solution for the halt and world lists as I had used with the pedestrian and private car lists: separate vectors in an array indexed by the thread number, which are then each iterated through in sequence to produce a deterministic output. The code with this fix is now on Github, and the multi-threaded passenger generation is, I think, finally working in network mode.

Thank you all for your patience whilst I have worked to deal with this issue. Hopefully, this will bring great improvements in performance in the future.

News:

Multi-threading in simulation code