Thank you for that profiling - that is very interesting, especially the consequences of the number of threads. I have now fixed the problem of the number of threads having to be set in simuconf.tab to be identical to that on the server: the game will simply defer to the server's when joining a network game and retain that value until it starts or loads a new game. The reason that the number of threads has to be the same on the client and on the server is that the passenger generation is multi-threaded, and each thread has its own random number generation seed, so that number of threads will affect what actually happens in the game. I am in the process of modifying the threading system so that the passenger generation system (which does not run concurrently with anything substantial in the main thread because it would cause conflicts if it were to do so) uses the number of threads specified in num_threads, but other multi-threaded systems, such as the private car and convoy routing, use 1 fewer thread than the specified num_threads, as these do run concurrently with the main thread in places.
Dr. Supergood is correct about the significance of graphical performance in Extended, incidentally - the much greater map size combined with the much greater depth/complexity of simulation makes the graphics performance far more critical in Extended than in Standard. I did not mean to be critical of those who worked on the current graphics system: it is of its time and I doubt that anything better could sensibly have been written in 1997-1999. If it were possible to recast the engine now to take advantage of modern methods and technology, that would be splendid, although is rather beyond my abilities.
On a map as large as the Bridgewater-Brunel map, most of the time is spent in passenger generation as a result of the large number of alternative destinations that it is necessary for passengers to have, together with the time spent by the game in hashtable lookups for routes on player networks. The performance of the hashtable and weighted vectors (for the buildings, which are accessed when passengers pick somewhere at random to go for each iteration of their destination search) are therefore critical, although I am not sure whether they can be improved.
As to the relationship between threads and execution units, my approach to threading in Simutrans has been to have a bespoke multi-threading model for each subsystem that requires multi-threading, and to have a set pool of threads for each of those subsystems (e.g., passenger generation, convoy routing, etc.). The number of threads in that pool is equal to the number of threads minus one (to take into account the main thread; although, I am now changing that for the passenger generation to make that equal to the number of threads, as that does not run concurrently with anything else of any significance). Some of these multi-threaded subsystems run concurrently with all sync steps (e.g. the path explorer, which uses only one thread, the convoy routing and the private car route finding) and some do not (the route unreserver for railways and the passenger generation). This is all independent from the number of threads allocated for the graphics, of course. Quite how this fits into the relationship between the number of threads and the number of execution units I do not know, but it is certainly considerably better in performance terms than it was when all of these things were single threaded.