News:

Want to praise Simutrans?
Your feedback is important for us ;D.

[BUG] frequent desyncs when modifying roads

Started by Mariculous, April 10, 2020, 09:12:10 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Mariculous

It keeps desyncing immediately after connection, whilst there is nobody else connected, so I don't think this is closely tied to other players connecting to the game.

Vladki

Yeah, I noticed - seems that disconnects are more frequent when looking on busy part of map, even if running alone.

CPU and RAM usage does not seem to be an issue. I'm not sure if it could be due to different number of threads? I reduced from default 6 to 4, to match the number of available cpu threads.

Freahk (or James, or anyone) could you try running a copy of simutrans-siemens for a while on some other server, so that we can compare and rule out if it is a hardware or connectivity issue? For connectivity check - do we have some players in USA? (where the server is probably located?)

Mariculous

Quote from: Vladki on April 10, 2020, 02:49:58 PMcould you try running a copy of simutrans-siemens for a while on some other server, so that we can compare and rule out if it is a hardware or connectivity issue?

There will be a new one on my server as soon as the map was voted and someone created the heightmap.
If there was no decision until Sunday, I will roll a dice.
So I guess it will be online on monday, maybe earlyer if we get at least 5 votes before Saturday afternoon.

Vladki

It was OK when I was playing alone, but the desyncs when we are there together are insane...

Mariculous

#4
Yep it's quite unplayable when there is more than one player connected.
Edit: it's also desyncing when playing allone but not that frequent.

Mariculous

This one is not a detailled report as I still cannot reproduce this reliably.

Currently, there are quite a lot desyncs on stephenson-siemens.
These desyncs seem to be related to roads, likely to privatecar routing. The server ran quite well without any desyncs for many hours today, even when more than one player was connected.
Once I removed a road, the desyncing began.
After that, there were quite frequent desyncs again even without any active modifications to roads.

Further, these desyncs seem to happen much more often if two or more players are connected.

jamespetts

Thank you for your report: this is helpful in narrowing down the causality for this issue. This is the priority issue, but because this sort of bug is extremely difficult to diagnose, may well to take a very long time to solve.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I am sorry for the trouble caused by the losses of synchronisation: these are, as many of you will know, exceedingly difficult to track down. I have just pushed a fix for a memory corruption bug. It is possible that this was the or a cause of the losses of synchronisation, as it was related to the private car route finding code.

I should be grateful if people could re-test with to-morrow's nightly build whether there is any change to the losses of synchronisation compared to the current build.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ranran

#8
Experimentation on localhost seems to have not solved the instability.
A desync will occur in 0-5 minutes even if the player is not doing anything on the new map.
However when max_route_tiles_to_process_in_a_step = 1 is changed max_route_tiles_to_process_in_a_step is changed to 1, it's very stable, and now about 2 hours have passed and desync has not occurred yet.

EDIT:Translation error

Mariculous

stephenson-siemens still celebrates desync parties.

jamespetts

Thank you for confirming. I have split this from the discussion relating to the server itself so that I can find information on this more easily. It is unfortunate indeed that the memory corruption fix did not fix this, as this is likely to take a huge amount of time to track down and fix, during which little other progress can be made.

I should note that I did manage to connect and stay connected for a good long time this morning to the Stevenson-Seimens server, even building a road, but nobody else was online at the time.

Ranran - I note that you get a different result changing max_route_tiles_to_process_in_a _step. Can I ask: what was the value of this variable when the losses of synchornisation were occurring, and what was the value of the variable when they were not occurring?

Vladki - can you let me know what the value of this variable is on the Stephenson-Siemens server?

Freahk (and others) - can anyone confirm whether this loss of synchronisaiton occurs despite everyone on the server simply observing, or does it occur only after (certain sorts of) interactions?
Edit: Having just attempted to joint the Stephenson-Seimens server with two other players online now, what seems to be happening is that, when I join, somebody else disconnects, and when that somebody else reconects, I lose synchronisation. Is that what other people notice, too? Freahk - is this what you mean by a "desync party"?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

max_route_tiles_to_process_in_a _step was 2048. Now I have changed it to 1 and restarted Stephendon siemens.

yes - desync party is when you can't keep connected more than 1 minute and players kick each other out

Ranran

#12
Quote from: jamespetts on May 01, 2020, 06:47:48 PMRanran - I note that you get a different result changing max_route_tiles_to_process_in_a_step. Can I ask: what was the value of this variable when the losses of synchornisation were occurring, and what was the value of the variable when they were not occurring?
The first post, I did a bad translation. I edited it.
I've posted my experiments and observations here.

To that I add a supplement:
The map is a new map with no player network built yet.
I tried changing max_route_tiles_to_process_in_a_step to the following value with the citycar deleted.
0, 1, 1024, 2048, 4096, 65535
The default was 1024. This will trigger a desync in 0-5 minutes. 2048 and 4096 are almost the same.
0, 65535 is very CPU intensive. And desync happens very soon. In most cases it will be desynced within 30 seconds.

EDIT:
In the case of 1, it was very stable and no desync occurred even if I left 2 clients connected for 2 hours.

EDIT2:
If the value is other than 1, there are many cases where one of the two clients is disconnected immediately after connecting them, and the timing of disconnecting the two clients is not always the same. (That is, if the two are still connected, they will be disconnected individually instead of simultaneously.)

Mariculous

Quote from: jamespetts on May 01, 2020, 06:47:48 PMcan anyone confirm whether this loss of synchronisaiton occurs despite everyone on the server simply observing, or does it occur only after (certain sorts of) interactions?
It does also occur when only one player is connected and does no interaction at all, although desyncs are more frequent when more people are connected.
I could not find any relation in between interacting with the server and desyncs.
However, after modifying rorads, desyncs seem to happen more frequently.

Futher, sometimes the server runs fine without any desyncs for an hour or something like that, but once a desync happens, many further desyncs will follow within short periods of time.

Due to that behavior, we came to the conclusion the desyncs might be related to missmatches in stored private car routes between server and client.

jamespetts

Quote from: Vladki on May 01, 2020, 07:00:04 PM
max_route_tiles_to_process_in_a _step was 2048. Now I have changed it to 1 and restarted Stephendon siemens.

yes - desync party is when you can't keep connected more than 1 minute and players kick each other out

Thank you. Can I check whether not being able to keep connected for more than 1 minute applies to the latest player to join, or only players other than the latest player to join?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

All players are affected.

The change of max_route_tiles_to_process_in_a_step does not seem to help in our case

jamespetts

Thank you all for your replies. Setting max_route_tiles_to_process_in_a_step to 1 will make private car route processing extremely slow so that very little updating of private car routes will be done. This will mean that any loss of synchronisation relating to private car routes will be delayed, and does not by itself solve the problem. It does, however, suggest that the problem is clearly related to private car routing.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ranran

Quote from: Vladki on May 01, 2020, 07:16:51 PMThe change of max_route_tiles_to_process_in_a_step does not seem to help in our case
Did you set progdir_overrides_savegame_settings = 1 or pak_overrides_savegame_settings = 1 correctly?

jamespetts

Quote from: Freahk on May 01, 2020, 07:11:28 PM
Due to that behavior, we came to the conclusion the desyncs might be related to missmatches in stored private car routes between server and client.

One possibility arising from this observation is that these issues might possibly be caused by whatever is causing some of the reported issues with private car routes not being updated properly. A bug in an area related to the loss of synchronisation can sometimes be caused by the same thing as is causing the loss of synchronisation. It may be that trying to fix these bugs may fix the loss of synchronisation, although, as with fixing the memory corruption bug, this is not certain.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

Quote from: Ranran on May 01, 2020, 07:23:14 PM
Did you set progdir_overrides_savegame_settings = 1 or pak_overrides_savegame_settings = 1 correctly?

This is not necessary if the advanced settings dialogue ("i" key) be use; but I am not sure whether it was in this case.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Vladki

And that was indeed the case - load offline, use "i", save, load online

I cannot use the override. It will change other config options. I would need some way to dump simuconf.tab from the running game

jamespetts

I have merged this with the earlier bug report relating to this, as this appears to be the same issue.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ranran

I think the title of the new thread is misleading.
Because desync happens even without editing the road as I already reported.
But I don't know if it will make the problem more serious.

Japan has golden stay home week so I did some experiments with loacal host today.
config settings are below:

max_route_tiles_to_process_in_a_step = 0
assume_everywhere_connected_by_road = 0



First, I made a small map without a city and connected two clients to it, but I could not confirm desync in an hour.
Next, I made a 512x512 map that started in the year 2000. I had two clients connecting to the map and had been watching for a while initially, but it didn't disconnect immediately. And I left my seat. After about 30 minutes and returning to the room, both clients were already disconnected.
This indicates that desync will occur even if the player does not delete or lay the road.

And I made and observed maps with the same settings several times. I first noticed that there was no citycar in the new city.
I don't know how this happens, but when loading the map, there may or may not be a citycar initially.
It appeared to be stable when there were no citycars on the map. So I waited for the citycar to show up on the map without it.

It turns out that the client disconnects in about 10-30 seconds when the citycar appears.
I repeated this experiment about 3 times, but I think that the appearance of citycar is the trigger because it is cut off immediately after the appearance of citycar.


Load this saved game in networks server mode.
No one has touched the map, and there is no citycar yet. However, after about 1 minute (around 30:00), a city car will appear.
Please connect with another client immediately after launching
You can see a lot of desyncs after the appearance of citycars on the new map.


The citycar will spawn little by little from around 29:45.
Then the three clients are disconnected in sequence. Not at the same time.
It was possible to connect stably until the appearance of citycars.

This has some fluctuations. The number of citycars increases little by little, but the disconnection does not occur immediately after it appears, and the timing of disconnecting each client is also different.
Note that the map is set to maximum private car routing frequency.
This seems to cause random desyncs, but this experiment is more frequent and the probability of occurrence is intentionally increased.
Certainly it could make things worse if deleting the road would cause the calculation to occur again.

ceeac

I think I managed to track it down:

=================================================================
==54921==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7fff1a0be9a1 at pc 0x000001217c6e bp 0x7fff1a0be2d0 sp 0x7fff1a0be2c8
READ of size 1 at 0x7fff1a0be9a1 thread T0
    #0 0x1217c6d in vehicle_desc_t::get_engine_type() const /home/ceeac/Projects/code/simutrans-ex/vehicle/../descriptor/vehicle_desc.h:785:42
    #1 0x1217c6d in road_vehicle_t::check_next_tile(grund_t const*) const /home/ceeac/Projects/code/simutrans-ex/vehicle/simvehicle.cc:3402:70
    #2 0x6e68b0 in route_t::intern_calc_route(karte_t*, koord3d, koord3d, test_driver_t*, int, long long, unsigned int, unsigned int, bool, int, koord3d, unsigned char) /home/ceeac/Projects/code/simutrans-ex/dataobj/route.cc:660:17
    #3 0x6ed3f9 in route_t::calc_route(karte_t*, koord3d, koord3d, test_driver_t*, int, unsigned int, bool, int, long long, unsigned int, koord3d, unsigned char) /home/ceeac/Projects/code/simutrans-ex/dataobj/route.cc:1199:22
    #4 0x108e3bf in tool_wayremover_t::calc_route(route_t&, player_t*, koord3d const&, koord3d const&) /home/ceeac/Projects/code/simutrans-ex/simtool.cc:3615:14
    #5 0x1091f04 in tool_wayremover_t::do_work(player_t*, koord3d const&, koord3d const&) /home/ceeac/Projects/code/simutrans-ex/simtool.cc:3683:7
    #6 0x10343c6 in two_click_tool_t::work(player_t*, koord3d) /home/ceeac/Projects/code/simutrans-ex/simmenu.cc:974:12
    #7 0xb1190a in nwc_tool_t::do_command(karte_t*) /home/ceeac/Projects/code/simutrans-ex/network/network_cmd_ingame.cc:1358:27
    #8 0x118d11b in karte_t::do_network_world_command(network_world_command_t*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:10506:8
    #9 0x118b45c in karte_t::process_network_commands(int*) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:10452:4
    #10 0x118ee73 in karte_t::interactive(unsigned int) /home/ceeac/Projects/code/simutrans-ex/simworld.cc:10610:4
    #11 0x100cba0 in simu_main(int, char**) /home/ceeac/Projects/code/simutrans-ex/simmain.cc:1387:9
    #12 0x1054f72 in sysmain(int, char**) /home/ceeac/Projects/code/simutrans-ex/simsys.cc:830:9
    #13 0x7f24137770b2 in __libc_start_main /build/glibc-YYA7BZ/glibc-2.31/csu/../csu/libc-start.c:308:16
    #14 0x446d6d in _start (/media/ceeac/Projects/code/simutrans-ex/build/default/simutrans-extended+0x446d6d)

Address 0x7fff1a0be9a1 is located in stack of thread T0 at offset 161 in frame
    #0 0x108d27f in tool_wayremover_t::calc_route(route_t&, player_t*, koord3d const&, koord3d const&) /home/ceeac/Projects/code/simutrans-ex/simtool.cc:3588

  This frame has 1 object(s):
    [32, 272) 'remover_desc' (line 3605) <== Memory access at offset 161 is inside this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-use-after-scope /home/ceeac/Projects/code/simutrans-ex/vehicle/../descriptor/vehicle_desc.h:785:42 in vehicle_desc_t::get_engine_type() const
Shadow bytes around the buggy address:
  0x10006340fce0: f8 f2 f2 f2 f8 f2 f8 f2 f8 f2 f8 f3 00 00 00 00
  0x10006340fcf0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10006340fd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10006340fd10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10006340fd20: f1 f1 f1 f1 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
=>0x10006340fd30: f8 f8 f8 f8[f8]f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
  0x10006340fd40: f8 f8 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 00 00 00 00
  0x10006340fd50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10006340fd60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10006340fd70: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
  0x10006340fd80: 00 00 00 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==54921==ABORTING
Aborted (core dumped)


The problem is that the variable 'remover_desc' in tool_wayremover_t::calc_route is allocated on the stack and used after scope (via the 'test_driver' variable). Moving the 'remover_desc' variable out of the if statement seems to solve the problem.

jamespetts

Thank you both very much: that is extremely helpful. I have now incorporated this. I should be very grateful if people could test with to-morrow's nightly build to see whether this solves the issue.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ranran

The symptom reported in the last post hasn't been resolved in today's build(#fcbb5f0) - the desync party will start with the appearance of citycar on the localhost testing.

jamespetts

Quote from: Ranran on May 04, 2020, 12:53:24 PM
The symptom reported in the last post hasn't been resolved in today's build(#fcbb5f0) - the desync party will start with the appearance of citycar on the localhost testing.

Thank you for confirming - that is helpful.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

Preliminary investigations into this issue show as follows:

(1) the problem does not appear to be related to data being incorrectly saved/loaded, as the problem can be reproduced even where the data in private_car_routes are erased on loading;
(2) enabling assume_everywhere_connected_by_road prevents the loss of synchronisation (but, obviously, disables private car routing features);
(3) there are two separate parts of the code where assume_everywhere_connected_by_road is checked:
(a) in determining which route that actual private car objects will follow (simroadtraffic.cc, hop_check()); and
(b) in determining whether passengers generated in cities take private cars to their destinations (simcity.cc, check_road_connexion_to()):
disabling both of these is necessary to prevent the loss of synchronisation, from which it can be concluded that two separate data structures, being:
(i) private_car_routes; and
(ii) connected_cities/connected_attractions/connected_industries
end up being different on the client compared to the server even assuming the same data structures at the outset, ultimately causing the losses of synchronisation.

Further investigation will be needed to trace the cause of these divergences.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

ceeac

route.cc line 416ff:

private_car_route_step_counter++;
w = tmp->gr->get_weg(road_wt);

if (w)
{
// The route is added here in a different array index to the set of routes
// that are currently being read.

// Also, the route is iterated here *backwards*.
w->add_private_car_route(destination_pos, previous);
}

// Old route storage - we probably no longer need this.
//route.store_at(tmp->count, tmp->gr->get_pos());

previous = tmp->gr->get_pos();
tmp = tmp->parent;


There is a race condition on the call to to weg_t::add_private_car_route, causing the route stored in the way to diverge.
This can happen when finding a route from two different cities to a common destination city, and the two routes cross each other.

jamespetts

Quote from: ceeac on May 10, 2020, 05:56:02 PM
route.cc line 416ff:

private_car_route_step_counter++;
w = tmp->gr->get_weg(road_wt);

if (w)
{
   // The route is added here in a different array index to the set of routes
   // that are currently being read.

   // Also, the route is iterated here *backwards*.
   w->add_private_car_route(destination_pos, previous);
}
         
// Old route storage - we probably no longer need this.
//route.store_at(tmp->count, tmp->gr->get_pos());

previous = tmp->gr->get_pos();
tmp = tmp->parent;


There is a race condition on the call to to weg_t::add_private_car_route, causing the route stored in the way to diverge.
This can happen when finding a route from two different cities to a common destination city, and the two routes cross each other.

Thank you: this is exceedingly helpful. Testing confirms that this is indeed the problem.

This is a subtle and complex problem. What happens appears to be that the order in which private car routes are added to the world is not (necessarily) the same between the server and client. This would not be a problem if what were actually being added were the same, but there are times when it might not be the same, for example, if a road had been added or deleted (whether by a player or automatically by a town) between the time when the first route had been found and the time when a second route had been found. Because only one route to a destination is stored in any given tile, all routes to that destination from that tile will be based on whatever route was set most recently. This has the effect that it is possible for there to be substantive private car related differences between client and server, which is what causes a loss of synchronisation.

I have now pushed a basic fix for this, consisting of only allowing one concurrent background thread to find private car routes when the game is running in network mode. Testing shows that this prevents the loss of synchronisation, but unfortunately has the effect of significantly reducing the speed with which private car routes are updated in network mode. This should not make a noticeable difference on smaller maps, but is likely to make a very significant difference on larger maps, where it might be several game years between updates of roads.

I would therefore suggest that larger network games be split across several large islands so that there are a number of separate road networks rather than a single massive road network, the latter of which would greatly increase the processing power required for route finding.

I have yet to think of any more sophisticated way of doing this that allows full multi-threaded route finding in network mode. The previous system worked because it wrote the data in a single thread, thus the order in which routes would be written was always the same, but this ended up consuming too much memory with a larger road network (~600 cities in one continent all connected to each other by road). I should be grateful if anyone has any solutions to this. Until then, however, we will have to be content with slower updating of private car routes in network games.

Incidentally, thank you to Ranran for the very helpful reproduction case, which is what I used for my initial testing in my previous post, and thank you again to ceeac for having spotted the cause of the problem: this has saved me an enormous amount of time.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Mariculous

#30
Oh well race conditions.

The following thoughts will be based on the following claims:
Claim: If each thread operates on its individual set of memory, we can ensure there won't be any concurrent writes.
Claim: If we don't write to a set of memory, we can safely read from that memory concurrently.

I am kind of wondering why we didn't have such desyncs before private car routing.
Different routes will result in different journey times, so there must have been passengers deciding to use the car on the server, whilst deciding to use public transport on the client (or the other way round)
Maybe these desyncs simply appeared much less frequently as a slight difference in journey times will only affect very few pasengers, whilst different stored routes can likely affect all passengers that decided to take the car.

Anyway, here are two possible solutions to calculate the same routes on the server and all clients.
I will share my thoughts of writing these to the world in the second section.

When calculating the routes, we are only reading data from shared memory. The results will be stored to individual memory, so we need to ensure the data we read does not change while we are calculating the routes.

We can achieve this the following way:
1. Define time intervals in which a given number of routes is calculated.
2. Schedule any kind of roadworks to be applied outside of route calculation time frames.
Player triggered roadworks are already scheduled by network packets to a specific, consistent point in time.
Any other roadworks need to be scheduled as well.

Alternatively, if route calculations take too long and new roads are built rather rarely, we can do the following:
1. Again, define time intervals in which a given number of routes is calculated.
2. if any roadworks is done within that timeframe, refuse all routes calculated in that time frame or validate them in between server and clients using hashes of the result. Refuse routes that differ on any client and recalculate that route later on again.

A mix of both might work best i.e. if the scheduled time is not too far in the future, wait for that time, otherwise cancel the calculations and refuse eventual results.

Keep in mind, we will need to ensure that congestion data does not change within our calculation timeframes.
Using old congestion data should work well. We already maintain such "stats" for the current and the previous month.
Maintaining a third, more recent but not live congestion "stat" should not be an issue.



Let's get to the described issue of race conditions when writing our calculated routes from invidual memory to shared memory (the world)
We will again try to ensure that we never read from memory that is written to in that time and two processes will never write to the same set of memory.
This results in writing the results to the world being a quite time-critical thing, as we need to block the main thread for this, so we want to keep that time short.

I see two approcahes here:
1. A two-step aproach:
First, we used Dijkstra to calculate all-to-one route graphs. We can do this simultaneously. (With the above described measure to ensure consistent calculations of the routes)

The second will start at defined point in time, consistent to server and clients. We neeed to ensure that tfe first step is finished at that time.
Either we pre-define that point in time (guess that's the way it works currently) or we send a packet to the server, just like player interactions, as soon as we are ready and the server will send a "start step two at tick n", wen all clients are ready.

In the second step, we will iterate the resulting graphs and write it to the world.
Note that we can safely use multiple threads to simultaneously write the data of one graph to the world but we can not iterate multiple such graphs at the same time.
We might restrict the number of tiles written per-thread in a single step.

Note that we don't need to wait for the second step to be finished to stat a new first step.
The only thing we have to take care of is that there is never more than one second step active at the same time
We should further take care that we never have more than one waiting second step to prevent pseudo memory leaks if the first step produces new routes faster than the second can write them to the world.


2. A n-step pipeline
We can further reduce the work that needs to be done in the main thread by shifting in a further intermediate step to do some preparation works.

First, we use Dijskstra again to calculate multiple all-to-one route graphs. We will ensure constsitency of this step as above.

The second step can start as soon as the first has finished. We don't need to synch this time across server and clients.
A single thread will virtually cut the world into chunks of e.g. 1024x64 tiles, create a vector for each chunk and sort the routing data into these vectors.
Optionally, to optimise processing times in the last, time critical step, we can sort the contents of each list by koordinates and sort the lists themselves by their number of elements. We want to start with large chunks, so only small ones are left in the end, when there are less chunks left to process than the number of our threads.

The third step will again be scheduled to start at a consistent time.
Multiple threads will be used to write the data from the chunks to the world, just as in the first approach.

Note that we can interleave multiple first and second steps to run simultaneously on different data, so we can circumvent the single-threaded nature of the second step by simply running multiple second steps at the same time.
The only thing we need to ensure here is that we never have multiple third steps running at the same time and that we start the third step in a consistent order.

Another option to reduce processing time in the last step is comparing newly calculated routes to what is stored in the world. If it's identical, we can skip refuse the newly calculated route entirely, otherwise we might only write instructions that actually changed to the chunk vectors.
This has the drawback that we cannot run step two and three simultaneously as we will read routes in step 2 and write thos in step 3.

Ranran

I'm glad that the big problem has been cleared up :)

jamespetts

Can I confirm whether people have managed to remain connected after the latest fix? I was able to stay connected online for 20-30 minutes just now, but nobody else was online at the time.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Mariculous

#33
We will see, likely this evening ;)
So far, I can confirm being connected for an hour without desyncs.

jamespetts

Thank you for confirming that this appears to have fixed the losses of synchronisation: this is helpful.

It might help to give some further refinement to the explanation as to how the losses of synchronisation occurred, which might help to refine your suggestions, as, from a brief reading of these, they may depend on assumptions which might have been reasonable to make from what I wrote above, but which are not quite correct.

The important thing to note is that, contrary to what I stated above, it is not just player changes to road networks that will cause differences in routing from the same point to any given destination depending on the time at which the routing takes place: it is also congestion. Because congestion affects the cost of traversing a tile so far as the routing algorithm is concerned, it has the potential to affect which route is selected from that point to a destination. Congestion changes with the movement of cars that stop in a tile, and this can be affected every sync_step.

Thus, if the routes be calculated at different sync_steps, they will potentially record different levels of congestion and make different routing decisions. This might possibly be dealt with by using only the previous month's congestion to affect routing.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Mariculous

Yes, I had stated that as a notice at the last paragraph of section one.

Quote from: Freahk on May 11, 2020, 09:32:39 PMKeep in mind, we will need to ensure that congestion data does not change within our calculation timeframes.
Using old congestion data should work well. We already maintain such "stats" for the current and the previous month.
Maintaining a third, more recent but not live congestion "stat" should not be an issue.

Using last-months data might be too outdated to actually react on congestion changes properly.
Running a preparation step before step 1 to initialise persistent congestion should be quite fast. We do this in karte_t::new_month, which is called each new month in step, besides many further things.


About your question from yesterday:
The server seems to be running quite stable in terms of desynchs. There had been a few crashes that could not be located yet.
Further, private car routes seem to be gone entirely. I'd expect them to get recalculated soon. Will file a bugreport if they don't seem to be recalculated within the next few ingame years.

jamespetts

Thank you for clarifying.

I am not sure that I understand the point that the routes "seem to be gone entirely" - can you elaborate? So far as I can see from my debug build, which lists all the routes going through a tile of road in the road's information window, the routes are all still present.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have split the post replying to the question about private car routes being "gone entirely" into a different thread, as that is a separate issue.

Returning to the losses of synchronisation, I have been testing further. It seems that divergences can occur even if there are no changes to congestion or to the roads themselves between the first and second calculation. I have tested this by temporarily moving the place where the multiple threads for the private car routing start and stop to make sure that this is not concurrent with anything other than player vehicle movement and testing on a map (Ranran's saved game) with no player vehicles.

This seems to occur where two routes have the same cost, and it then appears to be indeterminate which of the two routes be chosen. I am wondering whether there is a way to deal with this and force Dijekstra's Algorithm to be deterministic about routes of the same cost.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Mariculous

#38
That's strange, I'd need to look into how exactly the Dijkstra is implemented and what exactly we are calculating.

Generally, Dijkstra itself is deterministic. There is no randomisation anywhere in the algorithm itself.
For sure, one could implement the iteration of neighbors in a randomised way without breaking Dijkstra, but there is no reason in doing so.
The step of selecting the next node might also be implemented in a randomised way in case there are two different nodes with the same weight, but again there is no reason in doing so.

So either the weight, the processing order of neighbors or the processing order of two nodes with the same weight is kind of randomised.
The first would break Dijkstra in a way that it does not neccessarily return the shortest path, the latter two would only result in different routes if both have the same weight.

In any case, if there is any randomisation involved, I am really wondering why there are no desyncs if the calculation runs in a single thread.

jamespetts

I do not think that there is any actual randomisation: I believe that it is deterministic when the ultimate starting point of the route is the same, but not between any given mid point in the route and any given destination. There is code that takes into account the starting direction. Thus, if two routes to the same destination converge on a tile, and have arrived at that tile from different directions, they might take a different route out of that tile. Because the private car routing system recognises only one route from a tile to the destination, whichever is written last will be the route that private cars take, and so the order of writing must be the same between server and all clients.

I have started experimenting with this in my new testing branch (parallel-private-car-routing or somesuch: I am currently in my model railway shed and cannot recall the precise name). I have there disabled the code that obviously diverges depending on the starting direction (i.e. the code that adds cost penalties for turning) in the case of private car routing, but this seems not to be enough, so I suspect that there is also some other code that has this effect, but I have not found it.

Any assistance in finding this code so that it can be disabled for private car routing would be very helpful; ultimately, making this deterministic will be far more efficient than any means of making sure that the routes are all written in the same order.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Mariculous

So the weight in between two koors is kind of randomised.
Hopefully it is modelled in a way that relates multiple nodes to a koord, otherwise Dijkstra itself would break.
At least from the descriptions it sounds like it's relating multiple nodes to the same koord.
I did not study route search code in simutrans so far, but I guess I should do this just now...

jamespetts

Quote from: Freahk on May 14, 2020, 11:05:14 AM
So the weight in between two koors is kind of randomised.
Hopefully it is modelled in a way that relates multiple nodes to a koord, otherwise Dijkstra itself would break.
At least from the descriptions it sounds like it's relating multiple nodes to the same koord.
I did not study route search code in simutrans so far, but I guess I should do this just now...

That would be very helpful, as my attempts did not discover what was happening.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I have undertaken some further investigation on this, but have not found anything clear yet. I have eliminated all possible places where direction per se might have any effect on private car routing, and yet the loss of synchronisation still occurs.

Testing carefully inside the routing algorithm, it seems that the tiles are always inserted into the queue (queue.insert(k)) in the same order (in the case of Ranran's saved game, when the starting tile is 75,293, tile 75,294 is always inserted before 74,293), so there is no lack of determinism in this part of the code. Testing when the route is relaxed (previous = tmp->gr->get_pos()) seems to show that it is always 75,294 and never 74,293 accessed here, too, but it is difficult to be sure of this since the sample size for this test was very low because of the very large time between hits on my conditional breakpoint, which significantly slowed my code. It therefore remains very obscure where and how these indeterminisms are occurring.

Any assistance would be much appreciated.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.