The International Simutrans Forum

Simutrans Extended => Simutrans Extended Development => Simutrans Extended Solved Bug Reports => Topic started by: DrSuperGood on September 06, 2018, 03:21:35 PM

Title: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 06, 2018, 03:21:35 PM
Clients seem to be going out of sync with the server a lot.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 06, 2018, 04:31:23 PM
Have you any way of telling at present whether these issues are consistent with the server crashing or whether the issue is relating to synchronisation? I am afraid that, if it is the latter, the sort of bug that would cause failures to remain synchronised only on a map as large and developed as this one is likely to be so subtle and complex that it would be virtually impossible to solve, unless people could isolate some specific feature of the game that is only now being used for the first time (although it is hard to think what that might be, since electricity supply and air travel have been in use for an extended period).
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on September 06, 2018, 11:20:27 PM
I am having trouble staying in sync as well.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 06, 2018, 11:26:27 PM
I connected briefly earlier and noticed no difficulty. Can I check that you all have the latest version of the pakset, which was updated last night?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Junna on September 07, 2018, 03:47:46 AM
It resetted again when I was replacing 800 road vehicles, and it resetted to way before I started the replacement, once again.. all that work lost... ugh, why do I bother. I even forced a save by reconnecting before starting the replacement in case this happened, but it still resetted to an older save than the one I forced, as well as the hourly save it had done...
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 07, 2018, 04:21:56 AM
Quote
Have you any way of telling at present whether these issues are consistent with the server crashing or whether the issue is relating to synchronisation? I am afraid that, if it is the latter, the sort of bug that would cause failures to remain synchronised only on a map as large and developed as this one is likely to be so subtle and complex that it would be virtually impossible to solve, unless people could isolate some specific feature of the game that is only now being used for the first time (although it is hard to think what that might be, since electricity supply and air travel have been in use for an extended period).
Surely the crashes should be logged on the server like all application crashes?

I cannot tell from the client side since two cases produce the same cannot connect to server error in the game list.
  • Server has crashed and not yet restarted.
  • Someone beat me to selecting the server and is taking an extremely long time to join it.
Quote
I connected briefly earlier and noticed no difficulty. Can I check that you all have the latest version of the pakset, which was updated last night?
I use my updater so yes it updated the pakset files.

It is very sporadic... For example sometimes I connect and remain connected and in sync for well over an hour before I eventually have to quit to do something else. Other times I connect just to be immediately dropped out of sync the instant the map finishes loading. I can only assume it is out of sync because my client does not crash after I have been disconnected, however that need not be the case if the server is crashing due to running out of resources such as memory.

I know of one case that is highly likely to OOS one after a while. Changing vehicle schedules such that the vehicle will be displaced (eg a train on diagonal or over a junction) seems to not be fully coupled to the server so although the order does go through to the server it begins before or does something differently on the client that issued the order and so eventually a checksum mismatch will occur. This is inherited from standard where I first noticed it. You can login to find trains that you thought you fixed turned out to be stuck over junctions still.

I suspect another case might be related to sending the server orders at the same time the server gets scheduled for a save/load cycle due to someone joining. I am not sure that the orders are being properly discarded from queues and the result might be potentially that all clients get kicked due to these boarderline case orders being executed inconsistently between client and server. It could also be the server crashed during the load cycle, it is hard to tell.

It would be nice if the two cases were distinguished better. If you are disconnected from the server it should say you are disconnected from the server. If you checksum mismatch the server it should say you went out of sync with the server. Currently it uses the same message for both cases. Apparently there is a way to get the clients to log this in console, but I cannot seem to get it to work...
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 08, 2018, 10:20:40 PM
I have split this from the general topic about the server as this is perhaps more properly treated as a bug report rather than discussion about the maintenance of the server generally.

Firstly, to answer Dr. Supergood's questions: the logs will show crashes, but it is hard to match these to specific instances of reported desyncs. I have just amended one of the scripts to add the date and time of each check whether the server game is still running, so these should in future be slightly easier to correlate.

I suspect that the problem may well be server crashes, but it is difficult to be sure at this juncture. In any event, there appear to be an increasing number of reports of server crashes losing people's work and it is important to try to resolve this.

These crashes appear to be occurring in non-reproducible ways. There is one possible candidate issue currently known that might be the cause of this: there is an occasional crash bug that has persisted for some time which I have never been able to fix because I have never been able to reproduce it reliably enough. It manifests in the following piece of code in simworld.cc:

Code: [Select]
FOR(minivec_tpl<const planquadrat_t*>, const& current_tile_3, current_destination.building->get_tiles())
            {
               const nearby_halt_t* halt_list = current_tile_3->get_haltlist();
               if (!halt_list)
               {
                  continue;
               }
               for (int h = current_tile_3->get_haltlist_count() - 1; h >= 0; h--)
               {
                  halthandle_t halt = halt_list[h].halt;
                  if (halt->is_enabled(wtyp))
                  {
                     // Previous versions excluded overcrowded halts here, but we need to know which
                     // overcrowded halt would have been the best start halt if it was not overcrowded,
                     // so do that below.
#ifdef MULTI_THREAD
                     destination_list[passenger_generation_thread_number].append(halt);
#else
                     destination_list.append(halt);
#endif
                  }
               }
            }

It manifests specifically at this line:

Code: [Select]
halthandle_t halt = halt_list[h].halt;

The way in which it manifests is that halt_list[h] is an invalid memory location of a non-NULL value, indicating that this memory location has been deleted. This usually happens when the variable "h" is a non-zero number.

It is known to occur in the demo game loaded automatically on starting the game: sometimes it occurs a few seconds after starting the game; at other times, it can run for hours without occurring at all. I spent some considerable time earlier this year trying to track it down, and made some progress, but was not able to complete the work because there came a time when this became impossible to reproduce no matter how many times that I loaded the demo game and ran it on fast-forward (which had for a time reproduced this fairly reliably). I did note that it occurred at least sometimes (and possibly all of the time) on multi-tile buildings with holes in them (i.e., which are not contiguous quadrilaterals), but it is not clear whether this is related to the cause (and I did spend some time investigating this aspect of things to no avail).

I have been attempting to reproduce this error whilst writing this post, but to no avail. If anyone can assist in trying to track down the (or a possible) cause of this error, or suggest what might be done to do so (beyond things that I have already tried, such as using Dr. Memory) that would be much appreciated.

As to discerning the cause of server disconnects, the code in this respect is the same as in Standard: I do not think that I know enough about networking to be able to modify this easily. I agree that this would be useful, however.

Incidentally, how has stability been to-day?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 08, 2018, 10:44:32 PM
Quote
Incidentally, how has stability been to-day?
Left server running while I made dinner and came back to find client closed (crashed without error message).

Tried to join server just now and was immediately prompted with "you have lost synchronization" or whatever after load and the server resuming.

Finally managed to rejoin just to find all my work I made before making dinner was lost!

Yeh it is one of those days...
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on September 08, 2018, 11:32:59 PM
Can confirm that joining the server is a PITA at the moment...
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 08, 2018, 11:45:15 PM
I am not sure what the joining issue is, unless the server is now crashing so frequently that people are constantly trying to rejoin. My apologies that the game experience has deteriorated.

I am currently running the server game offline to attempt to reproduce this, although I have not had any success so far.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 09, 2018, 02:01:18 AM
The server appears resource constrained at the moment. Frequently it hangs or stutters for a few seconds before resuming. This is not connection related since when play resumes I am not far behind the server, instead it must be my client waiting for the server which is falling behind.

Many of the crashes during loading could easily be explained by running out of memory or similar resources.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 09, 2018, 10:49:43 AM
I have noticed that the game on the server is running at ~94% of RAM, which suggests that this is the issue - however, the amount of RAM on the server is 6GB, whereas loading the game afresh locally gives rise to memory utilisation of only 4GB, suggesting a leak: indeed, when I run Simutrans-Extended locally, the amount of memory used steadily increases with time, also suggesting a leak.

However, running Dr. Memory fails to diagnose this. Whilst it reports many suspected leaks:

Edit: Code too long to paste here

All of these are  in code shared with Standard, and almost all of them are in code run only on startup (mainly the translator and object file reader code). There is one more suspicious leak, which is this:

Code: [Select]
Error #70: POSSIBLE LEAK 32744 direct bytes 0x0000000008c6f820-0x0000000008c77808 + 82702 indirect bytes
# 0 replace_malloc                                  [d:\drmemory_package\common\alloc_replace.c:2576]
# 1 _chvalidator_l                                  [d:\th\minkernel\crts\ucrt\src\appcrt\convert\isctype.cpp:48]
# 2 _chvalidchk_l                                   [d:\th\minkernel\crts\ucrt\inc\ctype.h:161]
# 3 __acrt_locale_changed                           [d:\th\minkernel\crts\ucrt\inc\corecrt_internal.h:658]
# 4 _ischartype_l                                   [d:\th\minkernel\crts\ucrt\inc\ctype.h:186]
# 5 __crt_strtox::is_space                          [d:\th\minkernel\crts\ucrt\inc\corecrt_internal_strtox.h:145]
# 6 __crt_strtox::parse_integer<>                   [d:\th\minkernel\crts\ucrt\inc\corecrt_internal_strtox.h:282]
# 7 xmalloc                                         [c:\users\james\documents\development\simutrans\simutrans-extended-sources\simmem.cc:156]
# 8 freelist_t::gimme_node                          [c:\users\james\documents\development\simutrans\simutrans-extended-sources\dataobj\freelist.cc:106]
# 9 slist_tpl<>::node_t::operator new               [c:\users\james\documents\development\simutrans\simutrans-extended-sources\tpl\slist_tpl.h:42]
#10 slist_tpl<>::insert                             [c:\users\james\documents\development\simutrans\simutrans-extended-sources\tpl\slist_tpl.h:425]
#11 hashtable_tpl<>::set                            [c:\users\james\documents\development\simutrans\simutrans-extended-sources\tpl\hashtable_tpl.h:346]

which is in a part of the code which is used very frequently in Extended (the "set" command in the hashtable), however, this is also in code from Standard. Perhaps the leak actually originates in code from Standard?

Does anyone know how the hashtable "set" implementation works well enough to try to find what is causing this trouble?
Edit: I should note that I am unable to load the Stevenson-Seimens saved game when running Dr. Memory, as this will crash with the following error when doing so (but will load without difficulty when not using Dr. Memory):
Code: [Select]
Dr. Memory version 1.11.0 build 2 built on Aug 29 2016 02:41:18
Dr. Memory results for pid 9212: "Simutrans-Extended-debug.exe"
Application cmdline: "C:\Users\James\Documents\Development\Simutrans\simutrans-extended-sources\simutrans\Simutrans-Extended-debug.exe"
Recorded 115 suppression(s) from default C:\Program Files (x86)\Dr. Memory\bin64\suppress-default.txt

Error #1: UNADDRESSABLE ACCESS: writing 0x0000000000000000-0x0000000000000008 8 byte(s)
# 0 inflate_fast               
# 1 inflate                     
# 2 adler32_combine64           
# 3 gzread                     
# 4 loadsave_t::fill_buffer               [c:\users\james\documents\development\simutrans\simutrans-extended-sources\dataobj\loadsave.cc:777]
# 5 load_thread                           [c:\users\james\documents\development\simutrans\simutrans-extended-sources\dataobj\loadsave.cc:80]
# 6 pthreadVC2.dll!sched_get_priority_max+0x56b2   (0x000007fefbd769b0 <pthreadVC2.dll+0x69b0>)
# 7 MSVCR100.dll!_callthreadstartex       [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c:314]
# 8 MSVCR100.dll!_threadstartex           [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c:292]
Note: @0:01:21.106 in thread 3800
Note: instruction: mov    %r8 -> (%rbx)
This makes it very difficult to assess whether the leak is caused by something present in the Stevenson-Seimens saved game (where I had observed the leak) not present in the demo game from which the output in the original, unedited version of this post was generated.

Edit 2: Compiling in 32-bit rather than 64-bit allows Dr. Memory to work without errors - but I can find no significant leaks using the Stevenson-Seimens saved game with this method, even though memory usage is increasing constantly as reported by Windows - all of the reported leaks originate from the translator or object reader.

The USE_VALGRIND_MEMCHECK preprocessor directive appears to be deprecated, as there is no /valgrind/memcheck.h.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 09, 2018, 07:38:26 PM
I do not think this will detect logic leaks (if that is the correct term). A conventional leak refers to something be allocated, used, discarded but not destroyed to free resources. A logical leak is when something is allocated, used, not discarded or destroyed and not used again.

The problem with logical leaks is that they will not be detected by memory allocation watchers and such. This is because they might even be cleaned up at certain times, eg on game closure or map reload etc.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 09, 2018, 09:17:43 PM
I think that this is what the USE_VALGRIND_MEMCHECK is for: to detect at least some of what you describe as "logic leaks" in freelist and the like.

It would be helpful if anyone could attempt running Simutrans-Extended using the saved game from the Stevenson-Seimens server and Valgrind with this enabled (if this is still possible - does anyone know where /valgrind/memcheck.h is?) to see whether this is a detectable leak.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 13, 2018, 11:30:24 AM
MSVC 2017 has a heap profiling feature which allows you to inspect the number of objects allocated and most importantly compare snapshots of the heap for deltas. Using this it should be possible to spot the leak simply be noting what objects types always have large positive deltas. Or at least that is the theory...

I am trying to run the server game but as you can guess, not going to well. 22 minutes in and still stuck at calculating paths.

I assume the Stevenson-Seimens server is smaller? Anyone got a link to the save game as apparently the server that hosts it is currently down.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 13, 2018, 11:39:20 AM
Dr. Supergood - that is most helpful. The Stevenson-Seimens game is much, much smaller and yet still exhibits this problem. I may well have to repair my computer (you may well be right about the PSU - but replacing one of those takes a lot of time, which is somewhat limited at present; I will have to do it sooner or later, however) before looking into this intensively. If you are able to make progress with heap profiling with MSVC 2017, that would be most useful. Indeed, if this feature assists, it might be worth my while upgrading.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Vladki on September 13, 2018, 01:01:03 PM
I assume the Stevenson-Seimens server is smaller? Anyone got a link to the save game as apparently the server that hosts it is currently down.
Try this one: https://uran.webstep.net/~vladki/simutrans/autosave08.sve
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 13, 2018, 04:52:20 PM
I cannot detect any gameplay leaks in the Stevenson-Seimens save. After 30 minutes of fast forward simulation it gained 294 MB.

Of which ~288MB of this was allocated by...
void route_t::INIT_NODES(uint32 max_route_steps, const koord &world_size)
After roughly 12 minutes and came in the form of 6 route_t::ANode[] each of exactly 48,000,244 bytes size and were ultimately generated by stadt_t::check_all_private_car_routes(). The game ended with 14 of these identical sized arrays, with the other 8 having been generated around map loading from sources such as for player vehicles. I am guessing this memory is recycled so cannot be considered a leak.

However due to the server game being so massive and getting more and more public cars it could quite well be possible that dozens more of such arrays are needed/used. If that is the case those would push up memory usage over time potentially using many gigabytes of memory. If these represent cached routes there might need to be a line drawn as to how many routes are allowed to be cached at any given time. To keep performance reasonable more staggering could be performed as more routes are required.

6.6MB was allocated in the form of transferring_cargo_t[] as part of vector expansion and 3.8MB in the form of void as part of tile object list expansion. These can potentially be considered logic leaks since the vectors and object list do not dynamically shrink from what I can tell. The transferring_cargo_t[] produces few large arrays as each element is a complex object of well over 32 bytes, this could be combated to some extent by turning it into an array of pointers and having each transferring_cargo_t allocated from the free lists so each element is at most 8 bytes (pointer length). Object list expansion seems to suffer the most from pedestrian creation and movement, with some object lists being observed to hit the 255 element maximum limit which is over 4kb for a single tile.

Both these logical leaks could be solved by adding shrinking logic to the vectors. Transfering cargo could be made to shrink the backing list if it is larger than 8 elements and currently is utilizing less than 1/4 (bit right shift of 2) of its capacity Logic behind this is that a busy transfer will always be busy transferring so one would only want to downsize in response to exceptional loads. Object list could use something similar but can likely be set more aggressive to cull large arrays when they have 4 free capacity, since outside of pedestrians most tiles will have under 4 objects on them. Numbers obviously could be fine turned for performance and memory trade offs. Doubt this will make much of a difference as these lists are likely purged during save/load which occurs frequently. Still might cause a fluctuation of 100MB odd on the server.

I ran save/load cycles and have spotted a leak... Every save/load cycle quickstone_hashtable_tpl<haltestelle_t,haltestelle_t::connexion *>[] increased in allocation by 1,243 instances totalling 3,087,612 bytes. This leak is easily repeatable, with each save/load cycle of the map adding exactly the same number.

The allocations below were observed...
Code: [Select]
Simutrans-Extended Normal x64 Debug (SDL 2).exe!path_explorer_t::compartment_t::step() - Line 674 C++
  Simutrans-Extended Normal x64 Debug (SDL 2).exe!haltestelle_t::haltestelle_t() - Line 427 C++
Both sources appear to be leaking between save/load cycles. As the bridgewater server is a lot more complex and is subject to a lot of save/load cycling this could easily explain why it runs out of memory. In this simple map 3 MB is leaked per save/load cycle. In a more complex map that could easily be 30 or even 100MB. 4 players joining would be 400MB, etc...

It is worth noting that the route_t::ANode[] were not destroyed between save/load despite being created initially after first load ran for a bit. This is probably not a leak as I am guessing it gets recycled and reused.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 15, 2018, 05:27:06 PM
Thank you - that is very helpful. I have just pushed two possible fixes for memory leaks on saving/loading. The first is relating to private car route finding: the cause of this was, I think, that the ANODE arrays were thread local, created on the heap, and not explicitly destroyed when the threads were terminated (as they are on loading a saved game).

The second was, I think, some logic errors in the code for deleting the connexions* object which prevented this from occurring when loading a saved game.

Additionally, I have now had a chance to check the system log messages, which confirms that at least most of the crashes are associated with running out of memory. I shall be very interested to see whether there is an improvement with the next nightly build.

Thank you again very much for your investigative work, and apologies for the difficulties on the server.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 16, 2018, 01:21:08 PM
Some testing to-day: players are reporting greater stability with this version. However, there may still be some memory leaks left: what I notice after joining the game multiple times is that, memory usage gradually increases after a player joins until it reaches a plateau, where it fluctuates sligthly. When another player joins, memory usage falls considerably again, but not quite back to the level where it was when the previous player joins, and then increases to a slightly higher plateau.

Some examples from recent testing are that the free memory on the server was 376MB after the first join, falling to 128-129MB free after a few minutes. After the second join, it increased to 371MB free, falling to 115-116MB free after a few minutes.

However, curiously, a third join seems to have increased free memory to 493MB immediately after the join, falling to 257MB, so the pattern does not seem to be entirely consistent.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on September 17, 2018, 05:14:22 AM
In light of the above, it seems likely (to me, at least) that a primary cause of the incomplete saves was that the server was running out of memory mid-save. However, there must have been other contributing factors involved. In particular, these incomplete saves should not be replacing complete saves. In karte_t::save there is code to support using a temporary filename while writing the save; once the save has been successfully written this file is then renamed to replace the previous save. However, this tempory file is used only if the filename passed to karte_t::save beings with "save/". So I have two questions.
1. Why is it only used in those cases? (Prissi, who wrote that feature in 2012, would presumably know.)
2. Are the saves that are made by the server for backup purposes written to a filename beginning "save/"?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 17, 2018, 09:51:07 PM
Thank you for this - I was not aware of this restriction to the save file failsafe code, nor do I understand the reason for it. I do know, however, that the server saved games are not stored in a "/save" folder.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: prissi on September 18, 2018, 12:56:53 PM
Writing to other directories usually did not overwrite an existing save game respective it is meaningless to keep an old save for the next client. However, for an exisitng game, you press the same filename. If simutrans crashes during saving, you have lost everything. So only deleting the old one after finishing saving.

Servers usually do not use autosave, since saving causes an out of sync error with the clients (if they are not saving too then).
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 22, 2018, 08:24:26 PM
I think that the memory leak issue has now gone, as there are no reports of out of memory crashes since last week. I note the reasons given by Prissi for restricting the safe saving behaviour to filenames with /save, but, since, on a server, the saved game file is overwritten whenever that game is loaded/saved when a player joins (or by an autosave facility created by use of a script with nettool, as is used on the Bridgewater-Brunel server, to avoid desyncing when saving), it is useful to avoid overwriting the saved game until the save has confirmed to have been completed successfully, so I have removed the code restricting the use of this feature to files in the /save directory.

There is a new source of instability on the server, however: the game crashes with a segfault shortly after loading. I have not been able to reproduce this on a Windows debug build, but it can be reproduced (without being able to be traced) on the cross-compiled Windows release build.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 22, 2018, 09:59:17 PM
Quote
There is a new source of instability on the server, however: the game crashes with a segfault shortly after loading. I have not been able to reproduce this on a Windows debug build, but it can be reproduced (without being able to be traced) on the cross-compiled Windows release build.
?!

It is not crashing at all for me. I just get dropped out of sync after 5-10 minutes of being connected. I can then immediately rejoin and it resumes where it left off (no progress lost). Makes playing with more than 1 person connected impossible.

Using nightly build from server as downloaded by my tool, as always.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 23, 2018, 09:11:37 AM
This is the first confirmation that there is a desync issue independent of the fixed memory leak issue. When I tested this yesterday, the game crashed immediately after losing synchronisation with the server, so I inferred that the problem was that the server was crashing; however, testing again to-day, I see that synchronisation with the server is lost when neither server nor client crashes.

Finding and fixing this problem will be a gargantuan undertaking (as any desync bug subtle enough only to manifest itself only now is likely to be almost impossible to find), which will be likely to delay by a very long time progress on fixing other bugs or adding new features. I am not likely to be able to start in earnest until my own computer hardware problems are resolved, which I hope that they will be when I have time to install the new power supply that I ordered last week, whenever it arrives.

Any assistance from anyone in narrowing this down in any way would be very much appreciated (for example, testing whether old versions of the saved games remain in sync with the current build, or whether the current version of the saved game remains in sync when loaded with an older build, checking whether Linux clients are equally affected, checking whether the problem remains after deleting all aircraft on an offline test platform, checking whether the problem remains after removing city and/or industry electrification, etc.).
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 23, 2018, 09:24:55 AM
I could possibly do the tracking down but I lack hardware for it. At best I could do would be a windows server with windows configuration. However I would need exactly the server configuration (sync time, time per step etc) to rule that out as a variable.

Another more interactive approach would involve running the server with builds with particular components disabled. For example all power net updating could be turned off, or all aircraft movement disabled, etc. This could potentially be quicker than manually deleting all aircraft or modifying the save.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 23, 2018, 09:39:46 AM
That would be very helpful - thank you.
For reference, here is the server's simuconf.tab:
Code: [Select]
# simuconf.tab
#
# Low-Level values and constants
#
# Lines starting with '#' or any non_character letter will be ignored!
# To actually set a value, remove the leading '#'!
#
# This file can be found in many places:
#
# simutrans/config/simuconf.tab (this one)
# ~/simutrans/simuconf.tab (in the user directory, if singleuser_install == 0 or not present, first only pak-file path will be read)
# simutrans/pakfilefolder/config/simuconf.tab
# ~/simutrans/simuconf.tab (readed a second time, s.a.)
#
################################# Base settings ##################################
#
# This simuconf.tab will be read first => we set meaningful defaults here.
#
# load/save the files in the users or the program directory directory? (default: 0 = user directory)
# ATTENTION!
# will be only used if this file is located in the program directory at config/simuconf.tab!
singleuser_install = 1
#
#
#
# Do not delete these comment line! (Needed for installer)

progdir_overrides_savegame_settings = 0
pak_overrides_savegame_settings = 0
userdir_overrides_savegame_settings = 0

# Default pak file path
# which graphics you want to play?
# Nothing means automatic selection
# ATTENTION!
# This will be only used if this file is located in the program directory at config/simuconf.tab!
# and will be overwritten by the settings from simutrans/simuconf.tab in the user directory
#
#pak_file_path = pak/
#pak_file_path = pak.german/
#pak_file_path = pak128/
#pak_file_path = pak.japan/
#pak_file_path = pak.winter/
#pak_file_path = pak.ttd/

# The maximum number of position tested during a way search
# Consumes 16*x Bytes main memory, where x is the "max_route_steps" value.
max_route_steps = 1500000

# How many tiles to check before giving up on finding a free bay at a stop or free alternative route?
# Default: 200
# Unlimited: 0
max_choose_route_steps = 0

# size of catchment area of a station (default 2)
# older game size was 3
# savegames with another catch area will give strange results
station_coverage = 16

# Max number of steps in goods pathfinding
# This should be equal or greater than the biggest group
# of interconnected stations in your game.
#
# If you set it too low, some goods might not find a route
# if the route is too complex. If you set it too high, the
# search will take a lot of CPU power, particularly if searches
# often fail because there is no route.
#
# Depending on your CPU power, you might want to limit the search
# depth.
#
# prissi: On a 512x512 map with more than 150000 people daily, the saturation
# value for "no route" was higher, around 8000. Using 300 instead almost doubled
# the value of "no route"
#
max_hops = 2000

# Passengers and goods will change vehicles at most "max_transfer"
# times to reach their destination.
#
# It seems that in good networks, the number of transfers is below
# 5. I think 6 is a good compromise that allows complex networks but
# cuts off the more extreme cases
#
# You can set this lower to save a little bit of CPU time, but at some
# point this means you'll get less passengers to transport
#
# This value is less critical than "max_hops" from above.
#
# T. Kubes: I would say 8 for now, but this definitely should be difficulty
# dependent setting.
#
max_transfers = 12

# way builder internal weights (defaults)
# a higher weight make it more unlikely
# make the curves negative, and the waybuilder will built strange tracks ...
#
way_straight=1
way_curve=2
way_double_curve=6
way_90_curve=15
way_slope=10
way_tunnel=8
way_max_bridge_len=15
way_leaving_road=25

# These settings are used to calculate adjusted figures based on the length of the month.
# To assume a base month length based on the settings in Simutrans-Standard, use 1,000
# base_meters_per_tile and 18 base_bits_per_month.
#
# To assume a base month length of 24 hours (to allow various settings to be calibrated
# as if months were days), use a base meters per tile of 7500
base_meters_per_tile = 1000
base_bits_per_month = 18

# This setting determines the rate at which jobs replenish. At the default, 100, all
# of a building's jobs replenish in exactly 1 month. At 200, it takes 2 months, and at
# 50, it takes 1/2 a month. This is useful when calibrating the month/year scale as
# against the hour/minute scale. If a month is timed as 6.4 hours (6:24), for example,
# a day (24 hours) consists of 3.75 "months". For jobs to replenish once every 24 hours,
# therefore, set this to 375 if the length of a month is 6:24.
job_replenishment_per_hundredths_of_months = 100

# Save the current game when quitting and reload it upon reopening
reload_and_save_on_quit = 1

############################### Passenger and mail settings ##############################
# also pak dependent

# town growth multiplier factors
# The greater the factor, the greater the exent to which the thing to which the factor
# makes reference influences growth
passenger_multiplier = 40
mail_multiplier = 15
goods_multiplier = 20
electricity_multiplier = 20

# town growth is size dependent. There are three different sizes (<1000, 1000-10000, >10000)
# the idea is, that area increase by square but growth is linear
growthfactor_villages = 400
growthfactor_cities = 200
growthfactor_capitals = 100

# Enable this to use the old algorithm for city growth from Standard.
# NOTE: The renovation_percentage in cityrules.tab should be increased if this be done.
quick_city_growth = 0

# if enabled (default = 0 off) stops may have different capacities for passengers, mail, and  freight
seperate_halt_capacities = 0

# three modes (default = 0)
# 1: the payment is only relative to the distance to next interchange, 2 to the trips destination (default 0 is distance since last stop)
pay_for_total_distance = 0

# things to overcrowded destinations won't load if active (default off)
avoid_overcrowding = 0

# do not create goods/passenger/mail when the only route is over an overcrowded stop
no_routing_over_overcrowded = 0

# These settings determine the population, visitor demand, jobs and mail per "level" of building.
# Each of these things can be set independently in the buildings' .dat files, but for older paksets
# or paksets from Standard, only a "level" will be supplied, so these conversion factors are
# important in those cases.
population_per_level = 3
visitor_demand_per_level = 3
jobs_per_level = 2
mail_per_level = 1

# These settings determine the number of passenger trips that each person makes per game month
# and the number of items of mail that each unit of mail demand produces per month, in 1/100ths.
# This does *not* include onward and return trips, however, and is *before* adjustment for the
# meters per tile and bits per month scales.
passenger_trips_per_month_hundredths = 200
mail_packets_per_month_hundredths = 10

# This setting determines the maximum number of onward trips that passengers may make on a journey.
# The actual number of onward trips for any given packet of passengers is a random number of anywhere
# between 1 and this figure. This is only applicable if passengers are in fact going to make an onward
# trip, the distribution_weight of which is determined by a different setting (see below).
max_onward_trips = 3

# This figure determines how likely that it is that passengers will make any onward trips at all. It
# is expressed in percentage terms.
onward_trip_chance_percent = 25

# This is the distribution_weight, in percentage, that any given passenger journey will be a commuting trip. Any
# trip that is not a commuting trip is classed as a visiting trip.
commuting_trip_chance_percent = 67

# The following settings determine the way in which individual packets of passengers decide
# what their actual journey time tolerance is, within the above ranges. The options are:
#
# 0 - Even distribution
# Every point between the minimum and maximum is equally likely to be selected
#
# 1 - Normal distribution (http://en.wikipedia.org/wiki/Normal_distribution)
# Points nearer the middle of the range between minimum and maximum are more likely
# to be selected than points nearer the end of the ranges.
#
# 2 - Positively skewed normal distribution (squared) (http://en.wikipedia.org/wiki/Skewness)
# Points nearer the a point in the range that is not the middle but is nearer to the lower
# end of the range are more likely to be selected. The distance from the middle is the skew.
#
# 3 - Positively skewed normal distribution (cubed)
# As with no. 2, but the degree of skewness (the extent to which the modal point in the range
# is nearer the beginning than the end) is considerably greater.
#
# 4 - Positively skewed normal distribution (squared recursive)
# As with nos. 2 and 3 with an even greater degree of skew.
#
# 5 - Positively skewed normal distribution (cubed recursive)
# As with nos. 2, 3 and 4 with a still greater degree of skew.
#
# 6 and over - Positively skewed normal distribution (cubed multiple recursive)
# As with nos. 2, 3, 4, and 5 with an ever more extreme degree of skew. Use with caution.

random_mode_commuting = 2
random_mode_visiting = 2

################################## Industry settings #################################

# when a city reaches 2^n times of this number
# then a factory is extended, or a new factory chain is spawned
#industry_increase_every = 2000

# smallest distance between two adjacent factories
min_factory_spacing = 6

# max distance for connected factories
# if percentage>0, it will be in percent of the largest map dimension
# percentage also overrides the absolute value
max_factory_spacing_percentage = 25
#max_factory_spacing = 40

# allow all possible supplier to connect to your factories?
# best to leave it in default position. (only on for simuTTD)
crossconnect_factories = 0

# how big is the distribution_weight for crossconnections in percent
# (100% will give nearly the same results as crossconnect_factories=1)
crossconnect_factories_percentage = 33

# how much is the total electric power available (in relation to total production) in parts per thousand
electric_promille = 1100

# true if transformers are allowed underground (default)
allow_underground_transformers = 1

# with this switch on (default), overcrowded factories will not recieve goods any more
just_in_time = 1

# How much amount in transport is sent before further distribution stops
# This is only enabled when "just_in_time" is enabled
# The limit is given in percent of factory storage (0=off)
#
# This number is (as of Simutrans-Extended 11.16) scaled to the ratio
# of the time that it takes the industry to consume its stock to the average
# lead time for new deliveries. Values of slightly over 100% are recommended.
maximum_intransit_percentage = 110

# use beginner mode for new maps (can be switched anyway on the new map dialog)
first_beginner = 0

# number of periods for averaging the amount of arrived pax/mail at factories for boost calculation
# one period represents a fixed interval of 2^18 ms in-game time
# value can range from 1 to 16, inclusive; 1 means no averaging; default is 4
factory_arrival_periods = 4

# whether factory's pax/mail demands are enforced or not; default is on
factory_enforce_demand = 1

################################# Display settings ################################

# if defined, the default tool will try to calculate straight ways (like OpenTTD)
straight_way_without_control = 0

# player color can be fixed for several players or set totally random
# in the latter case each player will get unique but random coloring
# (default 0)
random_player_colors = 0

# when set here, the player will always get these player colors, even
# if random colors were enabled.
# color values range from 0 to 27 for first and second player color
# and there are 0...15 player in the game
player_color[1] = 1,4

# remove companies without convois after x month (0=off, 6=default)
remove_dummy_player_months = 6

# remove password of abandoned companies (wihtout any building activity) after x month (0=off default)
unprotect_abondoned_player_months = 0

# how long is a diagonal (512: factor 2=1024/512, old way, 724: sqrt(2)=1024/724
# THIS WILL BE ONLY USED, WHEN THIS FILE IS IN THE pakxyz/config!
#diagonal_multiplier = 724

# how height is a tile in z-direction (default 16, TTD 8)
# THIS WILL BE ONLY USED, WHEN THIS FILE IS IN THE pakxyz/config!
#tile_height = 16

# (=1) drive on the left side of the road
drive_left = 0

# (=1) signals on the left side
signals_on_left = 0

# Do you want to have random pedestrians in town? Look nice but needs some
# CPU time to process them. (1=on, 0=off)
# Impact on frame time: ~10% (16 cities on a new standard map)
random_pedestrians = 1

# Do you want to have random pedestrians after pax are reaching this
# destination? May generate quite a lot. (1=on, 0=off)
stop_pedestrians = 1

# there are some other grounds (like rocky, lakes etc. )
# which could be added to the map (default 10)
# show random objects to break uniformity (every n suited tiles)
random_grounds_probability = 10

# show random moving animals (n = every n suited tiles, default 1000)
random_wildlife_probability = 1000

# animate the water each intervall (if images available)
# costs some time for the additional redraw (~1-3%)
water_animation_ms = 250

# Show info windows for private cars and pedestrians?
# (1=on, 0=off)
pedes_and_car_info = 0

# How much citycars will be generated
citycar_level = 5

# After how many month a citycar breaks (and will be forever gone) ...
# default is ten years
default_citycar_life = 36

# Show infos on trees?
# (1=on, 0=off)
tree_info = 1

# Show infos also on bare ground?
# (1=on, 0=off)
ground_info = 1

# Show passenger level of townhalls?
# (1=on, 0=off)
townhall_info = 1

# do not show the button to add an inverted schedule for rail based convois
# (1=hide, 0=show anyway)
hide_rail_return_ticket = 1

# always open only a single info window for the ground,
# even if there are more objects on this tile
#only_single_info = 1

# show a tooltip on convois at several conditions
# 0 no messages
# 1 (default) only no_route and stuck
# loading and waiting at signals too
#show_vehicle_states = 1

# show (default) tiles with a halt when editing a schedule
#visualize_schedule = 1

# Should stations get numbered names? (1=yes, 0=no)
#numbered_stations = 0

# Show name signs and statistic?
# 0 = don't show anything
# 1 = station names
# 2 = statistics
# 3 = names and statistics
# The visual style is added to this number:
#   0 = black name in color box
#   4 = name in player color with outline
#   8 = box left of name in yellow outline
show_names = 3

# Color of cursor overlay, which is blended over marked ground tiles
# The available colors and their numbers can be found on
#    http://simutrans-germany.com/wiki/wiki/tiki-index.php?page=en_FactoryDef#mapcolor
# Suggested values (155 is the default)
# -- pak64, pak.german, pak128
#cursor_overlay_color = 155
# -- pak128.japan
#cursor_overlay_color = 149

# Color of background default: COL_GREY2 (210)
#background_color = 210

# there are three different ways to indicate an active window

# first: draw a frame with titlebar color around active window
#window_frame_active = 0

# second: draw the title with a different brighness (0: dark ... 6: bright)
front_window_bar_color = 1
bottom_window_bar_color = 3

# third (best together with 2nd):use different text color for bar
# some colors are 215-white, 240-blck 208-214- all shades of gray
front_window_text_color = 215
bottom_window_text_color = 209

# when moving, you can use windows to snap onto each other seamlessly
# if you do not like it, set the catch radius to zero
window_snap_distance = 8

# show tooltips (default 1=show)
show_tooltips = 1

# tooltip background color (+-1 arounf this index is used), taken from playercolor table
tooltip_background_color = 4

# tooltip text color (240=black, 215=white)
tooltip_text_color = 240

# delay before showing tooltip in ms (default 500ms)
tooltip_delay = 500

# duration in ms during tooltip is visible (default 5000ms=5s)
tooltip_duration = 5000

# show graphs old style (right to left) or rather left to right (default)
left_to_right_graphs  = 1

# if run on a mobile device, setting hide_keyboard=1 will only show the keyboard
# when there is text to input in a dialoge. Other than that textinput will not
# be possible (no keyboard short cuts).
# Currently only supported with SDL2
hide_keyboard = 0

# 1=top, 2=vertical centre, 3=bottom, 4=left, 8=horizontal centre, 12=right
# default for minimap is 1+12=13
compass_screen_position = 0

# Should either account (default) or net wealth be shown blow the screen?
player_finance_display_account = 1

################################### Finance settings ##################################
#
# These values are usually set in the pak files
# You can adjust all the cost in the game, that are not inside some pak file
#

# Starting money of the player. Given in Credit cents (1/100 Cr)
#starting_money = 20000000

# New system of year dependent starting money. Up to ten triplets are
# possible. The entries are of the form:
# startingmoney[i]=year,money(in 1/10 credits),interpolate (1) or step(0)
# starting_money[0]=1930,20000000,1
# starting_money[1]=2030,35000000,1

# allow buying obsolete vehicles (=1) in depot
allow_buying_obsolete_vehicles = 1

# vehicle can loose a part of their value, when the are once used
# the loss is given in 1/1000th, i.e 300 mean the value will be 70%
used_vehicle_reduction = 0

# lowest possible income with speedbonus (1000=1) default 125 (=1/8th)
bonus_basefactor = 125

# if a convoi runs on a way that belongs to another player, toll may
# be charged. The number given is the percentage of the running cost
# of the convoi or the way cost (include electrification if needed).
# (default 0)
toll_runningcost_percentage = 0
toll_waycost_percentage = 0

# Maintenance costs of buildings
#maintenance_building = 2000

# first stops: the actual cost is (cost*level*width*height)
#cost_multiply_dock=500
#cost_multiply_station=600
#cost_multiply_roadstop=400
#cost_multiply_airterminal=3000
#cost_multiply_post=300
#cost_multiply_headquarter=1000

# cost for depots
#cost_depot_air=5000
#cost_depot_rail=1000
#cost_depot_road=1300
#cost_depot_ship=2500

# other construction/destruction stuff
#cost_buy_land=100
#cost_alter_land=1000
#cost_set_slope=2500
#cost_found_city=5000000
#cost_multiply_found_industry=20000
#cost_remove_tree=100
#cost_multiply_remove_haus=1000
#cost_multiply_remove_field=5000
#cost_transformer=2500
#cost_maintain_transformer=20

# in beginner mode, all good prices are multiplied by a factor (default 1500=1.5)
beginner_price_factor = 1500

 # The number of months of maintainance that the make public tool costs to use
cst_make_public_months = 60

################################### Miscellaneous settings ##################################
#
# also pak dependent
#

# minimum distance between two townhalls
#minimum_city_distance = 16

# Minimum distance of a city attraction to other special buildings
#special_building_distance = 3

# Max. length of initial intercity road connections
# If you want to speed up map creation, lower this value.
# If you want more initial intercity roads, raise this value.
# If the value is too large then very long bridges might be created.
# 200 seems to be a good compromise between speed and road number
# note: this will slow down map creation dramatically!
#
#intercity_road_length = 200

# This is the maximum number of tiles that a road, river or canal
# that is a public right of way may be diverted for an existing
# public right of way to be deleted. Diversion allows players to
# change the course of public rights of way to accomodate, for
# example, railways whilst protecting their integrity.

max_diversion_tiles = 16

# This is the fraction of the way's total wear capacity below which
# the way will count as degraded and be automatically renewed or,
# if the player has insufficient money or auto-renewal has been
# disabled for the way in question, will enter a degraded state
# in which the speed limit will be reduced. (At a state of 0,
# the way will become totally impassable).
# Default: 7 (approx. 14%).

way_degradation_fraction = 7

# These two settings determine the default relationship between
# the weight of vehicles and their way wear factors. Air and
# road vehicles use the "road_type", and all others apart from
# maglev and water (which are hard coded to zero) use the
# "rail_type". This only applies to vehicles whose way wear
# factor is not specified in the individual vehicle definitions.
# The default for road is 4, based on the "forth power law":
# http://www.pavementinteractive.org/article/equivalent-single-axle-load/
# The default for rail is 1.

way_wear_power_factor_road_type = 4
way_wear_power_factor_rail_type = 1

# This is the setting to calibrate the way wear system. This
# is only effective for vehicles which do not have their way
# wear factor set in their individual .dat files. For an
# explanation of the standard axle load (in tonnes), see
# the link above.
# The default is 8, which is the UK standard for road.

standard_axle_load = 8

# This is the way wear factor exerted on roads by all
# "citycars" (that is, the automatically generated
# private road traffic) in the game, measured in
# 1/10,000ths of a standard axle load.
# Default = 2

citycar_way_wear_factor = 2

# Type of intercity roads - must be available as PAK file.
# Intercityroad with timeline: intercity_road[number]=name,intro-year,retire-year
# .. number = 0..9 - up to ten different intercity_roads possible
# .. name = name of an existing pak file
#intercity_road[0] = asphalt_road,0,1990
# default: city_road
# (old, 102.2.2 and before) intercity_road_type = asphalt_road

# Type of city roads - must be available as PAK file.
# Cityroad with timeline: city_road[number]=name,intro-year,retire-year
# .. number = 0..9 - up to ten different city_roads possible
# .. name = name of an existing pak file
#city_road[0] = dirt_road,1920,1940
# default: asphalt_road
# (old, 102.2.2 and before) city_road_type = city_road

# now river stuff
# first river type (should be defined in pak dependent file)
# The highest number is the smallest. A river with max_speed==0 is not navigavable by ships.
#river_type[0] = river
#river_type[1] = small_river
#river_type[2] = just_the source

# river number (16 gives a few nicely branched rivers)
#river_number = 16

# min length
#river_min_length = 16

# max length
#river_max_length = 320

# This is the distance in meters at which train drivers can see signals ahead.
# Trains have to brake in time for signals which might be at danger, so this
# distance affects train speed.
sighting_distance_meters = 250

# This is the maximum speed at which rail (including narrow gauge, monorail and maglev)
# vehicles may travel in the drive by sight working method. 0 = as fast as they can
# stop in time with no other limit. Note that this does not apply to trams.

max_speed_drive_by_sight_kmh = 35

# The following settings apply to time interval signalling only.
#
# time_interval_seconds_to_clear is the time, in seconds, after a train has completely passed a time interval
# signal that it will reset its aspect to clear.
#
# time_interval_seconds_to_caution is the time, in seconds, after a train has completely passed a time interval
# signal that it will reset its aspect to caution.
#
# Default: clear - 600 (10 minutes); caution - 300 (5 minutes)

time_interval_seconds_to_clear = 600
time_interval_seconds_to_caution = 300

# Corners of greater than 45 degrees have their radius calculated. However, because of
# the rigid tile system in Simutrans, this produces unrealistic results when applied
# to corners of 45 degrees. A value is thus specified here. This affects the speed
# at which vehicles can take 45 degree corners.
#
# If this is set to 0, 45 degree corners are treated as straight.
#
# If this is set to 9999, adjacent pairs of degree corners are treated as half the
# radius of a 90 degree corner (and non-adjacent pairs scaled according to the distance
# between them). This is techincally correct but does not work well in Simutrans because
# of the inability to have gentler corners.
#
assumed_curve_radius_45_degrees = 1000

# Towns have speed limits. This value, expressed in km/h, is the speed limit
# value for urban roads in the game.
# Default: 50
# Set to 0 to disable town speed limits (roads will have their base speed limits in towns).
# This only applies to roads.

town_road_speed_limit = 50

# This allows for height maps with more height levels
new_height_map_conversion = 0

# disable companies from making ways public with the appropiate tool
# even when disabled companies can still make stops public
# does not affect public service provider player
disable_make_way_public = 0

################################# Tree settings #################################
#  please be careful in changing them, I spent lot of time finding optimals.
#  those values have impact on no. of spawned trees -> memory consumption
#
# Number of trees on square 2 - minimal usable, 3 good, 4 very nice looking
max_no_of_trees_on_square = 3

# Base forest size - minimal size of forest - map independent
forest_base_size = 36

# Map size divisor - smaller it is the larger are individual forests
forest_map_size_divisor = 38

# Forest count divisor - smaller it is, the more forest are generated
forest_count_divisor = 16

# Determins how dense are spare trees going to be planted (works inversly)
forest_inverse_spare_tree_density = 400

# climate with trees entirely (1: water, 2:desert, 4:tropic, 8:mediterran, 16:temperate, 32:tundra, 64:rocky, 128:arctic)
# zero (default) means no climate with at least one tree per tile
tree_climates = 4

# climates with no trees at all (desert and arctic at the moment)
no_tree_climates = 130

# if set, no trees will be created at all (save about 30% memory and
# the season change will be much smoother on small machines)
#no_tree = 0

################################### Time settings ###################################

# Enforce vehicle introduction dates?
# 0 = all vehicles available from start of the game
# 1 = use introduction dates
# 2 = (default) use settings during game creation, new games off
# 3 = use settings during game creation, new games on
#
use_timeline = 3

# Starting year of the game:
# Setting it below 1850 is not recommended for 64 set!
# You will have problems with missing vehicles, do not complain if you do so!
# Setting it above 2050 will render game bit boring - no new vehicles.
#
# other recommended vaule for 64 is 1956
#
starting_year = 1930

# Starting month of the game for people who want to start in summer (default 1=January)
starting_month = 1

# Should month be shown in date?
# (0=no, 1=yes, 2>=show day in japan format=2, us format=3, german=4, japanese no season=5, us no season=6, german no season = 7, hours/minutes scale = 8)
# This is most useful, if you use longer months than the default length (see below)
# The hours/minutes scale shows the time in hours/minutes as used for determining journey times and other short times. It is the recommended setting.
show_month = 8

# Global time multiplier (will be save with new games)
# 2^bits_per_month = duration of a game month in milliseconds real time
# default before 99.x was 18. For example, 21 will make the month 2^3=8 times longer in real time
# production and maintainance cost will be adjusted accordingly.
#
bits_per_month = 20

################################# System settings #################################

# compress savegames?
# "binary" means uncompressed, "zipped" means compressed
# "bzip2" uses another compression algorithm
# other options are "xml", "xml_zipped" and "xml_bzip2"
# xml detects more errors of broken savegames but files are much larger
# bzip2 savegames are smaller than zipped but saving/loading takes longer
saveformat = zipped

# Alternate format for faster autosaves
autosaveformat = zipped

# autosave every x months (0=off)
autosave = 0

# display (screen/window) width
# also see readme.txt, -screensize option
#display_width  = 704

# display (screen/window) height
# also see readme.txt, -screensize option
#display_height = 560

# show full screen
fullscreen = 0

# For versions of Simutrans compiled to work in a multithreaded system (which will improve
# certain aspects of performance on multiple core/processor machines), this is the number
# of threads that will be used. Maximum: 12.
threads = 4

# maximum size of tool bars (0 = no limit)
# if more tools than allowed by height,
# next and prev arrows for scrolling appears
toolbar_max_width = 0
toolbar_max_height = 0

# How many frames per second to use? Display may look pretty until 10 or so
# (depends very much on computer, game complexity and graphics driver)
frames_per_second = 15

# during zooming out simutrans may get slow due to the very high number
# of tiles visible. If the tiles become equal or smaller than the tile size
# below, a simpler clipping algorithm will be used, which will give some
# clipping errors, but is faster. (default size = 4)
# However, during nromal gaming this will be detrimined automatically, so you
# usually you do not need to set this manually.
simple_drawing_tile_size = 4

# you can force fast redraw for fast froward by this (default off)
simple_drawing_fast_forward = 1

# How much faster should the game proceed with fast forward (limited by your computer and size of the map)
fast_forward = 100

################################### Network settings ##############################
#
# Synchronized networking is always a trade off between fast respone and safe
# connections. A more relaxed timing will cause delay of commands but is more
# likely to compensate for clients running slightly faster than the rest.
#

# Sets the local addresses Simutrans should listen on and use for making outgoing connections
# By default it will use all local IPv4 and IPv6 addresses
# This setting has no effect if Simutrans has been compiled with the USE_IP4_ONLY flag set!
# The addresses listd will be tried in the order specified
# A DNS name may be specified, this will be resolved and Simutrans will attempt to listen
# on all of the addresses returned.
listen = 46.32.231.222

# How much delay before comands are executed on the clients.
# A larger number will catch even clients running slightly ahead but cause delay.
# This is set by the server side.
#
# The sum of server_frames_ahead and additional_client_frames_behind should be
# evenly divisble by server_frames_between_checks for best network performance.
server_frames_ahead = 4

# How much extra delay in command execution on the client side, on top of server_frames_ahead.
# A larger number can compensate for larger fluctuations in communication latency.
# This is set by the client side.
#
# The sum of server_frames_ahead and additional_client_frames_behind should be
# evenly divisble by server_frames_between_checks for best network performance.
additional_client_frames_behind = 0

# In network mode, there will be a fixed number of screen updates before a step.
# Reasonable values should result in 2-5 steps per second.
#
# This is the number of sync steps for every step. Sync steps handle user interaction
# and things that update regularly: steps handle things that take much computational
# effort, such as routing. Each step takes much more time than each sync step.
# This setting is only active in network mode: the timing is automated in single
# player mode.
server_frames_per_step = 8

# The server sends after a fixed number of steps some information to the clients.
# Large values here means: reduced server communication (if that is of importance...)
# Small values should improve the timing of the clients.
server_frames_between_checks = 32

# Automatically announce server on the central server directory (http://servers.simutrans.org/)
# 0 (default) = off, 1 = on
server_announce = 1

# Interval of server announcement (if enabled)
# Value is number of seconds between server announcements, default is 900 (15 minutes)
# Minimum value is 60 (1 minute), for accurate listing it is recommended not to increase
# this value to greater than 3600 (1 hour)
# To disable announcements set server_announce to 0
server_announce_interval = 900

# Fully Qualified Domain Name (FQDN) or IP address of your server (IPv6 or IPv4)
server_dns = bridgewater-brunel.me.uk

# Name of server in server listing
server_name = Bridgewater-Brunel 1

# Additional information about your server (for display on the list server)
server_comments = A server for online play of large maps with the latest versions of Simutrans-Extended

# Email address of server maintainer (for display on the list server)
server_email = [REDACTED]

# The password required for administering the server.
# NOTE: This should be changed from the default when used, for
# obvious reasons.
server_admin_pw  = [REDACTED]

# Pakset download URL (for display on the list server)
server_pakurl = http://bridgewater-brunel.me.uk/downloads/nightly/pakset/pak128.britain-ex-nightly.tar.gz

# Server info URL (for display on the list server)
server_infurl = http://www.bridgewater-brunel.me.uk

# Pause server when no clients are connected
pause_server_no_clients = 1

# Nickname when joining network games
nickname = Minister of Transport

# Server saves savegame when being killed (default=0 off)
server_save_game_on_quit = 1

# Chat window transparency (0=off, 25, 50 75 are possible)
chat_transparency = 75

# The number of game months before a player making no changes
# to its company is unlocked automatically to allow other players
# to take over.
unprotect_abandoned_player_months = 180

# The number of game months before a player that has never built
# anything substantive is deleted automatically.
remove_dummy_player_months = 24

# Here you can add a message about your server (It will read this file on each joining anew)
server_motd_filename = motd.txt
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on September 25, 2018, 08:54:33 AM
I've pushed a further improvement for partial saves to Github, which was discussed for Standard here (https://forum.simutrans.com/index.php/topic,18517.0.html). The improvment is that the temporary filename is derived from the given filename, making it safe to run multiple instances of Simutrans on one filesystem simultaneously.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 25, 2018, 10:12:11 PM
Thank you for that - I had not thought of that issue, but this is a sensible improvement.

Now incorporated.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 27, 2018, 07:41:34 PM
I have begun some testing in relation to the desync bug. The first test was to run a local client/server pair on the saved game from the 16th of September (early 1937 in game) with the latest cross-compiled release build from the nightly server. The result of this test, just completed, was that this ran for over an hour without losing synchronisation.

The next step is to try with a later saved game. I note, incidentally, that a 21st of September saved game will crash the release build, but I cannot reproduce this in the debug build.

Edit: The second round of testing with a saved game from earlier to-day also failed to show any desync when run in a server/client pair locally for about an hour (until October 1940 in-game).

Can all those who are having desync problems let me know what operating systems that they are using? It is possible that the issue is occurring only when Windows clients are connected to a Linux server.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on September 27, 2018, 09:04:38 PM
I notice alot of talk about the overtaking patch and that it might be the issue.
When I made my airports, I utilized that feature for some of the bustations to the airports, particularly the "parallel stop mode". That was one of the last things I did before it became more unstable and I had to do other real world stuff.

I could try to rebuild those sections with just normal  way mode and see if that helps. That is, if I ever get on to the server....
Edit:
Got on the server and rebuilt all such stretches that I remember I have built.

Although currently connected and not yet desynced, it happened earlier the previous weeks, and I run Windows 10 with the 64 bit executor. The 32 bit executor has been unable to connect to the servergame without getting an out of memory within seconds.

edit2:
Just desynced....
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on September 27, 2018, 09:13:32 PM
I have been using the parallel stop mode rather extensively since it was introduced. It seemed to work fine.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 27, 2018, 09:33:57 PM
Thank you both for your feedback - that is useful. The 32-bit executable will not work as the game takes too much memory for this to be usable.

I have tried just now to connect with the latest build to see for how long that I can stay connected, but I had blue screen crashes on two separate occasions of trying, despite having replaced my PSU (with visibly melted capacitors) earlier in the week. I suspect that I may need a more comprehensive computer upgrade before carrying out intensive work on Simutrans, which might take a number of months to prepare and execute.

Ves - thank you for the test, but it seems from your results and Rollmaterial's later observations that this test is inconclusive (the loss of synchronisation still occurring being not inconsistent with the problem either being or not being with the new overtaking code, which is capable of having effect even if the features are not explicitly used).

If anyone could check whether it is possible to stay in sync for an hour or so with a Linux client, that would be very helpful.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 28, 2018, 03:12:41 AM
The overtaking code appears to be typecasting to "(volatile uint8)" in 4 places.
Quote
'type' : top-level volatile in cast is ignored
The compiler detected a cast to an r-value type which is qualified with volatile, or a cast of an r-value type to some type that is qualified with volatile. According to the C standard (6.5.3), properties associated with qualified types are meaningful only for l-value expressions.
I suspect from a compiler perspective this makes no sense. Volatile is used to annotate how the compiler should access a storage location, not a value.
Code: [Select]
for(  uint8 pos=1;  pos<(volatile uint8)sg[i]->get_top();  pos++  ) {The value returned by get_top is a value. This value cannot be volatile as it is a value so has no associated storage location to manipulate as specified by volatile.

The fact volatile is being used there raises alarm bells of possible race conditions.

I also spotted the following which might or might not be bad.
Code: [Select]
Severity Code Description Project File Line Suppression State
Warning C4701 potentially uninitialized local variable 'btyp' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\descriptor\reader\building_reader.cc 637
Warning C4701 potentially uninitialized local variable 'f_desc' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\display\font.cc 220
Warning C4701 potentially uninitialized local variable 'f_height' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\display\font.cc 219
Warning C4701 potentially uninitialized local variable 'g_desc' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\display\font.cc 91
Warning C4701 potentially uninitialized local variable 'h' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\display\font.cc 91
Warning C4701 potentially uninitialized local variable 'g_width' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\display\font.cc 127
Warning C4701 potentially uninitialized local variable 'image' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\gui\components\gui_convoy_assembler.cc 1226
Warning C4701 potentially uninitialized local variable 'image' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\gui\components\gui_convoy_assembler.cc 1484
Warning C4701 potentially uninitialized local variable 'byte_length' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\gui\components\gui_textinput.cc 187
Warning C4701 potentially uninitialized local variable 'byte_length' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\gui\components\gui_textinput.cc 216
Warning C4701 potentially uninitialized local variable 'n' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\gui\settings_stats.cc 1113
Warning C4701 potentially uninitialized local variable 'to' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\vehicle\simroadtraffic.cc 223
Warning C4701 potentially uninitialized local variable 'ri' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\simsys_s2.cc 231
Warning C4701 potentially uninitialized local variable 'cost' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\simtool.cc 5041
Warning C4701 potentially uninitialized local variable 'gd' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\vehicle\simvehicle.cc 2276
Warning C4701 potentially uninitialized local variable 'gd' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\vehicle\simvehicle.cc 2254
Warning C4701 potentially uninitialized local variable 'this_direction' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\vehicle\simvehicle.cc 3711
Warning C4701 potentially uninitialized local variable 'station_signal' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\vehicle\simvehicle.cc 7128
Warning C4701 potentially uninitialized local variable 'this_stop_signal_index' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\vehicle\simvehicle.cc 7197
Warning C4701 potentially uninitialized local variable 'next_next_signal' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\vehicle\simvehicle.cc 6879
Warning C4701 potentially uninitialized local variable 'tolerance' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\simworld.cc 6292
Warning C4701 potentially uninitialized local variable 'walking_time' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\simworld.cc 7108
Warning C4701 potentially uninitialized local variable 'car_minutes' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\simworld.cc 7261
Warning C4701 potentially uninitialized local variable 'best_journey_time' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\simworld.cc 6801
Warning C4701 potentially uninitialized local variable 'ts' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\squirrel\sqstdlib\sqstdstring.cc 138
Warning C4701 potentially uninitialized local variable 'ti' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\squirrel\sqstdlib\sqstdstring.cc 139
Warning C4701 potentially uninitialized local variable 'tf' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\squirrel\sqstdlib\sqstdstring.cc 140
Warning C4701 potentially uninitialized local variable 'name' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\gui\times_history.cc 113
Warning C4701 potentially uninitialized local variable 'best' used Simutrans-Extended Normal d:\simutransbuild\simutrans extended\simutrans-extended\bauer\vehikelbauer.cc 306
Obviously some of them might logically always be initialized. However does that apply to all of them? Uninitialized values leaking in and modifying behaviour could easily cause out of sync errors.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 28, 2018, 09:55:49 AM
Casting to volatile is insane - however, it is uncertain whether this is the cause of the trouble. It will be very difficult to fix if I am not able to reproduce it locally, as there will be no way of testing whether any given change makes a difference.

It would be very useful to know whether anyone is able to connect to the server on a Linux computer to test whether the synchronisation issue is limited to Windows clients connecting to the Linux server, since I cannot reproduce it with a Windows client connecting to a Windows server.
As to the potentially uninitialised local variables, if they were actually uninitialised, they would be detected either in Dr. Memory or using the Visual Studio runtime checks. There are lots of local variables which are not initialised on declaration but are initialised later.
Edit: Looking at the "volatile" part of the code, it is unclear what could simultaneously alter the output of gr->get_top(), as the actual movement of vehicles is not multi-threaded. I have experimentally removed all casts to volatile - it does not seem to cause any instability on a basic map, and it would be interesting to see whether there is any difference when this is pushed to the server. However, it is difficult to see what difference that this could make.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 28, 2018, 10:32:22 AM
Quote
Casting to volatile is insane - however, it is uncertain whether this is the cause of the trouble. It will be very difficult to fix if I am not able to reproduce it locally, as there will be no way of testing whether any given change makes a difference.
It is literally utter nonsense.
Quote
The compiler detected a cast to an r-value type which is qualified with volatile, or a cast of an r-value type to some type that is qualified with volatile. According to the C standard (6.5.3), properties associated with qualified types are meaningful only for l-value expressions.
The volatile key word is being completely ignored as it makes no syntax sense.
https://en.cppreference.com/w/cpp/language/cv
https://en.cppreference.com/w/cpp/language/value_category
Quote
Edit: Looking at the "volatile" part of the code, it is unclear what could simultaneously alter the output of gr->get_top(), as the actual movement of vehicles is not multi-threaded. I have experimentally removed all casts to volatile - it does not seem to cause any instability on a basic map, and it would be interesting to see whether there is any difference when this is pushed to the server. However, it is difficult to see what difference that this could make.
It will make no difference, and that is what is concerning. Someone explicitly placed those type casts there expecting them to do something. They have not been doing anything as they make no sense. This means that the problem that the author considered solved by using them still exists. I suggest trying to track down the author to explain their intended purpose.

Also worth pointing out the following line from the C++ standard (I think? or at least derived from that)...
Quote
This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order).
Quote
Within a thread of execution, accesses (reads and writes) through volatile glvalues cannot be reordered past observable side-effects (including other volatile accesses) that are sequenced-before or sequenced-after within the same thread, but this order is not guaranteed to be observed by another thread, since volatile access does not establish inter-thread synchronization.
In addition, volatile accesses are not atomic (concurrent read and write is a data race) and do not order memory (non-volatile memory accesses may be freely reordered around the volatile access).
One notable exception is Visual Studio, where, with default settings, every volatile write has release semantics and every volatile read has acquire semantics (MSDN), and thus volatiles may be used for inter-thread synchronization. Standard volatile semantics are not applicable to multithreaded programming, although they are sufficient for e.g. communication with a std::signal handler that runs in the same thread when applied to sig_atomic_t variables.
Hence one should not be using them to synchronize values at all outside of with signal handlers (pseudo interrupts?). Sure they work for that in MSVC, but possibly not GCC.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 28, 2018, 10:57:55 PM
I am not sure why it would have been thought necessary to use them here - they are being used on a getter method from Standard. The method is declared as const. Only the main thread should ever be changing the variable that this getter method accesses.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: prissi on September 29, 2018, 12:21:43 PM
When reading through the multitile city building code, I found a remark about floatig point calculation. Is this still in? That might be also a cause of desyncs.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 29, 2018, 01:16:08 PM
When reading through the multitile city building code, I found a remark about floatig point calculation. Is this still in? That might be also a cause of desyncs.

I cannot find any relevant floating point code (i.e., code that is not used only in UI or on initial map creation) in simcity.cc, sinhalt.cc or gebeaude.cc or their respective header files; I am not sure to what this is referring. I imagine that, if this had been in the city growth code, this desync would have emerged a long time ago.

I have still been unable to reproduce this locally. I tried both with the latest GCC build from the Brdigewater-Brunel server and the latest saved game, and with a Visual Studio release build, and I could not reproduce any desync when the same build was client and server.

This does suggest a desync of the sort where the platform makes a difference - but it is now extremely difficult to test whether this can be reproduced in older versions of either the code or saved game, as I am unable to reproduce this locally.

Again, it would be exceedingly helpful if anyone were able to test whether the desync can be reproduced when connecting from a Linux client to the Bridgewater-Brunel server.

Edit: Incidentally, testing with the Stephenson-Seimens server shows no desync.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 29, 2018, 09:03:47 PM
I am in the process of running a further test: I have temporarily reverted the server to an older saved game from the in-game year of 1937 to see whether clients can stay synchronised with the game with this save. The purpose of the test is to see whether the desync is caused by something new in the code in the last few weeks, or by changes in the saved game itself.

I have kept a backup of the latest saved game both on the server and on my own computer so that this can be restored when the testing is complete.

However, my own testing has been disrupted badly by continuing hardware problems on my own computer. It would therefore be very helpful if anyone could test this by trying to connect and see whether you can stay connected for at least 20 minutes without losing synchronisation.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 30, 2018, 08:06:23 AM
Been connected to the server for >40 minutes, no out of sync, multiple save/load cycles. Something changed with the game state since that save was made.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on September 30, 2018, 09:02:21 AM
Yeah I was also connected for around 30mins with no de-sync during the time I was connected. Also that save is before the great accidental tunnelling incident of 1939 so I am very much in favour of reverting to an earlier save.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on September 30, 2018, 09:05:14 AM
I have been on the server now for 50 minutes without any issues.
This save is before I started doing any airtraffic at all.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 30, 2018, 10:31:22 AM
Thank you all for your feedback: that is most helpful. I am not planning to revert to an earlier save permanently: just for testing, as it is necessary to fix the problem that causes these desyncs in the code, not just try to avoid it in the game.

I  note that I do not have a later saved game other than the one backed up from 1940, so I will not be able to try lots of individual tests between the two relevant points.

Dr. Supergood - you mention air traffic. I see that there was already an extensive air network in 1937, so it is doubtful that the issue is general to air traffic. Is there any particular feature of your air transport network which did not exist in anyone's air transport network in 1937?

Has anyone else used any additional features since 1937 that were not used in that year?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on September 30, 2018, 11:37:33 AM
I believe I wrote about air traffic :-P

I used the parallel stop mode on my airports, and I believe I almost doubled the amount of air routes on the server. My routes went to both my own airports but also alot to other players airports. The ground connection to my airports where handled often by other players with both busses (with the parallel stop mode) and trains.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 30, 2018, 11:55:30 AM
Quote
Dr. Supergood - you mention air traffic. I see that there was already an extensive air network in 1937, so it is doubtful that the issue is general to air traffic. Is there any particular feature of your air transport network which did not exist in anyone's air transport network in 1937?
You mean Ves?

One of the large airports not owned by me was constantly getting gammed due to a bug involving reserving runways when connected directly to a dock (so convoy loads, and then immediately moves onto runway). Someone rebuilt it.

My air network is largely unchanged, except for maybe a change in rolling stock.
Quote
I used the parallel stop mode on my airports, and I believe I almost doubled the amount of air routes on the server. My routes went to both my own airports but also alot to other players airports. The ground connection to my airports where handled often by other players with both busses (with the parallel stop mode) and trains.
Could easily be the parallel stop. I did not use such things at all.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on September 30, 2018, 12:07:50 PM
Quote
Could easily be the parallel stop. I did not use such things at all.
We could try to bog the server down by extensively build that stop mode, as test....?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on September 30, 2018, 12:32:21 PM
Could it have something to do with other players' convoys using those stops?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on September 30, 2018, 12:39:00 PM
We are trying to do exactly that on the server right now. I have ruined alot of other players network by placing parallel stop mode underneith their stops (which should not be possible...) and we are doing a loop now on the server with different convoys and stops.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on September 30, 2018, 12:47:13 PM
YES DESYNC!!  ;D
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 30, 2018, 12:52:53 PM
No desync here and I was connected at the same time as you. I think you just OoSed due to the long running change schedule OoS that is even in standard.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on September 30, 2018, 12:58:21 PM
Hmmm ok, did usually everybody get kicked out previously?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 30, 2018, 01:02:09 PM
Previously it was so instable it was impossible for anyone to remain connected for any reasonable length of time. I am guessing the instant a checksum check was performed clients were booted.

In any case our tests are going nowhere slowly. We tried parallel park mode with heavy traffic. We tried one way airports on another server. No OoS.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 30, 2018, 01:55:41 PM
Splendid, thank you all for the testing work, and apologies for the Ves/Dr. Supergood confusion. It would be very useful to know what feature(s) are found to trigger the desyncs. Do make sure to test one feature at a time (although I am sure that you are doing this anyway).

Unfortunately, the fact that this is a problem emerging with new ways of playing/feature use with the existing code will make finding the problem much harder than it would be if it had been introduced very recently, especially if it is not confined to the OTRP. This might take quite a lot of extensive work before progress can be made in other areas.

Dr. Supergood - you mention a bug relating to runways and a dock - I do not believe that this has been reported independently. Would you be able to post a fresh bug report thread for this?
Edit: My apologies: I posted the above without having realised that there was a whole second page of comments beyond my last message.
Ves - were any other people online when you had the desync? The game had been quite stable aside from the out of memory issues before this new issue arose. The 1940 map will lose synchronisation after 1-2 minutes, I believe.
Another possibility is to re-load the 1940 map and remove features, but that might be much harder because people will be kicked before they can do much of that.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on September 30, 2018, 02:59:52 PM
I assume the desync is a combination of RNG seed mismatch and possibly convoy handle mismatch (as the result of RNG seed mismatch creating different private cars)?

One might have to do some crazy code modification to find the fault. For example dumping the RNG state after every stage of every step. The idea would be that one could compare what the server dumps with what the client dumps to find out where/when the OoS occurs. Once the system causing the OoS is found it should either be possible to track it down directly, or more refinement to the listings made to expose details on where during the processing the OoS is occurring.

I think we have reached the end of meaningful testing with the reverted server game. I was connected without OoS for over 2 hours, including lots of road traffic across a variety of single directional roads with parallel parking.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: TurfIt on September 30, 2018, 06:57:40 PM
One might have to do some crazy code modification to find the fault. For example dumping the RNG state after every stage of every step. The idea would be that one could compare what the server dumps with what the client dumps to find out where/when the OoS occurs. Once the system causing the OoS is found it should either be possible to track it down directly, or more refinement to the listings made to expose details on where during the processing the OoS is occurring.

That code is exactly what's already there in the checklists - it was never removed after the last major desync hunt a few years ago; Really should have been simply for performance reasons...
Dumping after every stage is too much, instead find the major/minor system and keep diving deeper into them (refining as you say). But, expect many many false roads.
And this assumes you have a really reliable desync. Last time it would desync in the range of 5-15 mins and ended up taking huge hours to finally track.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on September 30, 2018, 11:24:59 PM
Thank you very much for testing. I am now reverting to the 1940 saved game from yesterday. I am afraid that this issue is likely to take a very long time to resolve, especially since even reproducing the desync is very difficult.

Has anyone any idea about what is different between the 1937 map and the 1940 map such as might cause this?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on October 01, 2018, 04:17:39 PM
During that time I added two large Trolley Bus routes, perhaps a bug with trolleys? I also created 3 bus terminals that used one way roads, one way signs and no-entry signs which I have removed to see if this has any effect.

edit: as expected this made no difference at all.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 01, 2018, 06:42:45 PM
Thank you for that: that is helpful.

I was hoping to try to connect with my Linux computer this evening, but, on attempting to compile, I get the following errors:

Code: [Select]
===> LD  simutrans/simutrans-extended
/usr/bin/ld: build/default/squirrel/sq_extensions.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqclass.o: relocation R_X86_64_32S against symbol `_ZN7SQClass7ReleaseEv' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqdebug.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqlexer.o: relocation R_X86_64_32S against symbol `_ZN7SQTable7ReleaseEv' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqobject.o: relocation R_X86_64_32S against symbol `_ZN15SQFunctionProto4MarkEPP13SQCollectable' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqtable.o: relocation R_X86_64_32S against symbol `_ZTV7SQTable' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqbaselib.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqcompiler.o: relocation R_X86_64_32 against symbol `_ZN10SQCompiler10ThrowErrorEPvPKc' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqfuncstate.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqstate.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqvm.o: relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdaux.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdio.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdrex.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdstring.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdblob.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdmath.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdstream.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdsystem.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output

This all arises in code that I know nothing about and I believe has been subject of some merging from Standard lately. If anyone can assist as to what might be the cause of this, this would be most helpful. I note that this does not seem to have affected compiling on the server.
Edit: Using the executable downloaded from the Bridgewater-Brunel server, I have been able to remain connected using a Linux client for a considerable time, over one in-game hour, and across the month boundary into December 1940. This does suggest that this may be one of those exceptionally difficult to find bugs in which Windows and Linux executables diverge in some extremely subtle way. Quite why this arises only in 1940 is hard to understand at this stage.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 01, 2018, 09:15:44 PM
Since GCC is used one can rule out some causes. Most likely causes would be multi threading (Linux kernel runs threads differently from Windows) and memory referencing (different virtual memory allocations).
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on October 01, 2018, 11:48:17 PM
I was hoping to try to connect with my Linux computer this evening, but, on attempting to compile, I get the following errors:
I have transferred virtually no changes to that code recently, although it is possible that changes elsewhere (e.g. to a Makefile) could be relevant.

When did you last successfully compile on that system?
What method of compilation are you using?
Have you tried producing a clean build?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on October 02, 2018, 01:01:51 PM
Do we have a server save from closer to when the de-sync started happening? I think it was okay until about mid 1939.

The server seemed okay until there was a fatal error (see here: https://forum.simutrans.com/index.php/topic,17611.msg175690.html#msg175690 (https://forum.simutrans.com/index.php/topic,17611.msg175690.html#msg175690)) that was fixed with the nightly build that came out on 21/09/2018. Since then there seems to be serious de-sync issues. Could it be that while this bug doesn't cause a crash anymore it still is causing the server issues?

I did notice that several players allowed access to their networks to others around this time, would this have any impact on the server?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 02, 2018, 05:15:32 PM
I have checked - the best that I have is September and December 1939. The fix itself would not cause a desync, as this is a simple test as to whether a pointer has a NULL value before dereferencing that pointer, but it is possible that whatever caused the crash is also somehow responsible for the desync. This crash occurred in the OTRP code (i.e., the new overtaking code). This code had been in the game for a long time before 1939 and it had evidently been used before that time, so this is perhaps not the most promising avenue.

Can anyone give me any idea as to which of the overtaking features were used before and after 1939 in the game?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 02, 2018, 08:46:26 PM
The crash was occurring on a stretch of road that was using classic driving style reliably for around 100 years... I think the road even existed before the feature was merged in.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 02, 2018, 09:12:05 PM
The crash was occurring on a stretch of road that was using classic driving style reliably for around 100 years... I think the road even existed before the feature was merged in.

Interesting - how were you able to identify what stretch of road was causing the crash?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 03, 2018, 01:18:50 AM
Quote
Interesting - how were you able to identify what stretch of road was causing the crash?
Stack trace. I loaded the server game in a debug build attached to MSVC. Looked at the vehicle that was causing the crash. Was one of my old post horses. Could also resolve the coordinates. Look at the thread for details.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 06, 2018, 12:52:14 PM
I am now running a series of tests on the server game. I have started by using the saved game from September 1939: this does indeed lose synchronisation, so whatever the difference is that is causing the error to be exposed in later games occurred between 1937 and 1939.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 06, 2018, 03:52:39 PM
The result of the first test is that, even with all aircraft removed, the loss of synchronisation still occurs after a few minutes connected.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 06, 2018, 05:18:28 PM
The result of the second test is that, even with a saved game in which all the roads had been set to the standard "two way" overtaking mode on loading using a specially modified executable (as well as all the aircraft removed), the client still lost sync with the server after a few minutes.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 06, 2018, 09:42:35 PM
The result of the third test is significant and interesting. I took the 1937 game that runs without losing synchronisation, then used the public player tool to advance the year to 1940 without changing the game-state other than the date, and re-ran the test. This time, the game would lose synchronisation after a few minutes again. I re-tested with the 1937 game and confirmed that it did not lose synchronisation. This suggests that there is an issue with some item automatically placed in the game which has an introduction or retirement date in around 1939, the most obvious candidates for which are roads.
Edit: Further testing has shown that copying the latest (client) simuconf.tab to the server (save for replicating the server's original network settings) does not prevent the loss of synchronisation from occurring with the >1939 saved game.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 07, 2018, 04:06:21 AM
Might be worth trying to advance the server beyond 1940 to see if there is a date the problem stops. This could help locate specific objects causing the problem. Obviously the year has to be advanced either offline and the server restarted or the server re-joined afterwards since one can assume that any windows client touching 1940 will be out of sync instantly and is only booted later when a checksum check is performed.

It might also be worth clean installing Simutrans on the server (making sure not to lose all saves). Although the pakset is hash checked by clients, files like simuconf.tab are not.

Of course one should make sure that the server is really going out of sync with the clients. It could be something to do with the time that starts somewhere in 1940 causing a false positive OoS detection.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on October 07, 2018, 12:21:59 PM
I went ahead and looked at what ways where becoming available in the time frame leading up to 1940. These objects are taken from all dats in this directory: https://github.com/jamespetts/simutrans-pak128.britain/tree/master/ways (https://github.com/jamespetts/simutrans-pak128.britain/tree/master/ways)

Name=hr-asphalt-road-medium
intro_year=1935
intro_month=6

name=BrickViaduct
intro_year=1838
intro_month=7

Name=city_road
intro_year=1932
intro_month=1

name=ConcreteSteelCantileverRoad
intro_year=1937
intro_month=5

Name=concrete_road
intro_year=1936
intro_month=9

Name=runway
intro_year=1938
intro_month=9

name=airport_oneway
intro_year=1938
intro_month=9

Name=taxiway
intro_year=1938
intro_month=9

---- close retire dates ---- (not a complete list, since I didnt think of checking the retire dates until midway through the list...)

Name=macadam_road
retire_year = 1936
retire_month = 7

Name=WoodenTretleElevatedNarrow
retire_year=1938
retire_month=7

name=WoodenTrestleNarrow
retire_year=1938
retire_month=7
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 07, 2018, 08:38:15 PM
Dr. Supergood - that is a very useful suggestion. I have tried advancing the time to 2000, and there is no loss of synchronisation with this. I will try a few intermediate dates to see what the cut-off is.
Edit: The loss of synchronisation still occurs in 1950.
Edit: The error also seems to occur in 1975.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 08, 2018, 08:03:17 AM
Might be worth binary searching the exact start and end year.

It could be coupled to town buildings/attractions, industry or private cars since all of those are subject to introduction or phase out with year.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 08, 2018, 10:08:06 AM
Each round of testing takes a considerable amount of time, so it will take a long time to get to the point of checking the exact year. I am planning to try to find it as precisely as possible, however.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 12, 2018, 12:21:08 AM
Further testing has revealed an error in the earlier testing, but that error itself has revealed interesting data. When I advanced to 2000 initially, I had used a game saved in 1937. However, the initial testings of 1950 and 1975 had used the game saved in 1939 - after the problem had arisen. Re-testing in 1952 with the game saved in 1937 shows that the client is able to stay in sync with the server.

This suggests that it is the presence in the game of an object that is automatically built sometime in the 1939-1952 era that causes the problem, rather than the building of the object while the client is connected.

I will have to test further when I have more time to see which year that the problem first goes away.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 12, 2018, 03:56:21 AM
There is a limit to what objects are automatically created or manipulated.
  • Trees.
  • City buildings/attactions.
  • Walking passengers.
  • Private vehicles.
  • Terrain slopes (due to construction of city buildings).
  • City roads.
  • Resurfacing of all existing roads, rails, etc, potentially to a different type due to obsolesence.
  • Industries, and industry linking.
  • Power consumption/generation.
  • Bridges, and hence grounds, due to the construction of city road bridges over obstacles.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: prissi on October 12, 2018, 04:48:43 AM
Are there exponenents or square roots used in any generation routin? It may be that those are slightly deviations only for number generated in that era. Because if there is no desync when running both under Linux, I would suspect something like this ...
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 12, 2018, 05:46:38 AM
Quote
Are there exponenents or square roots used in any generation routin? It may be that those are slightly deviations only for number generated in that era. Because if there is no desync when running both under Linux, I would suspect something like this ...
There is a software implementation for these which should be deterministic between platforms. The software implementation is heavily used by vehicle physics which cannot directly be the cause due to there being dates that the game remains in sync for hours despite ~10,000 vehicles.

Anyway an idea that occurred to me was to disable multi threading on both server and client for a test. If this stops it going out of sync then it is caused by something multi thread related.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 20, 2018, 02:41:37 PM
Further testing shows that year skipping the 1937 saved game to 1952 produces a saved game that stays in sync between client and server.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 20, 2018, 05:29:12 PM
Further testing shows that the 1937 game fast forwarded to 1940 also remains in sync with the server, suggesting that the earlier results implying to the contrary were contaminated with the confusion between different starting points identified earlier.

The consequence of this is that the earlier conclusion that the loss of synchronisation was not necessarily (and was probably not) caused by some automatically emergent objects such as buildings, private cars or pedestrians as previously thought.

Furhter investigation of the type originally carried out (i.e. into changes made by players to the network) will be needed.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Junna on October 21, 2018, 10:25:41 AM
I replaced something like one-thousand two-hundred road vehicles, would it be part of it? It was around the time the desynching started... Many buses also got stuck, because a number of them, have spuriously high axle loads (equal to their entire weight, 6-7 tonnes).
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 21, 2018, 10:48:50 AM
I have been conducting a test to try to determine the cause of the problem by liquidating each company one by one and seeing whether the server remains in sync after that liquidation. I have so far liquidated Crandon & Lakes and Player 11 to no avail. I was about to test liquidating the next company last night when my computer crashed, so I am going to restart to-day. This test will help to determine whether or not anything that you describe might be relevant, although it is difficult to see at present what in what you describe could be part of the problem, since both replacements and vehicles getting stuck/having no route have been encountered commonly before without causing loss of synchronisation.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 21, 2018, 02:31:33 PM
How many of the busses are left to replace? I am aware of a out of sync problem related to manual schedule changes but I did not think it applied to automatic changes.

Also how much power generation is going on? I recall a similar issue like this being caused by power nets on the last server.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 21, 2018, 02:45:25 PM
Preliminary testing seems to show that the loss of synchronisation appears not to occur when the Bay Transport Company is liquidated. However, I have not been able to test this thoroughly, since my computer is currently not stable enough to remain running without hard-crashing when running the server game for more than ~15 minutes at a time (although this is still longer than it took to lose synchronisation before I liquidated Bay Transport).

The server is currently set up with Bay Transport liquidated, but all other companies intact. If anyone can connect and try to remain connected (without interaction) for circa 1 hour in this state, that would be very helpful. I can then try to narrow down the problem once this has been confirmed. Note that you will need to download the latest version from the server as I fixed a crash bug this afternoon.

In relation to the other suggested issues: the electricity related loss of synchronisation was fixed a long time ago. As to schedule changes, I am not aware of this being a current bug. If anyone can reproduce this with the latest version, please post a full bug report in the usual way.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 21, 2018, 05:15:06 PM
Been connected to the server well over an hour. Even survived a save/load cycling of someone joining. No desyncs at all.

EDIT: A thought occurred to me. Now that we know removing Bay Transport solves the OoS, we have to prove that it is Bay Transport causing the OoS and not his interactions with everyone (since practically all companies connected to him in some way). Hence I propose restarting the server with a save that removes all other companies except Bay Transport and seeing if it OoSes still. If it does, then the problem is something in Bay's network and the removal of other companies might make this easier to identify.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 21, 2018, 09:07:34 PM
Thank you very much for testing: that is most helpful. That is a good idea for a further test, too, but first I want to test to see whether the fix to the bug that caused a crash actually fixed the desync by running the original saved game again: whilst this is very unlikely, because the two coincided, I need to rule this out before testing further. Then, I will proceed with Dr. Supergood's proposed further test.
Edit: The conclusion of the first part of the test is that the crash fix did not fix the loss of synchronisation. I will now proceed with Dr. Supergood's suggested test.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 21, 2018, 10:36:47 PM
I have now run the second part of Dr. Supergood's proposed test (and this is on the server now - you will need to update the executable again, as I had to fix another crash bug to run this): with just Bay Transport and the other companies removed, the client still loses synchronisation with the server. This implies that the issue is not at the intersection between Bay Transport and another network, but rather internal to the Bay Transport network.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 22, 2018, 08:50:59 PM
I have just carried out a further test by withdrawing all of Bay Transport's road vehicles. Connecting to the game thus modified still results in a loss of synchronisation after a few minutes.
Edit: Removing the aircraft also does not remove the loss of synchronisation issue.
Edit: Likewise, removing trams has no effect. All that remains is rail, so it seems likely (but not certain without further testing) that the problem is associated in some way with rail transport.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 22, 2018, 11:43:46 PM
Further tests show that removing all of Bay Transport's vehicles appear to allow a stable connexion to be maintained. It would be very helpful, however, if anyone else could test to verify this: the server is currently running in this state, so if anyone can stay connected for ~1 hour, this would be very good evidence of the stability.

Even more interestingly (perhaps), I discovered that I had missed some rail and road vehicles when I was testing earlier, and that some earlier versions of the testing saved game file (including the ones that I used to test the absence of road vehicles, aircraft and trams) still had one or two road vehicles left, as well as the first attempt at testing the removal of rail vehicles still had some road vehicles left. Testing with this version, the loss of synchronisation still seems to occur.

This is most interesting as, if the current saved game can be shown to be long-term stable, I can then remove vehicles one by one and see which one is responsible.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Junna on October 23, 2018, 01:26:02 AM
This is kind of off-topic, but how do you force liquidation of another company on a server game?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: prissi on October 23, 2018, 08:30:57 AM
nettool probably.

pak128.britian standard contains double objects, see here: https://forum.simutrans.com/index.php/topic,18506.msg176239.html

Three buildings appear from 1930 onward and are contained twice, once with cluster parameter and once without. Their building time is from 1930 to 1960, but if newer building appears in 1950 then those are built less frequently. Since the loading order of pak files depends on the file system (and thus is different between windows and linux) those COM_JH_1930_00_06A etc. may be the source of desync. With fewer companies, growth is more infrequent and such desync would happen less.

It might be useful, if the pak doublette feature from standard finds its way to experimental early, or if you check the debug messages for overlaid objects.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 23, 2018, 09:59:32 AM
Nettool is indeed the way of liquidating single companies - the syntax is nettool [server details] remove-company [company number].

As to the duplicated buildings, thank you for the investigations in this regard. As I posted in the other thread, however, I cannot read the text posted there, so I cannot check whether any of these are duplicated in the Extended version of the pakset. I have just checked for duplication of COM_JH_1930_00_06A, but found only one object with this name.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 23, 2018, 10:11:52 AM
Quote
It might be useful, if the pak doublette feature from standard finds its way to experimental early, or if you check the debug messages for overlaid objects.
While the server listing server was still working there was no pakset mismatch shown when connecting hence this is not the problem.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on October 23, 2018, 03:45:52 PM
if anyone can stay connected for ~1 hour, this would be very good evidence of the stability.

I've been connected for around 40 mins with no issues at the moment.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 23, 2018, 04:00:25 PM
Excellent, thank you very much for testing: that is very helpful.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 23, 2018, 04:40:14 PM
I was connected for 80 minutes, no out of sync.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 23, 2018, 06:01:24 PM
Excellent, thank you very much for testing.

I have now uploaded the other version that I described, which still has some residual road and rail vehicles in it for testing. It will restart with this version running in a few minutes. I intend to unlock Bay Transport so that we can all test to see which thing(s) are causing the trouble by removing them one by one. I should be very interested in anyone's results.


Edit: Now running and unlocked.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on October 24, 2018, 09:43:10 PM
I have managed to stay in sync for ~40 min without doing anything.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 24, 2018, 09:46:31 PM
That is interesting, thank you. I will have to re-test, as I did originally get out of sync errors with this saved game.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 25, 2018, 12:52:51 PM
Yeh the save is stable.

Is there one with all companies except bay removed? This one has most of bay's vehicles removed.

That said when I first joined I did get an index out of bounds crash. Not been able to reproduce it however.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 25, 2018, 05:56:55 PM
Thank you very much for testing: that is helpful.

I have now restarted the server game with the version of the saved game with Bay Transport's railway network only (plus one or two 'buses that I omitted in error to remove earlier). The company is unlocked, so there is scope for testing as to which specific line(s) are associated with the loss of synchronisation by way of withdrawing the stock from the lines one by one and testing after each.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 27, 2018, 12:28:58 PM
I am currently running a test in which I am removing rail lines of Bay Transport one by one and checking whether this affects loss of synchronisation for each line. I am going from the bottom of the list of lines upwards.

I appear to have found a stable state by removing all lines up to and including FRC - Roxingstoke - Templecaster (local). Removing all lines up to but not including that line did not prevent the loss of synchronisation, suggesting that something about this line might well be responsible for the issue, although further testing is needed to confirm this.

It would be helpful if people could connect to the server and test whether this is long term stable.

The next round of testing will be reverting to the version of the saved game in which the loss of synchronisation occurs to test whether removing only the abovementioned line will prevent the loss of synchronisation.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on October 27, 2018, 02:54:23 PM
I joined the server and suffered loss of synchronisation after about 2 minutes. I will try again and see if that was a one off.

edit: same happened again. There are a lot of stuck vehicles and vehicles with no route, could these be having an effect on players staying in sync?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 27, 2018, 05:27:15 PM
I have been in the process of testing by starting again with the saved game and by withdrawing vehicles only on the last route before the game apparently became stable: you joined, I think, after this process had started.

I am aware that there are stuck vehicles: however, this should not cause the clients to lose synchronisation with the server. It is unlikely that these are the cause as:

(1) there have been many times in the past when there have been stuck vehicles and/or vehicles with no route on the server and it has not lost synchronisation with clients; and
(2) removing only the Bay Transport Group prevents loss of synchronisation, but does cause some stuck vehicles.

My tests are now revealing bizarre and inconsistent results: starting from the FRC - Roxingstoke - Templecaster (local) line and working down to the bottom of the list, I have not been able to achieve the stability that I had earlier achieved starting at the bottom and working up to the FRC - Roxingstoke - Templecaster (local) line. I am now working up from that line towards the top of the list, but this is an extremely slow process.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 27, 2018, 05:29:45 PM
One cannot rule out that after enough time is allowed to progress the game magically goes back into sync due to how many broken lines there are due to the removal of all other companies. One might have to test with all companies present so that all lines continue normal operation.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 27, 2018, 06:08:30 PM
Having to reset the game every time would double or triple the time taken to test. This has already taken literally all day, so this sort of testing would have to be an absolute last resort.

Also, this seems unlikely, as the broken lines are stably broken and so do not change much with time.

However, the latest test shows that removing up to and including the Northern Frontier Express seems to have lead to stability. I should be grateful if people could log on for further testing in the current state. This line in particular is promising because the rolling stock seems to have been replaced in 1938.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on October 27, 2018, 07:18:16 PM
I have been able to remain in sync for ~1 hour.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Phystam on October 27, 2018, 08:46:07 PM
I could connect to the server for ~1 hour.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 27, 2018, 08:55:14 PM
Thank you both - that is very helpful.

May I ask now that one of you could try adding trains back to the withdrawn Bay Transport lines one by one, then waiting ~10 minutes after each repopulation to see whether it causes loss of synchronisation?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on October 27, 2018, 11:09:11 PM
I resent the trains out on the Northern Frontier Express and after a few minutes... desync!
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 28, 2018, 12:13:17 AM
Very interesting - thank you. It might help to test a few times by withdrawing them and then adding them to check that the behaviour is consistent. If it is, this is a very interesting development, although quite what is special about that line remains to be investigated.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on October 28, 2018, 01:39:41 AM
Sending the trains to depot again seems to prevent a desync. I have managed to remain connected for over 30 min.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on October 28, 2018, 09:44:47 AM
Since most trains work, it is likely something to do with the signalling. Try altering signalling, especially around any complex parts. Since one knows that line is the cause, any lines sharing the same ways should be checked if they can be terminated without stopping the desync to make testing easier.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 28, 2018, 11:41:00 AM
Thank you for that testing: that is very helpful.

One thing that might be worth checking is to send out a freight train using the same line to try to distinguish whether the issue is with signalling or similar or whether it is with passenger routing.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on October 28, 2018, 04:47:04 PM
I have sent out a freight train on that route and so far encountered no issues.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 28, 2018, 05:20:22 PM
Splendid, thank you for testing. To make this test more rigorous, it would be helpful if you could:

(1) send out some more freight trains (in case the issue is signalling and requires trains to interact to cause problems); and
(2) try to stay connected for at least 45 minutes.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 29, 2018, 12:37:20 AM
I have added a number of freight trains to this line, and have been able to stay connected for >1 hour, so it seems unlikely that the problem is signalling or similar.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on October 29, 2018, 10:20:17 AM
Splendid, thank you for testing. To make this test more rigorous, it would be helpful if you could:

(1) send out some more freight trains (in case the issue is signalling and requires trains to interact to cause problems); and
(2) try to stay connected for at least 45 minutes.

Sorry James I intended to do some more thorough testing with freight trains but got caught up yesterday evening.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on October 29, 2018, 06:16:56 PM
Tested sending out a single new passenger train with different rolling stock. No desync.

Edit: There actually seems to be a systematic desync after a longer period of time than before.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 29, 2018, 08:55:39 PM
Thank you for testing - can I ask what you mean by a "systematic" desync?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on October 29, 2018, 09:01:32 PM
That the game consistently loses sync after a period of time that is longer than for the originally observed desync. It suggests that the desync occurs individually on any train of the line.
Edit: I have managed to stay in sync for well over an hour.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 29, 2018, 10:41:20 PM
Interesting - what is the train doing when you are in sync for so long? I wonder whether there is a pattern in its activity and its connection with when you lose synchronisation?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on October 31, 2018, 10:25:13 PM
Some additional testing this evening shows that using the A4 class locomotive used on the Fronteir Express but hauling fast freight wagons does not trigger a loss of synchronisation, so any cause specific to the A4 class locomotive can be ruled out.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 03, 2018, 10:22:49 PM
Further testing yields some interesting results: I have re-instituted passenger services on the Northern Frontier Express without any loss of synchronisation. However, I am using different rolling stock: a GWR Hall class and GWR express carriages (just passenger, no mail).

I should be grateful if anyone else could log in and double check for long term stability. The next phase of testing would be to add mail to these trains and see whether that makes any difference. The trains are successfully carrying passengers along this line currently without loss of synchronisation, and I have been connected for over an hour. There are 5 trains in total running.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 04, 2018, 11:05:25 AM
I have now added some mail only trains to the line. After doing so, I experienced a loss of synchronisation after a time. I tried to reconnect and run the test again for confirmation, but my hardware problems caused my computer to crash while running the game on this occasion, so I could not test fully. The game is still running with mail trains on the Northern Frontier Express - I should be very grateful if anyone could log in and test stability with the game in this state.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on November 04, 2018, 05:47:24 PM
I've connected 3 times and each time I have de-synced from the server.

edit: I am connected at the moment and it seems to be saying in sync, I will see how long that lasts.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 04, 2018, 06:22:47 PM
I have connected twice and desynced both times after 20-30 min.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 04, 2018, 08:18:03 PM
That is extremely interesting - thank you both for that test. This suggests that there may well be a problem specific to mail. SuperTimo - was Bay Transport originally your company - if so, can you remember whether mail was conveyed on the Northern Frontier Express before 1938?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 04, 2018, 08:27:46 PM
From what I remember of previous testing, the line didn't carry mail after 1938 either. At least it didn't carry mail when I tested and it was desyncing after 5-10 minutes.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 04, 2018, 08:31:24 PM
Interesting - can I ask what locomotives and rolling stock that you used for that test?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 04, 2018, 08:34:12 PM
I think it was the stock originally used on the line, that being LNER A1's or A3's with the latest GNR corridor carriages.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 04, 2018, 09:07:53 PM
I think it was the stock originally used on the line, that being LNER A1's or A3's with the latest GNR corridor carriages.

Do you remember which carriages specifically? Some brake carriages that go with this set (I think the LNER rather than the GNR set by 1938) carry mail rather than passengers.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on November 04, 2018, 11:42:49 PM
That is extremely interesting - thank you both for that test. This suggests that there may well be a problem specific to mail. SuperTimo - was Bay Transport originally your company - if so, can you remember whether mail was conveyed on the Northern Frontier Express before 1938?

Bay transport wasn't my company (Great Highland Railway was mine), but the owner started carrying post as I requested them to as it would be beneficial to both of us (I think bay transport introduced a couple of mail only trains following this). This was not long before the de-sync started I think, around the time that Green Quantinglow airport was built which I think was around 1939.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 04, 2018, 11:48:11 PM
This is very interesting. This does suggest that the carriage of mail in particular on this line is likely to be a cause of the problem. It is very odd that this line in particular carrying mail is a problem rather than it always being a problem when mail is carried, however. I know that there was a comprehensive mail network for over 100 game years on the server before this problem started happening, so this is very odd indeed.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 05, 2018, 12:48:06 AM
Do you remember which carriages specifically? Some brake carriages that go with this set (I think the LNER rather than the GNR set by 1938) carry mail rather than passengers.
I remember there were no mail carriages. Simply the last model of GNR passenger carriages, with a restaurant carriage in each train.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on November 05, 2018, 07:26:13 AM
Mail generally has long waiting times compared with passengers. This means the total jounrney time or other such metrics used to decide route travelled would be considerably longer than with passengers. If due to an incorrect type choice an integer were to overflow one might run into platform specific behaviour that is not warned by compilers. In C/C++ integer overflow/underflow is not defined for signed types, and additionally a type might not be the same number of bits between platforms. This could result in mail taking different routes between platforms.

This would only be a problem with new mail lines because before there were not many competing excessivly long mail routes so there was only 1 choice. Recently before the desyncs started (within 10 years before) a cross map mail ship line was added, on top of an air mail network as well as additional train mail networks. There is probably well over 3 paths for mail to reach some destinations, with some of them being stupidly slow due to waiting times.

This is speculation but could be worth looking into.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on November 05, 2018, 10:17:39 AM
This would only be a problem with new mail lines because before there were not many competing excessivly long mail routes so there was only 1 choice. Recently before the desyncs started (within 10 years before) a cross map mail ship line was added, on top of an air mail network as well as additional train mail networks. There is probably well over 3 paths for mail to reach some destinations, with some of them being stupidly slow due to waiting times.

My line connected Bay Transport's (BT) and Far Eastern and Western's Railway lines (FE&WR). When BT started carrying post the combination of mine and BT routes this would have likely created a faster route than FE&WR's. The combination of the three companies would have also generated a great deal of additional mail journeys. I had one post train which ran once every month so this would lead to particularly long waiting/journey times for mail going via the Great Highland Railway.

In addition to this, around when the de-sync started occurring BT's trains on the Northern Frontier express were constantly getting stuck due to a lack of one way signs, causing trains to reserve routes for stupidly long distances in on the wrong line. This would have caused massive waiting times for post to be picked up/delivered
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 05, 2018, 10:34:47 AM
The integer overflow hypothesis does not seem plausible: journey times are now stored in unsigned 32 bit integers with a resolution of 10ths of minutes, giving a maximum storable journey time of 7,158,278 hours (being 298,261 days, or 817 years).

Rollmaterial - can you be specific as to exactly which carriages that you used to test and how many of them that there were?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 05, 2018, 12:40:42 PM
Technically I didn't "use" them, I just sent out the trains that were in the depot. If I remember correctly they were made of 5 carriages: brake front+middle+restaurant+middle+brake rear. They were arranged in the same order as the train with SR stock currently running on the line.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: VOLVO on November 05, 2018, 01:39:19 PM
So there were a few things below hopefully would bring new ideas of what went wrong:
1. On some stations there were no post offices or whatever I just put a bus stop with postboxes or roadside loading bay with postboxes.
2. The whole Northern Frontier Express line only has one dedicated mail train running with no schedule (as opposed to the Eastern Frontier Express which has some trains with mail brake carriages).
3. The Mail line of Northern Frontier Express (Northern Mail Express) is separated from the Passenger Norther Frontier Express, at Bealdean Rye the mail train line terminates at DrSuperGoods's Terminus, and the Passenger line terminates at my own station. (Current testing of mail trains seems to be running the passenger line)
4. My lines have strong mixing of old semaphore signals and modern 4 aspect light signals.
5. Some signals are placed at the second or even third platform tiles because the signal is built first then for operation reason the platform is extended. This I have done for a very long time I doubt it is the cause but still thought I mention in case anyone can think of something.

I also notice the branch line for Ves's Green Quantinglow Airport has been built when the strong dysncro came, and by that time the NFE Mail trains have already been running for quite a few game years.At that period of time all trains on the Northern Frontier Express has been replaced with LNER A4s and only the mail train was running on A3, so testing with A1 or A3 may not be representing the rolling stock of the dysncro period.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 06, 2018, 01:02:31 AM
Interesting - thank you both for that. I have re-started the trains that were already in the depot at Elmley which have the formations indicated by Rollmaterial (A3s with GNR corridor carriages as specified).

Loss of synchronisation was encountered, but only after quite a long time.

I should be grateful if others could also test the server in the present state with these trains running to test for ~1 hour for loss of synchronisation. This suggests that it is possible that the problem is specific to certain types of rolling stock, but this will need more testing to narrow down.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 07, 2018, 04:55:42 PM
I have only had a very brief chance to look at this, but the position appears to be that neither the 1937 nor the 1939/1940 versions of this game carried mail on the Northern Frontier Express. The 1937 version used the same A1s and GNR corridor carriages as were found in the depot in 1939 (these sets apparently having been withdrawn from service on upgrade), and the 1939 version using A4 locomotives and the later LNER corridor carriages. The 1937 saved game already had a few trains with the A4s and LNER corridor carriages which had been purchased in that year. However, in the 1937 version, the classes had not been reassigned on the newer LNER carriages, whereas by 1939, Bay Transport had reassigned the classes on all the trains on that line (including those with newer carriages) to very low.  The earlier trains with the GNR carriages had their classes already reassigned to very low.

Loss of synchronisaiton thus appears from this preliminary study to occur when either a mail train is run on this line (even though this line did not originally have mail trains) or when the later LNER carriages are run with reassigned classes.

This is an odd pattern of failure and does not suggest anything useful about the underlying code. However, it would be worthwhile carrying out tests by checking to see whether adding the A4s and LNER carriages to the line again (removing all existing trains on the line) without reassigning the classes causes loss of synchronisation or not, then reassigning the classes after testing for ~1 hour and testing for ~1 hour again to see whether loss of synchronisation results this time.

My time for testing at present is very limited: since these are tests that anyone can perform on the server, it would be extremely helpful and help to make progress towards a fix if anyone could test this this on the server and report the results here.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 08, 2018, 01:45:33 AM
Reassigning classes seems to affect the time after which sync is lost: it happens after 5-10 min with classes reassigned to very low and after ~30 min with default low class, independently from the choice of rolling stock. This suggests a correlation with the amount of passengers carried and that the issue is in passenger routing or generation. I will now test with reassigning the classes to medium.
Update: Desyncs after ~40 min.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 08, 2018, 10:59:43 AM
That is very interesting - thank you for that. That the reassignment causes the loss of synchronisation to happen more quickly when reassigned to very low and more slowly when reassigned to medium does seem consistent with your hypothesis that the problem is with passenger routing somehow and that the faster loss of synchronisation is caused by the greater number of passengers in the lower classes who are transported when reassigned to very low.

This does not, therefore, by itself get to the bottom of what the essential changes are between 1937 and 1939 such that the former is stable for >1 hour even with reassigned classes and the latter is not.

Were there any changes to the schedule of the Northern Frontier Express, between these dates may I ask?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on November 08, 2018, 11:49:31 AM
I assume you have tried after removing all his passenger aircraft? There is a known issue with broken passenger routing over air routes. The symptoms are that changing any passenger related schedule anywhere causes all air passenger (not mail) routes to be lost from the network. This is to rule out the rail line being a manifestation of this routing issue spilling over somehow.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on November 08, 2018, 12:47:43 PM
Were there any changes to the schedule of the Northern Frontier Express, between these dates may I ask?

There was the case that the trains got stuck multiple times, which might have led to a large amount of passengers on the line once the route was running again.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 10, 2018, 12:10:02 AM
I assume you have tried after removing all his passenger aircraft? There is a known issue with broken passenger routing over air routes. The symptoms are that changing any passenger related schedule anywhere causes all air passenger (not mail) routes to be lost from the network. This is to rule out the rail line being a manifestation of this routing issue spilling over somehow.

I am aware of this issue, which I was unable to diagnose after quite a few hours of testing. I did not continue testing for that because this issue then arose. That issue was found to be not specific to aircraft, but rather an issue affecting higher classes of passengers (becoming progressively worse the higher the class if I recall correctly; aircraft of this period default to the "very high" class, so exhibit this behaviour more readily).

Nonetheless, there was a test (which I believe is documented in detail on this thread) involving removing first all aircraft, then all ships, then all road vehicles until only trains remained and the loss of synchronisation still occurred. Then it was discovered that liquidating Bay Transport prevented the issue; then it was discovered that removing the Northern Frontier Express prevented the issue, and now it has been discovered that the problem occurs or not depending on what rolling stock is run on the Northern Frontier Express, but no intelligible pattern can yet be discerned from these data. The problem is that each round of testing takes such an enormous amount of time and so many rounds of testing are required to get good data that even narrowing the issue further is likely to take an extremely long time (i.e. many, many months given the amount of time currently available to me) without considerable assistance in testing.

Once the problem has been narrowed down more accurately, I can start looking at the code in more detail and isolating specific pieces of code and testing with those disabled to see wherein the problem arises.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on November 10, 2018, 09:32:58 AM
The current state of the NFE led to de-sync for me within ~10mins.

James what would you want tested next?

I believe that Rollmaterial tested switching to the A4s with reassigned classes, currently the line is running A4s with the class reassigned to medium.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 10, 2018, 09:52:56 AM
Thank you - that is most kind.
I think that the locomotive is not relevant, as it has been tested hauling freight wagons without any loss of synchronisation.

The next useful test would be to reset all classes on that line to the default and, making no other changes, try to stay in sync for circa 1 hour.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on November 10, 2018, 11:06:07 AM
I reset the class for the route and was able to stay in sync for an hour. I left of my own accord.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 10, 2018, 11:45:34 AM
That is extremely helpful, thank you.

The next useful test would be to try re-assigning the class from low to medium rather than from low to very low and try to stay in sync for ~1 hour and see whether that makes any difference.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on November 10, 2018, 01:10:51 PM
I have done that. I have de-synced three times, usually within <10 mins
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 10, 2018, 03:52:39 PM
I have done that. I have de-synced three times, usually within <10 mins

This is very interesting - thank you. The next thing to test would be to try some different rolling stock (e.g. GWR carriages) and see whether you lose synchronisation with those (a) with default classes; or (b) with reassigned classes (i) to very low; and (ii) to medium.

Thank you very much for this most useful work.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on November 10, 2018, 09:18:12 PM
I have replaced the rolling stock with GWR stock, same formation |brake - normal - dinning - normal - brake|. Has stayed in sync for at least 40 mins, however i did suffer a loss of synchronisation during the process of changing the trains. I will test changing the classes tomorrow.

edit: somehow I misspelt the brake the second time.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 11, 2018, 12:39:25 AM
I have replaced the rolling stock with GWR stock, same formation brake - normal - dinner - normal - break. Has stayed in sync for at least 40 mins, however i did suffer a loss of synchronisation during the process of changing the trains. I will test changing the classes tomorrow.

That is extremely helpful, thank you. Another thing that might be worth testing is, for any state where you get loss of synchronisation, try removing the dining cars and seeing whether this makes a difference when everything else remains the same.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on November 11, 2018, 12:05:07 PM
So far I have the following results:

- GWR Default Class, de-synced once but I think that was caused by something else as otherwise I was able to stay connected >50mins.-
- GWR Very Low: constant de-syncing <10mins each time.
- GWR medium: yet to de-sync been connected for around 20mins, will see if this lasts.

I will test with removing the dinning cars either later or tomorrow as I want to play another game and this is using up a fair amount of RAM and processing power.

edit: medium class still in sync after around an hour so it seems pretty stable.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 11, 2018, 03:21:18 PM
That is very helpful - thank you very much. I shall look forward to your dining car tests.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: SuperTimo on November 12, 2018, 11:07:52 AM
I have started to test the trains without the dinning car. Something I have noticed is that someone is playing on the server as a company called Northside Transport. They have built a railway line using some mothballed track and have several bus lines. I am not sure whether this is affecting things but either way it is potentially compromising the testing we are trying to undertake.

edit: So far the results without the dinning car are the same as with it. Setting the class to low results in no de-sync, setting it to medium causes de-sync, I am about to test it on very low but I think it is likely this will result in de-sync.

Something I have noticed is that when changing class the de-sync only seems to begin once the trains have removed all of the passengers of different classes.

edit 2: switching to very low also causes de-sync without a dinning car.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 12, 2018, 03:19:45 PM
I have started to test the trains without the dinning car. Something I have noticed is that someone is playing on the server as a company called Northside Transport. They have built a railway line using some mothballed track and have several bus lines. I am not sure whether this is affecting things but either way it is potentially compromising the testing we are trying to undertake..

I have noticed that, too: however, if we can get a state where one set of conditions results in loss of synchronisation and another does not, then that should suffice for the test to be independently verifiable. The problem would arise only if what Northam Transport is doing causes a loss of synchronisation in any event, which would be detectable if there ceased to be any states in which the loss of synchronisation did not occur.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on November 12, 2018, 06:08:15 PM
I think people should be more concerned about that player wasting their time playing on a server where no progress will be retained as it is being used to debug...

I have noticed very glitchy mechanics with the passenger class system in the past, shortly before the problem started. For example a train that could only hold around 400 people max (including standing) was suddenly holding over 2,000 people when I made a change in passenger class. It still listed the correct maximum capacity, but it was completely ignored this number and pretty much loaded as if it had 400% extra capacity.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 12, 2018, 06:35:59 PM
I think people should be more concerned about that player wasting their time playing on a server where no progress will be retained as it is being used to debug...

I have noticed very glitchy mechanics with the passenger class system in the past, shortly before the problem started. For example a train that could only hold around 400 people max (including standing) was suddenly holding over 2,000 people when I made a change in passenger class. It still listed the correct maximum capacity, but it was completely ignored this number and pretty much loaded as if it had 400% extra capacity.

If you have a reproduction case for this, I should be grateful if you could post a fresh bug report for this.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: VOLVO on November 14, 2018, 12:40:57 PM
That is very interesting - thank you for that. That the reassignment causes the loss of synchronisation to happen more quickly when reassigned to very low and more slowly when reassigned to medium does seem consistent with your hypothesis that the problem is with passenger routing somehow and that the faster loss of synchronisation is caused by the greater number of passengers in the lower classes who are transported when reassigned to very low.

This does not, therefore, by itself get to the bottom of what the essential changes are between 1937 and 1939 such that the former is stable for >1 hour even with reassigned classes and the latter is not.

Were there any changes to the schedule of the Northern Frontier Express, between these dates may I ask?

I usually increase frequency a little after the trains get stuck to prevent them stucking the stations which shares with other network or lines.

I will come back to it this weekend I should have more free time than previously and help out with the debugging.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 14, 2018, 12:50:52 PM
That is very helpful - thank you.

After the dining car/catering tests have been complete, the next tests that I want to run involve creating some special rail vehicles just for debugging, being (1) the identical LNER carriages used on the Northern Frontier Express, but with the default class set to very low; (2) the identical LNER carriages but with no overcrowded capacity; and (3) the identical LNER carriages with the default class set to very low and no overcrowded capacity. Each should be tested in turn for ~1 hour (or until earlier loss of synchronisation) to see whether this makes any difference. Each test should change only the carriages used on the line and nothing else.

The aim of these tests is to see whether the reassignment of classes itself causes the problem, whether the problem is caused by overcrowding (as is known to occur on that line) or whether the problem is caused by some combination of reassignment and overcrowding. If these tests yield a result showing evidence of a possible causal relationship between overcrowding and/or class reassignments and the loss of synchronisation, then the area of problematic code in question can potentially be narrowed considerably and work can begin to isolate/disable sections of the code for testing.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: VOLVO on November 18, 2018, 06:29:55 AM
WIth the current running Castle Class and GWR carriages there is no disyncro at all.

After changing the stock for A4 and LNER carriages with very low price > disyncro after over 2hours.

Also, if there's a way to disable all the 'no route' message boxes that'll be very nice as it's difficult to work with with everything popping up.

Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 18, 2018, 12:02:25 PM
Can I ask - did you reassign the classes for the GWR carriages? Also, did you run the GWR carriages with a catering vehicle?

To disable the no route pop-up windows, go to Message centre > Options and uncheck the right hand and centre checkboxes for "warnings" and "problems".
Title: Re: Instability on the Bridgewater-Brunel server
Post by: VOLVO on November 19, 2018, 01:52:51 PM
No, they were in very low configuration and no catering vehicles.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 19, 2018, 03:28:46 PM
Thank you for the confirmation.

The next step is for me to set up some special debugging carriages identical in every respect to the existing LNER carriages save that they are of "very low" class by default - testing with these (only) will help to determine whether the problem is the re-assignment of classes that is the problem.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 19, 2018, 09:40:13 PM
I have now added the debug versions of the Gresley carriages. These will be available from to-morrow's nightly build of the pakset. They can be distinguished because they have no translated name and their default name starts with DEBIUG1_. They have very low classes by default.

It would be very helpful if somebody could run a test to see whether using these carriages (and no other) without any class reassignments  on the Northern Frontier Express there is any loss of synchronisation or not.

I am very grateful for the testing work done so far.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 23, 2018, 09:46:59 PM
I have been testing with the debugging carriages - so far, I am able to remain connected stably to the server for quite some time with a number of trains of the special debugging carriages (including dining cars).

I should be grateful if others could test and confirm whether they are also able to stay connected. If so, this strongly suggests class reassignment as the source of the problem.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 24, 2018, 12:01:40 AM
Just tested. I lost sync after ~10 min twice, then managed to stay connected for over an hour. The desyncs may have had something to do with there being a train running in reverse schedule, which I fixed at some point during the second or third attempt. I have also noted that the game takes a considerable time to resume after the client has connected and loaded the map.
Edit: Tried again, once again desynced after ~10 min twice then stayed connected longer on the third attempt.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 24, 2018, 06:04:23 AM
Thank you for testing: that is helpful. Can I ask why you think that running in reverse schedule is relevant? Have you tested for this?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 24, 2018, 12:51:16 PM
My second round of testing seems to indicate that running in reverse schedule isn't related to the bug.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 25, 2018, 01:06:19 PM
Thank you for letting me know. Can I ask what tests produced this result?

What we really need to do is to conduct tests to deduce which features of this line are unique such as cause a loss of synchronisation whereas other lines do not. So far, the tests have been oddly inconclusive. What we may need to do next is duplicate the line, run on it trains that we know cause a loss of synchronisation, and element by element alter the line's schedule until the loss of synchronisation no longer occurs; then return to the original schedule and make the last change alone to see whether that change is decisive in and of itself or whether it is cumulative with other changes, and, if the latter, with what changes it is cumulative.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 25, 2018, 05:11:54 PM
I simply connected and let the game run. At some point I noticed the train I was following was running in reverse schedule, which I fixed. Then on the third time I connected i stayed in sync for over an hour. I then tested again and once again desynced after ~10 minutes twice, then managed to stay connected for longer on the following attempt, so I ruled out the reverse schedule train having had any incidence.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 25, 2018, 05:20:13 PM
I am not sure that I understand this test: trains will run in reverse automatically if they reach the end of their schedule and are set to run their schedules in reverse. You write that you "fixed" the running in reverse - do you mean that you simply unchecked the run in reverse box on that individual train's schedule (which would make that specific train turn around and go in the other direction)?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 25, 2018, 05:23:03 PM
Yes, I unchecked the reverse schedule box in the main convoy window. The line is manually scheduled in both directions, it does not use mirror schedule. As I said, it turned out it was just a coincidence.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 26, 2018, 12:02:37 AM
I am also seeing this pattern - very long periods of stability and then extended periods of instability where a loss of synchronisation will occur within a very short time of disconnecting.

I wonder whether the instability might be related, therefore, to the position of the trains in the schedule: on this occasion, I started the trains all at once, so they will all be in a very similar part of their schedule at the same time.

It would be very helpful if anyone could check to see whether the instability correlates in any way with the position of the trains on the Northern Frontier Express in any particular part of its schedule.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on November 26, 2018, 12:11:38 AM
Also, not everyone desyncs at the same time, but some people desync while others remain connected.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 26, 2018, 12:31:31 AM
That is interesting; however, it would nonetheless be useful to know whether some people lose synchronisation in a way which correlates in any way with where the trains on the Northern Frontier Express are on their schedule.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 27, 2018, 11:40:56 PM
I have been carrying out further tests concentrating on the schedule of the Northern Frontier Express. I have created a new schedule based on that of the Express and moved all the trains to the new schedule. This new schedule ("TEST line") is the same as the previous schedule except that it uses the reverse route feature and lists each station only once.

My findings so far are still inconclusive, but somewhat interesting. Mostly, this runs without error or loss of synchronisation. However, about 30-45 minutes ago, I lost synchronisation. When I logged back in, I found that one of the trains had incorrectly selected "reverse route" part way through the schedule and was blocking the path of one of the other trains on the line by reversing prematurely without crossing to the correct line. After manually correcting this, the game is again running without error or loss of synchronisation for an extended period of time, the train apparently following the schedule correctly.

I note that Rollmaterial had noticed a spurious instance of selecting "reverse route" earlier. I do wonder whether this might be related in some way to the loss of synchronisation. The problem is that it is extremely difficult to test because the conditions for reproducing the error are so infrequent and erratic.

I should be very grateful if anyone else could log into the server and:

(1) check whether it is stable over a long period; and
(2) track the position of convoy no. 4233 and, if it becomes unstable, note the position of that convoy and report here what that position is.

If this be done frequently enough, over the course of a few days or weeks, we should get an idea of whether there is any pattern that suggests a possible correlation between the two, or whether there is no correlation.

The fact that the problem occurs on one particular railway line does suggest that an issue relating to the schedule itself might well be the cause of the loss of synchronisation, but beyond that there are still very little in the way of experimental data to demonstrate the probable causal mechanism or even the region in the code where the causal mechanism is likely to take place.

What is very odd about this issue is that it only arose after the game had been played, with intensity, for nearly 190 game years, with intensive use of railways for 115 of those years. There is evidently something very, very idiosyncratic about this error that makes it exceptionally hard to track down. The more assistance that I can have in localising the error, the sooner that it will be possible to narrow it down to a specific area of the code and, ultimately, fix it and continue work on fixing lower priority bugs, adding features and improving the game balance.

All assistance will be much appreciated.
Edit: Just after posting this, I observed some more interesting results. I noticed a train travelling at 35km/h in the drive by sight working method and realised that it was on the wrong line. It was just beyond St. Mary Beddington station. This is very near where the train was when the last loss of synchronisation occurred. Then, there was a further loss of synchronisation.The trains were correctly in reverse route mode - the problem I traced back to the junctions at Bickstable Fields station, where, just beyond the station, there is a missing one way sign next to crossovers that allow a train to cross over onto the other line. I have replaced the one way sign and resolved the deadlock potential by disabling the reverse route for the two trains that had gone down the wrong line.

I should be very grateful for any further testing to check whether any further loss of synchronisation occurs in this new setup.
Edit 2: I have just lost synchronisation again, at the very point that a train stopped at St. Mary Beddington station (in reverse route mode on the correct track).

Edit 3: After reconnecting again, a train in the reverse route direction called at St. Mary Beddington without loss of synchronisation; but I have just lost synchronisation again a second or two before a train finished stopping in the platform at St. Mary Beddington in the reverse route direction.

Edit 4: Some extremely interesting results. The game remained stable for an extended period (>30 minutes approximately) while no trains called at St. Mary Beddington. Then, a series of two trains called at that station. Seconds after the second of them departed, there was a loss of synchronisation. The direction was as before in reverse route (that is, departing towards Elmley). I have not yet seen departures from that station in the other direction, so cannot confirm whether errors occur in these circumstances.

There now seems to be an extremely strong correlation between trains calling at St. Mary Beddington and the loss of synchronisation. It seems usually to be the second train departing that station after the client first connects that does this. What would be very useful is if others could test and observe St. Mary Beddington station - what I need is a clear idea of whether there are any losses of synchronisation that do not occur shortly before or after a train calls at St. Mary Beddington station. If people could record the direction in which the train is travelling when departing and how many previous trains departed before the loss of synchronisation, that would be extremely helpful.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on November 28, 2018, 04:15:02 AM
I assume you have set the server to perform checksum checks every frame or something like that? Otherwise there will be a delay between the action that causes the clients to go out of sync and the server detecting the out of sync with the client disconnecting.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 28, 2018, 10:29:39 AM
The following is specified in simuconf.tab:

Code: [Select]
server_frames_between_checks = 32

One thing that seems likely on the face of it is that the anomaly occurs at the previous station on the route and is an anomaly affecting departure time, but that this is only manifested when the train arrives at the next station, as this is the first point that that anomaly can be translated into a difference in the step in which the random number generator is called.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on November 29, 2018, 06:24:55 PM
Hello James, and all!

Sorry for having been absent for such a long time. RL have been pushing and I havent found much time to do Simutrans, other than reading the forum.

I did put myself up to do alot of logging on the trains on the stations today however, but nothing seemed unnormal about trains stopping, loading, departing or doing other train stuff.
However, I noted that trains departing in the reverse direction tends to initially display a distance to the next station that is way off!
For instance 193km from Camberwell New Town Railway Station  to Rockhead Gate Railway Station, which should be closer to around 12-14km. Ca midways the distance updates it self to a much reasonable value.

Other observations:
188km is noted from Bickstablewood Fields Railway Station -> St. Mary Beddington Piccadilly Railway Station
166km is noted from St. Mary Beddington Piccadilly Railway Station -> Buckllock Hill Railway Station
I didnt catch the displayed distance from when the train was leaving, but midways it displayed 72km from Underwater Bridge Railway Station to Wyndingborne Copse Railway Station
149km is noted from Buckllock Hill Railway Station -> Buckllock Bridge Railway Station
I didnt check the rest of the stops, but I assume that there are more instances of this.

During the hole period (ca 5 ingame hours) I had no desyncs at all. Occasionally the game would freeze for some minutes, but nothing more.

Hope this is of any assistance
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 29, 2018, 06:38:47 PM
Thank you for this. The observation regarding the distance appears to be a UI bug. May I ask: are you running Linux?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on November 29, 2018, 10:03:36 PM
No on windows 10.
Ok, however I have never seen that ui-bug before...

Skickat från min ONEPLUS A6003 via Tapatalk

Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 30, 2018, 12:43:09 AM
Thank you  - that is very helpful. It is odd that you are not experiencing any loss of synchronisation. Inconsistent results such as this can make tracing the issue hundreds of times more difficult and lengthy than it would otherwise be.

Can I ask - what behaviour are others seeing at present?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on November 30, 2018, 04:30:25 PM
Regarding the distance displayed, I dont think that is just an UI thing. Looking in the code, the information is fetched and calculated from

Code: [Select]
cnv->front()->get_route_index()and
Code: [Select]
cnv->get_route()->get_count()It is very basic code that should be fail proof in the info window, and the distance bar and the written distance are not calculated together, yet they both show the way ambigious distance.
Also, this is so far I have seen only happening in the reverse direction, when the "reverse" is ticked, but not from all stations, though.

It might be completely unrelated, but you stated that the "reverse" status was of interrest, as well as the station "St. Mary Beddington Piccadilly Railway Station" (which produces this behaviour).

I have been online now for ca 30 minutes without desync, however, when I first joined the servergame, it was so slow to open so I tabbed away and forgot about it. When I remembered that I had attempted to join the server 30 minutes later or so, it had desynced in the background.

edit:
just came home from a small walk, and the game had desynced while I was away. The game where tabbed away during my walk.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on November 30, 2018, 11:17:24 PM
Thank you very much for your testing.

First of all, as to the UI issue - what I suspect is happening is that the distance is incorrectly being calculated on the assumption that the convoy is moving forward in its schedule rather than backwards, so calculating the route as it would be if the convoy had to go all the way through the schedule to get to the next point in the other direction. This would be a UI issue. I have not looked into this in detail, however.

As to the loss of synchronisation, one thing that can be inferred from what I observed is that the trains were departing from the station immediately prior to St. Mary Beddington at different times on the client and server. At one point, I did observe a train depart from St. Mary Beddington immediately that it arrived without even waiting for the minimum loading time, but this is not readily reproducible. I have just pushed some slight changes to the code for calculating the waiting time at stations, in particular, making the numerical types more consistent; I suspect that this will not make much difference, but it is just possible that the error lies here, and it would be worthwhile re-testing to-morrow to see whether this has helped.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 01, 2018, 01:08:53 PM
Further testing shows that the minor changes made yesterday evening have not prevented the loss of synchronisation. The issue still seems to be confined to trains arriving/departing from St. Mary Beddington. There appears to be no clear pattern as to when arriving/departing at this station will cause the loss of synchronisation: the first arrival/departure after logging in can trigger the failure, whereas there can be quite a few in sequence without this occurring. As to whether this occurs only in reverse direction, this is inconclusive so far: the only occasion on which I saw a loss of synchronisation on departure on the foreward direction coincided with a train just about to arrive in the reverse direction.

The next phase of the test is to remove St. Mary Beddington from the schedule and see whether this prevents the loss of synchronisation, or simply moves where it occurs (suggesting that hte problem occurred at the previous station).

Edit: I have now modified the line on the server to remove the stop at St. Mary Beddington. It will be interesting to see whether this will affect the loss of synchronisation.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on December 01, 2018, 10:28:33 PM
James, did you just loose syncronization right now, or logged out? I have been on the server now for around an hour without any desyncs at all. I have had 4 trains coming through the St.Mary Beddington area and the surrounding stations in reverse up until now without problems.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 01, 2018, 10:34:02 PM
Yes, I have had loss of synchronisation a few times, including quite recently. The latest test was to see whether removing the signal from the second to last tile on the platform at Bucklock Bridge and replacing it where it should be (i.e. on the very last tile) made any difference, but apparently not. Having deleted St. Mary Beddington from the schedule, the loss of synchronisation seems now to occur on departure from Bucklock Bridge - two stops along in the reverse direction.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on December 01, 2018, 10:37:09 PM
Ok, for information, I am still on the servergame. When I first came online, I think I kicked you out in the process, because it said you had left just when I logged in an hour or so ago.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 01, 2018, 10:49:47 PM
Yes, I noted that.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on December 02, 2018, 12:11:52 AM
Now I got a desync. I had the window tabbed away, but noted some 5 minutes ago that the game had freezed, like it have used to occasionally do through out the entire session (1½-2 hours)
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 02, 2018, 12:23:47 AM
I do notice that it does that - this appears to be referable to the networking code written by Dr. Supergood to prevent loss of synchronisation: from what I understand, the client does this so as not to get ahead of the server.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on December 02, 2018, 12:30:10 AM
Ok, so that is a good thing then!
Now I got a desync again. It was in tabbed away stated, but I just some minutes before tabbing away (and desync) I watched two trains going the reverse direction past the Buckllock Bridge station, and no other train within hours (according to the station window time table).
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 02, 2018, 12:32:30 AM
The loss of synchronisation does now appear to occur either just after trains leave or just before trains arrive at this station in the reverse direction. It is very difficult to deduce what is different about this station on this line compared to all the many hundreds others in use in this huge map. It is really very odd.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on December 02, 2018, 12:35:48 AM
But wasnt it with the St. Mary Beddington Piccadilly Railway Station to begin with? What station are you refering to as being the suspect?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 02, 2018, 12:41:41 AM
It was St. Mary Beddington Picadilly until I removed that stop from the schedule. It then became Bucklock Bridge. I had suspected that it was the station immediately before St. Mary Beddington that was causing the problem that was only manifesting at St. Mary Beddington, but this does not seem consistent with it occuring two stations further on after St. Mary Beddington is removed from the schedule.

Perhaps you could try removing the station immediately before St. Mary Beddington on the schedule and test to see whether that prevents the loss of synchronisation, and, if it does, try re-adding St. Mary Beddington and seeing whether that makes a difference?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on December 02, 2018, 12:51:00 AM
I have not had a consistent desync at all, so Im not sure I would produce any consistent results. Remember that from my earlier session, I was online for almost 2 hours with four trains passing through the area in the reverse direction.
I might try something, but it is getting late and I might not be able to conduct any proper tests tonight.

edit 1:
Ok, now it desynced immediately when a train pulled up to the Bucklock Bridge.
Will check if the next train in line does the same thing....

edit 2:
No it didnt, it stayed in sync...
However, I have now deleted the Bickstablewood Fields Railway Station from the schedule, so lets see over the coming days if that improves things! Have not had any trains passing throguh at this point.

edit 3:
Nah.. got a desync with no train what so ever in the vicinity of either Bickstablewood Fields Railway Station nor Buckllock Bridge Railway Station...

The only odd thing I can see on the map is the displayed distance in the convoy window. It doesnt really look like the information could be calculated wrong in the info window, which suggests something wrong in the cnv->front()->get_route_index(), since that is the one that indicates the correct distance to the next destination. The values doesnt really add up to be a circumnavigation of the track to counter for wrong reverse counting, as demonstrated with the first example I gave:

193km from Camberwell New Town Railway Station  to Rockhead Gate Railway Station

The distance between the two stations are 10,88 km bird way (from the stop info window), and the distance between Camberwell New Town Railway Station and the reversing termini (Bealdean Rye....) is 17,62 bird way. So 10,88 + 17,62 + 17,62 + 10,88 is ca 56-57km and far from 193km, even though the trains are not traveling in a straight line.

Given the pretty inconsistent desync results, that might be worthwhile investigating, if not just to make sure that it is NOT because of that.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 02, 2018, 05:32:31 PM
Thank you very much for that testing. Can I ask - does the anomaly with the distances occur on any other line?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 08, 2018, 12:15:28 PM
I have carried out some further testing. The distance anomaly appears to be directly connected to the reverse route feature: what seems to happen is that, when in reverse route, a train will calculate its route initially to the opposite platform of the same station, but will shortly afterwards reset it to the correct destination. I am not at this stage sure how it resets it, since the calc_route() method is not called - it may well have calculated it via the correct destination and then later truncate it.

However, what is clear is that this cannot be the cause of the loss of synchronisation on the server game, since the original line (the "Northern Frontier Express") did not use the reverse route feature. Instead, the stations were simply entered manually in both directions. Testing shows that the distance anomaly does not occur on the unmodified saved game.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 08, 2018, 05:59:53 PM
Additional testing shows that the following two modifications applied both to server and client do not prevent loss of synchronisation:

(1) disabling multi-threading for convoys; and
(2) disabling the system for trains stopping in the centre of platforms for non-terminus stops.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 08, 2018, 08:05:10 PM
Further testing shows that disabling overcrowding on the debugging railway carriages does not prevent the loss of synchronisation. Incidentally, I have reverted to the original Northern Frontier Express schedule, and so the loss of synchronisation occurs departing from St. Mary Beddington heading to Elmley.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 09, 2018, 01:09:26 AM
Further testing has shown that simply changing the locomotives on the trains on the Northern Frontier Express to the GWR Castle class prevents the loss of synchronisation. It is not immediately clear why this should be at this stage; however, the prime suspect is air resistance: the A4 class is a streamlined locomotive, and has an air resistance value set manually in the pakset. Earlier in my testing a month or so ago, I also tested with the Southern Merchant Navy class, which has an air smoothed casing and also a custom air resistance value; this caused loss of synchronisation, too.

I have spotted some loss of precision in the fixed point system used to emulate floating point arithmetic where the air resistance value is read; however, this loss of precision seems constant and does not appear to change when using different methods of calculating the air resistance values. The current version running does indeed use a slightly modified way of storing the air resistance values, but this has not solved the loss of synchronisation.

Because the game had advanced to 1947 by which time no streamlined locomotives were available, I have re-set the game to its previous running state in 1941, although I have saved where it had reached in 1947.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 09, 2018, 03:58:09 PM
I have temporarily disabled the custom air resistance value for the LNER A4 class and re-instated the 1947 version of the saved game with the A4 class hauling the trains (these are slightly longer trains than previously, and I note that the loss of synchronisation no longer occurs at St. Mary Beddington). Before temporarily disabling the custom air resistance value, a loss of synchronisation could still be reproduced.

I have managed now to stay connected for quite a while after disabling the custom air resistance for the A4 class, but my computer crashed after about 30 minutes. I should be very grateful if anyone could carry out a longer-term test to check for stability.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 09, 2018, 04:55:12 PM
Further testing has shown that disabling the custom air resistance for the LNER A4 class makes no difference to the loss of synchronisation: this still occurs on occasions when the train arrives at St. Mary Beddington.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on December 09, 2018, 06:34:06 PM
I just got a desync after half to a full hour, while a train has been traveling through the St Mary station
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 09, 2018, 06:38:03 PM
Thank you for checking.

This is, I have to say, a fantastically and inexplicably complex problem to resolve - no test so far devised seems to have provided any meaningful insight into the nature of the problem. This really is bizarre beyond comprehension. I fear that this may take many, many more months to resolve without some serious progress.

If anyone can devise and execute some tests to narrow this issue down further, that would be most helpful.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on December 10, 2018, 12:26:41 AM
So is it related to the type of locomotive used or was that a premature conclusion?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 10, 2018, 12:37:38 AM
So is it related to the type of locomotive used or was that a premature conclusion?

It may be - so far, it appears not to lose synchronisation with a GWR Castle class, but it does lose synchronisation with the LNER A4 class (with or without a custom air resistance value). An earlier test showed loss of synchronisation with a Bulleid pacific (I think a Merchant Navy class, but it might have been the similar looking but lighter weight West Country class) and mail carriages. Nothing else has been tested so far - testing other locomotives with everything else the same would be a helpful start to try to narrow this down.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 10, 2018, 11:42:18 PM
One possible method of testing is to make a copy of the GWR castle class for debugging purposes, and, one by one, change its characteristics to those of the LNER A4 class and re-test each time to see whether there is a loss of synchronisation. Because a full round of testing will involve the trains making the full circuit of the Northern Frontier Express for each modification, this will be an extremely time consuming method of testing.

However, I cannot at present devise any alternative. I am not likely to have the time to do this for a few days, but if anyone else can devise any more efficient test in the meantime, that would be very helpful.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 12, 2018, 10:34:37 PM
I have just commenced a test of seeing whether the LMS Princess Coronation Royal class when paired with the debugging carriages will provoke a loss of synchronisation or not. Testing several other classes of locomotives would be helpful to have good data as to whether the locomotive is relevant to the loss of synchronisation and, if so, in what way and to what extent.

Edit: A loss of synchronisation can be reproduced with this class of locomotive on departing St. Mary Beddington towards Elmley.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 22, 2018, 11:35:10 PM
I have been very busy making various Christmas preparations, so have not been able much to test this recently. However, I am now staying with my parents for the festive season with my Linux NUC and have tried logging into the server with that. I find that that also loses synchronisation.

This is significant, as I had previously thought from earlier testing that this problem cannot be reproduced when a Linux client connects to a Linux server: it seems that that is not the case. This significantly broadens the type of error that might cause this behaviour and makes it in some respects less confusing than it appeared to be hitherto.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 27, 2018, 01:39:54 PM
Further testing shows some even more perplexing results.

Running on the Linux NUC with a server and a client connected through the loopback interface, I cannot reproduce the loss of synchronisation at all, even after several hours of running.

I am wondering whether this is a checklist mismatch at all. It may well be necessary to add a feature to the UI to display, when a client loses synchronisation, the type of cause (e.g. checklist mismatch, command executed in the past, etc.) to assist with diagnostics.

Edit: Further testing shows that this is indeed a checklist mismatch: using a debug output and connecting to the server in that way produces the following debugging output (edited to show the relevant part):
Code: [Select]
Warning: network_check_activity():    received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK:    time difference to server 66
Warning: network_check_activity():    received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK:    time difference to server 66
Warning: network_check_activity():    received cmd id=9 nwc_check_t from socket[19]
Warning: NWC_CHECK:    time difference to server 0
Warning: network_check_activity():    received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK:    time difference to server 0
Warning: karte_t:::do_network_world_command:    sync_step=5128  server=[ss=5128 st=1282 nfc=0 rand=2876650628 halt=4097 line=1025 cnvy=4097 ssr=1905360171,3649131816,0,0,0,0,0,0 str=3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,2876650628,2876650628,2556660269,644001429,3649131816 exr=0,0,0,0,0,0,0,0  client=[ss=5128 st=1282 nfc=0 rand=2876650628 halt=4097 line=1025 cnvy=4097 ssr=1905360171,3649131816,0,0,0,0,0,0 str=3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,3649131816,2876650628,2876650628,2556660269,644001429,3649131816 exr=0,0,0,0,0,0,0,0 
Warning: network_check_activity():    received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK:    time difference to server 0
Warning: network_check_activity():    received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK:    time difference to server 0
Warning: network_check_activity():    received cmd id=9 nwc_check_t from socket[19]
Warning: NWC_CHECK:    time difference to server 0
Warning: karte_t:::do_network_world_command:    sync_step=5132  server=[ss=5132 st=1283 nfc=0 rand=2026024821 halt=4097 line=1025 cnvy=4097 ssr=3465126746,205948590,0,0,0,0,0,0 str=205948590,205948590,205948590,205948590,205948590,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2563437453,644638429,2026024821 exr=0,0,0,0,0,0,0,0  client=[ss=5132 st=1283 nfc=0 rand=205948590 halt=4097 line=1025 cnvy=4097 ssr=3465126746,205948590,0,0,0,0,0,0 str=205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,2563437453,644638429,205948590 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command:    disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect():    Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset:    all static variables are reset
World finished ...
Show banner ...
Edit 2: Extracting from that, we seem to get the following pattern under "str=" for the failed step:
Client:
Code: [Select]
205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,205948590,2563437453,644638429,205948590Server
Code: [Select]
205948590,205948590,205948590,205948590,205948590,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2026024821,2563437453,644638429,2026024821This is in circumstances where the rand= are as follows on the failed step:
Client
Code: [Select]
rand=205948590
Server
Code: [Select]
rand=2026024821
Title: Re: Instability on the Bridgewater-Brunel server
Post by: TurfIt on December 27, 2018, 07:56:22 PM
That means something in either convoi_t::threaded_step() or convoi_t::step() as called from karte_t::step() is grabbing a random on the server but not client...
There's still 'spare' checklist rands in this debug version (that you were supposed to remove after the last big desync hunt - 4? years ago), you can insert in between to disambiguate threaded_step from _step.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 27, 2018, 08:06:56 PM
That is a very interesting and useful analysis: thank you. I earlier experimented with disabling the convoy multi-threading, which did not affect the desync, so whatever is causing the problem is either in convoi_t::step(), or in that part of convoi_t::threaded_step() that is replicated in the single thread only version. 

Of course, it is entirely possible that the ultimate problem is not in either of those methods, but that those methods are where the divergence first manifests itself as a call to simrand() on the server but not on the client.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: TurfIt on December 27, 2018, 08:25:37 PM
Finding the actual source of the problem is a problem itself with this style of debugging... end up down a lot of false paths.

I'd first start by ensuring the desync is consistently triggering from rands[13] (and hope you don't have more than one intermittent desync bug at play!), and then determine where all the simrand calls in convoi stepping are, and then determine what all conditions affect whether those calls are made.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on December 30, 2018, 12:25:16 AM
Thank you for your thoughts on this. As to a problem with this style of debugging, so far as I know, there is no better way of debugging an extremely difficult problem of this nature; unless you are aware of one?

In any event, I am now having trouble reproducing the loss of synchronisation. I have just been connected for ~2 hours with logging enabled and the trains on the Northern Frontier Express have made more than one complete circuit and there was no loss of synchronisation at all. I very much doubt that the problem has fixed itself (or that Ranran's UI improvements have had any effect), so this seems to be more intermittency. It is all very odd, but makes finding the problem much harder. I had been hoping to obtain some more logs of loss of synchronisation to test whether or not the loss of synchronisation always happens in the same place as revealed by the logs. I will try again to-morrow.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on December 30, 2018, 02:18:33 PM
Any sort of code change can move data around in memory.

Might be worth testing a reverted build to see if it comes back. This could explain some of the consistency issues.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 01, 2019, 10:50:11 PM
It is hard to see how having different memory locations would affect the loss of synchronisaiton bug - can you clarify how this would actually work?

I have re-tested and am still unable to reproduce this. I should be very grateful if anyone else could try connecting to the server to see whether the loss of synchronisation can still be reproduced. If not, I will try reverting the server saved game version to an earlier version in which time had not advanced so much and the issue could reliably be reproduced.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on January 02, 2019, 12:44:15 AM
Quote
It is hard to see how having different memory locations would affect the loss of synchronisaiton bug - can you clarify how this would actually work?
If the problem is the result of memory corruption or reading memory garbage then moving the code around or changing the size of a struct might change how things get packed in memory. This is likely different between Linux and Windows and possibly even other Linux builds which would explain how random the out of sync is.

The fact this bug is so obscure and hard to reliably recreate means that the cause must be something borderline insane.

The custom memory allocation used by Simutrans for performance might not be helping detecting such issues. Specifically standard instrumentation such as is often used to check for correct memory usage might not be compatible with the custom allocation model. Have we tried debugging a build where it is bypassed for normal malloc/free (or new/delete) calls?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 02, 2019, 01:08:17 AM
This bug is very obscure, but not so random: there was a time when it would very reliably be triggered whenever a train arrived at St. Mary Beddington heading to Elmley. Removing the St. Mary Beddington stop then produced the exact same symptoms (reliably) two stops down the line. This suggests that the cause is not random, but chaotic: deterministic between different machines and sessions, but ultra-complex in cause (at least in the specific instance). Corrupt memory would be far more likely to cause desyncs at random whenever a particular part of the code were used. The symptoms suggest a logic bug, albeit a highly complex and obscure one, rather than something lower level such as memory corruption.

As to disabling the custom memory allocation, the trouble is that any debug build with this map is almost unusably slow, and one with memory optimisations disabled would be worse still (there is a preprocessor directive that will do this, I think).

I am planning to build a new high performance computer this year, which might well make it easier to debug large maps, but how much easier remains to be seen.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 02, 2019, 04:01:56 AM
I have now got myself a new (powerful) laptop, so, if time permits, I might now be able to investigate the desyncs. I'd like to start by taking a known bad save and seeing if I can reproduce the desync locally (probably running one of the client or server on a Windows virtual machine). It sounds like the current save might not be causing desyncs, so perhaps it would be helpful if you posted a suitable version of the savegame somewhere. This would also make it easier to access the savegame without having to build a matching pakset and client to connect to the server.

Incidentally, one of my issues seems to have been due to an unexpected version mismatch arising from my client having a nine-character revision while the server had a seven-character revision. I'm not sure why the lengths are different by two characters (one character would be more understandable). There also appears to be an assumption in revision.jse that the revision length is nine characters, but the server breaks this assumption.
A fix for this would be simple - replace the two instances of "git rev-parse --short HEAD" with "git rev-parse --short=9 HEAD"
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 02, 2019, 12:13:21 PM
Thank you for looking into this: that is most kind. I am currently away from home, where I have a few more versions on my computer there, so the only version that I have available is the network game as it was actually running a short while (~ 2 game years) after the network issue was first detected. It can be downloaded here (http://bridgewater-brunel.me.uk/saves/bb-5-oct-2018.sve). You can increase the performance of this game by liquidating all players except Bay Transport from the command line by running the game as a server and using nettool.

Note, however, that I have never been able to reproduce the loss of synchronisation without actually connecting to the server (be it with a Linux or Windows client): the usual testing method of running a local server and connecting to it with a local client on the loopback interface has never been able to reproduce this error. That seems to be a clue itself as to the cause, but to what it is pointing I have no idea.

As to the server and client giving different numbers of characters for the revision, that is rather odd, especially as they are both compiled on the same machine (albeit the Windows ones are cross-compiled with different libraries). I will, however, implement your suggested fix: thank you for that.

Edit: Being away from my normal development environment, I am having trouble finding where git rev-parse --short HEAD is located: can you remind me? Thank you.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 02, 2019, 01:22:41 PM
especially as they are both compiled on the same machine

Actually they weren't, because I was using a self-compiled version. Perhaps that explains why the issue hasn't previously been encountered.

I've pushed the revision name fix to Github.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 02, 2019, 01:52:00 PM
Splendid, thank you: now incorporated.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 02, 2019, 11:12:23 PM
I am now again able to reproduce the loss of synchronisation on the game as it is currently on the server when a train arrives at St. Mary Beddington heading for Elmley. This was reproduced consistently twice in a row, and the second time, I logged it. Here is the relevant extract from the log:

Code: [Select]
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Warning: karte_t:::do_network_world_command: sync_step=73568  server=[ss=73568 st=18392 nfc=0 rand=1576148348 halt=4097 line=1025 cnvy=4097 ssr=2991505728,2840015283,0,0,0,0,0,0 str=2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2501980820,1576148348,2536695661,641877393,2840015283 exr=0,0,0,0,0,0,0,0  client=[ss=73568 st=18392 nfc=0 rand=1576148348 halt=4097 line=1025 cnvy=4097 ssr=2991505728,2840015283,0,0,0,0,0,0 str=2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2840015283,2501980820,1576148348,2536695661,641877393,2840015283 exr=0,0,0,0,0,0,0,0 
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=16, client_id=0
Warning: network_check_activity(): received cmd id=16 nwc_step_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Message: network_command_t::rdwr: read packet_id=9, client_id=0
Warning: network_check_activity(): received cmd id=9 nwc_check_t from socket[19]
Warning: NWC_CHECK: time difference to server -66
Warning: karte_t:::do_network_world_command: sync_step=73572  server=[ss=73572 st=18393 nfc=0 rand=4030280172 halt=4097 line=1025 cnvy=4097 ssr=390225914,3158243328,0,0,0,0,0,0 str=3158243328,3158243328,3158243328,3158243328,3158243328,2091714030,2091714030,2091714030,2091714030,2091714030,2091714030,4030280172,4030280172,2543500209,642516720,2091714030 exr=0,0,0,0,0,0,0,0  client=[ss=73572 st=18393 nfc=0 rand=2998080384 halt=4097 line=1025 cnvy=4097 ssr=390225914,3158243328,0,0,0,0,0,0 str=3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,3158243328,2998080384,2998080384,2543500209,642516720,3158243328 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset: all static variables are reset
World finished ...
Show banner ...
Title: Re: Instability on the Bridgewater-Brunel server
Post by: TurfIt on January 03, 2019, 01:19:13 AM
Same spot, but you should be running checks every frame for debugging.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 03, 2019, 01:53:57 AM
Interesting - thank you. By checks every frame, you mean the server setting? I can change this fairly easily if so.
Edit: Done - this should take effect when the server restarts to-morrow morning.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 03, 2019, 03:59:49 PM
Another round of testing with the check set to every frame rather then every 32 frames: the first batch of trains to pass St. Mary Beddington did not trigger the loss of synchronisation. However, waiting for the trains to go all the way around the line and back again did trigger it. Here is the relevant part of the debug output:

Penultimate message
Code: [Select]
Message: network_command_t::rdwr:   read packet_id=9, client_id=0
Warning: network_check_activity():   received cmd id=9 nwc_check_t from socket[19]
Warning: NWC_CHECK:   time difference to server -66
Warning: karte_t:::do_network_world_command:   sync_step=129523  server=[ss=129523 st=32380 nfc=3 rand=1421316284 halt=4097 line=1025 cnvy=4097 ssr=4080282562,1421316284,0,0,0,0,0,0 str=3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3257590584,1337000643,1883002831,3548628848 exr=0,0,0,0,0,0,0,0  client=[ss=129523 st=32380 nfc=3 rand=1421316284 halt=4097 line=1025 cnvy=4097 ssr=4080282562,1421316284,0,0,0,0,0,0 str=3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3548628848,3257590584,1337000643,1883002831,3548628848 exr=0,0,0,0,0,0,0,0 

Final message
Code: [Select]
Warning: karte_t:::do_network_world_command:   sync_step=129524  server=[ss=129524 st=32381 nfc=0 rand=2141574950 halt=4097 line=1025 cnvy=4097 ssr=1421316284,3615887852,0,0,0,0,0,0 str=3615887852,3615887852,3615887852,3615887852,3615887852,1588385680,1588385680,1588385680,1588385680,1588385680,1588385680,46953431,2141574950,1343811081,1883642638,1588385680 exr=0,0,0,0,0,0,0,0  client=[ss=129524 st=32381 nfc=0 rand=1588385680 halt=4097 line=1025 cnvy=4097 ssr=1421316284,3615887852,0,0,0,0,0,0 str=3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,3615887852,2997474060,1588385680,1343811081,1883642638,3615887852 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command:   disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect():   Lost synchronisation with server. Random flags: 0
Warning: nwc_routesearch_t::reset:   all static variables are reset
World finished ...
Show banner ...
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 06, 2019, 05:17:40 AM
I have been attempting to reproduce a deync 'locally', using the save from the 5th of October that was linked a few days ago. I have tested running a server in Linux on my new laptop and connecting to it from a Windows VM on the same laptop, but have so far been unable to reproduce a desync by this method. I have also tried running a server on my old laptop (over seven years old and with only 2GB of RAM) which is also running Linux. It seems to be able to run the server, but even with a heavily reduced save it runs very slowly. I do, however, eventually get a desync, which I think happens almost immediately after the server starts running.

Code: [Select]
Warning: karte_t:::do_network_world_command:    sync_step=8  server=[ss=8 st=1 nfc=0 rand=1302837895 halt=2049 line=1025 cnvy=4097 ssr=552184448,3172636355,0,0,0,0,0,0 str=3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,2818197206,2818197206,2818197206,2818197206,1302837895,1302837895,138807,13216,2818197206 exr=0,0,0,0,0,0,0,0  client=[ss=8 st=1 nfc=0 rand=221087431 halt=2049 line=1025 cnvy=4097 ssr=552184448,3172636355,0,0,0,0,0,0 str=3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,3172636355,221087431,221087431,138807,13216,3172636355 exr=0,0,0,0,0,0,0,0 
Warning: karte_t:::do_network_world_command:    disconnecting due to checklist mismatch

My most recent testing has been using the save here (https://ac894.user.srcf.net/bb-5-oct-2018-removed-most-objects.sve). It is produced by modifying the code to remove almost all of the vehicles, stations and objects outside a bounding box for the Northern Frontier Express, along with a few bits that were manually removed. This seems to have led to some slight glitches - in particular, attempting to delete the partially removed bridge in the northern chain crashed the game. I have, however, achieved a significant reduction in memory usage (3GB, down from about 3.7GB after removing other companies, or 6GB with other companies included), and running speed, which may help with debugging efforts.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 06, 2019, 12:26:51 PM
Interesting - thank you for your work on this. It is also interesting that you, too, cannot reproduce a loss of synchronisation on the loopback interface. I suspect that there must be some significance in this, but it is very hard to imagine what it is at this stage.

It would be very interesting to know whether the loss of synchronisation that you are able to reproduce occurs when trains on the Northern Frontier Express arrive at St. Mary Beddington station heading towards Elmley.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 07, 2019, 01:01:02 AM
I've had a closer look at the checklists, and it seems that we are looking at different causes of desyncs. My desync happens on the first step and arises during passenger generation, where the server seems to request one or more random numbers while the client does not.

Your desync however occurs (as might be expected) in the part of the code handling convoy stepping. This is between lines 5455 and 5490 in simworld.cc - random seeds written in line 5455 match, but the random seeds written in line 5490.

I don't know whether my desync is triggered by the extreme conditions I'm running under (i.e. a very underresourced laptop), or something inconsistent in the save that I am using - perhaps you could see what desync you get if you run that save on your server.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 07, 2019, 07:26:54 AM
I've pushed some debugging changes to Github:
1) Change goods->cargo so that we can compile with DEBUG_SIMRAND_CALLS
2) Add the thread-local seed to the debugging output, to allow calls from different threads to be distinguished (using the seed seemed to be the easiest way to produce an identifier that always exists and ought to be unique*).
This latter change should help me to debug the desync I'm getting in the passenger generation, since it seems to be caused by an unwanted call to the main thread simrand.

*I've also changed the seeds used for the passenger generation thread to make them actually unique - previously the first passenger generation thread used the same seed as the main thread.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 07, 2019, 08:59:45 PM
Thank you for this work and my apologies for the delay in replying: I have been returning from my holiday to-day.

I have now incorporated and pushed your patch, so this should be running on the server from to-morrow morning onwards. Do you still need me to run your saved game on the server?

Thank you again for all your work on this.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 07, 2019, 09:47:01 PM
Do you still need me to run your saved game on the server?
I suspect that it wouldn't be of much use - I'm haven't managed to reproduce that desync myself. On my latest run I managed to reach sync_step=46 before the client was somehow aborted (my mistake probably). On the other hand, my old laptop reached sync_step=8000(ish) before something happened to it, so it might just be feasible to run it as far as a desync.

I would suggest that the next step should be to run both a client and a server with DEBUG_SIMRAND_CALLS defined, so that we can hopefully get a better idea of where the deviation is arising. However, that might not be practical without further simplifying the game, due to the volume of debugging output that would give. In particular, for about 8000 sync steps I produced a log of over 1GB - a similar log for the duration that your test ran for would write over 16GB.

I'm going to try commenting out chunks of code and see what happens.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 07, 2019, 10:57:56 PM
Thank you - that is very helpful.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 09, 2019, 12:11:12 AM
I have been trying to deduce what calls to simrand() that there might be during convoi_t::step(). There are actually very few places where simrand() is called within the convoy stepping, but one place is in void haltestelle_t::liefere_an(ware_t ware, uint8 walked_between_stations) at line 3269, where pedestrian_t::generate_pedestrians_at() is called.

This has in the past been a place where a loss of synchronisation first manifested itself as a result of convoys stopping at stops at a different time. It is consistent with observed behaviour that this is what is happening: the loss of synchronisation occurs fractionally before the train arrives at St. Mary Beddington (on the client) and the server is recorded as having a call to the random number generator that the client does not have: what I infer is probably happening is that, for some reason, the train is arriving at the station on the server a few steps ahead of the step at which it arrives on the client.

There are two possible causes for this type of loss of synchronisation:
(1) the train leaves earlier on the server than on the client; or
(2) the train travels faster on the server than it travels on the client.

The latter seems unlikely, as this would be an issue with the physics engine, and it is very unlikely that this would consistently occur in exactly one place and one place only (and then migrate further down the line when the stop at that station is removed).

That leaves the possibility that the train leaves the previous station slightly earlier on the server than it leaves on the client. I did wonder whether this might have been the trouble some time ago, but investigations into the code for deciding when the convoy would leave the previous station did not reveal anything that looked as if they might possibly cause any loss of synchronisation, and I note that that part of the code is run in the main thread.

However, I do note that there has been a bug report relating to overloading of trains in some cases beyond their overcrowded capacity. It is just possible that this might be related, as there is code that is intended to increase loading time as the number of passengers or amount of goods the board a vehicle increases. I will have to look into this bug and see whether solving it assists this issue.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 09, 2019, 05:26:06 AM
It sounds like a useful test would be to simultaneously run a client that typically desyncs, and a client that typically does not desync. When a desync happens, have a look at the passengers on the train that just arrived at St. Mary Beddington, as well as those listed as 'transferring' to St. Mary Beddington in the station window, and see if there are any differences.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 09, 2019, 09:09:37 AM
Thank you for that suggestion; however, I am not sure that there is a specific client that does and a specific client that does not reliably lose synchronsation at the same time; I am not aware of any test that has identified any such client.

Probably easier at this stage would simply be to fix the known issue with passenger loading and see whether that assists - even if it does not, it is not work wasted. It appears that the  link for the file containing the reproduction case is broken, so I am waiting on Spenk009 to re-upload that before proceeding.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: DrSuperGood on January 09, 2019, 09:19:08 PM
One could rule out the first case being a problem by printing the step that all convoys depart and arrive at stations. In worst case this could go to a separate log file. Unlike random calls themselves, this log should be a lot more manageable and could even be filtered for the specific convoys that are causing problems.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 09, 2019, 09:28:23 PM
Thank you for that suggestion. I am not sure whether it will work, but I have added a line of code that will show a debug output when a convoy is set to depart, but only with the debug level set to 4, the highest setting. Hopefully, this will assist.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 11, 2019, 02:12:24 AM
I've added some basic checksums to the checklist - the first one sums the mantissae of convoy speeds, and the second sums the convoy weights. (Actually, it isn't quite over all convoys - I think convoys might be excluded for one sync step after exiting a depot, and maybe some other occasional sync steps). I've pushed this to Github.

The upshot is that this should detect the desync much earlier (hopefully identifying the sync step in which it happens), and perhaps even give us more indication of what is causing it.

EDIT: I don't think it's working properly, so probably best to leave this until I've got it working.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: TurfIt on January 11, 2019, 03:29:20 AM
Might want to rdwr debug_sum instead of rand   ;)
Also, while the extra checklist items are called rand, you can stick anything in them (well anything uint32). Many are not currently used... and IIRC I intended the extra ones (exr=)  24-31 to be more general purpose to narrow in on a location like you're attempting with convoi speeds and weights. i.e. Didn't really need to add debug_sum.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 11, 2019, 03:34:57 AM
You spotted the issue* about the time that I did.

I noticed there were some unused rands, but I thought I'd go with something different to avoid having a misleading variable name - at the moment I think everything in rands is a random number seed recorded at some point, and I think it would be less intuitive to change that.

EDIT:
*Actually, I didn't read what you wrote properly - we actually found similar problems in two different places.

I think I've now pushed a correct version (along with separate desync bug fix that was very helpful in testing).

EDIT 2:
It turns out convoy weight isn't suitable for inclusion in a checksum, because it is updated only when it is needed, which includes displaying it in gui windows. So I've pushed another commit to remove this. (There's also another commit on my master which is just refactoring some loop logic for simplicity.)
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 11, 2019, 04:56:23 PM
Splendid, thank you: now incorporated.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Vladki on January 11, 2019, 10:37:06 PM
Hello. Out of curiosity I have connected to Bridgewater Brunel server. Player "Bay transport Group" was probably using track network of some other player who went bankrupt, and all his stations and bridges were removed, while tracks were left mothballed. So many of his trains are stuck now. However  I have found one funny place: (1861, 3011) - it is a diagonal tunnel entrance. I thought that such thing is impossible. So I just thought if that could not be the cause of the problems?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Rollmaterial on January 11, 2019, 11:33:49 PM
I've been using diagonal tunnel entrances quite extensively for quite some time so I doubt they have anything to do with the desync.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 11, 2019, 11:40:41 PM
Hello. Out of curiosity I have connected to Bridgewater Brunel server. Player "Bay transport Group" was probably using track network of some other player who went bankrupt, and all his stations and bridges were removed, while tracks were left mothballed. So many of his trains are stuck now. However  I have found one funny place: (1861, 3011) - it is a diagonal tunnel entrance. I thought that such thing is impossible. So I just thought if that could not be the cause of the problems?

That seems most unlikely as it does not appear to be connected in any way with the circumstances in which the loss of synchronisation occurs (i.e. a train arriving in a specific direction at a specific location on a line that does not involve passing through such a tunnel).
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 11, 2019, 11:49:23 PM
So many of his trains are stuck now.
You could try using this save (http://ac894.user.srcf.net/bb-5-oct-2018-removed-most-objects-2.sve) on the server - it has most of the objects in the worlds removed (most of what's left is the cities served by the Northern Frontier Express and the surrounding infrastructure). If it's possible to reproduce the desync on that then it will probably be much easier to look through the logs to debug it, given that there's very little left going on. I've genrally begun by reducing the frequency of the line and sending half the trains to the depot (newly built at Bealdon Rye), to increase passenger loading per train, in case that's relevant, but I don't have a good save for that and haven't managed to trigger this desync.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 11, 2019, 11:52:04 PM
Interesting. Would it be worth using that save on the server itself, do you think?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 12, 2019, 12:04:42 AM
If you can trigger a desync using that save, then it will probably easier to debug because of how much less is going on. The reduced network will reduce passenger loading, which could be relevant; hence my suggestion to reduce the service frequency. And if we can't get a desync after a while, then we can switch back to using a bigger save.

Incidentally, has anyone made any observations on how full the trains are when they desync?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 12, 2019, 07:25:35 AM
I have not observed that, but I did use special debugging carriages with overcrowding disabled in the current server save.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 13, 2019, 03:31:02 AM
With the extra checksum, I am now getting relatively quick desyncs when conencting to the server (within a couple of minutes). These are not obviously linked to activity around St. Mary Beddington or nearby stations. Unfortunately, I have no way of working out the which convoy is out of sync.

I've pushed another checksum to Github which should solve this issue - with both the sum of speeds and the sum (speeds times id), it should usually be possible to derive the convoy's internal id (assuming that only one desyncs at a time).
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 13, 2019, 05:04:06 PM
Splendid, thank you: now incorporated.

Thank you very much for your work on this.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 14, 2019, 07:56:02 AM
RIght, I now have some diagnostics from a desync.

Desync occurred at sync_step 27245, step 6811, and was due to a speed mismatch with convoy 4233:
Code: [Select]
Warning: karte_t:::do_network_world_command:    sync_step=27245  server=[ss=27245 st=6811 nfc=1 rand=2594737631 halt=4097 line=1025 cnvy=4097 ssr=1910350214,2594737631,0,0,0,0,0,0 str=1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1910350214,3692126888,82257853,1420675956 exr=0,0,0,0,0,0,0,0 sums=1871189474,1645687888,0,0,0,0,0,0client=[ss=27245 st=6811 nfc=1 rand=2594737631 halt=4097 line=1025 cnvy=4097 ssr=1910350214,2594737631,0,0,0,0,0,0 str=1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1910350214,3692126888,82257853,1420675956 exr=0,0,0,0,0,0,0,0 sums=1849259230,3304245548,0,0,0,0,0,0
Looking later in the log I find:
Code: [Select]
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4233) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 6818So it looks like the convoy departed on the server approximately 7 steps earlier than on the client.

Since the convoy was probably stationary on the client when the desync occurred, I can conclude that the speed on the server must have had a mantissa of 4273037052, which makes the speed about 0.5% less than a power of 2 (which may just be a coincidence).

So the question is: why are trains departing from Bickstable Fields at such different times on the client and the server?

Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 14, 2019, 09:45:38 AM
Thank you for that: that is very helpful and interestingly confirms the hypothesis set out earlier in this thread regarding departure times.

The reason for the differing departure times is not straightforward to deduce. I had hoped that the issue would have related to the overcrowding bug, but that has now been fixed and turned out to be deterministic in nature (an integer underflow), so I suspect that this is not now likely to be the cause.

The moment of departure is calculated in convoi_t::hat_gelhalten(). The relevant part of the code is here:

Code: [Select]

const sint64 now = welt->get_ticks();
   if(arrival_time > now || arrival_time == WAIT_INFINITE)
   {
      // This is a workaround for an odd bug the origin of which is as yet unclear.
      go_on_ticks = WAIT_INFINITE;
      arrival_time = now;
      if (arrival_time < WAIT_INFINITE)
      {
         dbg->error("void convoi_t::hat_gehalten(halthandle_t halt)", "Arrival time is in the future for convoy %u at stop %u", self.get_id(), halt.get_id());
      }
   }
   const sint64 reversing_time = schedule->get_current_entry().reverse > 0 ? (sint64)calc_reverse_delay() : 0ll;
   bool running_late = false;
   sint64 go_on_ticks_waiting = WAIT_INFINITE;
   const sint64 earliest_departure_time = arrival_time + ((sint64)current_loading_time - reversing_time);
   if(go_on_ticks == WAIT_INFINITE)
   {
      if(haltestelle_t::get_halt(get_pos(), get_owner()) != haltestelle_t::get_halt(schedule->get_current_entry().pos, get_owner()))
      {
         // Sometimes, for some reason, the loading method is entered with the wrong schedule entry. Make sure that this does not cause
         // convoys to become stuck trying to get a full load at stops where this is not possible (freight consumers, etc.).
         loading_limit = 0;
      }
      if((!loading_limit || loading_level >= loading_limit) && !wait_for_time)
      {
         // Simple case: do not wait for a full load or a particular time.
         go_on_ticks = std::max(earliest_departure_time, arrival_time);
      }
      else
      {
         // Wait for a % load or a spacing slot.
         sint64 go_on_ticks_spacing = WAIT_INFINITE;

         if(line.is_bound() && schedule->get_spacing() && line->count_convoys())
         {
            // Departures/month
            const sint64 spacing = welt->ticks_per_world_month / (sint64)schedule->get_spacing();
            const sint64 spacing_shift = (sint64)schedule->get_current_entry().spacing_shift * welt->ticks_per_world_month / (sint64)welt->get_settings().get_spacing_shift_divisor();
            const sint64 wait_from_ticks = ((now + reversing_time - spacing_shift) / spacing) * spacing + spacing_shift; // remember, it is integer division
            sint64 queue_pos = halt.is_bound() ? halt->get_queue_pos(self) : 1ll;
            go_on_ticks_spacing = (wait_from_ticks + spacing * queue_pos) - reversing_time;
         }

         if(schedule->get_current_entry().waiting_time_shift > 0)
         {
            // Maximum wait time
            go_on_ticks_waiting = now + (welt->ticks_per_world_month >> (16ll - (sint64)schedule->get_current_entry().waiting_time_shift)) - (sint64)reversing_time;
         }

         if (schedule->get_spacing() && !line.is_bound())
         {
            // Spacing should not be possible without a line, but this can occasionally occur. Without this, the convoy will wait forever.
            go_on_ticks_spacing = earliest_departure_time;
         }

         go_on_ticks = std::min(go_on_ticks_spacing, go_on_ticks_waiting);
         go_on_ticks = std::max(earliest_departure_time, go_on_ticks);
         running_late = wait_for_time && (go_on_ticks_waiting < go_on_ticks_spacing);
         if(running_late)
         {
            go_on_ticks = earliest_departure_time;
         }
      }
   }

   // loading is finished => maybe drive on
   bool can_go = false;

   can_go = loading_level >= loading_limit && (now >= go_on_ticks || !wait_for_time);
   //can_go = can_go || (now >= go_on_ticks_waiting && !wait_for_time); // This is pre-14 August 2016 code
   can_go = can_go || (now >= go_on_ticks && !wait_for_time);
   can_go = can_go || running_late;
   can_go = can_go || no_load;
   can_go = can_go && state != WAITING_FOR_CLEARANCE && state != WAITING_FOR_CLEARANCE_ONE_MONTH && state != WAITING_FOR_CLEARANCE_TWO_MONTHS;
   can_go = can_go && now > earliest_departure_time;
   if(can_go) {

      if(withdraw  &&  (loading_level==0  ||  goods_catg_index.empty())) {
         // destroy when empty
         self_destruct();
         return;
      }

      // add available capacity after loading(!) to statistics
      for (unsigned i = 0; i<vehicle_count; i++) {
         book(get_vehicle(i)->get_cargo_max()-get_vehicle(i)->get_total_cargo(), CONVOI_CAPACITY);
      }

      // Advance schedule
      advance_schedule();
      state = ROUTING_1;
      dbg->message("void convoi_t::hat_gehalten(halthandle_t halt)", "Convoy %s departing from stop %s at step %i", get_name(), halt.is_bound() ? halt->get_name() : "unknown", welt->get_steps());
   }

   // reset the wait_lock
   if(state == ROUTING_1)
   {
      wait_lock = 0;
   }
   else
   {

      if (loading_limit > 0 && !wait_for_time)
      {
         wait_lock = (earliest_departure_time - now) / 2;
      }
      else
      {
         wait_lock = (go_on_ticks - now) / 2;
      }
      // The random extra wait here is designed to avoid processing every convoy at once
      wait_lock += (self.get_id()) % 1024;
      if (wait_lock < 0 )
      {
         wait_lock = 0;
      }
      else if(wait_lock > 8192 && go_on_ticks == WAIT_INFINITE)
      {
         // This is needed because the above calculation (from Standard) produces excessively
         // large numbers on occasions due to the conversion in Extended of certain values
         // (karte_t::ticks and go_on_ticks) to sint64. It would be better ultimately to fix that,
         // but this seems to work for now.
         wait_lock = 8192;
      }
   }


This code works by calculating the go_on_ticks number, comparing this with the current number of ticks and setting the wait_lock to 0 and state to ROUTING_1 if the go_on_ticks value is equal to or less than the current number of ticks. I have just pushed a change to the code which will give the go_on_ticks value in the convoy departure debug message so that this can be compared between client and server. We will then know whether the problem is that the go_on_ticks value is calculated differently, or whether the problem is that a divergence occurs after this number has been calculated.

Edit: One thing that might be worth checking: does the log contain any messages of the following type:
Quote
void convoi_t::hat_gehalten(halthandle_t halt): Arrival time is in the future for convoy 4233 at stop [whatever the ID of Brickstable Fields is]
?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 14, 2019, 01:58:19 PM
I am now getting desyncs *before* convoys reach Bickstablewood Fields Railway Station. The latest one happened when the train was still on the viaduct about 1km from the edge of the station, and about 99 steps before departure. I am ... surprised.

Code: [Select]
Warning: karte_t:::do_network_world_command:    sync_step=27245  server=[ss=27245 st=6811 nfc=1 rand=2594737631 halt=4097 line=1025 cnvy=4097 ssr=1910350214,2594737631,0,0,0,0,0,0 str=1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1910350214,3692126888,82257853,1420675956 exr=0,0,0,0,0,0,0,0 sums=1871189474,1645687888,0,0,0,0,0,0client=[ss=27245 st=6811 nfc=1 rand=2594737631 halt=4097 line=1025 cnvy=4097 ssr=1910350214,2594737631,0,0,0,0,0,0 str=1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1420675956,1910350214,3692126888,82257853,1420675956 exr=0,0,0,0,0,0,0,0 sums=1849259230,3304245548,0,0,0,0,0,0
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4233) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 6818
Convoy 4233 out of sync
Client speed mantissa is 21930244 less than server speed mantissa

Convoys 4225,4223,4221,4220 passed (I believe) while offline.

Convoys 4233,4225,4223,4221,4220 all passed once without desyncs.

Warning: karte_t:::do_network_world_command:    sync_step=175613  server=[ss=175613 st=43903 nfc=1 rand=2729897307 halt=4097 line=1025 cnvy=4097 ssr=3466879092,2729897307,0,0,0,0,0,0 str=931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,2659737695,3466879092
,1858232909,2730814151,931275724 exr=0,0,0,0,0,0,0,0 sums=3985059028,4134635295,0,0,0,0,0,0client=[ss=175613 st=43903 nfc=1 rand=2729897307 halt=4097 line=1025 cnvy=4097 ssr=3466879092,2729897307,0,0,0,0,0,0 str=931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275724,931275
724,2659737695,3466879092,1858232909,2730814151,931275724 exr=0,0,0,0,0,0,0,0 sums=3963128784,1498225659,0,0,0,0,0,0
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4233) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 43912
Convoy 4233 out of sync
Client speed mantissa is 21930244 less than server speed mantissa

Convoys 4225,4223 passed while offline.

Warning: karte_t:::do_network_world_command:    sync_step=260259  server=[ss=260259 st=65064 nfc=3 rand=3777191495 halt=4097 line=1025 cnvy=4097 ssr=520123040,3777191495,0,0,0,0,0,0 str=3921714216,3921714216,3921714216,3921714216,3921714216,3227137456,3227137456,3227137456,3227137456,3227137456,3227137456,3227137456,
3993339209,3531713451,331365175,3227137456 exr=0,0,0,0,0,0,0,0 sums=3742272234,3644090924,0,0,0,0,0,0client=[ss=260259 st=65064 nfc=3 rand=3777191495 halt=4097 line=1025 cnvy=4097 ssr=520123040,3777191495,0,0,0,0,0,0 str=3921714216,3921714216,3921714216,3921714216,3921714216,3227137456,3227137456,3227137456,322713745
6,3227137456,3227137456,3227137456,3993339209,3531713451,331365175,3227137456 exr=0,0,0,0,0,0,0,0 sums=3715599885,2729255491,0,0,0,0,0,0
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4221) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 65084
Convoy 4221 out of sync
Client speed mantissa is 26672349 less than server speed mantissa

Warning: karte_t:::do_network_world_command:    sync_step=262259  server=[ss=262259 st=65564 nfc=3 rand=2224400415 halt=4097 line=1025 cnvy=4097 ssr=1010227916,2224400415,0,0,0,0,0,0 str=3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,1451021687
,1867168680,2503084853,234854611,3548066625 exr=0,0,0,0,0,0,0,0 sums=87831369,260559438,0,0,0,0,0,0client=[ss=262259 st=65564 nfc=3 rand=2224400415 halt=4097 line=1025 cnvy=4097 ssr=1010227916,2224400415,0,0,0,0,0,0 str=3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625,3548066625
,3548066625,3548066625,1451021687,1867168680,2503084853,234854611,3548066625 exr=0,0,0,0,0,0,0,0 sums=63690176,1463940082,0,0,0,0,0,0
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4220) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 65653
Convoy 4220 out of sync
Client speed mantissa is 24141193 less than server speed mantissa
This was well before arrival at Bickstablewood Fields Railway Station!

Convoys 4233,4225,4223,4221,4220 all passed again without desync (last one passed at about 4:30pm).

Convoy 4233 passed again without desync. However, there was a desync shortly before convoy 4225 arrived. It does not seem to be consistent with the speed of only one convoy being out of sync.
Warning: karte_t:::do_network_world_command:    sync_step=399930  server=[ss=399930 st=99982 nfc=2 rand=2054363945 halt=4097 line=1025 cnvy=4097 ssr=585326164,2054363945,0,0,0,0,0,0 str=2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,2846632917,797446143,9
85621664,199545851,1364821627,2846632917 exr=0,0,0,0,0,0,0,0 sums=1266174254,4190039820,0,0,0,0,0,0client=[ss=399930 st=99982 nfc=2 rand=693739107 halt=4097 line=1025 cnvy=4097 ssr=2343166075,693739107,0,0,0,0,0,0 str=617314065,617314065,617314065,617314065,617314065,617314065,617314065,617314065,617314065,617314065,
617314065,4010552739,2385058720,199545769,1364821586,617314065 exr=0,0,0,0,0,0,0,0 sums=3896737550,310673654,0,0,0,0,0,0
Message: void convoi_t::hat_gehalten(halthandle_t halt):        Convoy (4225) LMS Class 7P "Princess Royal" departing from stop Bickstablewood Fields Railway Station at step 100039

Convoys 4223,4221,4220 approached and passed while I was disconnected.

This is a summary of the two key lines from the logs, and some numbers derived from them. I may edit this to add more later.

EDIT: There seems to be a pattern of desyncs occurring only on every second round trip. It could just be a coincidence.

EDIT: Hmm, last bit of data seems to break even more patterns. The debug sums do not seem to be consistent with a single convoy being out of sync, but it would be very strange for two convoys to go out of sync at the same time. This just gets even more confusing.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 14, 2019, 10:31:47 PM
I have also noticed the pattern with the loss of synchronisation occurring at every other round trip. Quite what should be causing this is, however, entirely obscure.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 15, 2019, 01:39:59 AM
I presume you have access to the server logs - perhaps you could look over them to see what was reported during the desync events above. The most useful thing is probably to find all (relevant) lines relating to convoy departure from Bickstablewood Fields Railway Station to compare the convoy departure times.

EDIT: Some more debugging info, quite useful this time.

I enabled the debug output within #ifdef DEBUG_ACCELERATION, and also added some for sp_soll and sp_hat in convoi_t::sync_step. The conclusion is that my most recent desync happened while the convoy was slowing for the semaphore signal before Bickstablewood Fields station, one tile before the signal cleared to allow the train into the station.

Debug output, filtered for relevant lines:
Code: [Select]
Warning: convoi_t::calc_acceleration 1: 153) at tile 163 next limit of   0 km/h, current speed 113 km/h,   187 steps til brake,  2304 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=97969, sp_hat=94208
Warning: karte_t::interactive: sync_step=624721  chklist=[ss=624721 st=156180 nfc=1 rand=2946203112 halt=4097 line=1025 cnvy=4097 ssr=2263788107,2946203112,0,0,0,0,0,0 str=4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,2263788107,3511651656,732188843,4164642788 exr=0,0,0,0,0,0,0,0 sums=1399152953,439014957,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -72 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=99317, sp_hat=98304
Warning: karte_t::interactive: sync_step=624722  chklist=[ss=624722 st=156180 nfc=2 rand=920875045 halt=4097 line=1025 cnvy=4097 ssr=2946203112,920875045,0,0,0,0,0,0 str=4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,2263788107,3511651656,732188843,4164642788 exr=0,0,0,0,0,0,0,0 sums=1310593905,793669416,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -72 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=96599, sp_hat=94208
Warning: karte_t::interactive: sync_step=624723  chklist=[ss=624723 st=156180 nfc=3 rand=3515893034 halt=4097 line=1025 cnvy=4097 ssr=920875045,3515893034,0,0,0,0,0,0 str=4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,4164642788,2263788107,3511651656,732188843,4164642788 exr=0,0,0,0,0,0,0,0 sums=4031560277,3499533278,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -75 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=98008, sp_hat=94208
Warning: karte_t::interactive: sync_step=624724  chklist=[ss=624724 st=156181 nfc=0 rand=3574004632 halt=4097 line=1025 cnvy=4097 ssr=3515893034,3090030907,0,0,0,0,0,0 str=3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3574004632,3518523764,732833386,3090030907 exr=0,0,0,0,0,0,0,0 sums=355198068,323033399,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -75 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=99447, sp_hat=98304
Warning: karte_t::interactive: sync_step=624725  chklist=[ss=624725 st=156181 nfc=1 rand=1144307454 halt=4097 line=1025 cnvy=4097 ssr=3574004632,1144307454,0,0,0,0,0,0 str=3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3574004632,3518523764,732833386,3090030907 exr=0,0,0,0,0,0,0,0 sums=8624033,1041624339,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -75 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=96820, sp_hat=94208
Warning: karte_t::interactive: sync_step=624726  chklist=[ss=624726 st=156181 nfc=2 rand=3716921189 halt=4097 line=1025 cnvy=4097 ssr=1144307454,3716921189,0,0,0,0,0,0 str=3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3574004632,3518523764,732833386,3090030907 exr=0,0,0,0,0,0,0,0 sums=316715136,1043566946,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -78 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=98319, sp_hat=98304
Warning: karte_t::interactive: sync_step=624727  chklist=[ss=624727 st=156181 nfc=3 rand=3675663255 halt=4097 line=1025 cnvy=4097 ssr=3716921189,3675663255,0,0,0,0,0,0 str=3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3090030907,3574004632,3518523764,732833386,3090030907 exr=0,0,0,0,0,0,0,0 sums=2301287915,37053678,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -78 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=95753, sp_hat=94208
Warning: karte_t::interactive: sync_step=624728  chklist=[ss=624728 st=156182 nfc=0 rand=1400799860 halt=4097 line=1025 cnvy=4097 ssr=3675663255,3931085262,0,0,0,0,0,0 str=3931085262,3931085262,3931085262,3931085262,3931085262,248477531,248477531,248477531,248477531,248477531,248477531,248477531,1400799860,3525395872,733477929,248477531 exr=0,0,0,0,0,0,0,0 sums=2896912166,1038970125,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -81 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=97149, sp_hat=94208
Warning: karte_t::interactive: sync_step=624729  chklist=[ss=624729 st=156182 nfc=1 rand=1556871974 halt=4097 line=1025 cnvy=4097 ssr=1400799860,1556871974,0,0,0,0,0,0 str=3931085262,3931085262,3931085262,3931085262,3931085262,248477531,248477531,248477531,248477531,248477531,248477531,248477531,1400799860,3525395872,733477929,248477531 exr=0,0,0,0,0,0,0,0 sums=1050489881,1571753344,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 113 km/h,   -60 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=98087, sp_hat=94208
Warning: karte_t::interactive: sync_step=624730  chklist=[ss=624730 st=156182 nfc=2 rand=3410090894 halt=4097 line=1025 cnvy=4097 ssr=1556871974,3410090894,0,0,0,0,0,0 str=3931085262,3931085262,3931085262,3931085262,3931085262,248477531,248477531,248477531,248477531,248477531,248477531,248477531,1400799860,3525395872,733477929,248477531 exr=0,0,0,0,0,0,0,0 sums=3675369905,3865676177,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 112 km/h,   -40 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=98458, sp_hat=98304
Warning: karte_t::interactive: sync_step=624731  chklist=[ss=624731 st=156182 nfc=3 rand=1696428903 halt=4097 line=1025 cnvy=4097 ssr=3410090894,1696428903,0,0,0,0,0,0 str=3931085262,3931085262,3931085262,3931085262,3931085262,248477531,248477531,248477531,248477531,248477531,248477531,248477531,1400799860,3525395872,733477929,248477531 exr=0,0,0,0,0,0,0,0 sums=453163847,2708091143,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 154) at tile 163 next limit of   0 km/h, current speed 112 km/h,   -17 steps til brake,  2048 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=94161, sp_hat=90112
Warning: karte_t::interactive: sync_step=624732  chklist=[ss=624732 st=156183 nfc=0 rand=726217411 halt=4097 line=1025 cnvy=4097 ssr=1696428903,1249187829,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=3759666517,1546734706,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 111 km/h,  -247 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=97588, sp_hat=94208
Warning: karte_t::interactive: sync_step=624733  chklist=[ss=624733 st=156183 nfc=1 rand=332093103 halt=4097 line=1025 cnvy=4097 ssr=726217411,332093103,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=2735623852,2582478221,0,0,0,0,0,0

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 110 km/h,  -227 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=96351, sp_hat=94208
Warning: karte_t::interactive: sync_step=624734  chklist=[ss=624734 st=156183 nfc=2 rand=3125117933 halt=4097 line=1025 cnvy=4097 ssr=332093103,3125117933,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=1847295082,4139723,0,0,0,0,0,0

Warning: NWC_CHECK: time difference to server 7920
Message: network_world_command_t::execute: do_command 9 at sync_step 624859 world now at 624734
Warning: karte_t:::do_network_world_command: sync_step=624733  server=[ss=624733 st=156183 nfc=1 rand=332093103 halt=4097 line=1025 cnvy=4097 ssr=726217411,332093103,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=2757554096,748478609,0,0,0,0,0,0client=[ss=624733 st=156183 nfc=1 rand=332093103 halt=4097 line=1025 cnvy=4097 ssr=726217411,332093103,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=2735623852,2582478221,0,0,0,0,0,0
Warning: karte_t:::do_network_world_command: disconnecting due to checklist mismatch
Warning: karte_t::network_disconnect(): Lost synchronisation with server. Random flags: 0

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 110 km/h,  -201 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=48456, sp_hat=45056

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 109 km/h,  -193 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=49574, sp_hat=49152

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 109 km/h,  -181 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=46453, sp_hat=45056

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 109 km/h,  -170 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=47285, sp_hat=45056

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 108 km/h,  -156 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=48017, sp_hat=45056

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 108 km/h,  -148 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=48610, sp_hat=45056

Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 108 km/h,  -136 steps til brake,  1792 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=705049, sp_hat=704512

Warning: convoi_t::calc_acceleration 1: 156) at tile 177 next limit of   0 km/h, current speed 103 km/h,  3364 steps til brake,  5120 steps til stop
Warning: convoi_t::sync_step: Convoy (4225) DRIVING with sp_soll=289299, sp_hat=286720
(Convoy 4225 was out of sync. The client speed mantissa was 21930244 less than server speed mantissa)

I enabled the debug output within #ifdef DEBUG_ACCELERATION (for specific convoys only), and also added some for sp_soll and sp_hat in convoi_t::sync_step. The conclusion is that this desync (at around 4am) happened while the convoy was slowing for the semaphore signal before Bickstablewood Fields station, one tile before the signal cleared to allow the train into the station. The nature of the desync is consistent with the signal clearing on the server one hop before it cleared on the client.

Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 15, 2019, 10:07:14 AM
Thank you very much for your work on this: this is most interesting. What I was planning to do was to check to see whether the convoy arriving in the future error message was triggered and try to work out what might cause that, but this suggests that this is not relevant to the cause of the issue.

I am currently away from home and do not have access to the server logs; those logs are not generasted at log level 4 in any event (log level 2 or 3 is used instead, I believe, which shows warnings but not messages), but I can check when I get home if this degree of logging would be helpful.

This suggests that the problem is either that there is something divergent between client and server in the signalling code or that the train is leaving the previous station earlier on the server than on the client.

One thing that is interesting to note is that the train slows for a signal ahead of Brickstable Fields station at all: if the signalling is well designed (i.e. with a distant signal far enough in front of the stop signal), the trians should not have to slow for this signal at all, which might suggest a bug with the signalling; but the distant signal may not be placed far enough away, so this slowing may result simply from sub-opimal signalling design.

Does your logging enable us to tell whether the trains departed the station before Brickstable Fields at the same time? If they did, we will need to look intensively at the signalling code itself, although it is not at all clear to me at present how this code might be divergent between client and server (but seemingly in a deterministic way).
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 15, 2019, 11:18:22 AM
The debug_sums in the checklist would diverge at the previous station if the convoy was departing at different times, since the speeds wouldn't be the same. I am confident that the first mismatch of speeds was on the approach to that signal.

I will note, however, that the symptoms we have experienced suggest that that is not the only place that things can go wrong. Hopefully fixing what goes wrong at that point will fix the problem elsewhere too.

I had a quick look at what seems to be the relevant bit of signalling code, and my first thoughts were that it's sufficiently hard to read that I'd eventually want to refactor it to help me understand it. It seems I might have to dig into it soon anyway (at least on some superficial level).

I think detailed logs from the server would be less useful now that we have evidence of a problem with the signal clearing.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Spenk009 on January 15, 2019, 12:07:00 PM
Does the convoy resume acceleration immediately after the signal checking completes? If either machine finishes at a different time or the route polling occurs at different times (the polling rate seems random, when vehicles re-route or re-check) the convoy would find itself slightly shifted in time/space/speed.

Without knowing much about the code, do both make the same routing (full reroute vs reserving track ahead)?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 15, 2019, 12:17:37 PM
Thank you for that: I have enough information now that I can look into this part of the code in detail to try to detect any problems, although this may take some time.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on January 15, 2019, 01:30:24 PM
Does the convoy resume acceleration immediately after the signal checking completes?
I am fairly confident that there is no race condition involving the signal check (which is in hop_check) and the convoy movement, since they are both run as part of the convoy sync_step. If there is a race condition, then it would probably be between the signal check, and something that modifies data used by the signal check (but isn't part of the sync_step).

Quote
the polling rate seems random, when vehicles re-route or re-check
If you're looking at data from a single-player game, then you should note that the step and sync_step durations are handled completely differently in single player and network play.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 15, 2019, 09:26:39 PM
Looking at the signalling at this location, I immediately notice two linked anomalies. First of all, the area is generally signalled by track circuit block signals, but the last signals before the station approaching from the Bealden Rye end are older absolute block mechanical signals. One of these, at 6331,1260 would presumably be the signal that is the issue in your testing.

Secondly, these signals purport to be connected to a mechanical signalbox at 6345,1259. However, there is no such signalbox: instead, part of Bickstable Fields railway station stands on that tile. One can imagine that a mechanical signalbox once stood there before the station was widened to four platforms and resignalled.

It should not be possible for signals to exist where their signalbox has been destroyed, as the destruction of the signalbox should automatically delete all connected signals. I notice that these signals are on an elevated way, so it is possible that this in some way caused the failure to delete these signals properly.

I have set a breakpoint for the point in the code when the block reserver first engages with the signal at 6331,1260, and I will see if I can work out how these anomalies affect the code.

Edit: Note for internal reference that, in a save running in single player mode, the signal first clears when the train is at 6273,1255.

Edit 2: The signalbox had not been deleted after all: it is still present, underneath the bridge.

Edit 3: The check at 6273,1255 is with route_index at 100 and next_block at 162.

Edit 4: This seems to be consistent in position over a round of trains passing Brickstable Fields.

Edit 5: When running as a server in debug mode, the signal first clears  6273,1255 is with route_index at 100 and next_block at 162 just as in single player mode.


Edit 6: The above is repeatable with multiple passes of trains.


I think that that is as far as I can go to-night before going to bed. I have yet to find any actual anomalies in the running code, but it is noteworthy that the problem occurs with (1) signals placed in a very odd way (it appears as though whoever was playing Bay Transport forgot to delete the old signals when resignalling the station); and (2) a signalbox placed under a single height elevated rail line, which should not be possible.

The signalling system has not been tested for the case of an absolute block combined signal sandwiched between two track circuit block signals each controlled by the same signalbox as each other. However, the signal appears to be working as intended (i.e., as a stand-alone stop signal, the distant aspect not working as the next signal is not of the absolute block type and in any event showing danger as the train is stopping at the station), so there is no immediate clue as to what the issue could be here.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on January 16, 2019, 10:43:16 PM
Looking now at when the trains first attempt to reserve the wayward semaphore signal, my first result shows a route_index of 74, next_signal of 162, tiles_to_check of 7 and last_index of 176. The train had reserved up to this signal already, however, so I will need to check again with some other trains before being sure of the significance of these figures.

Edit: As is usual with the signalling system, it checks the route a number of times as it progresses. The next check seems to be at route_index 86. The next_signal is at 145 before the block reserver has finished running and 162 afterwards. modified_sighting_distance_tiles and tiles_to_check remain at 7.


Edit 2: The next check is at route_index 100 with the same figures otherwise save that now next_signal is 65530 (i.e., the semaphore has cleared).

Edit 3: The next check is at route_index 140 with tiles_to_check being 5, last_index 176 and the next_signal being still at 145 before running the block reserver and 162 afterwards. At this point, the route is shown reserved into the station.


Edit 4: It then continues to re-check every tile up to and including 144 and then again at 155, with variation only in modified_sighting_distance_tiles. At route_index 155 it is called again, this time with the next_signal having not been reset to 145 before it is called, but being at 65535 (meaning that the train is clear to run to the end of its calculated route without stopping first).


Edit 5: It then repeats this up to and including route_index 161 (one tile before the semaphore signal) with variations in the modified_sighting_distance_tiles.

Edit 6: I should note that this is running locally in server mode.



Edit 7: Re-loading an slightly older saved game where the trains had yet to approach this signal, the first hit appears now to be at route_index 19, with next_signal at 145 before and 162 after running the block_reserver, modified_sighting_distance_tiles at 7, last_index at 176 and tiles_to_check at 6. This is when the reservation up to (but not beyond) the semaphore signal is first made.

Edit 8: The next check seems to be at route_index 74 with the usual 145/162 pattern.

Edit 9: The next check is now at 86.

Edit 10: The next are at 100, 140, 141, 142, 143, 144, 155, 156, 157, 158, 159, 160 and 161: all the same as previously.


Edit 11: For the next train, the checks are at 19, 74, 86, 100, 140, 141, 142, 143, 144, 155, 156, 157, 158, 159, 160 and 161: in other words, the same check points as the previous train.
The next step is to attempt the same exercise with a debug client connected to a local server on the loopback interface, but that will have to await another evening.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on February 24, 2019, 01:21:20 PM
Turning back to this and comparing my results above with the server logs: the server logs are somewhat misleading because the text for generating the messages is incorrect. The actual log outputs read as follows, e.g.,

Code: [Select]
Warning: convoi_t::calc_acceleration 1: 155) at tile 163 next limit of   0 km/h, current speed 110 km/h,  -227 steps til brake,  1792 steps til stop

This reads as though the convoy is at "tile 163". This is slightly misleading in that 163 is not a tile in an absolute sense (this would have an x,y co-ordinate), but is rather the current route index (i.e. the 163rd tile of the convoy's current route to its next immediate destination), but is also in error, as the 163 is not, in fact, the current route index at all, but rather the index of the route where the convoy next needs to stop. The 155, which is unlabelled in the current output, is the current route index.

I have modified the debug output to correct this and also to add the actual absolute tile co-ordinates for each step.

In any event, putting that aside, we can see that the loss of synchronisation appears to have occurred at route index 156 (i.e. 155 in the debug output, which deducts 1 from the route index; it is not immediately clear why it does this: I did not write this code) at a time when the next_stop_index is 163.

156 is one of the many different checks and does not seem to be particularly noteworthy. The server logs suggest that route index 155 (recorded as 154 because of the subtraction of 1) is the first route tile at which braking occurs, and a number of furhter steps (each tile has 16 steps) are traversed before the loss of synchronisation recorded by A. Carlotti occurred.

However, it is interesting that the next stop index ("next_stop_index") is 163 in A. Carlotti's logs. The above tests show this to be 162. This is when the game was running in server mode. I cannot immediately find an explanation for this. The next step is to try to calculate whether this is one and the same thing as "The nature of the desync is consistent with the signal clearing on the server one hop before it cleared on the client" as reported by A. Carlotti above.

A hop is not the same as a step: a hop is the transition from one tile to another, whereas there are 16 steps in a tile. This inconsistency of the next_stop_index of 1 tile suggests that the convoys may act on the basis that the signal is in different places between client and server.

However, this may be a false signal: there may be another -1 or +1 somewhere that explains the difference in the figures without showing any actual difference between client and server.

The next step, I think, is to run the test again locally in stand alone mode and see whether I get a next_stop_index of 163 or 162.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on February 24, 2019, 02:58:32 PM
Running this on the client, the next_stop_index is recorded as 163 consistently. I note also that the semaphore signal on the bridge clears long before the train enters the bridge on the client.

Edit: The semaphore signal on the bridge also clears long before the train enters the bridge on the server.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on February 24, 2019, 09:51:33 PM
Running this test again with a debug build running as the server, I am unable to reproduce any next_stop_index in this vicinity other than 163.

Edit: Looking at the earlier posts carefully, it was next_signal that was 162. This is converted to next_stop_index in convoi_t::set_next_stop_index(), which, at line 6685, adds 1 to the number. So, unfortunately, this is not a significant result.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on February 25, 2019, 09:33:50 AM
Further testing reveals some additional complexity in this issue which is very difficult to track down.

Recent signalling fixes appear to have altered the behaviour of the train approaching the signal at 6331,1260. As briefly noted above, the trains no longer slow down for this signal: instead, they always reserve through from this signal to the station some considerable distance in advance of this signal, when they reach the previous signal a long way before the bridge.

Running a test last night, I could not reproduce the loss of synchronisation with a train departing from St. Mary Beddington (it will be recalled that this is where the loss of synchronisation would always manifest itself, when trains arrived at this station). On the server at present, the trains on this route are bunched together, and it had been the case that, every other time that any of the bunch of trains would pass through St. Mary Beddington, the game would lose synchronisation. Last night, on testing, no loss of synchronisation occurred when any of the bunch of trains passed through St. Mary Beddington on the first occasion after connecting, nor did any loss of synchronisation occur on the first train of the second bunch passing through St. Mary Beddington.

However, leaving the game running overnight demonstrated that it is still not fully stable: a loss of synchronisation had occurred by this morning.

I am trying to analyse and understand the logs posted above by A. Carlotti to try to deduce how one can calculate from these that the issue was the deceleration of the convoy at the signal near Bickstable Fields.

The actual checklists where synchronisation seems to have been lost are as follows; on the server:

Code: [Select]
server= [ss=624733 st=156183 nfc=1 rand=332093103 halt=4097 line=1025 cnvy=4097 ssr=726217411,332093103,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=2757554096,748478609,0,0,0,0,0,0

and on the client:

Code: [Select]
client= [ss=624733 st=156183 nfc=1 rand=332093103 halt=4097 line=1025 cnvy=4097 ssr=726217411,332093103,0,0,0,0,0,0 str=1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1249187829,1778163667,726217411,3532267980,734122472,1249187829 exr=0,0,0,0,0,0,0,0 sums=2735623852,2582478221,0,0,0,0,0,0
The difference lies in the latter part, after "sums=":

Code: [Select]
sums=2757554096,748478609,0,0,0,0,0,0
on the server and

Code: [Select]
sums=2735623852,2582478221,0,0,0,0,0,0
on the client.

These two entries that are non-zero are defined in the code as "ss" and "st", and I think were added by A. Carlotti recently. I have not yet been able to calculate exactly how these variables are populated.

A. Carlotti - are you able to assist with exactly how these work and how you were able to deduce the immediate cause of the loss of synchronisation in your tests above?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on February 28, 2019, 01:15:48 AM
Further testing has narrowed this somewhat more, although I am getting some very odd results in some places. There is definitely an inconsistency in the way in which the game handles the signal at 6331,1260 clearing, however.

In particular, with this (http://bridgewater-brunel.me.uk/saves/server-desync-signal-test-4a.sve) game, in a Visual Studio debug build, the next convoy to approach the signal at 6331,1260 (being convoy 4233) will, at route_index 100 (long before the train has reached the bridge), reserve the track up to the station and therefore not have to slow in advance of the signal, but with one of the automated builds from the server, this will not happen: the train will not reserve the track beyond that signal to the station until a few tiles before it reaches it, by which time it has slowed down.

I have also noticed anomalies between the release builds: there was one occasion when I was testing a release build when I had waited a long time with the game running for the train to get to the end of its line and return; I had then saved it in just the right place to reproduce this behaviour quickly in the future, and continued running; when I did so, the train reserved early as in the Visual Studio build. However, when I re-loaded the game that I had saved, this time, it reserved late. That saved game is the same as the one to which I link above.

This difference between builds makes this very difficult to debug using conventional methods, as I cannot follow the code path using a debugger to find out why there is ever a situation in which the signal is not cleared when the convoy reaches route_index 100; my test report above, which was based on debug builds, consistently showed the trains clearing the signal well in advance.

This sort of inconsistency of behaviour is consistent with precisely the sort of thing that would cause a loss of synchronisation, but what sort of error could lead to inconsistent behaviour between two different builds (and inconsistent behaviour within one build but not another, and not the only one that I can probe to see what is happening) running with an otherwise identical dataset is complex in the extreme.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on February 28, 2019, 03:05:42 AM
The difference lies in the latter part, after "sums=":

Code: [Select]
sums=2757554096,748478609,0,0,0,0,0,0

on the server and

Code: [Select]
sums=2735623852,2582478221,0,0,0,0,0,0

on the client.

These two entries that are non-zero are defined in the code as "ss" and "st", and I think were added by A. Carlotti recently. I have not yet been able to calculate exactly how these variables are populated.

A. Carlotti - are you able to assist with exactly how these work and how you were able to deduce the immediate cause of the loss of synchronisation in your tests above?

The first number is a sum of speed mantissas (the speed being a floating point number). The second is a sum of (speed mantissa * convoy id). The combination of these two values allows us to identify the particular convoy that desynced, and what the discrepancy is.

I wrote the following Python function to compute these details:
Code: [Select]
def debug(servera,serverb,clienta,clientb):
diffa=clienta-servera
diffb=clientb-serverb
for i in range(10000):
if diffa*i%2**32==diffb%2**32:
print("Convoy", i, "out of sync")
print("Client speed mantissa is", (-diffa)%2**32, "less than server speed mantissa")
print("If client mantissa is 0, then server mantissa is", diffa%2**32)
which gives the results:
Code: [Select]
>>> debug(2757554096,748478609,2735623852,2582478221)
Convoy 4225 out of sync
Client speed mantissa is 21930244 less than server speed mantissa
If client mantissa is 0, then server mantissa is 4273037052

The last line probably isn't particularly helpful as written, though it does effectively give the difference in the opposite direction. Note that if the number in the second line is close to 2**32, then it probably means that the client speed is a bit faster. And if the difference is close to 2**31, then that probably means that the speed is close to a power of two and the client and server mantissa are using different exponents. If testing in a game with convoy ids higher than 10000 then you'd need to increase the range to cover all possible values.

In my testing, it seemed that the discrepancies tended to take one of a small number of values (with one value being particularly common).
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on February 28, 2019, 10:46:37 AM
Thank you for the information. This seems potentially consistent with the train slowing for the signal on the server and not on the client. I will have to carry out further extensive testing to narrow down which parts of the code are being not invoked on the optimised release build that are being invoked on the Visual Studio debug build, but that may take some time.
Edit: I have pushed a small change to a part of the code that may be implicated in this, replacing the built-in Simutrans min() with std::min(), as I am aware that the built-in min/max has given trouble in the past. I do not know whether this will help, but it may be worth re-testing with to-morrow's nightly build.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 02, 2019, 12:09:29 AM
Further testing shows that the min/max changes do not affect the incorrect differentiation between debug and release builds regarding when the signal clears. My time is very limited over the next week, so I am not likely to be able to make progress until about this time next week.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 03, 2019, 02:36:59 PM
I have added some temporary debugging output to aid testing. The divergence appears to occur in rail_vehicle_t::block_reserver() not in rail_vehicle_t::can_enter_tile().

It seems that block_reserver() is called at the same point (at route_index 100) on both debug and release builds; however, on the release build, it returns with a next_signal of 162 reserving only to the semaphore signal on the bridge, whereas on the debug build, it returns with a next_signal of 65530 (a special number denoting that the train need prepare to stop at no signal before the end of its route) and reserves all the way to the end of its route, the far end of the station platform.

I have added debugging output to places in the code in block_reserver() where the reservation might terminate, but the ones to which I have added the code are triggered on debug nor release builds.

Further exploration as to the origin of this divergence is required.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: prissi on March 08, 2019, 01:53:20 PM
MSVC Debug builds initialise values to non-zero (often 0xCACA or so), while many variables allocated in releases are often either 0 (when freshly allocated) or random values (when taken from the stack).

The other thing is that debug builds often have some padding around arrays (to catch overrun), while this maz lack in release builds. So mazbe some structure overrun. Just some input for further thoughs.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 08, 2019, 02:32:16 PM
Thank you for your thoughts: that is helpful.

The current pattern of different behaviour seems to be deterministic (i.e., it always behaves in one way in the release build and always another way in the debug build), which tends to suggest that it is probably not an uninitialised variable or writing beyond the end of an array; I am continuing to narrow down in the code where the divergence occurs, but this is a very slow process since I cannot use a debugger to check both builds.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 08, 2019, 11:56:08 PM
Further testing shows that the anomaly seems to be related to the treatment of combined signals and whether they are added to the list of pre-signals. More testing with updated testing code will be necessary before this can be narrowed down further.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Vladki on March 09, 2019, 10:19:16 AM
Further testing shows that the anomaly seems to be related to the treatment of combined signals and whether they are added to the list of pre-signals. More testing with updated testing code will be necessary before this can be narrowed down further.
Would it be helpful if I try using combined signals on the other server game?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 09, 2019, 11:43:10 AM
I think at this stage the best thing to do is for me to track down the divergence with the existing known setup, but thank you for offering.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 09, 2019, 12:08:02 PM
I have now narrowed down the release/debug divergence as first occurring at this line in simvehicle.cc:

Code: [Select]
if(sb && sb->can_add_signal(signal) && !directional_only)

On the debug build, this line evaluates to true at the critical point, whereas on the release build, it evaluates to false.

This is very odd, as there is nothing that would appear to be the sort of thing there that is the sort of thing that one would expect to behave differently between builds.

There are three elements to this: whether the variable "sb" (a pointer to a signalbox) is NULL or not, whether sb->can_add_signal(signal) is true, and whether directional_only is false.

We can ignore the latter, as this will always be false in this case (which we can confirm by noting that the reservation colour is red and not blue on both builds).

The sb pointer comes from the following code a few lines up:

Code: [Select]
const signalbox_t* sb = NULL;
const grund_t* gr_signalbox = welt->lookup(signal->get_signalbox());
if(gr_signalbox)
{
const gebaeude_t* gb = gr_signalbox->get_building();
if(gb && gb->get_tile()->get_desc()->is_signalbox())
{
sb = (signalbox_t*)gb;
}
}

The can_add_signal function is as follows:

Code: [Select]
bool signalbox_t::can_add_signal(const signal_t* s) const
{
if(!s || (s->get_owner() != get_owner()))
{
return false;
}

return can_add_signal(s->get_desc());
}

The last call is not recursive, but calls an overloaded version, which is here:

Code: [Select]
bool signalbox_t::can_add_signal(const roadsign_desc_t* d) const
{
uint32 group = d->get_signal_group();

if(group) // A signal with a group of 0 needs no signalbox and does not work with signalboxes
{
uint32 my_groups = get_first_tile()->get_tile()->get_desc()->get_clusters();
if(my_groups & group)
{
// The signals form part of a matching group: allow addition
return true;
}
}
return false;
}

None of those functions look suspicious for a loss of synchronisaiton between different builds (or between different instances of the same build, as on the server game), so it is rather perplexing how this arises.

My next test is to add debugging output to the release build to show whether sb is true and, if it is, what its co-ordinates are on the release build, which should narrow down which one of these two pieces of code is the culprit.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Vladki on March 09, 2019, 01:24:30 PM
Just a logical question? Why does simvehicle, check if some signal can be added to some signalbox? That should be checked only when signal is built or reconnected to new signalbox.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 09, 2019, 01:31:33 PM
The function of this test is to check whether a combined signal is close enough and compatible with the next signalbox: the idea is that a combined signal must be able to be controlled by both the signalbox to which it is attached and the next signalbox along the line.



In any event, I have narrowed this down further: in the release build, the "sb" variable is NULL, which is why the line

Code: [Select]
if(sb && sb->can_add_signal(signal) && !directional_only)

evaluates to false.

The divergence, therefore, must arise in the following code, although I must confess I am currently struggling to see how:

Code: [Select]
const signalbox_t* sb = NULL;
const grund_t* gr_signalbox = welt->lookup(signal->get_signalbox());
if(gr_signalbox)
{
  const gebaeude_t* gb = gr_signalbox->get_building();
  if(gb && gb->get_tile()->get_desc()->is_signalbox()) 
  {   
   sb = (signalbox_t*)gb; 
  }
}

Edit: This line refers to the signalbox at 6345,1259 which is the signalbox controlling the signal on the bridge. On the release build, clicking on this signal reveals that it has a signalbox of "none" (which is incorrect), whereas clicking on this signal on the debug version correctly shows it connected to this signalbox.

Edit 2: Further testing reveals some more details, some very odd. The problem appears to be in this line, which is replicated in the UI output described above:

Code: [Select]
const gebaeude_t* gb = gr_signalbox->get_building();

The coordinate for the signalbox is correctly saved and loaded, but, on some occasions, this succeeds and on some occasions this fails to return a building.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ranran on March 09, 2019, 02:31:37 PM
Is it related whether signalbox is multiple tiles?
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 09, 2019, 02:35:59 PM
The issue appears to be that the signalbox in question is actually beneath a viaduct and on that viaduct is a station platform. The code for finding the building is:

Code: [Select]
signalbox_t* grund_t::get_signalbox() const
{
   return dynamic_cast<signalbox_t *>(first_obj());
}

gebaeude_t *grund_t::get_building() const
{
   gebaeude_t *gb = find<gebaeude_t>();
   if (gb) {
      return gb;
   }

   gb = get_signalbox();
   if(gb)
   {
      return gb;
   }

   return get_depot();
}

The critical section appears to be:

Code: [Select]
return dynamic_cast<signalbox_t *>(first_obj());

This line was copied exactly from the get_depot() method from Standard:

Code: [Select]
depot_t* grund_t::get_depot() const
{
   return dynamic_cast<depot_t *>(first_obj());
}

In each case, what appears to happen is that the object that is returned is the first object in the object list. However, the sequence in which objects are stored in the object list appears to be indeterministic in release builds. So, if there is more than one object on the tile, it is indeterministic whether get_signalbox() will, in fact, return a signalbox or not, even if one of those objects is a signalbox. I wonder whether the same problem may apply to depots.

I will experiment with changing the code thus:

Code: [Select]
signalbox_t* grund_t::get_signalbox() const
{
   return dynamic_cast<signalbox_t *>(suche_obj(obj_t::signalbox));
}

although I worry that this might affect performance.
Edit: To answer Ranran's question, all of the signalboxes in question are on single tiles, so I do not believe that this is relevant.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 09, 2019, 03:40:37 PM
Initial tests of this fix are promising: the modified/fixed vesrion is now running on the Bridgewater-Brunel server, and I am logged in running a stability test. Feel free to login to join in the testing.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Phystam on March 09, 2019, 05:02:03 PM
I am now in the server. I will try to keep connecting the server during night.

edit:
The desync has happened during night.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 09, 2019, 05:38:37 PM
Splendid, thank you.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 09, 2019, 05:54:53 PM
I have been connected to the server constantly since my  last post on this thread with no loss of synchronisation.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on March 09, 2019, 08:17:45 PM
This is great news, James!
I will attempt to keep logged on for as long as possible too.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Ves on March 10, 2019, 12:01:10 AM
Great news James, had not a single dissync for 3,5 hours!

Skickat från min ONEPLUS A6003 via Tapatalk

Title: Re: Instability on the Bridgewater-Brunel server
Post by: Phystam on March 10, 2019, 02:20:58 AM
Is the server revision #8076566?
I think the latest revision is #fff2765...

Anyway, I could connect at least for 1 hour, without desync.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 10, 2019, 02:49:36 AM
I have left my computer running for many hours whilst I was in the shed doing something else, and it had lost synchronisation on my return. However, it had been connected successfully for a very long time.

I should note, incidentally, that you will all lose synchronisation at around 0600h GMT when the server resets for its nightly rebuild.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Phystam on March 10, 2019, 03:32:53 AM
Thank you for clarifying this.
Now 2 hours, still connected.

edit:
3 hours. very stable.

edit2:
4 hours.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: jamespetts on March 10, 2019, 10:57:07 AM
Thank you all very much for your testing, that is most helpful. I think that it is now safe to conclude that this particular problem has been solved. I will now revert the server to its status before the problem occurred. I suggest that people re-set their passwords swiftly after I do this. Once this has been done, I will post in the server's main thread.
Title: Re: Instability on the Bridgewater-Brunel server
Post by: Phystam on March 10, 2019, 11:06:38 AM
Thank you, now no desync is occured after resetting of the server!
Title: Re: Instability on the Bridgewater-Brunel server
Post by: ACarlotti on April 23, 2019, 01:11:49 AM
I discovered the underlying bug, which was causing signalboxes to be sorted in objlist with an inconsistent priority (due to undefined behaviour). The fix is discussed here (https://forum.simutrans.com/index.php/topic,18940.0.html).