News:

Simutrans Tools
Know our tools that can help you to create add-ons, install and customize Simutrans.

Instability on the Bridgewater-Brunel server

Started by DrSuperGood, September 06, 2018, 03:21:35 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

jamespetts

#35
Casting to volatile is insane - however, it is uncertain whether this is the cause of the trouble. It will be very difficult to fix if I am not able to reproduce it locally, as there will be no way of testing whether any given change makes a difference.

It would be very useful to know whether anyone is able to connect to the server on a Linux computer to test whether the synchronisation issue is limited to Windows clients connecting to the Linux server, since I cannot reproduce it with a Windows client connecting to a Windows server.
As to the potentially uninitialised local variables, if they were actually uninitialised, they would be detected either in Dr. Memory or using the Visual Studio runtime checks. There are lots of local variables which are not initialised on declaration but are initialised later.
Edit: Looking at the "volatile" part of the code, it is unclear what could simultaneously alter the output of gr->get_top(), as the actual movement of vehicles is not multi-threaded. I have experimentally removed all casts to volatile - it does not seem to cause any instability on a basic map, and it would be interesting to see whether there is any difference when this is pushed to the server. However, it is difficult to see what difference that this could make.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

#36
QuoteCasting to volatile is insane - however, it is uncertain whether this is the cause of the trouble. It will be very difficult to fix if I am not able to reproduce it locally, as there will be no way of testing whether any given change makes a difference.
It is literally utter nonsense.
QuoteThe compiler detected a cast to an r-value type which is qualified with volatile, or a cast of an r-value type to some type that is qualified with volatile. According to the C standard (6.5.3), properties associated with qualified types are meaningful only for l-value expressions.
The volatile key word is being completely ignored as it makes no syntax sense.
https://en.cppreference.com/w/cpp/language/cv
https://en.cppreference.com/w/cpp/language/value_category
QuoteEdit: Looking at the "volatile" part of the code, it is unclear what could simultaneously alter the output of gr->get_top(), as the actual movement of vehicles is not multi-threaded. I have experimentally removed all casts to volatile - it does not seem to cause any instability on a basic map, and it would be interesting to see whether there is any difference when this is pushed to the server. However, it is difficult to see what difference that this could make.
It will make no difference, and that is what is concerning. Someone explicitly placed those type casts there expecting them to do something. They have not been doing anything as they make no sense. This means that the problem that the author considered solved by using them still exists. I suggest trying to track down the author to explain their intended purpose.

Also worth pointing out the following line from the C++ standard (I think? or at least derived from that)...
QuoteThis makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order).
QuoteWithin a thread of execution, accesses (reads and writes) through volatile glvalues cannot be reordered past observable side-effects (including other volatile accesses) that are sequenced-before or sequenced-after within the same thread, but this order is not guaranteed to be observed by another thread, since volatile access does not establish inter-thread synchronization.
In addition, volatile accesses are not atomic (concurrent read and write is a data race) and do not order memory (non-volatile memory accesses may be freely reordered around the volatile access).
One notable exception is Visual Studio, where, with default settings, every volatile write has release semantics and every volatile read has acquire semantics (MSDN), and thus volatiles may be used for inter-thread synchronization. Standard volatile semantics are not applicable to multithreaded programming, although they are sufficient for e.g. communication with a std::signal handler that runs in the same thread when applied to sig_atomic_t variables.
Hence one should not be using them to synchronize values at all outside of with signal handlers (pseudo interrupts?). Sure they work for that in MSVC, but possibly not GCC.

jamespetts

I am not sure why it would have been thought necessary to use them here - they are being used on a getter method from Standard. The method is declared as const. Only the main thread should ever be changing the variable that this getter method accesses.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

When reading through the multitile city building code, I found a remark about floatig point calculation. Is this still in? That might be also a cause of desyncs.

jamespetts

#39
Quote from: prissi on September 29, 2018, 12:21:43 PM
When reading through the multitile city building code, I found a remark about floatig point calculation. Is this still in? That might be also a cause of desyncs.

I cannot find any relevant floating point code (i.e., code that is not used only in UI or on initial map creation) in simcity.cc, sinhalt.cc or gebeaude.cc or their respective header files; I am not sure to what this is referring. I imagine that, if this had been in the city growth code, this desync would have emerged a long time ago.

I have still been unable to reproduce this locally. I tried both with the latest GCC build from the Brdigewater-Brunel server and the latest saved game, and with a Visual Studio release build, and I could not reproduce any desync when the same build was client and server.

This does suggest a desync of the sort where the platform makes a difference - but it is now extremely difficult to test whether this can be reproduced in older versions of either the code or saved game, as I am unable to reproduce this locally.

Again, it would be exceedingly helpful if anyone were able to test whether the desync can be reproduced when connecting from a Linux client to the Bridgewater-Brunel server.

Edit: Incidentally, testing with the Stephenson-Seimens server shows no desync.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

I am in the process of running a further test: I have temporarily reverted the server to an older saved game from the in-game year of 1937 to see whether clients can stay synchronised with the game with this save. The purpose of the test is to see whether the desync is caused by something new in the code in the last few weeks, or by changes in the saved game itself.

I have kept a backup of the latest saved game both on the server and on my own computer so that this can be restored when the testing is complete.

However, my own testing has been disrupted badly by continuing hardware problems on my own computer. It would therefore be very helpful if anyone could test this by trying to connect and see whether you can stay connected for at least 20 minutes without losing synchronisation.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

Been connected to the server for >40 minutes, no out of sync, multiple save/load cycles. Something changed with the game state since that save was made.

SuperTimo

Yeah I was also connected for around 30mins with no de-sync during the time I was connected. Also that save is before the great accidental tunnelling incident of 1939 so I am very much in favour of reverting to an earlier save.

Ves

I have been on the server now for 50 minutes without any issues.
This save is before I started doing any airtraffic at all.

jamespetts

Thank you all for your feedback: that is most helpful. I am not planning to revert to an earlier save permanently: just for testing, as it is necessary to fix the problem that causes these desyncs in the code, not just try to avoid it in the game.

I  note that I do not have a later saved game other than the one backed up from 1940, so I will not be able to try lots of individual tests between the two relevant points.

Dr. Supergood - you mention air traffic. I see that there was already an extensive air network in 1937, so it is doubtful that the issue is general to air traffic. Is there any particular feature of your air transport network which did not exist in anyone's air transport network in 1937?

Has anyone else used any additional features since 1937 that were not used in that year?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Ves

I believe I wrote about air traffic :-P

I used the parallel stop mode on my airports, and I believe I almost doubled the amount of air routes on the server. My routes went to both my own airports but also alot to other players airports. The ground connection to my airports where handled often by other players with both busses (with the parallel stop mode) and trains.

DrSuperGood

QuoteDr. Supergood - you mention air traffic. I see that there was already an extensive air network in 1937, so it is doubtful that the issue is general to air traffic. Is there any particular feature of your air transport network which did not exist in anyone's air transport network in 1937?
You mean Ves?

One of the large airports not owned by me was constantly getting gammed due to a bug involving reserving runways when connected directly to a dock (so convoy loads, and then immediately moves onto runway). Someone rebuilt it.

My air network is largely unchanged, except for maybe a change in rolling stock.
QuoteI used the parallel stop mode on my airports, and I believe I almost doubled the amount of air routes on the server. My routes went to both my own airports but also alot to other players airports. The ground connection to my airports where handled often by other players with both busses (with the parallel stop mode) and trains.
Could easily be the parallel stop. I did not use such things at all.

Ves

QuoteCould easily be the parallel stop. I did not use such things at all.
We could try to bog the server down by extensively build that stop mode, as test....?

Rollmaterial

Could it have something to do with other players' convoys using those stops?

Ves

We are trying to do exactly that on the server right now. I have ruined alot of other players network by placing parallel stop mode underneith their stops (which should not be possible...) and we are doing a loop now on the server with different convoys and stops.


DrSuperGood

No desync here and I was connected at the same time as you. I think you just OoSed due to the long running change schedule OoS that is even in standard.

Ves

Hmmm ok, did usually everybody get kicked out previously?

DrSuperGood

#53
Previously it was so instable it was impossible for anyone to remain connected for any reasonable length of time. I am guessing the instant a checksum check was performed clients were booted.

In any case our tests are going nowhere slowly. We tried parallel park mode with heavy traffic. We tried one way airports on another server. No OoS.

jamespetts

Splendid, thank you all for the testing work, and apologies for the Ves/Dr. Supergood confusion. It would be very useful to know what feature(s) are found to trigger the desyncs. Do make sure to test one feature at a time (although I am sure that you are doing this anyway).

Unfortunately, the fact that this is a problem emerging with new ways of playing/feature use with the existing code will make finding the problem much harder than it would be if it had been introduced very recently, especially if it is not confined to the OTRP. This might take quite a lot of extensive work before progress can be made in other areas.

Dr. Supergood - you mention a bug relating to runways and a dock - I do not believe that this has been reported independently. Would you be able to post a fresh bug report thread for this?
Edit: My apologies: I posted the above without having realised that there was a whole second page of comments beyond my last message.
Ves - were any other people online when you had the desync? The game had been quite stable aside from the out of memory issues before this new issue arose. The 1940 map will lose synchronisation after 1-2 minutes, I believe.
Another possibility is to re-load the 1940 map and remove features, but that might be much harder because people will be kicked before they can do much of that.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

#55
I assume the desync is a combination of RNG seed mismatch and possibly convoy handle mismatch (as the result of RNG seed mismatch creating different private cars)?

One might have to do some crazy code modification to find the fault. For example dumping the RNG state after every stage of every step. The idea would be that one could compare what the server dumps with what the client dumps to find out where/when the OoS occurs. Once the system causing the OoS is found it should either be possible to track it down directly, or more refinement to the listings made to expose details on where during the processing the OoS is occurring.

I think we have reached the end of meaningful testing with the reverted server game. I was connected without OoS for over 2 hours, including lots of road traffic across a variety of single directional roads with parallel parking.

TurfIt

Quote from: DrSuperGood on September 30, 2018, 02:59:52 PM
One might have to do some crazy code modification to find the fault. For example dumping the RNG state after every stage of every step. The idea would be that one could compare what the server dumps with what the client dumps to find out where/when the OoS occurs. Once the system causing the OoS is found it should either be possible to track it down directly, or more refinement to the listings made to expose details on where during the processing the OoS is occurring.

That code is exactly what's already there in the checklists - it was never removed after the last major desync hunt a few years ago; Really should have been simply for performance reasons...
Dumping after every stage is too much, instead find the major/minor system and keep diving deeper into them (refining as you say). But, expect many many false roads.
And this assumes you have a really reliable desync. Last time it would desync in the range of 5-15 mins and ended up taking huge hours to finally track.

jamespetts

Thank you very much for testing. I am now reverting to the 1940 saved game from yesterday. I am afraid that this issue is likely to take a very long time to resolve, especially since even reproducing the desync is very difficult.

Has anyone any idea about what is different between the 1937 map and the 1940 map such as might cause this?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

SuperTimo

#58
During that time I added two large Trolley Bus routes, perhaps a bug with trolleys? I also created 3 bus terminals that used one way roads, one way signs and no-entry signs which I have removed to see if this has any effect.

edit: as expected this made no difference at all.

jamespetts

#59
Thank you for that: that is helpful.

I was hoping to try to connect with my Linux computer this evening, but, on attempting to compile, I get the following errors:


===> LD  simutrans/simutrans-extended
/usr/bin/ld: build/default/squirrel/sq_extensions.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqclass.o: relocation R_X86_64_32S against symbol `_ZN7SQClass7ReleaseEv' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqdebug.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqlexer.o: relocation R_X86_64_32S against symbol `_ZN7SQTable7ReleaseEv' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqobject.o: relocation R_X86_64_32S against symbol `_ZN15SQFunctionProto4MarkEPP13SQCollectable' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqtable.o: relocation R_X86_64_32S against symbol `_ZTV7SQTable' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqbaselib.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqcompiler.o: relocation R_X86_64_32 against symbol `_ZN10SQCompiler10ThrowErrorEPvPKc' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqfuncstate.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqstate.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/squirrel/sqvm.o: relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdaux.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdio.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdrex.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdstring.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdblob.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdmath.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdstream.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: build/default/squirrel/sqstdlib/sqstdsystem.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output


This all arises in code that I know nothing about and I believe has been subject of some merging from Standard lately. If anyone can assist as to what might be the cause of this, this would be most helpful. I note that this does not seem to have affected compiling on the server.
Edit: Using the executable downloaded from the Bridgewater-Brunel server, I have been able to remain connected using a Linux client for a considerable time, over one in-game hour, and across the month boundary into December 1940. This does suggest that this may be one of those exceptionally difficult to find bugs in which Windows and Linux executables diverge in some extremely subtle way. Quite why this arises only in 1940 is hard to understand at this stage.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

Since GCC is used one can rule out some causes. Most likely causes would be multi threading (Linux kernel runs threads differently from Windows) and memory referencing (different virtual memory allocations).

ACarlotti

Quote from: jamespetts on October 01, 2018, 06:42:45 PMI was hoping to try to connect with my Linux computer this evening, but, on attempting to compile, I get the following errors:
I have transferred virtually no changes to that code recently, although it is possible that changes elsewhere (e.g. to a Makefile) could be relevant.

When did you last successfully compile on that system?
What method of compilation are you using?
Have you tried producing a clean build?

SuperTimo

Do we have a server save from closer to when the de-sync started happening? I think it was okay until about mid 1939.

The server seemed okay until there was a fatal error (see here: https://forum.simutrans.com/index.php/topic,17611.msg175690.html#msg175690) that was fixed with the nightly build that came out on 21/09/2018. Since then there seems to be serious de-sync issues. Could it be that while this bug doesn't cause a crash anymore it still is causing the server issues?

I did notice that several players allowed access to their networks to others around this time, would this have any impact on the server?

jamespetts

I have checked - the best that I have is September and December 1939. The fix itself would not cause a desync, as this is a simple test as to whether a pointer has a NULL value before dereferencing that pointer, but it is possible that whatever caused the crash is also somehow responsible for the desync. This crash occurred in the OTRP code (i.e., the new overtaking code). This code had been in the game for a long time before 1939 and it had evidently been used before that time, so this is perhaps not the most promising avenue.

Can anyone give me any idea as to which of the overtaking features were used before and after 1939 in the game?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

The crash was occurring on a stretch of road that was using classic driving style reliably for around 100 years... I think the road even existed before the feature was merged in.

jamespetts

Quote from: DrSuperGood on October 02, 2018, 08:46:26 PM
The crash was occurring on a stretch of road that was using classic driving style reliably for around 100 years... I think the road even existed before the feature was merged in.

Interesting - how were you able to identify what stretch of road was causing the crash?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

DrSuperGood

#66
QuoteInteresting - how were you able to identify what stretch of road was causing the crash?
Stack trace. I loaded the server game in a debug build attached to MSVC. Looked at the vehicle that was causing the crash. Was one of my old post horses. Could also resolve the coordinates. Look at the thread for details.

jamespetts

I am now running a series of tests on the server game. I have started by using the saved game from September 1939: this does indeed lose synchronisation, so whatever the difference is that is causing the error to be exposed in later games occurred between 1937 and 1939.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

The result of the first test is that, even with all aircraft removed, the loss of synchronisation still occurs after a few minutes connected.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

The result of the second test is that, even with a saved game in which all the roads had been set to the standard "two way" overtaking mode on loading using a specially modified executable (as well as all the aircraft removed), the client still lost sync with the server after a few minutes.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.