News:

The Forum Rules and Guidelines
Our forum has Rules and Guidelines. Please, be kind and read them ;).

Online Game with SE

Started by steffen, March 06, 2011, 05:50:39 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

prissi

It is most likely routing. Wasn't there a passenger go away feature also implemented? In this case it could be there.

But I think routing, as there may be many circumstances with indentical comparisons and maybe some random element like pointer address is used to break the tie. This seems most likely to me. Or there are some hashtables with pointers involved.

jamespetts

#36
The latest tests reveal that removing the pakset with private cars ("citycars") on both server and client does not seem to prevent the desyncs (although there is some inconclusive suggestion from my research so far that it might diminish the frequency).

However, I have come upon a very odd divergence of code-paths, relating just to the GUI, and therefore not causative of desyncs in itself, but might point the way to the ultimate problem. I have mentioned it before in passing, but I have been looking into it more, and it is very strange. It arises in the following method in simplay.cc:


/**
* Rückruf, um uns zu informieren, dass ein Vehikel ein Problem hat
* @author Hansjörg Malthaner
* @date 26-Nov-2001
*/
void spieler_t::bescheid_vehikel_problem(convoihandle_t cnv,const koord3d ziel)
{
switch(cnv->get_state()) {

case convoi_t::NO_ROUTE:
DBG_MESSAGE("spieler_t::bescheid_vehikel_problem","Vehicle %s can't find a route to (%i,%i)!", cnv->get_name(),ziel.x,ziel.y);
if(this==welt->get_active_player()) {
cbuffer_t buf(320);
buf.printf( translator::translate("Vehicle %s can't find a route!"), cnv->get_name());
uint32 max_weight = cnv->get_route()->get_max_weight();
uint32 cnv_weight = cnv->get_heaviest_vehicle();
if (cnv_weight > max_weight) {
buf.printf(" ");
buf.printf(translator::translate("Vehicle weighs %dt, but max weight is %dt"), cnv_weight, max_weight);
}
welt->get_message()->add_message( (const char *)buf, cnv->get_pos().get_2d(), message_t::problems, PLAYER_FLAG | player_nr, cnv->front()->get_basis_bild());
}
break;

case convoi_t::WAITING_FOR_CLEARANCE_ONE_MONTH:
case convoi_t::CAN_START_ONE_MONTH:
case convoi_t::CAN_START_TWO_MONTHS:
DBG_MESSAGE("spieler_t::bescheid_vehikel_problem","Vehicle %s stucked!", cnv->get_name(),ziel.x,ziel.y);
{
cbuffer_t buf(256);
buf.printf( translator::translate("Vehicle %s is stucked!"), cnv->get_name());
welt->get_message()->add_message( (const char *)buf, cnv->get_pos().get_2d(), message_t::warnings, PLAYER_FLAG | player_nr, cnv->front()->get_basis_bild());
}
break;

default:
DBG_MESSAGE("spieler_t::bescheid_vehikel_problem","Vehicle %s, state %i!", cnv->get_name(), cnv->get_state());
}
}


Where a vehicle is overweight, and therefore cannot route, in Windows, the "Vehicle weighs %dt, but max weight is %dt" message is correctly shown. In Linux, however, the message is entirely absent. The divergence must occur in the above code, as the only other place where "Vehicle %s can't find a route!" occurs in the code is in the depot window, which is not invoked in the saved game in question (and, from my understanding, is only invoked on the user manually pressing "start" in the depot, or perhaps using they convoy replacer).

The only cause of divergence must be either inconsistent values returned in cnv->get_route()->get_max_weight() or cnv->get_heaviest_vehicle(). Both of those methods are specific to Experimental, but I can't see anything in them that could cause a divergence. The first method is simply a getter function:


uint32 get_max_weight() const { return max_weight; }


The second is likewise:


inline uint32 get_heaviest_vehicle() const { return heaviest_vehicle; }


The value of the heaviest vehicle is set in the calc_heaviest_vehicle() method, the contents of which are:


uint32
convoi_t::calc_heaviest_vehicle()
{
uint32 heaviest = 0;
for(uint8 i = 0; i < anz_vehikel; i ++)
{
uint32 tmp = fahr[i]->get_sum_weight();
if(tmp > heaviest)
{
heaviest = tmp;
}
}
return heaviest;
}


The "max_weight" value in the routing code is set in route_t::intern_calc_route(), in the following piece of code:


if (enforce_weight_limits && w != NULL)
{
// Bernd Gabriel, Mar 10, 2010: way limit info
const uint32 way_max_weight = w->get_max_weight();
max_weight = min(max_weight, way_max_weight);

if(enforce_weight_limits == 2 && weight > way_max_weight)
{
// Avoid routing over ways for which the convoy is overweight.
continue;
}
}


The "max_weight" value is initialised to MAXUINT32 at the beginning of the method. "Weight" in the method comes from a value passed to that method when called, which is, in turn, passed from "calc_route".

I really don't understand where the divergence can arise given that code, or why it isn't working in Linux. The solution to this problem may or may not be the solution to the ultimate problem, but it is worth solving in any event, since it is very useful for players to have an indication of when vehicles are failing to route because they are overweight.

Any assistance would be much appreciated.

Edit

I also wonder, given Dwachs' earlier comments, whether it is possible for divergences to arise in any of the following code (in simworld.cc):


void karte_t::set_citycar_speed_average()
{
if(stadtauto_t::table.empty())
{
// No city cars - use default speed.
citycar_speed_average = 50;
return;
}
stringhashtable_iterator_tpl<const stadtauto_besch_t*> iter(&stadtauto_t::table);
int vehicle_speed_sum = 0;
uint16 count = 0;
while(iter.next())
{
// Take into account the *chance* of vehicles, too: fewer people have sports cars than Minis.
vehicle_speed_sum += (speed_to_kmh(iter.get_current_value()->get_geschw())) * iter.get_current_value()->get_gewichtung();
count += iter.get_current_value()->get_gewichtung();
}
citycar_speed_average = vehicle_speed_sum / count;
}

void karte_t::calc_generic_road_speed_intercity()
{
// This method is used only when private car connexion
// checking is turned off.

// Adapted from the method used to build city roads in the first place, written by Hajo.
const weg_besch_t* besch = einstellungen->get_intercity_road_type(get_timeline_year_month());
if(besch == NULL)
{
// Hajo: try some default (might happen with timeline ... )
besch = wegbauer_t::weg_search(road_wt,80,get_timeline_year_month(),weg_t::type_flat);
}
generic_road_speed_intercity = calc_generic_road_speed(besch);
}

sint32 karte_t::calc_generic_road_speed(const weg_besch_t* besch)
{
if(besch || city_road)
{
const sint32 road_speed_limit = besch ? besch->get_topspeed() : city_road->get_topspeed();
const sint32 speed_average = (float)min(road_speed_limit, citycar_speed_average) / 1.5F;
const uint16 journey_time_per_tile = 600 * (einstellungen->get_distance_per_tile() / speed_average); // *Tenths* of minutes: hence *600, not *60.
return journey_time_per_tile;
}
else
{
return 600 * (einstellungen->get_distance_per_tile() / (citycar_speed_average / 1.5F));
}
}

void karte_t::calc_max_road_check_depth()
{
sint32 max_road_speed = 0;
stringhashtable_tpl <weg_besch_t *> * ways = wegbauer_t::get_all_ways();

if(ways != NULL)
{
stringhashtable_iterator_tpl <weg_besch_t *> iter(ways);
while(iter.next())
{
if(iter.get_current_value()->get_wtyp() != road_wt || iter.get_current_value()->get_intro_year_month() > current_month || iter.get_current_value()->get_retire_year_month() > current_month)
{
continue;
}
if(iter.get_current_value()->get_topspeed() > max_road_speed)
{
max_road_speed = iter.get_current_value()->get_topspeed();
}
}
}
else
{
max_road_speed = citycar_speed_average;
}
if(max_road_speed == 0)
{
max_road_speed = citycar_speed_average;
}

max_road_check_depth = (((float)einstellungen->get_max_longdistance_tolerance() / einstellungen->get_distance_per_tile()) / 60.0F) * min(citycar_speed_average, max_road_speed);
}


I can't immediately see how it might diverge here, as the order in which the values are read from the wegbauer_t::get_all_ways(); (etc.) methods shouldn't make a difference to the calculated average, but might that be wrong somehow, in some obscure way...?

Edit 2

Might it be that GCC and MSVC++ deal with integer division and/or rounding differently? Might it help if, for example, I change


int vehicle_speed_sum = 0;
uint16 count = 0;
...
citycar_speed_average = vehicle_speed_sum / count;


to


double vehicle_speed_sum = 0.0;
double count = 0.0;
...
citycar_speed_average = (sint32)(vehicle_speed_sum / count)

?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs

you might want to look at this commit:

https://github.com/aburch/simutrans/commit/2ee89a67bd2fcc17c28c2230b97bc87dad0e791d

The problem there was that on 'long' was 32bit on one system, 64bit on another system. The differences of koord_distance was computed as 'uint32' though it could be negative, thus in fact the difference was a large uint32 number. If this uint32 was stuffed into a 32bit long -> no problem. But if long was actually 64bit, a negative uint32 suddenly is a large positive number -> divergence.

Maybe this happens here too?
Parsley, sage, rosemary, and maggikraut.

jamespetts

Dwachs,

thank you for your reply. I don't think that the problem can be a 32-bit/64-bit issue, since there is only a 32-bit Windows binary, which will, presumably, behave in a consistent way whether run on a 32-bit or 64-bit system, and I have tested my games by connecting to a 32-bit Linux system. Also, as mentioned before, I had succeeded in connecting a 32-bit Linux system to a 64-bit Linux system in earlier tests, without the same problems. Thank you for your thoughts, though.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

Apart from pointers, also the time division between usix systems is different. On windows the scheduler usually runs in 5ms slices. Unix may differ from that.

jamespetts

Thank you for that information. What sort of code would use the scheduler? I don't think that I've interfaced with it knowingly in my coding.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

dr_time() will only give 5ms or 2ms steps in GDI, but may give different steps under Linux. If you did everything properly (i.e. nothing timing related in the routing) then this should not be important; but then you would get desyncs.

jamespetts

I haven't used dr_time() in any of the code that I've written, and I do recall auditing the code carefully when making Experimental compatible with online play before and not using calls to dr_time(). There may be some calls to dr_time() in the goods/passenger routing code written by Knightly, but I can't see how the routing code can be the problem if the desyncs do not occur when there are no towns (but there is a freight transport network) but do occur in similar circumstances when there are towns (and still only a freight transport network).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

I just compiled SE under mingw and get very many warnings. At some places you are passing doubles for ints. The result of that depends very much on precision and compiler.

VS

You could try compiling for both linux and windows with gcc...

My projects... Tools for messing with Simutrans graphics. Graphic archive - templates and some other stuff for painters. Development logs for most recent information on what is going on. And of course pak128!

jamespetts

Prissi,

thank you for your lead in relation to that; do you get more warnings with mingw than with MSVC++, or are the warnings in essence the same with each compiler? When I get time, I shall look into trying to fix these issues. What is the best sort of fix here - do express casts help?

VS,

thank you for that tip - I shall have to give that a go. Will compiling with mingw suffice, or is there a different compiler that I need to use in Windows?
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

VS

I'm not sure, but probably yes with matching gcc versions. However, if it's not easy to do, then leave it be... (I don't know how to do it.) The idea was, that way you eliminate some sources of divergence and can check if it's just that, or some real logical issue.

My projects... Tools for messing with Simutrans graphics. Graphic archive - templates and some other stuff for painters. Development logs for most recent information on what is going on. And of course pak128!

steffen

It's probably better to switch to using a free build process anyways. I was using gcc-4.4.4 or 4.4.5 when I made my attempts on this.

Edit: If it does turn out to be compiler difference causing this, whilst I'd still suggest switching to a free build, it should be possible to force equal behaviour by using appropriate compiler settings.

jamespetts

#48
My latest test result: running Simutrans-Experimental 9.5 Windows executable on a Windows machine and also on a Linux machine using Wine works without desyncs even on a large and complex map.

I am going to try compiling with MinGW on Windows and connecting to a native Linux build next.

Edit: I have had problems compiling in Windows with MinGW, but have managed to compile in Linux with G++. The latest 9.x still desyncs between Windows (MSVC++ build) and Linux.

Edit 2: I have now managed to get Simutrans-Experimental to compile under MinGW in Windows. Attempting to run that version with my compiled version in Linux still resulted in the same desyncs at the same rate as using the MSVC++ compiled version, so I think that we can probably eliminate compiler differences as the cause here.

Edit 3: Also, I notice that I am not getting the same compiler warnings as Prissi reported about doubles and ints - the only warnings that I get are unused variables (mainly caused by variables used in DBG calls when I have -DNDEBUG defined) and comparisons between signed and unsigned integers, most of which are in code from Standard.

Edit 4: Trying to connect a Windows MinGW compiled version to a Windows MSVC++ compiled version on the same machine also fails with desyncs. Trying to set up a MinGW compiled version as a server on Windows also fails with a "bind error" (the game will not start in server mode), which I cannot debug as there is no debugger that I can use with MinGW.

Edit 5: It seems that I can only reproduce the desyncs reliably on large and complicated maps when using MSVC++ and MinGW on Windows.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

Ah, for network code maybe try to compile with assertiond (skip the NDEBUG) Especially for bughunt, assertion are quite valuable.

And Mingw version and Linux version usually differ a lot. I am mostly using Ming 3.4.5 as this is the stable mingw for a long time and the smallest denominator between most systems.

steffen

I think Edit 4 of James' latest post indicates the compiler may in fact be to blame. Which versions of GCC&Ming did you use? I can easily install gcc down to 2.95.3, here's the full list: http://gentoo-portage.com/sys-devel/gcc

prissi

I think it is rather certain actions which order can be compiler dependent. An example was the evaluation of expression in a( b(), c() ); C standard leave it open if first c() and then b() is evaluated or the other way round. There have been some more statements like this. It should be a posting (maybe from knightly) earlier in the standard network thread.

steffen

That would also make sense. This is reminding me of why I hate C/C++ ;)
However, I would imagine that issues like that would be "solved" by using the same or nearly the same compiler on all platforms. Ofc, that's not really solving the problem :/

jamespetts

#53
Thank you both for those thoughts. The latest test results (as between MSVC++ and MinGW on the same machine) with my random caller logging shows that the divergence in random calls always happens in the next method after stadt_t::get_zufallspunkt.

Edit: Prissi: I can't find the discussion topic to which you refer - do you have a link?

Edit 2: Disabling private cars (by replacing the weighted random function to determine whether passengers have a private car with a simple declaration of "false") does not affect desyncs. Further, desyncs can occur with a get_zufallspunkt() call immediately after a previous get_zufallspunkt() call without an intermediate finde_passiagere_zeil() call.

Edit 3: Disabling code relating to checking the distances of passenger destinations resulted in a divergence after finde_passagier_ziel() and not get_zufallspunkt().

Edit 4: Further testing with pedestrians turned off reveals: (1) that pedestrians do not turn off properly, as some remain; and (2) on this occasion, at least, the desync occurred because karte_t::neuer_monat() was called on the client but not on the server at one juncture (but, oddly, the date of the desync messages both showed the same month on the client and the server).

Edit 5: Incidentally, in relation to the latter incident, the immediately preceding call was stadtauto_t::hop_check, which must have been called from an INT_CHECK in step_passagiere() given that the previous call before a list of stadtauto_t::hop_check calls was get_zufallspunkt(), as was the call immediately after neuer_monat(). The client is the MinGW binary and the server the MSVC++ binary. The above all seems to point towards the problem being somewhere in step_passagiere().
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

In which structure do you store your sync_step_t-objects? I did an optimisation and used a hash table; but then the order ding_t sanc_step() was executed became random, i.e. some pedestrian were at a crossing at different times. I had to return to a list.

jamespetts

#55
Prissi,

I haven't changed that part of the code from Standard, so, I assume, if Standard uses a list, Experimental also uses a list.

Edit: Following from Prissi's earlier suggestion:

Quote from: prissi
An example was the evaluation of expression in a( b(), c() ); C standard leave it open if first c() and then b() is evaluated or the other way round.

I have audited the code in stadt_t::step_passagiere() and the methods altered by Simutrans-Experimental called by stadt_t::step_passagiere(), and have found no instances of anything resembling a(b(), c()) that could cause difficulties, except where methods b() and c() are simple getter methods where the order of execution should make no difference.

Also, to answer another earlier question, the version of GCC that I am using from MinGW in Windows is 4.5.2, and in Linux 4.4.5.

Edit 2: Testing with compiling the Linux version in GCC 4.5.1 and with -O2 instead of -O3 optimisation produces the same results, with no reduction in desyncs (both when connected to a Windows client compiled with MSVC++ and with MinGW with GCC 4.5.2).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

Are you using prthashtables somewhere in your system at all?

jamespetts

#57
Prissi,

when I was making Experimental ready for online play initially, I audited all the uses of ptrhashtable, as you had indicated earlier that that system was not compatible with network mode, and eliminated nearly all of the Experimental-specific uses. The only instance of the use of ptrhashtable now that is specific to Experimental is in GUI code in gui_convoi_assembler_t. This code is only invoked when opening a depot or replace window, so cannot be the cause of the desyncs that I am noticing, which occur without any player interaction after connecting at all.

The other uses of ptrhashtable in the code are in obj_reader_t (ptrhashtable_tpl<obj_besch_t **, int> obj_reader_t::fatals;), powernet_t (static ptrhashtable_tpl<powernet_t *, powernet_t *> loading_table;), which I did not code myself so I assume come from Standard.

I also found #include "../tpl/ptrhashtable_tpl.h" in simworld.h, boden.cc, breuckenboden.cc and monorailboden.cc, but removing it in those files made no difference to compilation, so the references were redundant. You might want to check to see whether there are similar unnecessary includes in Standard, as the ones in the *boden files are, I am fairly sure, not ones that I put there (I am not sure about the one in simworld.h).

If the problem was the use of ptrhashtable, though, would this not result in desyncs equally on all clients, rather than desyncs only on clients compiled with a different sort of compiler?

Edit: Latest testing reveals that the problem cannot be in void stadt_t::step_passagiere() after all: if I add, for testing purposes, a "return;" at the very beginning of the method, I still get exactly the same desyncs as before.

Edit 2: Rather odd - starting a large, well-developed game (SDog's 162.sve) with the additional "return" resulted in the usual near-instant desync - however, starting my smaller testing game (one town, three 'bus routes with one 'bus each), which normally desyncs within about a month of game time, resulted in no desyncs at all. It is possible that there is a compiler inconsistency bug in void stadt_t::step_passagiere() and also another location in the code.

Edit 3: I have been working on improved methods of logging calls to simrand() so that I could see potential divergence in the very large map in which I usually see desyncs very quickly. I used the MinGW version and the Linux native version. I found that it ran for a while without desyncing, but desynced as soon as I tried to lay some road on the client. However, there were no divergences in the random calls. The last random call on the client was identical to the random call on the server with the same seed, and the history was also the same; the server had gone on executing into the future, but this to be expected if the client initiates the desync. I am not really sure what is going on here.

Edit 4: On the basis of the last result, I have investigated further, and found that the desyncs in the very large map are not caused by random number mismatches at all, but are generated from the following part of the code:


bool nwc_routesearch_t::execute(karte_t *world)
{
if(  get_map_counter()!=world->get_map_counter()  ) {
// since iteration limits are not specific to any world
// -> it is safe to execute in spite of the difference in map counters
dbg->warning("nwc_routesearch_t::execute", "different map counters: command=%u world=%u", get_map_counter(), world->get_map_counter());
}
if(  apply_limits  ) {
dbg->warning("nwc_routesearch_t::execute", "to be applied: limits=(%u, %u, %u, %llu, %u)", limit_set.rebuild_connexions, limit_set.filter_eligible,
limit_set.fill_matrix, limit_set.explore_paths, limit_set.reroute_goods);
// check if the command's sync step is valid
if(  get_sync_step()<world->get_sync_steps()  ) {
// this command should be executed in the past
world->network_disconnect();
dbg->warning("nwc_routesearch_t::execute", "attempt to execute in the past (sync step: command=%u < world=%u) results in disconnection", get_sync_step(), world->get_sync_steps());
return true; // to delete network command
}
...


Might I ask those who developed this part of the code for Standard - what is its function and why might it be triggered in these circumstances?

This looks to be a separate problem from the more occasional desyncs that I have detected on the smaller map in stadt_t::step_passagiere().

Edit 5: Actually, I'm not sure that that's from Standard at all - it might well be the path explorer code written by Knightly. I shall have to ask him the significance of a desync of that nature.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

Dwachs



if(  get_sync_step()<world->get_sync_steps()  ) {
// this command should be executed in the past
world->network_disconnect();
dbg->warning("nwc_routesearch_t::execute", "attempt to execute in the past (sync step: command=%u < world=%u) results in disconnection", get_sync_step(), world->get_sync_steps());
return true; // to delete network command
}

.. desynchronisation here means, that the client performed its simulation faster than the server. Maybe increasing the timing related settings in simuconf.tab could help.
Parsley, sage, rosemary, and maggikraut.

jamespetts

#59
Dwachs,

thank you: that is useful to know. The client was actually running on a faster machine than the server, so that is not surprising in itself: but isn't there already a built-in way of dealing with this issue (by slowing the client to the speed of the server)? After all, it's hardly likely to be uncommon for the client machine to run faster than the server. Also, I am baffled as to why this issue should only appear when the client and the server are running versions of Simutrans compiled with different compilers. Any help would be much appreciated.

Edit: I tried setting server_frames_ahead to 4, 5 and 15 (it was originally at 1), and found no difference in the rapidity of desyncs (between the MSVC++ version and the MinGW version).

Edit 2: In trials with the Linux GCC 4.5 version and the Windows MSVC++ version, sometimes the map will run a little while without a desync then desync with a checklist error (which I suspect is the same stadt_t::step_passagiere() error as in the smaller map). On other occasions, it desyncs almost instantly with the route search sync step differential error. I have logged a few of the differences in the actual numbers found when using a debugger on these occasions:

Run 1
Remote sync step: 1786
Local sync step: 1788

Run 2
Remote sync step: 2726
Local sync step: 2728

Run 3
Remote sync step: 2882
Local sync step: 2884

Run 4
Remote sync step: 3058
Local sync step: 3060

Run 5 (After restarting the client twice)
Remote sync step: 6707
Local sync step: 6720

Run 6 (Without restarting from run 5)
Remote sync step: 7014
Local sync step: 7016

It seems that, whenever I reconnect without restarting Simutrans on the client, it will always desync with a routesearch sync step mismatch immediately. When, however, I restart the client, it often runs for a long time without a desync, but will desync when I build a road (even not touching anything else). It is impossible to build a road on the client (the command to build the road is ignored), but it will desync with a sync step mismatch when building it on the server. I also notice that, in this state, the city cars are different on the client to the server (different cars in different places at the same time). It is almost as though it ought to be desyncing but is not.

It is noticeable that, with the exception of run 5, which was the first desync after a restart, all of the sync step desyncs are out by exactly 2. I do not appreciate fully the significance of this.22

**

Edit 3: I have been working on trying to narrow down the problem in stadt_t::step_passagiere(). Initial testing suggests that it has several origins. Suspect parts of the code so far are as follows:


const float proportion = ((float) best_journey_time / car_minutes) * car_minutes > best_journey_time ? 1.25F : 0.75F;
car_preference *= (sint16)proportion;


and


uint8 congestion_total;
if(destinations[current_destination].type == 1 && destinations[current_destination].object.town != NULL)
{
congestion_total = (city_history_month[0][HIST_CONGESTION] + destinations[current_destination].object.town->get_congestion()) / 4;
}
else
{
congestion_total = (city_history_month[0][HIST_CONGESTION] / 1.33);
}
car_preference = ((car_preference - congestion_total) > always_prefer_car_percent) ? car_preference - congestion_total : always_prefer_car_percent;


and


if(has_private_car)
{
const uint32 straight_line_distance = accurate_distance(k, destinations[current_destination].location);
uint16 time_per_tile = 65535;
switch(destinations[current_destination].type)
{
case 1:
//Town
time_per_tile = check_road_connexion_to(destinations[current_destination].object.town);
break;
case FACTORY_PAX:
time_per_tile = check_road_connexion_to(destinations[current_destination].object.industry);
break;
case TOURIST_PAX:
time_per_tile = check_road_connexion_to(destinations[current_destination].object.attraction);
break;
default:
//Some error - this should not be reached.
dbg->error("simcity.cc", "Incorrect destination type detected");
};

if(time_per_tile < 65535)
{
// *Tenths* of minutes used here.
car_minutes =  time_per_tile * straight_line_distance;
}
else
{
car_minutes = 65535;
}
}


If anyone has any ideas as to what in those parts of the code might produce different results on different compilers, I should be extremely grateful.

Edit 4: Incidentally, I probably should also add: the problem appears to be that a different number of private car trips are generated on the client than on the server when these parts of the code are enabled: I have tracked it in the city graphs, and, very occasionally, either the client or the server will generate an extra batch of private car trips as compared with the other.

Edit 5: Further testing confirms that, with those three parts of the code commented out, there are no desyncs (for at least four and a half game months) on my one town test map (the small and simple map).

Edit 6: I am told by Knightly that all floating point arithmetic is suspect as far as causing desyncs between differently compiled versions is concerned: see, for example, here. I notice that, in Standard, floating point arithmetic is used only for rendering, UI or map generation code where being deterministic accross different platforms does not matter. Unfortunately, in Experimental, it is widely used in the main running code, as I was not aware of this issue. The problems seem to be most intense in stadt_t::step_passagiere(), so I am looking at removing floating point arithmetic from there first, but I may eventually have to remove all floating point arithmetic in synchronised code. It is not yet clear whether this is the only cause of the random number generator desyncs, and testing is continuing on that. (Unfortunately, testing can be lengthy, as it can take a long time for divergences to appear).

In other news, I managed to play a lengthy online game with Steffen and Djohaal last night: Djohaal and I were using Windows, and Steffen, a Linux user, was running Simutrans through a Windows VM. We had only very occasional desyncs, (no more than one an hour, and possibly less), although a bigger problem was lag. That might have been caused, however, by my potentially poor internet connexion. Steffen, who was running the server, reported that the game would appear to speed up suddenly for short bursts; I then would notice that my client, which had not sped up, was lagging behind the server, as commands would take a long time to execute. On other occasions, the client would seem to catch up and the lag would disappear. A link to the saved game that we were playing is available here. Steffen and Djohaal between them were playing "Steffen". I was playing the "London & Western Railway".

Another noticeable issue was that discussed by Neroden here, which is that the routing by use of average times for the convoy's whole journey causes anomalies, with, in this case, passengers from Manchester to Bristol taking the Manchester to London train from Manchester to Birmingham even though there was a direct train from Manchester to Bristol, because the Manchester to London line (shared by the Bristol trains to Birmingham) was faster than the Birmingham to Bristol line (used by the Manchester to Bristol trains). That issue will need further consideration in its own thread.

Edit 7: I have been working to strip the floating point arithmetic from the code and replace it with integer equivalents. This is a difficult and painstaking process, and it is very easy to get calculations off by a factor of ten or a hundred in ways difficult to spot. However, this has yielded promising results. The first two pieces of suspect code from edit 3 have been converted to integer only operation, and, with those parts of the code activated, the game will run as between the MSVC++ and MinGW versions without desyncs for at least 11 game months. However, the third of the three pieces of code identified, that relating to private car routes, will still cause quite rapid desyncs (often within two weeks of game time) with the small and simple map. Further testing is required. In the meantime, I should appreciate anyone with the ability to compile from source testing the commit that I am about to push to Github to see whether any order of magnitude calculation errors are apparent.

Edit 8: I have removed much more of the floating point arithmetic, but there is still some in the physics engine that Bernd Gabriel wrote some time ago: converting the following to integer only operation seems to be tortuously difficult for somebody who, like me, was never very good at mathematics:


       const double p3 =  Frs / (3.0 * adverse.cf);
const double q2 = get_continuous_power() / (2 * adverse.cf);
const double sd = signed_power(q2 * q2 + p3 * p3 * p3, 1.0/2.0);
const double vmax = (signed_power(q2 + sd, 1.0/3.0) + signed_power(q2 - sd, 1.0/3.0));
return min(vehicle.max_speed, (sint32)(vmax * 3.6 + 1.0)); // 1.0 to compensate inaccuracy of calculation and make sure this is at least what calc_move() evaluates.


(And that is just an example).

I have pushed the latest results of my float-stripping to Github, but none of the additional float removal that I have undertaken to-day seems to have made any difference: the game will still desync with the very simple map with the private car route checking part of the code enabled, and will desync quite quickly even with it disabled on the 162-110.sve (or exhibit the alternative behaviour of desyncing only on the execution of a user command): the large and complicated map. Also, using the saved game that I was playing with Steffen (link above) and running that for some time, I get desyncs with that, too, after a goodly amount of time has elapsed. With the exception of the desyncs caused by laying roads (etc.) in 162-110.sve, which are of the "tried to execute command in the past" sort, all of the desyncs have been random number generator seed checklist mismatches; I have not recently been able to reproduce the route-search sync step desync.

I don't know whether it is the remaining (and extremely difficult to convert) physics code that is causing the trouble or not: the private car routing system does not touch the physics, which, aside from certain revenue calculations (that ought not cause desyncs even if they do not produce identical results), are the last remaining parts of floating point code that are not confined to display, user interface or map generation.

Any assistance in de-floating the physics would be very, very much appreciated, and likewise any suggestions of what, other than floating-point arithmetic, might be the cause of the remaining desyncs (bearing in mind that there are no desyncs when two identical binaries connect over the network).

Edit 9: Further testing with the private car connexions issue has revealed that the problem seems to occur when the client (MinGW) logs an attraction as reachable by road, but the server logs it as unreachable (time_per_tile == 65535). Here is the code for checking whether an attraction is reachable; as will be seen, it does not involve any floating point arithmetic:


uint16 stadt_t::check_road_connexion_to(const gebaeude_t* attraction)
{
if(welt->get_einstellungen()->get_assume_everywhere_connected_by_road())
{
return welt->get_generic_road_speed_intercity();
}

if(connected_attractions.is_contained(attraction->get_pos().get_2d()))
{
return connected_attractions.get(attraction->get_pos().get_2d());
}
const koord pos = attraction->get_pos().get_2d();
grund_t *gr;
weg_t* road = NULL;
for(uint8 i = 0; i < 16; i ++)
{
koord3d pos3d(pos + pos.neighbours[i], welt->lookup_hgt(pos + pos.second_neighbours[i]));
gr = welt->lookup(pos3d);
if(!gr)
{
pos3d.z ++;
gr = welt->lookup(pos3d);
if(!gr)
{
continue;
}
}
road = gr->get_weg(road_wt);
if(road != NULL)
{
break;
}
}
if(road == NULL)
{
// No road connecting to attraction - no connexion at all.
// We should therefore set *every* city to register this
// industry as unconnected.

const weighted_vector_tpl<stadt_t*>& staedte = welt->get_staedte();
for(weighted_vector_tpl<stadt_t*>::const_iterator j = staedte.begin(), end = staedte.end(); j != end; ++j)
{
(*j)->set_no_connexion_to_attraction(attraction);
}
return 65535;
}
const koord3d destination = road->get_pos();
const uint16 journey_time_per_tile = check_road_connexion(destination);
connected_attractions.put(attraction->get_pos().get_2d(), journey_time_per_tile);
if(journey_time_per_tile == 65535)
{
// We know that, if this city is not connected to any given industry, then every city
// to which this city is connected must likewise not be connected. So, avoid
// unnecessary recalculation by propogating this now.
koordhashtable_iterator_tpl<koord, uint16> iter(connected_cities);
while(iter.next())
{
welt->get_city(iter.get_current_key())->set_no_connexion_to_attraction(attraction);
}
}
return journey_time_per_tile;
}


I can't see what in that code might possibly compile differently on MinGW to MSVC++ - does anyone have any ideas?

Edit 10: Knightly has very helpfully found the problem in the above code (relating to undefined behaviour, caused by a bug relating to searching too large an array to see whether there is a road next to an industry or attraction). With that fixed (on Github), the desyncs associated with this part of the code in particular no longer occur.

The other desyncs, however, seem to be associated with the physics: bypassing the entirety of the physics code and stipulating that all moving vehicles should move at exactly 50km/h prevents the desyncs entirely; the physics code is, alas, filled with floating point arithmetic and highly complicated. Anyone who would like to volunteer to assist to convert it would be extremely appreciated.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

Since the physics code did have lots of other trouble (little penalty of hills vs. curves and so on and is very CPU consuming, why not kick this out and use the one of standard anyway? Most pakset are made with this in mind, and work well then.

But, on another note, if desyncs happen frequently after reconnection, then this points to some static variable that survives through initialisation of a new map. This could also make the difference between unix and windows as initialisation order of variable is not defined in the standard concerning static variables.

jamespetts

Prissi,

the issue with the hills is not caused by a problem with the physics calculations per se, but the difficult interface between realistic physics calculations and the somewhat representational world of Simutrans in which things can only be at heights in integer multiples of tiles: the problem is, in effect, that the inclines are too short to have sufficient effect. The solution to this that I have been planning to implement for some time is to treat a certain (and configurable) number of tiles in advance of the hill as part of the hill to increase by a factor of whatever that configurable number is the effect of the hill.

The reason that the physics engine was changed in Experimental in the first place was to ensure a real economic difference between steam and other forms of haulage on rails, the significance of the relative paucity of acceleration of which when compared to even ostensibly less powerful (and certainly slower, in the maximum speed sense) electric transport was of great significance in the development of railways and tramways around the turn of the 20th century and was a crucial economic driver towards the electrification of urban railways, without which driver, there would be little economic incentive to incur the enormous outlay of electrification. The difference between power and tractive effort is also highly significant, and likewise the impact of streamlining (albeit the latter to a lesser degree), and it was thought desirable by both Bernd and I to simulate these things.

I am slowly working through the physics code replacing the doubles with sint32s representing the doubles' values multiplied by either 100 or 10,000 to give the requisite accuracy (checking to ensure that overflows are not likely) and carefully testing the comparability of the results against the existing code by keeping the existing code in place and writing the new code as shadow code, using asserts to check the equivalence of the result. It is a somewhat painstaking process, but it should eventually yield acceptable results.

(Incidentally, I am not aware of the Simutrans-Experimental physics engine being a significant drain on resources - is your obervation to the contrary from profiling, or just looking at the code?)
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

prissi

When the engine was incorporated, people commented on a big drop in speed of larger games. I have barely time for standard, thus I cannot profile experimental.

jamespetts

I don't remember people commenting about speed drops associated with the physics engine particularly, although I may have forgotten.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

#64
Bernd Gabriel, the original author of the physics code, has written a new integer based fraction class to replace floating point arithmetic in the physics code, but did not have time to implement it in the code himself. I have attempted to implement it. Initial results are that there are errors in the calculations that I have not yet tracked down that cause vehicles to move at a constant and very fast speed (but to report that they are moving at 0km/h and that their maximum speed is 0km/h), but that, with this new code enabled, no cross-architecture desyncs occur.

I have pushed the new code to my Github repository; if anyone can lend a hand finding what the trouble is with the new code, I should be very grateful. Thanks to everyone who has helped with this so far.

Edit 1: Unfortunately, there appears to be a bug in the fractions code such that subtracting 1 from x produces a negative value even if x is greater than 1.

Edit 2: The problem appears to be in this piece of code:


n = n * v.d + d * v.n;
d = d * v.d;
return *this;


In this code, n represents the numerator and d the denominator of the first fraction, and v.n and v.d the numerator and denominator in the second fraction. Returning "*this" returns the first fraction. With this code, 1 += 1/-2 = -1/2, when it should resolve to 1/2. My mathematical mojo isn't enough to fix this - does anyone have any ideas?

Edit 3: Hmm, actually, I was able to fix it thus:


const bool negative = v.d < 0;
n = n * v.d + d * v.n;
d = d * v.d;
if(negative)
{
n = -n;
d = -d;
}
return *this;


Unfortunately, there are still numerous undiagnosed problems with the physics arithmetic resulting in all vehicles not having enough power to move.
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.

jamespetts

Simutrans-Experimental 9.6 is now available: see here for details. It should now be possible to play online between Windows and Linux (etc.) machines - I should be interested in feedback from anyone who plays Experimental online.

Remember, if you are setting up a public server, don't forget to register it here (thanks to Frank for setting up an Experimental version of the online server list).
Download Simutrans-Extended.

Want to help with development? See here for things to do for coding, and here for information on how to make graphics/objects.

Follow Simutrans-Extended on Facebook.