
Note: Although I have posted this on the SimutransExperimental board, much of this relates equally to SimutransStandard, as the code for calculating passenger and mail quantities is not distinct between the two. The reference to hourly quantities in the linked spreadsheet, however, is relevant only to Experimental.
Following discussion of the excessive passenger numbers in the recent BridgewaterBrunel (http://www.bridgewaterbrunel.me.uk) SimutransExperimental online game, I decided to look into the code to work out exactly how the passenger and mail generation is calculated to work out some bases for calibration. It turns out that, in simple terms, the code works like this: assuming a bits per month setting of 18 and a passenger factor setting of 8, each building in a city will be stepped once per month. Increasing the passenger factor from 8 to 9 will increase the monthly stepping from 1x per building to 1.112x, 10 will give 1.125x, 11 137.5x and so forth. Increasing the bits per month by 1 will double this number, but will also double the month length, and vice versa on decreasing it by 1.
Each step will generate exactly one packet. This packet might be either mail or passengers and can contain one or more of either mail or passengers (but not both). Each packet has a 3/4 chance of being a passenger packet and 1/4 chance of being a mail packet. The number of passengers or bags of mail in the packet is determined by the passenger/mail level shown in the building's dialogue box. For passengers, 6 is added to the level, and the resulting number divided by four. For mail, 8 is added to the level, and the resulting number is divided by 8. Fractions are rounded down to the nearest whole number. This has the consequence that buildings with a passenger/mail level of zero still generate passengers and mail, and still generate passengers and mail at exactly the same rate (1 unit per packet in both cases) as buildings with a level of 1. It takes a mail level of 8 to increase this number beyond 1 for mail, and a passenger level of 2 to reach a packet size of 2, and of 6 to reach a packet size of 3. Any intermediate numbers have precisely the same effect as the next number down: in other words, building levels [0 and 1], [2,3,4 and 5], and [6,7,8 and 9] all have exactly the same effect on passenger generation as each other and building levels [0,1,2,3,4,5,6 and 7] and building levels [8,9,10,11,12,13,14, and 15] all have exactly the same effect on mail generation as each other. I should note that the passenger factor affects what level that buildings are set to be.
Adding these together, a town with 100 buildings (a small but solid town in Simutrans terms) each of level 0 or 1 will produce 75 passengers and 25 bags of mail per month if the passenger factor is 8 and the bits per month setting is 18. However, at the settings currently in use on the BridgewaterBrunel server (taken from Pak128.BritainEx 0.8.4), being a passenger factor of 10 and a bits per month setting of 21, in a town with 100 buildings all of which have a level of 0 or 1, 750 passengers and 250 bags of mail would be generated each month.
If we assume an average generation of 1.25 packets per step on the basis that some significant number of buildings will be at level 2 or over, we get 937.5 passengers per month and 312.5 bags of mail per month. A town of 200 buildings, not an uncommon size for a larger town, would have figures of twice this level: 1,875 passengers and 625 bags of mail per month.
Not all of these passengers and bags of mail would be transported, of course: many would not be able to reach their destination at all or in time, but even at a rate of 16% for passengers and 18% for mail (which are the percentages actually transported from the town of Caringford in the online game), this still gives numbers of 150 passengers per month for a 100 building town or 300 passengers for a 200 building town, and 56 bags of mail per month for a 100 building town or 112 bags of mail per month for a 200 building town. (In fact, these figures underestimate the load, because every town has a town hall of at least level three, and often has attractions, such as a church, which has a level of 12: the real figures for Caringford, a town of 134 buildings, are 1,917 passengers generated of which 442 were transported in the last complete month as at the date of writing).
Whilst these figures may not seem very great at first glance, it does well to bear in mind that a SimutransExperimental month is defined in terms of a certain number of hours and minutes, and that the frequency of services is measured according to those hours and minutes rather than according to the number of months. In the current online game, following the default for Pak128.BritainEx 0.8.4, the bits per month setting is 21, with the result that there are 6 hours and 24 minutes (or 6.4 hours) in every month. 442 passengers and 645 bags of mail (also the most recent figure from Caringford) every 6.4 hours equates to 69 passengers and 100 bags of mail being transported every hour, and a total of 300 passengers and 74 bags of mail being generated every hour that might be transported if the networks were capable of it. For a game year of 1800, this seems rather excessive, and seems to account in part for the very great passenger and mail numbers seen in the game.
I attach this spreadsheet (http://bridgewaterbrunel.me.uk/misc/Passenger%20factor%20calcs.ods) (in .ods format) to show my calculations in reaching these figures, and to encourage experimentation to suggest an optimum passenger factor as well as any refinements to the code better to simulate all of this more accurately.
Edit: I forgot one important feature of the passenger generation algorithm in the above description (and spreadsheet): return journeys. Every passenger trip other than one between two points in the same town generates a return trip. That is, if a passenger packet is generated at stop A bound for stop B, then, unless stop A and stop B are in the same town, a packet of the same size is automatically generated at stop B bound for stop A at the same time. This has the effect of substantially increasing the number of passengers over and above what the above calculations indicated.

Interesting. But how would one go about allowing for an increase in the total number of journeys as they become less timeconsuming and as general income levels increases, towards the latter half of the 1800's (if one was to cut the passenger factor)?
If only we could simulate ticket prices, the demand could be regulated by increasing them to discourage excessive traffic (and certainly would add a very interesting dimension to multiplayer games, if one has the ability to handle the flow satisfyingly.)

Journey time tolerance is intended to have this effect for passengers.

A little data from my GB map which might be interesting/useful.
This map runs at 23 bits per month (because of the low metres per tile  118  this yields a 12 hour month). The passenger_factor is set to 1. With realistic service patterns and convoy capacities, I find that this is *just about* a suitable figure  if anything it yields slightly too large passenger volumes.
I'm sure this is partly because of the high bits per month figure, but I have thought for a while that it would be nice if this value was more finegrained (and if values below the current '1' were available).
Of course, one can also mitigate passenger levels by fiddling with "alternate destinations" and passenger journey time tolerances, which is what I'll have to do on the GB map if passenger numbers start to get really out of hand.

Hello James
This document is very well done! (I almost understood everything even if my English leaves much to be desired ... :[ ). I suggest putting it among the documents in evidence.
Giuseppe

A little data from my GB map which might be interesting/useful.
This map runs at 23 bits per month (because of the low metres per tile  118  this yields a 12 hour month). The passenger_factor is set to 1. With realistic service patterns and convoy capacities, I find that this is *just about* a suitable figure  if anything it yields slightly too large passenger volumes.
I'm sure this is partly because of the high bits per month figure, but I have thought for a while that it would be nice if this value was more finegrained (and if values below the current '1' were available).
Of course, one can also mitigate passenger levels by fiddling with "alternate destinations" and passenger journey time tolerances, which is what I'll have to do on the GB map if passenger numbers start to get really out of hand.
That is very interesting. May I ask  how do you work out what constitutes the right level of passengers in the first place?

Nothing too precise. I take the "right level" of passengers to be that at which most services are decently loaded, but not too often overcrowded  and given the realistic service frequencies this seems to require a passenger factor of 1.

Nothing too precise. I take the "right level" of passengers to be that at which most services are decently loaded, but not too often overcrowded  and given the realistic service frequencies this seems to require a passenger factor of 1.
Do you find that all the services are realistically loaded in this case, or are some overloaded and some underloaded?

For the most part loadings are realistic. There are some obvious and explicable distortions, though. High Speed services from Kent are extremely popular and usually full  whereas of course in real life this is mitigated by the fact that there is a supplement payable to travel on such services. On the other hand, London Underground services tend not to be as busy as one would expect  except in the very centre of London. This is presumably because it's quite difficult to simulate the fact that Londoners are much more likely to use public transport. (There are mechanisms in Experimental which can do this to a certain extent, of course, but I have not yet pushed these to their limits.)

For the most part loadings are realistic. There are some obvious and explicable distortions, though. High Speed services from Kent are extremely popular and usually full  whereas of course in real life this is mitigated by the fact that there is a supplement payable to travel on such services. On the other hand, London Underground services tend not to be as busy as one would expect  except in the very centre of London. This is presumably because it's quite difficult to simulate the fact that Londoners are much more likely to use public transport. (There are mechanisms in Experimental which can do this to a certain extent, of course, but I have not yet pushed these to their limits.)
Is this with or without a realistic level of private car transport?

Now that I'm not sure about. I have private car display set to low for performance reasons (since the map is hideously large). I don't know whether that's just a visual feature or if it turns off private car simulation altogether  and I'm afraid I'm not too familiar with the appropriate values governing the level of private car usage.

You need to look in the city chart window. The city car level relates only to the number of vehicle objects appearing on the road, not the number of journeys made by private car. That is determined by privatecar.tab file in the /[pakset]/config folder. If there is no privatecar.tab (giving variable values by year), a flat figure of 25% access to a car is assumed.
Edit: I have looked at your latest save, and you have car ownership at a steady 25%, meaning that no privatecar.tab is defined. I also notice that the trains seem rather on the deserted side: eleven out of five hundred and something. This seems to be the passenger density of very early on a Sunday morning: 1 is probably too low for Pak128.BritainEx.

It seems like the effect of building levels should be more continuous. Each level should always produce more then the previous, allowing level to simulate a combination of density, relative wealth, and the progression of time. This would especially help balancing earlier years, where routes can get clogged with slow, low capacity 1800's vehicles trying to move around 1900's levels of cargo.

Or even 1700s vehicles trying to move around 1800s cargo. I am minded to agree in principle. The question is implementation...

Maybe the number of passengers in a 'packet' should be a float and .1234 passengers is treated as one passenger with a 12.34% chance of spawning. Then building level would matter and passenger/mail balancing could be done with building levels and the level formula, independent of the strange passenger factor and bits per month stuff.

We can't have floats in running code, sadly, because rounding/truncating differs between different platforms, which causes desyncs when playing online. What we have to do is multiply by 100 or 1,000 (or even more) and use percentages or permilles, etc.

Looking into this topic further, I have found this (http://www.sustrans.org.uk/sites/default/files/documents/guidelines_16.pdf) research, which shows that, on average, each person makes 1,100 trips per year, 84% of which are 16km (10 miles) or under, and 14% of which are between 16 and 80km (1050 miles), the remaining 2% of which are above 80km (50 miles). The first step, therefore, is to reflect these values in simuconf.tab, which I have done on my Github branch for the forthcoming 0.9.0.
Given that there are 8,760 hours in a year, 1,100 trips per year equates to 1 trip per person every 0.13 hours. However, those journeys are not spread evenly throughout the day: many hours of the day are spent asleep, time in which no journeys are made. Because Simutrans does not represent fluctuating demand at different times of day, but instead represents an average day time demand, the inactive nighttime hours must be removed from the calculation. If we remove 8 hours from the day to represent per person sleeping/resting time, get a figure of 5,844 (24  8 * 365.25) active hours per year. Dividing 1,100 trips per 5,844 active hours results in a figure of 0.19 journeys per person per hour, or approximately one passenger trip every five active hours per head of population.
In 0.8.4, a sample town generated 3,007 passengers in a month, which is 6 hours and 24 minutes long (or 6.4 hours long). 3,007 / 6.4 = 469.84 passengers per hour, which, when divided by 0.19, gives a figure of 2,472.84. In the game, the town reports having 4,972 of population and 223 buildings (for reference, this is Chillhead from the BridgewaterBrunel online game, taken in the year 1808). Oddly, that suggests that there are too few, not too many, passengers being generated. The issue might, therefore, be in the proportions of them that are allowed to travel longer distances: the defaults in 0.8.4 are:
Local: 58%
Mid range: 28%
Long distance: 14%.
(The ranges are defined differently).
Have I done something wrong in this calculation, or do people agree that all this points to the conclusion that the issue is not the passenger factor per se, but the proportion of people prepared to travel long distances?

Possibly change the factors to:
Local: 60%
Mid: 32%
Long: 8%

Hmm, I think that nothing less than using fully realistic figures will suffice here, especially given the scale of the problem, which very small changes seem extremely unlikely to fix. In the Github repository, I have edited the relevant parts of simuconf.tab to the following:
# The program will generate three groups of passengers: (1) local;
# (2) midrange; and (3) long distance. The program will look for a town within
# the specified distance ranges for each class of passenger. If it cannot
# find such a town within a certain number of tries, it will pick a random town.
# The ranges *may* overlap. These values are in kilometres.
local_passengers_min_distance = 0
local_passengers_max_distance = 16
midrange_passengers_min_distance = 16
midrange_passengers_max_distance = 80
longdistance_passengers_min_distance = 55
longdistance_passengers_max_distance = 16384
# The following are percentage chances. They represent the chances that
# any passengers generated will try to find a town within each of the three
# ranges, respectively.
#
# 84% of journeys are 10 miles (16km) or under
# 14% of journeys are 10  50 miles (1680km)
# 2% of journeys are >50 miles (80km)
# See: http://www.sustrans.org.uk/assets/files/connect2/guidelines%2016.pdf
# Allow for some overlap, however, between midrange and long distance.
passenger_routing_local_chance = 84
passenger_routing_midrange_chance = 12
# (Longdistance by inference = 4)
# passenger_routing_longdistance_chance is 100 minus the sum of the two above values,
# but not stipulated individually.
# Passengers have a maximum tolerance level for how long that they will
# spend travelling. The further that the passengers want to go, the more
# time that they will be prepared to spend travelling. The number is
# expressed in minutes. For each packet of passengers, the number of
# minutes travelling time (including waiting time) that they are prepared
# to tolerate is randomised between the minimum and maximum amount using a
# normal distribution, meaning that the numbers are more likely to be in
# the middle of the range than at either end. The local, midrange and
# longdistance passenger groups correspond to those above.
min_local_tolerance = 2
max_local_tolerance = 75
min_midrange_tolerance = 5
max_midrange_tolerance = 240
min_longdistance_tolerance = 30
max_longdistance_tolerance = 4320

If you were to do that, I do think the passengers would need to be broken down further for more realistic results.
Off the top of my head...
Local
 Tourist
 Shopper/errands
 Commuter/student
 Commuter/local worker
mid distance
 Commuter/suburbia (anywhere, as long as travel time within 1hr)
 Shopper/by car
long distance
 Business trip (anywhere, as long as comfortable)
 Vacationist
 Immigrant/Emigrant
By far, the largest total number would be commuters.

Isn't mid distance (>16 km) a bit unusual for shoppers? Going more than a few km would immediately cancel out any bargain price.
James, the short distance does most likely include walking as mode of transport too. This would remove a very large number of those trips. I'd expect at 84% you greatly overestimate public transport use. (hardly anyone would take a bus for a trip below 1 km)

We canadians do cross the border occasionally to buy stuff from the states when they have black friday or other amazing deals.
But yes, it's a very limited percentage. Probably not even 1% of us will do it, since it requires a few thousand dollars investment to save several hundreds.

Maybe short/medium/long distance could be defined in terms of not just distance traveled but other factors. People in remote areas often travel much further than those in urban areas for work, shopping and so on.

Breaking down the distance factors further as suggested would require some quite fundamental changes to the basic way in which the passenger generation works; why is this necessary on the contingent that realistic figures are used for journey distances and time tolerances but not otherwise...?

The other thing to consider (this may already be taken account of in Exp  I'm not familiar with it) is the fact that earlier in history people travelled less far. Even in recent times there is a significant amount of evidence to suggest that where people travel is governed by how long people are willing to travel. For commuters, this works out at pretty much 1 hour. Build a new road/railway or speed things up and it means you don't simply improve existing journeys, you also CREATE new journeys because you can now commute to new places in under 1 hour. So as transport speeds up, you get more journeys (which should be reflected in an increased transport network capacity with faster and higher capacity vehicles).

This is accounted for by the journey time tolerance feature in Experimental.

I think it's a good idea, using realistic numbers, or maybe slightly less than real.
For era variance in travel distance, and in general, wouldn't journey time tolerance help mitigate the number of pax on the roads as well?
I'm not sure of its current performance, but pax seem to be quite tolerant in the online game, for the year 1821.

Jamespetts
it´s helpful then i here post a old time table out germany?

I don't think that a timetable per se would assist  what's really needed are the number of people actually travelling. Thank you for the offer, however.
I am experimenting with lowering the tolerances, too.

Jamespetts
In old Timetable can you read out how many people in the history be used a Train or a bus.

Jamespetts
In old Timetable can you read out how many people in the history be used a Train or a bus.
Perhaps they might be of some interest...

The timetables would provide not that much information. It would be needed to know what utilisation was aimed at, if the assumptions of the timetable where met, if operational constraints were more important (trains have to go back too) and if there was a correlation with passenger numbers intended at all (eg. 30 min timetable to meet city requirements for minimum service)
I don't see why having a more finegrained classification of travel distances would be sensible. Travel time ought to take care of that. Short, mid and long distance seems a very sensible classification. What i don't trust are the numbers you found. Unless they are transport specific, not general travel lengths, regardless of mode.

What i don't trust are the numbers you found. Unless they are transport specific, not general travel lengths, regardless of mode.
Do you mean the numbers that I found in the paper that I linked? Why do you not trust them? They specifically purport to be mode independent, as the purpose of the paper is to discuss methods of decreasing car usage in proportion to cycle and walking usage for short journeys.

I haven't seen the link on my first read, only after you mentioned it now that the article is linked.
The paper states 22% of trips are below 1 mile (1609 m?). I would not consider them as possible transport users. I suggest to remove them from the total and renormalize. Doing so would yield
80% short distance, 14% mid distance, and 6% long distance. See the calculation below.
 short
Prelude> 1 (0.05 + 0.11)/(10.22)
0.7948717948717949
long
Prelude> 0.05/(10.22)
6.410256410256411e2
mid
Prelude> 0.11/(10.22)
0.14102564102564102
ps.: I'm surprised it's in miles 40 years after metrication. (I wonder how people estimate such distances as all odometers and waysigns would show km). That's not a criticism of the article of course, just a surprised observation.
pps.: At average speed (1.1 m/s) walking a mile would take 24 minutes. It's surprising there are people who take a car for such distances, the time saved could be at maximum 5 minutes (driving has more overhead, unlocking, starting, unparking, parking, etc) the cost involved with the greatly increased wear of a cold motor ought not be able to make driving economically sensible for anyone close to an average hourly income.

I have been looking carefully into how the passenger generation system works in practice, and also calibrating some figures. When I apply my new journey time tolerance settings to a saved copy of the BridgewaterBrunel (http://www.bridgewaterbrunel.me.uk) online game, the overall proportion of passengers transported falls from 9% to 4%  a step in the right direction.
However, some rough calculations based on stagecoach departure figures from this (http://www.economics.uci.edu/files/economics/docs/workingpapers/200405/Bogart02.pdf) paper (see page 29), an assumption that stage coaches would on average carry 10 passengers each and these (http://homepage.ntlworld.com/hitch/gendocs/pop.html) historical population statistics suggest that stage coaches transported an average of 654 people per "active" hour (as defined above) in 1800. This works out at 0.00004 journeys per hour per head of population.
In my modified version of the BridgewaterBrunel game from 1808, however, with a population of 1,561,558, I get 56,126 passengers transported an hour total, or 0.035942 passengers transported per hour per head of population  nearly 100 times the historical figure for stagecoaches.
It has to be said that there must have been a goodly number of passengers transported by canals and boats of various kinds in 1800, so the figures for stagecoaches should be taken with some caution, but I should expect to see a difference of no more than one to one and a half orders of magnitude on the most generous assumptions about the numbers of passengers transported by canal (I cannot find any data on that), so a difference of a factor of 100 is still a long way off.
Trying to find the cause in the game mechanics, I think that I have narrowed things down to the journey time tolerance system. I have not found any bugs, per se, but the system is not quite doing the intended job, and I need to think carefully for a sensible solution.
As indicated above, the tolerance ranges for the various sorts of journeys are:
min_local_tolerance = 2
max_local_tolerance = 75
min_midrange_tolerance = 5
max_midrange_tolerance = 240
min_longdistance_tolerance = 30
max_longdistance_tolerance = 4320
These are realistic ranges in themselves, but the problem is how numbers are picked in the range. In 10.15 and previous versions, the system was very simple: a random number anywhere in the range would be found. A journey time tolerance would have an equal chance of being at any point in the range. I have modified this in my 10.x build to use a very simplified normal distribution generation mechanism suggested by Kieron, of dividing the maximum random number by 2, getting two random numbers with that halved maximum and adding them together.
Unfortunately, a normal distribution turns out not really to solve the underlying problem. Let me give some examples of an actual sequence of journey time tolerances generated in the game, trapped using the debugger (all values in minutes):
Local (0  16km)
38
27
46
33
61
46
42
43
37
71
Mid range (16  80km)
61
46
217
22
109
33
135
120
167
146
40
24
163
21
99
Long distance (55  effective infinity km)
3170 (52 hours)
2606 (43 hours)
2121 (35 hours)
2424 (40 hours)
2441 (40 hours)
1754 (29 hours)
2648 (44 hours)
3321 (55 hours)
3492 (58 hours)
2671 (44 hours)
2836 (47 hours)
3340 (55 hours)
1875 (31 hours)
2725 (45 hours)
2329 (38 hours)
2971 (49 hours)
2206 (36 hours)
371 (6 hours)
2182 (36 hours)
1338 (22 hours)
2044 (34 hours)
Although long distance passengers now make up only 4% of those generated, with figures like these, virtually all of them will tolerate the sorts of journey times seen in the online game involving long stage coach journeys or sea voyages. Although there must be some passengers willing to travel for such a long time (according to the paper linked above, a journey from London to Manchester in 1750 would take 90 hours, and multiple day sea voyages were the norm), only a very small proportion of the passengers generated for long distance routes should be prepared to travel for this length of time  less than 1 in 100, probably. There should be substantial numbers of long distance passengers being generated with journey times of three, two and even one hour(s), with only a small number being prepared to travel for four or more hours.
Perhaps what I need is not a bell curve after all, but a decay function of some sort to make numbers at the shorter end of the range progressively more likely than numbers at the longer end of the range.
I should be very keen indeed to know from people here:
(1) whether this seems in principle a good idea (and more generally, whether anybody disagrees with my analysis above, and, if so, why); and
(2) any good suggestions for a formula to implement the decay method that I propose.

Your sollution might be introducing a concurrent process reducing passenger numbers based on time, similar to your car ownership. (in fact it could be done with exactly that system). At those times cost was a prohibitive factor in stagecoach use.) Using a "exclusive pedestrian" percentage who are not using public transport for their travels you can reduce the number of users considerably.
A first guess for this would be assuming only the middle class would use stagecoaches. Education figures will effectively give this away. So anyone leaving school before being 14 could be considered too poor to pay for transport. This would result in 2% transport users in 1870, based on the Robbins Report (mentioned in the somewhat linked mail discussion (http://forum.simutrans.com/index.php?topic=10920.msg106105)) Have a try to change your caruser table to 98% before 1900 and see what happens :)
http://myweb.tiscali.co.uk/webbsredditch/Chapter%201/Travel%20in%2018thC.html
puts the cost of a 50 km trip in 1794 at 5 shilling outside, 9 sh inside.
Ten to 20 Lb labourer anual income, would be 16 to 32 sh a month. Here one has to consider also that in europe farm hands typically got the majority if not all of their compensation as alimentation and lodging. (Or was this different in England from the rest of Europe?)

int biased_random(int max) {
int random1 = simrand(max);
int random2 = simrand(max);
return (random1*random2)/max;
}
Or just bite the bullet and implement a normal distribution.

Sdog,
I have considered in the past using costing as a factor for transport, but rejected it on the ground that it would introduce undesirable complexities or, if the complexities were to be glossed over or simplified away, perverse results/incentives.
The above figures show that the journey time tolerance feature really does need to be modified in order to produce a better set of numbers, I think.
(Incidentally, the private car system could not be adapted as you imagine, as it involves checking whether the passenger has a private car, then checking whether the journey can be completed with a private car, then checking whether it can be completed with public transport, then comparing the merits of the two modes).
Kieron,
Hmm, I did implement more or less that mechanism:
/* Generates a random number on [0,max1] interval with a normal distribution*/
#ifdef DEBUG_SIMRAND_CALLS
uint32 simrand_normal(const uint32 max, const char* caller)
#else
uint32 simrand_normal(const uint32 max, const char*)
#endif
{
const uint32 half_max = max / 2;
#ifdef DEBUG_SIMRAND_CALLS
return (simrand(half_max, caller) + simrand(half_max, "simrand_normal"));
#else
return (simrand(half_max, "simrand_normal") + simrand(half_max, "simrand_normal"));
#endif
}
but I think what is needed is not a normal distribution at all, but a declining probability the higher that the numbers get: after all, the median of 30 and 4320 minutes, the long distance range, is 2,190 minutes, or 36.5 hours! Having this as the most common journey time tolerance for long distance journeys would not work. What I really need is having, say, three or two and a half hours as the most common journey time tolernace for long distance passengers, with a significant minority getting tolerances between 30 minutes and 3/2.5 hours, and a dwindling minority getting journey time tolerances above that in the stratospheric ranges.

Note the multiplication rather than addition in the code I gave this morning, this should skew the distribution :)

Interesting! To what extent will it be skewed?

The peak value will be 1/4 the maximum value. Though actually you are better off with (((rand(max)+1)*(rand(max)+1))/max)1 to avoid a large number of 0s being generated. If you multiply 3 random numbers from 1 to max then divide by max*max the peak will be at 1/9 and so on.

I think what is needed is not a normal distribution at all, but a declining probability the higher that the numbers get
An exponential distribution is a natural one for waiting times. A fraction (1/e)^n would wait n times the average waiting time (or equivalently, a fraction (1/2)^n would wait n times the 'halflife').
Best wishes,
Matthew

An exponential distribution is a natural one for waiting times. A fraction (1/e)^n would wait n times the average waiting time (or equivalently, a fraction (1/2)^n would wait n times the 'halflife').
Best wishes,
Matthew
We are dealing with journey time tolerances here, not waiting times  unless that is what you meant?

Is there a fundamental difference between waiting times and journey time tolerances? (Not in their effect or implementation, but in the distribution of their length)

Yes  waiting time tolerances are not randomised and set at passenger generation time, but fixed and checked every 256 steps for all waiting passengers/mail/goods. The formula for calculating waiting times is as follows:
// Checks to see whether the freight has been waiting too long.
// If so, discard it.
if(tmp.get_besch()>get_speed_bonus() > 0)
{
// Only consider for discarding if the goods care about their timings.
// Goods/passengers' maximum waiting times are proportionate to the length of the journey.
const uint16 base_max_minutes = (welt>get_settings().get_passenger_max_wait() / tmp.get_besch()>get_speed_bonus()) * 10; // Minutes are recorded in tenths
halthandle_t h = haltestelle_t::get_halt(welt, tmp.get_zielpos(), besitzer_p);
uint16 journey_time = 65535;
path_explorer_t::get_catg_path_between(tmp.get_besch()>get_catg_index(), tmp.get_origin(), tmp.get_ziel(), journey_time, h);
const uint16 thrice_journey = journey_time * 3;
const uint16 min_minutes = base_max_minutes / 12;
const uint16 max_minutes = base_max_minutes < thrice_journey ? base_max_minutes : max(thrice_journey, min_minutes);
uint16 waiting_minutes = convoi_t::get_waiting_minutes(welt>get_zeit_ms()  tmp.arrival_time);
#ifdef DEBUG_SIMRAND_CALLS
if (talk && i == 2198)
dbg>message("haltestelle_t::step", "%u) check %u of %u minutes: %u %s to \"%s\"",
i, waiting_minutes, max_minutes, tmp.menge, tmp.get_besch()>get_name(), tmp.get_ziel()>get_name());
#endif
if(waiting_minutes > max_minutes)
{
#ifdef DEBUG_SIMRAND_CALLS
if (talk)
dbg>message("haltestelle_t::step", "%u) discard after %u of %u minutes: %u %s to \"%s\"",
i, waiting_minutes, max_minutes, tmp.menge, tmp.get_besch()>get_name(), tmp.get_ziel()>get_name());
#endif
// Waiting too long: discard
if(tmp.is_passenger())
{
// Passengers  use unhappy graph.
add_pax_unhappy(tmp.menge);
}

We are dealing with journey time tolerances here, not waiting times  unless that is what you meant?
Journey time tolerances in the game are an example of 'waiting times' in a stochastic sense, which was what I meant. (The actual waiting times in the game are of course mostly deterministic.) An exponential distribution results naturally if there is a constant probability per unit time that any given individual will stop waiting.
Best wishes,
Matthew

Journey time tolerances in the game are an example of 'waiting times' in a stochastic sense, which was what I meant. (The actual waiting times in the game are of course mostly deterministic.) An exponential distribution results naturally if there is a constant probability per unit time that any given individual will stop waiting.
Best wishes,
Matthew
Ahh, I see what you mean. This doesn't work for journey time tolerances, however, where the journey time and tolerance must be calculated fully in advance.
Edit: Incidentally, was this what you meant with your original post, or was that different? I am afraid that I am no mathematician...

This doesn't work for journey time tolerances, however, where the journey time and tolerance must be calculated fully in advance.
But the tolerance
can be calculated in advance, by drawing from the exponential distribution, as I said in my first post in the thread. An equal probability of giving up per unit time is a justification or motivation for using the distribution, not necessarily the way to do the calculation.
Best wishes,
Matthew

Ahh, I wasn't aware of that. I'm afraid that your understanding of mathematics is rather in advance of mine  would you mind explaining that in a little more detail such that a mathematically challenged mind such as mine might understand it? Thank you very much for your input  it is most appreciated.

would you mind explaining that in a little more detail such that a mathematically challenged mind such as mine might understand it?
The general algorithm is presumably of the form:
 Generate a possible trip.
 Based on the distance to be travelled and any other relevant parameters (the game year, the social status of the passenger, ...) determine the typical journey time tolerance T for that trip.
 Draw the actual journey time tolerance t from a random distribution parameterised by T.
 If t is larger than the expected journey time, make the trip, otherwise don't bother.
One straightforward way to implement an exponential distribution with mean time
T for step 3 is:
 Generate a uniform random number x in the interval (0,1].
 Calculate t = T ln x .
This uses floatingpoint arithmetic. If you want to use only integers, but are happy with an approximate stepped distribution, then:
 Generate a uniform random integer x in the interval [0,2^{N}).
 Find the number n of leading 1s in the binary representation of x; i.e. x consists of n 1s, a 0, and a remainder of Nn1 other bits. Call the remainder y.
 Calculate t = (n + y/2^{Nn})T .
In this case
T is the median rather than the mean.
Best wishes,
Matthew

Matthew,
thank you for your suggestion  I shall bear that in mind.
In terms of the more general issue of calibration, I have looked further into the code this evening, and discovered why some buildings are reported at level 0, despite being set to a higher level in the .dat files: makeobj subtracts 1 from the level of each building at the time of the compilation of a pakset. It is not entirely clear why this is done, but it is at least a consistent, if somewhat confusing, mechanism.

On the issue of calibration of the number of people in a town to the number of buildings to the density of the town/those buildings, the basic formula in the game for it can be found in simcity.h here:
/**
* ermittelt die Einwohnerzahl der Stadt
* "determines the population of the city"
* @author Hj. Malthaner
*/
sint32 get_einwohner() const {return (buildings.get_sum_weight()*6)+((2*bevarbwon)>>1);}
get_einwohner() ("get population") is the method that returns the city's population as shown in the city information windows.
"buildings" is a weighted vector of buildings in the city, the weights being their levels assigned in the .dat files (with 1 added to the level to prevent there being zeros). Two buildings with a level set in the .dat file of 1 (displayed as 0) would therefore give a sum_weight of 2; two buildings, one with a level of 0/1 and one with a level of 1/2 would give a sum weight of 3, and so forth.
As to "bev", "arb" and "won", they are defined in comments in simcity.h as follows:
// population statistics
sint32 bev; // total population
sint32 arb; // amount with jobs
sint32 won; // amount with homes
The "bev" is particularly cryptic, as it purports itself to be the measure of population, but is used in the method used to return the population as part of a formula including other things. Only a little light is thrown on it by the following:
uint32 get_buildings() const { return buildings.get_count(); }
sint32 get_unemployed() const { return bev  arb; }
sint32 get_homeless() const { return bev  won; }
The basic calculation seems to be: 6 times the sum of the weight of all city buildings plus the "bev" measure of population less "unemployed" and "homeless" (the *2 and <<1 seem to cancel each other out, and are probably used to avoid rounding errors).
The "bev" value is mainly incremented in units of 1 (aside from the special buttons reserved to the public player to increase or decrease this by 100) in the step_bau() method in simcity.cc. For existing towns, the formula is:
// since we use internally a finer value ...
const int growth_step = (wachstum >> 4);
wachstum &= 0x0F;
// Hajo: let city grow in steps of 1
// @author prissi: No growth without development
for (int n = 0; n < growth_step; n++) {
bev++; // Hajo: bevoelkerung wachsen lassen
for (int i = 0; i < 30 && bev * 2 > won + arb + 100; i++) {
baue(false);
}
"wachstum" itself is set in the calc_growth() method, and is positive when passengers, mail, goods or electricity are supplied to the town (and greater in amount the greater proportion of these things that are transported).
For new towns, the formula is this:
bool new_town = (bev == 0);
if (new_town) {
bev = (wachstum >> 4);
bool need_building = true;
uint32 buildings_count = buildings.get_count();
uint32 try_nr = 0;
while (need_building && try_nr < 1000) {
baue(false); // it update won
if ( buildings_count != buildings.get_count() ) {
if(buildings[buildings_count]>get_haustyp() == gebaeude_t::wohnung) {
need_building = false;
}
}
try_nr++;
buildings_count = buildings.get_count();
}
bev = 0;
}
The elusive "arb" and "won" figures, meanwhile, are set in the baue_gebaude method as follows:
if (sum_gewerbe > sum_industrie && sum_gewerbe > sum_wohnung) {
h = hausbauer_t::get_gewerbe(0, current_month, cl, new_town);
if (h != NULL) {
arb += h>get_level() * 20;
}
}
if (h == NULL && sum_industrie > sum_gewerbe && sum_industrie > sum_wohnung) {
h = hausbauer_t::get_industrie(0, current_month, cl, new_town);
if (h != NULL) {
arb += h>get_level() * 20;
}
}
if (h == NULL && sum_wohnung > sum_industrie && sum_wohnung > sum_gewerbe) {
h = hausbauer_t::get_wohnhaus(0, current_month, cl, new_town);
if (h != NULL) {
// will be aligned next to a street
won += h>get_level() * 10;
}
}
The odd thing to notice here is that there is no addition of 1 to the get_level method, so buildings with a level of 0 will not affect these figures. I am not sure whether this is intended, and suspect that it is not: I have posted a bug report (http://forum.simutrans.com/index.php?topic=11123.new#new) about it.
At present, I remain somewhat confused about the relationship between bev, arb, won, the number and level of buildings in a city and the reported population figures, and how this system is intended to work. Any assistance in unpicking this that I might work out how population densities are actually calculated (and therefore how to calibrate them) would be much appreciated.
Meanwhile, I have found a useful web page (http://www.demographia.com/dbintluaarea2000.htm) on urban population densities in the largest 10 UK cities.

On the issue of calibration of the number of people in a town to the number of buildings to the density of the town/those buildings, the basic formula in the game for it can be found in simcity.h here:
sint32 get_einwohner() const {return (buildings.get_sum_weight()*6)+((2*bevarbwon)>>1);}
I suspect that
bev may be something like the number of households, rather than the number of individuals. Householders with a home and a job have a family, contributing 6 (or maybe that is 3, depending on the details of
get_sum_weight()) to the total population. Those that are unemployed and homeless contribute only 1 (themselves) to the total. (Note that
2*bevarbwon could be written as
get_unemployed()+get_homeless().)
Best wishes,
Matthew

thank you for your suggestion  I shall bear that in mind.
Here's some real code:
const unsigned int uibits = std::numeric_limits<unsigned int>::digits;
const unsigned int lgbits = uibits2;
const unsigned int lgbit = 1<<lgbits;
const unsigned int hibit = lgbit<<1;
unsigned int scaled_lg(unsigned int x, unsigned int T=1) {
if (x==0) return uibits*T;
unsigned int lg = 0;
//Find the first significant digit
while ((x & hibit) == 0) {
lg += T;
x <<= 1;
}
//Make space for overflow
x >>= 1;
//Find leading bits in mantissa
for (int j =1; j<lgbits/2; ++j) {
if ((x ^ lgbit) == 0  T == 0) return lg;
x >>= lgbits/2;
x *= x;
unsigned int round = T & 1;
T >>= 1;
if (x & hibit) {
x >>= 1;
lg = T+round;
}
}
//Interpolate remaining bits linearly
while (T != 0) {
if (x & lgbit) lg = T;
T >>= 1;
x <<= 1;
}
return lg;
}
If you call this with a uniform random integer
x, it will return an exponentially distributed one with median
T.
Best wishes,
Matthew

Matthew,
thank you for both replies. The formula I shall look into when I reach that part of the exercise: I shall concentrate first on getting the right relationship between buildings, population, density and base passenger generation first, I think.
I am beginning to suspect that "arb" and "won" refer to the number of workplaces and homes respectively: "arb", I think, is short for the German "Arbeit" meaning "work", and "won" for the German "wohnen" meaning "live". "arb", then, I think, refers to the number of places of employment, and "won" to the number of homes. Quite how this relates to "bev" I am still not entirely sure.

I am beginning to suspect that "arb" and "won" refer to the number of workplaces and homes respectively: "arb", I think, is short for the German "Arbeit" meaning "work", and "won" for the German "wohnen" meaning "live". "arb", then, I think, refers to the number of places of employment, and "won" to the number of homes.
I thought this was already quite clear from the comments in the code you posted, though
arb is the number of jobs rather than of workplaces:
// population statistics
sint32 bev; // total population
sint32 arb; // amount with jobs
sint32 won; // amount with homes
The comment on
bev is clearly misleading, but those on
arb and
won make sense. If
bev is really the number of households, then 'full employment' (when
get_unemployed() is zero) corresponds to one job per household, and 'sufficient housing' (when
get_homeless() is zero) to one home per household.
Best wishes,
Matthew

I am not sure that that is correct about "bev", as this formula:
sint32 get_einwohner() const {return (buildings.get_sum_weight()*6)+((2*bevarbwon)>>1);}
has the effect that "arb" and "won" are accounted for indirectly through buildings (the buildings.get_sum_weight()*6), which is why they are subtracted from "bev", and "bev" is added to the buildings.get_sum_weight()*6 because it (less arb and won) is some sort of remainder value, perhaps representing people living/working beyond the designed capacity of the buildings. Oddly, the consequence of the way that the passenger generation works is that these people will not create any passenger traffic, as passenger traffic is generated only by buildings.

I am not sure that that is correct about "bev", as this formula:
sint32 get_einwohner() const {return (buildings.get_sum_weight()*6)+((2*bevarbwon)>>1);}
has the effect that "arb" and "won" are accounted for indirectly through buildings (the buildings.get_sum_weight()*6), which is why they are subtracted from "bev", and "bev" is added to the buildings.get_sum_weight()*6 because it (less arb and won) is some sort of remainder value, perhaps representing people living/working beyond the designed capacity of the buildings.
Sort of. Let me restate in a slightly different way what I said a couple of posts back:
buildings.get_sum_weight() is something like the number of people (or more precisely, if my interpretation is correct, the number of householders) with a home and a job; it should be equivalent to
some_weighting_factor*arb +some_other_weighting_factor*won. Each of them apparently contributes not 1 but 6 to the total population (i.e. each has an average of 5 dependents).
If the normalisation were fully consistent,
some_weighting_factor+some_other_weighting_factor would be equal to 1, but I doubt that is actually the case. If it isn't, then the apparent 6 inhabitants per household is actually more or less by whatever factor is required to fix the normalisation.
(2*bevarbwon)>>1, or equivalently
(get_unemployed()+get_homeless())/2, is the number of remaining householders who have no home or job. Such people only contribute 1 to the total population; evidently they are presumed to have no dependents.
Best wishes,
Matthew

Hmm, I don't think that that's right, because buildings.get_sum_weight() returns the total number of buildings multiplied by the level* of those buildings. Those buildings include commercial and industrial buildings as well as residential buildings.
* As noted above, what the "level" is gets complicated. 1 is subtracted from the value of the "level" set in the .dat file, and added back in in some places but not others. It is added when the weights of buildings are set in the "buildings" vector, so something with a level of 1 in the .dat file would be 1 here, rather than 0. This means that all buildings count for this purpose.
Let me try running a worked example to see how these various things pan out. Suppose that we have an imaginary city (hamlet) with two buildings: one residential building of level 1, and one commercial building of level 2. (I am referring here to the "level" as set in the .dat file, not as it appears  as will be seen, for some purposes, 1 is subtracted from this number).
The residential building will actually contribute nothing to "won", since it is recorded as being level 0, so "won" will be 0. The commercial building will contribute its level * 20 to "arb", so "arb" will be 20 and "won" will be 0. We do not know what "bev" will be, since "bev" determines (indirectly) rather than is determined by the number of buildings. I will assume "bev" to be 20 for now. Running the formula, we would get:
buildings.get_sum_weight() = 3
3 * 6 = 18
bev  arb  won = 0
0 + 18 = 18
If I am right in my suspicions that the failure to add 1 to the weight of buildings when setting arb and won is a bug, then arb would end up being 40 and won would end up being 10, and the correct result would be:
buildings.get_sum_weight() = 3
3 * 6 = 18
bev  arb  won = 30
30 + 18 = 12
That would give a negative number, which is almost certainly not intended; however "bev" was an assumed value. If we assume that "bev" is 50 instead of 20, we get:
buildings.get_sum_weight() = 3
3 * 6 = 18
bev  arb  won = 0
0 + 18 = 18
In either case, "bev" even unadjusted does not equal buildings.get_sum_weight()*6, although it is closer in the first instance.
In any event, the general intention is for the level of each building to be multiplied by 6 to get the base population: the ((2*bevarbwon)>>1) formula appears to be only a minor adjustment. We can probably, for the purposes of the calibration of the passenger factor, ignore that part and concentrate on the basic fact, in Simutrans, population approximately equals the number of buildings multiplied by the (unadjusted) level of each building multiplied by 6.
The formula for generating passengers/mail is thus:
// prissi: since now backtravels occur, we damp the numbers a little
const int num_pax =
(wtyp == warenbauer_t::passagiere) ?
(gb>get_tile()>get_besch()>get_level() + 6) >> 2 :
(gb>get_tile()>get_besch()>get_post_level() + >> 3 ;
This is not the full picture, however, because nonlocal passengers (that is, passengers going somewhere other than in their own city) will also generate a return trip. This makes the number of passengers generated by each building somewhat nonconstant in proportion to other factors, since altering the proportion of passengers to increase local trips will decrease the number of returns, and therefore reduce the overall number of passengers generated. This might need looking into (perhaps a change in the code so that all passengers return).
In any event, the basic formula is that 3/4 of all packets are passengers, and the number of passengers in the packet is determined on the formula building level (unadjusted) + 6 / 4. Both our level 1 and level 2 buildings in the above example would thus produce 1 passenger per packet.
The next important piece in the jigsaw puzzle for passenger generation rate is the number of times per game month that the code for generating passengers is called. As can be seen in the spreadsheet attached to the opening post of this thread, for a passenger factor of 8 and a bits per month setting of 18, each building in a town is stepped once per game "month". For a passenger factor of 8 and a bits per month setting of 21 (yielding 6.4 hours per month), this entails the stepping of each city building 0.9375 times per month.
Remembering that the average person makes 1,100 trips per year, we need to find a formulation that properly encapsulates the correct relationship between this and the passenger generation figures. It is necessary, however, to use a figure of greater than 1,100 as a base, as account must be taken of the fact that not all generated passenger packets in Simutrans will actually make journeys, even in a wellconnected game, as there is the journey time tolerance to consider. I shall aim for 1,350 as the base figure.
First, the yearly figure must be translated to an hourly figure. For reasons discussed elsewhere, I use a figure for "active hours" in a day as being 16 (24  8 hours' sleep). 365.25 * 16 = 5,844 active hours per year. 1,350 / 5,844 = 0.231. Each unit of population should therefore generate 0.231 passengers per hour.
Because of the inconstancy in the treatment of return journeys and the mapping of levels to passenger generation discussed above, the current code does not produce a stable number of potential passenger journeys per unit of population. If we assume for the moment (density will be covered later) that all city buildings are of level 1 and 2/3rds of passengers will generate return trips, then we can make some approximate calculations.
It turns out that a passenger factor of 2 produces 0.234 units of passengers per hour for each level 1 building. Each level 1 building will register a population of 6; however, approximately 2/3rds of journeys are return journeys. Multiplying 6 by 2 we get 12, which we then have to divide by 2/3rds, which yields 8. If we changed the formula to make all journeys result in returns (and, after all, how often do people really go anywhere without also coming back eventually?), that number would be 4.
So, for the moment, 8 is the ideal passenger factor; 4 would be the ideal passenger factor if all journeys were return journeys.
However, as discussed in the opening post, there is a nonconstant relationship between level and the number of generated passengers because of the odd formula used. This will produce inconsistent results. A change of formula is needed here. Suppose for a moment we were to simplify the arrangements above, and, instead of adding six and dividing by four, we just added 1 (to compensate for subtracting one in makeobj, so that a level 1 building was treated as being of true level 1). What would that produce?
For a level 1 building, this would not change anything, as it happens, as 1 + 6 = 7; 7 / 4 = 1.75, but, because only integers are used here, not floating point numbers, this would be rounded down to 1 in the code in any event. What it would do, however, is create a linear relationship between the level of buildings and the number of passengers generated, such that a level 2 building would generate 2x the number of passengers of a level 1 building, a level 3 building 3x as many and so forth. This, I think, would be more satisfactory, although it would mean an increase in the number of base passengers generated for any given passenger factor from previous versions. Nonetheless, this sort of precise relationship is necessary for the purposes of accurate density calculations.
Turning, then, to density, the one point that is immediately apparent is that, because nothing on population density has been changed so far between Standard and Experimental, there is no adjustment in population density based on the distance scale. This is unfortunate and needs to be remedied.
I will start with the assumption that each tile is 125m x 125m (as will be the default in the next version of Pak128.BritainEx). That means that each tile is 15,625 square meters or 0.015625 square kilometres. The average population density per square kilometre for large urban areas in the UK is 4,100 (source (http://www.demographia.com/dbintluaarea2000.htm)). 4,100 * 0.015625 = 64.0625; each city tile ought therefore account for a population of about 64, at least in a dense city. This figure holds for even smaller urban areas in the South East of England, such as Slough (http://en.wikipedia.org/wiki/Slough). Smaller towns (and towns earlier in history) would have a lower population density, but I cannot currently find reliable figures for this.
In a Simutrans city, it seems to be a reasonable assumption that 50% of the land area is covered with city buildings, the other 50% being taken by roads, railways, stations, open spaces, etc.. This means that the above figure needs to be doubled for buildings, to 128 head of population per tile, on average, for a dense town. If the current system prevails, that would mean an average building level of 21.33 for each tile in a densely populated town. (For reference, at 250m/tile, 0.0625 km^2 / tile, there would need to be 256.25 head of population per tile, or 512.5 head of population per built tile, giving an average level of 85.41).
These levels are considerably higher than are customary in Simutrans, so some consideration is needed of whether to adopt an alternative formulation in the code. Low density areas, which should have perhaps 25100 dwellings per square kilometre (source (http://www.teagasc.ie/research/reports/ruraldevelopment/5164/eopr5164.pdf)) are best represented by level "1" buildings in Simutrans. Assuming a figure of 50 dwellings per square kilometre and an average population of 3 persons per dwelling in this low density state of affairs, this will produce 150 head of population per square kilometre for level 1 buildings, or 2.34375 head of population per tile for level 1 buildings at 125m/tile. Taking into account the less than 100% utilisation of land for buildings, the code ought to produce 34 units of population per level 1 building tile at 125m/tile (suggesting that much higher levels of buildings really are needed, and that there is far too little variation of population density in Simutrans cities: in Pak128.Britain, for example, a medium rise tower block is level 3 wheras a single family home is level 1, yet the tower block can contain far, far more people than 3 family homes). This would double to 8 at 250 meters, 16 at 500 meters and 32 at 1km per tile.
The base formula for population, therefore, ought to be: (buildings.get_sum_weight() * welt>get_settings().get_meters_per_tile()) / 31, subject to the adjustment with bev, arb and won (if, on further consideration, this is still necessary).
This will then require the reconsideration of the passenger numbers calculations above, as they will be out of step with the revised population figures in light of this density calibration. Firstly, the passenger factor will need adjusting to take into account the meters per tile setting: the higher the meters per tile setting, the greater that the passenger factor has to be. Starting with the correct value for 125m/tile, a passenger factor of 2 gets 0.234 passengers per hour per level 1 building. At 125m/tile, a level 1 building would produce 125/31 = 4 head of population, giving a figure of 8 as the ideal passenger factor. At 250m/tile, however, that ideal passenger factor rises to 16; at 500m/tile to 32 and at 1km/tile to 64. On balance, I think that it is best not to incorporate this change directly into the code, but rather add to the comments in simuconf.tab the explanation of these various passenger factor numbers and let people who adjust the tile density set the figures themselves.
I should further note that these ideal figures will again need reconsideration if the system of mail generation is to change, which will have to be the subject of a different analysis later in time, as I am already late for Christmas shopping.
Additionally, furhter consideration will have to be given to whether the town growth formula as it currently stands is compatible with the more realistic population model proposed here, and will work with the greater variations in density.
All of these calculations suggest that, as previously suspected, the population density and the base numbers of passengers generated with the passenger factor currently in use in Pak128.Britain0.8.4 (currently in use on the BridgewaterBrunel server) are both too low. On the face of it, that is at odds with the apparently excessive numbers of passengers being generated, but this appears to be explained by the problems with the formulation of the journey time tolerance code: in the past, far fewer people travelled because journeys took much longer. Long distance journeys in particular need to be curbed by this method.
One major change needed as a result of this is to pakset design, greatly to increase the relative level of high density buildings to low density buildings, on the basis that a level 1 building represents a single household (or, perhaps, in commercial terms, a single small shop or microoffice), and that these must scale in a linear fashion as the density represented increases.
This also raises the possibility that, in urban areas at least, there might not be enough room for a transport network that can realistically as many people as are actually generated because of the foreshortened relationship between actual buildings/tiles of road and the size that they represent. This is probably more of a problem with higher numbers of meters per tile. According to this (http://www.american.edu/spa/publicpurpose/upload/Kane_12.pdf) source, the mean number of 'bus stops per square mile in one US city is 76.158, the maximum being 466. 1 mile approx. equals 1.6km. On a scale of 125m/tile, one can fit 163.84 'bus stops into a square mile (in extreme, assuming nothing but 'bus stops; a sensible figure would be much lower); at 250m/tile, that is reduced to only 40.96; at 500m/tile to 10.24 and at 1km/tile to 2.56. Even with to scale coverage areas, this will not assist much, as the actual number of 'buses able to depart from a single 'bus stop will be the same.
This, in turn, suggests that, at higher values of meters per tile, it becomes increasingly impossible to simulate realistic patterns of urban density and local transport, making the move in Pak128.BritainEx from 250m/tile to 125m/tile of particular significance. This ought not affect long distance transport, however, as fitting in enough infrastructure in that case is far less significant because of the much lower number of passengers that will use it if this is properly calibrated, and the much higher relative amount of space that it is able to occupy.
Edit: One possible way of dealing with this, actually, would be to remove from simulation all of the very short passenger journeys, and define "very short" differently depending on the tile scale. According to this (http://www.sustrans.org.uk/assets/files/connect2/guidelines%2016.pdf) publication (see page 2 for the pie chart), 22% of all trips are under 1 mile, and a further 19% of trips are between 1 and 2 miles.
In SimutransExperimental, we could work on the basis that we simply do not simulate trips of under 1 mile (1.6km) at all, and do not simulate trips of under 2 miles (3.2km) where the meters per tile setting is above perhaps either 250m or 500m (furhter consideration would be needed of that threshold).
We could then adjust the figures by reducing by either 22% or 22+19 = 41% the total number of annual trips per person (from a nominal 1,350 to a nominal 1,188 for lower values of meters per tile or to a nominal 796.5 for higher values of meters per tile), and recalibrating the local/midrange/long distance passengers as follows:
Low values of meters per tile
Local: 62 (79%)
Midrange: 14 (18%)
Long distance: 2 (3%)
High values of meters per tile
Local: 43 (73%)
Midrange: 14 (24%)
Long distance: 2 (3%)
In practice, I also add an overlap between the distance ranges of midrange and long distance, reduce the midrange percentage and increase the longdistance percentage, so an adjusted set of figures for Pak128.BritainEx 0.9.0 might look like this:
Adjusted figures for low values of meters per tile
Local: 79%
Midrange: 16%
Long distance: 5%
I should be interested in views on whether and to what extent this makes sense in the simulation context.
This will form the basis of a test passenger calibration branch of my Github repository to look into these ideas when I have a chance. I shall also produce a test branch of my Pak128.BritainEx Github repository to model different building densities.
In the meantime, I should be very interested in any feedback on these discussions, particularly if anyone has spotted any errors in my formulae.

okay, after finally getting around to reading it, I do think the move to 125m/tile would be good.
However, if you move to 125m/tile, then it should be possible to simulate trips under 1 mile with station coverage of 3. With 3, you have a 875m square for each bus stop and 1000m square for two connected bus stops. The distance between each stop, if left seamless, would be approximately 1000m.
If you move up to station coverage 4, then you get a 1125m square or 1250m square and you could still include trips under 1.6km.
These distances are still twice the length of the average bus stop distance of 500m between each stop. They are, however, within the longer distances of 1km between each stop.

Yes, I think that you're right that we can't leave out the smaller distance journeys. Do you think that we should halve the number of journeys under 1 mile to compensate for having less infrastructure density because of the scaling, or leave things with fully real figures?

seeing as the bus stops might get overwhelmed, I would compensate, or not simulate under 900m at all.

seeing as the bus stops might get overwhelmed, I would compensate, or not simulate under 900m at all.
The statistics that I have don't distinguish between trips of 900m and trips of under 1.6km (1 mile). I'd have to redo all the calculations, removing half the passengers from the under 1 mile category and thus adjusting both the total number of passengers and the percentages to the various distances accordingly.

In that case, I would just do half for under 1.6km. Whichever is simpler.
Seeing as the results are not fully known, until they are played out in the game, I would use the simpler solution to test the results. If they work, great, if they don't, hopefully not much time was used.

Hello Jamespetts i can sone post a photo from a old timetable.

Greenling,
that's kind of you to suggest. I don't think that it's quite what we need for this, but by all means post them if you would like: they might come in handy at some point.
AEO,
I think that I might try, with the 125m/tile, using the fully realistic numbers without adaptation and seeing how that pans out.

Sdog,
I have considered in the past using costing as a factor for transport, but rejected it on the ground that it would introduce undesirable complexities or, if the complexities were to be glossed over or simplified away, perverse results/incentives.
The above figures show that the journey time tolerance feature really does need to be modified in order to produce a better set of numbers, I think.
(Incidentally, the private car system could not be adapted as you imagine, as it involves checking whether the passenger has a private car, then checking whether the journey can be completed with a private car, then checking whether it can be completed with public transport, then comparing the merits of the two modes).
Kieron,
Hmm, I did implement more or less that mechanism:
/* Generates a random number on [0,max1] interval with a normal distribution*/
#ifdef DEBUG_SIMRAND_CALLS
uint32 simrand_normal(const uint32 max, const char* caller)
#else
uint32 simrand_normal(const uint32 max, const char*)
#endif
{
const uint32 half_max = max / 2;
#ifdef DEBUG_SIMRAND_CALLS
return (simrand(half_max, caller) + simrand(half_max, "simrand_normal"));
#else
return (simrand(half_max, "simrand_normal") + simrand(half_max, "simrand_normal"));
#endif
}
but I think what is needed is not a normal distribution at all, but a declining probability the higher that the numbers get: after all, the median of 30 and 4320 minutes, the long distance range, is 2,190 minutes, or 36.5 hours! Having this as the most common journey time tolerance for long distance journeys would not work. What I really need is having, say, three or two and a half hours as the most common journey time tolernace for long distance passengers, with a significant minority getting tolerances between 30 minutes and 3/2.5 hours, and a dwindling minority getting journey time tolerances above that in the stratospheric ranges.
Perhaps it would do to introduce a median value for time tolerance, similar to that of the speedbonus max distance, in such a way that if we set, say, min/med/max 500/1000/3000, then half the journeys will have tolerance 5001000, and the other half 10003000? etc.

Perhaps it would do to introduce a median value for time tolerance, similar to that of the speedbonus max distance, in such a way that if we set, say, min/med/max 500/1000/3000, then half the journeys will have tolerance 5001000, and the other half 10003000? etc.
I am currently minded to try out Kieron's later suggestion of
multiplying the numbers to get a sort of exponential skew; however, I need to get the basic calibration of raw passenger generation right before I do further work on the tolerances.

It has occurred to me that some adjustment is necessary to the base generation figures set out above. Earlier, I wrote,
Remembering that the average person makes 1,100 trips per year, we need to find a formulation that properly encapsulates the correct relationship between this and the passenger generation figures. It is necessary, however, to use a figure of greater than 1,100 as a base, as account must be taken of the fact that not all generated passenger packets in Simutrans will actually make journeys, even in a wellconnected game, as there is the journey time tolerance to consider. I shall aim for 1,350 as the base figure.
First, the yearly figure must be translated to an hourly figure. For reasons discussed elsewhere, I use a figure for "active hours" in a day as being 16 (24  8 hours' sleep). 365.25 * 16 = 5,844 active hours per year. 1,350 / 5,844 = 0.231. Each unit of population should therefore generate 0.231 passengers per hour.
However, this makes the error of squashing the number of trips made by passengers in a total of 24 hours into 16. What we should actually do is generate that proportion of all passenger trips that take place during the 16 busiest hours of the day. Data for trips per time of day can be obtained here (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/9935/nts0501.xls).
The quietest 8 hours are 2300  0700, in which 3.67% of all trips are made. Therefore, the total number of passengers generated needs to be reduced by 3.67%  taking the base as 1,350 results in an adjusted base of 1,300  or, if the base were taken instead as 1,250 (which on reflection might be rather better than a base of 1,350), this would give 1,205. This would give 1,300 / 5,844 = 0.222 or 1,205 / 5,844 = 0.206 instead of the previously calculated 0.231. The overall difference that this might make may well be small, however, but the above figures might need adjusting in consequence.

I have been doing some work to implement the most basic parts of this (the relationship between the building's level, its size, the population in cities and the overall number of passengers generated) in advance of the next version. I was not originally going to implement any of this, but one of the things on my to do list for the next pakset release was recalibrating the congestion settings, which I came to realise was not possible to do properly without making some sort of start on this.
According to the above figures, we are aiming for a generated passenger to population ratio of 1.22:1 or thereabouts. I have changed the code so that all passengers generate return trips, and removed the code "damping" the amount of passengers and mail, which also had the effect that the different levels of buildings did not cause a linear increase in the number of passengers as would be expected. I have ensured that the level of buildings as entered in their .dat files are now directly represented in the game without intermediate obfuscation involving the passenger factor as occurred before, and recalibrated the passenger generation to interlink with the meters per tile setting.
I have further added a new system for calculating congestion (the old system can still be used: the congestion density factor setting determines which system is used), based on calculations derived from the TomTom Congestion Index. The desired ratio of 1.22:1 (to the nearest 2 decimal places) can be obtained with a passenger factor of 15.
While I am about it, I have also discovered and fixed a number of bugs with private car generation, including a bug that caused the number of recorded private car trips to be considerably too high.
A problem remains, however, that will need the more detailed review of systems that I cannot achieve in time for the next release, which is this: because the current system of destination finding requires that destinations be within a certain fixed distance range, small towns near large towns become the destination for a disproportionate number of passengers, greatly increasing the congestion in those towns, even if there is nothing much of interest there to which passengers might want to travel. This should be addressed with the planned more fundamental overhaul of town growth and passenger destination finding.

Hi James, that sounds very interesting. (And as a side note, I'm happy to hear that the cause of small towns getting disproportionate passengers has been identified.)
I have a very simplistic question to ask. Holding the passenger_factor value fixed, do you expect that the changes you describe above will increase or decrease the typical volume of passengers?

If you mean for a given passenger factor, will the number of passengers generated increase or decrease, the answer is that they will decrease.

Thanks for the reply, James: that's what I was hoping you'd say. ;)

Splendid!

On the other hand, London Underground services tend not to be as busy as one would expect  except in the very centre of London. This is presumably because it's quite difficult to simulate the fact that Londoners are much more likely to use public transport.
I think the biggest problem is probably city growth patterns. There is no mechanism to make cities grow more aggressively near places with public transportation, which is the mechanism which made London so dense, which is what made its public transportation so successful... I would really like to implement something like that, so that when the city looks to add new buildings, it preferentially goes near wellserved public transport stops, but I haven't really figured out how to do it yet. It would require changing the newbuilding algorithm entirely. "renovate_percentage" should not exist; there should be some sort of weighted vector of locations, with a penalty for already having a building, but a bonus for having better public transport.

A problem remains, however, that will need the more detailed review of systems that I cannot achieve in time for the next release, which is this: because the current system of destination finding requires that destinations be within a certain fixed distance range, small towns near large towns become the destination for a disproportionate number of passengers, greatly increasing the congestion in those towns, even if there is nothing much of interest there to which passengers might want to travel. This should be addressed with the planned more fundamental overhaul of town growth and passenger destination finding.
In practice, it's totally realistic that small towns near large towns get big transportation and population booms  as "commuter belt" towns. One problem is the failure of the towns to densify properly, as mentioned above.
A second problem is the failure of the passenger destination code to consider industrial/commercial/residential in any sensible manner. Passengers start at any building and go to any building. I think we should take a page from SimCity: most passengers should start at residential buildings, go to commercial, industrial, or tourist buildings (a fixed percentage for each) and then come back.
Factories should be counted in industrial buildings for this purpose and should have a fixed employment level (unlike what is currently implemented in standard, which is simply bizarre: currently factories employ more people if they are near larger cities, which is weird and makes no sense).
A certain percentage of passengers should, of course, visit other residential buildings.
Finally, while most passengers should go "there and back", a small percentage should randomly go to a new destination before going home.
I think this system would probably be fairly straightfoward to implement (for the next version). Certainly easier than fixing the city growth problem. The "second trip" part would be the most work as it would require keeping track of the ultimate builidng of origin. Actually, the building of origin is worth keeping track of anyway, because we can't send passengers home properly at the moment.
Here are my first wild guesses for appropriate percentages:
Shortrange passengers: 40% industrial, 40% commercial, 10% tourist, 10% residential, 5% "second trip"
Mediumrange passengers: 20% industrial, 40% commercial, 20% tourist, 20% residential, 10% "second trip"
Longrange passengers: 10% industrial, 35% commercial, 35% tourist, 20% residential, 25% "second trip"

I have actually been planning to implement more or less exactly this (although I had not considered percentages, which should probably be set in the .dat file) for some time: see the earlier discussion in, I think, this thread. Indeed, I had devised, in the abstract at least, a way to deal with the issue of town growth being dependent on public transport, taking another leaf from Sim City's book and using success rates, which you will see implemented (currently for information only) in the latest development branches on Github (and, indeed, in all the recent release candidates). The higher the success rate of transport from a particular building, the more likely that it will be to upgrade, and the higher the success rates of surrounding buildings, the more likely that a new building will be built on an undeveloped tile. The idea is also to separate commuter and noncommuter traffic (replacing the current three distance based classifications), and have different sorts of traffic important for different sorts of growth: industry will need workers and does not care much for visitors, so will care only about commuter traffic; people move to where jobs are primarily, so residential will care mainly about commuter traffic, but will have some preference for better noncommuter success rates; commercial premises need visitors (whether they be customers or business visitors), so the number of noncommuter passengers will be much more important for commercial buildings. (Also, endconsumer "industries" should be classified as commercial for these purposes; player buildings should also be classified as industrial, as people have to work in stations and depots, too).

Apologies for reviving a somewhat elderly topic  my current work in testing the passenger generation code has lead me to look into one of the topics discussed in this thread again, being normal distributions.
Some time ago, Kieron very helpfully suggested a simplified means of calculating a normal distribution:
int biased_random(int max) {
int random1 = simrand(max);
int random2 = simrand(max);
return (random1*random2)/max;
}
I implemented a version of that some time ago, but testing today showed that, for visiting trips in which passengers might be inclined to travel for up to 24 hours or more, this was not sufficient to make extremely long trips rare enough. I therefore turned to Kieron's enhanced suggestion:
The peak value will be 1/4 the maximum value. Though actually you are better off with (((rand(max)+1)*(rand(max)+1))/max)1 to avoid a large number of 0s being generated. If you multiply 3 random numbers from 1 to max then divide by max*max the peak will be at 1/9 and so on.
I implemented it thus to allow flexibility:
#ifdef DEBUG_SIMRAND_CALLS
uint32 simrand_normal(const uint32 max, uint32 exponent, const char* caller)
#else
uint32 simrand_normal(const uint32 max, uint32 exponent, const char*)
#endif
{
#ifdef DEBUG_SIMRAND_CALLS
sint64 random_number = simrand(max, caller);
#else
uint64 random_number = simrand(max, "simrand_normal");
#endif
assert(exponent <= 3); // Any higher number will produce integer overflows even with unsigned. 64bit integers
if(exponent < 2)
{
// Exponents of 1 make this identical to the normal random number generator.
return random_number;
}
uint64 adj_max = max == 0 ? 1 : max;
for(int i = 0; i < exponent  1; i++)
{
random_number *= simrand(max, "simrand_normal");
}
for(int n = 0; n < exponent  2; n ++)
{
adj_max *= adj_max;
}
return (uint32)(random_number / adj_max);
}
Testing it carefully, it seems to have the same effect as Kieron intended.
Even this I found insufficient, however: testing in 1775, excessive numbers of passengers (proportionately) are happy to use a stage coach service with journey times ranging between 6 and about 26 hours. Checking using a debugger reveals that a significant proportion of passengers are still happy to make journeys of many hours: far too high a proportion.
(I should note that the new passenger generation code splits visiting and commuting trips  the journey time tolerance ranges for each can be set separately in the pakset: for commuting trips, I have set a much narrower range of journey time tolerances, from 20 minutes to 2 hours, than for visiting trips; it is visiting trips that are causing the trouble here).
I have looked at Matthew Collett's earlier suggestion again, which I had not had time to look into before, and then had not considered necessary unless Kieron's simpler (and therefore more computationally economical, as well as easier to code) method proved to be inadequate, and it is only now that I am finding that, for visiting passengers at least, it seems to be.
Matthew's suggestion was as follows:
The general algorithm is presumably of the form:
 Generate a possible trip.
 Based on the distance to be travelled and any other relevant parameters (the game year, the social status of the passenger, ...) determine the typical journey time tolerance T for that trip.
 Draw the actual journey time tolerance t from a random distribution parameterised by T.
 If t is larger than the expected journey time, make the trip, otherwise don't bother.
One straightforward way to implement an exponential distribution with mean time T for step 3 is:
 Generate a uniform random number x in the interval (0,1].
 Calculate t = T ln x .
This uses floatingpoint arithmetic. If you want to use only integers, but are happy with an approximate stepped distribution, then:
 Generate a uniform random integer x in the interval [0,2^{N}).
 Find the number n of leading 1s in the binary representation of x; i.e. x consists of n 1s, a 0, and a remainder of Nn1 other bits. Call the remainder y.
 Calculate t = (n + y/2^{Nn})T .
In this case T is the median rather than the mean.
Best wishes,
Matthew
I am afraid, however, that I finding this somewhat hard to follow, as I am not familiar with the advanced mathematical concepts being deployed here. What does "Generate a uniform random integer
x in the interval [0,2
^{N})" mean exactly (especially the part of the sentence after "in the interval...")?
Any assistance in deciphering Matthew's helpful suggestion from some time ago, or any alternative means of achieving this objective would be most welcome. I am afraid that I have found the responses that I get to a Google search for "normal distribution" to be a little too techincal for me to understand and also focussed in any event on an unbiased normal distribution, whereas I am after a heavily biased distribution.
Edit: It looks like what I am trying to achieve is a positively skewed normal distribution (http://www.mathsrevision.net/advancedlevelmathsrevision/statistics/skewness) between the minimum and maximum times  but how to go about writing a C++ function to make this work, let alone one with good enough performance to be called very frequently, is beyond me without some assistance, I think.

someone correct me if I'm wrong, but I think you could get a similar function by transforming the function to the left.
which should be as simple as subtracting a specific number from the result.

what google shows: http://en.wikipedia.org/wiki/Normal_distribution#Generating_values_from_normal_distribution
Some methods don't seem too computionary intensive
BoxMuller with example of coding and comments http://en.literateprograms.org/BoxMuller_transform_%28C%29

What does "Generate a uniform random integer x in the interval [0,2^{N})" mean exactly
An integer with an equal chance (1 in 2
^{N}) of being any of 0, 1, 2, 3, … 2
^{N}2, 2
^{N}1.
I am afraid that I have found the responses that I get to a Google search for "normal distribution" to be a little too techincal for me to understand and also focussed in any event on an unbiased normal distribution, whereas I am after a heavily biased distribution.
Edit: It looks like what I am trying to achieve is a positively skewed normal distribution (http://www.mathsrevision.net/advancedlevelmathsrevision/statistics/skewness)
The suggested algorithm (the integer version of which can probably be improved further*) does not generate a normal distribution or anything related to it: it gives an exponential distribution, which I am fairly confident is a better starting point for this application.
Best wishes,
Matthew
Edit: * in fact, was improved in the code I posted to the thread shortly after the message you quote.

There should be plenty of computationally cheap methods to approximate the effect you want... I think the key starting point is to set a few data points so that a simple method can be devised. On a large map this would be called a lot, so keeping it simple is certainly important.
To achieve half of the bell curve graph, you'd need, in theory, a Gaussian function (http://en.wikipedia.org/wiki/Gaussian_function) (modified to stretch out the tail). But that's pretty computationally heavy. I think it can be easily represented by two exponential functions, each taking care of the commuting portion and the travel portion of the curve.
EDIT: Doing some more reading. Did you try cubing the function as Kieron suggested? That will get you a bias of 1/9 (11%) instead of 1/4 (25%), which pushes the curve far to the left and heavily diminishes it as you approach the maximum journey time.
int biased_random(int max) {
int random1 = simrand(max);
int random2 = simrand(max);
int random3 = simrand(max);
return (random1*random2*random3)/(max*max); (*Note)
}
Note*You could rework the math so that you don't have to have a large integer to hold this value.
EDIT 2: Did some plotting on a spreadsheet  a cubed function will probably get it much closer to what you want for distribution at the high end. A squared function yields about 16% of its values over 50% while a cubed function is down to 3% over 50%. 4% are over 75% with a square, 0.5% are over 75% with a cube. This is probably ideal for the long distance journies and perhaps medium. Local journies are probably better squared so that you get more journies at the higher end.
EDIT 3: How is longdistance_passengers_max_distance = 16384 used in the calculation? Is it truncated to the maximum possible journey distance for that particular map and scale? Per the simuconf comments, this is in km, correct?
From the online game, there are a lot of passengers generated who are happy to travel the entire length of the eastern island, some 200km in distance (1015 hours). Beyond that, passenger volumes drop  hardly anyone (basically zero) will attempt the 60+ hour journey between the east and west island on a 15 km/h ship. I have a few long distance connections but it's almost all mail and very few passengers.

Thank you all for your replies  they are much appreciated. Matthew  I did not spot the code when I read it before, but I think that I see it now: apologies for missing it. Is it this:
const unsigned int uibits = std::numeric_limits<unsigned int>::digits;
const unsigned int lgbits = uibits2;
const unsigned int lgbit = 1<<lgbits;
const unsigned int hibit = lgbit<<1;
unsigned int scaled_lg(unsigned int x, unsigned int T=1) {
if (x==0) return uibits*T;
unsigned int lg = 0;
//Find the first significant digit
while ((x & hibit) == 0) {
lg += T;
x <<= 1;
}
//Make space for overflow
x >>= 1;
//Find leading bits in mantissa
for (int j =1; j<lgbits/2; ++j) {
if ((x ^ lgbit) == 0  T == 0) return lg;
x >>= lgbits/2;
x *= x;
unsigned int round = T & 1;
T >>= 1;
if (x & hibit) {
x >>= 1;
lg = T+round;
}
}
//Interpolate remaining bits linearly
while (T != 0) {
if (x & lgbit) lg = T;
T >>= 1;
x <<= 1;
}
return lg;
}
If so, was this intended to be an exponential function or a normal distribution? I ask because it really is a skewed normal distribution that we need, not an exponential function: we want visiting trips to have a range of between 3 minutes and 5,400 minutes (90 hours) tolerance, but for there to be more people willing to travel for 10 minutes than 3 and more willing to travel 30 minutes than 10, and so forth, but much, much, much, much fewer willing to travel 10+ hours than 2 hours.
Sarlock, when you refer to the commuting portion and travel portion of the curve, I think that there is something of a misunderstanding: visiting trips and commuting trips do not occupy different parts of the same curve: they each have their own curves: visiting trips between 3 minutes and 90 hours, and commuting trips between 20 minutes and 2 hours. The commuting distribution can be an unsekwed normal distribution, whereas the visiting distribution needs to be extremely heavily positively skewed.
I have tried with the cubed distribution already, using the code that I posted in the previous thread, and passing "3" as the exponent for visiting trips, with "2" for commuting trips. Even then, about 1/10th of passengers are willing to make journeys of 6 hours or more, which is excessive. I tried using an exponent of 4, but the resulting numbers were so large that they could not be contained in unsigned 64bit integers, and overflowed producing mostly zero.
As to the long distance passengers and their maximum distance, the whole concept of distance ranges has been completely abolished in this code on which I am working because it created anomalies (such as no passengers willing to take the requisite time to make a (realistically) very slow but fairly short journey in the early days). Instead of local, midrange and long distance passengers each with distance ranges and their own journey time tolerance range, there will now be commuting and visiting passengers with no distance range and each with their own journey time tolerance range, as discussed above.
Incidentally, subtracting a number from the result will not work well (at least, not without adding something to the range), as we need at least some journeys to be at the top of the range, even if they are only 0.001% of all trips.

Thank you for the clarification. It may be difficult to mathematically represent your bell curve that peaks at a middle value, say 30 minutes. You're probably better off setting a few conditions to make the data points conform to the curve you want.
A cubed exponential curve is very steep, so I suspect what is occurring is that the graph has a large maximum value and all small values (even 6 hours) is included in the very left part of the graph and still has a fairly high probability of occurring. In 1750 at 15km/h, a 6 hour trip is 90km, or 720 tiles. In the 20th century, at 200km/h, a 6 hour trip is 1200km, or 9600 tiles  which is likely wider than any map currently in play.
Commuting trips between 20 minutes and 2 hours can probably use the exponential curve and very nicely have most of the data points concentrated in the 2060 mins range. If you cube it, then it will concentrate even more heavily in the 2040 range. With a square, 76% of trips will be below 60 minutes. 57% will be 45 minutes or less. With a cube, you get 92% below 60 minutes, 82% below 45 minutes.
Representing visiting trips of 35400 minutes, peaking at 30 minutes:
What percentage of passengers do we want taking a trip of 1545 minutes' length? 30%? How about 45 minutes to 120 minutes (2 hours)? Another 30%? How many taking a trip 315 minutes? Volumes would drop off quickly above 120 minutes. This will determine the shape of the curve. You may need to have 4 separate calculations for each range, to confine the data to a set that makes sense (and is easy to calculate). If you force the calculation to perform only, say, 5% of its data points above 2 hours, you can apply a squared exponential function and it should work fine  since we've already determined how many of that type to be calculated.

You are probably right about the large values  and, when I introduce portals, the values can only get larger: if we are simulating (albeit indirectly) a sailing ship voyage of the equivalent distance as from London to Sydney, that will be measured in months rather than hours, which journey at least some people were clearly prepared to tolerate, and yet the median tolerance should still remain in the region of 1 hour. I suspect that it can be done mathematically (and clipping the ranges in random samples has the disadvantage of producing arbitrary results and steep borders between values).
I wonder, however, whether there is something to be gained by the cumulative use of Kieron's formula: it will be recalled that the barrier to using it in a power of four is that the numbers become too large to fit in a 64bit integer. Suppose, instead of using a power of four, we were to multiply together two random numbers obtained from the power of two version of the formula (rather than, as in the power of two version of the formula itself, two evenly distributed random numbers), and then divide that resulting number by the maximum  would that, I wonder, have the same effect as a power of four?

perhaps lognormal distribution is what you are looking for?
http://www.wolframalpha.com/input/?i=lognormal

A 4th exponential will probably return values that are so heavily skewed to the beginning figures that your distribution at the higher end will be almost nonexistent.
I wonder if it's much faster computationally to take a random number and compare it to a data table that has the distribution data points in it... and then apply a small random variance to keep the data from "bunching". ie:
Pick random number 09, then pick random number between:
0: 315 minutes
1: 1620 mins
2: 2125 mins
3: 2630 mins (bell curve peak)
4: 3135 mins
5: 3640 mins
6: 4160 mins
7: 61120 mins
8: 121600 mins
9: 6015400 mins
More data points would be required to add more detail to the curve, this is just for 10 points and isn't very refined (just for an example list). You would likely want less than 10% of your trips being over 600 minutes  and probably want to add more data points even in the 6005400 range. 50 or 100 points would probably be plenty.
This would allow you to fine tune the curve to be exactly how you want it to look without having to burden the system with complex math (and much more difficult to tweak if you still aren't happy with the results  with this method you can make the distribution match exactly what you want).
If you want to supply some rough data points for percentage of trips below 30 mins, 3060, 60120 and above, I could put together a quick series of data points and see what you think.

another possibility is using part of a quartic function.
If you have a graphing calculator handy, you could potentially make it calculate a best fit function from several points.

Hmm  that log normal distribution looks interesting  this appears to be the same graph shape as a skewed normal distribution. The data points idea is an interesting one, but I am concerned that it will produce nonsmooth results with weird gameplay, and will be difficult to make work with user defined ranges. As for quadratic functions  I am afraid that the only things that I know about them is that they are called quadratic functions, and they are a very advanced area of mathematics. What would a quadratic function enable the programme to do, and are they computationally expensive (i.e., involve a great deal of recursion, many division operations or things such as square roots)? Remember that this random number generator must be called many times every step.

With enough data points (30 or 40 will likely suffice) the curve will be smoothed out so much that there will be no noticeable effect on the game. I'll create something to this effect to demonstrate. It makes it far easier to tweak the parameters of the curve by just changing a few values.
The data points could be user defined or put in to simuconf  allowing the user/pakset designer to determine the shape of the bell curve.
It's also super fast to calculate.

(http://www.ssgholdings.ca/simutrans/images/paxtimegraph2.png)
Spreadsheet:
Passenger Travel Times  Data Point System Excel File (http://www.ssgholdings.ca/simutrans/images/Passenger Travel Times  Data Point System.xlsx)
Pretty nice curve for a quick sample set. Might want to reduce the peak probabilities and round out the bell curve a bit, if desired... it probably peaks a bit too aggressively at 30 minutes. I used a system with two data pieces per time period: maximum time and probability. Total data points required: 20. Probably could use just 15 and still perform fine. Most of the diminishing numbers at the higher times (600+ minutes) could probably just be merged in to one set.
For each data sequence, we just choose a random number between the min and max. You can't really see on the graph (just a bit on the tails of the curve), but each segment is basically a linear section... but even with just 20 segments the lines all smooth out in to a nice bell curve shape.
EDIT: Changed slightly to reduce peak probabilities and round out the peak a bit more.

James, i'm affraid i've not read the thread so far, just answer based on your social media requests regarding the distribution problem:
The method of counting the leading 1s in the binary, Mathew described:
"Find the number n of leading 1s in the binary representation of x; i.e. x consists of n 1s, a 0, and a remainder of Nn1 other bits. Call the remainder y."
Is nothing more than an integer calculation to get log_2. You can easily see that when you look at the real representation above. Don't forget the inverse of exponential is logarithm. He is not calculating a skewed normal distribution, but rather a (much more appropriate!) exponential distribution.
You heard from exp. distributions in economy perhaps under the term 'long tail'. I can't look it up, but i think i remember from statistical mechanics that one can both prove and empirically show that passengers waiting is described by such a function. That is not sourced here, if i got some time at hand i shall try to look it up, else i'll play the ball to someone else here.

Sdog,
thank you for your input and clarification on Matthew's function: that is most thoughtful. However, I am somewhat doubtful that a logarithmic/exponential function is what is needed here rather than a positively skewed normal distribution. You refer to waiting times above, whereas what I am actually trying to model are journey time tolerances, which are not quite the same thing. Marchetti's constant (http://en.wikipedia.org/wiki/Marchetti%27s_constant) holds that people tend to have a fairly uniform travel time budget of about an hour, which strongly suggests that, for any given journey, more people should be willing to spend, say, half an hour travelling than are prepared to spend only three minutes travelling, but equally, more people should be prepared to spend half an hour travelling than three hours. Is there a particular datum on which you base your suggestion to the contrary?
Edit: Tests that I am in the process of carrying out are showing good results with compounding/recursion of the cubed version of Kieron's algorithm. Taking two results from the cubed version, multiplying them together and dividing by the maximum has produced a more satisfactory result, with 28 passengers being willing to travel in one game month of 6:24h with a minimum journey time of 4:18h travelling and 2:52h waiting, with 1,583 passengers recording "too slow" and 226 "no route". I will try again with a multiple of three and see whether that is better still. The questions are now, assuming this system proves suitable, how to calibrate it and how best to allow it to be customised in simuconf.tab, since there is no easy single exponent for scaling. Perhaps a few numbers with different modes, such as 0 or 1 for an even distribution, 2 or 3 for the squared or cubed skewed normal distribution, 4 for a double recursion of the cubed distribution and 5 for a triple recursion, or something of the sort?
Edit 2: A triple recursion does not seem to work well: dividing by two produces values higher than the maximum, whereas dividing by the square always produces zeros. The best results so far have been obtained by a single recursion.
Edit 3: I have found a way to parameterise it, I think, although I have not had a chance to do any serious performance testing yet or see whether this works well for extremely large numbers. The code is as follows:
/**
* Generates a random number on [0,max1] interval with a normal distribution
* See: http://forum.simutrans.com/index.php?topic=10953.0;all for details
* Takes an exponent, but produces unreasonably low values with any number
* greater than 2.
*/
#ifdef DEBUG_SIMRAND_CALLS
uint32 simrand_normal(const uint32 max, uint32 exponent, const char* caller)
#else
uint32 simrand_normal(const uint32 max, uint32 exponent, const char*)
#endif
{
#ifdef DEBUG_SIMRAND_CALLS
sint64 random_number = simrand(max, caller);
#else
uint64 random_number = simrand(max, "simrand_normal");
#endif
if(exponent < 2)
{
// Exponents of 1 make this identical to the normal random number generator.
return random_number;
}
// Any higher number than 3 will produce integer overflows even with unsigned. 64bit integers
// Interpret higher numbers as directives for recursion.
uint32 degrees_of_recursion = 0;
uint32 recursion_exponent = 0;
if(exponent > 3)
{
degrees_of_recursion = exponent  4;
if(degrees_of_recursion == 0)
{
recursion_exponent = 2;
}
else
{
recursion_exponent = 3;
}
exponent = 3;
}
const uint64 abs_max = max == 0 ? 1 : max;
for(int i = 0; i < exponent  1; i++)
{
random_number *= simrand(max, "simrand_normal");
}
uint64 adj_max = abs_max;
for(int n = 0; n < exponent  2; n ++)
{
adj_max *= adj_max;
}
uint64 result = random_number / adj_max;
for(uint32 i = 0; i < degrees_of_recursion; i ++)
{
// The use of a recursion exponent of 3 or less prevents infinite loops here.
const uint64 second_result = simrand_normal(max, recursion_exponent, "simrand_normal_recursion");
result = (result * second_result) / abs_max;
}
return (uint32)result;
}
The parameters are:
0 or 1  even distribution;
2: normal distribution with slight skew (squared);
3: normal distribution with large skew (cubed);
4 normal distribution with recursion skew (squared);
5: normal distribution with recursion skew (cubed);
6 and above: normal distribution with multiple recursion skew (cubed; these values produce so extreme a skew as may be of limited usefulness).
Early experimentation appears to show that it might be possible to have a much higher value for the maximum level of tolerance (perhaps circa 1,728,000, representing four months' worth of tenths of minutes, or, if we were to start recording time in seconds rather than minutes*, 20 days) using the number 6, but the results can be somewhat erratic.
* The reason to allow some very high values is that I plan to introduce a "portals" feature, allowing for simplified/abstracted intercontinental travel without an increase in map size. This will entail extremely long journey times of multiple months thus requiring me to use a 32bit rather than 16bit integer type to store all journey time information. However, increasing the integer precision from 16bit to 32bit also allows timings to be stored in seconds rather than the current tenths of minutes. This changes the meaning of the numbers returned by the random number generator and thus the effect of the skew factor.
Edit 5: I have now implemented this fully and made it customisable in simuconf.tab, as well as adding an unskewed normal mode. The simuconf.tab comments should explain the operation of this system:
# The following settings determine the way in which individual packets of passengers decide
# what their actual journey time tolerance is, within the above ranges. The options are:
#
# 0  Even distribution
# Every point between the minimum and maximum is equally likely to be selected
#
# 1  Normal distribution (http://en.wikipedia.org/wiki/Normal_distribution)
# Points nearer the middle of the range between minimum and maximum are more likely
# to be selected than points nearer the end of the ranges.
#
# 2  Positively skewed normal distribution (squared) (http://en.wikipedia.org/wiki/Skewness)
# Points nearer the a point in the range that is not the middle but is nearer to the lower
# end of the range are more likely to be selected. The distance from the middle is the skew.
#
# 3  Positively skewed normal distribution (cubed)
# As with no. 2, but the degree of skewness (the extent to which the modal point in the range
# is nearer the beginning than the end) is considerably greater.
#
# 4  Positively skewed normal distribution (squared recursive)
# As with nos. 2 and 3 with an even greater degree of skew.
#
# 5  Positively skewed normal distribution (cubed recursive)
# As with nos. 2, 3 and 4 with a still greater degree of skew.
#
# 6 and over  Positively skewed normal distribution (cubed multiple recursive)
# As with nos. 2, 3, 4, and 5 with an ever more extreme degree of skew. Use with caution.
random_mode_commuting = 2
random_mode_visiting = 5
Thank you to all who have helped.