Hashtables

jamespetts · February 18, 2014, 12:46:30 AM

I think that this has been discussed previously, but a long time ago, and I think that it might be worth discussing again. Currently, Simutrans has a hashtable template. However, unlike many hashtables, this hashtable has a somewhat inefficient means of looking things up and is no faster at returning a value from a key than, say, searching a vector or singly linked list for a particular value - if I recall correctly, it iterates through all items until it finds something matching.

A more conventional hashtable implementation, if I understand it correctly, has a very high speed algorithm for matching key/value pairs. Unless I have misunderstood something, therefore, there would be something to gain from using a more conventional hashtable implementation in Simutrans - perhaps from the std templates, if there is one? Or am I missing some important piece of information that is a good reason for retaining the existing model.

It is perhaps noteworthy that Experimental uses hashtable lookups rather more than Standard, for example, for storing cached routing information for goods, so it might be that a different strategy is optimum in each case, or even, conceivably, that it might make sense to use a standard hashtable class for some things and the existing hashtable implementation for others.

I should be interested in any words of wisdom from those more familiar with the low level innards of the code than I on the subject.

Ters · February 18, 2014, 05:46:31 AM

Simutrans' hashtable looks like a proper hashtable to me. The only thing seems to be that it has a static number of buckets (Simutrans calls them bags), which I think means that it does not scale as well. A hashvalue never gives direct lookup. It can't since hashvalues aren't unique. The hashvalue is only used to find the correct bucket (many hashvalues map to the same bucket), which quickly rules out the elements in all the other buckets. A normal search is then done within that bucket, using normal equality.

For a hashtable to be most effective, the hashing should be done so as to spread elements evenly across all buckets. The number of buckets should also be so that there are few elements in each bucket. In any case, the complexity of a hashtable search is O(n), just like a straight arrays, but unlike B-trees which are O(log n). Although both normal arrays and hashtables are O(n), the latter should be faster. O(n) just means that an array or hashtable that's twice as big is also twice as slow.

MCollett · February 18, 2014, 06:12:51 AM

Quote from: Ters on February 18, 2014, 05:46:31 AM
In any case, the complexity of a hashtable search is O(n), just like a straight arrays, but unlike B-trees which are O(log n). Although both normal arrays and hashtables are O(n), the latter should be faster. O(n) just means that an array or hashtable that's twice as big is also twice as slow.

A hashtable should be O(1) (constant look-up time), that's the whole point; if it isn't, you need more bins. And while a tree (of any type, not just B-trees) should indeed be O(lg n), so should binary search in a sorted array.

Best wishes,
Matthew

Dwachs · February 18, 2014, 08:30:37 AM

Simutrans standard does use the hashtables only in non-performance critical parts of the code (retrieving objects by name, translation of text snippets).

Combuijs · February 18, 2014, 08:52:05 AM

QuoteA hashtable should be O(1) (constant look-up time), that's the whole point; if it isn't, you need more bins.

Worst case it will have to loop through the buckets so order remains O(n) not O(1). Best case you want it to be a direct hit, average case a little above it. If that's not the case you want either more buckets or a better hash function. The last one is more critical! My experience is that in general a hash table size of 4 * the number items to store is a good balance between space and memory usage when the hash function is doing its job.

A good hash table implementation should in my opinion have a flexible size, not a static one. When the hash table gets too sparse or too populated it should ideally be shrunk or extended.

Ters · February 18, 2014, 06:11:35 PM

I was a bit unclear about reporting worst case, which is what I usually think about. If one can achieve O(1), only doesn't really need buckets at all, just an array indexed by the hashvalue.

But the conclusion seems clear that the problem with Simutrans' hashtable is that it doesn't adjust itself to the number of items. It can only make a search 100 times faster than it would be without the hashing part. A best case of O(1) and worst case of O(n) still hold. It is just more likely to experience the latter.

Quote from: Dwachs on February 18, 2014, 08:30:37 AM
Simutrans standard does use the hashtables only in non-performance critical parts of the code (retrieving objects by name, translation of text snippets).

But that's only because Simutrans' hashtable implementation doesn't scale well. We use hashtables in performance critical code at work. The only problem I think we've had is a big hashtable that wasn't presized, so it constantly had to resize, and therefore rehash all items over and over.

Quote from: MCollett on February 18, 2014, 06:12:51 AM
And while a tree (of any type, not just B-trees) should indeed be O(lg n), so should binary search in a sorted array.

Sorted arrays aren't good for updating though, so I didn't mention. As far as searching goes, the underlying theory is exactly the same.

jamespetts · February 18, 2014, 11:36:32 PM

Thank you for all the clarification on this issue, which is helpful. Is it worth trying to adapt the existing template to improve performance, or would it be easier to use an existing standard template for performance critical cases?

Ters · February 19, 2014, 06:32:13 AM

If performance is really critical, just inserting an existing general solution will likely not be enough. It must be tweaked to the particular situation. A really good hashtable implementation will allow you to do this. As has been mentioned, for a hashtable, a good hashing function and the right number of buckets is essential for peak performance. And if you know how many elements the hashtable will contain, even just roughly, tell it so up front, because changing the number of buckets is really O(n), both best case and worst case, as far as I can tell.

MCollett · February 19, 2014, 08:40:17 AM

Quote from: Ters on February 19, 2014, 06:32:13 AM
And if you know how many elements the hashtable will contain, even just roughly, tell it so up front, because changing the number of buckets is really O(n), both best case and worst case, as far as I can tell.

Yes and no. If the number of buckets grows exponentially rather than linearly (e.g. doubles every time more are needed) then the amortised cost is O(1). And except for hard real-time applications, amortised cost is generally more relevant than worst-case.

Best wishes,
Matthew

Combuijs · February 19, 2014, 11:41:40 AM

If you change the number of buckets (indeed doubling the number of buckets everytime you need extra space is a good strategy), you will have to rehash all existing values in the buckets. That is the O(n) Ters is talking about, I think. (And worst case that is O(n²), if you have a terrible hash function).

Ters · February 19, 2014, 03:18:04 PM

I was indeed thinking about doubling the number of buckets when more buckets are needed. It's probably the most common growth strategy, not only for hashtables.

Markohs · February 19, 2014, 05:54:36 PM

STL!!

Who said that?!?!?!?

Ters · February 19, 2014, 06:26:43 PM

STL actually don't have a hashtable. Hashtables were added to the C++11 standard, but at that point, it was no longer STL. However C++11 support is still not something one can expect.

ArthurDenture · February 24, 2014, 02:17:36 AM

QuoteSTL actually don't have a hashtable.

To be pedantic, STL certainly does have a hashtable, but the C++ Standard Library (before C++11) does not. Citation: http://stackoverflow.com/a/5908859/483595

Are there compilers targeted by Simutrans that do not provide std::hash_map?

Ters · February 24, 2014, 06:47:38 AM

Quote from: ArthurDenture on February 24, 2014, 02:17:36 AM
To be pedantic, STL certainly does have a hashtable, but the C++ Standard Library (before C++11) does not. Citation: http://stackoverflow.com/a/5908859/483595

I wasn't aware that STL was still available as a separate product. Since when has it had a hashtable? Did it have it when the C++ Standard Library was based on it?

Quote from: ArthurDenture on February 24, 2014, 02:17:36 AM
Are there compilers targeted by Simutrans that do not provide std::hash_map?

To be pedantic, GCC does not, nor does any recent MSC++ compiler.

For GCC, it's either __gcc_cxx::std_map or std::unordered_map, with the latter being std::tr1:unordered_map in versions a few years old. GCC 3 does not seem to have had unordered_map. mingw32 was fairly late to move from GCC 3 to GCC 4. I don't even now if they have formally switched yet for stable, but I think they have.

Microsoft seems to have used the std namespace for unordered_map since Visual Studio 2010, but it was std::tr1 in Visual Studio 2008. They also have a hash_map in namespace stdext since 2003. Before that, it was in std.

For Mac and Haiku, I have no idea of the situation.

The namespace change can probably be handled with some macros, as long as the API didn't change between tr1 and final. The API for hash_map and unordered_map might be too different.

News:

Hashtables