News:

Simutrans Sites
Know our official sites. Find tools and resources for Simutrans.

Windows Unicode File Path Support

Started by DrSuperGood, October 11, 2017, 04:46:31 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

TurfIt

r8319 has broken display of symbols and non-English characters.

DrSuperGood

Could you make sure those characters are UTF-8 encoded? The old logic might have translated some invalid UTF-8 into extended Latin which is a different handling standard. Such standard is usually used for software that was not originally Unicode aware, eg Internet Browsers, so it could be compatible with the common extended Latin multi-byte encoding. Since Simutrans should be using UTF-8 internally I thought returning the standard defined invalid character code 0xFFFD was more appropriate however that is missing in a lot of fonts so it might be a box instead.

If they are valid UTF-8 maybe it is problem with the pointer reference manifesting in GCC builds (it is a warning after all...). In my development MSVC builds and using Unifont I have no problem viewing and entering umlauts, Chinese or Japanese characters including save file names.

TurfIt

No idea the encoding. The one is from halt_detail.cc:279:   buf.append(" ·");
Otherwise it's simply loading the old yoshi test game - whatever is saved in that. All displays correct in r8318.

prissi

Simutrans was only using UTF8 internally, when the default language is UTF8. So this patch fails when loading old games, because it assumes the saved string were UTF8. Also there are two encodings latin and latin2, which needs to be converted to unicode upon loading (in all places) if a savegame is pre 120006 version. Also it need to be converted from latin2 for east europe and latin for western.

This is a supermessy can of worms opened here. It will affect about 50% of the strings, because internal names (the names of objects) MUST not be translated upon loading, but all other stuff MUST. So this affects convois, lines, stops, ...

I am almost attempted t revert that patch for display, until all these issues are solved properly. Here is a short patch that would solve the loading of stops and marker names. I think one would rather have a rdwr_utf8_str function which does the conversion as in this example if needed and the plain version, which must be used for objs and other stuff not to be meddled with.

DrSuperGood

I do not understand why this is suddenly a problem... The only thing that has changed as far as text processing is that it now correctly decodes UTF-8 into UTF-32 as opposed into some hacky UTF-16. No other processing was added or removed meaning that any invalid Unicode sequences now also existed before but were not visible.

Are you trying to say that it relied on ISO-8859-1 interpretation (The Unicode code points U+0080–U+00FF with the same value as the byte) behaviour before? If so one could fix that by changing the Unicode error handler to return the character as a code point instead of make a substitution with 0xFFFD, literally a 1 line change. I have made this change to hopefully combat the regression, TurfIt please check r8321 and later if the characters get displayed correctly.

As long as all new maps produce correctly encoded UTF-8 the odd invalid character should not be a problem in the long run. If the ISO-8859-1 interpretation works then it should be good enough to not require barrier translation of old strings, like what was happening before.

Ters

I don't see why Windows Unicode File Path Support had anything to with the internal strings in old save games. I would have expect that to rather be affected by r7590 from two years ago. Then again, r8319 is more related to the work in r7590 than WIndows Unicode File Path Support.

I'm not sure support for Latin1 characters in what Simutrans assumes to be UTF-8 strings prior to r8319 was intentional or just accidental. When encountering illegal UTF-8 sequences or 4-byte sequences, the old unicode.c fell back to Latin1. (The comments for illegal sequences suggest intention, the comment for 4-byte sequence suggests accidental Latin1 support.) The new code flags it as errors, plain and simple.

Since it is only in the parsing of the strings that this fixing occurred before r8319, maps that once contained Latin1 continued to use Latin1 even after being re-saved by newer "all"-UTF-8 versions of Simutrans (that is, from the past two years). The strings need to be converted on loading using the same fallbacks as pre-8319 unicode.c. String reading aught to be handled by only one utility funcion, so that should not be much work. (That might not be the case, though.) However, newer Simutranses can not exactly save games into the old format without knowing what encoding was used for that particular language at that time.

UTF-8 auto-detection is supposedly simple, as valid UTF-8 sequences are unlikely to occur with any other character encoding. The only exception is detecting if a text is supposed to be UTF-8, or ASCII or some ASCII extension like Latin1, when only characters present in ASCII is used. Even then, assuming UTF-8 when in doubt, is safe for reading.

prissi

Aparently this hack is covering a long standing issue. So for the intermediate time we need that conversion (which by the way produces Latin1 or Latin2 depending on font used).

And one should probably convert all strings upon loading into UTF-8 names, including the stuff with pak file names (that would assume latin1) and then stick with this for the future.

Ters

#42
I finally got around to merging my local changes with what had happened in SVN. Most was identical, and most of the rest was minor stuff like positioning and white space differences. In the end, I was left with three calls to fopen that I had replaced with dr_fopen that has not been replaced with dr_fopen. I've attached a patch for them. The one in font.cc is rather harmless, but I included it as well.


Edit:
tabfile and log is also used by makeobj, so makeobj also needs dr_fopen. dr_fopen is now implemented in simsys.cc, while I implemented it simio.cc, and I think this is the reason why. simsys.cc has too many other irrelevant dependencies as far as makeobj is concerned.

DrSuperGood

In retrospect I should have just made a simfilesystem.h and used that. Except there were already some functions like what was needed so that is why I choose to put them where I did. I also was not aware about how makeobj was coupled.

A separate simfilesystem.h could have been implemented with separate c files for the various platforms, reducing macro usage. Potentially one of them could have even been made to couple with C++ filesystem API, if that is Unicode compliant on Windows.

prissi

The one for debug purposes in display does not need dr_fopen, since it will open C:\font.bdf. That will always work ... But it is obsolete code. The rest is in, thank you.