News:

Simutrans Wiki Manual
The official on-line manual for Simutrans. Read and contribute.

Unicode always

Started by Ters, September 03, 2015, 07:24:48 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Ters

Although discussions on Unicode has now taken place on in two other topics the last few weeks, they don't really belong there. I'll therefore raise another Unicode issue here. Windows 10, and maybe Windows 8 which I have never used, use non-ASCII character (n-dash) in the default postfix added to the file name when creating a copy of a file in the same directory, at least in my locale. When this is done to a savegame file, Simutrans doesn't show the name right.

Update:
Here is a web page by someone else who has also been thinking about the best way of doing text processing, and has come to pretty much the same conclusion I have. Mingw may add some extra challenges, so that some of his suggestions may need some workarounds.

Ters

When converting between internal UTF-8 char strings and the wchar_t string used by Windows' Unicode API, one needs to use MultiByteToWideChar and WideCharToMultiByte. (codecvt_utf8_utf16 is new in C++11 and not available on mingw. mbstowcs and wcstombs operate on locales, and there are no UTF-8 locales on Windows apparently.) It would be nice to have a pair of helper function for converting between std::string and std::wstring, since calling MultiByteToWideChar or WideCharToMultiByte is not exactly a one-liner. But where to put such helpers functions? simsys_w.cc is only used for GDI builds, not for SDL on Windows.

prissi

There is utf8_to_utf16() and counterpart in simgraph.cc.

Second point: Yes, the conversion into ANSI locale will fail, since there is no UTF-8 codepage on windows. However, doing all in 16 bit will require different functions, and a lot of OS dependent code. HAving said that, I have never heard complains about japanese filenames, so some code must be working correctly. Maybe we just miss the character in the font when using the Ansi-Codepage?

Ters

I have turned simsys_w.cc fully Unicode, which didn't take much effort. The main challenge is all the various fopen and equivalent around in the code. Some of it is within libraries, not in Simutrans itself. So far, I have created wrapper functions ufopen and ugzopen in simio.cc. I have put them into use in loadsave.cc, and also adapter searchfolder.cc to Unicode. With that, the Unicode-named save game show up more right and loads. The n-dash is not displayed, but what follows is. Before, the name stopped with three dots where the n-dash was. The dialog doesn't show the details (pak set and date) after the button, though, so something is still amiss.

Quote from: prissi on September 06, 2015, 10:17:46 PM
There is utf8_to_utf16() and counterpart in simgraph.cc.
It only does a single character. And it's in unicode.cc, which might be a great place to put string variants. Though I might still utilize the Windows API for conversion, at least initially, since they likely have all error and safety checks in place.

Quote from: prissi on September 06, 2015, 10:17:46 PM
HAving said that, I have never heard complains about japanese filenames, so some code must be working correctly.
The files names are probably handled in some multi-byte encoding. That's the only way Win9x operated over there. Speaking of very different character sets. How does it work if a Japanese and a European player play a networked game together? Is the encoding of text in save games and in the protocol defined to be anything in particular?

prissi

Yes, there are two encodings: The names of town and stations, which are generated by the random number generator and hence need to be the same on all systems (hence they will be Japanese) and then the GUI language which is used from the players side.

Apart from the filename stuff, my first effort with Simutrans was to enable Unicode for display and strings. (Also Win2k and up used Unicode for japanese filenames.) THe japanese games are save as UTF8 string, which display garbage on the screen. (However, very few people ever open their savegame-folder with the explorer). So the have the same "name" than their Unix coutnerparts ...

Without seeing you patch: I think there must be a way to find out about UTF-8 string (like looking at savegame version?) and display older filenames correctly. I assume you put all file open stuff into simsys.cc?

Ters

Quote from: prissi on September 07, 2015, 08:22:31 PM
Yes, there are two encodings: The names of town and stations, which are generated by the random number generator and hence need to be the same on all systems (hence they will be Japanese) and then the GUI language which is used from the players side.
I remember from earlier work that translation files can be either UTF-8 or some plain 8-bit encoding. Does this reflect the encoding used by the game, or will both end up as UTF-8 internally?

Quote from: prissi on September 07, 2015, 08:22:31 PM
Apart from the filename stuff, my first effort with Simutrans was to enable Unicode for display and strings. (Also Win2k and up used Unicode for japanese filenames.) THe japanese games are save as UTF8 string, which display garbage on the screen. (However, very few people ever open their savegame-folder with the explorer). So the have the same "name" than their Unix coutnerparts ...
I think NTFS uses Unicode for file names, so it should actually be from NT 3.1. That Simutrans already leaks UTF-8 into places not expecting UTF-8 will be a challenge.

Quote from: prissi on September 07, 2015, 08:22:31 PM
Without seeing you patch: I think there must be a way to find out about UTF-8 string (like looking at savegame version?) and display older filenames correctly. I assume you put all file open stuff into simsys.cc?
I put the file open stuff in simio.cc, because the name seemed fitting. There isn't much in that file, so needing to include simio.h doesn't drag a lot of unnecessary stuff into places that just need to open a file. There isn't much to show with regards to what little I've done on Unicode paths so far. It's just a wrapper each for fopen and gzopen, plus some conversion in searchfolder.cc that could use some tidying first. None of it can be patched without doing the rest. I can make a patch for simsys_w.cc which seems independent of everything else, but on its own, it's not that important.

prissi

Since it is platform dependent stuff, it should go into simsys About simio.*, I think it is time for these files to go ...

Make just a function which returns a const char * (internally a static) called system_filename_to_utf(const char *) into simsys or so, which either reutrns a pointer to a static string or (on other systems) the string itself).

Converting all other stuff to unicode just requires a change in the translator (setting the unicode flag).

Ters

Quote from: prissi on September 08, 2015, 09:15:49 AM
Make just a function which returns a const char * (internally a static) called system_filename_to_utf(const char *) into simsys or so, which either reutrns a pointer to a static string or (on other systems) the string itself).

That won't do. There is no way to represent a Unicode filename on Windows using chars. On Windows, one must use wchar_t, and therefore _wfopen rather than fopen, gzopen_w rather than gzopen (both available on Windows only) and _wfindfirst/_wfindnext/_wfinddata_t rather than _findfirst/_findnext/_finddata_t (here all are Windows only). system_filename_to_utf can be used in searchfolder.cc, but it needs a companion for going the other way as well.

DrSuperGood

QuoteThere is no way to represent a Unicode filename on Windows using chars.
There is (should?) by using the multi-byte compiler flag. At least in MSVC according to what I read of the API.

That said, as I mentioned in other threads, Microsoft came to the solution that standardizing to single width wide character would be easier than supporting multi byte characters or multi width wide characters. To the point that Windows 10 apps are required to use only single width wide characters and do not support multi width characters at all.

A big question I am asking is why are C strings used and not the C++ string wrappers? Since the C++ string wrapper classes for normal and wide characters share the same interface one could use a platform selection macro to choose the type of string to use. On Windows this would use the wide character string class implementation. All manipulation of the string is then done by calling the appropriate type independent methods (instead of C style) so that no other code needs changing. The string consumers (file system API) might also need some platform selection macros but that can be neatly bundled into a single file, or maybe even separate definition files with the make process choosing the appropriate.

Ters

Quote from: DrSuperGood on September 08, 2015, 11:21:22 PM
There is (should?) by using the multi-byte compiler flag. At least in MSVC according to what I read of the API.
I need a link for that, because everything I've found indicates that UTF-8 is not natively supported by any API on Windows (except MultiByteToWideChar and WideCharToMultiByte).

Quote from: DrSuperGood on September 08, 2015, 11:21:22 PM
That said, as I mentioned in other threads, Microsoft came to the solution that standardizing to single width wide character would be easier than supporting multi byte characters or multi width wide characters. To the point that Windows 10 apps are required to use only single width wide characters and do not support multi width characters at all.
I think there might be some misunderstanding in what is meant by multi-byte. There appears to be a special, partial multi-byte API in addition to the "ANSI" API and the "wide" Unicode API. In the cases where I have located all three versions of the same functionality (such as https://msdn.microsoft.com/en-us/library/78zh94ax.aspx), only the multi-byte versions are deprecated. UTF-8 is not considered among these multi-byte encodings. They can't really deprecate it, because it's the encoding of the Internet. Deprecating multi-byte encodings seems to be about things like Shift JIS, though it is odd that Windows 1252 is not going to die (it is not multi-byte, but as far as I know just as bad as Shift JIS). Since MultiByteToWideChar and WideCharToMultiByte is the way to convert to/from UTF-8, it is not slated for removal, despite its name and original use with multi-byte encodings.

Quote from: DrSuperGood on September 08, 2015, 11:21:22 PM
A big question I am asking is why are C strings used and not the C++ string wrappers? Since the C++ string wrapper classes for normal and wide characters share the same interface one could use a platform selection macro to choose the type of string to use. On Windows this would use the wide character string class implementation. All manipulation of the string is then done by calling the appropriate type independent methods (instead of C style) so that no other code needs changing. The string consumers (file system API) might also need some platform selection macros but that can be neatly bundled into a single file, or maybe even separate definition files with the make process choosing the appropriate.
Once more: you can't just switch between string and wstring. Since wstring can't be transparently converted to char arrays, that would mean that every place a string is passed to some other API (SDL, network protocols, file formats, the script engine) you need to do a call to a function that conditionally converts to and/or from. I believe the best thing is to use char and convert at just those few places dealing with Windows specific APIs. So does that site I linked to in the first post. In either case, switching between fopen and _wfopen is needed, though.

DrSuperGood

QuoteOnce more: you can't just switch between string and wstring. Since wstring can't be transparently converted to char arrays, that would mean that every place a string is passed to some other API (SDL, network protocols, file formats, the script engine) you need to do a call to a function that conditionally converts to and/or from. I believe the best thing is to use char and convert at just those few places dealing with Windows specific APIs. So does that site I linked to in the first post. In either case, switching between fopen and _wfopen is needed, though.
Internal strings can probably remain as normal single width characters because one should not use Unicode for them anyway (they are not localized). Path strings should use whatever the OS file system is happiest supporting (normal char for Linux I think and wide char for Windows). Localized strings could be pushed through some universal intermediate form (eg UTF-8) and converted where and when needed (so probably be char type).

This would be implemented by a type definition for file path type and all file path strings use that type. Hopefully the result would be that the same code (via C++ strings) could be re-used for both types of string.

Ters

Internal strings might as well be UTF-8 as well, since it probably doesn't make any difference (ASCII is valid UTF-8, as well as valid Windows 1252 and, with one exception, even Shift JIS), and it saves having to enforce a barrier between strings and strings. And Simutrans hardly deals with paths at all, and when it does, at least part of that path is shown to or comes from the user. So in either case, they must be converted at some point and the external API called will be different on Windows. Even though I've just come home from a conference where someone argued out that strings should not be just strings, I'm not going bother with making a path class myself for this experiment in making Simutrans fully Unicode, rather than optionally partially Unicode.

prissi

#12
OK, I see the problem for needing different functions. But file opening should work using a short name, converted from UTF-8 by
UTF-8 -> WCHAR
WCHAR -> GetShortPathW() -> ANSI
return that ANSI string. Then then uglyness will all be hidden in the fopen-call, which does not bring it to light. However, that function will fail on creation of a new file, so it may need to create such a file.

I therefore suggest the following const char *system_filename_to_utf( const char *name, bool create=true ) using the above mechanism for filenames. For directories one can use the short name even with a japanese user name (which otherwise failt to start simutrans in my german windows7). So I wanted to submit (if the svn had not failed)

char const* dr_query_homedir()
{
static char buffer[PATH_MAX+24];

#if defined _WIN32
WCHAR bufferW[PATH_MAX+24], bufferW2[PATH_MAX+24];
if(  SHGetFolderPathW(NULL, CSIDL_PERSONAL, NULL, SHGFP_TYPE_CURRENT, bufferW)  ) {
DWORD len = PATH_MAX;
HKEY hHomeDir;
if(  RegOpenKeyExA(HKEY_CURRENT_USER, "Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\Shell Folders", 0, KEY_READ, &hHomeDir) != ERROR_SUCCESS  ) {
return 0;
}
RegQueryValueExW(hHomeDir, L"Personal", 0, 0, (LPBYTE)bufferW, &len);
}
// this is needed to access multibyte user driectories with ASCII names ...
wcscat( bufferW, L"\\Simutrans" );
CreateDirectoryW( bufferW, NULL ); // must create it, because otherwise the short name does not exist
GetShortPathNameW( bufferW, bufferW2, sizeof(bufferW2) );
WideCharToMultiByte( CP_UTF8, 0, bufferW2, -1, buffer, MAX_PATH, NULL, NULL );
#elif ...

It is not the nicest way to have single byte encoded filenames on windows, but it works in my tests.

Ters

I know of that way of avoiding/hiding Unicode, but it feels like moving backwards into the future, dragging one's feet. Since this is not an acute problem, I'd like to try to see what a full solution would be like, including whether it is practical at all. This might take some time, though.

prissi

It is pressing, as any user with a japanese (or any other name exceeding the current code page) will not be able to access the My Documetn directories using the current simutrans revision. At least for that snipped of code up there.

Ters

Quote from: prissi on September 11, 2015, 09:06:22 PM
It is pressing, as any user with a japanese (or any other name exceeding the current code page) will not be able to access the My Documetn directories using the current simutrans revision. At least for that snipped of code up there.

As far as I know, the path handling in Simutrans hasn't changed in years, and nobody has complained before that I have noticed.

prissi

The default characters are question marks. Hence if there is only a single user on that machine, it will work by accident. But I tried it with two users on my Windows 7, and the user with a japanese user name got into troubles.

Ters

I've finally had some time and energy to look into what has been committed related to this, and there appears to be much left to do.

First of all, why does searchfolder only use the Unicode API and MultiByteToWideChar/WideCharToMultiByte when compiling with MSVC, while dr_system_filename_to_uft8/dr_utf8_to_system_filename call MultiByteToWideChar/WideCharToMultiByte when compiled for Windows (MSVC and mingw)?

On Windows, searchfolder appears to only work, by accident, when using mingw. Mingw works because the path remains latin1 encoded, and Simutrans automatically decodes misencoded UTF-8 as latin1. When using the _MSC_VER code, the path is encoded as UTF-8, but this UTF-8 string is passed directly to fopen in pakselector_t::check_file, which fails.

DrSuperGood

Quote
First of all, why does searchfolder only use the Unicode API and MultiByteToWideChar/WideCharToMultiByte when compiling with MSVC, while dr_system_filename_to_uft8/dr_utf8_to_system_filename call MultiByteToWideChar/WideCharToMultiByte when compiled for Windows (MSVC and mingw)?
The MSVC project uses wide characters, as such the Windows API automatically defaults to wide characters. There are macros in Windows.h and its module headers which do this.

Quote
On Windows, searchfolder appears to only work, by accident, when using mingw. Mingw works because the path remains latin1 encoded, and Simutrans automatically decodes misencoded UTF-8 as latin1. When using the _MSC_VER code, the path is encoded as UTF-8, but this UTF-8 string is passed directly to fopen in pakselector_t::check_file, which fails.
MSVC builds appear to work fine on my computer. Then again almost all my folders use English characters (US/UK Eng) so I cannot vouch for other language characters.

Based on Microsoft's guideline all modern windows applications should only be using wide character sets. All new (since Vista) Windows API functions only accept wide characters. The way strings and paths are handled in general by Simutrans is kind of iffy and could even be prone to buffer overflow.

Ters

Quote from: DrSuperGood on January 31, 2016, 07:41:40 PM
The MSVC project uses wide characters, as such the Windows API automatically defaults to wide characters. There are macros in Windows.h and its module headers which do this.
[...]
Based on Microsoft's guideline all modern windows applications should only be using wide character sets. All new (since Vista) Windows API functions only accept wide characters. The way strings and paths are handled in general by Simutrans is kind of iffy and could even be prone to buffer overflow.

This has in essence nothing to do with char vs wchar_t. The only relation is that Windows doesn't expose Unicode through anything else but its wide API, and today, you should always use Unicode in some form. The world has standardized on UTF-8, which unfortunately didn't exist when the Win32 API was made. I've said it before: It is counterproductive for Simutrans to switch all it's string processing from UTF-8 to UTF-16 for Windows builds, duplicate a lot of its string processing code and having to convert between the two on all (de)serialization (saving, loading, networking) just because it does a few file system operations (fopen, gzopen, findfile) that should be Unicode. Simutrans must use the wide Win32 API, and Microsoft's non-standard extensions to the C runtime, though, which is probably good enough for Microsoft.

Quote from: DrSuperGood on January 31, 2016, 07:41:40 PM
MSVC builds appear to work fine on my computer. Then again almost all my folders use English characters (US/UK Eng) so I cannot vouch for other language characters.

If you don't use characters beyond codepoint 127, then this is completely irrelevant to you (as a user at least).

DrSuperGood

Quote
The world has standardized on UTF-8, which unfortunately didn't exist when the Win32 API was made.
As far as I can understand from MSDN, UTF-8 is supported except for the Windows API calls. You can even declare UTF-8 constants. As such if UTF-8 is used internally then all that would be required is boarder conversion to UTF-16 when interacting with the Windows API.

Microsoft recommends against UTF-8 to simplify programming. UTF-8 has many programming problems, such as finding the length of a string in characters and determining how much memory to allocate for a string. Further more the memory saved and possible performance gains are often insignificant in this day and age for most tasks. As such Microsoft has tried to promote fixed width character encoding. Mind you fixed-width was Java's solution for Strings for a long time, with only next year's release changing internal String representation to variable width characters for performance reasons.

One solution I mentioned in the past would be to globally use C++ style strings instead of C. One could then change the entire internal representation of strings with compiler macros from a single file. Windows would then be set to use its 16 bit fixed width wide characters which can be trivially passed to API calls. Linux can keep using UTF-8 as long as its APIs support it. Save/load I/O would take this platform dependant string and internally recode it into the appropriate form in a way similar to how Endian is handled. As far as programmers go, you would not have to care about the internal representation of strings.

As it is much is wrong with platform ports. Mac and Windows builds should be using first letter capitalized folder names as is standard on those operating systems (Linux is small letters only).

prissi

Ok, I tested with MSVC for paths with japanese characters (like a japanese user name) and it worked fine then. But as you wrote it may have done so by lucky coincidences.

The trouble is that the functions used did not worked first for folder names. But I am happy if you could clean this mess up properly. (I would like to do this too, but I am seriously lacking time this and next months.)

Ters

Quote from: DrSuperGood on January 31, 2016, 10:56:29 PM
As far as I can understand from MSDN, UTF-8 is supported except for the Windows API calls. You can even declare UTF-8 constants.
What else is there but the Windows API? Declaring UTF-8 constants has nothing to do with it. Firstly, that's a compiler thing. Secondly, of course you can declare UTF-8 constants! UTF-8 was designed to pass like anything that encodes ASCII in 8-bit as long as the contents doesn't matter (beyond special character within the ASCII range, like null, newline, quotes, slashes, etc.). That's why Linux uses UTF-8 to support Unicode. They hardly had to change anything, unlike Windows, which has doubled its API.

Quote from: DrSuperGood on January 31, 2016, 10:56:29 PM
Microsoft recommends against UTF-8 to simplify programming. UTF-8 has many programming problems, such as finding the length of a string in characters and determining how much memory to allocate for a string. Further more the memory saved and possible performance gains are often insignificant in this day and age for most tasks. As such Microsoft has tried to promote fixed width character encoding. Mind you fixed-width was Java's solution for Strings for a long time, with only next year's release changing internal String representation to variable width characters for performance reasons.
Both Windows and Java use UTF-16, which has exactly the same problems as UTF-8. Except that UTF-16 can lull you in to a false sense of security, since only Asian languages will ever require more than one element to encode a character. Back when Windows got it's wide API, 16-bit was thought to be enough. If you want simple string operations, we need to move to UTF-32, bit Windows doesn't support that. Linux uses UTF-32 in its definition of wchar_t. But UTF-32 is rarely used, because it wastes a lot of space for almost every language.

Quote from: DrSuperGood on January 31, 2016, 10:56:29 PM
One solution I mentioned in the past would be to globally use C++ style strings instead of C. One could then change the entire internal representation of strings with compiler macros from a single file. Windows would then be set to use its 16 bit fixed width wide characters which can be trivially passed to API calls. Linux can keep using UTF-8 as long as its APIs support it. Save/load I/O would take this platform dependant string and internally recode it into the appropriate form in a way similar to how Endian is handled. As far as programmers go, you would not have to care about the internal representation of strings.
It's not the API calls that need to be trivial. That's the lesser part of the code, more so when considering that most of it already has the proper conversions in place. All that is missing for full Unicode is conversion at the calls to fopen and gzopen, and changing #ifdef _MSC_VER to #ifdef _WIN32 in searchfolder.cc.

Quote from: prissi on January 31, 2016, 11:13:28 PM
Ok, I tested with MSVC for paths with japanese characters (like a japanese user name) and it worked fine then. But as you wrote it may have done so by lucky coincidences.
Maybe Japanese is treated differently when using the 8-bit API, or when doing conversions to short paths. When using the non-_MSC_VER code in searchfolder.cc, æøå ended up inside Simutrans encoded as Latin1 (three bytes). It displayed correctly, likely due to the fallback(s) in unicode.cc when encountering illegal byte sequences. When this was passed back to fopen, which expects Latin1, it naturally worked. When using the _MSC_VER code, æøå ended up encoded as UTF-8 inside Simutrans (six bytes). When passed to fopen, it failed.

DrSuperGood

Quote
Both Windows and Java use UTF-16, which has exactly the same problems as UTF-8. Except that UTF-16 can lull you in to a false sense of security, since only Asian languages will ever require more than one element to encode a character. Back when Windows got it's wide API, 16-bit was thought to be enough. If you want simple string operations, we need to move to UTF-32, bit Windows doesn't support that. Linux uses UTF-32 in its definition of wchar_t. But UTF-32 is rarely used, because it wastes a lot of space for almost every language.
I retract my statement about fixed width. Windows API functions marked as Unicode aware are fully UTF-16 compliant. They will correctly process plain 1-16 UTF-16 characters.

The problem with Windows API and UTF-8 is that UTF-8 was never really supported. Instead an API known as "code pages" was used, which is now marked as legacy. According to MSDN all new Windows applications should be using Wide Characters for UTF-16 support.

Ters

Quote from: DrSuperGood on February 01, 2016, 04:42:23 PM
The problem with Windows API and UTF-8 is that UTF-8 was never really supported. Instead an API known as "code pages" was used, which is now marked as legacy. According to MSDN all new Windows applications should be using Wide Characters for UTF-16 support.

Exactly. But Simutrans is not a Windows application. It's a cross platform application. One that knows how to process UTF-8 already. I'm working towards eliminating all use on non-wide Windows API. The direct use has been taken care of, I think. It's just a bit of indirect usage through the C runtime and zlib left.

Ters

I was a bit surprised and disappointed by the number of chdir calls in the code, but I have now managed to put together something that at least starts up from a directory containing a mix of Norwegian and Japanese characters, loads a pak set containing Norwegian characters and loads a game with a name containing Japanese characters. (The game name wasn't displayed right, but I think that is due to the font.) There might be some missed IO calls, though.

I have not (yet) touched freetype font loading, png loading, internal squirrel code and code only used by makeobj.

It is tested it with mingw and 32-bit mingw64, but needs additional testing, at least with MSVC and other OSes. Proper testing requires non-ASCII characters in the paths Simutrans deals with.

prissi

The logic of simsys would dictate, that all UTF functions are named dr_fopen, dr_chdir,  ... and so on.

About the folder path, I am not sure; because that path has be be handed down in the library, which uses fopen internally. (Apart from the fact, that there is no windows installation, where a font path would contain a character above 127.)

Other than that, thank you.

Ters

Quote from: prissi on February 06, 2016, 10:45:38 PM
The logic of simsys would dictate, that all UTF functions are named dr_fopen, dr_chdir,  ... and so on.

Well, I didn't originally make them there. It was a quick move from simio.cc since you didn't like that I put them there. Maybe I was also so distracted by other inconsistencies in the code standard used that I never thought about the naming.

Quote from: prissi on February 06, 2016, 10:45:38 PM
About the folder path, I am not sure; because that path has be be handed down in the library, which uses fopen internally. (Apart from the fact, that there is no windows installation, where a font path would contain a character above 127.)

I wasn't quite sure if it loaded OS fonts or from inside Simutrans' directory, or maybe both. And until I could figure out, I didn't check if it had special wide path functions like zlib luckily has.