News:

Simutrans Chat Room
Where cool people of Simutrans can meet up.

(Windows) user_dir and program_dir cannot handle multibyte path name

Started by z9999+, September 25, 2017, 11:13:46 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

z9999+

As reported by fasa on his help request topic, if user_account_name or/and path_to_program_dir includes multibute path name, recent versions of simutrans (since 120.1.1 ?) cannot return correct path string on Windows.
It seems that ANSI short path name are used for user_dir and program_dir. But current simutrans returns utf-8 encoded short path name which is not valid.

For example, if I install simutrans in d:\表示\ folder, simutrans try to read/write from d:\陦ィ遉コ\ folder on Japanese Windows, and it failes. As a result simutrans doesn't start as fasa reported.

For example, if user_account_name is 表示, simutrans try to read/write from C:\Users\陦ィ遉コ\DOCUME~1\SIMUTR~1\ on Japanese Windows, and it failes.
In this case, simutrans itself will start, but cannot save the game. (It is strange that there is no error message and shows 'saved'. But this is a different problem.)

表示 (ANSI, Shift_JIS) is 95 5C 8E A6
表示 (utf-8) is E8 A1 A8 E7 A4 BA
陦ィ遉コ (ANSI, Shift_JIS) is E8 A1 A8 E7 A4 BA

Ters

I have provided a patch that should cope with this by using Window's and Microsoft's C runtime's Unicode API, and converting to/from the UTF-8 strings used internally in Simutrans. It was deemed unnecessary as the path shortening worked just fine...

Not to mention that short paths is a feature that can be disabled on newer Windowses.

DrSuperGood

What you are saying does not make any sense. Windows API internally uses UTF-16 "Unicode", at least on modern OS versions. Legacy calls to code page "multi byte" functions, of which none are used, convert their input into UTF-16 before calling the Unicode windows API. Simutrans will not look for a multi byte path using ANSI or other multi-byte encoding schemes. This arrangement by Microsoft is fully backwards compatible for old multi byte applications while supporting modern Unicode applications.

Could you confirm what OS version (XP, 7, 8 or 10) and file system (FAT or NTFS) you are using?

I have noticed the following errors in the code...
In simu_main the logic to get the working directory is wrong when built with MSVC (does not apply to GCC and other builds, I think, unless they somehow use Direct.h...). MSVC's original implementation of getcwd was deprecated and replaced with a more ISO conformant _getcwd. I cannot find documentation for the behaviour of the old getcwd so it is possible it is returning in native multi byte rather than as UTF8 which would explain why the working path string is nonsense. Again this only effects MSVC self-builds, with GCC used for releases and nightly using a different and more conformant implementation of getcwd.

In dr_query_homedir the logic for the user documents file path is contradictory. The path gets converted to short form, which limits folder and file names to 8.3 form. However then the path is used to create a directory path called "screenshot" which is not a valid short form directory name (name is 10 characters when limit is 8 ). Seeing how it needs to create a folder "Simutrans" which is 9 characters and again not a valid 8.3 name I have to question the entire purpose of using short form at all since that code clearly will not run without long file name awareness. As your example path shows short form this might possibly be a problem? It could well be a problem on some FAT partitions, like common USB sticks.

I have fixed the above problems in a patch. MSVC will now use the correct version of getcwd, _getcwd. The dr_query_homedir method will no longer try to resolve a short form path as it is extremely unlikely that functionality was working correctly in the first place and correct implementation would likely be pointless as I doubt many people run Simutrans from a DOS FAT partition. Please check if the issue persists.

Ters

The problem with using the multi-byte functions is that the strings used with them is mixed with other string inside Simutrans, which is UTF-8. As I understood prissi, he used the short paths because they are in pure ASCII (which seems to be true from a quick test).

DrSuperGood

QuoteThe problem with using the multi-byte functions is that the strings used with them is mixed with other string inside Simutrans, which is UTF-8. As I understood prissi, he used the short paths because they are in pure ASCII (which seems to be true from a quick test).
Which is why no multi byte functions are used. Instead only Unicode windows API functions are used.

From what I can tell, the short name part was a hack to try and support Unicode or other locale code pages when using the old code page based API as the short file names might all use the same code page to encode the characters.

The old, deprecated, API is "multi-byte". These used what was referred to as code pages to encode characters. Microsoft decided to dump this internally with the rise of Unicode as dealing with code pages is confusing as each code page may only have a subset of all desired characters and conversion may be lossy.

The new API is "Unicode". All the "W" postfix functions follow this. Characters are encoded as UTF-16 Unicode. Although Microsoft claims that each character is exactly 2 bytes (16 bits), it is actually variable length encoded as it is Unicode. Internally modern Windows OS work using this, including all file system operations, with the old non W postfixed functions internally converting the input to UTF-16 and then passing it to a W postfixed equivalent.

Simutrans uses UTF-8 internally. As far as Windows is concerned, UTF-8 is a multi-byte code page. Since modern Windows OS can only understand UTF-16 Unicode, all internal UTF-8 strings have to first be converted to UTF-16 before being passed to the API. UTF-8 and UTF-16 are both fully Unicode compliant meaning conversion between each other is completely lossless, for normal applications. It is not possible to pass UTF-8 strings to the old multi-byte API because there is no encoding context so it will try to interpret them as being encoded with a different, usually default, code page when converting to UTF-16 internally resulting in loss of characters or garbage output. Microsoft likely choose UTF-16 over simply switching to universally assuming a UTF-8 code page encoding for backwards compatibility, ease of spotting encoding compatibility problems and the impact to performance is likely trivial for most applications especially when some locales need at least 2 bytes per character anyway.

As far as I can tell all appropriate conversion is already implemented and occurring. The W postfixed functions receive their UTF-16 encoded forms of the internal UTF-8 strings and all UTF-16 results are encoded into UTF-8 for internal use. Exception being the 2 things mentioned in my last post, where with MSVC one of the API calls was to an old deprecated function which might not return UTF-8 (still not sure if the new one does...) and the other to the short file name conversion function which might mangle encoding or fail as it was not really designed for Unicode use.

I will investigate further, using some of the Kanji (I think that is what they are?) provided earlier to test.

EDIT:
Checked self-built MSVC nightly and had no problem starting Simutrans with a working directory of "D:\SimutransBuild\simutrans\simutrans\表示\simutrans". If the nightly still crashes due to parsing the working directory it must be a GCC compatibility problem.

Ters

Quote from: DrSuperGood on September 26, 2017, 08:51:14 PM
As far as I can tell all appropriate conversion is already implemented and occurring. The W postfixed functions receive their UTF-16 encoded forms of the internet UTF-8 strings and all UTF-16 results are encoded into UTF-8 for internal use. Exception being the 2 things mentioned in my last post, where with MSVC one of the API calls was to an old deprecated function which might not return UTF-8 (still not sure if the new one does...) and the other to the short file name conversion function which might mangle encoding or fail as it was not really designed for Unicode use.

Only when dealing with the GUI does Simutrans do UTF-8<->UTF-16 translation, and I don't think even that was 100%. When dealing with the everything related to the file system, Simutrans relies, directly or indirectly, on the ANSI API. (Mostly indirectly, through the C standard library.) The ANSI API can not be set to UTF-8 (unless that changed in Windows 10, but Simutrans doesn't try to force this anyway). This is what my patch was all about. But it has been rejected twice. I am still using this patch personally, so it seems stable.

DrSuperGood

QuoteWhen dealing with the everything related to the file system, Simutrans relies, directly or indirectly, on the ANSI API.
No it does not...char const* dr_query_homedir()
{
static char buffer[PATH_MAX+24];

#if defined _WIN32
WCHAR bufferW[PATH_MAX+24];
if(  SHGetFolderPathW(NULL, CSIDL_PERSONAL, NULL, SHGFP_TYPE_CURRENT, bufferW)  ) {
DWORD len = PATH_MAX;
HKEY hHomeDir;
if(  RegOpenKeyExA(HKEY_CURRENT_USER, "Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\Shell Folders", 0, KEY_READ, &hHomeDir) != ERROR_SUCCESS  ) {
return 0;
}
RegQueryValueExW(hHomeDir, L"Personal", 0, 0, (LPBYTE)bufferW, &len);
}
wcscat( bufferW, L"\\Simutrans" );
WideCharToMultiByte( CP_UTF8, 0, bufferW, -1, buffer, MAX_PATH, NULL, NULL );
#elif defined __APPLE__
sprintf(buffer, "%s/Library/Simutrans", getenv("HOME"));
#elif defined __HAIKU__
BPath userDir;
find_directory(B_USER_DIRECTORY, &userDir);
sprintf(buffer, "%s/simutrans", userDir.Path());
#else
sprintf(buffer, "%s/simutrans", getenv("HOME"));
#endif

dr_mkdir(buffer);

// create other subdirectories
#ifdef _WIN32
strcat(buffer, "\\");
#else
strcat(buffer, "/");
#endif
char b2[PATH_MAX+24];
sprintf(b2, "%smaps", buffer);
dr_mkdir(b2);
sprintf(b2, "%ssave", buffer);
dr_mkdir(b2);
sprintf(b2, "%sscreenshot", buffer);
dr_mkdir(b2);

return buffer;
}
void dr_mkdir(char const* const path)
{
#if defined(_WIN32) && !defined(__CYGWIN__)
WCHAR pathW[MAX_PATH];
MultiByteToWideChar( CP_UTF8, 0, path, -1, pathW, sizeof(pathW) );
CreateDirectoryW( pathW, NULL );
#else
mkdir(path, 0777);
#endif
}
Pleanty of UTF-16 to UTF-8 conversions.

Ters

That is a very small amount of file system interaction, located in simsys.cc alone, which I failed to spend time (I really should be at work by now) pointing out as exceptions (there are also exceptions in searchfolder). But basically every call, scattered throughout the codebase, that opens a file uses fopen, directly or indirectly. fopen operates on Windows' ANSI codepage, which means it fails when given UTF-8 paths derived from UTF-16 paths gotten from Windows' Unicode API. There is also a whole bunch of chdir calls.

DrSuperGood

I see the problem. I will look into a neat solution which keeps internal strings as UTF-8 but calls the UTF-16 Windows APIs when required.

It is worth noting that calls such as fopen are already UTF-8 aware on platforms like Linux since the file systems usually use UTF-8 encoded names and the function passes it through without internal conversion. As far as I can tell it is only really a big problem on Windows because of their code page system requires character conversion so Microsoft added UTF-16 versions of the API rather than deciding to phase out non UTF-8 code pages or even supporting UTF-8 encoding with the API calls in the first place.

My proposed solution would be to implement adapter code for Windows builds that fits the standard API forms but assumes the passed char*/string arguments are UTF-8 encoding. Internally the adapters convert this UTF-8 encoding into UTF-16 and call the Unicode windows API. For Linux and other platforms the adapters directly call the functions with the arguments as no conversion is required, if the adapters require to exist at all. Like wise adapters for APIs that return strings would always return UTF-8 encoded strings, performing internal conversion as required.

This keeps the benefits of using UTF-8 with generic API calls while giving the compatibility of using UTF-16 and specific API calls. Only down side is a potential minor performance impact due to the required UTF-8 and UTF-16 conversion however this should be trivial and written off as a necessary overhead (works as opposed to currently...).

Ters

Just look at my patch. It does pretty much exactly what you write.

prissi

Western version could not handle Unicode user names before 120.1.1. So it seems on repair broke the other.

The problem is that windows on Japanese installation using Fat32 may return a path in ShiftJIS. (Since default japanese locals has codepage 932). Making Unicode out of this multibyte string results in gabarge, since it assumes a wrong codepage. On Western systems I could run Simutrans as well as user directories with Kanji names (or Katakana or whatever) fine. I am not sure, if the only thing missing is setting the correct codepage for the program to some UTF one, instead using the systems/s native one. So in short the MB codepage on japanese sytems might be still returning multibyte strings in CP932 instead Unicode. I will investigate this more on the weekend.

If this is true, then Windows need to get the right codepage for the internal conversion, and not assume 8.3 is Ansi.

Unfourtunately, the documentation states that _setmbcp does not allow to set MB codepage to UTF-8. I think this needs some forther experimentation, since the curretn routines assume the wide char is unicode and not CP932 or something else.

(And you are allowed to mix short and long notation. Win95 used that internally all the time, and hence Winodws 10 still allows it.)

Ters

The correct solution is to simply stay away from all Windows API using "codepages". That is just a backwards compatibility layer for programs with no knowledge of Unicode.

Windows 10 still understands short filenames, but the feature may be disabled on individual file systems. Short filenames are not generated on demand, they are created along with the file and stored in the directory entries. So if the file system does not have them, then Windows can't use them.

DrSuperGood

QuoteMaking Unicode out of this multibyte string results in gabarge, since it assumes a wrong codepage.
So you are saying the Unicode "w" IO functions cannot handle such file names due to a Unicode recoding error? In this case should we even care about it as that would clearly be a Microsoft bug.
QuoteThe problem is that windows on Japanese installation using Fat32 may return a path in ShiftJIS.
No one should be using FAT32 for their main drives, I am pretty sure most people, even in Japan, use NTFS for their main drives. The only modern consumer use of FAT I know of is used is for portable USB memory sticks, and only because Mac and Windows cannot agree to use a more modern format (Linux has no problem mounting Mac and NTFS drives).
QuoteUnfourtunately, the documentation states that _setmbcp does not allow to set MB codepage to UTF-8. I think this needs some forther experimentation, since the curretn routines assume the wide char is unicode and not CP932 or something else.
On Windows one has to use the non-standard "w" function versions of everything IO related which return UTF-16, a Unicode implementation which logically is round trip safe with UTF-8. To implement this in a sane and portable way one needs adapters, eg "utf8fopen" which on Linux and other platforms that generally use UTF-8 directly calls fopen while on Windows does UTF-8 to UTF-16 conversion and then calls wfopen. I can see a big problem currently being that all fopen and such calls are not w versions, with the only reason the game working at all on such platforms being that it skips non-ASCII file paths using some working directory trickery.
Quote(And you are allowed to mix short and long notation. Win95 used that internally all the time, and hence Winodws 10 still allows it.)
Starting with Windows 7 one can disable short notation support.
QuoteThe correct solution is to simply stay away from all Windows API using "codepages". That is just a backwards compatibility layer for programs with no knowledge of Unicode.
Exactly. One wraps all the Unicode versions in UTF-8 adapters and if those have internal bugs then blame Microsoft.

prissi

The main problem are no fopen, but the calls to gzopen for the compression libaries. There are no UTF16 calls for it. So the only way to do this (and the main reason for using not the w functions) is to be able to open savegames with gzopen.

The submitted patch using uft8 for gzopen cannot open filesnames with UTF-8 characters at all. I tried it and it generated other filesnames. The current version can use Kanji in filenames (on western codepages) and they are correct on the disk.

Ok, getcwd does not returns a short path, but an invalid one. "simutransト" is returned as "C\:simutrans?" in western code page and with shiftJIS ト in the japanese codepage. Of course any attempt at accessing this is doomed to fail at this point.

Now added short path conversion for program and user dir to fix this.

DrSuperGood

QuoteThe main problem are no fopen, but the calls to gzopen for the compression libaries. There are no UTF16 calls for it. So the only way to do this (and the main reason for using not the w functions) is to be able to open savegames with gzopen.
Why not use gzopen_w...#if (defined(_WIN32) || defined(__CYGWIN__)) && !defined(Z_SOLO)
ZEXTERN gzFile         ZEXPORT gzopen_w OF((const wchar_t *path,
                                            const char *mode));
#endif
Takes Unicode (UTF-16) string and internally uses Windows Unicode APIs. Mark Adler even recommends it... Better yet since we now live in the future it is right there in the header file, ready for us to use.
QuoteOk, getcwd does not returns a short path
Which is why there is _wgetcwd on Windows, which will return a full Unicode path.

Short paths are not a solution. It is like trying to deliver a package to a distant place and instead of doing things the proper way by using a truck (using Unicode UTF-16 interfaces on Windows builds behind UTF8 wrappers) one goes off and shoots the thing with a trebuchet (very old short paths). The packet gets to its destination vaguely and some times is intact (short paths are available) but other times it is lost or smashed to a million pieces (short paths not available, some problem happens resolving the short path).

Hence the idea would be to make something like "simfilesystem.h" which contains functions like "utf8getcwd" or "utf8fopen" which on Linux and other non-Windows builds simply reference the standard C functions, but on Windows reference _wgetcwd and _wfopen with some UTF-16 and UTF-8 conversion happening. Basically an extension to the standard API that guarantees UTF-8 compliance so that all required platform specific code is located in 1 place for maintainability. By it being its own separate header it means that cross platform can be handled at build script time with a special Windows implementation file and generic implementation file rather than having to deal with a sea of macros (better maintainability).

Ters

Quote from: prissi on September 29, 2017, 05:25:06 AM
The main problem are no fopen, but the calls to gzopen for the compression libaries. There are no UTF16 calls for it.

Yes, there is! My patch uses it.

Quote from: prissi on September 29, 2017, 05:25:06 AM
Ok, getcwd does not returns a short path, but an invalid one. "simutransト" is returned as "C\:simutrans?" in western code page and with shiftJIS ト in the japanese codepage. Of course any attempt at accessing this is doomed to fail at this point.

Simutrans should not use getcwd or _getcwd on Windows, but _wgetcwd. (Followed by UTF-16 to UTF-8 conversion.)

Quote from: prissi on September 29, 2017, 05:25:06 AM
Now added short path conversion for program and user dir to fix this.

Please stop with the short paths. It is not 1994 anymore. Short paths were outdated before Simutrans was even conceived.