Author Topic: Windows Unicode File Path Support  (Read 460 times)

0 Members and 1 Guest are viewing this topic.

Offline DrSuperGood

Windows Unicode File Path Support
« on: October 11, 2017, 04:46:31 AM »
Attached is a preliminary patch that converts Windows builds to fully using Unicode (UTF-16) for all file system related operations. Before it used some arbitrary, locale dependant code page that's limitations were masked with some DOS like 8.3 short path trickery.

The theory is that Simutrans now calls all file path related functions via adaptors. On non-Windows builds these mostly call the standard version of the function as one can mostly assume native UTF-8 support. On Windows builds these perform some UTF-8 translation and call the UTF-16 Windows API functions. As far as Simutrans is concerned all file paths are UTF-8. As far as Windows is concerned all file paths are UTF-16 (not code pages). It is not possible to get Windows to natively process UTF-8 file paths as the APIs were largely designed before UTF-8 (and even UTF-16).

This should allow the Windows build of Simutrans to support paths with names containing any valid Unicode characters. It should also allow Simutrans to run on systems where short path support is disabled.

It will not allow Windows builds of Simutrans to use long file paths or take advantage of Window 10's file path length limit removal feature. Much of the coding work done for this patch was created with such support in mind, but there are still a lot of hard coded path lengths which will implicitly force a limit.

This patch was tested using MSVC2017 where the Simutrans build was seen to work correctly and load in game. I do not know about MinGW or other such Windows builds working. I also do not know if the Linux or other platform builds are working.

The design of the code was made more complex due to limitations of the old C++ standard. I am sure a lot of this nonsense could be avoided if one uses the C++ filesystem class (C++17).

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 4802
  • Total likes: 191
  • Helpful: 108
  • Languages: EN, NO
Re: Windows Unicode File Path Support
« Reply #1 on: October 11, 2017, 05:43:04 AM »
What was wrong with the previous patch that warranted starting over from scratch? It was tested with both mingw32, mingw64 and MSC++. And I've been using it for over a year. (Although I'm not a particularly active player.)

And I don't really see the point of having a U16View class, when all that is needed is the constructor as a simple function.

Offline DrSuperGood

Re: Windows Unicode File Path Support
« Reply #2 on: October 11, 2017, 03:11:52 PM »
Quote
What was wrong with the previous patch that warranted starting over from scratch?
This patch also tries to improve some aspects of path processing so that more generic code can be used. For example it removes the can use recycle bin macros scattered around the place.

I also fixed a bug with the function responsible for sending files to the recycle bin. It allocated an array but deleted a non-array which is undefined behaviour. It also might work more successfully now as it tries to resolve the full path before issuing the delete.

Quote
And I don't really see the point of having a U16View class, when all that is needed is the constructor as a simple function.
Memory management. Otherwise it would be up to the caller to free the memory array used to hold the UTF-16 string which is both error prone and results in procedural coupling. Basically a similar reason why std::string is often used.

In theory one could return a wide string object instead. However this assumes the compiler has a 16bit wide character type, which apparently GCC does not. Also it would mean allocating memory twice, once to convert and then once for the string object as one cannot modify string object buffers directly.

Another solution would be returning a std::vector however one cannot access the underlying array until at least C++11 so again this is not a viable solution.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 4802
  • Total likes: 191
  • Helpful: 108
  • Languages: EN, NO
Re: Windows Unicode File Path Support
« Reply #3 on: October 11, 2017, 04:37:00 PM »
This patch also tries to improve some aspects of path processing so that more generic code can be used. For example it removes the can use recycle bin macros scattered around the place.

That is not a reason to do everything else from scratch, so it does not answer the question.

Memory management. Otherwise it would be up to the caller to free the memory array used to hold the UTF-16 string which is both error prone and results in procedural coupling. Basically a similar reason why std::string is often used.

In theory one could return a wide string object instead. However this assumes the compiler has a 16bit wide character type, which apparently GCC does not. Also it would mean allocating memory twice, once to convert and then once for the string object as one cannot modify string object buffers directly.

Another solution would be returning a std::vector however one cannot access the underlying array until at least C++11 so again this is not a viable solution.

As far as I could tell, it only converts to wchar_t. And since it is sized according to the string passed to it, it can't be and isn't used as a buffer for accepting returned UTF-16 strings either. As a plain wrapper around a basically read-only wchar_t array, std::wstring is better.

Offline DrSuperGood

Re: Windows Unicode File Path Support
« Reply #4 on: October 11, 2017, 08:02:05 PM »
Quote
As far as I could tell, it only converts to wchar_t. And since it is sized according to the string passed to it, it can't be and isn't used as a buffer for accepting returned UTF-16 strings either. As a plain wrapper around a basically read-only wchar_t array, std::wstring is better.
As mentioned before...

wchar_t is platform specific with GCC (used for non MSVC building) defining it as 32bits while MSVC 16bits. This is why modern C++ standards added explicit 16 and 32 bit std::string types for better portability. Secondly to produce the string it would still need to allocate a temporary buffer as the std::string API does not allow one to write directly to the underlying buffer hence it would need 2 allocations instead of 1 (first to make the string, second by std::string apis to store it). If one could use newer C++ standards this will likely not be a problem as there are a ton of new APIs added for memory management so chances are one could pass a more standard/primitive type out and get the same result.

One can always remove that class at a later time when better solutions become available. It is internally used so only 1 file would need to be changed. What is more important is if the path works for compilers and platforms other than MSVC Windows.

Offline TurfIt

Re: Windows Unicode File Path Support
« Reply #5 on: October 12, 2017, 12:02:59 AM »
Code: [Select]
simsys.cc: In function 'char* dr_getcwd(char*, size_t)':
simsys.cc:279:56: error: 'WC_ERR_INVALID_CHARS' was not declared in this scope
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, wpath, -1, buf, (int)size, NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
===> HOSTCXX utils/simstring.cc
simsys.cc:279:56: note: suggested alternative: 'MB_ERR_INVALID_CHARS'
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, wpath, -1, buf, (int)size, NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
                                                        MB_ERR_INVALID_CHARS
simsys.cc: In function 'const char* dr_query_homedir()':
simsys.cc:345:56: error: 'WC_ERR_INVALID_CHARS' was not declared in this scope
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, whomedir, -1, buffer, sizeof(buffer), NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
simsys.cc:345:56: note: suggested alternative: 'MB_ERR_INVALID_CHARS'
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, whomedir, -1, buffer, sizeof(buffer), NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
                                                        MB_ERR_INVALID_CHARS
simsys.cc: In function 'const char* dr_query_fontpath(const char*)':
simsys.cc:393:56: error: 'WC_ERR_INVALID_CHARS' was not declared in this scope
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, fontdir, -1, buffer, sizeof(buffer), NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
simsys.cc:393:56: note: suggested alternative: 'MB_ERR_INVALID_CHARS'
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, fontdir, -1, buffer, sizeof(buffer), NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
                                                        MB_ERR_INVALID_CHARS
utils/searchfolder.cc: In member function 'int searchfolder_t::search_path(const string&, const string&, bool, bool)':
utils/searchfolder.cc:73:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for(  int i=0;  i<path.size();  i++  ) {
                  ~^~~~~~~~~~~~
utils/searchfolder.cc: In static member function 'static std::__cxx11::string searchfolder_t::complete(const string&, const string&)':
utils/searchfolder.cc:158:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for(  int i=0;  i<filepath.size();  i++  ) {
                  ~^~~~~~~~~~~~~~~~

Offline DrSuperGood

Re: Windows Unicode File Path Support
« Reply #6 on: October 12, 2017, 12:19:34 AM »
Ok looks like a misunderstanding of the API. I thought it meant that it had different behaviour on Windows Vista and later, not that it was only added with Windows Vista and later. I guess I will just have to remove that safety check...

Attached is patch with that removed. Sorry for the inconvenience.

Offline TurfIt

Re: Windows Unicode File Path Support
« Reply #7 on: October 12, 2017, 12:44:04 AM »
Compiles now, game does not crash when run. That's about the limit of my interest in this... seems like a deficient OS shirking it's responsibility to handle this conversion junk internally and completely transparent to programs.... Or maybe I should just say 7-bit ASCII FTW. yeesh.

EDIT:
file dates on save games all show as 1969-12-31 19:00 in the load game dialog.
« Last Edit: October 12, 2017, 01:06:54 AM by TurfIt »

Offline DrSuperGood

Re: Windows Unicode File Path Support
« Reply #8 on: October 12, 2017, 05:07:46 AM »
Quote
file dates on save games all show as 1969-12-31 19:00 in the load game dialog.
Used the wrong stat call so the resulting struct was not compatible.

I also missed replacing some function calls. Attached new patch.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 4802
  • Total likes: 191
  • Helpful: 108
  • Languages: EN, NO
Re: Windows Unicode File Path Support
« Reply #9 on: October 12, 2017, 08:47:53 AM »
wchar_t is platform specific with GCC (used for non MSVC building) defining it as 32bits while MSVC 16bits.

Yes, wchar_t is platform specific, which GCC respects. It is 32-bits on Linux and 16-bit on Windows. If the compiler did not follow the platform specific definitions, they would be unable to use the APIs provided by the platform as the arguments wouldn't match.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 8793
  • Total likes: 316
  • Helpful: 229
  • Languages: De,EN,JP
Re: Windows Unicode File Path Support
« Reply #10 on: Today at 01:08:52 AM »
Execptions in Simutrans are useless, since the thearding GCC is build without exceptions.

Morevore, there is absolute no way to enter an invalid character in Win32 (since all input comes via Win32 API), nor could a file name ever have an illegal character (they come both too from the Win2 API).

Furthermore this even does not compile with MSVC2012.
Code: [Select]
line (921)
U16View const wparam((std::string(portable ? "/P /D=" : "/D=") + path_to_program).c_str());
error
1>simsys.cc(921): error C2784: "std::_String_iterator<_Mystr> std::operator +(_String_iterator<_Mystr>::difference_type,std::_String_iterator<_Mystr>)": template-Argument für "std::_String_iterator<_Mystr>" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xstring(420): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::_String_const_iterator<_Mystr> std::operator +(_String_const_iterator<_Mystr>::difference_type,std::_String_const_iterator<_Mystr>)": template-Argument für "std::_String_const_iterator<_Mystr>" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xstring(288): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::move_iterator<_RanIt> std::operator +(_Diff,const std::move_iterator<_RanIt> &)": template-Argument für "const std::move_iterator<_RanIt> &" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xutility(1947): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::_Array_iterator<_Ty,_Size> std::operator +(_Array_iterator<_Ty,_Size>::difference_type,std::_Array_iterator<_Ty,_Size>)": template-Argument für "std::_Array_iterator<_Ty,_Size>" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xutility(1801): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::_Array_const_iterator<_Ty,_Size> std::operator +(_Array_const_iterator<_Ty,_Size>::difference_type,std::_Array_const_iterator<_Ty,_Size>)": template-Argument für "std::_Array_const_iterator<_Ty,_Size>" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xutility(1662): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::reverse_iterator<_RanIt> std::operator +(_Diff,const std::reverse_iterator<_RanIt> &)": template-Argument für "const std::reverse_iterator<_RanIt> &" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xutility(1226): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::_Revranit<_RanIt,_Base> std::operator +(_Diff,const std::_Revranit<_RanIt,_Base> &)": template-Argument für "const std::_Revranit<_RanIt,_Base> &" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xutility(1031): Siehe Deklaration von 'std::operator +'

Since there seems to be really a pressure for getting this right using low paths, well, I submitted this one. -r8303
« Last Edit: Today at 01:53:43 AM by prissi »

Offline DrSuperGood

Re: Windows Unicode File Path Support
« Reply #11 on: Today at 02:46:59 AM »
Quote
Morevore, there is absolute no way to enter an invalid character in Win32 (since all input comes via Win32 API), nor could a file name ever have an illegal character (they come both too from the Win2 API).
That is not what the checks are there for. They are more there to make sure that some Simutrans programmer later down the line does not mess up string manipulation operations in a way to generate invalid Unicode, eg by accidently injecting non UTF-8 MultiByte code page characters from unwrapped APIs into the UTF-8 file path strings. A Fatal exception being thrown would allow the cause to be immediately apparent, as opposed to generic missing file errors as the IO API tries to process whatever garbage path it turned into.
Quote
Furthermore this even does not compile with MSVC2012.
Very strange. Maybe 2012 has problems with C++98, or the templates cannot deal with argument declared objects. MSVC 2017 had no problem with the code.

I will hopefully fix the MSVC 2012 compile error. Static size allocated buffers is not really a good solution to string production. Even if it is more than sufficient for any possible string it still uses an arbitrary magic number (why 2048 and not 4096 or 1024? the magic number problem...). Since one has access to std::string why not use it when possible?

So far on Windows Simutrans is capable of both seeing and loading a save named "表外漢字表外漢字.sve" (something I pulled randomly off Wikipedia). It also was capable of producing a save named "沓直鳥沓直鳥沓直鳥.sve" from in game (another random string from Wikipedia). The only problem, with my build at least, is that in game it shows up as unknown character boxes. Still at least such a file can be loaded and saved.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 4802
  • Total likes: 191
  • Helpful: 108
  • Languages: EN, NO
Re: Windows Unicode File Path Support
« Reply #12 on: Today at 05:16:46 AM »
The only problem, with my build at least, is that in game it shows up as unknown character boxes. Still at least such a file can be loaded and saved.

I noticed the same thing with what I did, so that is to be expected. en.tab loads Prop-Latin1.bdf, which as the name suggests, only contain Latin-1 characters. In fact, I don't think it even has all of them. The presence of n-dashes in backup copies I make of my savegames was what clued me into problems with path processing in Simutrans back in the days, but they still show up as boxes.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 8793
  • Total likes: 316
  • Helpful: 229
  • Languages: De,EN,JP
Re: Windows Unicode File Path Support
« Reply #13 on: Today at 05:53:35 AM »
That problem is also present if you join a japanese game on a non-japanese languag seting. But I hope with freefonts it will be "soon" history.