News:

Simutrans Chat Room
Where cool people of Simutrans can meet up.

Windows Unicode File Path Support

Started by DrSuperGood, October 11, 2017, 04:46:31 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

DrSuperGood

Attached is a preliminary patch that converts Windows builds to fully using Unicode (UTF-16) for all file system related operations. Before it used some arbitrary, locale dependant code page that's limitations were masked with some DOS like 8.3 short path trickery.

The theory is that Simutrans now calls all file path related functions via adaptors. On non-Windows builds these mostly call the standard version of the function as one can mostly assume native UTF-8 support. On Windows builds these perform some UTF-8 translation and call the UTF-16 Windows API functions. As far as Simutrans is concerned all file paths are UTF-8. As far as Windows is concerned all file paths are UTF-16 (not code pages). It is not possible to get Windows to natively process UTF-8 file paths as the APIs were largely designed before UTF-8 (and even UTF-16).

This should allow the Windows build of Simutrans to support paths with names containing any valid Unicode characters. It should also allow Simutrans to run on systems where short path support is disabled.

It will not allow Windows builds of Simutrans to use long file paths or take advantage of Window 10's file path length limit removal feature. Much of the coding work done for this patch was created with such support in mind, but there are still a lot of hard coded path lengths which will implicitly force a limit.

This patch was tested using MSVC2017 where the Simutrans build was seen to work correctly and load in game. I do not know about MinGW or other such Windows builds working. I also do not know if the Linux or other platform builds are working.

The design of the code was made more complex due to limitations of the old C++ standard. I am sure a lot of this nonsense could be avoided if one uses the C++ filesystem class (C++17).

Ters

What was wrong with the previous patch that warranted starting over from scratch? It was tested with both mingw32, mingw64 and MSC++. And I've been using it for over a year. (Although I'm not a particularly active player.)

And I don't really see the point of having a U16View class, when all that is needed is the constructor as a simple function.

DrSuperGood

QuoteWhat was wrong with the previous patch that warranted starting over from scratch?
This patch also tries to improve some aspects of path processing so that more generic code can be used. For example it removes the can use recycle bin macros scattered around the place.

I also fixed a bug with the function responsible for sending files to the recycle bin. It allocated an array but deleted a non-array which is undefined behaviour. It also might work more successfully now as it tries to resolve the full path before issuing the delete.

QuoteAnd I don't really see the point of having a U16View class, when all that is needed is the constructor as a simple function.
Memory management. Otherwise it would be up to the caller to free the memory array used to hold the UTF-16 string which is both error prone and results in procedural coupling. Basically a similar reason why std::string is often used.

In theory one could return a wide string object instead. However this assumes the compiler has a 16bit wide character type, which apparently GCC does not. Also it would mean allocating memory twice, once to convert and then once for the string object as one cannot modify string object buffers directly.

Another solution would be returning a std::vector however one cannot access the underlying array until at least C++11 so again this is not a viable solution.

Ters

Quote from: DrSuperGood on October 11, 2017, 03:11:52 PM
This patch also tries to improve some aspects of path processing so that more generic code can be used. For example it removes the can use recycle bin macros scattered around the place.

That is not a reason to do everything else from scratch, so it does not answer the question.

Memory management. Otherwise it would be up to the caller to free the memory array used to hold the UTF-16 string which is both error prone and results in procedural coupling. Basically a similar reason why std::string is often used.

Quote from: DrSuperGood on October 11, 2017, 03:11:52 PM
In theory one could return a wide string object instead. However this assumes the compiler has a 16bit wide character type, which apparently GCC does not. Also it would mean allocating memory twice, once to convert and then once for the string object as one cannot modify string object buffers directly.

Another solution would be returning a std::vector however one cannot access the underlying array until at least C++11 so again this is not a viable solution.

As far as I could tell, it only converts to wchar_t. And since it is sized according to the string passed to it, it can't be and isn't used as a buffer for accepting returned UTF-16 strings either. As a plain wrapper around a basically read-only wchar_t array, std::wstring is better.

DrSuperGood

QuoteAs far as I could tell, it only converts to wchar_t. And since it is sized according to the string passed to it, it can't be and isn't used as a buffer for accepting returned UTF-16 strings either. As a plain wrapper around a basically read-only wchar_t array, std::wstring is better.
As mentioned before...

wchar_t is platform specific with GCC (used for non MSVC building) defining it as 32bits while MSVC 16bits. This is why modern C++ standards added explicit 16 and 32 bit std::string types for better portability. Secondly to produce the string it would still need to allocate a temporary buffer as the std::string API does not allow one to write directly to the underlying buffer hence it would need 2 allocations instead of 1 (first to make the string, second by std::string apis to store it). If one could use newer C++ standards this will likely not be a problem as there are a ton of new APIs added for memory management so chances are one could pass a more standard/primitive type out and get the same result.

One can always remove that class at a later time when better solutions become available. It is internally used so only 1 file would need to be changed. What is more important is if the path works for compilers and platforms other than MSVC Windows.

TurfIt


simsys.cc: In function 'char* dr_getcwd(char*, size_t)':
simsys.cc:279:56: error: 'WC_ERR_INVALID_CHARS' was not declared in this scope
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, wpath, -1, buf, (int)size, NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
===> HOSTCXX utils/simstring.cc
simsys.cc:279:56: note: suggested alternative: 'MB_ERR_INVALID_CHARS'
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, wpath, -1, buf, (int)size, NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
                                                        MB_ERR_INVALID_CHARS
simsys.cc: In function 'const char* dr_query_homedir()':
simsys.cc:345:56: error: 'WC_ERR_INVALID_CHARS' was not declared in this scope
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, whomedir, -1, buffer, sizeof(buffer), NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
simsys.cc:345:56: note: suggested alternative: 'MB_ERR_INVALID_CHARS'
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, whomedir, -1, buffer, sizeof(buffer), NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
                                                        MB_ERR_INVALID_CHARS
simsys.cc: In function 'const char* dr_query_fontpath(const char*)':
simsys.cc:393:56: error: 'WC_ERR_INVALID_CHARS' was not declared in this scope
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, fontdir, -1, buffer, sizeof(buffer), NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
simsys.cc:393:56: note: suggested alternative: 'MB_ERR_INVALID_CHARS'
  int const convert_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, fontdir, -1, buffer, sizeof(buffer), NULL, NULL);
                                                        ^~~~~~~~~~~~~~~~~~~~
                                                        MB_ERR_INVALID_CHARS
utils/searchfolder.cc: In member function 'int searchfolder_t::search_path(const string&, const string&, bool, bool)':
utils/searchfolder.cc:73:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for(  int i=0;  i<path.size();  i++  ) {
                  ~^~~~~~~~~~~~
utils/searchfolder.cc: In static member function 'static std::__cxx11::string searchfolder_t::complete(const string&, const string&)':
utils/searchfolder.cc:158:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for(  int i=0;  i<filepath.size();  i++  ) {
                  ~^~~~~~~~~~~~~~~~

DrSuperGood

Ok looks like a misunderstanding of the API. I thought it meant that it had different behaviour on Windows Vista and later, not that it was only added with Windows Vista and later. I guess I will just have to remove that safety check...

Attached is patch with that removed. Sorry for the inconvenience.

TurfIt

#7
Compiles now, game does not crash when run. That's about the limit of my interest in this... seems like a deficient OS shirking it's responsibility to handle this conversion junk internally and completely transparent to programs.... Or maybe I should just say 7-bit ASCII FTW. yeesh.

EDIT:
file dates on save games all show as 1969-12-31 19:00 in the load game dialog.

DrSuperGood

Quotefile dates on save games all show as 1969-12-31 19:00 in the load game dialog.
Used the wrong stat call so the resulting struct was not compatible.

I also missed replacing some function calls. Attached new patch.

Ters

Quote from: DrSuperGood on October 11, 2017, 08:02:05 PM
wchar_t is platform specific with GCC (used for non MSVC building) defining it as 32bits while MSVC 16bits.

Yes, wchar_t is platform specific, which GCC respects. It is 32-bits on Linux and 16-bit on Windows. If the compiler did not follow the platform specific definitions, they would be unable to use the APIs provided by the platform as the arguments wouldn't match.

prissi

#10
Execptions in Simutrans are useless, since the thearding GCC is build without exceptions.

Morevore, there is absolute no way to enter an invalid character in Win32 (since all input comes via Win32 API), nor could a file name ever have an illegal character (they come both too from the Win2 API).

Furthermore this even does not compile with MSVC2012.

line (921)
U16View const wparam((std::string(portable ? "/P /D=" : "/D=") + path_to_program).c_str());
error
1>simsys.cc(921): error C2784: "std::_String_iterator<_Mystr> std::operator +(_String_iterator<_Mystr>::difference_type,std::_String_iterator<_Mystr>)": template-Argument für "std::_String_iterator<_Mystr>" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xstring(420): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::_String_const_iterator<_Mystr> std::operator +(_String_const_iterator<_Mystr>::difference_type,std::_String_const_iterator<_Mystr>)": template-Argument für "std::_String_const_iterator<_Mystr>" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xstring(288): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::move_iterator<_RanIt> std::operator +(_Diff,const std::move_iterator<_RanIt> &)": template-Argument für "const std::move_iterator<_RanIt> &" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xutility(1947): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::_Array_iterator<_Ty,_Size> std::operator +(_Array_iterator<_Ty,_Size>::difference_type,std::_Array_iterator<_Ty,_Size>)": template-Argument für "std::_Array_iterator<_Ty,_Size>" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xutility(1801): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::_Array_const_iterator<_Ty,_Size> std::operator +(_Array_const_iterator<_Ty,_Size>::difference_type,std::_Array_const_iterator<_Ty,_Size>)": template-Argument für "std::_Array_const_iterator<_Ty,_Size>" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xutility(1662): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::reverse_iterator<_RanIt> std::operator +(_Diff,const std::reverse_iterator<_RanIt> &)": template-Argument für "const std::reverse_iterator<_RanIt> &" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xutility(1226): Siehe Deklaration von 'std::operator +'
1>simsys.cc(921): error C2784: "std::_Revranit<_RanIt,_Base> std::operator +(_Diff,const std::_Revranit<_RanIt,_Base> &)": template-Argument für "const std::_Revranit<_RanIt,_Base> &" konnte nicht von "const char *" hergeleitet werden.
1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\xutility(1031): Siehe Deklaration von 'std::operator +'


Since there seems to be really a pressure for getting this right using low paths, well, I submitted this one. -r8303

DrSuperGood

QuoteMorevore, there is absolute no way to enter an invalid character in Win32 (since all input comes via Win32 API), nor could a file name ever have an illegal character (they come both too from the Win2 API).
That is not what the checks are there for. They are more there to make sure that some Simutrans programmer later down the line does not mess up string manipulation operations in a way to generate invalid Unicode, eg by accidently injecting non UTF-8 MultiByte code page characters from unwrapped APIs into the UTF-8 file path strings. A Fatal exception being thrown would allow the cause to be immediately apparent, as opposed to generic missing file errors as the IO API tries to process whatever garbage path it turned into.
QuoteFurthermore this even does not compile with MSVC2012.
Very strange. Maybe 2012 has problems with C++98, or the templates cannot deal with argument declared objects. MSVC 2017 had no problem with the code.

I will hopefully fix the MSVC 2012 compile error. Static size allocated buffers is not really a good solution to string production. Even if it is more than sufficient for any possible string it still uses an arbitrary magic number (why 2048 and not 4096 or 1024? the magic number problem...). Since one has access to std::string why not use it when possible?

So far on Windows Simutrans is capable of both seeing and loading a save named "表外漢字表外漢字.sve" (something I pulled randomly off Wikipedia). It also was capable of producing a save named "沓直鳥沓直鳥沓直鳥.sve" from in game (another random string from Wikipedia). The only problem, with my build at least, is that in game it shows up as unknown character boxes. Still at least such a file can be loaded and saved.

Ters

Quote from: DrSuperGood on October 17, 2017, 02:46:59 AM
The only problem, with my build at least, is that in game it shows up as unknown character boxes. Still at least such a file can be loaded and saved.

I noticed the same thing with what I did, so that is to be expected. en.tab loads Prop-Latin1.bdf, which as the name suggests, only contain Latin-1 characters. In fact, I don't think it even has all of them. The presence of n-dashes in backup copies I make of my savegames was what clued me into problems with path processing in Simutrans back in the days, but they still show up as boxes.

prissi

That problem is also present if you join a japanese game on a non-japanese languag seting. But I hope with freefonts it will be "soon" history.

TurfIt

Whatever compiler setup the nightly farm uses is not happy: (Aside: would be nice if someone with access to it could fix the nightly revision numbers too....  double r  rr9999)

configure: WARNING: Error, libpng is missing!
configure: WARNING: Using GDI backend!
svn: '.' is not a working copy
simsys.cc: In function 'int dr_mkdir(const char*)':
simsys.cc:140:21: error: '_set_errno' was not declared in this scope
simsys.cc:142:21: error: '_set_errno' was not declared in this scope
simsys.cc: In function 'int dr_remove(const char*)':
simsys.cc:207:21: error: '_set_errno' was not declared in this scope
simsys.cc:209:21: error: '_set_errno' was not declared in this scope
simsys.cc: In function 'int dr_rename(const char*, const char*)':
simsys.cc:231:21: error: '_set_errno' was not declared in this scope
simsys.cc:233:21: error: '_set_errno' was not declared in this scope
simsys.cc: In function 'int dr_chdir(const char*)':
simsys.cc:254:21: error: '_set_errno' was not declared in this scope
simsys.cc:256:21: error: '_set_errno' was not declared in this scope
simsys.cc: In function 'const char* dr_query_homedir()':
simsys.cc:331:3: error: 'LSTATUS' was not declared in this scope
simsys.cc:331:11: error: expected ';' before 'status'
simsys.cc:333:6: error: 'status' was not declared in this scope
simsys.cc:326:9: warning: unused variable 'len' [-Wunused-variable]
make: *** [build/default/simsys.o] Error 1

DrSuperGood

I have hopefully fixed the nightly build errors.

Where can one find the nightly build error logs? They used to be somewhere around the nightly download page.

TurfIt


DrSuperGood

Been messing around and tried running Simutrans using Unifont. The Unicode support is there although the windowing system has problems with the different font layout (bug or missing feature?). Not only can the windows build read Unicode save names, but they can also be displayed correctly using Unicode if a Unicode bdf file is provided. To produce the below I replaced "cyr.bdf" with "unifont-10.0.06.bdf".

I have also written a more accurate utf8 to code point decoder that can resolve any Unicode standard code point rather than the current hacky one that does a fake single character UTF-16 conversion and so can only resolve the core code page. I plan to commit it in a few days after polishing it a bit.

prissi

The current one is not hackish. Supporting more than 65535 characters is not needed for any living language, and simutrans has no been need of emoticons. Furthermore, even the TTF font display routines will not support UTF32 on windows. As soon as there are windows TTF fonts with more than 65535 characters (and even for those only few support the far east) then we shoudl consider an extension. But since UTF16 is the files system by definition on windows, I see no pressing for support right now.

Also symutrans has a relatively complete unicode font, if you install the chinese language support. The wenquanyi_9pt.bdf is quite complete, an I think I included also the codepages from cyr.bdf in it. (Have done this a long time ago.) In principle Simutrans can do internally UTF8 for all languages, the freefont is on my todo list for ages too.

Ters

UTF8, UTF16 and UTF32 are just encodings of Unicode, using one or more units of one, two and four bytes respectively. All support the full range of Unicode characters. So that Windows uses UTF16 for the file system doesn't mean anything to Simutrans, except that it must support Unicode to deal with all possible paths. The filename ?.txt is legal. I just created such a file. Whether Simutrans uses UTF8, UTF16 or UTF32 to do so is up to Simutrans.

Leartin

Technology is amazing if we can call our porn-folders "??" xDDD

Ters

Then it is amazing. If any video player or image view is able to open that folder is a different matter.

DrSuperGood

#22
QuoteThe current one is not hackish. Supporting more than 65535 characters is not needed for any living language, and simutrans has no been need of emoticons. Furthermore, even the TTF font display routines will not support UTF32 on windows. As soon as there are windows TTF fonts with more than 65535 characters (and even for those only few support the far east) then we shoudl consider an extension. But since UTF16 is the files system by definition on windows, I see no pressing for support right now.
It is needed to correctly parse Unicode strings. Parsing should stop if any invalid Unicode is encountered, and that includes invalid code points such as those beyond plane 17 (UFT-8 and UTF-32 only) or reserved by the standard for surrogate pairs. This does not include unused code pages, they are still valid even if they currently have no meaning. The current system will keep processing such strings until a 0 code point characters is found.

QuoteAlso symutrans has a relatively complete unicode font, if you install the chinese language support. The wenquanyi_9pt.bdf is quite complete, an I think I included also the codepages from cyr.bdf in it. (Have done this a long time ago.) In principle Simutrans can do internally UTF8 for all languages, the freefont is on my todo list for ages too.
As shown, Simutrans does support correct decoding of the main plane of Unicode. The problem is that there are no complete bdf font for it. Once a complete one is made, it could be used for most locales, with exception of those that need localized glyphs for some characters.
QuoteIf any video player or image view is able to open that folder is a different matter.
Simutrans should be able to open such folders now so I do not see why other applications should have a problem.

Simutran's current font system has a limitation of only the Basic Multilingual Plane. This means such characters will appear as squares no matter the font loaded. The technical solution would be to add multiple plane support to fonts. However actual fonts for it would still be needed.

Ters

Quote from: DrSuperGood on October 21, 2017, 02:33:38 PM
Simutrans should be able to open such folders now so I do not see why other applications should have a problem.

Because other programs don't use Simutrans for path handling, so fixing Simutrans doesn't mean that all other programs now do Unicode properly. Even programs that use wide-characters may fail to handle surrogate pairs properly, but there are still programs that only use plain 8-bit characters. Mercurial, Git's original rival and sort-of-sibling, still gets confused, as far as I know, by the fact that Windows does not use UTF-8. Git doesn't get quite as confused anymore, but I'm not sure how it deals with the fact that Windows isn't UTF-8 like Linux. It may do what Simutrans did, or it may do what Simutrans does.

TurfIt

Build broken again: (and many warnings added...)

dataobj/translator.cc: In function 'char* recode(const char*, bool, bool, bool)':
dataobj/translator.cc:173:65: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
       *dst++ = c = (uint8)utf8_decoder_t::decode((utf8 const *&)src);
                                                                 ^~~
dataobj/translator.cc:176:76: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
       *dst++ = c = unicode_to_latin2(utf8_decoder_t::decode((utf8 const *&)src));
                                                                            ^~~
display/simgraph16.cc: In function 'int display_calc_proportional_string_len_width(const char*, size_t)':
display/simgraph16.cc:4381:58: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
   utf32 iUnicode = utf8_decoder_t::decode((utf8 const *&)text);
                                                          ^~~~
gui/components/gui_textinput.cc: In member function 'virtual void gui_hidden_textinput_t::display_with_cursor(scr_coord, bool, bool)':
gui/components/gui_textinput.cc:692:58: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
    if(  cursor_visible  &&  text_pos - (utf8 const*)text == head_cursor_pos  ) {
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
unicode.cc: In static member function 'static utf32 utf8_decoder_t::decode(const utf8*, size_t&)':
unicode.cc:45:28: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
   if(  !(character == 0xE0 && buff[i] < 0xA0 ||
          ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
unicode.cc:52:28: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
   if(  !(character == 0xF0 && buff[i] < 0x90 ||
          ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
simsys_s2.cc: In function 'void internal_GetEvents(bool)':
simsys_s2.cc:643:15: error: 'utf8_to_utf16' was not declared in this scope
    utf16 uc = utf8_to_utf16( (utf8 *)event.text.text, &in_pos );
               ^~~~~~~~~~~~~
simsys_s2.cc:643:15: note: suggested alternative: 'utf16_to_utf8'
    utf16 uc = utf8_to_utf16( (utf8 *)event.text.text, &in_pos );
               ^~~~~~~~~~~~~
               utf16_to_utf8
make: *** [common.mk:50: /r/build/simsys_s2.o] Error 1


DrSuperGood

When removing the old function that call was not picked up for some reason. Should now be fixed.

I am not sure how to fix the "dereferencing type-punned pointer will break strict-aliasing rules" warnings. Unless such typecast is performed it requires a separate local variable of the appropriate type be declared in order to pass a reference to it. Maybe I am missing some syntax magic for references.

I am not even sure why the "suggest parentheses around '&&' within '||'" warnings are emitted seeing how && has higher precedence than || according to C/C++ standard so the order is implicitly well defined. Added brackets to fix it anyway.

Ters

Quote from: DrSuperGood on October 27, 2017, 08:18:20 PM
I am not sure how to fix the "dereferencing type-punned pointer will break strict-aliasing rules" warnings. Unless such typecast is performed it requires a separate local variable of the appropriate type be declared in order to pass a reference to it. Maybe I am missing some syntax magic for references.

You may take a look at https://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule. Furthermore, those are C casts. C++ has more explicit casts. (This probably requires the most unsafe of them anyway.) And although I haven't seen the context, having a & in it seems odd. I don't think I've seen a cast like that before.

Quote from: DrSuperGood on October 27, 2017, 08:18:20 PM
I am not even sure why the "suggest parentheses around '&&' within '||'" warnings are emitted seeing how && has higher precedence than || according to C/C++ standard so the order is implicitly well defined. Added brackets to fix it anyway.

Probably because a lot of programmers, myself included, don't remember the precedence order for those two and end up getting it wrong half the time.

DrSuperGood

QuoteAnd although I haven't seen the context, having a & in it seems odd. I don't think I've seen a cast like that before.
It is a reference to a pointer. The idea being that after a character is decoded it can increment the pointer to the next character, removing the need for other parameters like a variable to hold the length. One could set it to a pointer to a pointer but a reference to a pointer seems more appropriate as it would never make sense trying to process NULL or an arbitrary pointer.

The problem is that many of the string pointers are of type char where as the utf8 type is defined as unsigned character. For that reason it does not like using a parameter or local variable of type "char const *" as a reference target because it will only accept a variable of type "utf8 const *" as a reference. What I suspect is happening is by type casting to "utf8 const *&" it is creating a reference to a variable of "char const *" and then type casting it to a reference of "utf8 const *" which is accepted. Simply casting to "utf8 const *" does not work, it is apparently not the right type to take a reference from, possibly because it is type casting the value and losing context of the underlying variable.

In retrospect that method might not have been the best API choice. That is why I created a more traditional style one which takes a constant pointer to characters and outputs the length to a variable of size_t.

Ters

Yuck. Mutating values through references. Sure, some languages are all references, but then I am in that mindset. In C++, I expect f(a) to not modify a, unlike f(&a) (with arrays, you never know, though).

Quote from: DrSuperGood on October 28, 2017, 12:52:56 AM
[...]a more traditional style [...] takes a constant pointer to characters and outputs the length to a variable of size_t.

I thought the C++ way was to take in a (constant) pointer (or rather iterator) to characters and return the advanced pointer(/iterator).

prissi

I do not understand the waz whz fixing stuff tht is not broken. Also creating and destroyign a useless this pointer just to convert a single character I do not understand. Why change the simple function call to a class? (Hopefully the compiler is clever enough to do not create all the overhead ivolved with a class constructor and destructor just for a single call to convertthe character. Also class'ing stuff makes inlining of such code less likely.) Typing the extra "utf8_decoder_t::has_next" has absolutely no merit, for me.

Moreover, as stated, simutrans does not need to parse UTF32 (which is actually did even before), because there is not way to enter such characters right now, and neither to display them.

Ters

Quote from: prissi on October 28, 2017, 12:56:05 PM
Hopefully the compiler is clever enough to do not create all the overhead ivolved with a class constructor and destructor just for a single call to convertthe character. Also class'ing stuff makes inlining of such code less likely.
The ability of the compiler to do so is essential in C++. Probably even more so in the recent editions.

Quote from: prissi on October 28, 2017, 12:56:05 PM
Moreover, as stated, simutrans does not need to parse UTF32 (which is actually did even before), because there is not way to enter such characters right now, and neither to display them.
I don't see anything suggesting UTF-32 parsing. However one pretty much has to use UTF-32 when working with individual Unicode characters. (I have never seen it actually being called that in that context, though.) Basically, when you parse UTF-8 or UTF-16, you get UTF-32. If you don't, a user will sooner or later trigger a crash. That happened at work a short while back, when incomplete handling of UTF-16 in a customer's software's interface with ours caused a crash when one of their users tried to submit a telephone into a text field in our system. The telephone was broken into two pieces and became corrupt, causing the transaction to fail. (It didn't bring the entire system down, as we have better error handling than Simutrans. Tough luck for the user, though.)

Since UTF-8 has to be "parsed" in order to have any advantage over ASCII, it isn't much more work to handle the full range of Unicode rather than stop at the lower 64k characters. Especially when there actually is a developer dedicated enough to actually do the work. Since characters probably are processed one by one at this point, it's not like we're saving memory by using 16-bit integers over 32-bit integers. An x86 will use (at least) 32-bit registers and stack slots anyway since about 1990 (unless operating in real mode, which Simutrans does not and never has supported).

I'm not much of a fan of wrapping functionality in classes just to use classes/objects (although I don't have a choice as a Java developer), but in this case, it is a class with a state that I think is better off encapsulated. The utf8_decoder_t class does not have virtual members, so it should compile to exactly the same code as a naked char pointer and free standing functions operating on it would have. The compiler should know that this == &utf8str.

DrSuperGood

QuoteI thought the C++ way was to take in a (constant) pointer (or rather iterator) to characters and return the advanced pointer(/iterator).
The problem is that where does one return the code point then? Either the pointer can be returned and the code point set via a reference, or the code point returned and the pointer/length returned via a reference. If one used utf-32 directly this is no problem, but that has a lot of other problems itself.
QuoteI do not understand the waz whz fixing stuff tht is not broken.
It was broken. The old code would try to parse illegal Unicode which violates ISO standards. If the main character is illegal, it should not even test subsequent tails as those might be beyond buffer bounds.
QuoteAlso creating and destroyign a useless this pointer just to convert a single character I do not understand. Why change the simple function call to a class? (Hopefully the compiler is clever enough to do not create all the overhead ivolved with a class constructor and destructor just for a single call to convertthe character. Also class'ing stuff makes inlining of such code less likely.)
There should be practically no overhead from the class due to how good optimizing compilers are. Yes it is slower logically, but it is slower purely because it is in theory more correct as it tests for more cases.
QuoteMoreover, as stated, simutrans does not need to parse UTF32 (which is actually did even before), because there is not way to enter such characters right now, and neither to display them.
When pasting in from external sources on Windows it could encounter UTF-16 pairs, which it would incorrectly process. Now it processes them correctly into UTF-8 so such sequences can exist.

The only disadvantage is that it is technically slower now. However it is more correct as it will not try to process illegal Unicode sequences in compliance with ISO as opposed to before where it might still process tail bytes after an illegal character. That said the difference should be trivial, especially in most cases.

Ters

Quote from: DrSuperGood on October 28, 2017, 09:20:27 PM
The problem is that where does one return the code point then?

You suggested returning a size_t value, so the problem of returning two values was already present. C++ has a solution for that called std::pair. (Although I'm not particularly fond of that structure due to the rather meaningless names. std::tie helps somewhat, but it is much more verbose than such a construct would be in Python.)

DrSuperGood

QuoteYou suggested returning a size_t value, so the problem of returning two values was already present.
I am not suggesting it. That is how it was before, except it would increment the value similar to incrementing the pointer rather than returning the length.

One could also create a utf8_decode_result_t struct which contains both utf32 code point and byte length of the character and then return that. This would basically be similar to std::pair except explicitly typed and remove the need for any non-constant references as parameters. Using union mechanics one could pack the length into the unused bytes of UTF-32, but that might have other problems. Would this be better than the exiting API?

Ters

The current API looks mostly fine to me, except that the public static member functions kind of breaks the encapsulation. For full C++ style points, it should perhaps work like a forward iterator.