News:

Do you need help?
Simutrans Wiki Manual can help you to play and extend Simutrans. In 9 languages.

Patch to convert source files to UTF-8

Started by ceeac, February 24, 2020, 03:35:34 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

ceeac

This patch unifies the encoding of all source files (*.cc/*.h) to UTF-8 (without BOM). I did this for a number of reasons:


  • Consistency. Some files were already encoded as UTF-8, most were encoded as CP-1252 (?). Now there is only a single encoding.
  • The text rendering routines (e.g. display_text_proportional_len_clip_rgb) expect UTF-8 encoded strings, however for example the scrolltext in scrolltext.h was not encoded as UTF-8.
  • GCC's default source encoding is locale-dependent, which is usually UTF-8 nowadays, Clang only supports UTF-8, and MSVC also has UTF-8 support since VS2015.
  • This fixes a display bug when executing simutrans -help on the commandline with the locale set to UTF-8.

I have tested the patch on Linux and Windows (VS2019), and it worked fine there.

Ters

Do the source files need to be anything beyond ASCII in the first place?

Having recently had trouble getting a Linux system to compile source code that had always been UTF-8, I don't have much faith that anything else is a good idea for source code.

Mariculous

Imho it doesn't matter if utf-8 chars are actually used or not. Utf-8 is by default expected by most compilers, IDEs and whatever.
In any case, we should try to use the same encoding for all project files. utf-8 seems to be a better choice than any iso-whatever encoding.
Especially with translation files in mind, there are currently a lot of different encodings for different languages.
Further, simutranslator seems to use utf-8, which recently was an issue.

Even further, some old comments are still written in German, a few containing umlauts.

Imho, if possible, strictly switching all project sources to utf-8 is a good idea. However, as some translations come from paksets, there needs to be legacy support for iso-whatever encodings read from such language.tab files.

prissi

The ancient stuff should be dealt with properly i.e. remove all HJ Mathaner copyright messages (which are not appropriate since OS from 2007 anyway. And remove the German comments or translate them. Also adding names to routines is long discouraged, as many people contributed to them quite often in the meantime.

Mariculous

I don't think using a consistent encoding and translating or removing all German comments are closely tied tasks.
Translating those comments is for sure desireable but on the other hand quite a lot of work.
Mixing up different encodings in the same project is not pretty good practice, well at least in my opinion.
I don't think changing encoding is a high-priotity thing but as there already is a patch, why shouldn't it just be accepted?

Leartin

Quote from: prissi on February 25, 2020, 01:42:55 AMremove all HJ Mathaner copyright messages (which are not appropriate since OS from 2007 anyway.
*caugh* https://forum.simutrans.com/index.php/topic,17460 *caugh* (Technically, it's still copyright. If it goes, so should the copyright parameter - I'd long prefer to just be an author)

prissi

Some of these files were not from him, but just the header was copied lazily.

Also the correct copyright message (at least international style) is by the Simutrans team, which is anyway the second line. Some files have only this messages and some have even nothing. So this is rather a mess right now.

makie

Quote from: Ters on February 24, 2020, 06:29:49 PMDo the source files need to be anything beyond ASCII in the first place?
Yes, source files (*.cc/*.h) should be ASCII without any language-specific characters.

Quote from: Freahk on February 24, 2020, 06:48:35 PMEven further, some old comments are still written in German, a few containing umlauts.
just change "ö" to "o" or "oe" and its ok

DrSuperGood

Unicode supports German characters so why remove them? One just has to explicitly re-encode the files from their code page to UTF-8 so that the characters are converted properly to their UTF-8 versions.

All the source files should be UTF-8. These should ideally produce UTF-8 string constants, except when UTF-16 is explicitly requested (Windows OS). Program wise everything internal should be made UTF-8, except for character drawing which should programmatically convert to UTF-32 "code points" for lookup or Windows OS calls which should convert to and from UTF-16 as required.

I suspect the reason some of the files are not UTF-8 is due to oversight when adding new files to the SVN. On Windows it defaults to the local code page rather than UTF-8.

prissi

The reason is that in 1997 the files were in DOS codepage and nobody cared.

Actually, C keywords and everything else apart from strings and comments must be ASCII, as far as I know. Strings being UTF-8 (or any other codepage) might not work on a random system. (However, since most "rtandom system" are using GCC to compile, this issue has been solved by the GCC monopoly.)

Still, if one touches these places, translation (and removal) seems the cleaner choice.

Vladki

I also think that the right solution is to translate everything to English, and thus 7 bit ASCII should be sufficient

Mariculous

#11
For code? Sure.
For translations? No!

Anyone wants to translate all comments just now? Well I don't think so. As there already is a utf-8 conversion patch for consistent utf-8 encoding, it should be applied.
When all special chars are removed, encoding doesn't matter anyway as any encoding will look just the same as plain ASCII.

At least this is my opionion. Do whatever you want.

Ters

Quote from: DrSuperGood on February 25, 2020, 05:18:50 PMUnicode supports German characters so why remove them?
Unicode is not as well supported as one might think. Unicode support on Linux is handled by the locale system. Since those are user settings, it is possible for some user to use a UTF-8 locale, while others use something else, which can cause different interpretation of paths.

My particular problem was that I was trying to migrate out build systems to Docker containers. Since the idea behind containers is to have only what you need in them, the base docker image, which was the official Ubuntu docker image, contained only three locales, of which only one was in UTF-8. When logging into the container using, ssh sent over your locale settings from your machine. Since this did not exist, you ended up with a default that was not UTF-8. This took me a while to figure out, especially since it wasn't me personally who logged in, but the build orchestration program. (Which in turn someone had configured to explicitly use a non-UTF-8 locale, probably in a misguided attempt to deal with problems due to lack of correct locales.)

I don't know how GCC works on a system without UTF-8 locales, but one can clearly not just expect UTF-8 to work. Pure ASCII will work whether your system is set up for ASCII, ISO-8859-x, Windows-125x, and (for the most part) Shift JIS and possibly other Asian encodings. Unless you really need to encode a lot of non-ASCII characters in the code, it is probably the safest to use just ASCII. For files read by our applications, we have control over how the contents are interpreted, so they can (and should) be UTF-8.

DrSuperGood

Quote from: prissi on February 26, 2020, 12:29:49 PMActually, C keywords and everything else apart from strings and comments must be ASCII, as far as I know. Strings being UTF-8 (or any other codepage) might not work on a random system. (However, since most "rtandom system" are using GCC to compile, this issue has been solved by the GCC monopoly.)
UTF-8 is ASCII compatible. The extended code point support is achieved by using the not part of standard ASCII "extended ASCII" values and multi byte characters.
Quote from: Ters on February 26, 2020, 06:15:11 PMI don't know how GCC works on a system without UTF-8 locales, but one can clearly not just expect UTF-8 to work. Pure ASCII will work whether your system is set up for ASCII, ISO-8859-x, Windows-125x, and (for the most part) Shift JIS and possibly other Asian encodings. Unless you really need to encode a lot of non-ASCII characters in the code, it is probably the safest to use just ASCII. For files read by our applications, we have control over how the contents are interpreted, so they can (and should) be UTF-8.
One could argue such systems are incorrectly configured to begin with. Possibly for legacy reasons.

Mariculous

Quote from: DrSuperGood on February 26, 2020, 07:31:20 PMOne could argue such systems are incorrectly configured to begin with. Possibly for legacy reasons.
Further, those systems won't be able to work with the current encoding mess either.

prissi

ASCII defines code points 0...127 because it was 7 bit. So UTF-8 is by definition not ASCII. (I am old enough to have emails broken after Umlauts, because email was long time only ASCII allowed. Otherwise people would no used Base64 encoding for binaries back then ...)

However, instead updating obsolete comments and german text, rather removing and translating should be the way forward.


ceeac

What about people's names? For example, the scrolltext contains names with "ö" or "é" characters. Should these be renamed to (i.e. ö -> oe, é -> e)? Personally, I'd rather use the correct name if possible instead of a transliteration.

makie

Quote from: ceeac on February 27, 2020, 07:38:49 AMthe scrolltext
This should be a separate file.
scrolltext.h  Ok it is a separated file.
utf8 ? or iso8859-15 or ? ? ? ---> translator ??
that is probably the exception

Mariculous

Quote from: prissi on February 27, 2020, 06:42:45 AMSo UTF-8 is by definition not ASCII
Nobody said it's generally the same. Obviously it's not, otherwise there would be no need to define all these different encodings.

However, restricted to chars defined in ASCII ie 0x00 to 0x7F, any iso-8859-whatever, windows-whatever or utf-8 encoding can be "converted" to ASCII by simply interpreting it as ASCII. The other way round works with a little restriction: The 8-th bit in ASCII is undefined. Nowdays it's always 0, but historically it could be uses for any purpose like parity. In practice this doesn't matter as most (if not all) modern editors don't use plain ASCII anways and the few that do won't use the 8-th bit for anything special, it will be simply 0. It's usually embedded in iso-8859-whatever, windows-whatever or utf-8
As mentioned, the binary representation of ASCII chars is the exact same in all of these encodings, so a file containing only ASCII chars can properly interpreted as any of these.

It should be noted that the above assumes utf-8 without BOM.

prissi

Historically not all 8 bit characters could not be used for data transmission, since it was repeating the control characters in the extended ASCII set from 128 to 160. However, to the thing at hand:

Remove the line with the copyright message and the following line with *
Translate (or remove if trivial) the german comments

Convert the remaining rest to Unicode. This is especially scrolltext.hm, which was overlooked when everything else was switched to UTFß8 some time ago.

DrSuperGood

For now I still recommend transcoding all source files to UTF-8 and leaving the German comments and names as is. Removing those is a much bigger project which is less critical since UTF-8 can handle German characters perfectly and most modern setups should support UTF-8.

UTF-8 was designed to be backward compatible with ASCII. In worst case some characters would appear as nonsense glyphs.

https://en.wikipedia.org/wiki/UTF-8
QuoteIt was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" (slash) in filenames, "\" (backslash) in escape sequences, and "%" in printf.

Ters

Quote from: DrSuperGood on February 26, 2020, 07:31:20 PMUTF-8 is ASCII compatible.
I'd say is the other way around. All ASCII is valid UTF-8, but UTF-8 is not valid ASCII, since ASCII stops at codepoint 127. A program made for ASCII might use a 128 element array to look up glyphs. Once fed UTF-8 containing non-ASCII characters, it would end up with out-of-bounds access. (And these programs would likely not do error checking. Every byte counts when you only have 256 kB RAM.) That is not really the issue here, however. I'm just concerned that the actual result of something will be different depending on whether the code is read as UTF-8 or something else. The only thing that I can think of at the moment are wide character and string literals, but there may be something else. UTF-8 in comments should probably work fine on a 8-bit extended ASCII system, since the compiler doesn't care about the contents. The other way is more uncertain, although a proper UTF-8 implementation should be able to synchronize back. I'm not sure if UTF-8 work as well with other multi-byte encodings.

Quote from: DrSuperGood on February 27, 2020, 06:51:18 PMUTF-8 was designed to be backward compatible with ASCII. In worst case some characters would appear as nonsense glyphs.
That depends on what compatibility is needed. I would say that UTF-8 was designed so that ASCII is forward compatible with it. UTF-7 offers a different kind of backwards compatibility with ASCII. Less readable, but actually usable on true 7-bit ASCII systems. Beyond SMTP, and even then just partially, those are hard to find.

DrSuperGood

Quote from: Ters on February 27, 2020, 07:16:48 PMA program made for ASCII might use a 128 element array to look up glyphs. Once fed UTF-8 containing non-ASCII characters, it would end up with out-of-bounds access. (And these programs would likely not do error checking. Every byte counts when you only have 256 kB RAM.)
Such programs would be considered poorly made in this day and age where even spectre and meltdown are a concern. Simutrans cannot even run (in a normal sense) on 256 kB of RAM, as it most certainly is not a design consideration for its development (loads all assets into RAM). Extended is pushing 6 GB+ after all when on intended map sizes.
Quote from: Ters on February 27, 2020, 07:16:48 PMI'm just concerned that the actual result of something will be different depending on whether the code is read as UTF-8 or something else. The only thing that I can think of at the moment are wide character and string literals, but there may be something else. UTF-8 in comments should probably work fine on a 8-bit extended ASCII system, since the compiler doesn't care about the contents. The other way is more uncertain, although a proper UTF-8 implementation should be able to synchronize back. I'm not sure if UTF-8 work as well with other multi-byte encodings.
Modern C++ defines the handling of String literals specifically with regard to Unicode. Some time in the future one can swap over to using those once compiler support is main stream or at least required. Modern GCC builds should support it.

Currently this is very much looking for problems that might not be there. Sure some obscure compilers might have issue with the UTF-8 encoding. However the main stream ones being used (MSVC, GCC, etc) should not, especially if kept up-to-date (latest stable, recommended). If someone wants to port Simutrans to DOS (not officially supported) then they might need to hack around with the encoding, but for most people building UTF-8 should just work.

Later down the road, in another project, one can work to eradicate all German from the source code making it true ASCII even if still UTF-8 encoded. For German names one can export those to a separate file which can use programmatic handling of UTF-8 and hence be portable. Of course assuming by this stage a Unicode compatible version of C++ is not chosen (I think Extended already uses this).

ceeac

I'm with DrSupergood here - since all major compilers support UTF-8 input, it makes sense to convert the files. Furthermore, even if some other compiler did not support UTF-8, at least the encoding is consistent so batch converting with iconv before compiling should work.

Quote from: prissi on February 27, 2020, 12:57:28 PMRemove the line with the copyright message and the following line with *
I will work on this next (making the license header comment consistent across files).

Ters

Quote from: DrSuperGood on February 27, 2020, 10:50:39 PMSuch programs would be considered poorly made in this day and age where even spectre and meltdown are a concern.
Quote from: DrSuperGood on February 27, 2020, 10:50:39 PMSuch programs would be considered poorly made in this day and age where even spectre and meltdown are a concern.
Sure, but they are still around and still being made. However all this was just about the actual differences between ASCII (7-bit) and UTF-8 (multi 8-bit).

Quote from: DrSuperGood on February 27, 2020, 10:50:39 PMCurrently this is very much looking for problems that might not be there.
Almost. I'm was asking if the benefits of using non-ASCII characters outweighs the risks of running into some character encoding problem nobody has thought about.

Quote from: DrSuperGood on February 27, 2020, 10:50:39 PMModern C++ defines the handling of String literals specifically with regard to Unicode. Some time in the future one can swap over to using those once compiler support is main stream or at least required. Modern GCC builds should support it.
I don't think "support", nor "default", is safe enough. I prefer "mandatory", having just run into Murphy's law after thinking "everything supports UTF-8" and "UTF-8 is default on Linux".

ceeac

Updated the patch to only contain the changes not yet covered by the translation patch.

prissi

Submitted with r9000, but by god the scrooltext is oudated by almost a decade ...

ceeac

I missed one occurrence in gui_halt_detail_t::update_connections, but this is definitely the last one. :)