UTF-8?

jamespetts · March 02, 2018, 06:25:58 PM

I notice that there have been some character encoding changes made to the Standard codebase in the last few months. Recently, an Extended user contacted me noting that he was unable to use UTF-8 encoding in city lists (for the purposes of German characters).

Can I ask - is this an issue that the recent changes to the character encoding in Standard would have addressed? If so, it would be helpful to have some information to narrow down the date of the relevant changes so that I can more easily port them to Extended.

Ters · March 02, 2018, 09:06:28 PM

As far as I can tell, Simutrans has not changed how it reads citylists for years. They are Latin1 by default, but will be read as UTF-8 if they start with a BOM or § character to trigger UTF-8 reading.

Additionally, UTF-8 is not required for German, as all German characters are present in Latin1.

jamespetts · March 02, 2018, 09:10:30 PM

Thank you - I will pass this on to the person who reported the issue.

Frank · March 02, 2018, 10:28:01 PM

Quote from: Ters on March 02, 2018, 09:06:28 PM
As far as I can tell, Simutrans has not changed how it reads citylists for years. They are Latin1 by default, but will be read as UTF-8 if they start with a BOM or § character to trigger UTF-8 reading.

Additionally, UTF-8 is not required for German, as all German characters are present in Latin1.

The primary problem is the mixed Chars by vehicles.

Example tschech vehicles ČSD

The Č is latin2. German or english translate text export is latin1.
Result the Translator export is latin1 and the Č is broken.

The Translator works in the browser with utf-8 and also the database stores utf-8.

DrSuperGood · March 02, 2018, 11:32:25 PM

One should really try to phase out all non UTF-8 text in the long run, including re-encoding older files.

Frank · March 02, 2018, 11:54:22 PM

Quote from: DrSuperGood on March 02, 2018, 11:32:25 PM
One should really try to phase out all non UTF-8 text in the long run, including re-encoding older files.

The problem is the font files for utf-8.

DrSuperGood · March 03, 2018, 02:22:56 AM

QuoteThe problem is the font files for utf-8.

Simutrans already uses code page 1 fonts. Internally much of the latin1 is being translated into UTF-8 on load. I tested Simutrans running unifont and all code page 1 characters appeared correctly although the UI was clearly not designed for fonts of a different height.

prissi · March 03, 2018, 02:10:43 PM

For a long time you could use UTFß8 with latin encoding when using a § as first character (in UTF-8 encoding).

Ters · March 03, 2018, 05:32:20 PM

Quote from: prissi on March 03, 2018, 02:10:43 PM
UTFß8 with latin encoding

That makes absolutely no sense. UTF-8 and the different Latin encodings are mutually exclusive. (Although they are indistinguishable if the text only contains plain letters a through z, upper and/or lower case, digits 0 through 9 and a few other character like comma, period, space and newline.) Or did you mean UTF-8 encoded files with the Latin1 font? In that case, one is still limited to the characters present in the font.

I'm not sure how Simutrans uses the Latin2 font. By disrespecting Unicode and turning two "wrongs" into a "right"?

DrSuperGood · March 03, 2018, 11:04:56 PM

Simutran's font is Unicode. The glyph system uses Unicode code points. The problem is that the default font included with Simutrans is incomplete and missing most of code page 1 glyphs. A complete font like Unifont will provide full code page 1 support, but Simutrans is not designed for the different character height due to excessive hard coding so there are visual artefacts.

There is currently no support or plan to support code pages above 1. The glyph engine uses up to a 65536 sized array which is enough for the first code page. Code points from code pages above 1 can still be used and will be interpreted correctly as a character but will always render as an unknown character.

Ters · March 04, 2018, 08:10:02 AM

It is just that the Latin2 font defines glyphs for code points up to and including 255, but in Unicode, 0 through 256 is Basic Latin and Latin 1. So Simutrans would either have to remap the code points used in the font file when loading it, or it simply reads Latin 2 encoded text as if it was Latin 1 and "corrects" it by displaying the correspondingly "wrong" glyph. The latter is how it would have been done before Unicode.

prissi · March 09, 2018, 02:35:04 AM

Upon loading files Simutrans tries to convert.

Since Simutrans UTF support is older than the BOM, it uses § to flag a file as UTF-8 (and of course if it sees a BOM). If it encounters a file with 8 bit but not UTF-8 it depends on the default font. If the czech font is used, the content will be assumed latin2, and otherwise latin1. Objects in the game will be saved as they have been named in the dat-file (name="xyz"). That is often latin2 for czech editors, and maybe even SJis for some japanese. That is why there should not be any 8 bit characters in the names. Unfourtunately some german, czech and japanese objects ignored this.

Internally Simutrans use the old Unicode with 65535 charcters. And there is not extension planned. Rather it will be switched to freefont to use the installed system fonts also for display. And to enable font scaling.

I could change makeobj to do not longer compile file like this. But of course it will only work in retrospect and break many existing games (because such objects will no longer be found).

News:

UTF-8?