Revision control softwares (split from: Using git)

Ters · November 08, 2013, 06:19:34 AM

(topic split from: http://forum.simutrans.com/index.php?topic=12770.msg127216#msg127216)

The problem on Windows isn't the file system encoding, both Windows and Linux can and do perform translation that to some degree hides the true encoding from the application. What is special about Windows is that when Windows went Unicode, which is used internally in the kernel, it introduced a parallel API for that so that the old API could continue to operate on the various 8-bit encoding expected by existing applications. At the time, UTF-8 did not exist, so Windows went for a fixed 2-byte Unicode encoding (later changed to variable length UTF-16, as 2 bytes turned out not to be enough to full Unicode). Linux on the other hand, simple switched their existing API over to UTF-8.

This has the effect that appliactions on Linux can use Unicode without knowing it, as UTF-8 looks like valid ASCII as far as most programs are concerned. Standard C functions now take Unicode where they previously accepted ASCII or a localized variation of it. On Windows however, these functions still map to the old local character encoding. Programs must use a different API to recieve or pass file names or other text in Unicode.

This means that when Git and Mercurial read a file name on Windows, they get a series of bytes that can easily be invalid UTF-8. Whereas UTF-8 file names from Linux will look strange when viewed with the (in my case "Latin-1") encoding used by the legacy Windows API. Subversion can more easily get away with it, as it can translate the file names going between computers, but for Git and Mercurial, the file name is part of the data used to generate the hash identifying a change set. It can't be modified afterwards without breaking stuff.

sdog · November 08, 2013, 07:49:51 AM

Thanks Ters, that was a very interesting answer to my, rather confuse, question.

Where i don't follow is, for what the application is concerned is only the outside of the interface. As long as there is a bijective mapping between UTF-8 and the local encoding (eg utf-16) it ought not to matter for the application what internally is done in the interface. Ie the app sends an UTF-8 string and receives an UTF-8 string in return? Since the hash can only be generated in the app, the layer behind that os'es interfaces can't influence that?

Since it obviously does not work, does this mean there is not a bijective mapping between UTF-8 and UTF-16? Or there's a glaring mistake in my above assumption?

Second part where i lost you: where do the legacy API's play into this? They were likely necessary to make the transition from old encodings in the move to NT, and later from 98 to XP. Most likely they are still there for MS tendency to allow extreme backward compatibility. But why would they matter for software compiled today?
"Programs must use a different API to recieve or pass file names or other text in Unicode."
In other words why run into problems if one does not specifically chose to interface to legacy API?

"Subversion can more easily get away with it, as it can translate the file names going between computers"
Wait, that means they might still get mojibake in filenames, it just doesn't fail because of it?

Thought: With the problem as described: This might also cause EncFS encrypted files synced with rsync between windows and *nix (cloud or phone). Or anything else that does filename encryption or initializaion vector chaining based on path.

ps.: just ignore me when this becomes to tedious. Just found it interesting. (hope i don't sound agressive, if yes: it's just ignorance)

Markohs · November 08, 2013, 12:27:27 PM

Quote from: sdog on November 08, 2013, 07:49:51 AM
Second part where i lost you: where do the legacy API's play into this? They were likely necessary to make the transition from old encodings in the move to NT, and later from 98 to XP. Most likely they are still there for MS tendency to allow extreme backward compatibility. But why would they matter for software compiled today?

Well, I'm getting off-topic, I'll try to be short.

This extreme backwards compatibility is actually a very good thing and desirable in software. Microsoft is not alone with this, for example Sun Microsystems granted binary compatibility for all their operating systems, and you can sun software compiled for Solaris 5, in Solaris 11, still nowadays. Solaris 5 was released in 1991. This is achieved with binary interfaces, and versioning. Not to menction IBM that still keeps axecution modes in their computers so tehy can actually run, programs designed for IBM 360 machines. And believe me, that's needed, because there are lots of applications that run even today where their source code has disappeared, or were written in assembler. They still sell to you their old ADA/Cobol/Fortran/Pascal compilers, there is a lot of code written for that, that's used and will not disapear.

Linux nightmare on binary compatibility is something that I can't understand, how GNU floods glibc with lots of incompatibilities that breack binary compatibility is beyond my understanding, but from what I did read, it's based on some Richard Stallman ideas, that imply keeping interfaces starndard, is against free sotware freedom. ****, I'd say.

Looks like any change that could potentially save 4 cycles of CPU is accepted in glibc, even if it breaks old applications completely.

I'd say this is something very bad Linux has, including the lack of a stable kernel drivers compatibility, so the vendors can't just distribute a private module and expect it to work forever, they are forced to spawn a community that implements the driver open source, and mantain it to kernel version (gratuitous) changes until they run tired of it and disband.

Regarding UTF, I'd say we as a programmers should just use the 16-bit UTF implementation in our programs, in windows. Linux move to just use UTF8 has its disadvantages too, both were valid and reasonable options.

sdog · November 08, 2013, 05:35:15 PM

I can see your point about Linux or gnu, and mostly agree. Perhaps it's not so much ideology but both group dynamics and pragmatism. If a device driver isn't maintained actively it's better it breaks than being a security hole. For the end users this means they either pay for further driver maintenance, acquire the source code or just get new hardware.
It just has to last until it's amortised, expecting it to run for free afterwards is asking for too much.

For backward complexity, why does legacy software have to run on a modern os is being me. If you can't maintain the code, it has to sandboxed. When doing so one can use any antique interface or just use a VM.

This was more fundamental, and I didn't want to polemise in my previous post... I don't underwent if or why git hapened to interface the legacy APIs, why not the current one with UTF-8. When all is Unicode the interface can be polymorphic, as the evicting can be identified by the length. Ie if the second byte is the tail of a 16 char or a new 8 bit char. Is it, or its it purely 16 to 16, fixing non utf16 apps to use those other APIs?

ps.: Vernor Vince had in some sci fi novels the concept of archaeologist programmer. Centuries old sub light interstellar spacecraft had so complex and old software requiring different approaches. The fact the ship survived to get that old makes them reluctant to implement new things that might cause bugs that would kill them, but they can assume over the centuries someone had almost the same problem to solve, and 'dig'.

@mod or Igor: I've pushed the thread so far astray, it might be prudent to move it to randomness lounge. #9 would be a good break. I suppose Ters wouldn't mind.

Ters · November 08, 2013, 05:36:12 PM

Quote from: sdog on November 08, 2013, 07:49:51 AM
Where i don't follow is, for what the application is concerned is only the outside of the interface. As long as there is a bijective mapping between UTF-8 and the local encoding (eg utf-16) it ought not to matter for the application what internally is done in the interface. Ie the app sends an UTF-8 string and receives an UTF-8 string in return? Since the hash can only be generated in the app, the layer behind that os'es interfaces can't influence that?

Windows does not have a UTF-8 interface (unless it's possible for applications to switch the "ANSI codepage" to UTF-8, preferably for it's own process only). UTF-8 wasn't around when Windows NT went Unicode. If you use the old char interface, you get some local encoding. For western Europe, this will be windows-1252, a close relative of ISO 8859-1 aka Latin 1. If you use the wchar_t interface, you get UTF-16. On Linux, there is no wchar_t interface, and the char interface gives you UTF-8 if the system uses Unicode.

Quote from: sdog on November 08, 2013, 07:49:51 AM
Second part where i lost you: where do the legacy API's play into this? They were likely necessary to make the transition from old encodings in the move to NT, and later from 98 to XP. Most likely they are still there for MS tendency to allow extreme backward compatibility. But why would they matter for software compiled today?
"Programs must use a different API to recieve or pass file names or other text in Unicode."
In other words why run into problems if one does not specifically chose to interface to legacy API?

I believe the problem is that the C library only uses the char data type for file names, which means that it's built on top of Windows' legacy "ANSI" API. If you use the Windows API directly, it's relatively easy to switch to Unicode, but portable applications won't be using the Windows API. They rather use the standard C API, which is still "ANSI" on Windows, but "secretly" converted to Unicode on Linux with UTF-8.

Unicode is opt-in for programs on Windows, and then you deal with it as wchar_t, whereas it is forced upon them by the system on Linux, if regular chars. It's possible to get things right, but you have to write some platform specific code to do it. That both Git and Mercurial was written by and for Linux developers might expain why they didn't include such Windows specific code, and retrofitting it might be far from trivial.

Quote from: sdog on November 08, 2013, 07:49:51 AM
"Subversion can more easily get away with it, as it can translate the file names going between computers"
Wait, that means they might still get mojibake in filenames, it just doesn't fail because of it?

No, I don't think so. Subversion can freely convert the filename to the proper encoding. I don't remember if I have actually tried, but when googling my problems with Mercurial, no one mentioned Subversion having the same problem.

Quote from: sdog on November 08, 2013, 07:49:51 AM
Thought: With the problem as described: This might also cause EncFS encrypted files synced with rsync between windows and *nix (cloud or phone). Or anything else that does filename encryption or initializaion vector chaining based on path.

Possibly. Git and Mercurial isn't the only software that can screw things up when transfering stuff between machines with different 8-bit character encodings.

prissi · November 08, 2013, 11:19:02 PM

Well it is not entirely true. You can get UTF8 filenames (Window98 did this on FAT32). And there are functions in windows to convert between codepages including UTF-8 (so it was around then too, even UTF-7 existed almost as long a unicode16). Only for the input you will get a wchar, and obviously no UTF8 letter sequence.

sdog · November 09, 2013, 01:09:20 AM

@Markohs, Ters
Sorry, I have missed who replied, didn't see it properly while reading on the phone.

Ters · November 09, 2013, 11:10:57 AM

Quote from: prissi on November 08, 2013, 11:19:02 PM
Well it is not entirely true. You can get UTF8 filenames (Window98 did this on FAT32). And there are functions in windows to convert between codepages including UTF-8 (so it was around then too, even UTF-7 existed almost as long a unicode16). Only for the input you will get a wchar, and obviously no UTF8 letter sequence.

Yes, but the format used in the file system itself is hidden to normal applications, and convertion to UTF-8 must be done explicitly. You won't get UTF-8 file names from readdir on Windows like you do on Linux, not by default at least.

According to Wikipedia, UTF-8 was officially presented in 1993, the same year Windows NT was released. Support for converting to and from UTF-8 was likely added to the Windows API at a later point, but you still have to read file names as wchar_t from the API, then convert it to UTF-8. Portable applications do neither by default.

prissi · November 09, 2013, 09:30:01 PM

I had my help file for Window NT 3.1 programming still lying around (it is still the best conpact introduction to the windows API, although a little dated).

Quote
Double-byte Character Sets

The double-byte character set (DBCS) is called an expanded 8-bit character set because its smallest unit is a byte. It can be thought of as the ANSI character set for some Asian versions of Windows (particularly the Japanese version). Win32 functions for the Japanese version of Windows accept DBCS strings for the ANSI versions of the functions. However, unlike the handling of Unicode, DBCS character handling requires detailed changes in the character-processing algorithms throughout an application's source code.

which indeed had single, double or three byte characters (like the Japanese JIS code pages). There were also a function called "IsDBCSLeadByte" which could determine if this is the start of a multibyte sequence as well as MultiByteToWideChar and WideCharToMultiByte which needed a codepage as argument. The available codepages (taken from a 1995 Borland compiler) were

Code Select


//
//  Code Page Default Values.
//
#define CP_ACP                    0           // default to ANSI code page
#define CP_OEMCP                  1           // default to OEM  code page
#define CP_MACCP                  2           // default to MAC  code page
#define CP_SYMBOL                 42          // SYMBOL translations

#define CP_UTF7                   65000       // UTF-7 translation
#define CP_UTF8                   65001       // UTF-8 translation

and divers country specific encoding, which I left out.

Giving this, I would say windows (32 bit) supports UTF-8 from 1995 resp Windows NT 3.51 (actually support for the UTF-8 codepage is rather trivial, so it may be even there for NT 3.1 in the time between standartisation and release).

Ters · November 09, 2013, 11:32:02 PM

There were multi-byte encodings before UTF-8. But for the problem I brought up, it doesn't matter what MultiByteToWideChar and WideCharToMultiByte supports or how long it has been supported, as long as fopen (or the CreateFileA function in the Windows API it's probably based on) doesn't take an UTF-8 string by default. Windows has a separate wchar_t based API for Unicode, while Linux repurposed the existing char based API. That's the issue.

Maybe it's possible to call some SetAnsiCodepage(CP_UTF8) to switch Windows' "ANSI" API to UTF-8, which would cause the char based API to behave like it does on Linux, but that's platform specific code that needs to be added explicitly. If it exists, it doesn't appear that enough people are aware of it.

Of course, Git and Mercurial will also have trouble between a Linux machine configured to use Unicode and a Linux machine not configured to use Unicode.

prissi · November 10, 2013, 12:23:59 AM

The is no fopen call in the windows APIs at all ... There is "CreateFile" (or the very old and even for NT obsolete _lopen). So it is the library to blame. You can have a C standard library using UTF-8 (but is C not standardized for using the ASCII subset?) If with the UTF-8 codepage, the windows CreateFileA version would use UTF-8.

Even Posix does not support UTF-8, but only a local codepage (which can be UTF-8 aparently). So Posix using the same practice as Windows.

Ters · November 10, 2013, 09:18:52 AM

Quote from: prissi on November 10, 2013, 12:23:59 AM
The is no fopen call in the windows APIs at all

Neither do Git or Mercurial use the Windows API. But fopen on Windows is built on top of of the Windows API. Since fopen takes a char string argument, I guess it was natural to base it on the "ANSI" sub-API. I don't think Windows 95 had a Unicode sub-API (unicows came later, and is in any case an add-on), and I believe the C runtime was common for Windows 95 and Windows NT.

Quote from: prissi on November 10, 2013, 12:23:59 AM
Even Posix does not support UTF-8, but only a local codepage (which can be UTF-8 aparently).

That's the crucial thing. Windows has explicit Unicode support, Linux (or Posix) doesn't really care. But Git and Mercurial assumes that every sane system use a UTF-8 "codepage" these days, except Windows doesn't, since Unicode is by convention treated separately there. (Technically, Git and Mercurial don't assume UTF-8. They care as little about that as Linux itself, or even less. But they do assume that all systems use the same codepage, which in practice is UTF-8 on Linux.)

News:

Revision control softwares (split from: Using git)