[Bug: 21.5-b25] Problems with latin-unity and VM

Wed Feb 7 10:29:49 EST 2007

Joachim Schrod writes:

 > uses its internal character encoding in buffers; while I thought some
 > more meta-information about the origin of that encoding is available,
 > that actually didn't seem to matter here.

It is available, it's just a design decision that it won't be used
during editing.

 > My error was at a different place: I thought that one octet (let's say,
 > 0xea or "ä") in a file could end up as two different internal XEmacs
 > characters (let's use for the sake of this example the symbolic names
 > 'octet-0xe4 and 'latin1-aumlaut), depending of the coding system used
 > for reading. There, a coding system of 'binary would trigger the
 > creation of the first XEmacs character, and a coding system of
 > 'iso-8859-1 (or any of its variants) would trigger the creation of the
 > second character.

In fact, this is correct.  For example, all but about 6 of the members
of ISO-8859-1 and ISO-8859-15 are the same characters according to ISO
8859 and Unicode, but (due to an unfortunate design decision closely
related to "Japanese exceptionalism"; you can look up "Han
Unification" in wikipedia for a reasonably accurate story) in Mule
they are different; even NO-BREAK SPACE (U+00A0) in latin-iso8859-1 !=
NO-BREAK SPACE (U+00A0) in latin-iso8859-2.  :-(

This means that the same will be true of the coding systems iso-8859-1
and iso-8859-2.  Thus latin-unity was born.  I really didn't think it
would survive anywhere near this long. :-(

What's special about binary and ISO 8859-1 is that binary ==
iso-8859-1 by definition.  So in your example 'octet-0xe4 ==
latin1-aumlaut != latin2-aumlaut (although in both cases the second
octet of the internal representation is 0xe4!)

There's actually another difference: ISO 8859-X for X != 1 is treated
as a version of ISO 2022 (which is correct, as far as it goes),
including the possibility of using ISO 2022 escape sequences to change
character sets, which is horribly wrong.  This disgusting hack was
implemented more or less to prevent data loss (since in principle ISO
2022 can encode all characters, and in practice XEmacs's version
covers about 90% of Unicode).  *But it doesn't apply to iso-8859-1/
binary*, which is why binary is the primary cause of such data loss.
Perhaps even worse, it means that iso-8859-2 is not a spelling for
binary, and people who have LANG=hr_HR.ISO8859-2 (Croatian) find their
tarballs and other binary stuff getting mysteriously corrupted.

There is hope, however!  This will all go away, more or less
automatically, when we convert to Unicode internal coding.  I'll
undoubtedly get death threats from ultra-right-wing Japanese, but I
will bravely face my fate!

 > And that seems to be my primary misconception: Since you tell me that
 > "C1 and Latin-1 characters are represented in *two* octets", it seems
 > that these two situations (difference between coding systems 'binary
 > and 'iso-8859-1 during read) are not distinguished and that the
 > internal character 'latin1-aumlaut is always created, because the
 > internal XEmacs character 'octet-0xea does not exist. (The latter
 > would probably correspond to the FSF Emacs unibyte encoding that you
 > mentioned that XEmacs doesn't have.)

Basically, yes.