Typos in GCC output

Sun Oct 22 23:05:44 EDT 2006

Nix writes:

 > But, um, why aren't the codecs expecting UTF-8 when LANG is set to
 > a UTF-8 locale?

They do by default in most situations, such as reading from files.
But processes are a different matter, because they're not rewindable.
So basically the detector needs to make a decision based on the first
bufferful, which is short and very likely to be pure ASCII.  (Don't
say "but ..." for a couple paragraphs, please.)

In that situation, it's typically a bad idea to assume anything but
binary.  The codecs don't know that they're dealing with GCC; they
could easily be dealing with XEmacs, which has a habit of dumping byte
code in its Lisp backtraces.  It turns out that in XEmacs 'binary is
synonymous with 'iso-8859-1 (a backwards-compatibility hack), which
had the misfeature of doing the right thing 90% of the time, keeping
the Mule developers from working on pushing that to 99.4%.

You could keep a list of which apps are well-behaved, but I suspect
this would be unreliable.

I don't know of any (working) resampling codecs that check to see if
things have changed on the fly.  It probably makes sense in the case
of synchronous processes, such as *compile* buffers, to treat them as
"slow" files.  That is, buffer the whole output from the process, do
detection on a large sample, and then convert the whole thing.
(There's an obvious space optimization of buffering only until you
have so much you won't use any more, convert the detection sample, and
use that codec for the rest on-the-fly.)  But it's not obvious how to
make this work for asynchronous processes, such as *shell* buffers.
Do you really want to reconvert the buffer every time the process
spits out more output?  (Maybe you do, but somebody's gonna have to
code it and put it in front of users to see what they think.)

 > We could do, only LANG isn't a system-global parameter. You can easily
 > wrap GCC in a tiny script that unsets LANG, viz
 > 
 > LANG= exec real-gcc "$@"

And what if "real-gcc" is a lie, and in fact it's a script

LANG=Etruscan exec really-real-gcc "$@"

?  People (at least in the Japanese subset) actually do stuff like
that.

Environment is for setting user defaults, but the user rarely
interfaces directly with GCC these days; they interact with some kind
of filter (aka IDE).  "System-global" is incorrect as you point out,
but LANG *is* process-global, which is inappropriate in multilingual
applications.  (Suppose iconv obeyed LANG?  Then in your environment
it would only be useful for converting UTF-8 to UTF-8! ;-)

 > Hm. I find myself wondering if perhaps the XEmacs shell modes shouldn't
 > arrange to reset LANG appropriately for the process-coding-system: of
 > course that doesn't help much if you change it after the shell is
 > started, but if you do *that* presumably you knew you were doing it and
 > could feed a LANG= to the shell.

Nope; it's a chicken-and-egg problem.  Sometimes you start with the
chicken and sometimes with the egg.  There is no general way to do
this; it's AI stuff.  You need a heavy-duty multilingual natural
language processing system to do what even the famed 7-bit
null-lingual American programmer can do "by eye".

 > (The Unix LANG/LC_* model isn't really designed for situations where
 > you're constantly changing applicable encoding, is it?)

You're darn tootin' it isn't.  That's basically the issue that killed
XIM.  setlocale() can take a noticeable amount of time in situations
where you're switching back and forth between input modes all the
time.  And don't even mention "multi-threading"!

 > Because it has to determine if the identifiers are valid, tokenize them,
 > and so on. (i.e. there are a *lot* of possible encodings for, say, `.'
 > in Java, all of which act like a `.'.  C and C++ need some degree of
 > Unicode support as a QoI matter, as well: who are we to say that people
 > can't put their own names in comments or string literals?)

Uh, isn't that what iconv and XEmacs are for?  And what makes you
think that the guy at the next desk will necessarily use Unicode for
his name?---after all, your company only bought him and his code last
week.

It's a serious layering violation for GCC itself to be doing those
translations.  gcc (the gcc controller app itself) should assume that
Java code is native-endian UTF-16 (that's the standard, right?).
Users should invoke gcc via a wrapper that handles the translations
for them.

 > > As stated, the assumption is not violated.  My tools *are* capable of
 > > handling UTF-8.  It is the inference that using UTF-8 is therefore
 > > reliable that is wrong.
 > 
 > I still don't understand why. Is it that LANG might not match the
 > encoding you're using *right now*?

It's that in a multilingual environment the odds are very good that
LANG inherited from a process's initialization doesn't match the
encoding I'm using right now, yes.

 > If so, then, well, this only applies to people who are changing
 > encodings all the time in shell buffers in which they're also running
 > compilations. Is this really common? (If they're changing encodings so
 > often, surely they can change encoding back?)

In Japan it is; UTF-8-encoded Japanese text is very much a minority
taste even today.

Of course you can change encodings back.  The issue is, why should I
have to pay that tax for a *compiler's error output*??  The `'
convention is perfectly usable though ugly, and can easily be
beautified (eg, with a nearly harmless comint filter in Emacsen).

 > No! Stick with a LANG set to UTF-8 and everything should work. I can't
 > understand why it isn't for you.

Because I'm a resident of Japan, which has *5* major encodings
(ISO-2022-JP, EUC-JP, Shift-JIS, UTF-8, and roman transliteration)
reachable with one keystroke from my mail summary buffer, not to
mention GB2312 (aka the preferred encoding of Chinese spam), UTF-16,
and other special-purpose encodings primarily used internally to
various applications that sometimes leak into public (damn that broken
Windows anyway, it's such a pane).

I *can* stick to LANG=ja_JP.UTF-8, precisely because XEmacs ignores
the "UTF-8".  The important part of that to XEmacs is "ja_JP", because
it tells XEmacs to prefer Japanese fonts and encodings where
characters are shared with Chinese and/or Korean.  Once I know it's
Japanese, the statistical characteristics of the octet streams give
"no defects in 10 years of heavy daily use" reliability in
distinguishing the 4 "real" Japanese encodings from each other.  And
even if I'm in an environment where the text is more likely to be
French than Japanese, XEmacs does better than 99%.

 > In that case, said smart tool should have *no* trouble with a couple of
 > Unicode quotes coming out of GCC (and, indeed, for me, it all works.
 > But that doesn't say much because if it didn't work for me I'd have
 > fixed it already.)

XEmacs has no trouble decoding that, and even if it did, you could fix
it with a simple comint filter.  What bothers me is that a useful
protocol was changed without warning, from something that is simple
and robust and well-known even to legacy applications, to something
that is less simple, demonstrably unreliable, and likely to cause bad
interactions with smart applications that use code dating from before
the protocol change.  Since the protocol was never formalized, GCC is
certainly within its rights to say "the joke's on you for trying to do
something useful with our past very regular behavior".  But I don't
think that's very productive.

 > I think I'm still missing the point (no surprise there, I seem to
 > specialize in it).

Well, it's not specific to you.  The basic tension is that for humans
it's all quite obvious, right there in front of your nose in FG_COLOR
and BG_COLOR.  A machine, however, generally has no idea what text
it's spewing at users.  It has even less idea about the encodings it
is being fed.  It's just a variant on the Postel Principle for
Internet clients: be catholic about what you accept, puritan about
what you produce.  In this case, "puritan" can, and IMO should, mean
"use UTF-8, of course! but restrict yourself to the subset 0-127". ;-)