Typos in GCC output

Mon Oct 23 17:42:34 EDT 2006

On Mon, 23 Oct 2006, stephen at xemacs.org murmured woefully:
> Nix writes:
> 
>  > But, um, why aren't the codecs expecting UTF-8 when LANG is set to
>  > a UTF-8 locale?
> 
> They do by default in most situations, such as reading from files.
> But processes are a different matter, because they're not rewindable.
> So basically the detector needs to make a decision based on the first
> bufferful, which is short and very likely to be pure ASCII.  (Don't
> say "but ..." for a couple paragraphs, please.)

True:(

That's a bit of a sod, really.

> In that situation, it's typically a bad idea to assume anything but
> binary.  The codecs don't know that they're dealing with GCC; they
> could easily be dealing with XEmacs, which has a habit of dumping byte
> code in its Lisp backtraces.  It turns out that in XEmacs 'binary is
> synonymous with 'iso-8859-1 (a backwards-compatibility hack), which
> had the misfeature of doing the right thing 90% of the time, keeping
> the Mule developers from working on pushing that to 99.4%.

Gah.

> You could keep a list of which apps are well-behaved, but I suspect
> this would be unreliable.

I see the problem .Alas I expect that *most* apps expect that when the
locale is UTF-8, they can emit UTF-8 output. That's kind a large part of
what the locale setting *means*.

> I don't know of any (working) resampling codecs that check to see if
> things have changed on the fly.  It probably makes sense in the case
> of synchronous processes, such as *compile* buffers, to treat them as
> "slow" files.  That is, buffer the whole output from the process, do
> detection on a large sample, and then convert the whole thing.

True, unlss the compiler is *really* dog slow (not unheard of).

> (There's an obvious space optimization of buffering only until you
> have so much you won't use any more, convert the detection sample, and
> use that codec for the rest on-the-fly.)  But it's not obvious how to
> make this work for asynchronous processes, such as *shell* buffers.

I think you'd need a crude high-speed estimator which triggers a full
check of probable coding system only when a character is emitted that
has both not been emitted recently and that is not a frequently emitted
character in that coding system (so catting a binary by mistake would
check only a few times, but a sudden emission of a Unicode quote would
trigger a re-evaluation).

But this is probably totally impossible (as well as wildly impractical)
due to details of some obscure coding system I've never heard of :(

> Do you really want to reconvert the buffer every time the process
> spits out more output?  (Maybe you do, but somebody's gonna have to
> code it and put it in front of users to see what they think.)

No way. Reconversions should be as rare as possible, and no rarer.
(As I said, I have no idea how to actually make this happen :( )

>  > LANG= exec real-gcc "$@"
> 
> And what if "real-gcc" is a lie, and in fact it's a script
> 
> LANG=Etruscan exec really-real-gcc "$@"

!

> ?  People (at least in the Japanese subset) actually do stuff like
> that.

Well, that's strange, but if they've asked for that, presumably they
expect the results (which in this case wouldn't include Unicode
quote marks, but I'll pretend you said Etruscan.UTF-8 ;) )

But I'll agree that perhaps a flag is wanted which arranges to emit
nothing outside the 7-bit subset unless absolutely necessary. (Of
course that's too late for GCC 4.1 and 4.2 now.)

> Environment is for setting user defaults, but the user rarely
> interfaces directly with GCC these days; they interact with some kind
> of filter (aka IDE).

In that case it's the IDE's job to reset LANG to a non-UTF-8 value
if it's not willing to cope with UTF-8 output!

>                       "System-global" is incorrect as you point out,
> but LANG *is* process-global, which is inappropriate in multilingual
> applications.  (Suppose iconv obeyed LANG?  Then in your environment
> it would only be useful for converting UTF-8 to UTF-8! ;-)

iconv is a special case because its entire raison d'etre is encoding
conversion: of course it has to be capable of dealing with multiple
encodings simultaneously. GCC, so far, doesn't, so it uses LANG (like,
oh, just about every other noninteractive program out there other than
things that are part of the i18n infrastructure like iconv).

>  > Hm. I find myself wondering if perhaps the XEmacs shell modes shouldn't
>  > arrange to reset LANG appropriately for the process-coding-system: of
>  > course that doesn't help much if you change it after the shell is
>  > started, but if you do *that* presumably you knew you were doing it and
>  > could feed a LANG= to the shell.
> 
> Nope; it's a chicken-and-egg problem.  Sometimes you start with the
> chicken and sometimes with the egg.  There is no general way to do
> this; it's AI stuff.

Well, it could have a conversion table that says `if the process-coding
system is FOO, set LANG to BAR'. (However, this is complicated by the
divergent locale names in many Unixes, argh.)

>                       You need a heavy-duty multilingual natural
> language processing system to do what even the famed 7-bit
> null-lingual American programmer can do "by eye".

In the general case, you're right. In a lot of useful special cases it
may be possible anyway.

>  > (The Unix LANG/LC_* model isn't really designed for situations where
>  > you're constantly changing applicable encoding, is it?)
> 
> You're darn tootin' it isn't.  That's basically the issue that killed
> XIM.  setlocale() can take a noticeable amount of time in situations
> where you're switching back and forth between input modes all the
> time.  And don't even mention "multi-threading"!

I could make you scream by mentioning the vile localeconv(). (But I
won't.)

> It's a serious layering violation for GCC itself to be doing those
> translations.  gcc (the gcc controller app itself) should assume that
> Java code is native-endian UTF-16 (that's the standard, right?).

Yeah.

> Users should invoke gcc via a wrapper that handles the translations
> for them.

That means the users using KOI8-R would have to have a *really* smart
wrapper, that knows when to switch between that and UTF-16, and so on
and so forth... it's easier for the common case if GCC just uses iconv()
to convert things itself.

(This example was not plucked out of the air.)

>  > I still don't understand why. Is it that LANG might not match the
>  > encoding you're using *right now*?
> 
> It's that in a multilingual environment the odds are very good that
> LANG inherited from a process's initialization doesn't match the
> encoding I'm using right now, yes.

Yeah, that's a bit of a swine. I fear the only approach that might work
there would be to have a wrapper around GCC that used gnudoit to query
XEmacs for the current buffer's coding system (or for the corresponding
LANG if you'd rather do the translation in Lisp than in the shell, and
who wouldn't), and then set LANG accordingly.

>  > If so, then, well, this only applies to people who are changing
>  > encodings all the time in shell buffers in which they're also running
>  > compilations. Is this really common? (If they're changing encodings so
>  > often, surely they can change encoding back?)
> 
> In Japan it is; UTF-8-encoded Japanese text is very much a minority
> taste even today.

(Most Japanese correspondents I talk to in my financial-info-thumping
day job seem to use SJIS.)

> Of course you can change encodings back.  The issue is, why should I
> have to pay that tax for a *compiler's error output*??  The `'
> convention is perfectly usable though ugly, and can easily be
> beautified (eg, with a nearly harmless comint filter in Emacsen).

Well, unset LANG then, and GCC will default to 7-bit ASCII; that's even
easier:)

>  > No! Stick with a LANG set to UTF-8 and everything should work. I can't
>  > understand why it isn't for you.
> 
> Because I'm a resident of Japan, which has *5* major encodings
> (ISO-2022-JP, EUC-JP, Shift-JIS, UTF-8, and roman transliteration)

Wow. I knew it had a lot, but not that many. I guess I can see why the
original designers of MULE were Japanese: they had a *reason* to want
something so featureful...

(it's just a shame they didn't remain involved. Does anyone really
understand CCL any longer?)

>                                                      (damn that broken
> Windows anyway, it's such a pane).

I hear a lot of people think it's smashing.

> I *can* stick to LANG=ja_JP.UTF-8, precisely because XEmacs ignores
> the "UTF-8".  The important part of that to XEmacs is "ja_JP", because
> it tells XEmacs to prefer Japanese fonts and encodings where
> characters are shared with Chinese and/or Korean.

So in other words you're saying `use UTF-8' and then relying on every
program you run regularly ignoring it (or so it seems to me, otherwise
you wouldn't be complaining about GCC using UTF-8 in that situation)? 
That seems... brittle.

>                                                    Once I know it's
> Japanese, the statistical characteristics of the octet streams give
> "no defects in 10 years of heavy daily use" reliability in
> distinguishing the 4 "real" Japanese encodings from each other.  And

I knew MULE was good, but I didn't know it was that good. That's an
incredibly low error rate for any estimation function.

>  > In that case, said smart tool should have *no* trouble with a couple of
>  > Unicode quotes coming out of GCC (and, indeed, for me, it all works.
>  > But that doesn't say much because if it didn't work for me I'd have
>  > fixed it already.)
> 
> XEmacs has no trouble decoding that, and even if it did, you could fix
> it with a simple comint filter.  What bothers me is that a useful
> protocol was changed without warning, from something that is simple

It was prominently mentioned in the GCC 4.0 release notes, along with
info on how to disable it: <http://gcc.gnu.org/gcc-4.0/changes.html> and
a link to an article by Markus Kuhn on why using Unicode quotes was a
good idea dammit.

I can't really see any way of advertising it more widely. We don't have
any rooftops to shout from.

> and robust and well-known even to legacy applications, to something
> that is less simple, demonstrably unreliable, and likely to cause bad
> interactions with smart applications that use code dating from before
> the protocol change.  Since the protocol was never formalized, GCC is
> certainly within its rights to say "the joke's on you for trying to do
> something useful with our past very regular behavior".  But I don't
> think that's very productive.

It was a major version bump. Things change at major version bumps. It's
certainly less disruptive than a C++ ABI bump, and there've been a good
few of those. (There've even been C ABI bumps on some architectures,
e.g.  mips-sgi-irix.)

-- 
`When we are born we have plenty of Hydrogen but as we age our
 Hydrogen pool becomes depleted.'