Typos in GCC output

Sun Oct 22 20:32:03 EDT 2006

On Mon, 23 Oct 2006, stephen at xemacs.org stated:
> Nix writes:
> 
>  > Alas, the C standard says no :) of course, C code is primarily produced
>  > for machines to read, and they prefer consistency. GCC's standard error
>  > stream is parsed in sufficient detail for quotes to matter by perhaps
>  > one or two programs, and what they do isn't terribly complex...
> 
> What the Emacs Lisp does is not terribly complex, true---but totally
> irrelevant.  By that point, *it's too late*, the codecs (which live
> just on the XEmacs side of the pipe, and pretty much have to if you
> want any kind of efficiency) have already converted those bytes.

But, um, why aren't the codecs expecting UTF-8 when LANG is set to
a UTF-8 locale?

(I'm new to the wonderful multiple-encoding world of MULE, so perhaps
I'm missing something.)

> But it's a poor atom-blaster that won't point both ways.  You realize
> that what *GCC* is doing internally is not very complex, and could
> easily be delegated to a wrapper, so we wouldn't have to recommend
> that the user change a *system global parameter* (ie, LANG) to suit an
> application that rarely produces output directly for the user, but
> rather normally is filtered through one or more wrappers anyway?

We could do, only LANG isn't a system-global parameter. You can easily
wrap GCC in a tiny script that unsets LANG, viz

LANG= exec real-gcc "$@"

or just arrange for compile.el or whatever to unset LANG before calling
GCC. (start-process doesn't seem to support running programs with
different environments yet, but that can't be too terribly hard to add,
and is of general utility.)

>  > I don't get it.
> 
> That's my point.  If you *did* get it, I would call it a difference of
> values, not "ill-advised". ;-)

OK, I am merely ignorant :)

>  > If a program is looking at GCC's standard error stream, why would
>  > it expect anything other than text (7-bit ASCII before this change,
>  > UTF-8 afterwards)?
> 
> If "a program" is XEmacs, it could be a shell buffer where I'd been
> looking at EUC-JP content in TeX error messages in my last make, or
> Shift JIS I'd grepped out of an email.

Hm. I find myself wondering if perhaps the XEmacs shell modes shouldn't
arrange to reset LANG appropriately for the process-coding-system: of
course that doesn't help much if you change it after the shell is
started, but if you do *that* presumably you knew you were doing it and
could feed a LANG= to the shell.

(The Unix LANG/LC_* model isn't really designed for situations where
you're constantly changing applicable encoding, is it?)

>  > There's no way we could avoid producing UTF-8 output on stderr in some
>  > circumstances, even if the quotes were kept at `': printed Java
>  > identifiers would have to be Unicode, for starters.
> 
> But compile.el is not going to try to parse the Java identifiers, just
> spit them back.  Ie, that's not Unicode *produced* by gcc, that's
> binary crap *rendered* from the data, just like the jTeX content.  If
> the data in the code is something other than UTF-8 (eg, ISO-8859-1), I
> don't see why GCC should give a fig about the value of LANG in the
> environment.

Because it has to determine if the identifiers are valid, tokenize them,
and so on. (i.e. there are a *lot* of possible encodings for, say, `.'
in Java, all of which act like a `.'. C and C++ need some degree of
Unicode support as a QoI matter, as well: who are we to say that people
can't put their own names in comments or string literals?)

>  > I'll admit that I can't figure out why you would set LANG=en_BLAH.UTF-8
>  > in your environment if your tools were *not* capable of handling UTF-8.
>  > It *still* seems like a reasonable assumption to me.
> 
> As stated, the assumption is not violated.  My tools *are* capable of
> handling UTF-8.  It is the inference that using UTF-8 is therefore
> reliable that is wrong.

I still don't understand why. Is it that LANG might not match the
encoding you're using *right now*?

If so, then, well, this only applies to people who are changing
encodings all the time in shell buffers in which they're also running
compilations. Is this really common? (If they're changing encodings so
often, surely they can change encoding back?)

(Why is XEmacs-21.5.27 choosing to garbage-collect between every word I
type? If this is how the incremental GC normally works it's damned
annoying. Yes, each GC round only takes about a second, but still, that's
an unresponsive second every two seconds...)

>  > If your tools can't handle UTF-8, don't set your locale to a UTF-8
>  > locale.
> 
> Uh, we are talking about the LANG variable.  It is *global*.  For

What? It's an environment variable: a per-process attribute.

> *most* of what I do, it makes sense to set that variable to *.UTF-8,
> because *most* of my data (including a fair number file names) *is*
> UTF-8, and (with the exception of XEmacs) *none* of my tools are smart
> enough to DTRT without LANG set appropriately.

Yeah. XEmacs beats the common herd yet again :)

> So what you're saying is that a user whose data is *mostly* UTF-8, and
> whose "dumb" tools all handle UTF-8 and only UTF-8 should change LANG
> to something else

No! Stick with a LANG set to UTF-8 and everything should work. I can't
understand why it isn't for you.

>                   because he also uses a smart tool that not only can
> DTRT with UTF-8, but UTF-16, EUC-JP, GB2312, KSC6501, and KOI-8 as
> well as ASCII?

In that case, said smart tool should have *no* trouble with a couple of
Unicode quotes coming out of GCC (and, indeed, for me, it all works.
But that doesn't say much because if it didn't work for me I'd have
fixed it already.)

I think I'm still missing the point (no surprise there, I seem to
specialize in it).

One aside: if XEmacs goes into a tight loop, how do I debug it?
I entered a newsgroup a few minutes ago and XEmacs wandered off
computing madly, GCing occasionally, and never came back.

The backtrace was, ahem, unhelpful:

Attaching to program: /usr/bin/xemacs, process 28586
Failed to read a valid object file image from memory.
Reading symbols from /usr/lib/libaudio.so.2...done.
Loaded symbols for /usr/lib/libaudio.so.2
[...]
Loaded symbols for /usr/lib/libXfixes.so.3
0x08204cdb in do_marker_adjustment (mpos=14799561, from=15192266, to=15192601, amount=-392705) at /usr/packages/xemacs/21.5.27/src/insdel.c:192
192           if (mpos > from + amount && mpos <= from)
(gdb) bt
#0  0x08204cdb in do_marker_adjustment (mpos=14799561, from=15192266, to=15192601, amount=-392705) at /usr/packages/xemacs/21.5.27/src/insdel.c:192
#1  0x08204d21 in adjust_markers (buf=<value optimized out>, from=15192266, to=15192601, amount=-392705) at /usr/packages/xemacs/21.5.27/src/insdel.c:231
#2  0x08206bb9 in gap_right (buf=0xa4112cc, cpos=14781487, bpos=14799896) at /usr/packages/xemacs/21.5.27/src/insdel.c:389
#3  0x082088fd in buffer_delete_range (buf=0xa4112cc, from=<value optimized out>, to=14781488, flags=0) at /usr/packages/xemacs/21.5.27/src/insdel.c:1410
#4  0x080c66e7 in Fdelete_char (count=14799561, killp=147669592) at /usr/packages/xemacs/21.5.27/src/cmds.c:283
#5  0x0811a59e in Ffuncall (nargs=2, args=0xbfa791b4) at /usr/packages/xemacs/21.5.27/src/eval.c:3893
#6  0x080abdf2 in execute_optimized_program (program=0xb8a0f78 "eb\210????#?\a??!\210??\b??.db\210?y\210??!?\005??!\210eb\210??!?\n`?y\210`|\210??????#?\a??!\210??\207", stack_depth=4,
    constants_data=0xb746260) at /usr/packages/xemacs/21.5.27/src/bytecode.c:862
[...]
#78 0x080c45f8 in initial_command_loop (load_me=147669592) at /usr/packages/xemacs/21.5.27/src/cmdloop.c:313
#79 0x0810d9d8 in xemacs_21_5_b27_i686_pc_linux (argc=1, argv=0xbfa7bfc4, unused_envp=0x0, restart=0) at /usr/packages/xemacs/21.5.27/src/emacs.c:2667
#80 0x0810e7eb in main (argc=Cannot access memory at address 0xe1d2c9
) at /usr/packages/xemacs/21.5.27/src/emacs.c:3111
(gdb)
(gdb) info locals
No locals.

Thanks heaps, GDB. No valid object file image?!

How can I even tell what Lisp function it was getting stuck in?

(I'm considering just oprofiling it next time to get a clue what
functions the loop is passing through: would that be a worthwhile
approach?)

-- 
`When we are born we have plenty of Hydrogen but as we age our
 Hydrogen pool becomes depleted.'