[Q] Handle bytes in the range 0x80-0xC0 better when dealing with ISO-IR 196.

Thu Nov 23 13:50:48 EST 2006

"Stephen J. Turnbull" <stephen at xemacs.org> writes:

> Aidan Kehoe writes:
>
>  > Well, David?s problem is an actual problem. 
>
> David's problem is also reasonably easy to workaround.  Read TeX's
> output as binary (which is what it is, of course),

If only it were...

> walk up the buffer or string until you hit a legal UTF-8 first byte,
> then decode from there on.  It's a little bit harder than that, but
> not much.

I am afraid that it is not as easy as that.  TeX converts some (but
quite often only a subset) utf8 bytes into ^^ab hexadecimal notation,
then it cuts the resulting strings short at a fixed distance of
characters (which may be in the middle of a ^^ab sequence) to the
front and back and uses that for a context to be identified in the
file.

So what we actually do is cutting away everything that may be part of
an incomplete ^^ab sequence at the start and end of the contexts, then
encode the result into a raw byte string according to the buffer
encoding, then redecode the result into the buffer encoding, chopping
away illegal utf-8 characters at the start and end of the context
afterwards.

And those strings can then be used for lookup in the buffer.

The problem is not that TeX's output is binary.  The problem is that
it is a hodgepodge of binary bytes and transliterated binary bytes
(with the transliteration not respecting character boundaries) chopped
off somewhere without regard to character boundaries of either
originating utf-8 characters or even the transliterations of single
bytes of them.

While I'll agree immediately that some people ought to get
collectively have their heads examined for the combined effects
a) coming up with the idea of transliterating "unprintable" characters
in the terminal output.
b) finding nothing wrong with chopping such a transliteration in half
c) letting some bytes through unmolested depending on locale
d) making LaTeX deal with characters not in the current locale
e) making LaTeX deal with multibyte encodings without TeX itself
knowing about or supporting anything but 8-bit characters.

The result is a pain to work with.  But there are no alternatives, and
actually nothing better one can do with the current code base.

And the point is that Emacs 22, even Emacs 21 _manage_ dealing with
this, even when (in a Latin-1 locale or a LaTeX that believes to be in
one) utf-8 sequences get only partly transliterated by TeX and thus
fails to be legal utf-8.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum