[Q] Handle bytes in the range 0x80-0xC0 better when dealing
with ISO-IR 196.
David Kastrup
dak
Thu Nov 23 19:46:28 EST 2006
Aidan Kehoe <kehoea at parhasard.net> writes:
> Ar an tr?? l? is fiche de m? na Samhain, scr?obh David Kastrup:
>
> > > I have a tentative plan to add a charset to XEmacs, 256
> > > characters of which reflect corrupt Unicode data. These 256
> > > characters will be generated by Unicode-oriented coding systems
> > > when they encounter invalid data:
> > >
> > > (decode-coding-string "\x80\x80" 'utf-8)
> > > => "\200\200" ;; With funky redisplay properties once display tables
> > > ;; and char tables are integrated. Which, whee, is more
> > > ;; work.
> >
> > Here is what Emacs 22 returns:
> >
> > #("\xc2\x80\xc2\x80" 0 2 (display #("\\200" 0 4 (face escape-glyph)) help-echo utf-8-help-echo untranslated-utf-8 128) 2 4 (display #("\\200" 0 4 (face escape-glyph)) help-echo utf-8-help-echo untranslated-utf-8 128))
> >
> > > And will be ignored by them when writing:
> > >
> > > (encode-coding-string (decode-coding-string "\x80\x80" 'utf-8) 'utf-8)
> > > => ""
> >
> > Here is what Emacs 22 returns:
> >
> > "\200\200"
>
> Quite an old GNU Emacs 23.0.0 gives me this:
>
> (encode-coding-string "\x80\x80" 'utf-8)
> => "\200\200"
>
> (decode-coding-string "\x80\x80" 'utf-8)
> => "\200\200"
>
> Savannah?s being unco-operative about allowing me to cvs update, otherwise
> it would be worth reporting the former as a bug.
I am not sure. It is round-trip, but of course it leaves bad bytes in
the buffer. And some of them might combine badly with others.
> Were it my implementation, I would regard the latter as a bug too.
Yes, could be worth working over. I suppose that "Emacs 23" will only
start getting a really solid beating once it is moved to HEAD.
> > > This will allow applications like David Kastrup?s reconstruct-utf-8
> > > sequences-from-fragmentary-TeX-error-messages to be possible, while
> > > not contradicting the relevant Unicode standards. With Unicode as
> > > the internal encoding, there?s no need to have a separate Mule
> > > character set; we can stick their codes somewhere above the astral
> > > planes. But we should maintain the same syntax code for them. Note
> > > also that, as far as I can work out, these 256 codes will be
> > > sufficient for representing error data for all the other
> > > Unicode-oriented representations well as UTF-8.
> >
> > Not just for "unicode-oriented". The recipe should be workable for
> > the iso-latin-* stuff as a file encoding, too, I think.
>
> Hmm? _Are_ there invalid sequences for the ISO-8859-N file
> encodings?
If the file is in iso-8859-N, but the locale and/or the process
encoding (which might come from a master file in utf-8 that includes a
subfile in iso-*)...
The possibilities for complications are pretty much endless. Polish
people, for example, tend to use Latin-2 encodings in their files, but
Latin-1 locales. They just "know" which characters will be displayed
wrong and how, and in a Polish locale, more things go wrong than in an
English one.
You don't really want to know the number of idiocies I have to cater
for in connection with AUCTeX/TeX/LaTeX.
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
More information about the XEmacs-Beta
mailing list