[Q] Handle bytes in the range 0x80-0xC0 better when dealing with ISO-IR 196.

Thu Nov 23 13:07:10 EST 2006

Aidan Kehoe <kehoea at parhasard.net> writes:

>  Ar an dara l? is fiche de m? na Samhain, scr?obh Stephen J. Turnbull: 
>
>  >  > +		/* ASCII, or the lower control characters.
>  >  > +                   
>  >  > +                   Perhaps we should signal an error if the character is in
>  >  > +                   the range 0x80-0xc0; this is illegal UTF-8. */
>  >  > +                Dynarr_add (dst, (c & 0x7f));
>  > 
>  > Please do.  This is corrupting the data.
>  > 
>  > I don't have a clue how to recover from it, but the user should at
>  > least be told.
>
> I have a tentative plan to add a charset to XEmacs, 256 characters of which
> reflect corrupt Unicode data. These 256 characters will be generated by
> Unicode-oriented coding systems when they encounter invalid data:
>
> (decode-coding-string "\x80\x80" 'utf-8) 
> => "\200\200" ;; With funky redisplay properties once display tables and
> 	      ;; char tables are integrated. Which, whee, is more work. 

Here is what Emacs 22 returns:

#("\xc2\x80\xc2\x80" 0 2 (display #("\\200" 0 4 (face escape-glyph)) help-echo utf-8-help-echo untranslated-utf-8 128) 2 4 (display #("\\200" 0 4 (face escape-glyph)) help-echo utf-8-help-echo untranslated-utf-8 128))

> And will be ignored by them when writing: 
>
> (encode-coding-string (decode-coding-string "\x80\x80" 'utf-8) 'utf-8)
> => ""

Here is what Emacs 22 returns:

"\200\200"

Of course, the internal coding for Emacs 22 is emacs-mule, not utf-8
based, so this is not completely relevant.  But maybe it is
interesting, nevertheless.

> This will allow applications like David Kastrup?s reconstruct-utf-8
> sequences-from-fragmentary-TeX-error-messages to be possible, while
> not contradicting the relevant Unicode standards. With Unicode as
> the internal encoding, there?s no need to have a separate Mule
> character set; we can stick their codes somewhere above the astral
> planes. But we should maintain the same syntax code for them. Note
> also that, as far as I can work out, these 256 codes will be
> sufficient for representing error data for all the other
> Unicode-oriented representations well as UTF-8.

Not just for "unicode-oriented".  The recipe should be workable for
the iso-latin-* stuff as a file encoding, too, I think.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum