[Q] Handle bytes in the range 0x80-0xC0 better when dealing with ISO-IR 196.

Thu Nov 23 09:13:38 EST 2006

 Ar an dara l? is fiche de m? na Samhain, scr?obh Stephen J. Turnbull: 

 >  > +		/* ASCII, or the lower control characters.
 >  > +                   
 >  > +                   Perhaps we should signal an error if the character is in
 >  > +                   the range 0x80-0xc0; this is illegal UTF-8. */
 >  > +                Dynarr_add (dst, (c & 0x7f));
 > 
 > Please do.  This is corrupting the data.
 > 
 > I don't have a clue how to recover from it, but the user should at
 > least be told.

I have a tentative plan to add a charset to XEmacs, 256 characters of which
reflect corrupt Unicode data. These 256 characters will be generated by
Unicode-oriented coding systems when they encounter invalid data:

(decode-coding-string "\x80\x80" 'utf-8) 
=> "\200\200" ;; With funky redisplay properties once display tables and
	      ;; char tables are integrated. Which, whee, is more work. 

And will be ignored by them when writing: 

(encode-coding-string (decode-coding-string "\x80\x80" 'utf-8) 'utf-8)
=> ""

This will allow applications like David Kastrup?s reconstruct-utf-8
sequences-from-fragmentary-TeX-error-messages to be possible, while not
contradicting the relevant Unicode standards. With Unicode as the internal
encoding, there?s no need to have a separate Mule character set; we can
stick their codes somewhere above the astral planes. But we should maintain
the same syntax code for them. Note also that, as far as I can work out,
these 256 codes will be sufficient for representing error data for all the
other Unicode-oriented representations well as UTF-8.

-- 
Santa Maradona, priez pour moi!