[Q] Handle bytes in the range 0x80-0xC0 better when dealing with ISO-IR 196.

Aidan Kehoe kehoea
Thu Nov 23 11:40:35 EST 2006


 Ar an ceathr? l? is fiche de m? na Samhain, scr?obh Stephen J. Turnbull: 

 >  > I have a tentative plan to add a charset to XEmacs, 256 characters of
 >  > which reflect corrupt Unicode data. These 256 characters will be
 >  > generated by Unicode-oriented coding systems when they encounter
 >  > invalid data:
 > 
 > I wish you wouldn't.  Let's just get Unicode inside, and figure out
 > how to signal errors in a useful way from inside a coding stream.

While I admit that Unicode inside would be better: however, given that we
can now preserve arbitrary Unicode with the current architecture without
trouble--the lack of this feature was the primary reason I wanted to move
when I proposed it--and given that building and testing on
X11-server-side-mule, X11-server-side-nomule, XFT-mule, XFT-nomule,
GTK-mule, GTK-nomule, Win32-mule, Win32-nomule before committing any change
is already a _lot_ of work, I have personally no intention of committing a
change that adds Unicode internally as a compile-time flag.

I intend to get what I have working locally, and up to date--it isn?t
now--and I will post that to -patches. But the trade off of so much more
testing versus the limited advantages to it (better behaviour when
searching, and faster redisplay on XFT and Windows, and OK, a more
understandable architecture) doesn?t seem sensible to me. If the option of
no-Mule is removed, or we removed support for GTK, then that would be
different.

 >  > (decode-coding-string "\x80\x80" 'utf-8) 
 >  > => "\200\200" ;; With funky redisplay properties once display tables and
 >  > 	      ;; char tables are integrated. Which, whee, is more work. 
 >  > 
 >  > And will be ignored by them when writing: 
 >  > 
 >  > (encode-coding-string (decode-coding-string "\x80\x80" 'utf-8) 'utf-8)
 >  > => ""
 > 
 > Yuck.  You realize that you can't do that with the autosave code,
 > right? 

Which is well and good, it?s desired that the autosave files reflect XEmacs?
state, not necessarily what will be written. Similarly ? will be preserved
in auto-saves and trashed in iso-8859-2 buffers. 

 > And you don't want to do that if the buffer is unmodified, right?

Which is well and good, what the coding systems do will be irrelevant if the
buffer is unmodified, because they?re not invoked. 

 > Sounds like a hell of a lot of work to get right, and it will still be
 > fragile.

Much less work than allowing coding systems to throw errors, as I understand
it.

 >  > This will allow applications like David Kastrup?s reconstruct-utf-8
 >  > sequences-from-fragmentary-TeX-error-messages to be possible, while
 >  > not contradicting the relevant Unicode standards. With Unicode as the
 >  > internal encoding, there?s no need to have a separate Mule character
 >  > set; we can stick their codes somewhere above the astral planes. But
 >  > we should maintain the same syntax code for them. Note also that, as
 >  > far as I can work out, these 256 codes will be sufficient for
 >  > representing error data for all the other Unicode-oriented
 >  > representations well as UTF-8.
 > 
 > Sounds dangerous and messy to me.

Well, David?s problem is an actual problem. And the many ways in which we
fail http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt is something
that annoys me the purist in me immensely--implementing such a thing would
address both.

-- 
Santa Maradona, priez pour moi!



More information about the XEmacs-Beta mailing list