Moving to Unicode internally.

Stephen J. Turnbull stephen at xemacs.org
Tue Sep 14 01:38:28 EDT 2004


>>>>> "Aidan" == Aidan Kehoe <kehoea at parhasard.net> writes:

    >> [...] Right, but that's because there is no formal notion of
    >> character composition in Mule.

    Aidan> That wasn't ever finished, then?

No, I mean there is no formal notion of character composition in Mule.
A composite character is basically an index into a table of programs
that create the glyphs.  It's also true that we never implemented the
redisplay parts, but that's a different issue.

    >> So?  If they live in plane 13 or something, we just don't allow
    >> the user to specify a 16-bit internal representation.

    Aidan> [*Shudder* whoever mentioned having the user specifying the
    Aidan> internal representation? That's something that far too low
    Aidan> level for everyday use. ]

Sure, but it will take a while to get the logic for autoswitching
right, and there is simply no way for a program to get the tradeoff
between widechar and multibyte representation right for all users.
Maybe we call that "optimize for speed" vs "optimize for space", but
some users are going to need those extra cycles/buffer positions.

    Aidan> If that's a way of nudging me to go and do all the
    Aidan> multiple-character-formats-in-one-build work, I'll do
    Aidan> UTF-16 first, if that's okay.

If you're going to do the work, it's up to you.  The only thing I
would say is make sure that text with no surrogates never breaks
because there is code to handle surrogates.  It also may be a good
idea to be able to fall back to "each surrogate is just a single
unknown character" mode at runtime, in case of a bug that causes
character counts to be off or something.

But the easiest thing to do is UTF-8, because it's exactly the same as
Mule once you tweak a few limit parameters on the generic character-
counting code.  Everything's already in place for that.  Next easiest
would be fixed-width, 16-bit BMP or 32-bit UCS-4 or UTF-32.

There are probably subtle dependencies on char (vs short for UTF-16)
in the variable-width handling code.  So getting the extra 16 planes
of UTF-16 right is likely to be hard.

    Aidan> I heard complaints that Mule wasn't Fast Enough ...

Yeah, but it's really hard to pin that down to Mule, unless
buffer-motion error-checking is enabled.  It's certainly true that
some things where XEmacs is godawful slow are really painful in Mule
(like font-lock).  But they're not really acceptably efficient in
no-mule builds, either.

    Aidan> Umm. What does [denormalized UTF-8] give us?

It gives us all fixed widths from 1 to 4 bytes wide, representing the
entire range of Unicode characters, with no changes to the decoding
code _or the search and regexp code_, and only trivial changes to
encoding to pad Ibytes out to the current width.

Relatively slow for Ibyte <-> Ichar conversions, but we never do more
than about 4k of those at a time (ie, in redisplay).

    >> That's not what I mean by "auto-loading".  That requires you to
    >> identify the windows-1252 coding system first.  The problem is
    >> knowing whether you need the coding system or not, and that is
    >> going to require having the tables loaded.

    Aidan> No it doesn't. Ben Wing's Unicode detectors all work off
    Aidan> magic values, as does the shift_jis detector, the big5
    Aidan> detector, and the iso2022 detector.

Look again, you're confused by name overloading.  Those are not coding
system detectors, those are coding category detectors.  We need to do
better than that.  For example, suppose the coding category approach
says "genuine Unicode UTF-16" (we've seen a big-endian 16-bit BOM and
a valid surrogate pair).  Now what translation table(s) do we need to
load for the Latin characters in order to use ordinary X iso8859-*
fonts?  Without character distribution information, we don't know.

Note that `load-unicode-tables' just loads everything imlemented.  Now
you know why.

    Aidan> Yes, absolutely. This has its downside too, though
    Aidan> --cf. http://www.joelonsoftware.com/articles/Unicode.html :

Joel is simply spreading FUD, at least in this context.  People who
don't use Unicode or Content-Type headers in their web pages are the
problem; the alternative to what IE and Mozilla do is to simply say
"this crap is undisplayable, let's move on!"  Of course we should do
better for our users than just say "sorry", and we can do better than
IE or GNU Emacs (as of last I checked) do.  See next point.

    Aidan> We want to take as much language environment information as
    Aidan> is available to us when detecting these things, I think.

Of course.  But the first thing we need to do is to take back the
power to decide encodings for the user.  Mule should rank encodings,
convert and redisplay according to the one it likes best, and if the
top two are "too close" in likelihood, display a warning (remember, if
the top two are Latin-1 and Latin-5 and we're down to character
distribution for guessing, there's a good chance the user will _not_
see corruption on the first screenful).

As in latin-unity, we should give the user a short ordered list of
alternatives if he wants to munge the encoding, too.  None of this
"500 scripts you never heard of in alphabetical order" nonsense.

    Aidan> Wonder how much Serbian and Bulgarian's character
    Aidan> distribution varies from that of Russian?

The Mozilla people tell me they do a pretty good job on all Western
European languages with about 4k characters to work with, and there
are commercial systems that do much better.

    Aidan> Is there much need to autodetect X Compound Text, though? 

Unfortunately, there are X applications that send ctext no matter what
you ask them for.

    Aidan> I still don't anticipate ever having to use the word "carp"
    Aidan> IRL :-)

Ah, my man!  Not a Perl programmer, I see.  (If you are, I'm not sure
I want someone who doesn't "use carp" to be working on XEmacs!  ;-)


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.




More information about the XEmacs-Beta mailing list