Unicodification of sources, part 1

Mon Jun 19 01:56:43 EDT 2006

Eventually we should recode all our sources to Unicode UTF-8 as the
standard encoding.  I'm willing to present rationales for Unicode and
UTF-8, but I believe we already have working consensus that those are
long-term goals, so have omitted them here.

Currently there is an existing standard for Lisp (ISO-2022-JP, which
is actually a general internationalized standard despite being
designed for Japanese) and we'll need to maintain backward
compatibility for 21.4 for a while.  I'm not sure of what issues might
be involved for C sources, but in X11-related code there are external
standards for strings (X Compound Text) and font names (ISO 8859/1)
that need to be considered, and in MS Windows-related code there is a
bias toward UTF-16 rather than UTF-8 as I understand it.  Finally, ISO
8859/1 is the existing standard for ChangeLogs.  (I'm describing from
memory, but I'll supply documentation if it's important; my point here
is to give a somewhat accurate description of where we are, not to be
authoritative.)

I propose the following process for 21.5 and the packages:

Stage 1:  Re-encode all plain text documentation in UTF-8.  That
includes READMEs, PROBLEMS, etc, and ChangeLogs.  Of these, as far as
I know only the ChangeLogs contain non-ASCII characters, frequently in
contributor names and occasionally in Mule-related logs.  So mostly
this would simply mean establishing a standard that non-ASCII *is*
permitted if encoded in UTF-8, and some guidelines to avoid gratuitous
backward incompatibility where ASCII would do.

Stage 2:  Re-encode the C sources in UTF-8.  I believe this currently
affects only comments and some strings in Mule code, mostly related to
input methods.  Probably this would be easiest if concurrent to
introducing Unicode as the internal coding for XEmacs.

Stage 3a:  Re-encode core Lisp in UTF-8, and remove the special
treatment of Mule files.  This would involve a small amount of code
allowing the Lisp reader to read UTF-8 whether Mule was configured or
not, and some decisions about how to handle non-ISO-8859-1 characters
in no-Mule.  (This might be moot if we decide to remove the no-Mule
option, but I think that's unlikely to happen soon.)

Stage 3b: Re-encode Info and other shared documentation (man pages)
in UTF-8.  This has the problem that TeX doesn't like Unicode very
much yet, although there are solutions for European languages.

Stage 3c: Provide a parallel set of Mule packages encoded in UTF-8 for
XEmacs 21.5 and above.  This would also require an infrastructure for
automatic transcoding.

Stage 4: Unify the utf8-packages and xemacs-packages hierarchies, and
make the mule-packages hierarchy obsolete.

The target time-frame would be to accomplish Stages 1-3 by the release
of 22.0 (the presumed release version for the 21.5 code base), and
Stage 4 for 22.2 (or maybe 22.4).

I don't see any real reason not to proceed with Stage 1 immediately
after maybe a 2-week discussion period.

Comments?  Objections?  Obstacles I've overlooked?

-- 
Graduate School of Systems and Information Engineering   University of Tsukuba
http://turnbull.sk.tsukuba.ac.jp/        Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
       "Some men see problems that are, and ask `how can we fix that'?
        I see problems that never were, and say `I have fixed that!'"