Subject: the character type
From: Erik Naggum <erik@naggum.no>
Date: 1996/03/17
Newsgroups: comp.lang.lisp
Message-ID: <3036079150517560@arcana.naggum.no>

suppose you have a file in an unknown character set.  the file starts with
some codes that tell you how it is encoded.  each character could be 7, 8,
14, 16 or 21 bits wide, encoded as 1, 1, 2, 2, and 4 bytes, respectively.
additionally it could be using any of the ISO 2022 or ISO 10646 methods of
encoding.

suppose you want to read this into the system as _characters_, and work on
them as characters, for purposes of mapping, transformation, processing, or
interpreting where the encoding is irrelevant to the result, but you need
to generate some _encoding_ on output, which need not be a function of the
input encodings used.

I can fake it with integers, but I really want to distinguish characters
from integers.  I actually want characters to be known by longish, unique
names, and have several possible encodings.  I can fake this with symbols,
but symbols are not characters, and I think Common Lisp should have a
strong enough character type that it could do all this, not the least
because of the requirements of internationalization.

I would like to be able to have a number of character set descriptions
(tables) and encoding algorithms (filter functions) that allow me to return
the character read from an input source.  let me take a simple example.
suppose the character set is ISO 8859-1, but it has been reduced to 7-bit
encoding according to ISO 2022, such that SO and SI are used to switch
between the "low" and the "high" half.  the usual way to deal with this is
to let SO and SI toggle the 8th bit, but I don't want that.  I want SO and
SI to change the mapping for the current 7-bit character set such that the
right character is returned directly.  I want this because SO and SI are
not the only possible character set shift control characters in ISO 2022.

additionally, I want to be able to parse escape sequences as their
appropriate pseudo-characters, but this is above and beyond the others.

ideally, I would like to have a facility that allowed me to create new
characters, name them, and assign _multiple_ codes to them.  the key to my
quest is that it should be possible to read and write them using various
encoding schemes and to put them into strings and still to have the rest of
the Common Lisp system deal with them.

I am disappointed with the character type in Common Lisp -- it seems
incongruously inflexible -- and wonder if any work has been done in this
area.  e.g., are there Common Lisp implementations that allow GB5, UTF, JIS
X 0208, etc, or so-called "wide characters"?

I think I would like to see `(setf char-name)', `(setf char-code)', and/or
a (new) `make-char' function that allowed me to build new characters.

the font and bits stuff was a start in the right direction, although less
general than they should have been, but these were removed from the ANSI
standard.

any ideas?

#<Erik>
-- 
the Internet made me do it