Ray Dillinger <bear@sonic.net> wrote:
+---------------
| I guess I'm notorious as the agitator in the scheme camp who has
| been arguing in favor of a character type that corresponds to
| what in unicode parlance is a "grapheme," whereas the emerging
| consensus seems to be for a character type that corresponds to
| what in unicode parlance is a "codepoint."
...
| ...when people write characters in some
| non-preferred form you could just silently read them as
| normalized, sparing people the brain sweat of remembering
| all the proper normalization stuff themselves, so that
| (equal? #\A:macron:cedilla #\A:cedilla:macron) => #t
|
| On the downside, the character set is then infinite,
| char->integer may return bignums, characters may have to
| be boxed (and if we want to preserve eq?-ness, boxed
| characters have to be interned as well) and the number of
| *codepoints* in a string or the specific representation
| of a character as a codepoint sequence becomes undefined.
| So there are tradeoffs...
+---------------
On the plus side, ANSI CL has never required EQ-ness" of characters,
so there's none to "preserve". [See CLHS 13.1.5 "Identity of Characters".]
The CL programmer is only promised EQL/ identity for characters and numbers.
Another plus is that CHAR-CODE-LIMIT, the "upper exclusive bound on
the value returned by the function char-code", is an non-negative
INTEGER, not a FIXNUM, so while the occasional BIGNUM is ugly, it's
not forbidden.
On the down side, you will have to provide a mapping functions
from characters to CL character codes and back, so CHAR-CODE and
CODE_CHAR can work.
And I'm not sure, but give the history [and confusion] about the topic,
you'd probably want there to *not* be any "implementation-defined
attributes" for characters. That is, make CHAR-CODE and CHAR-INT
the same, and make normalized graphemes EQL to "the same" unnormalized
graphemes.
Another choice might be to allow/provide some "implementation-defined
attributes", and then only make normalized graphemes EQL, but perhaps
allow unnormalized graphemes to be EQUAL. [In that case CHAR-CODE and
CHAR-INT *would* be different for at least some graphemes.]
-Rob
-----
Rob Warnock <rpw3@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607