Subject: Re: Wide character implementation From: Erik Naggum <erik@naggum.net> Date: Sun, 24 Mar 2002 06:51:53 GMT Newsgroups: comp.lang.lisp,comp.lang.scheme Message-ID: <3225941523389213@naggum.net> * tb+usenet@becket.net (Thomas Bushnell, BSG) | Should the Scheme/CL type "character" hold Unicode characters, or | Unicode glyphs? (It seems clear to me that it should hold characters, | but I might be thinking about it poorly.) There are no Unicode glyphs. This properly refers to the equivalence of a sequence of characters starting with a base character and optinoally followed combining characters, and "precomposed" characters. This is the canonical-equivalence of character sequences. A processor of Unicode text is allowed to replace any character sequence with any of its canonically-equivalent character sequences. It is in this regard that an application may want to request a particular composite character either as one character or a character sequence, and may decide to examine each coded character element individually or as an interpreted character. These constitute three different levels of interpretation that it must be possible to specify. Since an application is explicitly permitted to choose any of the canonical-equivalent character sequences for a character, the only reasonable approach is to normalize characters into a known internal form. There is one crucial restriction on the ability to use equivalent character sequences. ISO 10646 defines implementation levels 1, 2 and 3 that, respectively, prohibit all combining characters, allow most combining characters, and allow all combining characters. This is a very important part of the whole Unicode effort, but Unicode has elected to refer to ISO 10646 for this, instead of adopting it. From my personal communication with high-ranking officials in the Unicode consortium, this is a political decision, not a technical one, because it was feared that implementors that would be happy with trivial character-to-glyph--mapping software (such as a conflation of character and glyph concepts and fonts that support this conflation), especially in the Latin script cultures, would simply drop support for the more complex usage of the Latin script and would fail to implement e.g., Greek properly. Far from being an enabling technology, it was feared that implementing the full set of equivalences would be omitted and thus not enable the international support that was so sought after. ISO 10646, on the other hand, has realized that implementors will need time to get all this right, and may choose to defer implementation of Unicode entirely if they are not able to do it stepwise. ISO 10646 Level 1 is intended to be workable for a large number of uses, while Level 3 is felt not to have an advantage qua requirement until languages that require far more than composition and decomposition to be fully supported. I concur strongly with this. The character-to-glyph mapping is fraught with problems. One possible way to do this is actually to use the large private use areas to build glyphs and then internally use only non-combining characters. The level of dynamism in the character coding and character-to-glyph mapping here is so much difficult to get right that the canonical-equivalent sequences of characters (which is a fairly simple table-lookup process) pales in comparison. That is, _if_ you allow combining characters, actually being able to display them and reason about them (such as computing widths or dealing with character properties of the implicit base character or converting their case) is far more difficult than decomposing and composing characters. As for the scary effect of "variable length" -- if you do not like it, canonicalize the input stream. This really is an isolatable non-problem. /// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.