Subject: Re: Reviews for lisp implementations From: Erik Naggum <erik@naggum.no> Date: 1999/04/18 Newsgroups: comp.lang.lisp Message-ID: <3133424944726306@naggum.no> * Lars Marius Garshol <larsga@ifi.uio.no> | And if it's really sorted separately then I think makes sense to | consider it a separate character, as Unicode more or less does | (although it calls it a ligature): U+0132 and U+0133. this is getting a bit far afield, but collation order, characterness, and glyphness are distinct properties of a writing system element. for one thing, there is no _single_ correct collation order. character sets do _not_ imply collation order. characterness of a writing system element is a fairly fundamental concept and is strongly associated with meaning. glyphness of a writing system element is strongly associated with looks. finally, fonts are made up instantiations of glyphs. e.g., a writing system element may exhibit so different meanings that they deserve to be separate characters, although this is very rare. in general, there is also one glyph per character, although some have more (the German short and long s, the open and baggy a, the open and broken vertical line), but more frequent is a glyph for a sequence of characters (ligatures in Latin scripts, but includes vowels in Indic scripts and Hebrew) or a character in contex (the connectives (single, initial, medial, final) in Arabic scripts), etc. collation order is tightly coupled with character, but for hysterical raisins many languages collate sequences of characters as a single unit. to represent all of this correctly, you need a whole bunch of tables. there are therefore glyph set standards that are very separate from character set standards, and their mapping is non-trivial. there are huge tables of correct collation orders for different scripts and languages (French requires a five-level deep collation system in full name and dictionary sorting), and conflation of representation makes up most of it (e.g., no significance it attached to the ring in "Ångstrøm" in an English dictionary, where it is sorted with Angst, but you'll find it at the end of a Norwegian one because Å is a separate character). Unicode is a hybrid of a character and a glyph set. the reason for this is fairly obvious when you consider its major proponents: Xerox and Microsoft. Xerox makes printers and wanted a simple standard for which they could make huge fonts. Microsoft are just too damn stupid to get it right or to respect any traditions. (Xerox didn't want it to replace the first ISO 10646 draft, however, so they may be excused.) in typical "is this a font or what?"-misunderstanding, æ was a ligature in Unicode, but I complained about it, so ISO 10646-1 has amended it to be a letter, and "ij" is a character, not a presentation form, which it should have been. #:Erik -- environmentalists are much too concerned with planet earth. their geocentric attitude prevents them from seeing the greater picture -- lots of planets are much worse off than earth is.