Subject: Re: Reviews for lisp implementations
From: Erik Naggum <erik@naggum.no>
Date: 1999/04/18
Newsgroups: comp.lang.lisp
Message-ID: <3133424944726306@naggum.no>

* Lars Marius Garshol <larsga@ifi.uio.no>
| And if it's really sorted separately then I think makes sense to
| consider it a separate character, as Unicode more or less does
| (although it calls it a ligature): U+0132 and U+0133.

  this is getting a bit far afield, but collation order, characterness, and
  glyphness are distinct properties of a writing system element.  for one
  thing, there is no _single_ correct collation order.  character sets do
  _not_ imply collation order.  characterness of a writing system element
  is a fairly fundamental concept and is strongly associated with meaning.
  glyphness of a writing system element is strongly associated with looks.
  finally, fonts are made up instantiations of glyphs.  e.g., a writing
  system element may exhibit so different meanings that they deserve to be
  separate characters, although this is very rare.  in general, there is
  also one glyph per character, although some have more (the German short
  and long s, the open and baggy a, the open and broken vertical line), but
  more frequent is a glyph for a sequence of characters (ligatures in Latin
  scripts, but includes vowels in Indic scripts and Hebrew) or a character
  in contex (the connectives (single, initial, medial, final) in Arabic
  scripts), etc.  collation order is tightly coupled with character, but
  for hysterical raisins many languages collate sequences of characters as
  a single unit.  to represent all of this correctly, you need a whole
  bunch of tables.  there are therefore glyph set standards that are very
  separate from character set standards, and their mapping is non-trivial.
  there are huge tables of correct collation orders for different scripts
  and languages (French requires a five-level deep collation system in full
  name and dictionary sorting), and conflation of representation makes up
  most of it (e.g., no significance it attached to the ring in "Ångstrøm"
  in an English dictionary, where it is sorted with Angst, but you'll find
  it at the end of a Norwegian one because Å is a separate character).

  Unicode is a hybrid of a character and a glyph set.  the reason for this
  is fairly obvious when you consider its major proponents: Xerox and
  Microsoft.  Xerox makes printers and wanted a simple standard for which
  they could make huge fonts.  Microsoft are just too damn stupid to get it
  right or to respect any traditions.  (Xerox didn't want it to replace the
  first ISO 10646 draft, however, so they may be excused.)  in typical "is
  this a font or what?"-misunderstanding, æ was a ligature in Unicode, but
  I complained about it, so ISO 10646-1 has amended it to be a letter, and
  "ij" is a character, not a presentation form, which it should have been.

#:Erik
-- 
environmentalists are much too concerned with planet earth.  their geocentric
attitude prevents them from seeing the greater picture -- lots of planets are
much worse off than earth is.