Re: Back to character set implementation thinking

Subject: Re: Back to character set implementation thinking
From: Erik Naggum <erik@naggum.net>
Date: Tue, 26 Mar 2002 01:34:19 GMT
Newsgroups: comp.lang.lisp,comp.lang.scheme
Message-ID: <3226095271716329@naggum.net>

* Thomas Bushnell, BSG
| The GNU/Linux world is rapidly converging on using UTF-8 to hold 31-bit
| Unicode values.  Part of the reason it does this is so that existing byte
| streams of Latin-1 characters can (pretty much) be used without
| modification, and it allows "soft conversion" of existing code, which is
| quite easy and thus helps everybody switch.

  UTF-8 is in fact extreemly hostile to applications that would otherwise
  have dealt with ISO 8859-1.  The addition of a prefix byte has some very
  serious implications.  UTF-8 is an inefficient and stupid format that
  should never have been proposed.  However, it has computational elegance
  in that it is a stateless encoding.  I maintain that encoding is stateful
  regardless of whether it is made explicit or not.  I therefore strongly
  suggest that serious users of Unicode employ the compression scheme that
  has been described in Unicode Technical Report #6.  I recommend reading
  this technical report.

  Incidentally, if I could design things all over again, I would most
  probably have used a pure 16-bit character set from the get-go.  None of
  this annoying 7- or 8-bit stuff.  Well, actually, I would have opted for
  more than 16-bit units -- it is way too small.  I think I would have
  wanted the smallest storage unit of a computer to be 20 bits wide.  That
  would have allowed addressing of 4G of today's bytes with only 20 bits.
  But I digress...

| So even if strings are "compressed" this way, they are not UTF-8.  That's
| Right Out.  They are just direct UCS values.  Procedures like string-set!
| therefore might have to inflate (and thus copy) the entire string if a
| value outside the range is stored.  But that's ok with me; I don't think
| it's a serious lose.

  There is some value to the C/Unix concept of a string as a small stream.
  Most parsing of strings needs to parse so from start to end, so there is
  no point in optimizing them for direct access.  However, a string would
  then be different from a vector of characters.  It would, conceptually,
  be more like a list of characters, but with a more compact encoding, of
  course.  Emacs MULE, with all its horrible faults, has taken a stream
  approach to character sequences and then added direct access into it,
  which has become amazingly expensive.

  I believe that trying to make "string" both a stream and a vector at the
  same time is futile and only leads to very serious problems.  The default
  representation of a string should be stream, not a vector, and accessors
  should use the stream, such as with make-string-{input,output}-stream,
  with new operators like dostring, instead of trying to use the string as
  a vector when it clearly is not.  The character concept needs to be able
  to accomodate this, too.  Such pervasive changes are of course not free.

| Ok, then the second question is about combining characters.  Level 1
| support is really not appropriate here.  It would be nice to support
| Level 3.  But perhaps Level 2 with Hangul Jamo characters [are those
| required for Level 2?] would be good enough.

  Level 2 requires every other combining character except Hangul Jamo.

| It seems to me that it's most appropriate to use Normalization Form D.

  I agree for the streams approach.  I think it is important to make sure
  that there is a single code for all character sequences in the stream
  when it is converted to a vector.  The private use space should be used
  for these things, and a mapping to and from character sequences should be
  maintained such that if a private use character is queried for its
  properties, those of the character sequence would be returned.

| Or is that crazy?  It has the advantage of holding all the Level 3 values
| in a consistent way.  (Since precombined characters do not exist for all
| possibilities, Normalization Form C results in some characters
| precombined and some not, right?)

  Correct.

| And finally, should the Lisp/Scheme "character" data type refer to a
| single UCS code point, or should it refer to a base character together
| with all the combining characters that are attached to it?

  Primarily the code point, but both, effectively, by using the private use
  space as outlined above.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.