Subject: Re: Back to character set implementation thinking From: Erik Naggum <erik@naggum.net> Date: Tue, 26 Mar 2002 01:34:19 GMT Newsgroups: comp.lang.lisp,comp.lang.scheme Message-ID: <3226095271716329@naggum.net> * Thomas Bushnell, BSG | The GNU/Linux world is rapidly converging on using UTF-8 to hold 31-bit | Unicode values. Part of the reason it does this is so that existing byte | streams of Latin-1 characters can (pretty much) be used without | modification, and it allows "soft conversion" of existing code, which is | quite easy and thus helps everybody switch. UTF-8 is in fact extreemly hostile to applications that would otherwise have dealt with ISO 8859-1. The addition of a prefix byte has some very serious implications. UTF-8 is an inefficient and stupid format that should never have been proposed. However, it has computational elegance in that it is a stateless encoding. I maintain that encoding is stateful regardless of whether it is made explicit or not. I therefore strongly suggest that serious users of Unicode employ the compression scheme that has been described in Unicode Technical Report #6. I recommend reading this technical report. Incidentally, if I could design things all over again, I would most probably have used a pure 16-bit character set from the get-go. None of this annoying 7- or 8-bit stuff. Well, actually, I would have opted for more than 16-bit units -- it is way too small. I think I would have wanted the smallest storage unit of a computer to be 20 bits wide. That would have allowed addressing of 4G of today's bytes with only 20 bits. But I digress... | So even if strings are "compressed" this way, they are not UTF-8. That's | Right Out. They are just direct UCS values. Procedures like string-set! | therefore might have to inflate (and thus copy) the entire string if a | value outside the range is stored. But that's ok with me; I don't think | it's a serious lose. There is some value to the C/Unix concept of a string as a small stream. Most parsing of strings needs to parse so from start to end, so there is no point in optimizing them for direct access. However, a string would then be different from a vector of characters. It would, conceptually, be more like a list of characters, but with a more compact encoding, of course. Emacs MULE, with all its horrible faults, has taken a stream approach to character sequences and then added direct access into it, which has become amazingly expensive. I believe that trying to make "string" both a stream and a vector at the same time is futile and only leads to very serious problems. The default representation of a string should be stream, not a vector, and accessors should use the stream, such as with make-string-{input,output}-stream, with new operators like dostring, instead of trying to use the string as a vector when it clearly is not. The character concept needs to be able to accomodate this, too. Such pervasive changes are of course not free. | Ok, then the second question is about combining characters. Level 1 | support is really not appropriate here. It would be nice to support | Level 3. But perhaps Level 2 with Hangul Jamo characters [are those | required for Level 2?] would be good enough. Level 2 requires every other combining character except Hangul Jamo. | It seems to me that it's most appropriate to use Normalization Form D. I agree for the streams approach. I think it is important to make sure that there is a single code for all character sequences in the stream when it is converted to a vector. The private use space should be used for these things, and a mapping to and from character sequences should be maintained such that if a private use character is queried for its properties, those of the character sequence would be returned. | Or is that crazy? It has the advantage of holding all the Level 3 values | in a consistent way. (Since precombined characters do not exist for all | possibilities, Normalization Form C results in some characters | precombined and some not, right?) Correct. | And finally, should the Lisp/Scheme "character" data type refer to a | single UCS code point, or should it refer to a base character together | with all the combining characters that are attached to it? Primarily the code point, but both, effectively, by using the private use space as outlined above. /// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.