Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!") From: Erik Naggum <erik@naggum.no> Date: 1999/02/11 Newsgroups: comp.lang.lisp Message-ID: <3127703943282468@naggum.no> * "Howard R. Stearns" <howard@elwood.com> | One that is one of my pet peaves. A while back in "IEEE Computer" | magazine, some yahoo decided that we don't need to use 16 bits to handle | international characters. Instead, we usually only need 8 bits at a | time, and that we would get better performance by using 8-bit characters | for everything along with a locally understood "current char set". They | eventually printed a "letter to the editor" I sent, and the whole thing | bugs me enough that I'm going to repeat it here. the first ISO 10646 draft actually had this feature for character sets that only need 8 bits, complete with "High Octet Prefix", which was intended as a stateful encoding that would _never_ be useful in memory. this was a vastly superior coding scheme to UTF-8, which unfortunately penalizes everybody outside of the United States. I actually think UTF-8 is one of the least intelligent solutions to this problem around: it thwarts the whole effort of the Unicode Consortium and has already proven to be a reason why Unicode is not catching on. instead of this stupid encoding, only a few system libraries need to be updated to understand the UCS signature, #xFEFF, at the start of strings or streams. it can even be byte-swapped without loss of information. I don't think two bytes is a great loss, but the stateless morons in New Jersey couldn't be bothered to figure something like this out. argh! when the UCS signature becomes widespread, any string or stream can be viewed initially as a byte sequence, and upon first access can easily be inspected for its true nature and the object can then change class into whatever the appropriate class should be. it might even be byteswapped if appropriate. this is not at all rocket science. I think the UCS signature is among the smarter things in Unicode. that #xFFFE is an invalid code and #xFEFF is a zero-width space are signs of a brilliant mind at work. I don't know who invented this, but I _do_ know that UTF-8 is a New Jersey-ism. | One issue you bring up that is not covered in the letter is whether speed | is effected in Lisp by simultaneously supporting BOTH ASCII and Unicode. there is actually a lot of evidence that UTF-8 slows things down because it has to be translated, but UTF-16 can be processed faster than ISO 8859-1 on most modern computers because the memory access is simpler with 16-bit units than with 8-bit units. odd addresses are not free. | It is not quite correct to refer to Unicode as a 16-bit standard. | Unicode actually uses a 32-bit space. It is one of the more popular | subsets of Unicode, UCS-2, that happens to fit in 16 bits. well, Unicode 1.0 was 16 bits, but Unicode is now 16 bits + 20 bits worth of extended space encoded as 32 bits using 1024 high and 1024 low codes from the set of 16-bit codes. ISO 10646 is a 31-bit character set standard without any of this stupid hi-lo cruft. your point about the distinction between internal and external formats is generally lost on people who have never seen the concepts provided by the READ and WRITE functions in Common Lisp. Lispers are used to dealing with different internal and external representations, and therefore have a strong propensity to understand much more complex issues than people who are likely to argue in favor of writing raw bytes from memory out to files as a form of "interchange", and who deal with all text as _strings_ and repeatedly maul them with regexps. my experience is that there's no point in trying to argue with people who don't understand the concepts of internal and external representation -- if you want to reach them at all, that's where you have to start, but be prepared for a paradigm shift happening in your audience's brain. (it has been instructive to see how people suddenly grasp that a date is always read and written in ISO 8601 format although the machine actually deals with it as a large integer, the number of seconds since an epoch. Unix folks who are used to seeing the number _or_ a hacked-up crufty version of `ctime' output are truly amazed by this.) if you can explain how and why conflating internal and external representation is bad karma, you can usually watch them people get a serious case of revelation and their coding style changes there and then. but just criticizing their choice of an internal-friendly external coding doesn't ring a bell. #:Erik -- Y2K conversion simplified: Januark, Februark, March, April, Mak, June, Julk, August, September, October, November, December.