Subject: Re: Wide character implementation From: Erik Naggum <erik@naggum.net> Date: Tue, 19 Mar 2002 10:53:48 GMT Newsgroups: comp.lang.lisp,comp.lang.scheme Message-ID: <3225524036151618@naggum.net> * Thomas Bushnell, BSG | If one uses tagged pointers, then its easy to implement fixnums as | ASCII characters efficiently. Huh? No sense this makes. | But suppose one wants to have the character datatype be 32-bit Unicode | characters? Or worse yet, 35-bit Unicode characters? Unicode is a 31-bit character set. The base multilingual plane is 16 bits wide, and then there are the possibility of 20 bits encoded in two 16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (- (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme, but one does not have to understand the lo- and hi-word codes that make up the 20-bit character space. In effect, you need 16 bits. Therefore, you could represent characters with the following bit pattern, with b for bits and c for code. Fonts are a mistake, so is removed. 000000ccccccccccccccccccccc00110 This is useful when the fixnum type tag is either 000 for even fixnums and 100 for odd fixnums, effectively 00 for fixnums. This makes char-code and code-char a single shift operation. Of course, char-bits and char-font are not supported in this scheme, but if you _really_ have to, the upper 4 bits may be used for char-bits. | At the same time, most characters in the system will of course not be | wide. What are the sane implementation strategies for this? I would (again) recommend actually reading the specification. The character type can handle everything, but base-char could handle the 8-bit things that reasonable people use. The normal string type has character elements while base-string has base-char elements. It would seem fairly reasonable to implement a *read-default-string-type* that would take string or base-string as value if you choose to implement both string types. /// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.