Subject: Re: Questions about Symbolics lisp machines From: Erik Naggum <erik@naggum.net> Date: Mon, 25 Mar 2002 09:26:15 GMT Newsgroups: comp.lang.lisp Message-ID: <3226037188416051@naggum.net> * Thomas Bushnell, BSG | Consider the hair and pain involved in making Unicode work on GNU/Linux | systems with UTF-8. This is the easiest way to go, and even so it's lots | of work converting a jillion applications to work right. And this is | because "character stream" is *not* a well defined concept; Unix | historically only has ASCII character streams, and from this comes a | giant problem. No, this is an important mistake. Unix has a well-defined concept of an "octet stream". This is _never_ what you really want. On top of this "octet stream" Unix has, via C's lack of a real character type, given its users the notion that a _character_ is just the same as a small integer that happens to fit in an octet. All of this is unfortunately wrong. A character and its encoding are different concepts. An encoding and its (external) representation are different concepts. An external representation and the numeric values of whatever unit it is made up of are different concepts. By conflating all four concepts into one, Unix has held text processing and computing in general back several decades. This is fairly ironic, since Unix started out as a text processing vehicle. One result of this character = small number = octet confusion is that "variable length" encodings are seriously frightening to Unixoid coders (partly because of the observation that you make that all the code that deals with octet-stream -> anything-else interpretation and has caused such a problem with stable and well-defined standards such as ISO 2022 that the IETF was utterly unable to use any existing standards for the representation of multi--character-set "documents" and "streams", and so had to invent both MIME (extremely crude structured objects in mail) and a charset property at an extremely high level, such that mixing charsets became extremely verbose and difficult. This is also why some people think Unicode sucks because it may force programmers to deal with characters differently than "just assume 16 bits" instead of the old "just assume 8 bits". | Care to guess how many times different argument parser routines there are | in an average GNU/Linux system? I pick that one because argument parsing | is actually, a *total waste*, forced by the use of a "shell"--another | wasted concept unnecessary in a Real System (like the various lispms had, | and like Sky [should it ever happen] will have). I think you overreact now. The biggest problem here is that _everything_ in Unix is an octet stream, even strings, and program arguments are just strings. (The fact that you need to parse the "string" from beginning to end to find the in-band terminator (which cannot even be escpaed) makes it a stream, and the "pointer" you have into a stream to read the current position is just like the position in a stream.) Unix is in fact so streams-based that it is nearly _impossible_ to work with structured objects. Everywhere an object wants to go, it has to be marshalled into and out of an octet-stream--based external format, both in arguments and in pipelines. It is as you had to call a function foo like (eval (format nil "(foo~@{ ~A~})" <arguments>)). Hey, I just reinvented Tcl. Of course, every object must have an external representation of _some_ sort to communicate it with external programs, but marshalling to and from octet stream should preserve the object-ness. Lisp and things like ASN.1 enable the preservation of objectness in marshalling, and many other attempts have been made. But treating everything like a string without futher adornment or syntax (which Unix shells and, ironically, SGML and XML do) is just plain wrong. On the other hand, there _are_ times when you want to just copy a file or ship across a network bit by bit, in which case the octet stream might seem the only alternative. This is not really a situation that the user or even (application) programmer needs to be exposed to. /// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.