Subject: Re: reading/writing bytes smaller than 8 bits? From: Erik Naggum <erik@naggum.no> Date: 2000/01/26 Newsgroups: comp.lang.lisp Message-ID: <3157896071676495@naggum.no> * "Bruce L. Lambert" <lambertb@uic.edu> | I just figured 1 byte = 8 bits, therefore 1 (unsigned-byte 4) = 4 bits = | 0.5 bytes both in Lisp and in a file on disk. Simple, yet erroneous, | deductive logic on my part. byte n. 1. adjacent bits within an _integer_. (The specific number of bits can vary from point to point in the program; see the _function_ BYTE.) -- from the Common Lisp the Standard (ANSI X3.226-1994) | If not the OS, then what system determines how an (unsigned-byte 4) or | any other object gets written to disk? Does each application make its | own decisions on this point? yes. Unix doesn't know about file contents. it stores only bytes (still of no predetermined size). the only common byte size these days is 8, but Unix was delivered in the past on machines with 9-bit bytes. (this historical fact has even escaped most C programmers.) | I just was trying to understand the space considerations of some code I'm | writing. I figured there was a direct mapping from the size of the | internal data structure to the size of the file. there never are. the first tenet of information representation is that external and internal data formats are incommensurate concepts. there simply is no possible way they could be conflated conceptually. to move from external to internal representation, you have to go through a process of _reading_ the data, and to move from internal to external representation, you have to go through a process of _writing_ the data. these processes are non-trivial, programmatically, conceptually, and physically. that some languages make them appear simple is to their credit, but as always, programming languages are about _pragmatics_, and it doesn't make sense to make conceptually complex things complex to use and program -- quite the contrary, in fact. so the more things that are different look the same to a programmer, the higher the likelihood that there is something complex and worthwhile going on behind the scenes. the second tenet of information representation is that optimizations for internal representations are invalid for external representation and vice versa. the crucial difference is that data in internal representation is always under the control of the exact same software at all times, while data in external representation _never_ is. therefore, decisions you make when optimizing internal representation exploit the fact that you have full control over the resources that are being used to represent it, such as actually knowing all the assumptions that might be involved, while decisions you make when optimizing external representation must yield to the fact that you have no control over the resources that are being used to represent it. a corollary is that storing any data in raw, memory-like form externally (_including_ network protocols) is so stupid and incompetent that programmers who do it without thinking should be punished under law and required to prove that they have understood the simplest concepts of computer design before they are ever let near a computer again. the third tenet of information representation is that data in _any_ external representation is _never_ outlived by the code, and that is an inherent quality of external representation: the very reason you decided to write it out to begin with is the reason it won't go away when the code that wrote it did. this fact alone so fundamentally alters the parameters of optimization of external representation from internal that the only consequence of not heeding it is to wantonly destroy information. now, there is one particular software company that profits by breaking all possible understanding of information representation, and who makes their very living from destroying the value of information previously committed to the care of their software. this _started_ through sheer, idiotic incompetence on their part, but turned into company policy only a few years later: the company mission is now to destroy information like no company has ever done before, for the sole purpose of causing the customers to continue to pay for software that renders their past useless and thus in need of re-creation, data conversion, etc. | That is, if a thousand-element array of 4-bit bytes takes 512 bytes | (according to time), and I write the contents of that array to disk, I | expected to see a 512 byte file. Not so, apparently. avoid the temptation to confuse internal with external representation, and you will do fine. as soon as you think the two are even remotely connected (as such -- going through a read/write process is precisely the point showing how they are _not_ connected as such), you lose. big time, and the way back to unconfusion is long and hard. witness all the C victims who actually think it makes sense to dump memory to files. they lose so badly you'd think somebody would learn from it, but no -- their whole philosophy is to remain braindamaged in the face of danger, as that kept them out of the scary, scary task of having to _think_ about things, not something C programmers are very good at. a 1000-element vector of (unsigned-byte 4) may optimally be stored as 500 bytes on a disk file if you are willing to _know_ what the file is like. typically, hwoever, you would include metainformation that is not needed once it has been processed and is in memory, versioning information, some form of validation clues for the information (array bounds, etc), and in all likelihood if you store binary data, some compression technique. many arguments to the contrary notwithstanding, binary representation, when it has been made to work, is _incredibly_ bloated compared to what people set out believing it will be. making binary representation efficient is so hard that most people give up, satisfied with gross inefficiency. a Microsoft Word document is the typical example of how unintelligently things _can_ be done when the real dunces are let at it. you may actually be better off writing your vector of 4-bit bytes out as hexadecimal digits, pretending that it is a huge number. Common Lisp does not offer you any _huge_ help in hacking such notations back to more common internal representations, but actually trying to time the work you do during I/O has left many programmers bewildered by the cost of such a simple operation. a disk read can easily take millions of times longer than a memory read. whether you decode digits in that time or map bits directly into memory is completely irrelevant to the system performance. the conditions under which these low-level things matter are so different from what normal people experience that they will have studied the topic of internal and external representation very carefully before they need to know the details. | Allegro is my vendor. well, no, Franz Inc is the vendor behind Allegro CL. #:Erik