Subject: Re: reading/writing bytes smaller than 8 bits?
From: Erik Naggum <erik@naggum.no>
Date: 2000/01/26
Newsgroups: comp.lang.lisp
Message-ID: <3157896071676495@naggum.no>

* "Bruce L. Lambert" <lambertb@uic.edu>
| I just figured 1 byte = 8 bits, therefore 1 (unsigned-byte 4) = 4 bits =
| 0.5 bytes both in Lisp and in a file on disk. Simple, yet erroneous,
| deductive logic on my part.

byte n. 1. adjacent bits within an _integer_. (The specific number of bits
can vary from point to point in the program; see the _function_ BYTE.)
                     -- from the Common Lisp the Standard (ANSI X3.226-1994)

| If not the OS, then what system determines how an (unsigned-byte 4) or
| any other object gets written to disk?  Does each application make its
| own decisions on this point?

  yes.  Unix doesn't know about file contents.  it stores only bytes (still
  of no predetermined size).  the only common byte size these days is 8,
  but Unix was delivered in the past on machines with 9-bit bytes.  (this
  historical fact has even escaped most C programmers.)

| I just was trying to understand the space considerations of some code I'm
| writing.  I figured there was a direct mapping from the size of the
| internal data structure to the size of the file.

  there never are.

  the first tenet of information representation is that external and
  internal data formats are incommensurate concepts.  there simply is no
  possible way they could be conflated conceptually.  to move from external
  to internal representation, you have to go through a process of _reading_
  the data, and to move from internal to external representation, you have
  to go through a process of _writing_ the data.  these processes are
  non-trivial, programmatically, conceptually, and physically.  that some
  languages make them appear simple is to their credit, but as always,
  programming languages are about _pragmatics_, and it doesn't make sense
  to make conceptually complex things complex to use and program -- quite
  the contrary, in fact.  so the more things that are different look the
  same to a programmer, the higher the likelihood that there is something
  complex and worthwhile going on behind the scenes.

  the second tenet of information representation is that optimizations for
  internal representations are invalid for external representation and vice
  versa.  the crucial difference is that data in internal representation is
  always under the control of the exact same software at all times, while
  data in external representation _never_ is.  therefore, decisions you
  make when optimizing internal representation exploit the fact that you
  have full control over the resources that are being used to represent it,
  such as actually knowing all the assumptions that might be involved,
  while decisions you make when optimizing external representation must
  yield to the fact that you have no control over the resources that are
  being used to represent it.  a corollary is that storing any data in raw,
  memory-like form externally (_including_ network protocols) is so stupid
  and incompetent that programmers who do it without thinking should be
  punished under law and required to prove that they have understood the
  simplest concepts of computer design before they are ever let near a
  computer again.

  the third tenet of information representation is that data in _any_
  external representation is _never_ outlived by the code, and that is an
  inherent quality of external representation: the very reason you decided
  to write it out to begin with is the reason it won't go away when the
  code that wrote it did.  this fact alone so fundamentally alters the
  parameters of optimization of external representation from internal that
  the only consequence of not heeding it is to wantonly destroy information.

  now, there is one particular software company that profits by breaking
  all possible understanding of information representation, and who makes
  their very living from destroying the value of information previously
  committed to the care of their software.  this _started_ through sheer,
  idiotic incompetence on their part, but turned into company policy only a
  few years later: the company mission is now to destroy information like
  no company has ever done before, for the sole purpose of causing the
  customers to continue to pay for software that renders their past useless
  and thus in need of re-creation, data conversion, etc.

| That is, if a thousand-element array of 4-bit bytes takes 512 bytes
| (according to time), and I write the contents of that array to disk, I
| expected to see a 512 byte file.  Not so, apparently.

  avoid the temptation to confuse internal with external representation,
  and you will do fine.  as soon as you think the two are even remotely
  connected (as such -- going through a read/write process is precisely the
  point showing how they are _not_ connected as such), you lose.  big time,
  and the way back to unconfusion is long and hard.  witness all the C
  victims who actually think it makes sense to dump memory to files.  they
  lose so badly you'd think somebody would learn from it, but no -- their
  whole philosophy is to remain braindamaged in the face of danger, as that
  kept them out of the scary, scary task of having to _think_ about things,
  not something C programmers are very good at.

  a 1000-element vector of (unsigned-byte 4) may optimally be stored as 500
  bytes on a disk file if you are willing to _know_ what the file is like.
  typically, hwoever, you would include metainformation that is not needed
  once it has been processed and is in memory, versioning information, some
  form of validation clues for the information (array bounds, etc), and in
  all likelihood if you store binary data, some compression technique.
  many arguments to the contrary notwithstanding, binary representation,
  when it has been made to work, is _incredibly_ bloated compared to what
  people set out believing it will be.  making binary representation
  efficient is so hard that most people give up, satisfied with gross
  inefficiency.  a Microsoft Word document is the typical example of how
  unintelligently things _can_ be done when the real dunces are let at it.

  you may actually be better off writing your vector of 4-bit bytes out as
  hexadecimal digits, pretending that it is a huge number.  Common Lisp
  does not offer you any _huge_ help in hacking such notations back to more
  common internal representations, but actually trying to time the work you
  do during I/O has left many programmers bewildered by the cost of such a
  simple operation.  a disk read can easily take millions of times longer
  than a memory read.  whether you decode digits in that time or map bits
  directly into memory is completely irrelevant to the system performance.

  the conditions under which these low-level things matter are so different
  from what normal people experience that they will have studied the topic
  of internal and external representation very carefully before they need
  to know the details.

| Allegro is my vendor.

  well, no, Franz Inc is the vendor behind Allegro CL.

#:Erik