Subject: Re: XML->sexpr ideas
From: Erik Naggum <erik@naggum.no>
Date: 19 Jan 2004 12:24:42 +0000
Newsgroups: comp.lang.lisp
Message-ID: <3283503882585065KL2065E_-_@naggum.no>

* Kenny Tilton
| Of course it is easy enough for me to come up with a sexpr format off
| the top of my head, but I seem to recall someone (Erik? Tim? Other?)
| saying they had done some work on a formal approach to an alternative
| to XML/HTML/whatever.
| 
| True that? If so, I am all ears.

  Really?  You are?  Maybe I didn't survive 2003 and this is some Hell
  where people have to do eternal penance, and now I get to do SGML all
  over again.

  Much processing of SGML-like data appears to be stream-like and will
  therefore appear to be equivalent to an in-order traversal of a tree,
  which can therefore be represented with cons cells while the traverser
  maintains its own backward links elsewhere, but this is misleading.

  The amount of work and memory required to maintain the proper backward
  links and to make the right decisions is found in real applications to
  balloon and to cause random hacks; the query languages reflect this
  complexity.  Ease of access to the parent element is crucial to the
  decision-making process, so if one wants to use a simple list to keep
  track of this, the most natural thing is to create a list of the
  element type, the parent, and the contents, such that each element has
  the form (type parent . contents), but this has the annoying property
  that moving from a particular element to the next can only be done by
  remembering the position of the current element in a list, just as one
  cannot move to the next element in a list unless you keep the cons
  cell around.  However, the whole point of this exercise is to be able
  to keep only one pointer around.  So the contents of an element must
  have the form (type parent contents . tail) if it has element contents
  or simply a list of objects, or just the object if simple enough.

  Example: <foo>123</foo> would thus be represented by (foo nil "123"),
  <foo>123</foo><bar>456</bar> by (foo nil "123" bar nil "456"), and
  <zot><foo>123</foo><bar>456</bar></zot> by #1=(zot nil (foo #1# "123"
  bar #1# "456")).

  Navigation inside this kind of structure is easy: When the contents in
  CADDR is exhausted, the CDDDR is the next element, or if NIL, we have
  exhausted the contents of the parent and move up to the CADR and look
  for its next element, etc.  All the important edges of the containers
  that make up the *ML document are easily detectible and the operations
  that are usually found at the edges are normally tied to the element
  type (or as modified by its parents), are easily computable.  However,
  using a list for this is cumbersome, so I cooked up the «quad».  The
  «quad» is devoid of any intrinsic meaning because it is intended to be
  a general data structure, so I looked for the best meaningless names
  for the slots/accessors, and decided on QAR, QBR, QCR, and QDR.  The
  quad points to the element type (like the operator in a sexpr) in the
  QAR, the parent (or back) quad in the QBR, the contents of the element
  in the QCR, and the usual pointer to the next quad in the QDR.

  Since the intent with this model is to «load» SGML/XML/SALT documents
  into memory, one important issue is how to represent long stretches of
  character content or binary content.  The quad can easily be used to
  represent a (sequence of) entity fragments, with the source in QAR,
  the start position in QBR, and the end position in QCR, thereby using
  a minimum of memory for the contents.  Since very large documents are
  intended to be loaded into memory, this property is central to the
  ability to search only selected elements for their contents -- most
  searching processors today parse the entire entity structure and do
  very little to maintain the parsed element structure.

  Speaking of memory, one simple and efficient way to implement the quad
  on systems that lack the ability to add native types without overhead,
  is to use a two-dimensional array with a second dimension of 4 and let
  quad pointers be integers, which is friendly to garbage collection and
  is unambiguous when the quad is used in the way explained above.

  Maybe I'll talk about SALT some other day.

-- 
Erik Naggum | Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.