Subject: Re: HTML parser?
From: rpw3@rigden.engr.sgi.com (Rob Warnock)
Date: 1997/12/12
Newsgroups: comp.lang.scheme
Message-ID: <66qisd$120oj@fido.asd.sgi.com>

Christopher B. Browne <cbbrowne@hex.net> wrote:
+---------------
| Has anyone written an HTML parser in Scheme?
+---------------

You might be able to find some stuff in either the SIOD distribution
<URL:http://people.delphi.com/gjc/siod.html> or in the Scheme
Underground web server code -- code, docs, &c. can be found via
<URL:http://www-swiss.ai.mit.edu/scsh/contrib/net/sunet.html>.

+---------------
| I'd like to take an HTML document presumably in the form of a big
| string and turn it into a list that might look something like:
+---------------

The biggest problem you're likely to encounter is that many (if not most!)
of the HTML composition tools (not to mention manually edited pages)
*violate* proper HTML nesting of opening & closing tags -- and as a
result most of the popular browsers have been coded to be somewhat tolerant
of same. For example, *many* times I have seen the following or similar:

	...<I><B>some bold-tialic text</I></B>...
or:
	...go <I><A HREF="somewhere">here</I></A> for more...

If you're going to try to coerce existing HTML into a clean nested form,
you're simply going to have to be tolerant of such silliness, and will
need various (necessarily imperfect) heuristics for how to treat mis-nested
HTML. Sorry 'bout that.

+---------------
| (html (head (title "My dumb web page"))
|       (body
|            (h1 "Main title")
|              (ul
|                (li "A list of items")
|                (li "An item with a" 
| 	           (ahref "http://www.hex.net/~cbbrowne" 
| 		      "Reference to My Home Page")))))
+---------------

Unfortunately, this example is a little too simple. For example, I don't
think you're really going to want to collapse "<A HREF=...>" into "ahref" --
what about the *other* possible attributes inside the <A>? How would you
code <A NAME=...> or even both in the same anchor? [It's theoretically legal.]
And there are several other attributes that can be included in an <A> with
an HREF (such as REL and REV). Myself, I'd probably use a "let-like" syntax
for the attributes, that is, a list of attribute/value pairs:

  	           (a ((href "http://www.hex.net/~cbbrowne")
		       (name "para3")
		       (ref "http://some.other.place/there")
		       (rel first))
  		     "Reference to My Home Page")

And similarly for the HEAD & META, though here since there is no untagged
text allowed [is that true?] you may be able to omit the outer parens:

	(head (title "My dumb web page")
	      (base (href "http://www.hex.net/~cbbrowne"))
	      (meta (name "description")
		    (content "a typical demo page of HTML stuff"))
	      (meta (http-equiv "expires")
		    (content "Sun, 28 Dec 1997 09:32:45 GMT")))

And finally, if you're parsing documents that are already created,
don't forget to define some place to put the <!DOCTYPE> info, either at
a higher level than (html ...) or as a required first element. That is:

    (sgml (doctype html public "-//IETF//DTD HTML 2.0 Strict Level 2//EN")
      (html (head ...)
	    (body ...)))

or, if you never use your database for anything else than HTML:

    (html (doctype html public "-//IETF//DTD HTML 2.0 Strict Level 2//EN")
	  (head ...)
	  (body ...))

or even:

    (html2.0
	  (head ...)
	  (body ...))

The first form feels more "correct", somehow. (Or the third, which at
least could later be macro-expanded into the first form, if needed.)


-Rob

p.s. Are HTTP documents of "Content-type: text/html" which have no leading
     <!DOCTYPE ...> assumed to be HTML 1.0?  Or some value even lower...?)

-----
Rob Warnock, 7L-551		rpw3@sgi.com   http://reality.sgi.com/rpw3/
Silicon Graphics, Inc.		Phone: 650-933-1673 [New area code!]
2011 N. Shoreline Blvd.		FAX: 650-933-4392
Mountain View, CA  94043	PP-ASEL-IA