Christopher B. Browne <cbbrowne@hex.net> wrote:
+---------------
| Has anyone written an HTML parser in Scheme?
+---------------
You might be able to find some stuff in either the SIOD distribution
<URL:http://people.delphi.com/gjc/siod.html> or in the Scheme
Underground web server code -- code, docs, &c. can be found via
<URL:http://www-swiss.ai.mit.edu/scsh/contrib/net/sunet.html>.
+---------------
| I'd like to take an HTML document presumably in the form of a big
| string and turn it into a list that might look something like:
+---------------
The biggest problem you're likely to encounter is that many (if not most!)
of the HTML composition tools (not to mention manually edited pages)
*violate* proper HTML nesting of opening & closing tags -- and as a
result most of the popular browsers have been coded to be somewhat tolerant
of same. For example, *many* times I have seen the following or similar:
...<I><B>some bold-tialic text</I></B>...
or:
...go <I><A HREF="somewhere">here</I></A> for more...
If you're going to try to coerce existing HTML into a clean nested form,
you're simply going to have to be tolerant of such silliness, and will
need various (necessarily imperfect) heuristics for how to treat mis-nested
HTML. Sorry 'bout that.
+---------------
| (html (head (title "My dumb web page"))
| (body
| (h1 "Main title")
| (ul
| (li "A list of items")
| (li "An item with a"
| (ahref "http://www.hex.net/~cbbrowne"
| "Reference to My Home Page")))))
+---------------
Unfortunately, this example is a little too simple. For example, I don't
think you're really going to want to collapse "<A HREF=...>" into "ahref" --
what about the *other* possible attributes inside the <A>? How would you
code <A NAME=...> or even both in the same anchor? [It's theoretically legal.]
And there are several other attributes that can be included in an <A> with
an HREF (such as REL and REV). Myself, I'd probably use a "let-like" syntax
for the attributes, that is, a list of attribute/value pairs:
(a ((href "http://www.hex.net/~cbbrowne")
(name "para3")
(ref "http://some.other.place/there")
(rel first))
"Reference to My Home Page")
And similarly for the HEAD & META, though here since there is no untagged
text allowed [is that true?] you may be able to omit the outer parens:
(head (title "My dumb web page")
(base (href "http://www.hex.net/~cbbrowne"))
(meta (name "description")
(content "a typical demo page of HTML stuff"))
(meta (http-equiv "expires")
(content "Sun, 28 Dec 1997 09:32:45 GMT")))
And finally, if you're parsing documents that are already created,
don't forget to define some place to put the <!DOCTYPE> info, either at
a higher level than (html ...) or as a required first element. That is:
(sgml (doctype html public "-//IETF//DTD HTML 2.0 Strict Level 2//EN")
(html (head ...)
(body ...)))
or, if you never use your database for anything else than HTML:
(html (doctype html public "-//IETF//DTD HTML 2.0 Strict Level 2//EN")
(head ...)
(body ...))
or even:
(html2.0
(head ...)
(body ...))
The first form feels more "correct", somehow. (Or the third, which at
least could later be macro-expanded into the first form, if needed.)
-Rob
p.s. Are HTTP documents of "Content-type: text/html" which have no leading
<!DOCTYPE ...> assumed to be HTML 1.0? Or some value even lower...?)
-----
Rob Warnock, 7L-551 rpw3@sgi.com http://reality.sgi.com/rpw3/
Silicon Graphics, Inc. Phone: 650-933-1673 [New area code!]
2011 N. Shoreline Blvd. FAX: 650-933-4392
Mountain View, CA 94043 PP-ASEL-IA