Subject: Re: cl-webapi (was Re: Where to start)
From: rpw3@rpw3.org (Rob Warnock)
Date: Wed, 20 Aug 2003 09:01:15 -0500
Newsgroups: comp.lang.lisp
Message-ID: <EtSdnQx1PJe24t6iXTWc-w@speakeasy.net>
Daniel Barlow  <dan@telent.net> wrote:
+---------------
| rpw3@rpw3.org (Rob Warnock) writes:
| > Initially it seemed rather complex to me, all those different "stages",
| > amd the complexity of the URI parsing seemed a bit over-the-top, but
| > I dunno, maybe there are cases where it's needed.
| 
| All you need for "hello world" in  Araneida (ignoring the export-server
| stuff, which I freely admit is messy) is 
| 
| (defclass hello-handler (handler) ())
| (defmethod handle-request-response ((h hello-handler) (method (eql :get))
|                                     request)
|   (request-send-headers request)
|   (format (request-stream request) "<title>hello</title><p>Hello world~%")
|   t)
|   
| (install-handler *root-handler* (make-instance 'hello-handler) 
|                  "http://localhost/hello.html")
+---------------

For handlers which are loaded into the server at init time, mine isn't
all that much different, actually, except for not using CLOS as much:

    (defun hello-page (request)
      (with-html-output (s (http-request-stream request) t)
	"Content-Type: text/html" (lfd) (lfd)
	(:html ()
	  (:head () (:title () "Hello")) (lfd)
	  (:body ()
	    (:h1 () "Hello") (lfd)
	    "Hello, world!" (lfd)))
	(lfd)))

    (register-uri-handler :path "/lhp/hello" :kind :exact
			  :function 'hello-page)


If you should reload the file containing that, REGISTER-URI-HANDLER
is careful to re-use the handler entry if the :PATH & :KIND are equal
to the previous one (just overlaying the function in that case).

[And as noted last time, I don't yet handle virtual hosts correctly.
I don't want to put the virtual host explicitly into a URI passed to
the registration routine the way you do above, since I want a given
".lhp" file (and its cached code) to be able to work on multiple virtual
hosts that share a common $DOCUMENT_ROOT, e.g., "127.0.0.1", "localhost",
"rpw3.org", and "fast.rpw3.org", but *not* "ns1.rpw3.org" (say). Getting
this right without backparsing "httpd.conf" is a bit tricky.]

A minimal LHP page (demand-loaded from the filesystem & cached)
is even more terse, provided one is happy with the default "200"
response and "text/html":

    (lhp-basic-page ()
      (:html ()
	(:head () (:title () "Hello")) (lfd)
	(:body ()
	  (:h1 () "Hello") (lfd)
	  "Hello, world!" (lfd)))
      (lfd)))

Or, an ".lhp" file again, but done without using LHP-BASIC-PAGE
or HTOUT at all (close to what the above expands into, actually):

    (flet ((the-page (request)		; FLET avoids package pollution.
	     (let ((s (http-request-stream request)))
	       (format s "Content-Type: text/html~%~%~
			 <html><head><title>hello</title></head>~%~
			 <body><h1>hello</h1>~%~
			       <p>Hello world~%~
			 </body></html>~%"))))
      (lhp:lhp-set-page-function #'the-page))

+---------------
| > Plus... it only handled :EXACT and :PREFIX matches, and I needed
| > :SUFFIX, too (for ".lhp" files like the above).
| 
| Suffix matching.  Hmm.  I can see how that'd be useful, although I'm
| not sure what the priority for "best matching handler" would be.  Does
| /cliki/foo.lhp run the cliki handler that it prefix matched, or the
| lhp handler according to the suffix?  Apache, I know, has both, and I
| find that the usual result is to do something appropriate 95% of the
| time and horribly confuse people in the remaining 5%
+---------------

Well, my decision was that an exact match with an :EXACT entry trumps
both the others, but that may have been due to the way I hacked suffix
handlers into the system. (*blush!*) The way they work [well, there
is only one so far!] is to grovel around on disk to see if the indicated
object(s) actually exist[*], and if so and if the load is successful,
register a new :EXACT URI-handler entry and call *that* one. Future hits
find the new exact entry and do don't go through the suffix handler at
all. (Nor read the filesystem, except for a quick FILE-WRITE-DATE, unless
it's changed.)

[*] Note: Apache will cheerfully send your handler matching URIs that
    don't have any existence in the file system, at least as long as
    the directories are there. And if "MultiViews" are enabled, the
    directories don't even have to be there (which is how classic CGI
    scripts effectively do "prefix matching" and extract the $PATH_INFO).
    If a file /foo.lhp exists and MultiViews are enabled, Apache will
    match a path /foo/bar/baz (even if directory "bar" doesn't exist!)
    and send the request to your server (or to /foo.cgi, if that's what
    exists). Useful, when you want it, but dangerous when it happens to
    you accidentally.

As far as prefix versus suffix goes, I have no strong preference. Or
rather, I *do* have a strong opinion that one should not be mixing the
two in the same part of the virtual URI hierarchy. Prefix matching feels
to me like mainly a way to try to trick search engines into indexing
what are really queries under the skin (that is, all of the stuff
following the prefix is really a query encoded into URI space!).
It's fine for that, but in that case it should (IMHO) be off in its
own corner of the virtual URI hierarchy, and *not* share a prefix
with stuff that's really strongly filesystem based. But YMMV...

+---------------
| > The registered function...does whatever it want to do thoughout
| > the various "stages", calling back into the infrastructure for help
| >  whenneeded [such as issuing ... standard HTML error pages].
| 
| Helper functions for errors etc are another area that need addressing.
| Using cliki as an example again, it'd be nice to customize the 404
| page by subclassing something, so that requests for nonexistent cliki
| pages display differently from requests for other random nonexistent
| resources on the same server
+---------------

Again, my stuff may have to evolve in that direction, but so far I've
gotten by with only three (well, four) infrastructure error pages a
handler can call:

    forbidden-page (403)	Page functions might call this on auth errors.
    not-found-page (404)	Default page for the no-handler case.
    internal-error-page (500)	Some CL error condition (other than EPIPE).
    fallback-page (200)		Does a formatted dump of the request.

The first two are exact clones of the corresponding builtin Apache
pages, except for one character so I can tell them apart.  ;-}
The third is not identical to the Apache one, especially as it
reports the CL condition to the user (if possible), but is similar
in tone. The last is not really an error; it gets used for debugging
new request types, e.g., as an "action" of a FORM, say.

The user applications themselves have a bunch of very application-
specific error pages, to be sure, but they all use the application's
standard page templates -- headers, footers, navbars, etc. -- and
explain the error in user terms. So in HTTP terms, these are in fact
not "errors" (and all produce "200 OK" results).

+---------------
| > Can't it be made to run on CMUCL pretty easily?
| 
| It wouldn't be rocket science.  I just say "SBCL" because I don't
| actually test with CMUCL before releasing new versions, so at any
| given time there are probably sbcl package prefixes and things in
| there that need fixing.
+---------------

Yeah, I noticed that. But it didn't seem to be *too* bad that way.
A bit of grunt work, a couple of #+sbcl (progn ...) and #+cmu (progn ...)
here and there, and...  ;-}

+---------------
| > That is, the whole issue (mentioned in passing above) of what might be
| > called "dynamic defsystem" issues (that is, when do Lisp sources get
| > loaded/reloaded/recompiled).
| 
| Yes.  I haven't thought at all about dynamic loading at request time,
| but I agree it needs consideration.
+---------------

I may have more of a bug in my ear about that than most, since I
came to the "CL web API" topic from the direction of doing classic
CGI "scripts" in Lisp, rather starting from the CL-HTTP/AllegroServe
"Lisp is the whole world" point of view. So I when I moved from fork/exec
CGI scripting to a persistent server, naturally one of the first things
I wanted was to "make it work like CGI, only better".

+---------------
| What I'm trying to punt on here is how you get from sexp to stream
| (htout, lml, cl-who, etc)
+---------------

That's fine. There are plenty of packages for that, and plenty of
reasons to have & use different ones. The only thing that's really
important for that from the "CL web API" point of view is that the
handlers -- especially suffix handlers -- need to be able to select
the style themselves. That is, you're probably going to see a ".alp"
handler, a ".tml" handler, and a ".lhp" handler doing a bunch of very
similar things (especially "dynamic defsystem" kinds of things at least)
and it would be nice if the obviously-common things could be factored
out and made available as part of the infrastructure. But you're also
going to see all three doing some very *different* stuff, and they need
to be allowed the freedom to do that, too.

To me, that argues for a very minimal interface in the infrastructure-
to-handler direction -- basically, call the handler function with *all*
the information about the HTTP request, and little else -- but a fairly
rich interface in the handler-to-infrastructure direction: callbacks,
library routines, standard error pages, some templated pages, etc.
Some handlers may be little more than shims, calling back into the
infrastructure; others may have to nearly duplicate large hunks of
the infrastructure because it doesn't *quite* work for their precise
needs. Both should be possible.

+---------------
| It seems that given support for a 'suffix' handler, it would be
| possible for the client application programmer to write something that
| would pick up pages at some filesystem location based on the URL, do
| whatever processing is appropriate (compile, load, read,
| read-sequence, whatever) and shove it out to the client .  With the
| addition of some protocol to register the response as eligible for
| caching in the server (and ideally to invalidate the cache, which I
| agree is a problem in Araneida) I think this covers all three content
| creation patterns that you mention - mostly by punting on the details.
| I take refuge in the Unix "mechanism not policy" mantra here.
+---------------

Almost... But... For space/speed tradeoffs, especially to keep the
server from bloating up into holding the *entire* site in memory,
you need several knobs to say at what level the caching occurs, e.g.,
the final HTML, the compiled code, the interpreted code, the pre-
processed page from disk (that is, ".tml" gets parsed and turned into
trees of Lisp source that makes HTOUT calls -- one could usefully stop
there and cache that output as, say, a parallel ".lhp" file on disk,
which in turn could be loaded on demand), or no caching at all, and
when/why cached things expires. Creating an infrastructure that has all
the right knobs so that *any* desired policy is possible is a *huge*
task... and *will* get it "wrong" in some critical respect anyway. So
leave the policy in the individual handlers (applications), and supply
a rich mechanism (infrastructure) so that implementing whatever weird
policy is fairly easy for the easy cases, and at least *possible* for
the hard ones.

[Oops! Sorry, didn't mean to go off on an architectural style rant.
We obviously agree on the notion of policy/mechanism split, but some
policies are more complex than others...]

I keep harping on the phrase "dynamic defsystem" because it seems that's
the kind of thing you're going to want in the infrastructure, *but* with
the "system descriptions" in the applications, handlers, or even the web
page files. For example, I'm currently working on a small project that --
purely for development convenience -- I'm coding as a bunch of little
".lhp" files, but each of them assumes there is a much larger common set
of stuff already installed in the server when they run. So I've ended up
with the following really ugly boilerplate in each ".lhp" file [please!
no flames about string-bashing vs. pathname merging, I *know* it's ugly!]:

    (eval-when (:compile-toplevel :load-toplevel :execute)
      (unless (find-package :org.rpw3.customer-name.project)
	(load (concatenate 'string (directory-namestring
				    (load-time-value *load-pathname*))
			   "pkg")))) ; so LOAD finds either ".lisp" or ".x86f"

    (in-package :org.rpw3.customer-name.project)

    (lhp-basic-page ()
      (ensure-components-present/up-to-date)	; micro-defsystem, sort of.
      ...rest of page logic
	  and HTML generation code...)

Note that the ENSURE-COMPONENTS-PRESENT/UP-TO-DATE call has to be *inside*
the LHP-BASIC-PAGE macro, since if this page has been accessed before, it
is probably cached in the server, and if one of the *other* component
source files has changed we want the accessing of this page (or any other
".lhp" page in the project, whether previously loaded or not) to trigger
a rebuild of the changed component(s).

The "pkg" contains [yes, against much good advice I've heard here!]
both the package definition and a list of the common component libraries
that all the ".lhp" files in the project need loaded (and compiled
and automatically kept up to date), as well as the definition of the 
ENSURE-COMPONENTS-PRESENT/UP-TO-DATE that does the "make" function.

Ugly, like I said. But what it means operationally is that I can edit
just about any file [except "pkg.lisp" -- there are some Catch-22's
with updates of it], whether an LHP "web page" or Lisp sources for
the common library code, and the next load or reload of any ".lhp"
web page from my browser will get a consistent, up-to-date set of
compiled code into the server, and then run that ".lhp" page function.
And when things are up-to-date, the only additional latency that's
added to the web page output is one FILE-WRITE-DATE per project
component file [fast, since FreeBSD caches the inodes].

Talk about your "fast prototyping" of a web site...  ;-}  ;-}

Anyway, the point of this somewhat lengthy digression is that what
I'd really like is to get rid of all that boilerplate and just be
able to say in any of the ".lhp" files:

    (lhp-basic-page (:system-description "project_name.asd")   ;or equivalent
      ...rest of page logic
	  and HTML generation code...)

and have all of it work the same way.

Unfortunately, it can't be *quite* that simple, at least with CMUCL,
since it reads whole top-level forms at one time, so the package has
to be already defined and IN-PACKAGE'd before the LHP-BASIC-PAGE form
is read. But it probably *could* be as simple as this:

    (use-system "custy-project.asd")
    (in-package :org.rpw3.customer-name.project)

    (lhp-basic-page ()
      (refresh-system)
      ...rest of page logic
	  and HTML generation code...)

That's only three lines of not-too-messy boilerplate per web page file.

+---------------
| > |  - URI/URL parsing is probably _in_ scope, however: URI are used all
| > |    over the place in HTTP; it'd be madness to treat them as text...
| > +---------------
| >
| > Well, I'd agree, except... If you're using one of the "active pages
| > stored in the filesystem" styles, fairly often you need to go back and
| > forth from URI to pathnames (and often, *namestrings*). After looking
| 
| Yes.  Jeff Dalton made a similar point on the lispweb list.  My point
| was that given the URL whose printed representation is
| 
|   http://www.example.com/cgi-bin/foo.pl?bar=1&baz=2
| 
| it's a lot nicer to be able to ask for (url-scheme u) or
| (url-query-info u) than chopping it up using position and subseq, and
| my guess is that any web server will do enough of this url processing
| that it will end up growing a set of functions for this purpose.
+---------------

Yes, I think so, but again: Operating in the context of being a backend
to Apache (which is sort of where I [and maybe we?] started), Apache
has done much of that for you, and split out the results into individual
CGI environment variables. In particular, for "handled" files, these
variables contain almost all the parsed stuff you need (except for,
as noted last time, the breaking up of QUERY_STRING/POST-data into
query bindings):

    REQUEST_URI			; Cannot be trusted, may have bad "/../".
    REDIRECT_URL		; Safe from bad "/../" (Apache filters it).
    PATH_INFO			;   (ditto)
    PATH_TRANSLATED		;   (ditto)
    SCRIPT_NAME			;   (ditto)
    SCRIPT_FILENAME		;   (ditto)

    REQUEST_METHOD
    QUERY_STRING		; Set but null for both "/foo" & "/foo?".
    REDIRECT_QUERY_STRING	; Unset for "/foo"; set but null for "/foo?".
    CONTENT_LENGTH		; (for POST)

    SERVER_PROTOCOL
    SERVER_NAME/HTTP_HOST	; May be different if "UseCanonicalName On",
    SERVER_PORT			;  then $HTTP_HOST is the URI's virtual server.

    AUTH_TYPE
    REMOTE_USER
    REMOTE_ADDR

Yes, I know some of this is very Apache-specific, and a fully-general API
should be the same under CL-HTTP, AllegroServe, etc. But I mention it only
to show why I didn't spend very much energy on a fancy parsed-URI scheme.
Everything *I* needed was in those vars.

*Aha!* I just figure out a major source of our differences over URI
parsing here! You're *NOT* running as a "CGI" or "mod_lisp" client,
but as a proxy. All of that Apache parsing that went into the CGI
environment variables *isn't* available to you. You *have* to (re)parse
the full URI again. [Apache parsed it the first time, to figure out
that it needed to send it to the proxy (your server), but all of the
parsed information got lost when it did.] That explains a lot...

+---------------
| If strings are accepted as url designators everywhere that URLs are
| (which is not uniformly the case in araneida), I don't think that url
| parsing should impact the user too much.  I'm open to being convinced 
| on that, I guess.  Outside of CLiki, my URLs almost never correspond
| in any way at all to things in the filesystem, so people with
| other usage patterns will see it differently.
+---------------

As I commented above, IMHO one should carefully partition the portion
of URI space one is using prefix parsing for away from filesystem-mapped
and/or suffix parsing space. The latter are the ones that end up using
strings a lot. (IME)

+---------------
| > "Lispweb"?  Whazzat?
| 
| The lispweb@red-bean.com mailing list.  I've posted an article there
| which you can find on gmane at
|   http://article.gmane.org/gmane.lisp.web/148
| For subscription information, look at 
|   http://www.red-bean.com/mailman/listinfo/lispweb
+---------------

Thanks!


-Rob

-----
Rob Warnock, PP-ASEL-IA		<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607