Subject: Re: hints when trying to implement lisp in c
From: rpw3@rpw3.org (Rob Warnock)
Date: Sat, 15 Sep 2007 23:58:13 -0500
Newsgroups: comp.lang.lisp
Message-ID: <rPWdnbpyfLF4KnHbnZ2dnUVZ_qCgnZ2d@speakeasy.net>
<luis_prosikito@yahoo.es> wrote:
+---------------
| So far I know there is tinyscheme, lisp500, emacs, of which I'm
| always reading source code.
+---------------

Those are certainly good source code references, but if you like
those you'll want to add George Carrette's SIOD (Scheme In One Day)
and Aubrey Jaffer's SCM to your list:

    http://people.delphiforums.com/gjc/siod.html

    http://www-swiss.ai.mit.edu/~jaffer/SCM
    http://www-swiss.ai.mit.edu/~jaffer/SLIB

Both are written in C, with fairly small code bases given
what they provide.

Caveat: SIOD an R3RS Scheme plus extensions; SCM is R4RS
plus a bunch of extensions in SLIB. Both can be embedded
in C programs and/or call C code linked with them.

+---------------
| I also know the implementation details heavily depends on the kind
| of lisp one is looking for (i.e.: dynamic or lexical scope, etc.)
+---------------

For your sanity's sake, I do hope you're looking for a Lisp
with lexical scope, at least for locals [that is, Scheme or
Common Lisp or similar].

+---------------
| Hoewever my question is if there is some known document which
| enumerate/explains the basic parts of a lisp interpreter.
| 
| I've read/I'm reading many different documents, such as "the roots of
| lisp", "on lisp", "lisp in small pieces", "sicp", "elements of
| artificial intelligence" from tanimoto, etc, yet none of them explains
| what I want to read, which is about the implemtation of lisp from a
| language-independent manner.
+---------------

"Lisp In Small Pieces" is IMHO *THE* best of all of those for a
potential implementer to grasp. The fact that most of it uses a
Scheme-like metalangauge for the concrete examples shouldn't
deter you; there are sections in the back that target to C code.

+---------------
| What I've so far is:
| a stack/table driven LL(1) parser, which I use rather than a recursive
| proper so as to keep track of what part of the grammar the interpreter
| is currently in (I might change this to a recursive one once I feel
| more familiarized with the whole thing, as the actual model wastes cpu
| cycles).
+---------------

Having done a few implementations of "a small Lisp" from time to time
[hey, eveyone has a *few* time-wasting hobbies! ;-} ], it's my personal
opinion that this is going about it the wrong way. The essence of the
Lisp-family languages [despite R5RS Scheme heading a bit down the
wrong path] is that the semantics of Lisp programs is defined on s-expr
*objects* in memory, *NOT* on any sometimes-equivalent/sometimes-not
externally-serialized character representation. We just recently had
a very long series of threads in this forum on that very topic [look
for the subjects "grammatical complexity" and "simple lisp interpreter"]
about whether terms like "LL(n)" can even apply to Lisp, given the
possibilty of user-defined readmacros.

In my experience, a better approach for a beginning Lisp implementer
to take, given that you want to target C at the end, is this:

0. Pick some one implementation of Lisp/Scheme/Common_Lisp/whatever
   that you are familiar & comfortable with and that runs on the
   platform upon which you're going to be doing most of your initial
   development. This will be the language that you will write almost
   all of your prototypes, build tools, test suites, etc., on, at least
   until your own implementation is mature enough for them to be able
   to be ported to it [at which point it will become "self-hosted"].
   Call this language L0.

1. In L0, code up an implementation of the READ function that parses
   the printed representation of objects -- including s-exprs -- of
   your target language (Ln) and returns an "equivalent" object in L0.
   [If Ln provides new atomic object types that don't exist in L0,
   just define some structs or something in L0 to represent them.]
   Note that, given the same input string, it is *not* necessary
   that L0's READ and Ln's READ return the same (EQUAL or EQUALP)
   L0 object! But it's convenient for debugging if they're very
   close to the same. [You want to see something "reasonable" when
   you print one of them at the L0 REPL.]

   This is actually quite a bit of work, but the "parsing" aspects
   are the least of it. All you need at the beginning for the "parser"
   is a simple lexer for numbers, symbols, & strings, and a recursive-
   descent loop for lists and dotted-conses. The complexity is really
   in such things as: (1) character sets [ASCII-only? UTF-8?];
   (2) readtables [especially if Ln is going to allow users to define
   readmacros [at first, readmacros have to be be written in L0];
   (3) maybe even how your eventual GC's allocator is going to work,
   that is, deciding on the representation of certain objects, e.g.,
   sumbols in Ln may have (or need) different slots than symbols
   in L0.

2. Also in L0, code up an implementation of the PRINT & PRINC functions
   [or better, {WRITE/PRIN1/PRINC}-TO-STRING, but you can start with
   just PRINT/PRINC/TERPRI] that takes L0 objects that have been parsed
   by Ln-READ and creates an "equivalent" printed representation.
   Worry a little bit [but not *too* much, at this stage] about
   getting PRINT+READ & READ+PRINT invariance correct, especially
   for numbers.

At this point you can go one of two ways, the "make EVAL work" path
or the "start porting to C" path. Said another way, you can continue
building an Ln virtual machine in the L0 environment, and go further
and further into implementing INTERN, EVAL, APPLY [and FUNCALL],
special forms, arithmetic, array, & string ops, etc., in L0 code
that acts on an instance of an Ln VM. Or you can stop here and
immediately port your Ln READ & PRINT functions to C.

[Note: Taking the latter path  *doesn't* necessarily mean implementing
a full GC yet (at first you can just "malloc()" stuff and never free it),
but you *do* need to think long and hard about what style of C code
you're going to end up with and the kind of GC it's going to support.
Are you going to take the easy route and use a Boehm/Demers/Reiser
conservative GC, or do you want to write your own precise GC? Each
choice constrains the kind of C code you "generate" (manually, at
this point).]

As one who has taken both paths [with different projects] in the past,
I can tell you that neither is fully satisfactory for a new implementer
who's just noodling around. The "make EVAL work" [really, the "build
an Ln VM in L0"] path will appear go very fast, and you'll get lots
of stuff to work early, but you might become frustrated that even
after quite a lot of work you seem no closer to the "real" environment
that you started out targeting. But when you *do* finally start the
porting to "the hardware" [C code, in this case] things will probably
go very quickly, and if not, any serious refactoring that's necessary
will be much faster on this path. The danger on this path is that
you'll get discouraged and never get to "the hardware".

On the other hand, if you port your Ln READ & PRINT functions to C as
soon as you get them to "sort of work" in L0, you'll get an initial
rush of accomplishment, since once READ & PRINT work in C [along with
all of the object-allocation code that that implies, including INTERN], 
adding EVAL/APPLY, a few of the basic generic Lisp special forms and
functions, and getting a simple REPL to "work" is a *very* small
amount of work. "Yippee! It works!" Well, not quite. Odds are very
high that as you start adding more and more of the semantics of Ln
[assuming it's a relatively rich language] that you will find yourself
deep down in an architectural dead end, and will have to backtrack
pretty far out to find a solution. The resulting refactoring of all
of that manually-written C code will not be pretty, and may lead to
your giving up entirely.

There is a compromise approach, but it's potentially a bit more work
up front than either of the two alternatives given above, which is
to start in the very beginning with a (simple) Ln-to-C compiler
written in L0, and *stay* with that approach all the way, refactoring
[in L0 source!] as necessary when incrementally adding features to Ln
[that is, as you grow L1, L2, ... all the way to Ln]. And if you
grow your test suite along with the compiler, so that every step
along the way your compiler is passing the test suite, you end up
with the method described here:

    http://lambda-the-ultimate.org/node/1752
    http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf
    http://www.cs.indiana.edu/~aghuloum/
    Abdulaziz Ghuloum (Indiana University),
    "An Incremental Approach to Compiler Construction"

He's compiling from Scheme to x86 assembler source, but the same
approach would work compiling from Ln to C [with the compiler
written in L0, of course].

This latter approach is sort of inside-out compared to my earlier
suggestion of starting with READ/PRINT, in that you initially use
*only* L0's READ/PRINT, and don't get to Ln's READ/PRINT at all
until your compiler is already rich enough to implement READ/PRINT
*in* compiled Ln.

[Aside: Since I've never tried it before myself, I suspect that
if I ever attempt another "small Lisp", I'll try the latter
approach for it. But YMMV...]

+---------------
| An nfa in which every primitive/variable/function get registered, as
| the simplest/fastest way I could think of when trying to be able to
| know if the word being read is a primtive/variable or function or not,
| and as to acess it's structure data in the case it is.
+---------------

This is not the Lisp way. There is no need whatsoever when READing
source text to know what the symbols in it "mean". Only after source
text is parsed into an in-core s-expr [including whatever INTERNing
is needed] should you  even *care* about what a symbol "means".
After all, the user may have simply typed a quoted list containing
a whole bunch of symbols that -- whie the same as some symbols
with value or function definitions in your language -- are in this
case "just data", and thus will *never* need interpretation by
your compiler/interpreter.

+---------------
| Some sort of oblist for the primitives, variables and functions
| (globaly) defined.
+---------------

An oblist is not even necessary at all, if you have a more sophisticated
INTERN, though it's certainly one way to do a quick & stupid INTERN.
[I've done that myself while bootstrapping.] But in any case the
oblist is just for *symbols* [perhaps as an aid to INTERN, perhaps
not], and it's the symbols *themselves* that store [in their
SYMBOL-VALUE, SYMBOL-FUNCTION, & SYMBOL-PLIST, etc., slots] whether
they have any global definition(s). But you don't need any of that
just to READ an s-expr.

And when you get to EVAL [and/or MACROEXPAND and/or COMPILE], don't
forget the lexical environment [also *not* needed while READing],
which in some ways is even more important than the global definition(s)
stored in symbols since different lexical variables can be named by
the same symbol. Queinnec covers several possible implementations
of lexical environment in "Lisp In Small Pieces" in the chapter
(section?) on "Fast Interpretation" [including several subtle traps
in the interaction between lexical environments and continuations
in Scheme].

+---------------
| I say globally cose I've wrapped these last two constructions into a
| data-structure (which for now I call environment), as I know each
| function and other contructions needs it's own.
+---------------

Uh... no. Each *instance* of a function closure will need its
own lexical environment, which you can't even create (in most
cases) until run-time.

+---------------
| And appart from copying tinyscheme's garbage collector I haven't got
| much more.
+---------------

Is that a precise or conservative collector? If the former,
how are you handling your "GC_protect()" stacking in C?

+---------------
| Right now I'm stuck in the part where I should apply all the
| procedures I've provided the interpreter with in order to gather
| input properly and start evaluating it.
| 
| As I've seen within emacs, I've provided internal-function's data
| structures with min an max number of arguments so it would be easier
| to parse the input.
+---------------

Something feels wrong about this, as if it's all being done in the
wrong order [e.g., you need READ *first*, then EVAL is really simple]
and/or it *way* too complex, but I've probably already said too much
and should shut up and let others comment.


-Rob

p.s. My last attempt at a toy Lisp in C got this far, using
about 2500 lines of C [including some CL-generated tables]:

    qdl> (oblist)
    (FLOAD >= <= > < /= = / * - + 1- 1+ ASSOC LENGTH LIST CONS CDR
     CAR MLO D32 DEBUG OBLIST EVAL APPLY FUNCALL VECTOR TERPRI PRINC
     PRINT LET* LET DEFUN SETQ IF PROGN LAMBDA QUOTE TEST123 T NIL)
    qdl> (defun fact (n)
	   (if (< n 2)
	     1
	     (* n (fact (1- n)))))
    FACT
    qdl> (fact 5)
    120
    qdl> (defun fact2 (n)
	   (let ((aux nil))                                    
	     (setq aux (lambda (n prod)
			 (if (< n 2)
			   prod
			   (funcall aux (1- n) (* n prod)))))
	     (funcall aux n 1)))
    FACT2
    qdl> (fact2 5)
    120
    qdl> 

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607