Subject: Re: representing strings??
From: rpw3@rigden.engr.sgi.com (Rob Warnock)
Date: 1998/08/05
Newsgroups: comp.lang.scheme
Message-ID: <6q9175$1q8m4@fido.engr.sgi.com>
Andrew  <calcheng@uclink4.berkeley.edu> wrote:
+---------------
| any ideas on how to represent strings in an interpreter.....
| i am looking for type checking criteria.....and how a string is 
| actually represented in the interpreter.........
+---------------

It varies wildly all over the place among implementations -- see
Gudeman, "Representing Type Information in Dynamically Typed Languages"
<URL:ftp://ftp.cs.indiana.edu/pub/scheme-repository/doc/pubs/typeinfo.ps.gz>
for an excellent survey -- but the main two styles of representing strings
are [horribly oversimplified!] indirect & inline.

In the "indirect" style, heap "objects" are a fairly small struct (or a
union of small structs) containing either a *very* small amount of data
in the struct itself or pointer(s) to larger data elsewhere, something
like this (cribbed and simplified from the SIOD sources):

	typedef struct obj {
	 short type;
	 union {struct {struct obj *car; struct obj *cdr;} cons;
		struct {double data;} flonum;
		struct {char *pname; struct obj *vcell;} symbol;
		...
		struct {char *name; struct obj *(*f)(void *,...);} subr;
		struct {struct obj *env; struct obj *code;} closure;
		struct {long dim; long *data;} long_array;
		struct {long dim; double *data;} double_array;
		struct {long dim; char *data;} string;
		struct {long dim; struct obj **data;} lisp_array;
	  } u;
	} *LISP;

and there is an enum somewhere (or series of some "#define"s) to enumerate
the values in "type". So one might type-check for a string with:

	#define STRING_P(x) (IS_ON_HEAP(x) ? (x)->type == string_type : 0)

(where the "IS_ON_HEAP" macro decides between "immediate" values and heap
objects) and get the address and length of the string with:

	#define STRING_VAL(x) ((x)->u.string.data)
	#define STRING_LEN(x) ((x)->u.string.dim)

Often in the indirect style only the heap "cells" (as above) are garbage-
collected per se, while the "external" data is explicitly malloc'd/free'd
by type hook routines called by the GC when a heap cell's state changes.

In the direct style, the data (except for "large" objects, which are handled
somewhat as in the indirect style) immediately follows the type indicator,
sometimes with a fixed-position length indicator as well (to make life a
little simler for the GC), so you get something like this, using the idiom
of a length-1 array to signify indefinite size:

	typedef struct obj {
	  short type;
	  short len;
	  union {
	    char  chars[1];
	    short shorts[1];
	    int   ints[1];
	    long  longs[1];
	    void *ptrs[1];
	  } u;
	} *LISP;

and then you have various macros to do casts onto the "obj" type, e.g.:

	#define STRING_P(x) (IS_ON_HEAP(x) ? (x)->type == string_type : 0)
	#define STRING_LEN(x) ((x)->len)
	#define STRING_VAL(x) (&((x)->u.chars[0]))

or maybe:

	struct Lisp_String {
	  short type;
	  short len;
	  char data[MAX_IMMED_STRING];
	};
	...
	#define STRING_VAL(x) (((struct Lisp_String*)(x))->data)
	#define STRING_LEN(x) ((x)->len)

Finally, some (but not all) Schemes ensure that strings are null-terminated
(even though the associated length field makes that uneccessary) to make it
easier to hand Scheme strings to C library routines which expect null-
terminated strings.

Does that answer your question?


-Rob

-----
Rob Warnock, 7L-551		rpw3@sgi.com   http://reality.sgi.com/rpw3/
Silicon Graphics, Inc.		Phone: 650-933-1673
2011 N. Shoreline Blvd.		FAX: 650-933-4392
Mountain View, CA  94043	PP-ASEL-IA