Subject: Re: hash tables with a copying memory gc
From: rpw3@rigden.engr.sgi.com (Rob Warnock)
Date: 1999/11/19
Newsgroups: comp.lang.lisp
Message-ID: <813ama$24o6f@fido.engr.sgi.com>
David Betz <dbetz@mediaone.net> wrote:
+---------------
| I have built a system based on a copying garbage collector and I would
| like to add hash table support...
+---------------

Me too. (Well, "working on", not "have built"...) The following somewhat
half-baked approach is the best I've come up with so far. It *ought* to
avoid re-hashing for many common uses of hash tables. (I'd be interested
to see how it compares with commercial implementations.)

+---------------
| How are hash tables usually implemented in a copying environment?
| ...give each object an id that is never recycled...  This works but
| wastes space... [or] ...address of the object [and] rehash every time
| a collection occurs. This wastes processing time.
| What is the good solution that I'm overlooking.
+---------------

Those are the two main ways I know of. Well, there's third choice which
works only for "persistent" hashes [see below], and that's to recompute
the hash every time you do a lookup, but that wastes time if keys tend
to be looked up lots of times, on average. The direction I've been heading
is a mixed-strategey hybrid between all of the above, with two main elements: 

First, partition the object/type/class space into objects that have
more-or-less obvious persistent (or idempotent or immutable) hashes
(e.g., strings, symbols, & numbers come to mind) and those that don't
(conses, general vectors, etc,). For pragmatic or heuristic reasons
you might decide to shift *some* (but not necessarily all) objects that
"naturally" fall in the latter group into the first group by providing
them with unique IDs -- perhaps "heavyweight" class or structure instances,
where the extra space for a unique ID might not be as big an issue.

Then, upon garbage collection, only rehash the table if at least one
key that *doesn't* have a persistent hash is currently in the table;
otherwise not. That is:

1. When you create an empty hash table or "clrhash" it, flag it as "pure"
   [or whatever you want to call it]:  (setf (hash-table-pure-p ht) t)

2. Let each (setf (gethash key ht) val) implicitly also do something like:
   (setf (hash-table-pure-p ht) (and (hash-table-pure-p ht)
				     (hash-key-persistent-p key)))
   where the "hash-key-persistent-p" predicate is true if it is possible
   (*and* if you've chosen in your design) to calculate a persistent-style
   hash for that object -- one that is unchanged across collections.

3. Each time a hash table is copied by the GC, if (hash-table-pure-p ht)
   then you don't have to rehash the table, else you do.

4. Whenever the GC rehashes a hash table, it must behave AS IF a "clrhash"
   had been done followed by re-inserting all of the entries, allowing an
   "impure" hash table to become "pure" again if all the impure keys are
   remhash'd. [Note that since hash tables always have an integer number
   of keys in them, you could achieve the "as if" cheaply with a simple
   "ref count" of impure keys maintained by setf/gethash & remhash.]

5. For those objects whose hash values can be persistent, you have a number
   of choices of when to compute the hash:
   A. Always compute it when the object is initially created, and store it
      in the object for later use if the object is ever used as a key of any
      hash table. (Instantiation serial numbers fall here, too.) Costs space
      in every object of every persistent-hash-capable type.
   B. When the object is initially created, store an "impossible" hash in
      the object (also costs space per object), but don't actually compute
      the hash until the first time the object is used as a key in *some*
      hash table (*then* store it in the object). Since most objects are
      never used as hash table keys, this might save considerable CPU time.
      [On the other hand, "intern" probably wants to use the same mechanism,
      so you might as well compute & store the hash for every symbol's
      "symbol-name" at read time, as in "A".]
   C. You might *never* store the hash (saving space), but recompute it
      whenever you need it. [Though obviously, this *can't* work for objects
      using instantiation serial numbers!]

The thing that's interesting to me is that you don't have to apply
a uniform strategy to all objects/types/classes, but can tune the
implementation to give a hybrid strategy that's better than any single
strategy. [You might even expose some of the tuning knobs to the user,
though that might get a bit tricky in places...]


-Rob

p.s. One more point: Whatever you use for a hash function -- *especially*
for objects with instantiation-serial-numbers/IDs -- should have the property
that the information is fairly uniformly distributed throughout the bits
of the largest possible hash value. That allows you to grow a hash table
*without* recomputing the hash values -- just use more bits of the ones that
have already been computed. Raj Jain wrote a paper at DEC that showed that
CRCs (particularly the Ethernet CRC-32) have this property, and for many
years I've been using a table-lookup version of CRC-32 almost everywhere
I needed a hash function. (On some machines it's even cheaper than a
multiply-and-add hash.)

-----
Rob Warnock, 8L-846		rpw3@sgi.com
Applied Networking		http://reality.sgi.com/rpw3/
Silicon Graphics, Inc.		Phone: 650-933-1673
1600 Amphitheatre Pkwy.		FAX: 650-933-0511
Mountain View, CA  94043	PP-ASEL-IA