Re: hashtable w/o keys stored... — Rob Warnock Lisp usenet archive

Erik Naggum  <erik@naggum.no> wrote:
+---------------
|   the real difference, as I see it, is that tries will be able to take
|   advantage of _any_ smallest set of differences, so if you have a bunch of
|   people names starting with "Smith", the trie will effectively just move
|   past the first 5 and use the next 5 characters, instead.
+---------------

Exactly.  Plus, if you're doing typical symbol-table or other plaintext stuff,
you can *significantly* cut the amount of storage used with tries by various
forms of "compression" on each node (much like the way most C compilers
generate different styles of code for "switch" statements depending on
the pattern of "case" values), tagging each node with its "style" or kind:

- As Erik notes, you can collapse a run of fanout==1 nodes into a single
  "match"-only node consisting of a string and one further node pointer.

- If the degree of fanout is >1 but still very low, the node might be
  encoded as a small string that's searched sequentially, together with
  a vector of the same length (indexed by the searched-for character's
  position in the string). While this is a good deal slower than a pure
  "radix-sort" style index, it's also *much* smaller.

- With higher fanout, a node might contain "min" & "max" character codes,
  and a vector as large as "max - min + 1", indexed by "this_char - min".

- With still higher fanout, revert to a full max-char-code size vector.
  In most practical applications, you'll only need a few nodes like this
  (maybe even only the root node).

+---------------
| a hash function would have to be excessively intelligent to adapt
| the same way.
+---------------

That reminds me of a trick that can help hash tables: If you use a really
good hash, one for which nearly every bit position contains nearly a full
bit of information (a CRC-32 is such a hash), then if you store the *full*
hash value with the entry, when you want grow the hash table because the
per-bucket chains get too long, you don't have to re-hash the keys -- you
can just use more bits of the original hash value.

Finally, there's an interesting hybrid of hash tables and tries -- a
multi-level hash table that uses different bits of the hash at each level
of the tree/trie (that is, it's a binary tree if you only use one bit of
the hash per level). When you hit a leaf node you do a comparison on
the full hash first before wasting time with a full string compare.
If the hashes are the same but the strings differ, then the strings are
probably *very* different (assuming a "good" hash, such as a CRC), so
hanging a "compressed trie" (as above) off that node should be a cheap
way to resolve the conflict (that is, the strings will very likely differ
in the first character!).

[Hmmm... It would be interesting to compare these three methods for
space & speed to implement "intern"...]


-Rob

-----
Rob Warnock, 8L-855		rpw3@sgi.com
Applied Networking		http://reality.sgi.com/rpw3/
Silicon Graphics, Inc.		Phone: 650-933-1673
2011 N. Shoreline Blvd.		FAX: 650-964-0811
Mountain View, CA  94043	PP-ASEL-IA