Erik Naggum <erik@naggum.no> wrote:
+---------------
| the real difference, as I see it, is that tries will be able to take
| advantage of _any_ smallest set of differences, so if you have a bunch of
| people names starting with "Smith", the trie will effectively just move
| past the first 5 and use the next 5 characters, instead.
+---------------
Exactly. Plus, if you're doing typical symbol-table or other plaintext stuff,
you can *significantly* cut the amount of storage used with tries by various
forms of "compression" on each node (much like the way most C compilers
generate different styles of code for "switch" statements depending on
the pattern of "case" values), tagging each node with its "style" or kind:
- As Erik notes, you can collapse a run of fanout==1 nodes into a single
"match"-only node consisting of a string and one further node pointer.
- If the degree of fanout is >1 but still very low, the node might be
encoded as a small string that's searched sequentially, together with
a vector of the same length (indexed by the searched-for character's
position in the string). While this is a good deal slower than a pure
"radix-sort" style index, it's also *much* smaller.
- With higher fanout, a node might contain "min" & "max" character codes,
and a vector as large as "max - min + 1", indexed by "this_char - min".
- With still higher fanout, revert to a full max-char-code size vector.
In most practical applications, you'll only need a few nodes like this
(maybe even only the root node).
+---------------
| a hash function would have to be excessively intelligent to adapt
| the same way.
+---------------
That reminds me of a trick that can help hash tables: If you use a really
good hash, one for which nearly every bit position contains nearly a full
bit of information (a CRC-32 is such a hash), then if you store the *full*
hash value with the entry, when you want grow the hash table because the
per-bucket chains get too long, you don't have to re-hash the keys -- you
can just use more bits of the original hash value.
Finally, there's an interesting hybrid of hash tables and tries -- a
multi-level hash table that uses different bits of the hash at each level
of the tree/trie (that is, it's a binary tree if you only use one bit of
the hash per level). When you hit a leaf node you do a comparison on
the full hash first before wasting time with a full string compare.
If the hashes are the same but the strings differ, then the strings are
probably *very* different (assuming a "good" hash, such as a CRC), so
hanging a "compressed trie" (as above) off that node should be a cheap
way to resolve the conflict (that is, the strings will very likely differ
in the first character!).
[Hmmm... It would be interesting to compare these three methods for
space & speed to implement "intern"...]
-Rob
-----
Rob Warnock, 8L-855 rpw3@sgi.com
Applied Networking http://reality.sgi.com/rpw3/
Silicon Graphics, Inc. Phone: 650-933-1673
2011 N. Shoreline Blvd. FAX: 650-964-0811
Mountain View, CA 94043 PP-ASEL-IA