Notes on the Erik Naggum comp.lang.lisp archive

I created the Naggum cll archive to make it easier to browse and find articles from the late Erik Naggum's comp.lang.lisp history.

In 2009, Ron Garret published a 700MB archive file of all of comp.lang.lisp (compressed 185MB copy). I split the archive into separate article files and selected any article with a Message-ID with "naggum" as a substring for this archive (5MB archive of plain-text Naggum articles).

Is this a good idea?

I'm not sure. I've found interesting articles I hadn't seen before, but it also removes articles from their original flow of discussion and correction. In 2001, Erik commented on a similar idea:

[T]he reason technical answers to technical questions work so well in this particular context is that anyone can, and usually does, correct any and all mistakes and problems in the answers.
⋮
[I]nstead of automated tools that violate people's willingness to share, just collect your own insights and bring them to the next "generation" of people who discover Common Lisp by answering the questions and correcting the mistakes made by other newbies who got it wrong in their eagerness.

You can get back into some of the context by clicking an article's Message-ID, which will lead you to the Google archive for an article, if available, and from there you can see more of the discussion in the article's thread.

Search

I started off with Montezuma, and it was pretty easy to follow the tutorial to get a working search very quickly. It's worth evaluating if you're looking for a CL indexing/search solution. However, I ran into two problems: I wanted to sort all results by article date, and I wanted to parse search strings and construct queries myself. I couldn't figure out how to do that as quickly as I wanted, so I thought I'd try a new approach.

I split each article into terms. There are about 40,000 unique terms in about 5,000 articles. For each term, there's a bit-vector with one bit per article; if bit n is set in a term's bit-vector, then article number n contains that term. Together with some article metadata, the entire search structure is about 30MB in memory.

Search then becomes a matter of parsing a search string into terms, fetching each term's bit-vector, and merging the bit-vectors with bit-and to determine which articles contain all terms. For example:

Term "LISP": #*11101100...
Term "CODE": #*00110100...
.....Result: #*00100100...

Negative term searches like "-cons" use bit-andc2 instead of bit-and.

Getting search results for a large number of terms takes 0.001 seconds or less, and only conses up a few thousand bytes to create template values to pass to HTML-TEMPLATE.

(The optional third parameter to the bit-vector logical functions also means I don't need to cons up extra garbage for intermediate results. Hooray for bit-vectors!)

This general idea was inspired by the telephone number search in Programming Pearls and some information from Managing Gigabytes.

URLs

Each article has a id derived from the article's Message-ID:

(defun article-id (message-id)
  (string-trim "<>" message-id))

The URL for a message is then:

http://www.xach.com/naggum/articles/ARTICLE-ID.html for the HTML version
http://www.xach.com/naggum/articles/ARTICLE-ID.txt for the plain text version

Feedback

If you have any questions or comments about this archive, please email me, Zach Beane.

2010-01-12