Subject: Re: massive data analysis with lisp
From: rpw3@rpw3.org (Rob Warnock)
Date: Mon, 09 Oct 2006 18:41:03 -0500
Newsgroups: comp.lang.lisp
Message-ID: <CKKdnUz0Ku6SQ7fYnZ2dnUVZ_tWdnZ2d@speakeasy.net>
<paritos@gmail.com> wrote:
+---------------
| Peder O. Klingenberg wrote:
| > So you allocate room for the entire file in memory before you turn it
| > back into a stream and call read-line on it.  Seems wasteful.  What's
| > wrong with using read-line directly on the file you're reading?
| 
| According to http://www.emmett.ca/~sabetts/slurp.html this can be more
| efficient.
+---------------

Only if you can *fit* the whole file into memory. If not,
you *must* process it incrementally (e.g., use READ-LINE
or modest-sized READ-SEQUENCE on it directly).

+---------------
| However, plain read-line as you are suggesting yields the
| same sluggish performance, of roughly 1MB per *minute*
+---------------

Something's seriously wrong here, then. A simple CMUCL loop
gets ~16 MB per *second* (~240 K lines/s) on a 117 MB file --
and it was an *NFS-mounted* file at that!!

    cmu> (time (with-open-file (s "fairly-large-mail-file.txt")
		 (loop for line = (read-line s nil nil)
		       while line
		   count line into count
		   sum (1+ (length line)) into total-length
		   finally (return (list count total-length)))))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 
    ; [GC threshold exceeded with 13,070,440 bytes in use.  Commencing GC.]
    ; [GC completed with 1,165,192 bytes retained and 11,905,248 bytes freed.]
    ; [GC will next occur when at least 13,165,192 bytes are in use.]
    ...[another 12 similar GC triplets]...
    ; [GC threshold exceeded with 13,390,112 bytes in use.  Commencing GC.]
    ; [GC completed with 1,386,680 bytes retained and 12,003,432 bytes freed.]
    ; [GC will next occur when at least 13,386,680 bytes are in use.]
    ; Evaluation took:
    ;   7.18f0 seconds of real time
    ;   5.65f0 seconds of user run time
    ;   0.3f0 seconds of system run time
    ;   12,932,822,018 CPU cycles
    ;   [Run times include 0.27f0 seconds GC run time]
    ;   0 page faults and
    ;   174,137,920 bytes consed.
    ; 
    (1721482 116790736)
    cmu> (/ 116790736 7.18f0)

    1.626612f+7
    cmu> (/ 1721482 7.18f0)

    239760.73f0
    cmu> 

Note that by doing individual READ-LINEs and making sure that the
values were garbage by the time the next GC occurred, the above run
never used more than ~13 MB of additional heap at any one time,
despite the fact that the file was an order of magnitude larger.

Any decent CL implementation should get similar (or better!)
performance. Try a similar simple loop in your configuration,
and see if you can spot the bottleneck.


-Rob

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607