<paritos@gmail.com> wrote:
+---------------
| Peder O. Klingenberg wrote:
| > So you allocate room for the entire file in memory before you turn it
| > back into a stream and call read-line on it. Seems wasteful. What's
| > wrong with using read-line directly on the file you're reading?
|
| According to http://www.emmett.ca/~sabetts/slurp.html this can be more
| efficient.
+---------------
Only if you can *fit* the whole file into memory. If not,
you *must* process it incrementally (e.g., use READ-LINE
or modest-sized READ-SEQUENCE on it directly).
+---------------
| However, plain read-line as you are suggesting yields the
| same sluggish performance, of roughly 1MB per *minute*
+---------------
Something's seriously wrong here, then. A simple CMUCL loop
gets ~16 MB per *second* (~240 K lines/s) on a 117 MB file --
and it was an *NFS-mounted* file at that!!
cmu> (time (with-open-file (s "fairly-large-mail-file.txt")
(loop for line = (read-line s nil nil)
while line
count line into count
sum (1+ (length line)) into total-length
finally (return (list count total-length)))))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:
; [GC threshold exceeded with 13,070,440 bytes in use. Commencing GC.]
; [GC completed with 1,165,192 bytes retained and 11,905,248 bytes freed.]
; [GC will next occur when at least 13,165,192 bytes are in use.]
...[another 12 similar GC triplets]...
; [GC threshold exceeded with 13,390,112 bytes in use. Commencing GC.]
; [GC completed with 1,386,680 bytes retained and 12,003,432 bytes freed.]
; [GC will next occur when at least 13,386,680 bytes are in use.]
; Evaluation took:
; 7.18f0 seconds of real time
; 5.65f0 seconds of user run time
; 0.3f0 seconds of system run time
; 12,932,822,018 CPU cycles
; [Run times include 0.27f0 seconds GC run time]
; 0 page faults and
; 174,137,920 bytes consed.
;
(1721482 116790736)
cmu> (/ 116790736 7.18f0)
1.626612f+7
cmu> (/ 1721482 7.18f0)
239760.73f0
cmu>
Note that by doing individual READ-LINEs and making sure that the
values were garbage by the time the next GC occurred, the above run
never used more than ~13 MB of additional heap at any one time,
despite the fact that the file was an order of magnitude larger.
Any decent CL implementation should get similar (or better!)
performance. Try a similar simple loop in your configuration,
and see if you can spot the bottleneck.
-Rob
-----
Rob Warnock <rpw3@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607