Subject: Re: Can not find older posting: Reading files (fast)
From: rpw3@rpw3.org (Rob Warnock)
Date: Sun, 28 Aug 2005 22:46:18 -0500
Newsgroups: comp.lang.lisp
Message-ID: <XsudnVwdGZGXGI_eRVn-gw@speakeasy.net>
drewc  <drewc@rift.com> wrote:
+---------------
| Rob Warnock wrote:
| >     (defun file-string (path)
| >       "Sucks up an entire file from PATH into a freshly-allocated string,
| >       returning two values: the string and the number of bytes read."
| >       (with-open-file (s path)
| > 	(let* ((len (file-length s))
| > 	       (data (make-string len)))
| > 	  (values data (read-sequence data s)))))
| 
| According to [ <http://www.tfeb.org/lisp/obscurities.html> ] ...
+---------------

Thanks for the ref!

+---------------
| ...this function is not portable :
| 
| "But this almost certainly will not work reliably. file-length will 
| almost certainly tell you the length of the file in octets, not 
| characters...
+---------------

Hmmm... O.k., I'll agree with the non-portability in general, but
it *might* be slightly more portable than Tim's page suggests.  ;-}
According to the CLHS:

    FILE-LENGTH returns the length of stream, or NIL if the length
    cannot be determined.

    For a binary file, the length is measured in units of the
    element type of the stream.

and refers one to OPEN, which says:

    element-type---a type specifier for recognizable subtype of
    CHARACTER; or a type specifier for a finite recognizable subtype
    of INTEGER; or one of the symbols SIGNED-BYTE, UNSIGNED-BYTE, or
    :DEFAULT. The default is CHARACTER.

And 13.1.4.1 "Graphic Characters" says that:

    #\Backspace, #\Tab, #\Rubout, #\Linefeed, #\Return, and #\Page,
    if they are supported by the implementation, are non-graphic.

But 2.1.3 "Standard Characters" only requires that the non-graphic
characters #\Space and #\Newline is supported.

So I guess it really boils down to whether in a given implementation
#\Return exists as a CHARACTER, and what happens when you READ-CHAR
a stream containing one, since READ-SEQUENCE is defined that way:

    READ-SEQUENCE is identical in effect to iterating over the
    indicated subsequence and reading one element at a time from
    stream and storing it into sequence, but may be more efficient
    than the equivalent loop. An efficient implementation is more
    likely to exist for the case where the sequence is a vector with
    the same element type as the stream.

Note that this is *not* the same as asking whether:

    (= (length (file-string "foo"))
       (with-open-file (s "foo")
	 (loop for line = (read-line s nil nil)
	       while line
	   sum (1+ (length line)))))
     ==> T

This clearly might be false on platforms where #\Newline is externally
represented as <CR><LF>, but if #\Return is a (non-graphic) CHARACTER
on those machines, then the following might still be true even if the
above is false:

    (= (length (file-string "foo"))
       (with-open-file (s "foo")
	 (loop for char = (read-char s nil nil)
	       while char
	   count t)))

Note that the former returns NIL on CMUCL under Unix when given a file
containing ASCII NULs (a .tar.gz! ;-} ) but the latter still returns T.
It would be interesting to know whether the latter also returns T on
MS/DOS or Windows platforms, and for which CL implemetations.


-Rob

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607