Subject: Re: How to avoid GC in tight numeric test loop? (CMUCL)
From: rpw3@rpw3.org (Rob Warnock)
Date: Thu, 12 Jun 2008 22:25:13 -0500
Newsgroups: comp.lang.lisp
Message-ID: <A6Odnafh_vwEdczVnZ2dnUVZ_qXinZ2d@speakeasy.net>
Madhu  <enometh@meer.net> wrote:
+---------------
| Robert Maas <rem-2008jun12-001@yahoo.com> wrote:
| | (CMU Common Lisp 18b, ... Python 1.0, target Intel x86)
| | The purpose of this test is to learn (by experiment) what size
| | active memory fits within the CPU cache and what size thrashes the
| | CPU cache thereby significantly slowing performace.
| <snip>
| | So what's allocating all that memory? Is it inside RANDOM??
| 
| Yes. I believe CMUCL's RANDOM conses if its integer argument is greater
| than kernel::random-fixnum-max, which is (expt 2 22).
| [I Checked this on rand-mt19937.lisp on an old version of CMU Common
|  Lisp 2006-09-07 (19D), linux Python 1.1, target Intel x86]
+---------------

Well, when I tried it on CMUCL-19c, when given a literal argument
[but much more about that below!] RANDOM didn't cons at all:

    cmu> arrsiz

    268435456
    cmu> (time (dotimes (i 100000000) (random 268435456)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   2.93f0 seconds of real time
    ;   2.892589f0 seconds of user run time
    ;   5.6f-5 seconds of system run time
    ;   5,428,917,417 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.                  <== NO CONSING
    ; 
    NIL
    cmu> 

The real problems here are two:

First & foremost, Robert forgot to *compile* ONE-ACCESS!!

    cmu> (compile 'one-access)
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ONE-ACCESS
    NIL
    NIL
    cmu> 

Now his example TIME conses *much* less:

    cmu> (time (dotimes (k 50000) (one-access)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.12f0 seconds of real time
    ;   0.069097f0 seconds of user run time
    ;   0.038177f0 seconds of system run time
    ;   222,723,659 CPU cycles
    ;   0 page faults and
    ;   4,698,872 bytes consed.
    ; 
    NIL
    cmu>

That's only ~94 bytes/call. [Actually, varies from 80 to 96, see below.]

But there's a more subtle problem, probably CMUCL-specific, which
is that ONE-ACCESS is calling (RANDOM ARRSIZ), which, despite being
a (near-)constant here, seems to cause consing. Compare this:

    cmu> (time (dotimes (i 10000) (random 268435456)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.0f0 seconds of real time
    ;   2.23f-4 seconds of user run time
    ;   3.f-6 seconds of system run time
    ;   408,602 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.                  <== NO CONSING!!
    ; 
    NIL
    cmu> 

versus this: 

    cmu> arrsiz

    268435456
    cmu> (time (dotimes (i 10000) (random arrsiz)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.01f0 seconds of real time
    ;   0.011814f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   23,014,591 CPU cycles
    ;   0 page faults and
    ;   939,296 bytes consed.
    ; 
    NIL
    cmu> 

Whoa!! *That* was unexpected! [...at least, to me.]
So what's going on here?!? Hmmm...

    cmu> (= arrsiz 268435456)

    T
    cmu> (fixnump arrsiz)

    T
    cmu> (time (random 268435456))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.0f0 seconds of real time
    ;   8.f-6 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   6,113 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.                  <== NO CONSING
    ; 
    246292663
    cmu> (time (let ((x 268435456)) (random x)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.0f0 seconds of real time
    ;   9.f-6 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   6,202 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.                  <== NO CONSING
    ; 
    32540891
    cmu> (time (let ((x arrsiz)) (random x)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.0f0 seconds of real time
    ;   2.1f-5 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   30,686 CPU cycles
    ;   0 page faults and
    ;   96 bytes consed.                 <== **WTF?!?**
    ; 
    174987352
    cmu> 

(*rustle*) (*rustle*)... (*grep*) (*grep*)...

Hmmm... Seems there's an internal sort of "compiler macro" built
into the CMUCL compiler -- but *not* visible to user, that is,
(COMPILER-MACRO-FUNCTION 'RANDOM) ==> NIL -- that partially inlines
calls to RANDOM if the arg meets some narrow conditions that I can't
quite parse. It's in the file "cmucl-19c/src/compiler/x86/arith.lisp",
in the (DEFINE-VOP (RANDOM-MT19937) ...) if anyone wants to look at it.
Anyway, when those conditions are met, the compiler partially inlines
the MT19937 algorithm [still calling out to RANDOM-MT19937-UPDATE as
necessary]; otherwise it just generates a normal out-of-line call
to RANDOM. Compare the output of this:

    (disassemble (lambda () (random 268435456)))

with this:

    (disassemble (lambda () (random arrsiz)))

And, indeed, when ONE-ACCESS is changed to manually inline
the value of ARRSIZ, the consing goes away completely!!

    cmu> (defun one-access ()
	   (setf (aref arr (the (unsigned-byte 28) (random 268435456)))
		 (the (unsigned-byte 8) (random 256))))

    ONE-ACCESS
    cmu> (compile *)
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ONE-ACCESS
    NIL
    NIL
    cmu> (time (dotimes (k 50000) (one-access)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.03f0 seconds of real time
    ;   0.028682f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   60,136,254 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> 

Anyway, the main take-away here is to remember to
*always* compile your functions before trying to analyse
speed and/or consing...


-Rob

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607