Subject: Re: How much tuning does regular lisp compilers do?
From: rpw3@rpw3.org (Rob Warnock)
Date: Sat, 30 Aug 2008 12:37:43 -0500
Newsgroups: comp.lang.lisp
Message-ID: <itadnRbJmcf6GCTVnZ2dnUVZ_uednZ2d@speakeasy.net>
verec  <verec@mac.com> wrote:
+---------------
| rpw3@rpw3.org (Rob Warnock) said:
| > Note that the same loop took ~4, ~4, ~2, & ~4 CPU cycles/iteration.
| > Why? Because the third compilation was more optimally placed for
| > the CPU's branch prediction (Athlon-32, as it happens).
| 
| Don't you think your example would be more convincing...
+---------------

Well, I really wasn't trying to "convince" anyone, just suggest
to Andreas Davour that alignment of compiled code was something
he might want to look at. But if you insist...  ;-}  ;-}

NOTE: For the following examples, I'll still be using 32-bit CMUCL-19c
as before, but since my laptop is packed up at the moment what follows
will be done on my desktop machine's Athlon-64, not my laptop's Athlon-32
[which is what I used yesterday]. The main difference is that on the
Athlon-64 the "good" & "bad" cycle counts per iteration are "2" & "3"
(though still in 32-bit mode). [Plus, it's ~1.6 times as fast as the
laptop.] But the general effect of alignment on performance is similar.

+---------------
| ...if you were to show us a) that the memory address at which each
| instance of your code has been compiled is indeed cache friendly, ...
+---------------

Well, every instance was *NOT* cache friendly, that was the whole point!

But if you want examples of alignments that are reproducably fast versus
reproducably slow, then grab yourself a copy of CMUCL and follow along
at home as I reproduce this. [I'm not going to copy/paste the entire
transcript, only key bits.]

We start by compiling (say) 20 copies of a canonical DOTIMES test
function. I use 2000000000 as the count since that's close to the
maximum that CMUCL will force into native 32-bit integer arithmetic
(when compiled) without explicit declarations.

    cmu> (defparameter *funcs* (coerce
				(loop repeat 20 collect
				  (compile nil (lambda ()
						 (dotimes (i 2000000000)))))
				'vector))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 
    ...[and so on]...

    *FUNCS*
    cmu>

The key inner loop here is only 4 instructions (11 bytes), labeled below
with offsets 0x51-0x5A [note that the "JB" is a 2-byte instruction]:

    cmu> (disassemble (aref *funcs* 0))
    48B68C28:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL NULL)
    ...[rest of prolog]...
      4A:       MOV     EAX, 0               ; No-arg-parsing entry point
      4F:       JMP     L1
    ;;; [2] (DOTIMES (I 2000000000))
      51: L0:   INC     EAX                  ; [:BLOCK-START]
      52: L1:   MOV     ECX, EAX             ; [:BLOCK-START]
      54:       CMP     ECX, 2000000000
      5A:       JB      L0
    ...[epilog]...
    cmu> (time (funcall (aref *funcs* 0)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   1.85f0 seconds of real time
    ;   1.813294f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   4,103,287,566 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> 

The alignment above is one of the "good" ones that for an Athlon gives
only 2 CPU clocks/iteration (plus a couple of percent of total overhead).
"Bad" alignments give 3 CPU clocks/iteration, as we'll see shortly.

Now TIME all 20 of the functions [you can safely ignore the warnings
you'll get about the "TIME form in a non-null environment, forced to
interpret." That only means that the FUNCALL and the AREF are interpreted,
not the compiled LAMBDAs in *FUNCS* themselves]. From the copious output
that results I've extracted just the "CPU cycles" lines:

    cmu> (dotimes (i 20)
	   (time (funcall (aref *funcs* i))))
    ...
    ;   4,034,516,892 CPU cycles
    ;   6,065,078,446 CPU cycles
    ;   6,058,044,433 CPU cycles
    ;   4,066,357,764 CPU cycles
    ;   4,055,854,124 CPU cycles
    ;   4,043,935,574 CPU cycles
    ;   6,040,178,270 CPU cycles
    ;   4,029,870,642 CPU cycles
    ;   6,065,532,242 CPU cycles
    ;   6,037,903,830 CPU cycles
    ;   6,113,910,026 CPU cycles
    ;   6,066,510,660 CPU cycles
    ;   6,040,727,000 CPU cycles
    ;   4,028,146,010 CPU cycles
    ;   6,046,813,682 CPU cycles
    ;   6,062,487,970 CPU cycles
    ;   4,038,351,352 CPU cycles
    ;   6,035,721,860 CPU cycles
    ;   6,039,610,976 CPU cycles
    ;   6,065,662,764 CPU cycles
    ...
    cmu> 

Since we're looping two billion times, that says that (as expected)
a given copy of the function takes either two or three CPU cycles
per iteration of the inner loop. If you put those lines into an array
and pick out the leading digits you get:

    cmu> (loop repeat 20 for i from 4 by 33
	   collect (parse-integer (subseq *cycle-lines* i (1+ i))))

    (4 6 6 4 4 4 6 4 6 6 6 6 6 4 6 6 4 6 6 6)
    cmu> (defparameter *cycles* *)

    *CYCLES*
    cmu> (count 4 *cycles*)

    7
    cmu> 

(Aside: If CMUCL compiled code alignment were uniformly random, this would
 suggest that roughly 1/3 of the alignments are "good" [2 CPU cycles
 per iteration] and the other 2/3 are "bad" [3 CPU cycles/iteration].
 But in fact [skipping ahead a bit] it's really more like a 50/50 split,
 and the above is skewed only due to an overly-small sample size.)

To help look at the alignments in detail, let's collect all the
disassemblies:

    cmu> (defparameter *dis-lines*
	   (coerce
	    (let ((dis (with-output-to-string (*standard-output*)
			 (dotimes (i 20)
			   (disassemble (aref *funcs* i))))))
	      (with-input-from-string (s dis)
		(loop for line = (read-line s nil nil)
		      while line
		  collect line)))
	    'vector))

    *DIS-LINES*
    cmu> (length *dis-lines*)

    640
    cmu>

So each one is 32 lines long. [We'll need that later.]

    cmu> (loop for line across *dis-lines*
	   when (search ".ENTRY" line )
	     collect line)

    ("48B68C28:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4800B620:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480244B0:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4803AD98:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480519D8:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4806A4B8:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48080E80:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48097838:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480AE360:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480C5130:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480DBA20:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480F2360:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48108CE0:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4811F668:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48136360:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4814CDB0:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48163838:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4817A600:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48191290:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "481A7EA0:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)")
    cmu> 

CMUCL-19c always allocates objects on 8-byte boundaries, as you can see.
The cache line size of the Athlon is 64 bytes, so there should be only 
8 cases we need to distinguish. Let's match them up against "fast" and
"slow" execution:

    cmu> (loop for line in *
	   collect (mod (parse-integer (subseq line 6 8) :radix 16)
			64))

    (40 32 48 24 24 56 0 56 32 48 32 32 32 40 32 48 56 0 16 32)
    cmu> (defparameter *offsets* *)

    *OFFSETS*
    cmu> (loop for cycles in *cycles*
	       and offset in *offsets*
	   collect (list cycles offset))

    ((4 40) (6 32) (6 48) (4 24) (4 24) (4 56) (6 0) (4 56) (6 32) (6 48)
     (6 32) (6 32) (6 32) (4 40) (6 32) (6 48) (4 56) (6 0) (6 16) (6 32))
    cmu> (stable-sort (copy-list *) #'< :key #'first)

    ((4 40) (4 24) (4 24) (4 56) (4 56) (4 40) (4 56) (6 32) (6 48) (6 0)
     (6 32) (6 48) (6 32) (6 32) (6 32) (6 32) (6 48) (6 0) (6 16) (6 32))
    cmu> (stable-sort (copy-list *) #'< :key #'second)

    ((6 0) (6 0) (6 16) (4 24) (4 24) (6 32) (6 32) (6 32) (6 32) (6 32)
     (6 32) (6 32) (4 40) (4 40) (6 48) (6 48) (6 48) (4 56) (4 56) (4 56))
    cmu> 

This is very interesting: When the offset of the start of the function
is a multiple of 16, the code runs slow; when the offset is 8 *plus*
a multiple of 16, the code runs fast.

Is this reproduceable? I won't show all my experiments here, but the
answer is a definite "Yes!!". If you repeat the timing tests -- *without*
doing a GC!!! -- the times stay the same, and the corresponences between
time & function index stay the same:

    cmu> (defparameter *time-lines*
	   (coerce
	    (let ((dis (with-output-to-string (*trace-output*)
			 (dotimes (i 20)
			   (time (funcall (aref *funcs* i)))))))
	      (with-input-from-string (s dis)
		(loop for line = (read-line s nil nil)
		      while line
		  collect line)))
	    'vector))
    ...[lots of warnings about "TIME form in a non-null environment"]...
    *TIME-LINES*
    cmu> (loop for line across *time-lines*
	   when (search "CPU cycles" line)
	     collect (parse-integer line :start 4 :end 5))

    (4 6 6 4 4 4 6 4 6 6 6 6 6 4 6 6 4 6 6 6)
    cmu> (equal * *cycles*)

    T
    cmu> 

So functions which were fast just after being compiled stay fast
upon repeated execution, and functions which were slow stay slow.

From the disassembly above, we know that the code of the inner loop of
the DOTIMES is 11 bytes long. There are only two possible alignments
of interest, "16*n" and "(16*n)+8", but why is the "(16*n)+8" alignment
the *fast* one?!?

For that we have to look at the alignment of the inner loop code itself,
*not* the alignments of the start of the functions. The inner loop code
is at disassembly lines 13, 15, 16, & 17 [from doing a full disassembly
and manually counting, not any predetermined magic], so you can examine
all of them this way [I'll only show the first few]:

    cmu> (loop for j from 0 by 32 below (length *dis-lines*)
	       and time in *cycles*
	   do (format t "===== ~s =====~%" (if (= time 4) "FAST" "SLOW"))
	      (loop for i in '(0 13 15 16 17)
		do (write-line (aref *dis-lines* (+ i j)))))
    ===== "FAST" =====
    48B68C28:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL NULL)
	  51: L0:   INC     EAX                  ; [:BLOCK-START]
	  52: L1:   MOV     ECX, EAX             ; [:BLOCK-START]
	  54:       CMP     ECX, 2000000000
	  5A:       JB      L0
    ===== "SLOW" =====
    4800B620:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL NULL)
	  49: L0:   INC     EAX                  ; [:BLOCK-START]
	  4A: L1:   MOV     ECX, EAX             ; [:BLOCK-START]
	  4C:       CMP     ECX, 2000000000
	  52:       JB      L0
    ===== "SLOW" =====
    480244B0:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL NULL)
	  D9: L0:   INC     EAX                  ; [:BLOCK-START]
	  DA: L1:   MOV     ECX, EAX             ; [:BLOCK-START]
	  DC:       CMP     ECX, 2000000000
	  E2:       JB      L0
    ===== "FAST" =====
    4803AD98:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL NULL)
	  C1: L0:   INC     EAX                  ; [:BLOCK-START]
	  C2: L1:   MOV     ECX, EAX             ; [:BLOCK-START]
	  C4:       CMP     ECX, 2000000000
	  CA:       JB      L0
    ...[trimmed]...
    cmu> 

So for the "fast" cases the inner loop code never crosses a 16-byte
boundary, and for the "slow" cases it does. But why is a 16-byte
boundary important, and not the 64-byte cache line size? A possible
answer for that I dug out of an old copy of Microprocessor Report
on the AMD K7, which notes that the K7 pipeline fetches 16 bytes of
instructions from the primary I-cache per cycle. *AHA!* All becomes clear!
[Yes, I know I'm running on a K8, but I'm guessing it's the same.]

As further confirmation, if one now forces a GC [CMUCL has a copying GC],
then all those functions will get moved around, and the alignments will
all change, and with them the run-times under TIME. But what does *not*
change is the pairing between the new function alignments and which
functions now run "fast" or "slow":

    cmu> (gc :full t)

    ; [GC threshold exceeded with 6,902,768 bytes in use.  Commencing GC.]
    ; [GC completed with 1,441,256 bytes retained and 5,461,512 bytes freed.]
    ; [GC will next occur when at least 13,441,256 bytes are in use.]

    NIL
    cmu> (gc :full t)
    ; [GC threshold exceeded with 1,445,288 bytes in use.  Commencing GC.]
    ; [GC completed with 1,437,264 bytes retained and 8,024 bytes freed.]
    ; [GC will next occur when at least 13,437,264 bytes are in use.]

    NIL
    cmu> (defparameter *dis-lines/2*
	   (coerce
	    (let ((dis (with-output-to-string (*standard-output*)
			 (dotimes (i 20)
			   (disassemble (aref *funcs* i))))))
	      (with-input-from-string (s dis)
		(loop for line = (read-line s nil nil)
		      while line
		  collect line)))
	    'vector))

    *DIS-LINES/2*
    cmu> (defparameter *offsets/2*
	   (loop for line across *dis-lines/2*
	     when (search ".ENTRY" line )
	      collect (mod (parse-integer (subseq line 6 8) :radix 16)
			   64)))

    *OFFSETS/2*
    cmu> *offsets/2*

    (16 8 0 40 32 24 16 8 24 0 56 48 40 32 24 16 8 0 56 48)
    cmu> (defparameter *time-lines/2*
	   (coerce
	    (let ((dis (with-output-to-string (*trace-output*)
			 (dotimes (i 20)
			   (time (funcall (aref *funcs* i)))))))
	      (with-input-from-string (s dis)
		(loop for line = (read-line s nil nil)
		      while line
		  collect line)))
	    'vector))
    ...[lots of warnings about "TIME form in a non-null environment"]...
    *TIME-LINES/2*
    cmu> (defparameter *cycles/2*
		       (loop for line across *time-lines/2*
			 when (search "CPU cycles" line)
			   collect (parse-integer line :start 4 :end 5)))

    *CYCLES/2*
    cmu> *cycles/2*

    (6 4 6 4 6 4 6 4 4 6 4 6 4 6 4 6 4 6 4 6)
    cmu> (loop for cycles in *cycles/2*
	       and offset in *offsets/2*
	   collect (list cycles offset))

    ((6 16) (4 8) (6 0) (4 40) (6 32) (4 24) (6 16) (4 8) (4 24) (6 0)
     (4 56) (6 48) (4 40) (6 32) (4 24) (6 16) (4 8) (6 0) (4 56) (6 48))
    cmu> (stable-sort (copy-list *) #'< :key #'first)

    ((4 8) (4 40) (4 24) (4 8) (4 24) (4 56) (4 40) (4 24) (4 8) (4 56)
     (6 16) (6 0) (6 32) (6 16) (6 0) (6 48) (6 32) (6 16) (6 0) (6 48))
    cmu> (stable-sort (copy-list *) #'< :key #'second)

    ((6 0) (6 0) (6 0) (4 8) (4 8) (4 8) (6 16) (6 16) (6 16) (4 24) (4 24)
     (4 24) (6 32) (6 32) (4 40) (4 40) (6 48) (6 48) (4 56) (4 56))
    cmu> 

Here again, function starting offsets of "16*n" are "slow" and offsets
of "(16*n)+8" are "fast", even though the "same" functions have different
offsets/speeds than they did before the relocation (due to the GC).

You can repeat this sequence all day long, and you'll get the same
correlation of speed with alignment, even though for any single function
*its* alignment (and thus speed) may well vary from one GC to the next.

This makes benchmarking Lisp programs with copying garbage collectors
very "interesting", to say the least.  ;-}  ;-}

+---------------
| b) that cache friendliness is preserved accross VM mapping...
+---------------

I'm not sure exactly what you're trying to ask here, but consider
that the inner loop in question is only 11 bytes long, so it either
all fits in a single primary I-cache fetch or it's broken between
exactly two such fetches [given that 32-bit CMUCL always aligns
objects on 8-byte boundaries]. Anyway, since pages consist of an
even number of cache lines, I don't think the "VM mapping" (whatever
you mean by that) matters very much in this particular case.

+---------------
| and c) that no other process on your system was stealing time away,
| 3 runs out of 4 ?
+---------------

There's no way to convince you of that remotely, of course, but I
do know what's running on my systems, and no, there were no other
significant CPU loads during the above TIME runs. By "significant"
I mean loads that would skew the first two decimal digits of the
CPU's cycle count [which CMUCL reads with the x86 "RDTSC" instruction].


-Rob

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607