Subject: Re: data hygiene [Re: Why is Scheme not a Lisp?] From: Erik Naggum <erik@naggum.net> Date: Fri, 22 Mar 2002 07:12:25 GMT Newsgroups: comp.lang.lisp Message-ID: <3225769957792533@naggum.net> * Matthias Blume | Suppose you can "spawn" as many sub-threads as you like at any point, | but following a pointer has a fixed cost. Are you trying to make me pretend that spawning a sub-thread is free and following pointer has costs? This is the part I simply reject. | In the plist case, you need at least 2n time steps to traverse n | key-value pairs because the fetch operations (mostly CDRs) are all | data-dependend on their respective predecessors. In the alist case, one | thread could "run down" the spine of the alist in roughly n steps, | spawning a new thread at every CAR. The subthreads then independently | check to see if their respective key is the one that we are looking for. | This latter work can effectively overlap with the guy running down the | spine, so the total time can be less than in the plist case. I see that several people have been working this out on a mythical computer, not extant hardware. I am frankly completely unimpressed with the arguments that have been made so far. One does not optimize theoretical code for theoretical computers. | On real hardware, _if_ the key comparison is cheap (e.g., EQ), then each | of those subthreads will live only very shortly, so the amount of | effective parallelism that needs to be supported by the hardware is | bounded (and rather small). Well, at least they have to live long enough to grab the cons cell pointed to by the car and the contents o the car of that cell. Now, it is all well and good to have a CPU that can pipeline anything you want, but we are actually trying to talk to the same _memory_, right? And that is not the memory in either L1 or L2 cache, right? How many requests can the _memory_ pipeline from data that are going to be very close? I mean, my CPU has 32-byte cache lines, meaning it will usually not do any less than 32-byte fetches, but the crucial thing is to prefetch _both_ car and cdr of all cells visited in order to obtain maximal memory performance. To make this work, one would have to do massive analysis of the addresses that are actually fetched. This is actually possible as long as one has to fetch from real memory instead of L1 or L2 caches, or even L3 caches. In order to make this a testable condition, there is no point in scanning some short list. The list has to be at least 1 million entries in length for the test conditions to ensure that all data cache lines are flushed. To make that work, there would be some savings in cdr'ing ahead to fill half the L1 cache so the cons cells, which I assumme will occupy 2 adjacent words, will be in the L1 cache when the car visits it. However, in total, there will be exactly as many memory references with a plist and an alist, because they use exactly as many cons cells. I maintain that on, e.g., the IA32, which I have spent some time learning how to optimize the living daylight out of, you will never see any difference in performance between an alist and a plist. There is no magic in this, just the willingness not to "suppose" one can do something that no processor actually does. | Obviously, as Joe's tests demonstrate, getting any real advantage out of | this on a real system is hard. It would require tight coding, hardware | that can actually do something like the above, at least in some sense, | and would probably still not matter for relatively short lists. None of the compiled code I have seen has used prefetch instructions, and the pipeline on, e.g., IA32, is not long enough to see these actualy memory fetch coming. Therefore, current tests exhibit nothing of the kind we would like to prove in this exercise. In order to measure the impact of any savings, the best approach is probably to have a forerunner that visits both cars and cdrs in an alist and the cdrs in a plist. It should probably be running at _least_ 8 cons cells ahead of their use, but something like 100 will probably make more sense, in order to give the memory time to fill up the cache lines. This also indicates that there is very significant savings to be had from starting the storage of any object on cache line boundaries. But suppose your list is short enough that all cons cells can fit in L1 cache. The best optimization approach is simply to call length on it to move all the cons cells into L1 cache, and then to make sure that they stay there between every time you need to scan the list. If you manage to expire the cache lines between calls to either assoc or getf, there is no point in this exercise at all, since the next time it is needed, you will have to go out and prefetch it, anyway, and since both assoc and getf must return the first occurrence, the likelihood of prefetching memory that you will never use is pretty high. | By the way, this is all pretty silly anyway. If I had to implement a | dictionary data structure where the number of keys is large enough so | that the above would matter, then I would *definitely* use something else | that has better theoretical (and practical) complexity. Yup. I have been trying to argue that assoc is more expensive because it is more way more general and may do a _lot_ more work than getf, which could be as simple as my plist-get. To get a user call to assoc be as tight as my alist-get requires a lot of effort in both compiler and user code. I think, however, that I have demonstrated exhaustively that there is no performance difference between alist and plist and that performance should never enter the picture when deciding which to use. For instance, performance is likely to be good for get-properties and destructuring using keyword arguments, neither of which have a parallell for alists. /// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.