Subject: Re: data hygiene [Re: Why is Scheme not a Lisp?] From: Erik Naggum <erik@naggum.net> Date: Fri, 22 Mar 2002 10:50:22 GMT Newsgroups: comp.lang.lisp Message-ID: <3225783034450486@naggum.net> * Duane Rettig | What is critical is that the machine has the ability to do three memory | references at once. That is indeed a critical piece of information. I have reasoned based on explicit prefetch instructions because the pipeline is not long enough to _sustain_ three outstanding requests, especially considering the destructive effect of taken branches on the pipeline, and considering the amount of time it takes to fetch from external memory compared to the on-chip L1 cache, which will hold the entire cons cell in the same cache line, unles allocation is really screwed up. Unrolling loops has the significant disadvantage that execution needs more memory references. I have severe problems seeing how you can possibly achieve code with three outstanding requests at all times. Not to mention the severe shortage of registers on IA32. Are there any CPUs that can _actually_ achieve this? If none, it is an interesting exercise in How Things Should Be, but not in optimizing code for real processors. Sadly, I am far more interested in actual CPUs than imaginary ones, but since acquiring a working knowledge of a processor family is pretty much depth-first, I have only working knowledge of one current architecture, and I am probably not up to speed on what IA32 can do, either. From what I can find in the documentation, however, I am unable to write the code for this architecture that will _actually_ achieve more than two concurrent accesses to implement assoc (or my simplified alist-get) even though the CPU is supposed to be able to handle four. There are two obvious reasons for this: (1) memory accesses that end up with a L1 cache hit are not worth optimizing further, and (2) if you have grabbed one of the car and the cdr of a cons cell, the other access is free. If you have to hit the L2 cache or worse, have to request it from real memory, you still get a 64-bit unit back from the memory subsystem. In other words, if you keep moving two cdrs ahead in a plist, the car is there for free in a plist, whereas the alist must access both car and cdr of the given cell. Yes, there is savings to be had in pipelining this, but it has no practical edge over that in plists, because we are not optimizing just the processor pipeline, we have to request the memory and have something that takes some time to get, before this matters. I sort of regret having to be incredibly concrete and low-level, but that is where these things actually happen. It is very convenient to optimize some abstract access model, but it has to be implemented and actually be possible to generate instructions for it such that it actually matters. I have yet to see that be true. I guess what I keep saying is that optimizing for a real processor is not some theoretical task that can ignore the actual processor. You alluded to measurements that indicated a material difference, Duane, and I would just love to see that and the code that produced it on various processors. /// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.