Re: Followup to "Effect of multiple-processors on memory allocation"

William D Clinger <cesura17@yahoo.com> wrote:
+---------------
| Adam Warner quoting Andy Glew:
| > Other companies such as Hewlett-Packard (HP), Intel, and Silicon
| > Graphics (SGI) have chosen to support sequential consistency in
| > their current architectures.
| 
| They did that through a combination of hardware and software
| techniques.  In particular, support for sequential consistency
| can involve compilers that emit memory barriers as needed, and are
| careful not to perform certain optimizations that would have been
| transparent on older architectures.
+---------------

I can't speak for H-P & Intel, but the SGI "Origin" (MIPS/Irix) and
"Altix" (IPF/Linux) ccNUMA systems [with many hundreds of CPUs!!]
*do* provide sequential consistency in the *hardware* -- no software
support needed[1], and *without* any global snooping[2] -- using
DASH-style directory-based cache coherency [where the global state
of a given cache line is held-in/owned-by the *memory* subsystem(s),
not the CPU(s)].

[1] After boot time, that is. At boot time the O/S is of course
    required to set up the cache system and the global memory
    switching fabric to "do the right thing" later, at runtime.

[2] There is of course "local" snooping in the front-side busses
    of each CPU by the cache-directory system, to turn local cache
    misses (reads) and writes into global cache-line reads and
    cache-line write-invalidates, respectively. [To the CPUs the
    cache-directory system appears to be "just another CPU" on
    the same front-side bus.]

+---------------
| You should read Boehm's paper. He used to work at SGI, now works
| at HP, and is quite familiar with their architectures and compilers.
+---------------

Yes, but AFAIK the only compiler tweaks [other than for pure
performance] were to fix bugs in the CPUs [e.g., the famous
"branch from the last word of a page" brokenness in R4000].

This *has* to be the case, actually, since with "Origin" & "Altix"
all of the DMA devices also participate in the global cache coherency,
and most of them are hard-coded silicon, with no "compiler tweaks"
even possible.

Though note that for the highest-performance DMA devices -- ones
with multiple connections [either physical paths or just multiple
"virtual channels"] into the memory switching fabric, special
ordering considerationsr *were* necessary in a very few cases
where one needed to coerce sequential consistency into a total
order. [And, unfortunately, for a few devices interrupts and
DMA traffic went into the memory switching fabric over different
"virtual channels", and there again...]

Ob. credentials: I worked for ~13 years at SGI designing/debugging
very high performance networking and cluster-interconnect adapters,
and had to deal with this stuff on pretty much a daily basis.


-Rob

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607