William D Clinger <cesura17@yahoo.com> wrote:
+---------------
| Adam Warner quoting Andy Glew:
| > Other companies such as Hewlett-Packard (HP), Intel, and Silicon
| > Graphics (SGI) have chosen to support sequential consistency in
| > their current architectures.
|
| They did that through a combination of hardware and software
| techniques. In particular, support for sequential consistency
| can involve compilers that emit memory barriers as needed, and are
| careful not to perform certain optimizations that would have been
| transparent on older architectures.
+---------------
I can't speak for H-P & Intel, but the SGI "Origin" (MIPS/Irix) and
"Altix" (IPF/Linux) ccNUMA systems [with many hundreds of CPUs!!]
*do* provide sequential consistency in the *hardware* -- no software
support needed[1], and *without* any global snooping[2] -- using
DASH-style directory-based cache coherency [where the global state
of a given cache line is held-in/owned-by the *memory* subsystem(s),
not the CPU(s)].
[1] After boot time, that is. At boot time the O/S is of course
required to set up the cache system and the global memory
switching fabric to "do the right thing" later, at runtime.
[2] There is of course "local" snooping in the front-side busses
of each CPU by the cache-directory system, to turn local cache
misses (reads) and writes into global cache-line reads and
cache-line write-invalidates, respectively. [To the CPUs the
cache-directory system appears to be "just another CPU" on
the same front-side bus.]
+---------------
| You should read Boehm's paper. He used to work at SGI, now works
| at HP, and is quite familiar with their architectures and compilers.
+---------------
Yes, but AFAIK the only compiler tweaks [other than for pure
performance] were to fix bugs in the CPUs [e.g., the famous
"branch from the last word of a page" brokenness in R4000].
This *has* to be the case, actually, since with "Origin" & "Altix"
all of the DMA devices also participate in the global cache coherency,
and most of them are hard-coded silicon, with no "compiler tweaks"
even possible.
Though note that for the highest-performance DMA devices -- ones
with multiple connections [either physical paths or just multiple
"virtual channels"] into the memory switching fabric, special
ordering considerationsr *were* necessary in a very few cases
where one needed to coerce sequential consistency into a total
order. [And, unfortunately, for a few devices interrupts and
DMA traffic went into the memory switching fabric over different
"virtual channels", and there again...]
Ob. credentials: I worked for ~13 years at SGI designing/debugging
very high performance networking and cluster-interconnect adapters,
and had to deal with this stuff on pretty much a daily basis.
-Rob
-----
Rob Warnock <rpw3@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607