Subject: Re: network programming
From: (Rob Warnock)
Date: Mon, 17 Dec 2007 04:00:59 -0600
Newsgroups: comp.lang.lisp
Message-ID: <>
George Neuner  <gneuner2/@/> wrote:
| (Rob Warnock) wrote:
| >+---------------
| >| TCP on a LAN is serious overkill ...
| >+---------------
| >
| >Again, rubbish. With a single TCP session any modern OS/CPU/NIC
| >combination can easily saturate even a gigabit Ethernet [~117 MB/s
| >of user data end-to-end, within 1% of "perfect"].
| I don't know what you think you're measuring, but 117 MB/s user data
| isn't even possible without compression.

Of course it is!! GbE is *125* MB/s peak user data rate; 118.66 MB/s
is the theoretical maximum sustained user data rate you get when you
take out link-level framing, IP, and TCP overheads. 117+ MB/s -- a
little less than that -- is readily attainable on commodity hardware.

| For networks, giga means 10^9, not 2^30.

Of course. And by 117 MB/s I meant 117,000,000 (decimal) bytes per second.

| Ethernet is async requiring 10 bits/byte.

***BZZZTTT!!*** Sorry, back to school. Ethernet was *never* "async",
always *synchronous* [once the initial preamble has locked up the
receive PLL]. Transmission on Gigabit Ethernet (GbE) runs at 125 MBaud
(8 ns/symbol), using PAM-5 coding on each pair, with the link-level
symbols encoded on all four twisted pairs at the same time. PAM-5 times
4 pairs = (expt 5 4) = 625 different symbols possible per signaling
interval. Of those, only the "best" 512 symbols are used for user
data [some of the others are used for framing], alternating between
two sets of 256 so as to maintain clocking and also shaping the
radiated EMI to stay within FCC limits. That means that for each
signaling interval there's really only a choice of 256 different
symbols, which are used to encode 8 user data bits. 125 MBaud times
8 bits/symbol = 1000 Mbit/s (125 MB/s) of *user* data (peak).

[By the way, this is the *same* 125 MBaud rate used for 100BASE-TX.
GbE runs 10 times faster than 100BASE-TX because GbE: (1) uses PAM-5
instead of MLT-3 (which was really only a binary code, under the hood);
(2) *doesn't* use the FDDI 8b/10b precoding for clock recovery, but
encodes across all four pairs so that there are fewer wasted symbols,
enough to allow clock recovery to be achieved by flipping between symbol
sets; and (3) uses all four pairs at once in both directions instead
of only one pair in each direction. All of which means that GbE gets
8 bits/Baud of user data instead of 100BASE-TX's 0.8 bit/Baud.]

The maximum frame size [excluding "jumbo frames"] is 1538 bytes,
comprised of preamble (8 bytes), Ethernet link-level header
(6 + 6 + 2 = 14 bytes), Ethernet payload (1500 bytes), CRC-32 (4 bytes),
and interpacket recovery time (12 bytes). [The interpacket gap is
actually silence on 10BASE-T or a stream of IDLEs on 100BASE-T
and GbE, but in any case it's 12 bytes worth of time.] Since the
Ethernet payload *includes* the IP (20 bytes) and TCP (20 bytes)
headers, the maximum user data per TCP packet is only 1460 bytes,
and thus the theoretical maximum sustained TCP user data rate is
(* 125e6 (/ 1460 1538)) ==> 118.66 MB/s [118,660,598.18 bytes/s
if you want to be picky], or 94.93% of wire speed.

As I've said before and will absolutely stand by, you can easily
get 98.6% of *that*, or 117 MB/s, on commodity hardware these days.

| 10^8 B/s = 95.37 * 2^20 bytes.

I have no idea what these number refer to here. I mentioned no
powers of 2 at all. My numbers show that the theoretical maximum
sustained TCP user data rate on GbE is ~94.93% of the wire speed,
or 949,284,785+ bit/s, and that one can actually achieve 98.6% of
*that* in practice, or 93.6% of peak wire speed = 117 MB/s from
the user process memory of one machine to the user process memory
of another machine. [With or without a GbE switch in the path, by
the way, provided that it's a reasonably performant switch.]

| And that total must include packet preambles, headers and trailers.

I included all of that, see above.

| The theoretical maximum for GB Ethernet may be 99.1% ...

It's not; it's only 94.93% (see above).

| ...but it is only achievable on an otherwise quiet LAN.
| It takes very little other traffic to screw it up.

But since in practice all GbE LANs are *switched*, if you buy
non-blocking switches traffic from one path won't interfere
with another path at all.

| Add to that the fact that exponential back off from CSMA/CD on
| Ethernet begins to severely impact performance when usages drives
| collision rates above 40%.

Oh, boy. Now you're just embarrassing yourself with that old myth.
Look, an Ethernet collision only costs at most one slot time (64 bytes)
plus an interpacket recovery time (12 bytes). A 40% collision rate
would therefore only add (* 0.4 (+ 64 12)) 30 bytes to the average
frame overhead, lowering the efficiency from 94.93% (1460/1538)
to 93% (1460/1568). No way can you consider that to be a "severe
performance impact"!!

Back at SGI, even on ancient half-duplex 10BASE-T, we routinely
got >1160 KB/s in our benchmarks. (Which, I'll admit, is slightly
less than the fraction of wire speed I've seen on GbE, but I did
say "half-duplex", and even with "delayed ACKs" turned on ACKs do
take up some bandwidth.)

Besides, there are no collisions at all on switched Gigabit Ethernet!!!
[Buffer congestion in switches, maybe, but no collisions.]

| For high usage rates, DDI rings are far more performant.

Sounds like someone with a product to sell.

| TCP was designed for WANs and by comparison with other LAN streaming
| protocols, it is quite heavy weight.  That's why IBM, Novell and DEC
| spent millions developing NetBIOS, SPX and DECnet ... because TCP was
| so brilliantly performant on LANs.

Actually, they spent millions developing NetBIOS, SPX and DECnet
because they hoped they could lock customers into using *their*
proprietary protocols instead of open, standards-conforming,
*free* protocols. They lost.

| I never used DECnet even though I personally know Stuart Wecker
| who led the team that designed it.

Yes, yes, Stu's a really nice guy. His protocol description of
DDCMP was a marvel of clear writing. One could easily implement
the protocol directly from the spec -- and some non-DEC folks did!
By the way, did I ever mention that a little company I worked
for called DCA in Atlanta was shipping a networking product for
revenue that used DDCMP as the link-level protocol two years
*before* DEC shipped anything with it?!? We even used Stu's
software CRC-16 algorithm in our driver, this one:

    Wecker, S., "A Table-Lookup Algorithm for Software
    Computation of Cyclic Redundancy Check (CRC),"
    Digital Equipment Corporation memorandum, 1974.

Now that we're done name-dropping, back to GbE...  ;-}  ;-}

| I have used NetBIOS and SPX extensively - both are measurably faster
| than TCP on the same hardware.  TCP can be tuned to the environment,
| but that is beyond most users.  Most stacks today perform some self
| tuning within certain ranges of parameters, but they can't handle
| everything they may encounter.

All I can say, George, is that you don't seem to have kept current.
TCP readily runs at wire speed on GbE these days, using commodity
hardware and readily-available operating systems (Linux, FreeBSD, etc.).
117 MB/s of "ttcp" or "netperf" is just what we routinely expect
to see. If we *don't* get it, we go look to see what's broken.

In conclusion: There are occasionally valid reasons one might choose
*not* to use TCP. The fact that it's "on a LAN" isn't one of them.


Rob Warnock			<>
627 26th Avenue			<URL:>
San Mateo, CA 94403		(650)572-2607