Subject: Re: Politeness and language growth
From: Erik Naggum <erik@naggum.no>
Date: 1999/01/08
Newsgroups: comp.lang.lisp
Message-ID: <3124822115465877@naggum.no>

* Andi Kleen <ak-uu@muc.de>
| This happens when an connection attempt arrives, but it gets removed
| again before the program accepts it - this happens e.g. when an ICMP
| packet arrives exactly in the right window.

  this doesn't make sense as you explain it.  could you be more specific?
  (assume that I know all the protocol specs by heart, actually helped
  write the Internet Host Requirements RFCs, and sometimes do network
  debugging for a living.)  I don't see how you are supposed to be able to
  "remove" a connection attempt (by which I assume you mean a TCP SYN) and
  presumably _before_ the SYN ACK can see a RST from the other end.  which
  type of ICMP packet do you mean could interfere here?

| if an application cannot accept any blocking it is supposed to set the
| listening socket to non blocking mode first, then the accept'ed sockets
| will be non blocking too and return EAGAIN when the connection
| disappeared.  The application has to handle this inherent race by
| retrying.

  this scenario seems extremely unlikely to cause blocking calls.  yet it
  happens.  with both TCP logs and ICMP logs on the system in question,
  select returns an apparently random fd from either the input or output
  set without evidence of activity.  in the test case where this has
  happened to me, the machine does _nothing_ apart from waiting for
  connections on 64 sockets, input on 64 sockets, and output on 64 sockets.
  there is _no_ activity that should trigger and make select return, yet it
  does, and there is, invariably, nothing there.  the network only has the
  usual background noises, to which it does not seem to relate in any way.

| In what other situations do you think does select "lie" too?

  it has lied to me about write not blocking on sockets.  just today, I had
  to shut down the whole application and restart it because we couldn't
  wait for my finding the bug during production hours.  uptime requirements
  are extremely tight.  we were down for 3 minutes, and during that time,
  all of nine people had been yanked out of whatever concentration the have
  left and asked to participate in fixing the problem.

  the problem is actually two-fold: select is used a number of fds, but
  when a false positive is returned, due to system internals, it keeps
  retrying on only the one that failed for a while before giving up on it,
  and returning to normal operation.  sometimes, it doesn't give up.  I
  have looked over the Allegro CL code and I can find no fault with it.

  I have coded around this by asking select if he lied about the return
  value, and if it doesn't confirm the return value, I ignore it and retry.
  (I have replaced the functions FILESYS-WRITE-BYTES and -STRING, and
  FILESYS-READ-BYTES and -BUFFER.)  I have found _nothing_ to support any
  theory of external influences of any kind, although I'm sure there's
  _something_, like a clock ticking or some other apparently-innocuous
  element.  I wish I understood what's going on, but I have stopped my
  research at select, and have coded around select, and that appears to
  remove the problem, for whatever reason.  if and when I have more time,
  I'd like to dig even deeper, but this has already cost us some delays and
  I have some catching up and some code cleanup to do.

#:Erik