[Udpcast] Timing problem with starting PXE boot and udp-sender

Alain Knaff alain.knaff at lll.lu
Thu Apr 29 22:39:52 CEST 2004


begin  Thursday 29 April 2004 21:20, Donald Teed quote:
> If a CONNECT can trigger the rendez-vous, then if I notice a certain
> number of machines not connecting, I should be able to simply
> reboot them and have them try this again.  The wierd thing was
> that we tried that, and the same 4 machines did not rendezvous
> while 11 were standing by ready.

You know, you didn't either confirm nor deny that you usually start up
the sender after the receivers.

So let's just suppose you always start up the sender after receivers,
except for those where first PXE fails:

In that case, you're observed behaviour is consistent with machines
that NEVER send out that first CONNECT after reboot. If, due to some
construction limitations, the card is not operational within the 5
first seconds after driver activation, the first CONNECT would ALWAYS
fall within that window. By rebooting the machines, you'd trigger
another driver removal and re-insertion, which again would make the
card unavailable during a short time, and the CONNECT would again be
dropped.

Interesting things to test (in order to confirm or deny the
hypothesis):
 1. Start the sender first
    - do now _all_ machines fail? If yes, I think that's excellent
    confirmation that the first CONNECT after reboot never makes
    it...)
    - do only some of the machines fail (... and always the same after
    a _complete_ restart of the experience). If yes, the problem seems
    not only be dependant on card model, but on each card invidually.
    - do only some of the machines fail, and always different ones
    after a complete restart of the experience? If yes, we do have a
    true mystery ;-)
 2. Run a tcpdump on the server, and see what packets you get (port
9000 and 9001) from which machines.

>  That was what led me to conclude
> there was a window of time to rendez-vous and it had elasped.

Nope, there is no such window.

> However on a third session the 4 missed machines were included in
> a new batch and did get imaged OK.

good.

> > > I checked the options and I don't see any that are designed to increase
> > > how long it will wait to see more machines responding as ready.
> >
> > There is  the "--rexmit-hello-interval 3000" option which instructs
> > the sender to keep on resending its HELLO packets until transmission
> > is started. The number is the interval, in milliseconds, between to
> > HELLO packets. This might solve the issue.
> >
> > udp-sender --rexmit-hello-interval 3000 --file fileimage.gz
>
> OK, cool, that might be useful.
>
> There are a few things I need to test.  I can try substituting
> the switch involved.

Could help. But from what I've read in the various newsgroups, this
particular problem (card initialization) has more to do with the cards
themselves than the switch.

>  In general the client machines are a
> little unpredictable since they were carried around daily by
> University students for 2 or 3 years.
>
> --Donald Teed

Alain




More information about the Udpcast mailing list