[Udpcast] More on the corrupt file problem

Kyle Cordes kyle at kylecordes.com
Fri Oct 31 02:04:33 CET 2008


Alain Knaff wrote:

> support). However, in some two weeks time I'll be more available to
> check out what is going on.

Thanks for your reply. I will continue to investigate (such as with the 
ideas you describe below), and hopefully by the time you are able to 
attack it, I'll have enough diagnostic info to point to the problem.


> The strange thing is, we do use udpcast for duplicating entire disks,
> most of which are larger than 50GB by now, and we never did notice any

I had assumed this was the case, which is why I found the corruption so 
surprising!

(At the risk of sounding too critical, I was also surprised that udpcast 
doesn't do an end to end checksum or similar, it make me think of the 
oft-referenced 1981 paper: 
http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf )


> One suggestion (careful: this may take some time, and needs *huge*
> amounts of diskspace): try running udpcast under strace (strace -fo
> log.send udp-sender ... and strace -fo log.recv udp-receiver ...), and

I am not familiar with strace, but I will get familiar with it.

I might combine this with a different idea I had: add a few lines of 
code to compare the file position with # of bytes udp-receiver thinks it 
wrong, and if they don't match, die.  If I do this, it seems like the 
end of the strace would be at more or less exactly where the problem 
occurred.

Do you have any feel for how much disk space I might need, to strace 
udp-receiver on a file of 50 GB?


> Another weird thing is that although the problem happens relatively
> "early" in the file, it only occurs for certain minimum file sizes...

Yes, this is very weird.  I will hopefully find a way to run the whole 
test in a loop - I have a couple of machines which could pound on it 
24x7 for a few days.


> just as if the file was being corrupted after the fact (say, after 10GB
> have been transferred.) It might be interesting to do a cmp midway
> through and see if the difference is already there "from the

This seems unlikely to be at issue, since the trouble still occurs when 
I grab the output using:

udp-receiver --pipe "tee somefile" >/dev/null

Although my knowledge is incomplete, I don't think the OS will let 
udp-receiver reach through the pipe and "tee" to seek around on somefile.

I also don't see how udp-receiver could possibly seek backward in to its 
output, because of this:

$ grep seek *.c
statistics.c:   loff_t offset = lseek64(fd, 0, SEEK_CUR);
statistics.c:   off_t offset = lseek(fd, 0, SEEK_CUR);

... offhand I can't think of a way to move around in to a file without 
seek()ing.

> And, do several runs with the same input file always produce the error
> at the exact same spot?

I will test this carefully, and report back. I think the answer is no, 
since in some test runs I udpcasted (do you mind your to three 
receivers, and each ends up with a different file length.

-- 
Kyle Cordes
http://kylecordes.com


More information about the Udpcast mailing list