[Udpcast] More on the corrupt file problem

Alain Knaff alain at knaff.lu
Thu Oct 30 21:17:58 CET 2008


Kyle Cordes wrote:
> To try to track down the udpcast corrupt file problem, I ran some more 
> tests. This time I used a ~50GB file, a sender, and only 1 receiver.
> 
> side       bytes
> sender:    53687091200
> receiver:  53686091776
> 
> In all of my large-file runs, udp-receiver comes up a bit "short", it 
> missing some of the data, never "long".
> 
> I created a 50 GB test file with predictable text data in it, suing this 
> ugly little program:
> 
> #include <stdio.h>
> 
> // 16 bytes per entry.
> //
> int main(void) {
>          long long gb = 1024 * 1024 * 1024;
>          long long m = 50 * gb;
>          long long i;
>          for(i = 0; i<m; i+= 16) {
>                  printf("%.15lld\n", i);
>          }
> }
> 
> 
> so that I could easily look at the files. I found that the received file 
> ended with the same data as the sent file; in other words, the problem 
> is *not* a matter of terminating early, or other finishing-out process.
> 
> Rather, it's much earlier.  According to "cmp":
> 
> differ: byte 2098176010, line 131136001
> 
> That's a little under 2 GB of the way in to a 50 GB file.
> 
> Strangely, I ran repeated tests with 10 GB files, and didn't get any 
> corruption.
> 
> 
> Alain - it would warm my heart to see you ack these messages, even if 
> you don't have a solution at hand.
> 

I do get your messages, but for the moment I am somewhat busy on some
other project (preparing the release of mtools version 4 with Unicode
support). However, in some two weeks time I'll be more available to
check out what is going on.

The strange thing is, we do use udpcast for duplicating entire disks,
most of which are larger than 50GB by now, and we never did notice any
ill effect. A large piece of data missing in the middle would have been
pretty obvious, but we've never have seen any of this so far.

So apparently, it only happens under certain circumstances... and we
need to understand what exactly these circumstances which are triggering
this are.

I appreciate your work on this subject, and I'm pretty confident that
within a couple of more tests, you'll have identified what is going on
(... making it easier for me to fix...)

One suggestion (careful: this may take some time, and needs *huge*
amounts of diskspace): try running udpcast under strace (strace -fo
log.send udp-sender ... and strace -fo log.recv udp-receiver ...), and
try to locate the system calls around the place where the missing data
occurs (strace output should have reads and writes whose parameter is
your textual data. The stretch of output between the reading or writing
000002098175984 and 000002098225152 is the interesting one here...

Actually, to be precise, as udpcast reads and writes in largish chunks,
you'll not see a read or write for every line. So the last read or write
before the error will probably have a number less than 000002098175984,
and the next write will have a number larger than 000002098225152, but
you get the gist of it.

Another weird thing is that although the problem happens relatively
"early" in the file, it only occurs for certain minimum file sizes...
just as if the file was being corrupted after the fact (say, after 10GB
have been transferred.) It might be interesting to do a cmp midway
through and see if the difference is already there "from the
beginning..." (for instance, you may start your cmp as soon as your
receive file reached size 2GB...)

And, do several runs with the same input file always produce the error
at the exact same spot?

Regards,

Alain




More information about the Udpcast mailing list