[Udpcast] More on the corrupt file problem

Alain Knaff alain at knaff.lu
Tue Nov 4 23:09:01 CET 2008


Kyle Cordes wrote:
> Alain Knaff wrote:
> 
>> The strange thing is, we do use udpcast for duplicating entire disks,
>> most of which are larger than 50GB by now, and we never did notice any
>> ill effect. A large piece of data missing in the middle would have been
>> pretty obvious, but we've never have seen any of this so far.
> 
> 
> Alain,
> 
> By any chance do you typically use it like so on your large files?
> 
> udp-receiver | some-process    ?
> 
> Contrary to my earlier findings, in ongoing testing I have found that if 
> I used it like this:
> 
> 
> udp-receiver --file foo
> 
> I sometimes get bad results; and things like this:
> 
> udp-receiver --pipe "lzop -d" --file foo
> 
> also sometimes get bad results.
> 
> but I noticed that my real scripts do this:
> 
> udp-receiver | lzop -d | pg_restore
> 
> and I tested like this:
> 
> udp-receiver | lzop -d >foo
> 
> ... and I get correct results. To 5 or 6 receivers. Every night.
> 
> Also, I found that having lzop (or other common compression tool) in the 
>   loop acts as a guard against data integrity problems - if udp-receiver 
> skip or damages data, it would fail lzop's checksums and make the whole 
> process fail.
> 
> Thus, it looks like there is some issue that comes in to play with 
> --file, but not when simply letting the data fall out on stdout.
> 
> I'm sitting the issue down for the moment, but later I may beat on it a 
> little more to try to track down the specifics of the failure.
> 
> I also think that perhaps the "--pipe" and "--file" features are 
> unnecessary; that udp-receiver would be better by being simpler, and 
> simply assume that the user will redirect the output where they need it.
> 

I have a suspicion that there may be a bug in some versions of the Linux
kernel as far as seek is concerned, that seek is not thread-safe.

Udpcast uses lseek(fd, 0, SEEK_CUR) to read the current file position
for statistics printing. Theoretically, this should not be harmful, as
it should have no influence of file position. But I've got the suspicion
that what this really does it read the file position, do some stuff, and
then _write_back_ that same position: leading to corruption if ever a
read or write in a different thread happened in between (file position
will be reset to just before the read).

Could you try out whether you still get the problem if you comment out
the contents of the printFilePosition function in statistics.c  ?

What if you replace that contents with:

static void printFilePosition(int fd) {
    if(fd != -1) {
	int fd2 = dup(fd);
	if(fd2 != -1) {
#ifdef HAVE_LSEEK64
	    loff_t offset = lseek64(fd2, 0, SEEK_CUR);
	    if(offset != -1)
		printLongNum(offset);
#else
	    off_t offset = lseek(fd2, 0, SEEK_CUR);
	    if(offset != -1)
		fprintf(stderr, "%10d", offset);
#endif
	    close(fd2);
	}
    }
}

(Trying to read the position from a _copy_ of the file descriptor)

Regards,

Alain


More information about the Udpcast mailing list