hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Stack <st...@archive.org>
Subject Re: I get checksum errors! Was: Re: io.skip.checksum.errors was: Re: Hung job
Date Thu, 13 Apr 2006 17:40:52 GMT
Doug Cutting wrote:
> Michael Stack wrote:
>> One question: The 'io.skip.checksum.errors' is only read in 
>> SequenceFile#next but the LocalFileSystem checksum error "move-aside" 
>> handler can be triggered by other than just a call out of 
>> SequenceFile#next.  If so, stopping the LocalFileSystem move-aside on 
>> checksum error is probably not the right thing to do.
> Right, we ideally want SequenceFile to disable it when that flag is 
> set.  But that would take a lot of plumbing to implement!  
> Perhaps we should instead fix this by not closing the file in 
> LocalFilesystem.reportChecksumFailure.  Then it won't be able to move 
> the file aside on Windows.  To fix that, we can (1) try to move it 
> without closing it (since something on the stack will eventually close 
> it anyway, and may still need it open) and (2) if the move fails, try 
> closing it and moving it (for Windows).  The net effect is that 
> io.skip.checksum.errors will then work on Unix but not on Windows.  Or 
> we could skip moving it altogether, since it seems that most checksum 
> errors we're seeing are not disk errors but memory errors before the 
> data hits the disk. 
What if we did not move the file?  A checksum error would be thrown.  If 
we're inside SequenceFile#next and 'io.skip.checksum.errors' is set, 
then we'll just try to move to next record.  I do not have the 
experience with the code base to know if not-moving will manufacture 
weird scenarios elsewhere in the code base.
> A checksum failure on a local file currently causes the task to fail. 
> But it takes multiple checksum errors per job to get a job to fail, 
> right?  Is that what's happening?  
It is.  Jobs are long-running -- a day or more (I should probably try 
cutting them into smaller pieces).  What I usually see is a failure for 
some genuinely odd reason.  Then the task lands on a machine that has 
started to exhibit checksum errors.  After each failure, the task is 
rescheduled and it always seems to land back at the problematic machine 
(Anything I can do about randomizing the machine a task gets assigned too?).


View raw message