hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Help: -copyFromLocal
Date Thu, 30 Mar 2006 16:24:31 GMT
Eric Baldeschwieler wrote:
> Interesting.  It would actually be nice to include the CRCs in an  
> export, so that you can validate your data when you reload it.  CRCs  
> are best if they are kept end to end.

CRC's are included in the export.  As the files are read from dfs, the 
CRC's are checked.  As they're written to the local fs new CRCs are 
computed and written.  But then when the local files are listed, 
preparing to write them back to dfs, the CRC files are listed.  So then 
we copy a file back to dfs, we check its CRC on read and generate a new 
CRC on write.  Then we try to explicitly copy the CRC file and get an 
already-exists error.  Not to mention that we'd be generating a .crc 
file for the .crc file.  So the immediate bug is that we're listing .crc 
files from the local FS.  These should be excluded from directory 
listings there, as they are elsewhere.

We could try to copy CRC files rather than re-generate them, but that's 
a separate issue.  Things should work correctly if one lists a 
directory, opens each file, and writes its content to a new file in 
another directory.  That's valid user code using standard public APIs, 
and there's no opportunity in that case to copy CRC files directly.  The 
way to fix this is to not list CRC files in copyFromLocal.  The 
FileSystem API has both listFiles and listFilesRaw methods for this very 
purpose.  But the copyFromLocal code doesn't use these correctly.  It 
doesn't use the Hadoop FileSystem API to access local files, but rather 
the normal Java APIs.  That's the bug.

We could change file copying to copy CRC files w/o re-generating them, 
disabling re-generation in this case.  This would make CRCs more 
end-to-end, since it could catch corruption while in the 4k copying 
buffer.  When we've seen corruption is when very large buffers are used 
(e.g., when sorting), so this is not a likely place for corruption, but 
still a possible one.  And, again, this is separate from the issue reported.


View raw message