hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-738) dfs get or copyToLocal should not copy crc file
Date Wed, 22 Nov 2006 19:45:04 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-738?page=comments#action_12452033 ] 
            
Doug Cutting commented on HADOOP-738:
-------------------------------------

> Do we still want to support crc files [ ...]

MapReduce data spends a lot of time in memory (while sorting) and on local disks.  Most checksum
errors folks see are from local disks during sorting, not HDFS.  So, yes, we'll still need
it.

And per-block checksums are different.  They're not end-to-end.  Currently we checksum the
data as it is written to the output stream's buffer, and validate it as it is read from the
input stream's buffer.  A lot can happen between that time and it winding up in a DFS block.
 To replace this we'd ideally want to still compute the checksum as it is written, transmit
it along with the block to datanodes, then transmit it back to the client when the data is
read, and verify as it is read.  We'd also need sub-block checksums, not per-block, so that
one can seek without checksumming an entire block.  Yes, TCP does checksums, but memory errors
can be introduced on either end outside of the TCP stack, and, if blocks are temporarily stored
on local disk, it can also be a source of block corruption.  So getting rid of CRC files for
even HDFS will take more than just per-block checksums on datanodes.

> dfs get or copyToLocal should not copy crc file
> -----------------------------------------------
>
>                 Key: HADOOP-738
>                 URL: http://issues.apache.org/jira/browse/HADOOP-738
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.8.0
>         Environment: all
>            Reporter: Milind Bhandarkar
>         Assigned To: Milind Bhandarkar
>             Fix For: 0.9.0
>
>         Attachments: hadoop-crc.patch
>
>
> Currently, when we -get or -copyToLocal a directory from DFS, all the files including
crc files are also copied. When we -put or -copyFromLocal again, since the crc files already
exist on DFS, this put fails. The solution is not to copy checksum files when copying to local.
Patch is forthcoming.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message