Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <10618106.1164224704682.JavaMail.jira@brutus>
Date: Wed, 22 Nov 2006 11:45:04 -0800 (PST)
From: "Doug Cutting (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Commented: (HADOOP-738) dfs get or copyToLocal should not
 copy crc file
In-Reply-To: <18953484.1164060182168.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

    [ http://issues.apache.org/jira/browse/HADOOP-738?page=comments#action_12452033 ] 
            
Doug Cutting commented on HADOOP-738:
-------------------------------------

> Do we still want to support crc files [ ...]

MapReduce data spends a lot of time in memory (while sorting) and on local disks.  Most checksum errors folks see are from local disks during sorting, not HDFS.  So, yes, we'll still need it.

And per-block checksums are different.  They're not end-to-end.  Currently we checksum the data as it is written to the output stream's buffer, and validate it as it is read from the input stream's buffer.  A lot can happen between that time and it winding up in a DFS block.  To replace this we'd ideally want to still compute the checksum as it is written, transmit it along with the block to datanodes, then transmit it back to the client when the data is read, and verify as it is read.  We'd also need sub-block checksums, not per-block, so that one can seek without checksumming an entire block.  Yes, TCP does checksums, but memory errors can be introduced on either end outside of the TCP stack, and, if blocks are temporarily stored on local disk, it can also be a source of block corruption.  So getting rid of CRC files for even HDFS will take more than just per-block checksums on datanodes.

> dfs get or copyToLocal should not copy crc file
> -----------------------------------------------
>
>                 Key: HADOOP-738
>                 URL: http://issues.apache.org/jira/browse/HADOOP-738
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.8.0
>         Environment: all
>            Reporter: Milind Bhandarkar
>         Assigned To: Milind Bhandarkar
>             Fix For: 0.9.0
>
>         Attachments: hadoop-crc.patch
>
>
> Currently, when we -get or -copyToLocal a directory from DFS, all the files including crc files are also copied. When we -put or -copyFromLocal again, since the crc files already exist on DFS, this put fails. The solution is not to copy checksum files when copying to local. Patch is forthcoming.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira