Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 22182 invoked from network); 22 Nov 2006 19:45:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 Nov 2006 19:45:26 -0000 Received: (qmail 29093 invoked by uid 500); 22 Nov 2006 19:45:35 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 29058 invoked by uid 500); 22 Nov 2006 19:45:35 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 29049 invoked by uid 99); 22 Nov 2006 19:45:35 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Nov 2006 11:45:35 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Nov 2006 11:45:24 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A755A7142C3 for ; Wed, 22 Nov 2006 11:45:04 -0800 (PST) Message-ID: <10618106.1164224704682.JavaMail.jira@brutus> Date: Wed, 22 Nov 2006 11:45:04 -0800 (PST) From: "Doug Cutting (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-738) dfs get or copyToLocal should not copy crc file In-Reply-To: <18953484.1164060182168.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ http://issues.apache.org/jira/browse/HADOOP-738?page=comments#action_12452033 ] Doug Cutting commented on HADOOP-738: ------------------------------------- > Do we still want to support crc files [ ...] MapReduce data spends a lot of time in memory (while sorting) and on local disks. Most checksum errors folks see are from local disks during sorting, not HDFS. So, yes, we'll still need it. And per-block checksums are different. They're not end-to-end. Currently we checksum the data as it is written to the output stream's buffer, and validate it as it is read from the input stream's buffer. A lot can happen between that time and it winding up in a DFS block. To replace this we'd ideally want to still compute the checksum as it is written, transmit it along with the block to datanodes, then transmit it back to the client when the data is read, and verify as it is read. We'd also need sub-block checksums, not per-block, so that one can seek without checksumming an entire block. Yes, TCP does checksums, but memory errors can be introduced on either end outside of the TCP stack, and, if blocks are temporarily stored on local disk, it can also be a source of block corruption. So getting rid of CRC files for even HDFS will take more than just per-block checksums on datanodes. > dfs get or copyToLocal should not copy crc file > ----------------------------------------------- > > Key: HADOOP-738 > URL: http://issues.apache.org/jira/browse/HADOOP-738 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.8.0 > Environment: all > Reporter: Milind Bhandarkar > Assigned To: Milind Bhandarkar > Fix For: 0.9.0 > > Attachments: hadoop-crc.patch > > > Currently, when we -get or -copyToLocal a directory from DFS, all the files including crc files are also copied. When we -put or -copyFromLocal again, since the crc files already exist on DFS, this put fails. The solution is not to copy checksum files when copying to local. Patch is forthcoming. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira