Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Thu, 23 Aug 2012 01:09:42 +1100 (NCT)
From: "Kihwal Lee (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <228793507.441.1345644582680.JavaMail.jiratomcat@arcas>
In-Reply-To: 
 <1478976591.1207.1333385124633.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HDFS-3177) Allow DFSClient to find out and use
 the CRC type being used for a file.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439549#comment-13439549 ] 

Kihwal Lee commented on HDFS-3177:
----------------------------------


bq. For append, it makes a lot of sense to keep using the existing checksum type. What is the use case for using a different checksum type?

I don't think it makes sense either, but that was the design decision made in HDFS-2130. There might have been some use cases for this, so I tried to support it while making the default to not allow it. If you feel that this should be the behavior with no configurable option, I will be happy to update the patch accordingly.  

What do you think we should do for concat()? It is supposed to be quick namenode only operation, so I don't feel comfortable about inserting code to check the checksums of input files.

bq. Suppose the last block is half written with CRC32 in a close file. Then, the file is re-opened for append with CRC32C. Would the block has two checksum types, i.e. first half is CRC32 and the second half is CRC32C?

No. Datanode will continue to use the same checksum parameters of the existing partial block for writing, independent of what client is sending with data. Input data integrity check is still done, of course. 

bq. Suppose a close file is already using more than one checksum type. Then, the file is re-opened for append with dfs.client.append.allow-different-checksum == false. Which checksum should it use? Or should it fail?

I don't think we can do much for existing files. Users can detect it with getFileChecksum(), which will show DataChecksum.Type.MIXED as its checksum type. For these files, checksum will still be used for block -level integrity check and nothing will break until something like distcp tries to compare FileChecksums after copying.  
                
> Allow DFSClient to find out and use the CRC type being used for a file.
> -----------------------------------------------------------------------
>
>                 Key: HDFS-3177
>                 URL: https://issues.apache.org/jira/browse/HDFS-3177
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node, hdfs client
>    Affects Versions: 0.23.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>             Fix For: 2.1.0-alpha, 3.0.0
>
>         Attachments: hdfs-3177-after-hadoop-8239-8240.patch.txt, hdfs-3177-after-hadoop-8239.patch.txt, hdfs-3177-branch2-trunk.patch.txt, hdfs-3177.patch, hdfs-3177-with-hadoop-8239-8240.patch.txt, hdfs-3177-with-hadoop-8239-8240.patch.txt, hdfs-3177-with-hadoop-8239-8240.patch.txt, hdfs-3177-with-hadoop-8239.patch.txt
>
>
> To support HADOOP-8060, DFSClient should be able to find out the checksum type being used for files in hdfs.
> In my prototype, DataTransferProtocol was extended to include the checksum type in the blockChecksum() response. DFSClient uses it in getFileChecksum() to determin the checksum type. Also append() can be configured to use the existing checksum type instead of the configured one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira