hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2130) Switch default checksum to CRC32C
Date Mon, 31 Oct 2011 22:05:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140609#comment-13140609

Todd Lipcon commented on HDFS-2130:

Turns out this is actually fairly difficult. The reason is that the checksumming is done at
the DFSOutputStream layer, rather than the DataStreamer layer. So, the checksum algorithm
and chunk size needs to be known _before_ the outputstream connects to the datanode.

Here are a few possible solutions:
1) When append() is called, make an RPC to the datanode hosting the last block of the file.
This RPC will read the meta header and return the correct checksum. The DFSOutputStream then
adopts that checksum.
- fairly simple to implement. 
- Allows switching checksum type *and* chunk size.
- extra round-trip to set up the pipeline for append.

2) In the case of append, the DN can allow a writer to use a different checksum _algorithm_
so long as the chunk size and checksum size are the same. In this case, it will verify the
incoming packets using the writer's algorithm, then re-checksum them using the disk algorithm
before writing to the meta file. 

- no extra round-trip on pipeline creation.
- no need to change client code. 
- when the client transitions to the next block in a file being appended, the new (preferred)
checksum is used.

- There's a slight performance hit when filling up the last block of a file being appended.

- Not a general solution (only supports changing polynomial, not chunk size)

Any other ideas?
> Switch default checksum to CRC32C
> ---------------------------------
>                 Key: HDFS-2130
>                 URL: https://issues.apache.org/jira/browse/HDFS-2130
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs client
>            Reporter: Todd Lipcon
> Once the other subtasks/parts of HDFS-2080 are complete, CRC32C will be a much more efficient
checksum algorithm than CRC32. Hence we should change the default checksum to CRC32C.
> However, in order to continue to support append against blocks created with the old checksum,
we will need to implement some kind of handshaking in the write pipeline.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message