hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sameer Paranjpye (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1134) Block level CRCs in HDFS
Date Tue, 20 Mar 2007 21:09:33 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482534

Sameer Paranjpye commented on HADOOP-1134:

+1 for offline upgrades.

>> Owen O'Malley [20/Mar/07 12:59 PM] I think the inline crcs are too problematic. They
will add a mapping between logical and physical offsets into the block that will hit a 
>> fair amount of code. If the side file is opened with a 4k buffer, it will only take
2 reads of the side file to handle the entire block (assuming 4B CRC/64KB and 128MB 
>> blocks). It also is much much easier to handle upgrade.

It takes only 2 reads to handle the entire block which is good.  But it takes those same 2
reads to handle a tiny fraction of the block as well, which is where the downside appears.
It's quite clear that doing inline checksums makes the upgrade process a lot harder. The question
is whether or not taking the hit of a difficult upgrade and complicating the data access code
is a reasonable price to pay for halving the number of seeks in the system for good. It feels
like it is, thoughts?

> Block level CRCs in HDFS
> ------------------------
>                 Key: HADOOP-1134
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1134
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
> Currently CRCs are handled at FileSystem level and are transparent to core HDFS. See
recent improvement HADOOP-928 ( that can add checksums to a given filesystem ) regd more about
it. Though this served us well there a few disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In many cases,
it nearly doubles the number of blocks. Taking namenode out of CRCs would nearly double namespace
performance both in terms of CPU and memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted blocks. With
block level CRCs, Datanode can periodically verify the checksums and report corruptions to
namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as in GFS.
I will update the jira with detailed requirements and design. This will include same guarantees
provided by current implementation and will include a upgrade of current data.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message