hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sameer Paranjpye (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1134) Block level CRCs in HDFS
Date Wed, 28 Mar 2007 23:34:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485026

Sameer Paranjpye commented on HADOOP-1134:

> Sameer, does the following sum up your proposal (a refinement of option (3) above) :
> 1) For each blocks for which CRC is generated (in one of the ways mentioned below), Datanode
reports CRC of the checksum file to namenode.

Not really. I don't think I mentioned the Namenode keeping CRCs of block CRCs. 

There is another issue that appears not to have been dealt with yet, we say that the Datanode
locally generates CRCs if the .crc file for a block is unavailable. We need to be very careful
here IMO. During registration many many blocks are missing simply because they haven't been
reported in yet. We don't want the Namenode to report missing .crcs to the Datanodes until
a certain threshold of blocks has been reached. This should probably be the same as dfs.safemode.threshold.pct.

I would propose the following:

1) Extend the Namenode interface, add a  getChecksumAuthority(long blockId) method. This method
takes a block-id as input and responds a <type, authority, offset> tuple, where 'authority'
is the name of the .crc file in the Namenode and 'offset' is the offset of the specified block
in the *data* file. It throws an appropriate exception when the input block-id is unrecognized
or belongs to a .crc file or the checksum authority is missing. The 'type' field indicates
the authority type which is either CRCFILE, DATANODE or YOU. The latter two codes are used
by the Namenode when it has determined that blocks of a .crc file are missing.

2) Each Datanode does the following during the upgrade:
  - For each block it calls getChecksumAuthority(), discovers the checksum file, opens it
for read, reads the header and discovers the bytes/chksum for the current block
  - If getChecksumAuthority() fails, the Datanode moves on to the next block, it will return
to this block when it has run through all it's remaining blocks
  - It uses the bytes/chksum and the data offset to determine where in the .crc file the current
blocks checksums lie
  - It reads the checksums from one of the replicas and validates the block data against the
checksums, if validation succeeds, the checksum data is written to disk and it moves on to
the next block. The checksum upgrade for the current block is reported to the Namenode.
  - If validation fails it tries the other replicas of the checksum data. If validation fails
against all checksum replicas it arbitrarily chooses one replica, copies checksum data from
it and reports a corrupt block to the Namenode

3) When the Namenode determines that the .crc file corresponding to a block is unavailable,
it chooses a representative from one of the Datanodes hosting the block to locally generate
CRCs for the block. It does so by sending YOU in the type field when getChecksumAuthority
is invoked. For the remaining Datanodes hosting the block the Namenode sends DATANODE in the
type field and asks them to copy CRCs from the chosen representative.

The upgrade is considered complete when dfs.replication.min replicas of all known blocks have
been transitioned to block level CRCs. 

In some cases, this condition will not be met either because some data blocks are MIA.

During the upgrade process 'dfs -report' should indicate how many blocks have not been upgraded
and for what reason. It should also indicate whether 
a) the upgrade is incomplete
b) the upgrade is complete
c) the upgrade is wedged because some blocks are missing

In the case of b) or c) occuring, the sysadmin can issue a 'finishUpgrade' command to the
Namenode which causes the .crc files to be removed and their blocks marked for deletion. Note
that this is different from 'finalizeUpgrade' which causes state from the previous version
to be discarded. Datanodes that join the system after the upgrade is finished are handled
using 3) above.

This is a more complex proposal, but the additional complexity has been introduced in order
to provide much stronger correctness guarantees, so I feel that it is warranted. Comments

> Block level CRCs in HDFS
> ------------------------
>                 Key: HADOOP-1134
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1134
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
> Currently CRCs are handled at FileSystem level and are transparent to core HDFS. See
recent improvement HADOOP-928 ( that can add checksums to a given filesystem ) regd more about
it. Though this served us well there a few disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In many cases,
it nearly doubles the number of blocks. Taking namenode out of CRCs would nearly double namespace
performance both in terms of CPU and memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted blocks. With
block level CRCs, Datanode can periodically verify the checksums and report corruptions to
namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as in GFS.
I will update the jira with detailed requirements and design. This will include same guarantees
provided by current implementation and will include a upgrade of current data.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message