hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1134) Block level CRCs in HDFS
Date Wed, 28 Mar 2007 19:38:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484970
] 

Raghu Angadi commented on HADOOP-1134:
--------------------------------------


"io.bytes.per.checksum" will still be used for the expected purpose and will dictate bytes/checksum
for all the new data created by the client. Doug prefered client dictating this value and
I preferred Namenode or Datanode informing client (using the same config).

Similar to current behavior, each checksum file has a header that indicates bytes/checksum.
Thus at any time each block has its own bytes/checksum it does not need to match with other
blocks or even with other replicas. When a block is copied to another datanode, source datanode
decides what bytes/checksum is.

> Do we simply copy existing checksum data or do we re-generate it? 
During upgrade, we simply copy, of course with new header. This is Doug's preference since
this will speed up the upgrade. I agree but don't mind implementing forced check during upgrade.

During block replication, destination datanode verifies the checksum and create its local
copy. End result would be that source and destination will have the same content in checksum
file (unless the header format has changed).

> I don't think simply copying checksum data is enough since the checksums can themselves
be corrupt. We need some level of validation. > We can compare copies of the checksum data
against each other, if we find a majority of copies that match then we treat those as
> authoritative. But what happens when we don't find a majority? Or we can re-generate
checksum data on the Datanode and validate it 
> against the existing data. 

If we cannot get old CRC data for any reason, we will generate one based on the local data
(which could be wrong). There are two options to validate upgraded data (for simplicity all
the details and error conditions are not explained) :
1) use old CRCs (Doug's choice)
2) check CRC of each replica and choose the majority (Sameer's choice)
3) Combination of (1) and (2). i.e. use (2) if (1) fails etc. This option is proposed only
now.

At this point, I would leave it you guys to decide which one we should do. Please choose one.

> How does a Datanode discover authoritative sources of checksum data for it's blocks?

Is this during upgrade?

> This works while the upgrade is in progress but perhaps it can be extended to deal with
Datanodes that join the system after the
> upgrade is complete. If a Datanode joins after a complete upgrade and crc file deletion,
the Namenode could redirect it to other 
> Datanodes that have copies of it's blocks, the new Datanode can then pull block level
CRC files from it's peers, validate it's data and 
> perform an upgrade even though the .crc files are gone.

My expectation was that once the upgrade is considered done by the namenode, any datanode
that comes up with old database, will shutdown with a clear error message with out entering
into upgrade phrase. The following two conditions should be met before upgrade is considered
done :
1) All the datanodes that registered should report completion of their upgrade. If Namenode
restarts, each datanode will re-register and inform again about their completion.
2) Similar to current safemode, "dfs.safemode.threshold.pct" of the blocks that belong to
non .crc files, should have at least "dfs.replication.min" replicas reported upgraded. 

Of cource, we can still do what you propose for datanodes that come up with old data after
the above conditions are met.. it it is considered required. It means that some of the upgrade
specific  code could spread a little bit more into normal operation of namenode. Upgrade is
already enough complicated that this might not add much more code.






> Block level CRCs in HDFS
> ------------------------
>
>                 Key: HADOOP-1134
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1134
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>
> Currently CRCs are handled at FileSystem level and are transparent to core HDFS. See
recent improvement HADOOP-928 ( that can add checksums to a given filesystem ) regd more about
it. Though this served us well there a few disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In many cases,
it nearly doubles the number of blocks. Taking namenode out of CRCs would nearly double namespace
performance both in terms of CPU and memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted blocks. With
block level CRCs, Datanode can periodically verify the checksums and report corruptions to
namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as in GFS.
I will update the jira with detailed requirements and design. This will include same guarantees
provided by current implementation and will include a upgrade of current data.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message