hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinitha Reddy Gankidi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order
Date Tue, 13 Sep 2016 01:21:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485884#comment-15485884

Vinitha Reddy Gankidi commented on HDFS-10301:

Upon thorough investigation of heartbeat logic I have verified that unreported storages do
get removed without any code change. Attached patch 014 eliminates the state and the zombie
storage removal logic introduced in HDFS-7960. 
I have added a unit test that verifies that when a DN storage with blocks is removed, this
storage is removed from the DatanodeDescriptor as well and does not linger forever. Unreported
storages are marked as FAILED in  {{updateHeartbeatState}} method when {{checkFailedStorages}}
is true. Thus when a DN storage is removed, it will be marked as FAILED in the next heartbeat.

The storage removal happens in 2 steps after that (Refer Step 2 & 3 in https://issues.apache.org/jira/browse/HDFS-10301?focusedCommentId=15427387&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15427387).

The test {{testRemovingStorageDoesNotProduceZombies}} introduced in HDFS-7960 passes by reducing
the heartbeat recheck interval so that the test doesn't timeout. By default, the Heartbeat
Manager removes blocks associated with failed storages every 5 minutes.
I have ignored {{testProcessOverReplicatedAndMissingStripedBlock}} in this patch. Please refer
to HDFS-10854 for more details.

> BlockReport retransmissions may lead to storages falsely being declared zombie if storage
report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Vinitha Reddy Gankidi
>            Priority: Critical
>             Fix For: 2.7.4
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, HDFS-10301.004.patch,
HDFS-10301.005.patch, HDFS-10301.006.patch, HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch,
HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, HDFS-10301.012.patch, HDFS-10301.013.patch,
HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
> When NameNode is busy a DataNode can timeout sending a block report. Then it sends the
block report again. Then NameNode while process these two reports at the same time can interleave
processing storages from different reports. This screws up the blockReportId field, which
makes NameNode think that some storages are zombie. Replicas from zombie storages are immediately
removed, causing missing blocks.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message