hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
Date Thu, 21 Apr 2016 04:58:26 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251275#comment-15251275
] 

Colin Patrick McCabe commented on HDFS-10301:
---------------------------------------------

Thanks for the bug report.  This is a tricky one.

One small correction-- HDFS-7960 was not introduced as part of DataNode hotswap.  It was originally
introduced to solve issues caused by HDF-7575, although it fixed issues with hotswap as well.

It seems like we should be able to remove existing DataNode storage report RPCs with the old
ID from the queue when we receive one with a new block report ID.  This would also avoid a
possible congestion collapse scenario caused by repeated retransmissions after the timeout.

> Blocks removed by thousands due to falsely detected zombie storages
> -------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Walter Su
>            Priority: Critical
>         Attachments: HDFS-10301.01.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it sends the
block report again. Then NameNode while process these two reports at the same time can interleave
processing storages from different reports. This screws up the blockReportId field, which
makes NameNode think that some storages are zombie. Replicas from zombie storages are immediately
removed, causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message