hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walter Su (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order
Date Fri, 22 Apr 2016 02:08:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253181#comment-15253181
] 

Walter Su commented on HDFS-10301:
----------------------------------

bq. Enabling HDFS-9198 will fifo process BRs. It doesn't solve this implementation bug but
virtually eliminates it from occurring.
bq. This addresses Daryn's comment rather than solving the reported bug, as BTW Daryn correctly
stated.
that's incorrect. Please run the test in 001 patch with-and-without the fix, you'll see the
difference. It does solve the issue. Because, 

The bug only exists when reports are contained in one rpc. If they are splitted into multiple
RPCs, it's not problem, because the {{rpcsSeen}} guard prevent it from happening. So, my approach
is to process reports contained in one rpc contiguously, by putting them into the queue atomically.


> BlockReport retransmissions may lead to storages falsely being declared zombie if storage
report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Priority: Critical
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, HDFS-10301.01.patch,
zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it sends the
block report again. Then NameNode while process these two reports at the same time can interleave
processing storages from different reports. This screws up the blockReportId field, which
makes NameNode think that some storages are zombie. Replicas from zombie storages are immediately
removed, causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message