hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin P. McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order
Date Sat, 13 Aug 2016 06:17:22 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419823#comment-15419823

Colin P. McCabe commented on HDFS-10301:

I don't think the heartbeat is the right place to handle reconciling the block storages. 
One reason is because this adds extra complexity and time to the heartbeat, which happens
far more frequently than an FBR.  We even talked about making the heartbeat lockless-- clearly
you can't do that if you are traversing all the block storages.  Taking the FSN lock is expensive
and heartbeats are sent quite frequently from each DN-- every few seconds.  Another reason
reconciling storages in heartbeats is bad is because if the heartbeat tells you about a new
storage, you won't know what blocks are in it until the FBR arrives.  So the NN may end up
assigning a bunch of new blocks to a storage which looks empty, but really is full.

I came up with what I believe is the correct patch to fix this problem months ago.  It's here
as https://issues.apache.org/jira/secure/attachment/12805931/HDFS-10301.005.patch .  It doesn't
modify any RPCs or add any new mechanisms.  Instead, it just fixes the obvious bug in the
HDFS-7960 logic.  The only counter-argument to applying patch 005 that anyone ever came up
with is that it doesn't eliminate zombies when FBRs get interleaved.  But this is not a good
counter-argument, since FBR interleaving is extremely, extremely rare in well-run clusters.
 The proof should be obvious-- if FBR interleaving happened on more clusters, more people
would hit this serious data loss bug.

This JIRA has been extremely frustrating.  It seems like most, if not all, of the points that
I brought up in my reviews were ignored.  I talked about the obvious problems with compatibility
with [~shv]'s solution and even explicitly asked him to test the upgrade case.  I told him
that this JIRA was a bad one to give to a promising new contributor such as [~redvine], because
it required a lot of context and was extremely tricky.  Both myself and [~andrew.wang] commented
that overloading BlockListAsLongs was confusing and not necessary.  The patch confused "not
modifying the .proto file" with "not modifying the RPC content" which are two very separate
concepts, as I commented over and over.  Clearly these comments were ignored.  If anything,
I think [~shv] got very lucky that the bug manifested itself quickly rather than creating
a serious data loss situation a few months down the road, like the one I had to debug when
fixing HDFS-7960.

Again I would urge you to just commit patch 005.  Or at least evaluate it.

> BlockReport retransmissions may lead to storages falsely being declared zombie if storage
report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Vinitha Reddy Gankidi
>            Priority: Critical
>             Fix For: 2.7.4
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, HDFS-10301.004.patch,
HDFS-10301.005.patch, HDFS-10301.006.patch, HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch,
HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, HDFS-10301.012.patch, HDFS-10301.013.patch,
HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
> When NameNode is busy a DataNode can timeout sending a block report. Then it sends the
block report again. Then NameNode while process these two reports at the same time can interleave
processing storages from different reports. This screws up the blockReportId field, which
makes NameNode think that some storages are zombie. Replicas from zombie storages are immediately
removed, causing missing blocks.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message