hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walter Su (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
Date Wed, 09 Mar 2016 11:35:40 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186974#comment-15186974
] 

Walter Su commented on HDFS-9822:
---------------------------------

bq. I am still a little confused how this error happens.
Me too. I don't think we get the right cause.
bq. But if there are same block group entry exists in different queue..
No 2 queues can have same BG. The update(..) logic is correct.
No queue can has 2 same items. The queue is a HashSet.

My pure guess is that it's caused by race condition. We have a guard at
{code}
//  BlockManager#scheduleReconstruction(..)
    if (block.isStriped()) {
      if (pendingNum > 0) {
        // Wait the previous reconstruction to finish.
        return null;
      }
{code}
which is inside namesystem lock. But before {{ReplicationMonitor}} thread goes to {{validateReconstructionWork(..)}},
it loses the lock. So it's possible the junit thread get the lock. If they both passes the
guard, eventually one of them will failed the assert.

> Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at
the same time
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9822
>                 URL: https://issues.apache.org/jira/browse/HDFS-9822
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: erasure-coding
>            Reporter: Tsz Wo Nicholas Sze
>            Assignee: Rakesh R
>         Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch
>
>
> Found the following AssertionError in https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/
> {code}
> AssertionError: Should wait the previous reconstruction to finish
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100)
> 	at java.lang.Thread.run(Thread.java:745)
> 	at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126)
> 	at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message