hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rakesh R (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9435) TestBlockRecovery#testRBWReplicas is failing intermittently
Date Tue, 17 Nov 2015 16:17:11 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15008932#comment-15008932
] 

Rakesh R commented on HDFS-9435:
--------------------------------

It looks like there is a race between this waiting period and BPServiceActor#scheduleNextHeartbeat()
call by BPServiceActor#offerService().
{code}
  void triggerHeartbeatForTests() {
    synchronized (pendingIncrementalBRperStorage) {
      final long nextHeartbeatTime = scheduler.scheduleHeartbeat();
      pendingIncrementalBRperStorage.notifyAll();
      while (nextHeartbeatTime - scheduler.nextHeartbeatTime >= 0) {
        try {
          pendingIncrementalBRperStorage.wait(100);
        } catch (InterruptedException e) {
          return;
        }
      }
    }
  }
{code}

Execution Sequence results in test case failure:-

1=> During starts, its calling {{dn.getAllBpOs().get(0).triggerHeartbeatForTests()}} and
initializing {{final long nextHeartbeatTime = scheduler.scheduleHeartbeat();}}
2=> BPServiceActor#offerService()
3=> BPServiceActor#sendHeartBeat()
4=> BPServiceActor.scheduler.scheduleNextHeartbeat()
5=> Now, immediately {{nextHeartbeatTime - scheduler.nextHeartbeatTime >= 0}} satisifies
and #triggerHeartbeatForTests() stops waiting period and starts unit testing.
6=> During tests, it will try to get {{BlockRecoveryWorker#getActiveNamenodeForBP()}} and
see null ActiveNN, then throws exception. Because BPServiceActor#offerService() execution
is still in progress and not yet updated the ActiveNN.
{code}
    DatanodeProtocolClientSideTranslatorPB activeNN = bpos.getActiveNN();
    if (activeNN == null) {
      throw new IOException(
          "Block pool " + bpid + " has not recognized an active NN");
    }
{code}


> TestBlockRecovery#testRBWReplicas is failing intermittently
> -----------------------------------------------------------
>
>                 Key: HDFS-9435
>                 URL: https://issues.apache.org/jira/browse/HDFS-9435
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Rakesh R
>            Assignee: Rakesh R
>         Attachments: testRBWReplicas.log
>
>
> TestBlockRecovery#testRBWReplicas is failing in the [build 13536|https://builds.apache.org/job/PreCommit-HDFS-Build/13536/testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockRecovery/testRBWReplicas/].
It looks like bug in tests due to race condition.
> Note: Attached logs taken from the build to this jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message