hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Kelly (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3452) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock
Date Fri, 25 May 2012 15:39:23 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283547#comment-13283547
] 

Ivan Kelly commented on HDFS-3452:
----------------------------------

Patch looks good Uma. A few comments.

# The version in the data should be a format version, in case we wish to change the data format
in future. Not the znode version
# the creation of the inprogress znode should catch a nodeexists exception in the case to
two nodes starting at once
# in javadoc, @update should be #update
# "Already inprogress node exists" -> "Inprogress node already exists"
# TestBookKeeperJournalManager#testAllBookieFailure: you need to add  bkjm.recoverUnfinalizedSegments()
before the failing startLogSegment. 
# TestBookKeeperAsHASharedDir#testMultiplePrimariesStarted: this needs to be changed. Fix
is simple though, now that the locking has changed, we it's the nn who was previously working
which dies, not the new one trying to start. Code below

{code}
  @Test
  public void testMultiplePrimariesStarted() throws Exception {
    Runtime mockRuntime1 = mock(Runtime.class);
    Runtime mockRuntime2 = mock(Runtime.class);
    Path p1 = new Path("/testBKJMMultiplePrimary");
    Path p2 = new Path("/testBKJMMultiplePrimary2");

    MiniDFSCluster cluster = null;
    try {
      Configuration conf = new Configuration();
      conf.setInt(DFSConfigKeys.DFS_HA_TAILEDITS_PERIOD_KEY, 1);
      conf.set(DFSConfigKeys.DFS_NAMENODE_SHARED_EDITS_DIR_KEY,
               BKJMUtil.createJournalURI("/hotfailoverMultiple").toString());
      BKJMUtil.addJournalManagerDefinition(conf);

      cluster = new MiniDFSCluster.Builder(conf)
        .nnTopology(MiniDFSNNTopology.simpleHATopology())
        .numDataNodes(0)
        .manageNameDfsSharedDirs(false)
        .build();
      NameNode nn1 = cluster.getNameNode(0);
      NameNode nn2 = cluster.getNameNode(1);
      FSEditLogTestUtil.setRuntimeForEditLog(nn1, mockRuntime1);
      FSEditLogTestUtil.setRuntimeForEditLog(nn2, mockRuntime2);
      cluster.waitActive();
      cluster.transitionToActive(0);

      FileSystem fs = HATestUtil.configureFailoverFs(cluster, conf);
      fs.mkdirs(p1);
      nn1.getRpcServer().rollEditLog();
      cluster.transitionToActive(1);

      verify(mockRuntime1, times(0)).exit(anyInt());
      fs.mkdirs(p2);

      verify(mockRuntime1, atLeastOnce()).exit(anyInt());
      verify(mockRuntime2, times(0)).exit(anyInt());

    } finally {
      if (cluster != null) {
        cluster.shutdown();
      }
    }
  }
{code}

Other than that, I think this is ready to go. Good work :)
                
> BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing
of lock
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3452
>                 URL: https://issues.apache.org/jira/browse/HDFS-3452
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: suja s
>            Assignee: Uma Maheswara Rao G
>            Priority: Blocker
>         Attachments: BK-253-BKJM.patch, HDFS-3452.patch, HDFS-3452.patch
>
>
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 5000. By the
time control comes to acquire lock the previous lock is not released which leads to failure
in lock acquisition by NN and NN gets shutdown. Ideally it should have been done)
> =============================================================================
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: Failed to
acquire lock with /ledgers/lock/lock-0000000007, lock-0000000006 already has it
> 2012-05-09 20:15:29,732 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error:
recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
> Scenario:
> Start ZKFCS, NNs
> NN1 is active and NN2 is standby
> Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message