hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3452) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock
Date Tue, 29 May 2012 12:39:23 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284765#comment-13284765

Uma Maheswara Rao G commented on HDFS-3452:

Hi Ivan, Thanks a lot for taking a look.

Just a clarification.
 Is this the change you are suggesting for #1?
before clearing we should have the check because, at the same time other node may come and
create his node. But this current node will clear the inprogress path and change the version
number. So, other node may fail further. Is this the point you are pointing here?

       zkc.delete(inprogressPath, inprogressStat.getVersion());
+      String inprogressPathFromCI = ci.read();
+      if (inprogressPathFromCI.equals(inprogressPath)) {
+        ci.clear();
+      }
     } catch (KeeperException e) {
       throw new IOException("Error finalising ledger", e);
     } catch (InterruptedException ie) {
       throw new IOException("Error finalising ledger", ie);
-    } finally {
-      wl.release();
-    }
+    } 

for #3, yes I agree, we can have the version check. Also thought to have seperate version
number, again I felt it may not be good to maintain version numbers for each node. We may
can update single version number if any changes for any node. ofcource version checks still
can maintain like in NN for all edit log OPs. 
Anyway I will create seperate version number for CurrentInprogress node and will have the
version check. 

could you please clarify above #1.
> BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing
of lock
> -----------------------------------------------------------------------------------------------
>                 Key: HDFS-3452
>                 URL: https://issues.apache.org/jira/browse/HDFS-3452
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: 2.0.0-alpha
>            Reporter: suja s
>            Assignee: Uma Maheswara Rao G
>            Priority: Blocker
>         Attachments: BK-253-BKJM.patch, HDFS-3452-1.patch, HDFS-3452.patch, HDFS-3452.patch
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 5000. By the
time control comes to acquire lock the previous lock is not released which leads to failure
in lock acquisition by NN and NN gets shutdown. Ideally it should have been done)
> =============================================================================
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: Failed to
acquire lock with /ledgers/lock/lock-0000000007, lock-0000000006 already has it
> 2012-05-09 20:15:29,732 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error:
recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
> Scenario:
> Start ZKFCS, NNs
> NN1 is active and NN2 is standby
> Stop NN1. NN2 tries to transition to active and gets shut down

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message