Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Mon, 7 Oct 2013 06:32:43 +0000 (UTC)
From: "Vinay (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12672062.1380757922235.24279.1381127563190@arcas>
In-Reply-To: <JIRA.12672062.1380757922235@arcas>
References: <JIRA.12672062.1380757922235@arcas>
Subject: [jira] [Commented] (HDFS-5291) Standby namenode after transition to
 active goes into safemode
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787946#comment-13787946 ] 

Vinay commented on HDFS-5291:
-----------------------------

{quote}We need the following:
1. Standby when throws safemode exception, the client fails over.
2. Active when throws safemode exception, the client must retry.{quote}
I am not seeing any chance of standby throwing SafeModeException except following 2 places with *dfs.ha.allow.stale.reads* enabled.
{{getBlockLocations()}} and {{getAdditionalDatanode()}}. 

Hi [~arpitagarwal], in your cluster *dfs.ha.allow.stale.reads* was enabled?

> Standby namenode after transition to active goes into safemode
> --------------------------------------------------------------
>
>                 Key: HDFS-5291
>                 URL: https://issues.apache.org/jira/browse/HDFS-5291
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.1.1-beta
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>            Priority: Critical
>         Attachments: HDFS-5291.000.patch, HDFS-5291.001.patch, nn.log
>
>
> Some log snippets
> standby state to active transition
> {code}
> 2013-10-02 00:13:49,482 INFO  ipc.Server (Server.java:run(2068)) - IPC Server handler 69 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from IP:33911 Call#1483 Retry#1: error: org.apache.hadoop.ipc.StandbyException: Operation category WRITE is not supported in state standby
> 2013-10-02 00:13:49,689 INFO  ipc.Server (Server.java:saslProcess(1342)) - Auth successful for nn/hostname@EXAMPLE.COM (auth:SIMPLE)
> 2013-10-02 00:13:49,696 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful for nn/hostname@EXAMPLE.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
> 2013-10-02 00:13:49,700 INFO  namenode.FSNamesystem (FSNamesystem.java:stopStandbyServices(1013)) - Stopping services started for standby state
> 2013-10-02 00:13:49,701 WARN  ha.EditLogTailer (EditLogTailer.java:doWork(336)) - Edit log tailer interrupted
> java.lang.InterruptedException: sleep interrupted
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:334)
>         at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
>         at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:356)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1463)
>         at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:454)
>         at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTail
> 2013-10-02 00:13:49,704 INFO  namenode.FSNamesystem (FSNamesystem.java:startActiveServices(885)) - Starting services required for active state
> 2013-10-02 00:13:49,719 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnfinalizedSegments(419)) - Starting recovery process for unclosed journal segments...
> 2013-10-02 00:13:49,755 INFO  ipc.Server (Server.java:saslProcess(1342)) - Auth successful for hbase/hostname@EXAMPLE.COM (auth:SIMPLE)
> 2013-10-02 00:13:49,761 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful for hbase/hostname@EXAMPLE.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol
> 2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnfinalizedSegments(421)) - Successfully started new epoch 85
> 2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(249)) - Beginning recovery of unclosed segment starting at txid 887112
> 2013-10-02 00:13:49,874 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(258)) - Recovery prepare phase complete. Responses:
> IP:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
> 172.18.145.97:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
> 2013-10-02 00:13:49,875 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recover
> {code}
> And then we get into safemode
> {code}
> Construction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,277 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: IP:1019 is added to blk_IP157{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[IP:1019|RBW], ReplicaUnderConstruction[172.18.145.96:1019|RBW], ReplicaUnde
> rConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,279 INFO  hdfs.StateChange (FSNamesystem.java:reportStatus(4703)) - STATE* Safe mode ON.
> The reported blocks 1071 needs additional 5 blocks to reach the threshold 1.0000 of total blocks 1075.
> Safe mode will be turned off automatically
> 2013-10-02 00:13:50,279 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: IP:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], ReplicaUnderConstruction[172.18.145.97:1019|RBW], ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,280 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: 172.18.145.99:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], ReplicaUnderConstruction[172.18.145.97:1019|RBW], ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,281 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: 172.18.145.97:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], ReplicaUnderConstruction[172.18.145.97:1019|RBW], ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> {code}


--
This message was sent by Atlassian JIRA
(v6.1#6144)