Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4A87ECB15 for ; Fri, 17 Aug 2012 19:37:40 +0000 (UTC) Received: (qmail 13362 invoked by uid 500); 17 Aug 2012 19:37:38 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 13191 invoked by uid 500); 17 Aug 2012 19:37:38 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 13090 invoked by uid 99); 17 Aug 2012 19:37:38 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Aug 2012 19:37:38 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 658962C5BE6 for ; Fri, 17 Aug 2012 19:37:38 +0000 (UTC) Date: Sat, 18 Aug 2012 06:37:38 +1100 (NCT) From: "Uma Maheswara Rao G (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <803704605.24923.1345232258416.JavaMail.jiratomcat@arcas> In-Reply-To: <1771424424.19811.1344307382647.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Commented] (HDFS-3769) standby namenode become active fails because starting log segment fail on shared storage MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437002#comment-13437002 ] Uma Maheswara Rao G commented on HDFS-3769: ------------------------------------------- Hi Liaowenrui, This problem we have fixed in BK side right? is that the same problem you are talking here? On BKs unavailability, bk recovery will fail on single attempt and lead to close the ledger with -1 on one corner case. SO, when other node becomes active, that will clean this ledger as lastConfiem is -1. And leading that entry miss right. If that is the case, we already handled at BK side. Please confirm the same. If this issue is same we can close this JIRA here. > standby namenode become active fails because starting log segment fail on shared storage > ---------------------------------------------------------------------------------------- > > Key: HDFS-3769 > URL: https://issues.apache.org/jira/browse/HDFS-3769 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha > Affects Versions: 2.0.0-alpha > Environment: 3 datanode:158.1.132.18,158.1.132.19,160.161.0.143 > 2 namenode:158.1.131.18,158.1.132.19 > 3 zk:158.1.132.18,158.1.132.19,160.161.0.143 > 3 bookkeeper:158.1.132.18,158.1.132.19,160.161.0.143 > ensemble-size:2,quorum-size:2 > Reporter: liaowenrui > Priority: Critical > Fix For: 2.1.0-alpha, 2.0.1-alpha > > > 2012-08-06 15:09:46,264 ERROR org.apache.hadoop.contrib.bkjournal.utils.RetryableZookeeper: Node /ledgers/available already exists and this is not a retry > 2012-08-06 15:09:46,264 INFO org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager: Successfully created bookie available path : /ledgers/available > 2012-08-06 15:09:46,273 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering unfinalized segments in /opt/namenodeHa/hadoop-2.0.1/hadoop-root/dfs/name/current > 2012-08-06 15:09:46,277 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest edits from old active before taking over writer role in edits logs. > 2012-08-06 15:09:46,363 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication and invalidation queues... > 2012-08-06 15:09:46,363 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all datandoes as stale > 2012-08-06 15:09:46,383 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number of blocks = 239 > 2012-08-06 15:09:46,383 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of invalid blocks = 0 > 2012-08-06 15:09:46,383 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of under-replicated blocks = 0 > 2012-08-06 15:09:46,383 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of over-replicated blocks = 0 > 2012-08-06 15:09:46,383 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of blocks being written = 0 > 2012-08-06 15:09:46,383 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing edit logs at txnid 2354 > 2012-08-06 15:09:46,471 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 2354 > 2012-08-06 15:09:46,472 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 2354 failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@4eda1515, stream=null)) > java.io.IOException: We've already seen 2354. A new stream cannot be created with it > at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.startLogSegment(BookKeeperJournalManager.java:297) > at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:86) > at org.apache.hadoop.hdfs.server.namenode.JournalSet$2.apply(JournalSet.java:182) > at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:319) > at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:179) > at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:894) > at org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:268) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:618) > at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1322) > at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61) > at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63) > at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49) > at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1230) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:990) > at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427) > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira