Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 82B6D10ADD for ; Tue, 23 Dec 2014 01:49:14 +0000 (UTC) Received: (qmail 50140 invoked by uid 500); 23 Dec 2014 01:49:14 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 50092 invoked by uid 500); 23 Dec 2014 01:49:14 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 50079 invoked by uid 99); 23 Dec 2014 01:49:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Dec 2014 01:49:14 +0000 Date: Tue, 23 Dec 2014 01:49:14 +0000 (UTC) From: "Enis Soztutar (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-12743: ---------------------------------- Fix Version/s: 1.1.0 2.0.0 1.0.0 > [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true > ---------------------------------------------------------------------------------------- > > Key: HBASE-12743 > URL: https://issues.apache.org/jira/browse/HBASE-12743 > Project: HBase > Issue Type: Bug > Reporter: stack > Fix For: 1.0.0, 2.0.0, 1.1.0 > > > Master is stuck for two days trying to rejoin cluster after monkey killed and restarted it. > After retrying to get namespace 350 times, Master goes down: > {code} > 2014-12-20 18:43:54,285 INFO [c2020:16020.activeMasterManager] client.RpcRetryingCaller: Call exception, tries=349, retries=350, started=6885331 ms ago, cancelled=false, msg=row 'default' on table 'hbase:namespace' at region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=6000000190 > 2014-12-20 18:43:54,303 WARN [c2020:16020.activeMasterManager] master.TableNamespaceManager: Caught exception in initializing namespace table manager > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=350, exceptions: > Sat Dec 20 16:49:08 PST 2014, RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not online on c2023.halxg.cloudera.com,16020,1418988286696 > at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722) > at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851) > at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695) > at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434) > {code} > Seems like 2014-12-20 16:49:03,665 INFO [RS_LOG_REPLAY_OPS-c2021:16020-0] wal.WALSplitter: DistributedLogReplay = true > Seems easy enough to reproduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)