Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5228417CF3 for ; Thu, 23 Apr 2015 04:31:41 +0000 (UTC) Received: (qmail 92125 invoked by uid 500); 23 Apr 2015 04:31:39 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 91886 invoked by uid 500); 23 Apr 2015 04:31:39 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 91514 invoked by uid 99); 23 Apr 2015 04:31:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Apr 2015 04:31:39 +0000 Date: Thu, 23 Apr 2015 04:31:39 +0000 (UTC) From: "stack (JIRA)" To: dev@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-13535) Regions unassigned MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 stack created HBASE-13535: ----------------------------- Summary: Regions unassigned Key: HBASE-13535 URL: https://issues.apache.org/jira/browse/HBASE-13535 Project: HBase Issue Type: Bug Reporter: stack hbase-1.1 will be the first release with DLR on by default. I've been running ITBLLs on a cluster trying to find issues with DLR. My first few runs ran nicely... but the current run failed complaining regions are not online and indeed recovery is stuck making no progress. Upon examination, it looks to be an assignment rather than DLR issue. A server carring meta has its meta log replayed first but we are seemingly failing to assign regions after meta is back online. Meantime, my regionserver logs are filling with spewing complaint that regions are not online (we should dampen our logging of region not being online... ) and then the split log workers are stuck: {code} Thread 13206 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-2): State: TIMED_WAITING Blocked count: 45 Waited count: 59 Stack: java.lang.Thread.sleep(Native Method) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674) org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104) org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096) org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066) Thread 13205 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-1): State: TIMED_WAITING Blocked count: 45 Waited count: 59 Stack: java.lang.Thread.sleep(Native Method) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674) org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104) org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096) org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066) Thread 13204 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0): State: TIMED_WAITING Blocked count: 50 Waited count: 63 Stack: java.lang.Thread.sleep(Native Method) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761) org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674) org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104) org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096) org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066) {code} ...complaining that: 2015-04-22 21:28:02,746 DEBUG [RS_LOG_REPLAY_OPS-c2021:16020-1] wal.WALSplitter: Used 134248328 bytes of buffered edits, waiting for IO threads... The accounting seems off around here in SSH where it is moving regions that were on dead server to OFFLINE but is reporting no regions to assign: {code} 143320 2015-04-21 17:05:07,571 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0] handler.ServerShutdownHandler: Mark regions in recovery for crashed server c2024.halxg.cloudera.com,16020,1429660802192 before assignment; regions=[] 143321 2015-04-21 17:05:07,572 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Adding to processed servers c2024.halxg.cloudera.com,16020,1429660802192 143322 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {8d63312bc39a39727afea627bb20fee4 state=OPEN, ts=1429660996054, server=c2024.halxg.cloudera.com,16020,1429660802192} to {8d63312bc39a39727afea627bb20fee4 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192} 143323 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {64c97bf39441e09977332c02e628a8c2 state=OPEN, ts=1429660996045, server=c2024.halxg.cloudera.com,16020,1429660802192} to {64c97bf39441e09977332c02e628a8c2 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192} 143324 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {3f4ea5ea14653cee6006f13c7d06d10b state=OPEN, ts=1429660996066, server=c2024.halxg.cloudera.com,16020,1429660802192} to {3f4ea5ea14653cee6006f13c7d06d10b state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192} 143325 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {6eaf51e55c9c23356a697f286f473db8 state=OPEN, ts=1429660996051, server=c2024.halxg.cloudera.com,16020,1429660802192} to {6eaf51e55c9c23356a697f286f473db8 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192} 143326 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {1cbe63cd29c7709209adbf4305ebc746 state=OPEN, ts=1429660996062, server=c2024.halxg.cloudera.com,16020,1429660802192} to {1cbe63cd29c7709209adbf4305ebc746 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192} 143327 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {3a49d17e189b2a26eb73e1d43cf2d0ac state=OPEN, ts=1429660996053, server=c2024.halxg.cloudera.com,16020,1429660802192} to {3a49d17e189b2a26eb73e1d43cf2d0ac state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192} 143328 2015-04-21 17:05:07,576 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0] handler.ServerShutdownHandler: Reassigning 0 region(s) that c2024.halxg.cloudera.com,16020,1429660802192 was carrying (and 0 regions(s) that were opening on this server) {code} ... and indeed these regions are never assigned. I see DLR on occasion timing out like this.... 2015-04-22 21:27:31,327 ERROR [RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0] wal.WALSplitter: Exiting thread .... Caused by: Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): org.apache.hadoop.hbase.NotServingRegionException: Region IntegrationTestBigLinkedList,\xB3333333(,1429660019160.8d63312bc39a39727afea627bb20fee4. is not online on c2024.halxg.cloudera.com,16020,1429661908290 ... which is one of the regions in the set of OFFLINE but not assigned. Digging.... -- This message was sent by Atlassian JIRA (v6.3.4#6332)