Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Date: Thu, 23 Apr 2015 04:31:39 +0000 (UTC)
From: "stack (JIRA)" <jira@apache.org>
To: dev@hbase.apache.org
Message-ID: <JIRA.12823291.1429763472000.2313.1429763499091@Atlassian.JIRA>
In-Reply-To: <JIRA.12823291.1429763472000@Atlassian.JIRA>
References: <JIRA.12823291.1429763472000@Atlassian.JIRA>
 <JIRA.12823291.1429763472716@arcas>
Subject: [jira] [Created] (HBASE-13535) Regions unassigned
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

stack created HBASE-13535:
-----------------------------

             Summary: Regions unassigned
                 Key: HBASE-13535
                 URL: https://issues.apache.org/jira/browse/HBASE-13535
             Project: HBase
          Issue Type: Bug
            Reporter: stack


hbase-1.1 will be the first release with DLR on by default. I've been running  ITBLLs on a cluster trying to find issues with DLR. My first few runs ran nicely... but the current run failed complaining regions are not online and indeed recovery is stuck making no progress.

Upon examination, it looks to be an assignment rather than DLR issue. A server carring meta has its meta log replayed first but we are seemingly failing to assign regions after meta is back online.

Meantime, my regionserver logs are filling with spewing complaint that regions are not online (we should dampen our logging of region not being online... ) and then the split log workers are stuck:

{code}
Thread 13206 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-2):
  State: TIMED_WAITING
  Blocked count: 45
  Waited count: 59
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
    org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
    org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
    org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
Thread 13205 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-1):
  State: TIMED_WAITING
  Blocked count: 45
  Waited count: 59
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
    org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
    org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
    org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
Thread 13204 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0):
  State: TIMED_WAITING
  Blocked count: 50
  Waited count: 63
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
    org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
    org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
    org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
    org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
{code}


...complaining that:

2015-04-22 21:28:02,746 DEBUG [RS_LOG_REPLAY_OPS-c2021:16020-1] wal.WALSplitter: Used 134248328 bytes of buffered edits, waiting for IO threads...

The accounting seems off around here in SSH where it is moving regions that were on dead server to OFFLINE but is reporting no regions to assign:

{code}
143320 2015-04-21 17:05:07,571 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] handler.ServerShutdownHandler: Mark regions in recovery for crashed server c2024.halxg.cloudera.com,16020,1429660802192 before assignment; regions=[]
143321 2015-04-21 17:05:07,572 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Adding to processed servers c2024.halxg.cloudera.com,16020,1429660802192
143322 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {8d63312bc39a39727afea627bb20fee4 state=OPEN, ts=1429660996054, server=c2024.halxg.cloudera.com,16020,1429660802192} to {8d63312bc39a39727afea627bb20fee4 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
143323 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {64c97bf39441e09977332c02e628a8c2 state=OPEN, ts=1429660996045, server=c2024.halxg.cloudera.com,16020,1429660802192} to {64c97bf39441e09977332c02e628a8c2 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
143324 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {3f4ea5ea14653cee6006f13c7d06d10b state=OPEN, ts=1429660996066, server=c2024.halxg.cloudera.com,16020,1429660802192} to {3f4ea5ea14653cee6006f13c7d06d10b state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
143325 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {6eaf51e55c9c23356a697f286f473db8 state=OPEN, ts=1429660996051, server=c2024.halxg.cloudera.com,16020,1429660802192} to {6eaf51e55c9c23356a697f286f473db8 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
143326 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {1cbe63cd29c7709209adbf4305ebc746 state=OPEN, ts=1429660996062, server=c2024.halxg.cloudera.com,16020,1429660802192} to {1cbe63cd29c7709209adbf4305ebc746 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
143327 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {3a49d17e189b2a26eb73e1d43cf2d0ac state=OPEN, ts=1429660996053, server=c2024.halxg.cloudera.com,16020,1429660802192} to {3a49d17e189b2a26eb73e1d43cf2d0ac state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
143328 2015-04-21 17:05:07,576 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] handler.ServerShutdownHandler: Reassigning 0 region(s) that c2024.halxg.cloudera.com,16020,1429660802192 was carrying (and 0 regions(s) that were opening on this server)

{code}

... and indeed these regions are never assigned.  I see DLR on occasion timing out like this....

2015-04-22 21:27:31,327 ERROR [RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0] wal.WALSplitter: Exiting thread

....

Caused by:


Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): org.apache.hadoop.hbase.NotServingRegionException: Region IntegrationTestBigLinkedList,\xB3333333(,1429660019160.8d63312bc39a39727afea627bb20fee4. is not online on c2024.halxg.cloudera.com,16020,1429661908290

... which is one of the regions in the set of OFFLINE but not assigned.

Digging....


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)