hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-13535) Regions go unassigned when meta-carrying RS is killed
Date Thu, 23 Apr 2015 04:33:38 GMT

     [ https://issues.apache.org/jira/browse/HBASE-13535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack updated HBASE-13535:
--------------------------
    Summary: Regions go unassigned when meta-carrying RS is killed  (was: Regions unassigned)

> Regions go unassigned when meta-carrying RS is killed
> -----------------------------------------------------
>
>                 Key: HBASE-13535
>                 URL: https://issues.apache.org/jira/browse/HBASE-13535
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>
> hbase-1.1 will be the first release with DLR on by default. I've been running  ITBLLs
on a cluster trying to find issues with DLR. My first few runs ran nicely... but the current
run failed complaining regions are not online and indeed recovery is stuck making no progress.
> Upon examination, it looks to be an assignment rather than DLR issue. A server carring
meta has its meta log replayed first but we are seemingly failing to assign regions after
meta is back online.
> Meantime, my regionserver logs are filling with spewing complaint that regions are not
online (we should dampen our logging of region not being online... ) and then the split log
workers are stuck:
> {code}
> Thread 13206 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-2):
>   State: TIMED_WAITING
>   Blocked count: 45
>   Waited count: 59
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> Thread 13205 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-1):
>   State: TIMED_WAITING
>   Blocked count: 45
>   Waited count: 59
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> Thread 13204 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0):
>   State: TIMED_WAITING
>   Blocked count: 50
>   Waited count: 63
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> {code}
> ...complaining that:
> 2015-04-22 21:28:02,746 DEBUG [RS_LOG_REPLAY_OPS-c2021:16020-1] wal.WALSplitter: Used
134248328 bytes of buffered edits, waiting for IO threads...
> The accounting seems off around here in SSH where it is moving regions that were on dead
server to OFFLINE but is reporting no regions to assign:
> {code}
> 143320 2015-04-21 17:05:07,571 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] handler.ServerShutdownHandler:
Mark regions in recovery for crashed server c2024.halxg.cloudera.com,16020,1429660802192 before
assignment; regions=[]
> 143321 2015-04-21 17:05:07,572 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates:
Adding to processed servers c2024.halxg.cloudera.com,16020,1429660802192
> 143322 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates:
Transition {8d63312bc39a39727afea627bb20fee4 state=OPEN, ts=1429660996054, server=c2024.halxg.cloudera.com,16020,1429660802192}
to {8d63312bc39a39727afea627bb20fee4 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143323 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates:
Transition {64c97bf39441e09977332c02e628a8c2 state=OPEN, ts=1429660996045, server=c2024.halxg.cloudera.com,16020,1429660802192}
to {64c97bf39441e09977332c02e628a8c2 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143324 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates:
Transition {3f4ea5ea14653cee6006f13c7d06d10b state=OPEN, ts=1429660996066, server=c2024.halxg.cloudera.com,16020,1429660802192}
to {3f4ea5ea14653cee6006f13c7d06d10b state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143325 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates:
Transition {6eaf51e55c9c23356a697f286f473db8 state=OPEN, ts=1429660996051, server=c2024.halxg.cloudera.com,16020,1429660802192}
to {6eaf51e55c9c23356a697f286f473db8 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143326 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates:
Transition {1cbe63cd29c7709209adbf4305ebc746 state=OPEN, ts=1429660996062, server=c2024.halxg.cloudera.com,16020,1429660802192}
to {1cbe63cd29c7709209adbf4305ebc746 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143327 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates:
Transition {3a49d17e189b2a26eb73e1d43cf2d0ac state=OPEN, ts=1429660996053, server=c2024.halxg.cloudera.com,16020,1429660802192}
to {3a49d17e189b2a26eb73e1d43cf2d0ac state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143328 2015-04-21 17:05:07,576 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] handler.ServerShutdownHandler:
Reassigning 0 region(s) that c2024.halxg.cloudera.com,16020,1429660802192 was carrying (and
0 regions(s) that were opening on this server)
> {code}
> ... and indeed these regions are never assigned.  I see DLR on occasion timing out like
this....
> 2015-04-22 21:27:31,327 ERROR [RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0] wal.WALSplitter:
Exiting thread
> ....
> Caused by:
> Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
org.apache.hadoop.hbase.NotServingRegionException: Region IntegrationTestBigLinkedList,\xB3333333(,1429660019160.8d63312bc39a39727afea627bb20fee4.
is not online on c2024.halxg.cloudera.com,16020,1429661908290
> ... which is one of the regions in the set of OFFLINE but not assigned.
> Digging....



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message