hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Heng Chen <heng.chen.1...@gmail.com>
Subject Re: region stuck in failed close state
Date Mon, 30 May 2016 09:59:35 GMT
I  find something useful.

When we do region.close,  if there is one compaction or flush in progress,
close will wait for compaction or flush be finished.

{code: title=HRegion.java}

@Override
public void waitForFlushesAndCompactions() {
  synchronized (writestate) {
    if (this.writestate.readOnly) {
      // we should not wait for replayed flushed if we are read only
(for example in case the
      // region is a secondary replica).
      return;
    }
    boolean interrupted = false;
    try {
      while (writestate.compacting > 0 || writestate.flushing) {
        LOG.debug("waiting for " + writestate.compacting + " compactions"
          + (writestate.flushing ? " & cache flush" : "") + " to
complete for region " + this);
        try {
          writestate.wait();
        } catch (InterruptedException iex) {
          // essentially ignore and propagate the interrupt back up
          LOG.warn("Interrupted while waiting");
          interrupted = true;
        }
      }
    } finally {
      if (interrupted) {
        Thread.currentThread().interrupt();
      }
    }
  }
}

{code}

And writestate.flushing will be set to be true in two place:

HRegion.flushCache and HRegion.replayWALFlushStartMarker

{code: title=HRegion.flushCache}

synchronized (writestate) {
  if (!writestate.flushing && writestate.writesEnabled) {
    this.writestate.flushing = true;
  } else {
    *....*
  }
}

{code}

{code: title=HRegion.replayWALFlushStartMarker}

synchronized (writestate) {
  try {
    *....*
    if (!writestate.flushing) {

        this.writestate.flushing = true;
        *...*

*     }*

{code}


Notice that,  in HRegion.replayWALFlushStartMarker,  we did not check
writestate.writesEnabled before set writestate.flushing to be true.

So if region.close wake up in writestate.wait, but the lock acquried by
HRegion.replayWALFlushStartMarker,  the flushing will be set to be true
again,  and region.close will stuck in writestate.wait forever.


Will it happen in real logical?


2016-05-27 10:44 GMT+08:00 Heng Chen <heng.chen.1986@gmail.com>:

> Thanks guys,  yesterday i restart relate RS and failed close region reopen
> successfuly.  But today, there is another region fall in this state.
>
> I paste relate RS' jstack information.  This time the failed close region
> is 9368190b3ba46238534b6307702aabae
>
> 2016-05-26 21:50 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
>
>> Heng:
>> Can you pastebin the complete stack trace for the region server ?
>>
>> Snippet from region server log may also provide more clue.
>>
>> Thanks
>>
>> On Wed, May 25, 2016 at 9:48 PM, Heng Chen <heng.chen.1986@gmail.com>
>> wrote:
>>
>> > On master web UI, i could see region (c371fb20c372b8edbf54735409ab5c4a)
>> > always in failed close state,  So balancer could not run.
>> >
>> >
>> > i check the region on RS,  and found logs about this region
>> >
>> > 2016-05-26 12:42:10,490 INFO  [MemStoreFlusher.1]
>> > regionserver.MemStoreFlusher: Waited 90447ms on a compaction to clean up
>> > 'too many store files'; waited long enough... proceeding with flush of
>> >
>> >
>> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
>> > 2016-05-26 12:42:20,043 INFO
>> >  [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
>> > regionserver.HRegionServer:
>> > dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
>> > requesting flush for region
>> >
>> >
>> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
>> > after a delay of 20753
>> > 2016-05-26 12:42:30,043 INFO
>> >  [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
>> > regionserver.HRegionServer:
>> > dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
>> > requesting flush for region
>> >
>> >
>> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
>> > after a delay of 7057
>> >
>> >
>> > relate jstack information like below:
>> >
>> > Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
>> >   State: WAITING
>> >   Blocked count: 1
>> >   Waited count: 2
>> >   Waiting on
>> > org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
>> >   Stack:
>> >     java.lang.Object.wait(Native Method)
>> >     java.lang.Object.wait(Object.java:502)
>> >
>> >
>> org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)
>> >
>>  org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
>> >
>>  org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)
>> >
>> >
>> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
>> >
>> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>> >
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> >     java.lang.Thread.run(Thread.java:745)
>> >
>> >
>> > Our HBase cluster version is 1.1.1,   i try to compact this region,
>> > compact stuck in progress 89.58%
>> >
>> >
>> >
>> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
>> > 85860221 85860221
>> > 89.58%
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message