hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Heng Chen <heng.chen.1...@gmail.com>
Subject Re: region stuck in failed close state
Date Tue, 31 May 2016 02:06:29 GMT
@Ted.  the log i paste is in INFO Level,  i have change it to be DEBUG
Level yesterday,  If this happen again, i will upload the debug level log.

2016-05-30 21:57 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:

> There is debug log in HRegion#replayWALFlushStartMarker :
>
>               LOG.debug(getRegionInfo().getEncodedName() + " : "
>
>                   + " Prepared flush with seqId:" +
> flush.getFlushSequenceNumber());
>
> ...
>
>                 LOG.debug(getRegionInfo().getEncodedName() + " : "
>
>                   + " Prepared empty flush with seqId:" +
> flush.getFlushSequenceNumber());
>
> I searched for them in the log you attached to HBASE-15900 but didn't find
> any occurrence.
>
> FYI
>
> On Mon, May 30, 2016 at 2:59 AM, Heng Chen <heng.chen.1986@gmail.com>
> wrote:
>
> > I  find something useful.
> >
> > When we do region.close,  if there is one compaction or flush in
> progress,
> > close will wait for compaction or flush be finished.
> >
> > {code: title=HRegion.java}
> >
> > @Override
> > public void waitForFlushesAndCompactions() {
> >   synchronized (writestate) {
> >     if (this.writestate.readOnly) {
> >       // we should not wait for replayed flushed if we are read only
> > (for example in case the
> >       // region is a secondary replica).
> >       return;
> >     }
> >     boolean interrupted = false;
> >     try {
> >       while (writestate.compacting > 0 || writestate.flushing) {
> >         LOG.debug("waiting for " + writestate.compacting + " compactions"
> >           + (writestate.flushing ? " & cache flush" : "") + " to
> > complete for region " + this);
> >         try {
> >           writestate.wait();
> >         } catch (InterruptedException iex) {
> >           // essentially ignore and propagate the interrupt back up
> >           LOG.warn("Interrupted while waiting");
> >           interrupted = true;
> >         }
> >       }
> >     } finally {
> >       if (interrupted) {
> >         Thread.currentThread().interrupt();
> >       }
> >     }
> >   }
> > }
> >
> > {code}
> >
> > And writestate.flushing will be set to be true in two place:
> >
> > HRegion.flushCache and HRegion.replayWALFlushStartMarker
> >
> > {code: title=HRegion.flushCache}
> >
> > synchronized (writestate) {
> >   if (!writestate.flushing && writestate.writesEnabled) {
> >     this.writestate.flushing = true;
> >   } else {
> >     *....*
> >   }
> > }
> >
> > {code}
> >
> > {code: title=HRegion.replayWALFlushStartMarker}
> >
> > synchronized (writestate) {
> >   try {
> >     *....*
> >     if (!writestate.flushing) {
> >
> >         this.writestate.flushing = true;
> >         *...*
> >
> > *     }*
> >
> > {code}
> >
> >
> > Notice that,  in HRegion.replayWALFlushStartMarker,  we did not check
> > writestate.writesEnabled before set writestate.flushing to be true.
> >
> > So if region.close wake up in writestate.wait, but the lock acquried by
> > HRegion.replayWALFlushStartMarker,  the flushing will be set to be true
> > again,  and region.close will stuck in writestate.wait forever.
> >
> >
> > Will it happen in real logical?
> >
> >
> > 2016-05-27 10:44 GMT+08:00 Heng Chen <heng.chen.1986@gmail.com>:
> >
> > > Thanks guys,  yesterday i restart relate RS and failed close region
> > reopen
> > > successfuly.  But today, there is another region fall in this state.
> > >
> > > I paste relate RS' jstack information.  This time the failed close
> region
> > > is 9368190b3ba46238534b6307702aabae
> > >
> > > 2016-05-26 21:50 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
> > >
> > >> Heng:
> > >> Can you pastebin the complete stack trace for the region server ?
> > >>
> > >> Snippet from region server log may also provide more clue.
> > >>
> > >> Thanks
> > >>
> > >> On Wed, May 25, 2016 at 9:48 PM, Heng Chen <heng.chen.1986@gmail.com>
> > >> wrote:
> > >>
> > >> > On master web UI, i could see region
> > (c371fb20c372b8edbf54735409ab5c4a)
> > >> > always in failed close state,  So balancer could not run.
> > >> >
> > >> >
> > >> > i check the region on RS,  and found logs about this region
> > >> >
> > >> > 2016-05-26 12:42:10,490 INFO  [MemStoreFlusher.1]
> > >> > regionserver.MemStoreFlusher: Waited 90447ms on a compaction to
> clean
> > up
> > >> > 'too many store files'; waited long enough... proceeding with flush
> of
> > >> >
> > >> >
> > >>
> >
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> > >> > 2016-05-26 12:42:20,043 INFO
> > >> >  [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
> > >> > regionserver.HRegionServer:
> > >> >
> dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
> > >> > requesting flush for region
> > >> >
> > >> >
> > >>
> >
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> > >> > after a delay of 20753
> > >> > 2016-05-26 12:42:30,043 INFO
> > >> >  [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
> > >> > regionserver.HRegionServer:
> > >> >
> dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
> > >> > requesting flush for region
> > >> >
> > >> >
> > >>
> >
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> > >> > after a delay of 7057
> > >> >
> > >> >
> > >> > relate jstack information like below:
> > >> >
> > >> > Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
> > >> >   State: WAITING
> > >> >   Blocked count: 1
> > >> >   Waited count: 2
> > >> >   Waiting on
> > >> > org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
> > >> >   Stack:
> > >> >     java.lang.Object.wait(Native Method)
> > >> >     java.lang.Object.wait(Object.java:502)
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)
> > >> >
> > >>
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
> > >> >
> > >>  org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
> > >> >
> > >> >
> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> > >> >
> > >> >
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > >> >
> > >> >
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > >> >     java.lang.Thread.run(Thread.java:745)
> > >> >
> > >> >
> > >> > Our HBase cluster version is 1.1.1,   i try to compact this region,
> > >> > compact stuck in progress 89.58%
> > >> >
> > >> >
> > >> >
> > >>
> >
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> > >> > 85860221 85860221
> > >> > 89.58%
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message