hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vikas Vishwakarma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13592) RegionServer sometimes gets stuck during shutdown in case of cache flush failures
Date Wed, 29 Apr 2015 09:21:06 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14519016#comment-14519016
] 

Vikas Vishwakarma commented on HBASE-13592:
-------------------------------------------

tried a patch with changes suggested by [~lhofhansl] wherein we move wal.sync also with the
try catch block of cache flush, and in this case all the affected RegionServers successfully
shutdown without any RegionServer going into hung state. The RegionServers that don't shutdown
are fully operational and working fine.

Current implementation in HRegion.java
{noformat}
protected FlushResult internalFlushcache(
...
try {
..
this.updatesLock.writeLock().lock();
..
if (!wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes())) {    <-----------
this will do DrainBarrier beginOp
..
} finally {
  this.updatesLock.writeLock().unlock();
}
...
if (wal != null && !shouldSyncLog()) {
  wal.sync();  <----- this is currently outside the try catch block for flush cache below
and is added inside the try catch block in the submitted patch
}
mvcc.waitForRead(w);
...
try {
...
flush cache code
...

} catch (Throwable t) {
    if (wal != null) {
      wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());  <---  this will
do DrainBarrier.endOp()
    }
}
......
// If we get to here, the HStores have been written.
if (wal != null) {
 wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());  <---  this will
do DrainBarrier.endOp()
}

{noformat}

The patch submitted contains the following changes

{noformat}
protected FlushResult internalFlushcache(
...
try {
..
this.updatesLock.writeLock().lock();
..
if (!wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes())) {    <-----------
this will do DrainBarrier beginOp
..
} finally {
  this.updatesLock.writeLock().unlock();
}
...
try {
if (wal != null && !shouldSyncLog()) {
  wal.sync();  <----- included in the flush cache try catch block, any exceptions here
will also call abortCacheFlush in the catch block which will decrement the op count in DrainBarrier
}
mvcc.waitForRead(w);
...
flush cache code
...

} catch (Throwable t) {
    if (wal != null) {
      wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());  <---  this will
do DrainBarrier.endOp()
    }
}
......
// If we get to here, the HStores have been written.
if (wal != null) {
 wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());  <---  this will
do DrainBarrier.endOp()
}

{noformat}

> RegionServer sometimes gets stuck during shutdown in case of cache flush failures
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-13592
>                 URL: https://issues.apache.org/jira/browse/HBASE-13592
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.10
>            Reporter: Vikas Vishwakarma
>            Assignee: Vikas Vishwakarma
>
> Observed that RegionServer sometimes gets stuck during shutdown in case of cache flush
failures. On adding few debug logs and looking through the stack trace RegionServer process
looks stuck in closeWAL -> hlog.close -> closeBarrier.stopAndDrainOps(); during the
shutdown sequence in the run method
> From the RegionServer logs we see there are multiple attempts to flush cache for a particular
region which increments the beginOp count in DrainBarrier but all the flush attempts fails
somewhere in wal sync and the DrainBarrier endOp count decrement never happens. Later on when
shutdown is initiated RegionServer process is permanently stuck here
> In this case hbase stop also does not work and RegionServer process has to be explicitly
killed using kill -9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message