hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-13238) Time out locks and abort if HDFS is wedged
Date Fri, 13 Mar 2015 21:58:38 GMT
Andrew Purtell created HBASE-13238:

             Summary: Time out locks and abort if HDFS is wedged
                 Key: HBASE-13238
                 URL: https://issues.apache.org/jira/browse/HBASE-13238
             Project: HBase
          Issue Type: Brainstorming
            Reporter: Andrew Purtell

This is a brainstorming issue on the top of timing out locks and aborting if HDFS is wedged.

We had a minor production incident where a region was unable to close after 24 hours. The
CloseRegionHandler was waiting for a write lock on the ReentrantReadWriteLock we take in HRegion#doClose.
There were outstanding read locks. Three other threads were stuck in scanning, all blocked
on the same DFSInputStream. Two were blocked in DFSInputStream#getFileLength, the third was
waiting in epoll from SocketIOWithTimeout$SelectorPool#select with apparent infinite timeout
from PacketReceiver#readChannelFully.

This is similar to other issues we have seen before, in the context of the region wanting
to finish a compaction, but can't due to some HDFS issue causing the reader to become extremely
slow if not wedged.

The Hadoop version was 2.3 (specifically 2.3 CDH 5.0.1), and we are planning to upgrade, but
[~lhofhansl] and I were discussing the issue in general and wonder if we should not be timing
out locks such as the ReentrantReadWriteLock, and if so, abort the regionserver. In this case
this would have caused recovery and reassignment of the region in question and we would not
have had a prolonged availability problem. 

This message was sent by Atlassian JIRA

View raw message