hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhangduo (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-13238) Time out locks and abort if HDFS is wedged
Date Sat, 14 Mar 2015 01:45:38 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361479#comment-14361479
] 

zhangduo edited comment on HBASE-13238 at 3/14/15 1:44 AM:
-----------------------------------------------------------

{quote}
Even if we get hung up at the HDFS level, it's our problem, we can't just have indefinite
unavailability in some known circumstances.
{quote}
Agree.
I know sometimes it is hard to make HDFS support us(HBASE-5940, almost 3 years and no progress...),
they have their own plan.
But suggest to keep the work around code only in critical places and strongly document why
we do this. 


was (Author: apache9):
{noformat}
Even if we get hung up at the HDFS level, it's our problem, we can't just have indefinite
unavailability in some known circumstances.
{noformat}
Agree.
I know sometimes it is hard to make HDFS support us(HBASE-5940, almost 3 years and no progress...),
they have their own plan.
But suggest to keep the work around code only in critical places and strongly document why
we do this. 

> Time out locks and abort if HDFS is wedged
> ------------------------------------------
>
>                 Key: HBASE-13238
>                 URL: https://issues.apache.org/jira/browse/HBASE-13238
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Andrew Purtell
>
> This is a brainstorming issue on the topic of timing out locks and aborting rather than
waiting infinitely. Perhaps even as a rule.
> We had a minor production incident where a region was unable to close after trying for
24 hours. The CloseRegionHandler was waiting for a write lock on the ReentrantReadWriteLock
we take in HRegion#doClose. There were outstanding read locks. Three other threads were stuck
in scanning, all blocked on the same DFSInputStream. Two were blocked in DFSInputStream#getFileLength,
the third was waiting in epoll from SocketIOWithTimeout$SelectorPool#select with apparent
infinite timeout from PacketReceiver#readChannelFully.
> This is similar to other issues we have seen before, in the context of the region wanting
to finish a compaction before closing for a split, but can't due to some HDFS issue causing
the reader to become extremely slow if not wedged. This has lead to what should be quick SplitTransactions
causing availability problems of many minutes in length.
> The Hadoop version was 2.3 (specifically 2.3 CDH 5.0.1), and we are planning to upgrade,
but [~lhofhansl] and I were discussing the issue in general and wonder if we should not be
timing out locks such as the ReentrantReadWriteLock, and if so, abort the regionserver. In
this case this would have caused recovery and reassignment of the region in question and we
would not have had a prolonged availability problem. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message