hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18058) Zookeeper retry sleep time should have an upper limit
Date Fri, 19 May 2017 04:11:05 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016870#comment-16016870

Hudson commented on HBASE-18058:

FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #3036 (See [https://builds.apache.org/job/HBase-Trunk_matrix/3036/])
HBASE-18058 Zookeeper retry sleep time should have an upper limit (Allan (tedyu: rev d137991ccc876988ae8832c316457e525f6bf387)
* (edit) hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java
* (edit) hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java
* (edit) hbase-common/src/main/resources/hbase-default.xml

> Zookeeper retry sleep time should have an upper limit
> -----------------------------------------------------
>                 Key: HBASE-18058
>                 URL: https://issues.apache.org/jira/browse/HBASE-18058
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 1.4.0
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>         Attachments: HBASE-18058-branch-1.patch, HBASE-18058-branch-1.v2.patch, HBASE-18058-branch-1.v3.patch,
HBASE-18058.patch, HBASE-18058.v2.patch
> Now, in {{RecoverableZooKeeper}}, the retry backoff sleep time grow exponentially, but
it doesn't have any up limit. It directly lead to a long long recovery time after Zookeeper
going down for some while and come back.
> A case of damage done by high sleep time:
> If the server hosting zookeeper is disk full, the zookeeper quorum won't really went
down but reject all write request. So at HBase side, new zk write request will suffers from
exception and retry. But connection remains so the session won't timeout. When disk full situation
have been resolved, the zookeeper quorum can work normally again. But the very high sleep
time cause some module of RegionServer/HMaster will still sleep for a long time(for example,
the balancer) before working.

This message was sent by Atlassian JIRA

View raw message