Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 15 Nov 2017 16:14:00 +0000 (UTC)
From: "Peter Somogyi (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12835797.1433529877000.246240.1510762440957@Atlassian.JIRA>
In-Reply-To: <JIRA.12835797.1433529877000@Atlassian.JIRA>
References: <JIRA.12835797.1433529877000@Atlassian.JIRA> <JIRA.12835797.1433529877760@jira-lw-us.apache.org>
Subject: [jira] [Resolved] (HBASE-13850) Check for dead server on
 CallTimeoutException
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 15 Nov 2017 16:14:07 -0000


     [ https://issues.apache.org/jira/browse/HBASE-13850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Somogyi resolved HBASE-13850.
-----------------------------------
    Resolution: Duplicate

> Check for dead server on CallTimeoutException
> ---------------------------------------------
>
>                 Key: HBASE-13850
>                 URL: https://issues.apache.org/jira/browse/HBASE-13850
>             Project: HBase
>          Issue Type: Improvement
>          Components: Client, MTTR
>    Affects Versions: 2.0.0, 1.2.0
>            Reporter: Matteo Bertozzi
>            Assignee: huaxiang sun
>            Priority: Minor
>         Attachments: HBASE-13850-v0.patch, TestGetPerf.java
>
>
> WARN this may be a misconf, so let me know if there is a timeout param to set.
> {noformat}
> hbase-site.xml
> zookeeper.session.timeout 10000
> hbase.regionserver.storefile.refresh.period 10000
> hbase.client.operation.timeout 5000
> hbase.client.meta.operation.timeout 5000
> hbase.client.scanner.timeout.period 10000
> hbase.regionserver.lease.period 10000
> {noformat}
> I have a test that does a kill STOP on a RS and tries to query it.
> From the conf the zk lease is 10sec, and the master is correctly doing the reassign after 10sec and meta is updated.
> the client keep trying to query the RS for a specific row until it get a response. The table.get(row) in the loop throws a CallTimeoutException every 5sec (which is the configured settings). but instead of succeed after 2/3 retries (> 10sec where the master reassign) it keeps retrying up to 60sec (I don't know what that 60sec is, maybe a conf param that I'm not able to find)
> one simple fix in the code is handling the CallTimeoutException in RegionServerCallable and clear the meta cache for that RS that is not responding. (but maybe there is already a conf to set to reduce that 60sec period)


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)