Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 34EFC200C86 for ; Wed, 17 May 2017 06:14:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 336BD160BC9; Wed, 17 May 2017 04:14:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7A46B160BC1 for ; Wed, 17 May 2017 06:14:07 +0200 (CEST) Received: (qmail 75700 invoked by uid 500); 17 May 2017 04:14:06 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 75689 invoked by uid 99); 17 May 2017 04:14:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 May 2017 04:14:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 23E1D1A0150 for ; Wed, 17 May 2017 04:14:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id as_bkUZ-E3MO for ; Wed, 17 May 2017 04:14:05 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 0B5475FD6D for ; Wed, 17 May 2017 04:14:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 7DBF7E09D6 for ; Wed, 17 May 2017 04:14:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 2DFBF263A9 for ; Wed, 17 May 2017 04:14:04 +0000 (UTC) Date: Wed, 17 May 2017 04:14:04 +0000 (UTC) From: "Allan Yang (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (HBASE-18058) Zookeeper retry sleep time should have a up limit MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 17 May 2017 04:14:08 -0000 [ https://issues.apache.org/jira/browse/HBASE-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013479#comment-16013479 ] Allan Yang edited comment on HBASE-18058 at 5/17/17 4:13 AM: ------------------------------------------------------------- {quote} Normally in this case RegionServer will crash due to zookeeper session timeout, similar like when RS full GC, right? Mind share the case in your scenario? How do you keep RS alive while zookeeper down for some while? Thanks. Allan Yang {quote} Yes, It is a very interesting case and really happened. If the server hosting zookeeper is disk full, the zookeeper quorum won't really went down but reject all write request. So at HBase side, new zk write request will suffers from exception and retry. But connection remains so the session won't timeout. When disk full situation have been resolved, the zookeeper quorum can work normally again. But the very high sleep time cause some module of RegionServer/HMaster will still sleep for a long time(in our case, the balancer) before working. was (Author: allan163): {quote} Normally in this case RegionServer will crash due to zookeeper session timeout, similar like when RS full GC, right? Mind share the case in your scenario? How do you keep RS alive while zookeeper down for some while? Thanks. Allan Yang {quote} Yes, It is a very interesting case and really happened. If the server hosting zookeeper is disk full, the zookeeper quorum won't really went down but reject all write request. So at HBase side, new zk write request will suffers from exception and retry. But connection remains so the session won't timeout. When disk full situation have been resolved, the zookeeper quorum can work normally again. But the very high sleep time cause some module of RegionServer will still sleep for a long time(in our case, the balancer) before working. > Zookeeper retry sleep time should have a up limit > ------------------------------------------------- > > Key: HBASE-18058 > URL: https://issues.apache.org/jira/browse/HBASE-18058 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.0, 1.4.0 > Reporter: Allan Yang > Assignee: Allan Yang > Attachments: HBASE-18058-branch-1.patch, HBASE-18058-branch-1.v2.patch, HBASE-18058.patch > > > Now, in {{RecoverableZooKeeper}}, the retry backoff sleep time grow exponentially, but it doesn't have any up limit. It directly lead to a long long recovery time after Zookeeper going down for some while and come back. -- This message was sent by Atlassian JIRA (v6.3.15#6346)