Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ADC4A6CF3 for ; Fri, 29 Jul 2011 12:47:34 +0000 (UTC) Received: (qmail 86078 invoked by uid 500); 29 Jul 2011 12:47:34 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 86019 invoked by uid 500); 29 Jul 2011 12:47:34 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 86004 invoked by uid 99); 29 Jul 2011 12:47:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 12:47:33 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 12:47:31 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id E328091F97 for ; Fri, 29 Jul 2011 12:47:09 +0000 (UTC) Date: Fri, 29 Jul 2011 12:47:09 +0000 (UTC) From: "ramkrishna.s.vasudevan (JIRA)" To: issues@hbase.apache.org Message-ID: <2083162540.18393.1311943629927.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HBASE-3065) Retry all 'retryable' zk operations; e.g. connection loss MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072794#comment-13072794 ] ramkrishna.s.vasudevan commented on HBASE-3065: ----------------------------------------------- I have uploaded the addendum. I think there is a better way also Currently in RecoverableZookeeper.getData() api already does the removeAppend step. But those api doesnot take the AsyncCallback as parameter but the one in Zookeeper does. Here the problem is Zookeeper.getData() that takes AsyncCallback doesnot return the byte[] instead internally it inovkes the AsyncCallback.processResult(). that is the reason we dont have the corresponding similar api in RecoverableZookeeper. Pls let me know if the patch is ok. Also correct me if my analysis is wrong. > Retry all 'retryable' zk operations; e.g. connection loss > --------------------------------------------------------- > > Key: HBASE-3065 > URL: https://issues.apache.org/jira/browse/HBASE-3065 > Project: HBase > Issue Type: Bug > Reporter: stack > Assignee: Liyin Tang > Priority: Critical > Fix For: 0.92.0 > > Attachments: 3065-v3.txt, 3065-v4.txt, HBASE-3065-addendum.patch, HBase-3065[r1088475]_1.patch, hbase3065_2.patch > > > The 'new' master refactored our zk code tidying up all zk accesses and coralling them behind nice zk utility classes. One improvement was letting out all KeeperExceptions letting the client deal. Thats good generally because in old days, we'd suppress important state zk changes in state. But there is at least one case the new zk utility could handle for the application and thats the class of retryable KeeperExceptions. The one that comes to mind is conection loss. On connection loss we should retry the just-failed operation. Usually the retry will just work. At worse, on reconnect, we'll pick up the expired session event. > Adding in this change shouldn't be too bad given the refactor of zk corralled all zk access into one or two classes only. > One thing to consider though is how much we should retry. We could retry on a timer or we could retry for ever as long as the Stoppable interface is passed so if another thread has stopped or aborted the hosting service, we'll notice and give up trying. Doing the latter is probably better than some kinda timeout. > HBASE-3062 adds a timed retry on the first zk operation. This issue is about generalizing what is over there across all zk access. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira