Return-Path: Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: (qmail 8001 invoked from network); 22 Oct 2010 19:58:37 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Oct 2010 19:58:37 -0000 Received: (qmail 77492 invoked by uid 500); 22 Oct 2010 19:58:37 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 77457 invoked by uid 500); 22 Oct 2010 19:58:37 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 77449 invoked by uid 99); 22 Oct 2010 19:58:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Oct 2010 19:58:37 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Oct 2010 19:58:36 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o9MJwGSt015605 for ; Fri, 22 Oct 2010 19:58:16 GMT Message-ID: <21619717.32711287777496557.JavaMail.jira@thor> Date: Fri, 22 Oct 2010 15:58:16 -0400 (EDT) From: "stack (JIRA)" To: issues@hbase.apache.org Subject: [jira] Resolved: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg In-Reply-To: <14321588.189621284501285804.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-2998. -------------------------- Resolution: Fixed Hadoop Flags: [Reviewed] Thanks for the review Jon. I did as you suggested (and that test passes). I just tried it too up on cluster w/ 5 node ensemble. Committing. > rolling-restart.sh shouldn't rely on zoo.cfg > -------------------------------------------- > > Key: HBASE-2998 > URL: https://issues.apache.org/jira/browse/HBASE-2998 > Project: HBase > Issue Type: Bug > Reporter: Jean-Daniel Cryans > Assignee: stack > Priority: Critical > Fix For: 0.90.0 > > Attachments: 2998.txt > > > I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line: > {code} > bin/hbase zkcli stat $zmaster > {code} > It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.