hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HBase Review Board (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg
Date Thu, 21 Oct 2010 08:58:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923373#action_12923373

HBase Review Board commented on HBASE-2998:

Message from: stack@duboce.net

bq.  On 2010-10-20 17:54:04, Jonathan Gray wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java, line
bq.  > <http://review.cloudera.org/r/1057/diff/1/?file=15040#file15040line103>
bq.  >
bq.  >
bq.  Jonathan Gray wrote:
bq.      Not sure where my comment went :)  Just wondering if in the rolling restart case,
we aren't aborting, right?
bq.      Is the stop() now a hard stop and that's why we need to close regions first?

Looking into this more --prompted by your comment -- this extra callout to shut down user
regions in shutdown hook shouldn't be needed.  Something else is going on.  I'm digging in.

Right, we are not aborting when rolling restart.  We are intercepting the jvm kill signal.

- stack

This is an automatically generated e-mail. To reply, visit:

> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.0
>         Attachments: 2998.txt
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg
for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't
have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running).
When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration
files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused,
ignores it, procedes to restart the master (which waits on the znode), and the starts restarting
the region servers. They can't shutdown properly under 60 seconds, since they need a master,
so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message