hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nitay Joffe (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-1629) HRS unable to contact master
Date Wed, 08 Jul 2009 22:25:15 GMT

     [ https://issues.apache.org/jira/browse/HBASE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Nitay Joffe updated HBASE-1629:

    Attachment: hbase-1629.patch

Small patch for a convoluted problem. Amandeep, try this out, see if it fixes it for you.

Here's the problem:

[14:32]  <nitay> reportForDuty()
[14:32]  <nitay>     while (!getMaster()) {
[14:32]  <nitay>       sleeper.sleep();
[14:32]  <nitay>       LOG.warn("Unable to get master for initialization");
[14:32]  <nitay>     }
[14:33]  <nitay> getMaster()
[14:33]  <nitay>     HServerAddress masterAddress = null;
[14:33]  <nitay>     while (masterAddress == null) {
[14:33]  <nitay>       if (stopRequested.get()) {
[14:33]  <nitay>         return false;
[14:33]  <nitay>       }

This is an infinite loop which causes the messages at the end of the RS Log Amandeep posted.

The flow of logic that leads to this is the following:
# RS session with ZooKeeper expires.
# Master gets znode expiration, starts cleanup/shutdown of RS.
# RS gets its session expired, begins restart() logic, setting stopRequested.
# Meanwhile, RS run() thread is still talking to master.
# Master gets a message from RS, but doesn't know it because it's been removed. This is the
"received server report from unknown server..." stuff. Tells the RS to reinitialize, sending
# RS on getting MSG_CALL_SERVER_STARTUP calls reportForDuty() and is now in a loop. The restart()
thread from ZooKeeper is waiting for the RS run() to finish, but it never will.

This simple patch makes reportyForDuty() fail fast when stopRequested is set.

> HRS unable to contact master
> ----------------------------
>                 Key: HBASE-1629
>                 URL: https://issues.apache.org/jira/browse/HBASE-1629
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: Amandeep Khurana
>            Assignee: Nitay Joffe
>             Fix For: 0.20.0
>         Attachments: hbase-1629.patch, Master_log, RS_Log
> HRS unable to contact master for initialization after expiration from ZK. Master thinks
HRS is still up whereas HRS went down and now cannot restart. The RS logs have a flurry of
the following warning messages:
> 2009-07-08 12:53:19,547 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable
to get master for initialization
> More logs from the RS and the Master attached.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message