hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Himanshu Vashishtha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
Date Tue, 26 Jun 2012 23:13:44 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401779#comment-13401779

Himanshu Vashishtha commented on HBASE-2611:

I looked at this issue from the perspective of using Zookeeper#multi Operation (present in
3.4). This API guarantees to do a list of Op as a single transaction, rolling back all the
Ops in case any of the Op fails. I tested this functionality as a standalone case (where the
transaction was to move a bunch of Znodes from one parent to another), and it works good (out
of N threads which race to do the transfer, only 1 is successful). And in case of a failure,
all the Ops done so far are rolled back. I can attach the sample code if required.

Here is the approach I used to utilize multi for this issue:
a) All the active region servers tries to "move" the logs of peers under the dead regionserver
znode. It involves creating Op objects for creating new znodes and deleting old ones. As per
the multi API guarantee, only one regionserver will be successful in moving the znodes.

b) The regionservers will "keep on trying to move" the znodes from the dead regionserver untill
they are sure that the node is deleted (by the successful regionserver), or there is no log
to process. This is to avoid any corner case so as not to miss the logs for the dead regionserver.
The number of trials can be made configurable.

c) In case of cascading failure (when the successful regionserver dies before it gets the
notification from zk about the successful move), other regionservers will get this new event
and will proceed as normal (will try to move all the znodes from this newly dead regionserver

It will be good to know what others think about this approach. Other rogue conditions that
can happen?

Attached is a patch based and I tested it by manually killing regionservers at random (not
totally random, but killing one and then killing the successful one when it has just transferred
the logs) (its difficult to kill it while transferring as its an atomic operation now). Any
ideas/suggestions for more direct testing are welcome.
> Handle RS that fails while processing the failure of another one
> ----------------------------------------------------------------
>                 Key: HBASE-2611
>                 URL: https://issues.apache.org/jira/browse/HBASE-2611
>             Project: HBase
>          Issue Type: Sub-task
>          Components: replication
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>         Attachments: HBase-2611-upstream-v1.patch
> HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs
queues from other region servers that failed. Devise a reliable way to do it.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message