helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Swaroop Jagadish (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HELIX-26) Better support for handling network partition and process freeze
Date Fri, 22 Mar 2013 20:47:15 GMT

    [ https://issues.apache.org/jira/browse/HELIX-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611204#comment-13611204
] 

Swaroop Jagadish commented on HELIX-26:
---------------------------------------

The proposal is to handle process-freeze/network-partition by acquiring a lease with the controller.
We will consider distributed leases in the future. With the assumption that the controller
and zk will always have connectivity between them, here is the pseudocode for the helix client
library changes on each participant

onSyncDisconnect() {
    while(not connected to zk) {
        shouldContinue = check_with_coordinator(local_state)
        if(not shouldContinue) {
             reset local state
             broadcast disabled state
             break
        }
    }
}
                
> Better support for handling network partition and process freeze
> ----------------------------------------------------------------
>
>                 Key: HELIX-26
>                 URL: https://issues.apache.org/jira/browse/HELIX-26
>             Project: Apache Helix
>          Issue Type: Improvement
>    Affects Versions: 0.6.0-incubating
>            Reporter: kishore gopalakrishna
>            Assignee: Swaroop Jagadish
>             Fix For: 0.6.1-incubating
>
>
> Handling network partition is tricky in distributed systems. Zookeeper allows us to solve
this upto some degree with the use of heart beat. But this is not sufficient in large scale
systems with many nodes. One of the problems is that once the client detects disconnect which
happens on the client side, the options are
> 1. Put your self in a pause state until you reconnect.
> 2. Continue what ever you are doing until notified of session expiry.
> Unfortunately 1 is too agressive and 2 is too passive. Since Helix comes with the centralized
controller, its possible to have a more middle ground solution where once the participant
receives a disconnect event, it can check with co-ordinator(s)/peers to check if it can continue
operating.
> The challenge here for the node to detect if it belongs to the same partition as of the
co-ordinator or not. So its goal is to reach the controller, if it cannot reach the controller
it has to disable/fence itself.
> As of now Helix simply provides the state if its disconnected from the cluster and user
can either chose 1) or 2).
> This JIRA aims to investigate better ways to enhance network partition detection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message