hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abmar Barros (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model
Date Fri, 16 Jul 2010 19:38:50 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889280#action_12889280
] 

Abmar Barros commented on ZOOKEEPER-702:
----------------------------------------

Hi Diogo! Thank you for indicating this paper, I haven't found such failure detection type
so far, it is very interesting. 

It proposes a simple way of estimating heartbeats arrival times based on application messages.
However it does require the attachment of sending times to all application messages (or at
least the ones it will use to do the estimation), which is an overhead to message size. Anyway,
with the separate failure detector module, it would be easy to implement a new Failure Detector
that uses such data.

So far, I have adapted the proposed failure detectors in order to compute the estimated next
arrival time only when a heartbeat is received.

> GSoC 2010: Failure Detector Model
> ---------------------------------
>
>                 Key: ZOOKEEPER-702
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
>             Project: Zookeeper
>          Issue Type: Wish
>            Reporter: Henry Robinson
>            Assignee: Abmar Barros
>         Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, chen-pseudo.txt,
phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch,
ZOOKEEPER-702.patch
>
>
> Failure Detector Module
> Possible Mentor
> Henry Robinson (henry at apache dot org)
> Requirements
> Java, some distributed systems knowledge, comfort implementing distributed systems protocols
> Description
> ZooKeeper servers detects the failure of other servers and clients by counting the number
of 'ticks' for which it doesn't get a heartbeat from other machines. This is the 'timeout'
method of failure detection and works very well; however it is possible that it is too aggressive
and not easily tuned for some more unusual ZooKeeper installations (such as in a wide-area
network, or even in a mobile ad-hoc network).
> This project would abstract the notion of failure detection to a dedicated Java module,
and implement several failure detectors to compare and contrast their appropriateness for
ZooKeeper. For example, Apache Cassandra uses a phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf)
which is much more tunable and has some very interesting properties. This is a great project
if you are interested in distributed algorithms, or want to help re-factor some of ZooKeeper's
internal code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message