hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ChiaHung Lin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-370) Failure detector for Hama
Date Tue, 22 Mar 2011 04:25:05 GMT

    [ https://issues.apache.org/jira/browse/HAMA-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009548#comment-13009548
] 

ChiaHung Lin commented on HAMA-370:
-----------------------------------

Indeed, the implementation in patch also contains heartbeat mechanism - the monitored process
periodically sending heartbeat. 

The different is a conventional heartbeat failure detector has a fixed timeout. The phi accrual
failure detector decomposes functions into different components (monitoring, interpretation,
etc.); with a suspicion level (not binary trust or suspect value output) exposed so that different
applications equipped with its own interpreter can use the output value for further decision.
For instance, a master may allocate urgent tasks to workers which have lower suspicion level.
Or the monitoring process may interpret according to its business logic in determining if
monitored process has crashed. 

Although a task failure can be solved with a restart, the difficulty lies in the distinguished
between a crash/ failure process and a very slow one. In addition, in the future if the project
needs the feature of fault tolerant between bspmasters, a failure detection service is required.




> Failure detector for Hama
> -------------------------
>
>                 Key: HAMA-370
>                 URL: https://issues.apache.org/jira/browse/HAMA-370
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>    Affects Versions: 0.3.0
>         Environment: GNU/ Debian, JDK 1.6.0_22-b04 
>            Reporter: ChiaHung Lin
>            Assignee: ChiaHung Lin
>              Labels: patch
>             Fix For: 0.3.0
>
>         Attachments: HAMA-370.patch, HAMA-370.patch
>
>
> In order to enable fault tolerance service, BSPMaster requires to have ability in determining
GroomServers' status. This generally can be achieved through failure detector. The attached
file contains source for such patch. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message