hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suraj Menon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HAMA-636) Confined recovery
Date Thu, 13 Feb 2014 10:16:19 GMT

     [ https://issues.apache.org/jira/browse/HAMA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Suraj Menon updated HAMA-636:
-----------------------------

    Description: 
"Confined recovery" mentioned in Pregel paper can be used to improve the cost and latency
of recovery. 

In addition to the existing HDFS checkpoints,1) the tasks log outgoing messages to local filesystem
for each superstep (See disk queue). When a task fails, 2) it reverts to the last checkpoint.
3) Other tasks re-send messages sent to failed task at each superstep occurring after the
last checkpoint.

Today we write the checkpointed messages to HDFS. We want these files to be written on local
filesystem. There should be a way these files could be moved across to optimize the fault
recovery process.

  was:
"Confined recovery" mentioned in Pregel paper can be used to improve the cost and latency
of recovery. 

In addition to the existing HDFS checkpoints,1) the tasks log outgoing messages to local filesystem
for each superstep (See disk queue). When a task fails, 2) it reverts to the last checkpoint.
3) Other tasks re-send messages sent to failed task at each superstep occurring after the
last checkpoint.


> Confined recovery
> -----------------
>
>                 Key: HAMA-636
>                 URL: https://issues.apache.org/jira/browse/HAMA-636
>             Project: Hama
>          Issue Type: Sub-task
>          Components: bsp core, messaging
>            Reporter: Edward J. Yoon
>              Labels: gsoc2014, java
>
> "Confined recovery" mentioned in Pregel paper can be used to improve the cost and latency
of recovery. 
> In addition to the existing HDFS checkpoints,1) the tasks log outgoing messages to local
filesystem for each superstep (See disk queue). When a task fails, 2) it reverts to the last
checkpoint. 3) Other tasks re-send messages sent to failed task at each superstep occurring
after the last checkpoint.
> Today we write the checkpointed messages to HDFS. We want these files to be written on
local filesystem. There should be a way these files could be moved across to optimize the
fault recovery process.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message