hama-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hama Wiki] Trivial Update of "BSPMaster" by ChiaHungLin
Date Mon, 24 Feb 2014 07:31:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change notification.

The "BSPMaster" page has been changed by ChiaHungLin:
https://wiki.apache.org/hama/BSPMaster?action=diff&rev1=17&rev2=18

  ----
   * [[Registrator|Registrator]]
   * Receptionist 
-  * JobOperator
   * Scheduler
   * ResourceConsultant
   * GroomManager
+  * Monitor
-  * Supervisor(?)
+   * Supervisor(?)
   
- 
  
  == State ==
  Two states are applied to BSPMaster node, including:
@@ -40, +39 @@

   * STOPPED
  {{attachment:bspmaster_state3.png|BSPMaster State}}
  
+ == Scenario ==
+ 
+  * Restart
+   * When a task fails on a groom server, restart that job by re-running '''all''' tasks
from the latest checkpoint that universally available. The reason not merely re-running the
task that fails comes from the fact that universally available checkpoint may not be only
one step behind the current superstep. This may lead to the deadlock between alive tasks and
the restarted one during sync phase. For example, the universally checkpoint available is
the 6th superstep, and currently running the computation from the 7th to 8th superstep. Suppose
one of the tasks fails, then the system migrates the failed task to another machine and resumes
the failed task from the 6th superstep checkpoint whilst other tasks keep continuously running
until hitting the barrier sync at the superstep 8th. Now the dead lock is raised when the
resumed task, that previous fails, hits the barrier sync at the superstep 7th because no other
tasks are at the superstep 7th. There is one proposed solution to fix a task failure issue.

+ 
  == Source ==
  [[http://svn.apache.org/repos/asf/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPMaster.java|BSPMaster.java]]
  

Mime
View raw message