incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suraj Menon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-498) BSPTask should periodically ping its parent.
Date Thu, 23 Feb 2012 20:56:48 GMT

    [ https://issues.apache.org/jira/browse/HAMA-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215039#comment-13215039
] 

Suraj Menon commented on HAMA-498:
----------------------------------

Hi, as I am testing my changes, I found that sometimes on failure of BSPTask, the BSPMaster
steps in and kills the failed task.
I see that when doing so, it just sets the state of the task in variable "runningTasks" (that
contains a list of TaskInProgress objects) as KILLED. It does not purge the task from the
list. Isn't it supposed to purge the task from the runningTasks?  Currently, I am purging
tasks when I find out of contact BPSTasks.
Please Refer: Added to GroomServer.java
{noformat}
  private class BSPTasksMonitor extends Thread{

    private List<TaskInProgress> outOfContactTasks =
        new ArrayList<GroomServer.TaskInProgress>(
            conf.getInt(Constants.MAX_TASKS_PER_GROOM, 3));

    private BSPTasksMonitor(){

      outOfContactTasks =
          new ArrayList<GroomServer.TaskInProgress>(
              conf.getInt(Constants.MAX_TASKS_PER_GROOM, 3));
    }


    public void run(){

      getObliviousTasks(outOfContactTasks);

      if(outOfContactTasks.size() > 0){
        LOG.debug("Got " + outOfContactTasks.size() + " oblivious tasks");
      }

      Iterator<TaskInProgress> taskIter = outOfContactTasks.iterator();

      while(taskIter.hasNext()){
        TaskInProgress tip = taskIter.next();
        try{
          LOG.debug("Purging task " + tip);
          purgeTask(tip, true);
        }
        catch(Exception e){
          LOG.error(new StringBuilder(
              "Error while removing a timed-out task - ")
              .append(tip.toString()) , e);

        }
      }
      outOfContactTasks.clear();

    }
  }


  private synchronized void getObliviousTasks(
      List<TaskInProgress> outOfContactTasks){

    if(runningTasks == null){
      LOG.debug("returning null");
      return;
    }

    long currentTime = Calendar.getInstance().getTimeInMillis();
    long monitorPeriod = conf.getLong(Constants.GROOM_PING_PERIOD,
        Constants.DEFAULT_GROOM_PING_PERIOD);

    for (Map.Entry<TaskAttemptID, TaskInProgress> entry : runningTasks
        .entrySet()) {
      TaskInProgress tip = entry.getValue();
      
      // Task is out of contact if it has not pinged since more than 
      // monitorPeriod. A task is given a leeway of 10 times monitorPeriod
      // to get started.
      if ( tip.taskStatus.getRunState().equals(TaskStatus.State.RUNNING) &&
          ( ((tip.lastPingedTimestamp == 0 &&
            ((currentTime - tip.startTime) > 10*monitorPeriod)) ||
            ((tip.lastPingedTimestamp > 0) &&
              (currentTime - tip.lastPingedTimestamp) > monitorPeriod)))){

        LOG.info("adding purge task: " + tip.getTask().getTaskID());

        outOfContactTasks.add(tip);
        
      }

    }

  }

      
{noformat}



                
> BSPTask should periodically ping its parent.
> --------------------------------------------
>
>                 Key: HAMA-498
>                 URL: https://issues.apache.org/jira/browse/HAMA-498
>             Project: Hama
>          Issue Type: Sub-task
>          Components: bsp
>    Affects Versions: 0.4.0
>            Reporter: Edward J. Yoon
>            Assignee: Suraj Menon
>              Labels: newbie
>             Fix For: 0.5.0
>
>
> As described in http://wiki.apache.org/hama/GroomServerFaultTolerance
> BSPTask should periodically ping its parent 'GroomServer' for their health status.
> 1. If Tasks are unable to ping their parent 'GroomServer', it should be killed themselves.
> 2. And, if GroomServer does not receive ping from the childs, GroomServer should check
whether that child is running.
> You don't need to implement recovery logic in this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message