hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peeyush Bishnoi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4938) [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down
Date Mon, 12 Jan 2009 17:50:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663026#action_12663026
] 

Peeyush Bishnoi commented on HADOOP-4938:
-----------------------------------------

The approach to build such script is to identify the HOD allocated clusters in which :

1. Ringmaster is down : Use "qstat -f <jobid>"  output and get the first node from "exec_host"
attribute of torque resource manager and poll it for UP or DOWN

2. "Resource Manager notes" field is not available :  Use "qstat -f <jobid>" output
and find out whether "notes" attribute is available or not.

The clusters which will satisfy above two above condition will said to be problematic cluster
. These problematic cluster need to be find out and resource manager job should be deleted
or send the mail to administrator for job deletion if job has not been deleted .

Steps 1 and 2 should be carried out for all the running jobs i.e running jobs got from "qstat
-r "

---

> [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-4938
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4938
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Hemanth Yamijala
>            Assignee: Peeyush Bishnoi
>
> As mentioned in HADOOP-4937, sometimes in large cluster deployments, faulty nodes on
which the ringmaster process comes up may go down after the cluster is successfully allocated.
Such clusters fail to deallocate automatically even if the idleness limit of the cluster is
exceeded. This is because the idleness is tracked by the ringmaster process which itself has
gone down.
> As large number of nodes can get held up due to this, such clusters should be detected
and deallocated in some manner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message