hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4938) [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down
Date Mon, 02 Feb 2009 10:51:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669554#action_12669554
] 

Hemanth Yamijala commented on HADOOP-4938:
------------------------------------------

Peeyush, as we discussed, please make the following changes:

- Pass options as command line parameters. I think this will be easier to manage for now.
Look at how logcondense.py works.
- The state file and the log file locations should be configurable. Default can be /tmp and
/var/log
- The code is checking the sum of runningJobs and submittedJobs is < the number stored
in the state file. Since submittedJobs already includes runningJobs, you don't need to sum
them up.
- The SMTP recepient address should be configurable. Also does the library you are using support
multiple addresses and a remote SMTP host ?
- Submit this as a patch, I think the file should be under the $HOD_HOME/support.
- Include the ASF header in the file.
- Can you also submit documentation for this in Forrest ?

> [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-4938
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4938
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Hemanth Yamijala
>            Assignee: Peeyush Bishnoi
>         Attachments: externalIdleTracker.py
>
>
> As mentioned in HADOOP-4937, sometimes in large cluster deployments, faulty nodes on
which the ringmaster process comes up may go down after the cluster is successfully allocated.
Such clusters fail to deallocate automatically even if the idleness limit of the cluster is
exceeded. This is because the idleness is tracked by the ringmaster process which itself has
gone down.
> As large number of nodes can get held up due to this, such clusters should be detected
and deallocated in some manner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message