hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.
Date Wed, 15 Oct 2008 08:21:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639758#action_12639758
] 

Hemanth Yamijala commented on HADOOP-3217:
------------------------------------------

Attached a patch for Hadoop 0.17. The following are the changes:

- For relevant qsub failures, that is other than qsub options error, or insufficient resources,
we retry a configurable number of times (default 3), with a configurable wait interval between
the retries (default 10 seconds)
- For all qstat errors, we retry a configurable number of times (default 3), with a configurable
wait time interval between the retries (default 10 seconds)
- For qstat queries which are successful, and where we poll for the job state to become running
or completed, the interval is made configurable (default 30 seconds).

Patch for other branches in progress.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be
running / completed (due to error). The initial rate of query is once every 0.5 seconds for
20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find
that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216).
It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message