hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3523) [HOD] If a job does not exist in Torque's list of jobs, HOD allocate on previously allocated directory fails.
Date Tue, 10 Jun 2008 07:53:45 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hemanth Yamijala updated HADOOP-3523:
-------------------------------------

    Attachment: 3523.patch

The attached patch fixes the issue described above. We now check for the exit code from qstat
indicating that the job id is invalid (error code = 153) and treat that as equivalent to completed.
By doing so, a previously allocated cluster who's cluster id is no longer present with Torque
will continue to be auto-deallocated and allocated again. 

However, if any other torque error occurs, we treat that as an unknown case, and let the user
handle the deallocation himself. 

> [HOD] If a job does not exist in Torque's list of jobs, HOD allocate on previously allocated
directory fails.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3523
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3523
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.18.0
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: 3523.patch
>
>
> HADOOP-3483 addressed the issue where a dead cluster could be reallocated without having
to issue warnings to users to clean up the directory themselves, provided the job is completed.
It missed one case, where the job no longer exists in the Torque queue. When tried in that
case, HOD fails with a bad error message:
> ERROR - qstat error: exit code: 153 | signal: False | core False
> CRITICAL - op: allocate hod-clusters/test 3 failed: <type 'exceptions.TypeError'>
'NoneType' object is unsubscriptable
> This should be addressed to avoid user concerns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message