hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karam Singh (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3523) [HOD] If a job does not exist in Torque's list of jobs, HOD allocate on previously allocated directory fails.
Date Tue, 10 Jun 2008 12:34:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603851#action_12603851
] 

Karam Singh commented on HADOOP-3523:
-------------------------------------

To check the issue, did the following -:
1. 
   a. Allocate hod cluster with --ringmaster.idleness-limit=240. 
   b. Waited for 4 mins. 
   c .verified  the cluster dead from hod list and qstat. 
   d. Restarted torque. ran qstat to verify that it does return anything. 
   e. ran hod allocate using hod without patch using same cluster dir, hod thows error. 
   f. Again ran hod allocate using patched hod. Allocation was successful

2. 
  a. Allocate hod cluster with --ringmaster.idleness-limit=240. 
   b. Waited for 4 mins. 
   c .verified  the cluster dead from hod list and qstat. 
   d. Stopped torque
   e. ran hod allocate using hod without patch using same cluster dir, hod thows error. 
   . Again ran hod allocate using patched hod. hod allocation fails with following error -:
    [
        WARNING/30 torque:96 - qstat error: exit code: 255 | signal: False | core False. 
       CRITICAL/50 hod:310 - Found a previously allocated cluster at cluster directory '~/c_dirn'.
Deallocate the cluster first.
    ]
3.  Also hod behavior when hod list shows clsuter as dead/mapred dead/hdfs dead but actually
cluster is alive (related torque job status is R)..
4. Normal re allocation of dead cluster 

> [HOD] If a job does not exist in Torque's list of jobs, HOD allocate on previously allocated
directory fails.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3523
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3523
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.18.0
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: 3523.patch
>
>
> HADOOP-3483 addressed the issue where a dead cluster could be reallocated without having
to issue warnings to users to clean up the directory themselves, provided the job is completed.
It missed one case, where the job no longer exists in the Torque queue. When tried in that
case, HOD fails with a bad error message:
> ERROR - qstat error: exit code: 153 | signal: False | core False
> CRITICAL - op: allocate hod-clusters/test 3 failed: <type 'exceptions.TypeError'>
'NoneType' object is unsubscriptable
> This should be addressed to avoid user concerns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message