Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 46951 invoked from network); 10 Jun 2008 12:35:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Jun 2008 12:35:13 -0000 Received: (qmail 18193 invoked by uid 500); 10 Jun 2008 12:35:11 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 17673 invoked by uid 500); 10 Jun 2008 12:35:10 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 17662 invoked by uid 99); 10 Jun 2008 12:35:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jun 2008 05:35:10 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jun 2008 12:34:29 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id BA96D234C135 for ; Tue, 10 Jun 2008 05:34:46 -0700 (PDT) Message-ID: <1854249354.1213101286762.JavaMail.jira@brutus> Date: Tue, 10 Jun 2008 05:34:46 -0700 (PDT) From: "Karam Singh (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-3523) [HOD] If a job does not exist in Torque's list of jobs, HOD allocate on previously allocated directory fails. In-Reply-To: <1054021920.1213072664974.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603851#action_12603851 ] Karam Singh commented on HADOOP-3523: ------------------------------------- To check the issue, did the following -: 1. a. Allocate hod cluster with --ringmaster.idleness-limit=240. b. Waited for 4 mins. c .verified the cluster dead from hod list and qstat. d. Restarted torque. ran qstat to verify that it does return anything. e. ran hod allocate using hod without patch using same cluster dir, hod thows error. f. Again ran hod allocate using patched hod. Allocation was successful 2. a. Allocate hod cluster with --ringmaster.idleness-limit=240. b. Waited for 4 mins. c .verified the cluster dead from hod list and qstat. d. Stopped torque e. ran hod allocate using hod without patch using same cluster dir, hod thows error. . Again ran hod allocate using patched hod. hod allocation fails with following error -: [ WARNING/30 torque:96 - qstat error: exit code: 255 | signal: False | core False. CRITICAL/50 hod:310 - Found a previously allocated cluster at cluster directory '~/c_dirn'. Deallocate the cluster first. ] 3. Also hod behavior when hod list shows clsuter as dead/mapred dead/hdfs dead but actually cluster is alive (related torque job status is R).. 4. Normal re allocation of dead cluster > [HOD] If a job does not exist in Torque's list of jobs, HOD allocate on previously allocated directory fails. > ------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-3523 > URL: https://issues.apache.org/jira/browse/HADOOP-3523 > Project: Hadoop Core > Issue Type: Bug > Components: contrib/hod > Affects Versions: 0.18.0 > Reporter: Hemanth Yamijala > Assignee: Hemanth Yamijala > Priority: Blocker > Fix For: 0.18.0 > > Attachments: 3523.patch > > > HADOOP-3483 addressed the issue where a dead cluster could be reallocated without having to issue warnings to users to clean up the directory themselves, provided the job is completed. It missed one case, where the job no longer exists in the Torque queue. When tried in that case, HOD fails with a bad error message: > ERROR - qstat error: exit code: 153 | signal: False | core False > CRITICAL - op: allocate hod-clusters/test 3 failed: 'NoneType' object is unsubscriptable > This should be addressed to avoid user concerns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.