Return-Path: Delivered-To: apmail-hadoop-chukwa-dev-archive@minotaur.apache.org Received: (qmail 7981 invoked from network); 21 Jul 2010 09:28:46 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Jul 2010 09:28:46 -0000 Received: (qmail 23997 invoked by uid 500); 21 Jul 2010 09:28:46 -0000 Delivered-To: apmail-hadoop-chukwa-dev-archive@hadoop.apache.org Received: (qmail 23927 invoked by uid 500); 21 Jul 2010 09:28:44 -0000 Mailing-List: contact chukwa-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: chukwa-dev@hadoop.apache.org Delivered-To: mailing list chukwa-dev@hadoop.apache.org Received: (qmail 23919 invoked by uid 99); 21 Jul 2010 09:28:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jul 2010 09:28:44 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jul 2010 09:28:41 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o6L9Kncn011372 for ; Wed, 21 Jul 2010 09:20:49 GMT Message-ID: <28284531.493541279704049748.JavaMail.jira@thor> Date: Wed, 21 Jul 2010 05:20:49 -0400 (EDT) From: "Ari Rabkin (JIRA)" To: chukwa-dev@hadoop.apache.org Subject: [jira] Commented: (CHUKWA-487) Collector left in a bad state after temprorary NN outage In-Reply-To: <28214922.4031273524390920.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CHUKWA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890624#action_12890624 ] Ari Rabkin commented on CHUKWA-487: ----------------------------------- I just committed this. :) > Collector left in a bad state after temprorary NN outage > -------------------------------------------------------- > > Key: CHUKWA-487 > URL: https://issues.apache.org/jira/browse/CHUKWA-487 > Project: Chukwa > Issue Type: Bug > Components: data collection > Affects Versions: 0.4.0 > Reporter: Bill Graham > Priority: Blocker > Attachments: CHUKWA-487.patch, CHUKWA-487.threaddump.txt > > > When the name node returns errors to the collector, at some point the collector dies half way. This behavior should be changed to either resemble the agents and keep trying, or to completely shutdown. Instead, what I'm seeing is that the collector logs that it's shutting down, and the var/pidDir/Collector.pid file gets removed, but the collector continues to run, albeit not handling new data. Instead, this log entry is repeated ad infinitum: > 2010-05-06 17:35:06,375 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 > 2010-05-06 17:36:06,379 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 > 2010-05-06 17:37:06,384 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.