Return-Path: Delivered-To: apmail-hadoop-chukwa-dev-archive@minotaur.apache.org Received: (qmail 16609 invoked from network); 24 Jun 2009 00:44:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Jun 2009 00:44:18 -0000 Received: (qmail 24881 invoked by uid 500); 24 Jun 2009 00:44:29 -0000 Delivered-To: apmail-hadoop-chukwa-dev-archive@hadoop.apache.org Received: (qmail 24861 invoked by uid 500); 24 Jun 2009 00:44:29 -0000 Mailing-List: contact chukwa-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: chukwa-dev@hadoop.apache.org Delivered-To: mailing list chukwa-dev@hadoop.apache.org Received: (qmail 24851 invoked by uid 99); 24 Jun 2009 00:44:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2009 00:44:29 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2009 00:44:27 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 560D0234C004 for ; Tue, 23 Jun 2009 17:44:07 -0700 (PDT) Message-ID: <1692217759.1245804247345.JavaMail.jira@brutus> Date: Tue, 23 Jun 2009 17:44:07 -0700 (PDT) From: "Ari Rabkin (JIRA)" To: chukwa-dev@hadoop.apache.org Subject: [jira] Commented: (CHUKWA-323) Chukwa agent unable to stream all data source on the jobtracker node In-Reply-To: <666027488.1245620647336.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CHUKWA-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723385#action_12723385 ] Ari Rabkin commented on CHUKWA-323: ----------------------------------- Tests committed to trunk; underlying issue not yet closed. > Chukwa agent unable to stream all data source on the jobtracker node > -------------------------------------------------------------------- > > Key: CHUKWA-323 > URL: https://issues.apache.org/jira/browse/CHUKWA-323 > Project: Hadoop Chukwa > Issue Type: Bug > Components: data collection > Affects Versions: 0.2.0 > Environment: Redhat EL 5.1, Java 6 > Reporter: Eric Yang > Assignee: Jerome Boulon > Priority: Blocker > Fix For: 0.2.0 > > Attachments: testForLeaks.patch > > > HDFS namenode and mapreduce related metrics seem to stop sending data since 06/21/2009 00:00:00. > Agent log contains exceptions like these: > 2009-06-21 21:28:01,165 WARN Thread-10 FileTailingAdaptor - failure reading > /usr/local/hadoop/var/log/history/host.example.com_1245463671645_job_200906200207_0351_user_Chukwa-Demux_20090620_09_56 > java.io.FileNotFoundException: > /usr/local/hadoop/var/log/history/host.example.com_1245463671645_job_200906200207_0351_user_Chukwa-Demux_20090620_09_56 > (Too many open files) > at java.io.RandomAccessFile.open(Native Method) > at java.io.RandomAccessFile.(RandomAccessFile.java:212) > at > org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.FileTailingAdaptor.tailFile(FileTailingAdaptor.java:239) > at org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.FileTailer.run(FileTailer.java:90) > 2009-06-21 21:28:01,165 WARN Thread-10 FileTailingAdaptor - Adaptor|58fb855b5c26d36cc1e69e264ce3402c| file: > /usr/local/hadoop/var/log/history/host.example.com_1245463671645_job_200906200207_0352_user_PigLatin%3AHadoop_jvm_metrics.pig, > has rotated and no detection - reset counters to 0L > It looks like the number of file offset tracking pointers exceeded the jvm concurrent number of files open. Which > triggers a feedback loop that FileTailingAdaptor assuming log file had rotated, but it wasn't the case. > FileTailingAdaptor was simply unable to track the offset that's all. > [root@gsbl80211 log]# /usr/sbin/lsof -p 29960|wc -l > 1084 > The concurrent # of open file is 1084 which exceeded the default limit 1024 of concurrent open files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.