Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 657FA17FFC for ; Wed, 4 Feb 2015 05:34:15 +0000 (UTC) Received: (qmail 1616 invoked by uid 500); 4 Feb 2015 05:34:07 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 1317 invoked by uid 500); 4 Feb 2015 05:34:07 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 1295 invoked by uid 99); 4 Feb 2015 05:34:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Feb 2015 05:34:07 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sumasai.shivaprasad@gmail.com designates 209.85.213.41 as permitted sender) Received: from [209.85.213.41] (HELO mail-yh0-f41.google.com) (209.85.213.41) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Feb 2015 05:34:02 +0000 Received: by mail-yh0-f41.google.com with SMTP id f73so20382757yha.0; Tue, 03 Feb 2015 21:32:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=O0Vu66b8SX9D8VyK9u5OFNl+xqv3qT61lpeqZ52QURQ=; b=Dk9uwdyIQEkwAPlVn+SXEkPrCKeU3CLtILnzeMM82xwdoVU01OyQSHiRNQAbSMH2rn QN+mQfvC8eDIqpzXQ4i5Agzl6D+kO2kBLTIiWBrEAI0EuaIA49KLQTbhiMoRF4y/yRt0 LNSEwTtnrxaFekFm2ocfdBjxV9z+y785st9xFWJWcjJ1uitSrWQygSh02DnyFepNVplQ +EFNE37oM5DEXkLdJNLJu3jmP5tHNSwa9Hppegs1BcHkL0IytHcSAflDVi7r7AG/q7Lt n2MrmE+JX5/N6kzbWldAceG8oR4ujDG67OSRx3NdIe3B/YCiKKy1GJYnAjP7Jyb//gEY v+7Q== MIME-Version: 1.0 X-Received: by 10.236.34.225 with SMTP id s61mr5533727yha.30.1423027977157; Tue, 03 Feb 2015 21:32:57 -0800 (PST) Received: by 10.170.114.17 with HTTP; Tue, 3 Feb 2015 21:32:57 -0800 (PST) In-Reply-To: <0EE80F6F7A98A64EBD18F2BE839C911567728C71@szxeml512-mbs.china.huawei.com> References: <0EE80F6F7A98A64EBD18F2BE839C911567728C71@szxeml512-mbs.china.huawei.com> Date: Wed, 4 Feb 2015 11:02:57 +0530 Message-ID: Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons From: Suma Shivaprasad To: user@hadoop.apache.org Cc: "yarn-dev@hadoop.apache.org" Content-Type: multipart/alternative; boundary=089e01681ed815842f050e3c85ed X-Virus-Checked: Checked by ClamAV on apache.org --089e01681ed815842f050e3c85ed Content-Type: text/plain; charset=UTF-8 Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60. The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which is very high wrt to the apps running at any given time(40-60). The RM logs though show 0 failed apps in audit logs during that hour. The RM UI also doesnt show any apps in Applications->Failed tab . The logs are getting rolled over at a slower rate ..every 1-2 hours. Am searching for "Application Finished - Failed" to find the apps failed. Please let me know if I am missing something here. Thanks Suma On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S < rohithsharmaks@huawei.com> wrote: > Hi > > > > Could you give more information, which version of hadoop are you using? > > > > >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. > > May be I suspect that Logs might be rolled out. Does more applications are > running? > > > > All the applications history will be displayed on RM web UI (provided RM > is not restarted or RM recovery enabled). May be you can check these > applications lists. > > > > For finding reasons for application killed/failed, one way is you can > check in NodeManager logs also. Here you need to check using container_id > for corresponding application. > > > > Thanks & Regards > > Rohith Sharma K S > > > > *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com] > *Sent:* 03 February 2015 21:35 > *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org > *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons > > > > Hello, > > > Was trying to debug reasons for Killed/Failed apps and was checking for > the applications that were killed/failed in RM logs - from RMAuditLogger. > > QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it > possible that some logs are missed by AuditLogger or is it the other way > round and metrics are being reported higher ? > > Thanks > > Suma > --089e01681ed815842f050e3c85ed Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Using hadoop 2.4.0. #of Applications running on avera= ge is small ~ 40 -60. The metrics in Ganglia shows around around 10-30 apps= killed every 5 mins which is very high wrt to the apps running at any give= n time(40-60). The RM logs though show 0 failed apps in audit logs during t= hat hour.
The RM UI also doesnt show any apps in Applications->Faile= d tab . The logs are getting rolled over at a slower rate ..every 1-2 hours= . Am searching for "Application Finished - Failed" to find the ap= ps failed. Please let me know if I am missing something here.

Thanks
Suma




On Wed, Feb 4, 2015 at = 10:03 AM, Rohith Sharma K S <rohithsharmaks@huawei.com> wrote:

Hi

=C2=A0

Could you give more infor= mation, which version of hadoop are you using?

=C2=A0

>> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~10= 0. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.=

May be I suspect that Logs might be rolled ou= t. Does more applications are running?

=C2=A0

All the applications hist= ory will be displayed =C2=A0on RM web UI (provided RM is not restarted or R= M recovery enabled). May be you can check these applications lists.

=C2=A0

For finding reasons for a= pplication killed/failed, one way is you can check in NodeManager logs also= . Here=C2=A0 you need to check using container_id for corresponding application.

=C2=A0

Thanks & Regards

Rohith Sharma K S<= u>

=C2=A0

From: Suma Shi= vaprasad [mailto:sumasai.shivaprasad@gmail.com]
Sent: 03 February 2015 21:35
To: user= @hadoop.apache.org; yarn-dev@hadoop.apache.org
Subject: QueueMetrics.AppsKilled/Failed metrics and failure reasons<= u>

=C2=A0

Hello,



Was trying to debug reasons for Killed/Failed apps and was checking for the= applications that were killed/failed in RM logs - from RMAuditLogger.

QueueMetrics.AppsKill= ed/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger sho= ws 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs = are missed by AuditLogger or is it the other way round and metrics are being reported higher ?

Thanks

Suma


--089e01681ed815842f050e3c85ed--