hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suma Shivaprasad <sumasai.shivapra...@gmail.com>
Subject Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons
Date Wed, 04 Feb 2015 10:54:44 GMT
Thanks for your inputs. The cluster Metrics API is giving correct numbers
for the failed/killed apps and is matching with the RM audit logs and we
are planning to use that instead.

Suma

On Wed, Feb 4, 2015 at 12:04 PM, Rohith Sharma K S <
rohithsharmaks@huawei.com> wrote:

> There are several ways to confirm from YARN that total number of
> Killed/Failed applications in cluster
> 1. Get from RM web UI lists OR
> 2. From admin try using this to get numbers of failed and killed
> applications: ./yarn application -list -appStates FAILED,KILLED
> 3. Using client API's
>
> Since metrics values are displayed in ganglia is incorrect, I get doubt
> that
> 1. does ganglia is pointing out to correct RM cluster? Or
> 2. what is the method ganglia uses to retrieve QueueMetrics?
> 3. Any client program calculates you have written retrieve apps and
> calculate it?
>
>
> Thanks & Regards
> Rohith Sharma K S
>
> -----Original Message-----
> From: Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> Sent: 04 February 2015 11:03
> To: user@hadoop.apache.org
> Cc: yarn-dev@hadoop.apache.org
> Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
> Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
> The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
> which is very high wrt to the apps running at any given time(40-60). The RM
> logs though show 0 failed apps in audit logs during that hour.
> The RM UI also doesnt show any apps in Applications->Failed tab . The logs
> are getting rolled over at a slower rate ..every 1-2 hours. Am searching
> for "Application Finished - Failed" to find the apps failed. Please let me
> know if I am missing something here.
>
> Thanks
> Suma
>
>
>
>
> On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S <
> rohithsharmaks@huawei.com> wrote:
>
> >  Hi
> >
> >
> >
> > Could you give more information, which version of hadoop are you using?
> >
> >
> >
> > >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> >
> > May be I suspect that Logs might be rolled out. Does more applications
> > are running?
> >
> >
> >
> > All the applications history will be displayed  on RM web UI (provided
> > RM is not restarted or RM recovery enabled). May be you can check
> > these applications lists.
> >
> >
> >
> > For finding reasons for application killed/failed, one way is you can
> > check in NodeManager logs also. Here  you need to check using
> > container_id for corresponding application.
> >
> >
> >
> > Thanks & Regards
> >
> > Rohith Sharma K S
> >
> >
> >
> > *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> > *Sent:* 03 February 2015 21:35
> > *To:* user@hadoop.apache.org; yarn-dev@hadoop.apache.org
> > *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
> >
> >
> >
> > Hello,
> >
> >
> > Was trying to debug reasons for Killed/Failed apps and was checking
> > for the applications that were killed/failed in RM logs - from
> RMAuditLogger.
> >
> >  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
> > Is it possible that some logs are missed by AuditLogger or is it the
> > other way round and metrics are being reported higher ?
> >
> > Thanks
> >
> > Suma
> >
>

Mime
View raw message