hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4443) MR AM and job history server should be resilient to jobs that exceed counter limits
Date Tue, 17 Dec 2013 18:34:08 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13850746#comment-13850746
] 

Hadoop QA commented on MAPREDUCE-4443:
--------------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12579168/MAPREDUCE-4443-trunk-3.patch
  against trunk revision .

    {color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4263//console

This message is automatically generated.

> MR AM and job history server should be resilient to jobs that exceed counter limits 
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4443
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4443
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>            Assignee: Mayank Bansal
>              Labels: usability
>         Attachments: MAPREDUCE-4443-trunk-1.patch, MAPREDUCE-4443-trunk-2.patch, MAPREDUCE-4443-trunk-3.patch,
MAPREDUCE-4443-trunk-draft.patch, am_failed_counter_limits.txt
>
>
> We saw this problem migrating applications to MapReduceV2:
> Our applications use hadoop counters extensively (1000+ counters for certain jobs). While
this may not be one of recommended best practices in hadoop, the real issue here is reliability
of the framework when applications exceed counter limits.
> The hadoop servers (yarn, history server) were originally brought up with mapreduce.job.counters.max=1000
under core-site.xml
> We then ran map-reduce job under an application using its own job specific overrides,
with  mapreduce.job.counters.max=10000
> All the tasks for the job finished successfully; however the overall job still failed
due to AM encountering exceptions as:
> {code}
> 2012-07-12 17:31:43,485 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
Num completed Tasks
> : 712012-07-12 17:31:43,502 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher:
Error in dispatcher threa
> dorg.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 1001
max=1000
>         at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:58)
       at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:65)
>         at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:77)
       at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:94)
>         at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:105)
>         at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:202)
>         at org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:337)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1212)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1198)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1179)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:711)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.checkJobCompleteSuccess(JobImpl.java:737)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.checkJobForCompletion(JobImpl.java:1360)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1340)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1323)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:380)
       at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:666)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:113)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:890)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:886)
       at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:125)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74)
       at java.lang.Thread.run(Thread.java:662)
> 2012-07-12 17:31:43,502 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher:
Exiting, bbye..2012-07-12 17:31:43,503 INFO [Thread-1] org.apache.had
> {code}
> The overall job failed, and the job history wasn't accessible either at the end of the
job (didn't show up in job history server).
> We were able to workaround the issue by changing to higher limits in core-site.xml and
restarting yarn servers. However that forced us to increase the counters global limit to be
as high as possible use by any individual application, which is hard to predict.
> The original job then succeeded with new global limits. 
> However, since we didn't restart the job history server, it was unable to display job
history page for the successful job altogether as it still hit counter exceeded exception.
Restart of job history server finally got the application available under job history.
> I'll also attach AM logs to help debug the issue 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message