hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohith Sharma K S <rohithsharm...@huawei.com>
Subject RE: How to troubleshoot failed or stuck jobs
Date Mon, 02 Mar 2015 06:06:05 GMT

1.       For the Failed jobs, you can directly check the MRAppMaster logs.  There you get
reason for failed jobs.

2.       For the stuck job, you need to do some ground work to identify what is going wrong.
It can be either YARN issue or MapReduce issue.

2.1   In a recent time, I have face job stuck many times if headroom calculation goes wrong.
 Headroom is sent by RM to ApplicationMaster and AM uses this as deciding factors ( https://issues.apache.org/jira/i#browse/YARN-1680
).  Corresponding parent jira is  https://issues.apache.org/jira/i#browse/YARN-1198

2.2   When the job is stuck,
YARN – try to get ClusterMemory Used, ClusterMemory Reserved, Total Memory, How many NodeManagers?
What is the headroom sent to AM.
                 MapReduce – Any NM’s are blacklisted, Does all the reducers tasks are
using ClusterMemory? By default Reducers start before Mapper completion. In case if Mapper
fails because of some unstable node, then reducers take over the cluster. Here, it is expected
reducers should be pre-empted. Need to identify whether reducers are getting pre-empted.
MRAppMaster log would help for some extent to analyze the issue.

Thanks & Regards
Rohith Sharma K S

From: Krish Donald [mailto:gotomypc27@gmail.com]
Sent: 02 March 2015 11:09
To: user@hadoop.apache.org
Subject: Re: How to troubleshoot failed or stuck jobs

Thanks for Link Ted,

However wanted to understand the approach which should be taken when troubleshooting failed
or stuck jobs ?

On Sun, Mar 1, 2015 at 8:52 PM, Ted Yu <yuzhihong@gmail.com<mailto:yuzhihong@gmail.com>>
Here are some related discussions and JIRA:




On Sun, Mar 1, 2015 at 8:41 PM, Krish Donald <gotomypc27@gmail.com<mailto:gotomypc27@gmail.com>>

Wanted to understand,  How to troubleshoot failed or stuck jobs ?


View raw message