hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4818) Easier identification of tasks that timeout during localization
Date Tue, 19 Aug 2014 21:39:19 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102893#comment-14102893

Jason Lowe commented on MAPREDUCE-4818:

Just thinking out loud: I wonder if we added a separate yarn-localization-log for each container
that would record separately the start and end of each resource localization on behalf of
that container?  Part of the problem today is that when a container dies during localization
the job user doesn't always have access to the NM logs to debug the localization failure and
know which one was the problem.  If we at least had a per-container log showing the localization
start/end times for each resource we'd know which one it got stuck on and how long each resource
took to localize (if we timestamped the log entries or otherwise computed it).

The tricky part is that often a file is localized on behalf of many containers, including
those that show up late to the party.  For example container A needs resource X and has been
localizing for many minutes then container B shows up also needing X, but we don't start downloading
X again since container A is already handling it.  We could just add an X start entry to container
B's log since it's waiting on it.

I'm trying to think of other ways besides MR job/task status to solve the issue since the
difficulty in debugging container localization issues is not specific to MapReduce apps but
applies to YARN apps in general.

> Easier identification of tasks that timeout during localization
> ---------------------------------------------------------------
>                 Key: MAPREDUCE-4818
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4818
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>              Labels: usability
> When a task is taking too long to localize and is killed by the AM due to task timeout,
the job UI/history is not very helpful.  The attempt simply lists a diagnostic stating it
was killed due to timeout, but there are no logs for the attempt since it never actually got
started.  There are log messages on the NM that show the container never made it past localization
by the time it was killed, but users often do not have access to those logs.

This message was sent by Atlassian JIRA

View raw message