hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haibo Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6771) Diagnostics information can be lost in .jhist if task containers are killed by Node Manager.
Date Wed, 31 Aug 2016 18:14:20 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452936#comment-15452936
] 

Haibo Chen commented on MAPREDUCE-6771:
---------------------------------------

bq. Note that we aren't stuck with TaskAttemptUnsuccessfulCompletion event for doing diagnostics.

Agree. I am guessing the reason why diagnostics is included in TaskAttemptUnsuccessfulCompletionEvent
is users only want to see diagnostics when task attempts fail. Parsing a new event and ignoring
such events for successful task attempts does need additional change.
bq.  but waiting for a container completion event is not something the state machine does
today.
There is no need to wait for container completion event. My proposal is to wait for transition
into FAILED state. As long as the task attempt goes into FAILED state, which does not necessarily
need to be triggered by a container completion event (Time out (TA_TIMED_OUT) is already built-in
in transitions from FAIL_FINISHING_CONTAINER to FAILED), the diagnostics will be written into
jhist file. But your point of having a wide window is susceptible to AM crash is still very
convincing.

Given that there is no clear preferable approach to address the case in MAPREDUCE-4955, do
you think I can go ahead address the issue in this jira? The symptom of  MAPREDUCE-4955 and
this one is the same, but the cause is not quite exactly. The case in MAPREDUCE-4955 happens
when AM thinks the task attempt is already dead, or the diagnostics comes after a taskUnsuccessfulCompletionEvent
is generated already, whereas the case in this jira happens when the diagnostics comes in
while task attempt is still in running state, or before a taskUnsuccessfulCompletionEvent.
 The case in this jira is easy to fix, and we can keep MAPREDUCE-4955 to address the other
when we decide what to do.

> Diagnostics information can be lost in .jhist if task containers are killed by Node Manager.
> --------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6771
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6771
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.7.3
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>         Attachments: TaUnsuccessfullyEventEmission.jpg, mapreduce6771.001.patch
>
>
> Task containers can go over their resource limit, and killed by Node Manager. Then MR
AM gets notified of the container status and diagnostics information through its heartbeat
with RM.  However, it is possible that the diagnostics information never gets into .jhist
file, so when the job completes, the diagnostics information associated with the failed task
attempts is empty.  This makes it hard for users to root cause job failures that are often
caused by memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message