hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.
Date Mon, 24 Feb 2014 15:43:23 GMT

    [ https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910411#comment-13910411
] 

Jason Lowe commented on YARN-221:
---------------------------------

bq. We can have RM AM wait for notification as in container exit -> NM notifies RM ->
RM notifies AM. That will create some delay for AM to declare the job is done. With the NM
-> RM heartbeat value used in big clusters, it could add couple seconds delay for the job.
That might not be a big deal for regular MR jobs.

The NM does out-of-band heartbeats when containers exit, so the turnaround time can be shorter
than a full NM heartbeat interval. 

If we're really concerned about any additional time added for graceful task exit we can also
have the AM unregister when the job succeeds/fails but before all tasks exit, and eventually
the RM will kill all containers of the application when the AM eventually exits (or times
out waiting).  In that sense it would not add any time from the job client's perspective,
as the job could report completion at the same time it did before.  However it would add some
time from the YARN perspective, as the application is lingering on the cluster a few extra
seconds in the FINISHING state than it did before.

bq. One thing to add we need the definition and policy on how to handle those tasks that are
in the finishing state and MR AM ends up stopping them as they don't exit by themselves.

I don't think we need to get too tricky here.  The NM will see the container return a non-zero
exit code and assume that's failure.  If tasks are succeeding but returning non-zero exit
codes then that's probably a bug and arguably a good thing we're grabbing the logs to show
what went wrong when it tried to tear down.  IMHO we should fix what's causing the non-zero
exit code rather than try to add a mechanism to prevent logs from being aggregated in what
should be a rare and abnormal case.

> NM should provide a way for AM to tell it not to aggregate logs.
> ----------------------------------------------------------------
>
>                 Key: YARN-221
>                 URL: https://issues.apache.org/jira/browse/YARN-221
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Robert Joseph Evans
>            Assignee: Chris Trezzo
>         Attachments: YARN-221-trunk-v1.patch
>
>
> The NodeManager should provide a way for an AM to tell it that either the logs should
not be aggregated, that they should be aggregated with a high priority, or that they should
be aggregated but with a lower priority.  The AM should be able to do this in the ContainerLaunch
context to provide a default value, but should also be able to update the value when the container
is released.
> This would allow for the NM to not aggregate logs in some cases, and avoid connection
to the NN at all.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message