hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anubhav Dhoot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
Date Wed, 02 Jul 2014 18:33:24 GMT

    [ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050533#comment-14050533
] 

Anubhav Dhoot commented on YARN-2175:
-------------------------------------

We have seen it happen when the source file system had issues. Some jobs would intermittently
take a long time to fail and would succeed in rerun because the jars were put in a new distributed
cache location when rerun. Without this timeout we have no lever to mitigate underlying HDFS/Hardware
issues out in production until the root cause is identified and fixed. 
Also in comparison with the mapreduce.task.timeout this seems very focussed on a specific
operation - localization. I would expect this timeout would be defaulted to a large value
in production (say 30 min) and used only to mitigate when a issue occurs in production.

> Container localization has no timeouts and tasks can be stuck there for a long time
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-2175
>                 URL: https://issues.apache.org/jira/browse/YARN-2175
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various container startup
operations. Localization for example could take a long time and there is no automated way
to kill an task if its stuck in these states. These may have nothing to do with the task itself
and could be an issue within the platform.
> Ideally there should be configurable limits for various states within the NodeManager
to limit various states. The RM does not care about most of these and its only between AM
and the NM. We can start by making these global configurable defaults and in future we can
make it fancier by letting AM override them in the start container request. 
> This jira will be used to limit localization time and we can open others if we feel we
need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message