tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-3130) A bad NodeManager can end up occupying all container launcher threads, delaying new launches
Date Fri, 19 Feb 2016 23:32:18 GMT
Siddharth Seth created TEZ-3130:
-----------------------------------

             Summary: A bad NodeManager can end up occupying all container launcher threads,
delaying new launches
                 Key: TEZ-3130
                 URL: https://issues.apache.org/jira/browse/TEZ-3130
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.7.0
            Reporter: Siddharth Seth
             Fix For: 0.8.3


If there's a single bad NodeManager, and a lot of containers allocated on this node - all
container launcher threads can end up blocked on this node, delaying subsequent launches.
This is despite timeouts kicking in.
1) We should not allow all threads to be used up for a single NM
2) The retry policy could be enhanced to stop at ConnectionTimeouts (e.g. Node down)
3) Interrupt launch requests once Tez has detected a container as timed out.

Noticed by [~rajesh.balamohan] - threads would lockup for 15 minutes in 0.7, and potentially
infinitely on 0.8. That's another bug that needs investigation in 0.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message