hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
Date Fri, 06 Jun 2014 12:26:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019798#comment-14019798
] 

Wangda Tan commented on YARN-1408:
----------------------------------

[~sunilg], thanks for reply,
bq. My doubt is in this context. I felt like we may not need a SchedulerApplicationAttempt
in RMContainer itself. Rather I can do like keeping required information in RMContainer. So
it will be more sepcific to the handling.
bq. KILL event can happen not only from Preemption. I feel we need to deal issue with repect
to preemption only, correct?. Other KILL event in RMContainer may be for a reserved container
KILL etc. So we may need to do only in PreemptableResourceScheduler#killContainer. What do
you feel?
FairScheduler supports preemption as well, but it doesn't inherent PreemptableResourceScheduler
interface.  And do you agree that we need ALWAYS add Resource Request back when a container
killed before AM acquired it (no matter this container is killed or not)? It's more like you've
ordered something in Amazon, but you don't know Amazon already cancelled your order, it's
not a good experience for user in this case. If you agree with me, it might be better to add
logics in RMContainer transitions. :)
bq. I agree with the concern of modifying definition, but to an extent I think it will be
better if we can reuse with a recovery mode.
+1 for extending and reusing existed interface.

Wangda

> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for
30mins
> ----------------------------------------------------------------------------------------------
>
>                 Key: YARN-1408
>                 URL: https://issues.apache.org/jira/browse/YARN-1408
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>         Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch,
Yarn-1408.patch
>
>
> Capacity preemption is enabled as follows.
>  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
>  *  yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached
RM.
> ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't
handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED
at KILLED
> This also caused the Task to go for a timeout for 30minutes as this Container was already
killed by preemption.
> attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message