hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
Date Sat, 28 Jun 2014 01:21:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046590#comment-14046590
] 

Wangda Tan commented on YARN-1408:
----------------------------------

Hi [~sunilg],
Thanks for working out this patch so fast!

*A major problems I've seen.*
ResourceRequest stored in RMContainerImpl should include rack/any RR,
Currently, there's only one ResourceRequest stored in RMContainerImpl, this may should not
enough for recovering in following cases:
Case 1: RR may contain other fields like relaxLocality, etc. Assume a RR is node-local, the
relaxLocaity=true (default), and it's rack-local/any RR's relaxLocality=false. In your current
implementation, you cannot fully recover original RRs.
Case 2: Rack-local RR will be missing. Assume a RR is node-local, when do resource allocation,
the outstanding rack-local/any numContainer will be decreased, you can check AppSchedulingInfo#allocateNodeLocal
for the logic of how outstanding rack/any #containers decreased.

*My thoughts about how to implement this is:*
In FiCaScheduler#allocate, appSchedulingInfo.allocate will be invoked. You can edit appSchedulingInfo.allocate
to return a list a RRs, include node/rack/any if possible.
Pass such RRs to RMContainerImpl

And could you please elaborate on this?
bq. AM would have asked for NodeLocal in another Hosts, which may not be able to recover.

Does it make sense to you?  I'll review minor issues and test cases in next cycle.

Thanks,
Wangda

> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for
30mins
> ----------------------------------------------------------------------------------------------
>
>                 Key: YARN-1408
>                 URL: https://issues.apache.org/jira/browse/YARN-1408
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>         Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch,
Yarn-1408.5.patch, Yarn-1408.patch
>
>
> Capacity preemption is enabled as follows.
>  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
>  *  yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached
RM.
> ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't
handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED
at KILLED
> This also caused the Task to go for a timeout for 30minutes as this Container was already
killed by preemption.
> attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message