hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
Date Fri, 06 Jun 2014 11:57:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019783#comment-14019783

Sunil G commented on YARN-1408:

Hi [~leftnoteasy]
bq.I meant we need keep a complete Resource Request, which includes Resource Request itself,
rack/ANY level Resource Request, in case of the entire ResourceRequest removed when count
reaches zero. I think we don't have difference here. , right?
Yes. :)

bq.pass a reference of SchedulerApplicationAttempt to RMContainer
My doubt is in this context. I felt like we may not need a SchedulerApplicationAttempt in
RMContainer itself. Rather I can do like keeping required information in RMContainer. So it
will be more sepcific to the handling.

bq.IMHO, it's better to put recover ResourceRequest logic in RMContainerImpl.FinishedTransition(),
we can check if original state is Allocated and event is KILL. The benefit of this choice
is we don't need separately modify FairScheduler and CapacityScheduler. Make sense?
KILL event can happen not only from Preemption. I feel we need to deal issue with repect to
preemption only, correct?. Other KILL event in RMContainer may be for a reserved container
KILL etc.
So we may need to do only in PreemptableResourceScheduler#killContainer. What do you feel?

bq.Yes, but we cannot reuse this API without modifying its definition. The AppSchedulingInfo#updateResourceRequests
will replace original ResourceRequests to new ResourceRequests, and our requirement is only
increase the original ResourceRequest.
I agree with the concern of modifying definition, but to an extent I think it will be better
if we can reuse with a recovery mode.

bq.And please note that, we need update QueueMetrics as well.
+1 Yes. I agree. We have to do this also.

Thank you. 

> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for
> ----------------------------------------------------------------------------------------------
>                 Key: YARN-1408
>                 URL: https://issues.apache.org/jira/browse/YARN-1408
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>         Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch,
> Capacity preemption is enabled as follows.
>  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
>  *  yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached
> ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't
handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED
> This also caused the Task to go for a timeout for 30minutes as this Container was already
killed by preemption.
> attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs

This message was sent by Atlassian JIRA

View raw message