hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
Date Mon, 12 May 2014 04:47:15 GMT

    [ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994801#comment-13994801
] 

Sunil G commented on YARN-1408:
-------------------------------

Please check the below scenario.
After allocating a container to an application, CS will decrement its associated Resource
Request info. 
Once this container is identified for preemption, preemption module in RM will do the container
kill regardless whatever state the container is.
 
I am assuming that state of one container is AQUIRED [Waiting for Launch event to become RUNNING].
And now this is marked for preemption, so container will get preempted.
 
Hence Next heartbeat to AM has same container present in newlyAllocatedContainers and in completedContainers.
 [Allocation and Kill happened within an AM heartbeat cycle]
An Invalid state transition [AQUIRED at KILLED] will be happened while processing from newlyAllocatedContainers
in AM side. This will cause task to timeout after 30mins.
 
If we try remove container from newlyAllocatedContainers, we can avoid invalid state transition.
But this will cause task hang. [RM lost the resource request]
As per initial explanation, RM has allocated a container and AM is waiting to get that container
to assign for a task.
Due to preemption, this has not been happened. Hence it will cause task to hang.

I feel we can preempt those containers which are only in RUNNING state. [~devaraj.k] and [~curino],
please share your thoughts.

> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for
30mins
> ----------------------------------------------------------------------------------------------
>
>                 Key: YARN-1408
>                 URL: https://issues.apache.org/jira/browse/YARN-1408
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Sunil G
>             Fix For: 2.5.0
>
>         Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch,
Yarn-1408.patch
>
>
> Capacity preemption is enabled as follows.
>  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
>  *  yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached
RM.
> ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't
handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED
at KILLED
> This also caused the Task to go for a timeout for 30minutes as this Container was already
killed by preemption.
> attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message