hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omkar Vinit Joshi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-299) Node Manager throws org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: RESOURCE_FAILED at DONE
Date Mon, 08 Jul 2013 21:27:51 GMT

    [ https://issues.apache.org/jira/browse/YARN-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702456#comment-13702456
] 

Omkar Vinit Joshi commented on YARN-299:
----------------------------------------

I guess the patch looks good overall .. however we need an additional fix which might also
occur. The root cause for this is more evident in YARN-820 logs.. Container is requesting
multiple resources and RESOURCE_LOCALIZED / RESOURCE_FAILED events might occur for one more
more resources between container received first RESOURCE_FAILED event and it deregister itself
from remaining resources...therefore we might see RESOURCE_FAILED / RESOURCE_LOCALIZED events
sent to containerImpl when resource is in DONE state (for different resources).... Therefore
like RESOURCE_FAILED we should also ignore RESOURCE_LOCALIZED event.
I could see one more issue in the logs... it would be great if we fix that too as a part of
this jira.... looks like a quick change... here in LOG.info it is calling toString on LocalizedResource
which is not threadsafe for ref (LinkedList used internally). I guess grabbing writelock inside
toString will protect it from such exceptions.. we need to check other state machines as well.

{code}
            } catch (ExecutionException e) {
              LOG.info("Failed to download rsrc " + assoc.getResource(),
                  e.getCause());
              LocalResourceRequest req = assoc.getResource().getRequest();
              publicRsrc.handle(new ResourceFailedLocalizationEvent(req,
                  e.getMessage()));
              assoc.getResource().unlock();
{code}

any thoughts?
                
> Node Manager throws org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid
event: RESOURCE_FAILED at DONE
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-299
>                 URL: https://issues.apache.org/jira/browse/YARN-299
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.0.1-alpha, 2.0.0-alpha
>            Reporter: Devaraj K
>            Assignee: Mayank Bansal
>         Attachments: YARN-299-trunk-1.patch
>
>
> {code:xml}
> 2012-12-31 10:36:27,844 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Can't handle this event at current state: Current: [DONE], eventType: [RESOURCE_FAILED]
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: RESOURCE_FAILED
at DONE
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:819)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:71)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:504)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:497)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> 	at java.lang.Thread.run(Thread.java:662)
> 2012-12-31 10:36:27,845 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1356792558130_0002_01_000001 transitioned from DONE to null
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message