hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (YARN-6837) When the LocalResource's visibility is null, the NodeManager will shutdown
Date Tue, 18 Jul 2017 14:59:02 GMT

     [ https://issues.apache.org/jira/browse/YARN-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe reassigned YARN-6837:
--------------------------------

    Assignee: Jinjiang Ling

Thanks for the report and the patch!  Looking at the patch, I'm not a fan of letting an NPE
occur then catching it and assuming we know where the NPE came from.  It's error prone for
maintenance since someone could accidentally introduce another NPE problem and then we are
catching and suppressing for the wrong reason making things harder to debug.

Speaking of repressing exceptions, this simply logs a warning when we have no visibility,
but then it just continues.  What will happen to the resource after that?  It doesn't look
like we add it to any localizer list and therefore I think the container will just hang waiting
for a resource to localize that never will.

A better way to handle this is to sanity-check the container launch request in ContainerManagerImpl#startContainerInternal
and throw an exception if the request is malformed.  This has the benefit of propagating the
error back to the client who is making the bad request so they know both that the request
was bad and the corresponding container will not be launched.  This looks similar to YARN-6403,
and the resource visibility was missed in that change.

> When the LocalResource's visibility is null, the NodeManager will shutdown
> --------------------------------------------------------------------------
>
>                 Key: YARN-6837
>                 URL: https://issues.apache.org/jira/browse/YARN-6837
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Jinjiang Ling
>            Assignee: Jinjiang Ling
>         Attachments: YARN-6837.patch
>
>
> When I write an yarn application, I create a LocalResource like this
> {quote}
> LocalResource resource = Records.newRecord(LocalResource.class);
> {quote}
> Because I forget to set the visibilty of it, so the job is failed when I submit it.
> But NodeManager shutdown one by one at the same time, and there is NullPointerExceptionin
NodeManager's log:
> {quote}
> 2017-07-18 17:54:09,289 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
USER=hadoop       IP=10.43.156.177        OPERATION=Start Container Request       TARGET=ContainerManageImpl
     RESULT=SUCCESS  APPID=application_1499221670783_0067    CONTAINERID=container_1499221670783_0067_02_000003
> 2017-07-18 17:54:09,292 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in
dispatcher thread
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.addResources(ResourceSet.java:84)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:868)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:819)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1684)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:96)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1418)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1411)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-07-18 17:54:09,292 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
Start request for container_1499221670783_0067_02_000002 by user hadoop
> {quote}
> Then I change my code and still set the visibility to null
> {quote}
> LocalResource resource = LocalResource.newInstance(
>                                 URL.fromURI(dst.toUri()),
>                                 LocalResourceType.FILE, {color:red}null{color},
>                                 fileStatus.getLen(), fileStatus.getModificationTime());
> {quote}
> This error still happen.
> At last I set the visibility to correct value, the error do not happen again.
> So I think the visibility of LocalResource is null will cause NodeManager shutdown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message