hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (YARN-6403) Invalid local resource request can raise NPE and make NM exit
Date Thu, 30 Mar 2017 14:31:41 GMT

     [ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe reassigned YARN-6403:
--------------------------------

            Assignee: Tao Yang
    Target Version/s: 2.8.1

Submitting patch so Jenkins can comment on it as well.

bq. For the client-side change, IIUIC the generated protobuf code won't throws NPE for this
case actually.

The generated protobuf code won't throw NPE for _this_ particular case, but it does throw
NPE for _other_ fields that you try to set directly to null.  For example, if one tried to
call setLocalResources(null) on the protobuf (not the PBImpl) then the generated protobuf
code explicitly throws NPE.  As such, I believe it's appropriate to throw NPE in our client
check code as well rather than a generic RuntimeException.  It's a minor point since the net
effect will be similar for the client in either case.

The {{localResources != null}} check in checkLocalResources is not necessary since the calling
code explicitly checks for it already.

The error message should be a bit more specific.  It just logs the local resource as a string,
but unfortunately that won't log the fact that the resource itself is null.  We should change
"Got invalid local resource" to something like "Null resource URL for local resource".  Throwing
NPE instead of RuntimeException would at least hint to the user that there's a problem with
a null field here, but we should also be more explicit in the error message to help them along.

This test code:
{code}
    boolean throwsException = false;
    try {
[....]
      containerLaunchContext.setLocalResources(localResources);
    } catch (Throwable e) {
      throwsException = true;
      Assert.assertTrue(e.getMessage().contains("Got invalid local resource"));
    }
    Assert.assertTrue(throwsException);
{code}
can be simplified to this:
{code}
    try {
[....]
      containerLaunchContext.setLocalResources(localResources);
      Assert.fail("setting an invalid local resource should be an error!");
    } catch (RuntimeException e) {
      Assert.assertTrue(e.getMessage().contains("Got invalid local resource"));
    }
{code}
Note that we should be checking for the specific exception type we are throwing in the test
rather than Throwable, since this is essentially part of the client API.

testClientFailureWithInvalidResource does not belong in ContainerManagerImpl since it has
nothing to do with ContainerManagerImpl.  It's really a test for ContainerLaunchContextPBImpl
and should be moved to an appropriate place in the yarn-common project.  TestApplicationClientProtocolRecords
looks like a decent place since it's already has another test for ContainerLaunchContextPBImpl
there.  The unit test method should be renamed to something more appropriate when moved there,
like testCLCPBImplNullResource.


> Invalid local resource request can raise NPE and make NM exit
> -------------------------------------------------------------
>
>                 Key: YARN-6403
>                 URL: https://issues.apache.org/jira/browse/YARN-6403
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>         Attachments: YARN-6403.001.patch, YARN-6403.002.patch
>
>
> Recently we found this problem on our testing environment. The app that caused this problem
added a invalid local resource request(have no location) into ContainerLaunchContext like
this:
> {code}
>     localResources.put("test", LocalResource.newInstance(location,
>         LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
>         System.currentTimeMillis()));
>     ContainerLaunchContext amContainer =
>         ContainerLaunchContext.newInstance(localResources, environment,
>           vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This mistake
cause several NMs exited with the NPE below and can't restart until the nm recovery dirs were
deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.<init>(LocalResourceRequest.java:46)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>       throws URISyntaxException {
>     this(resource.getResource().toPath(),  //NPE occurred here
>         resource.getTimestamp(),
>         resource.getType(),
>         resource.getVisibility(),
>         resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could avoid damaging
the cluster. Perhaps we can verify the resource both in ContainerLaunchContext and LocalResourceRequest?
Please feel free to give your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message