Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 38402200C55 for ; Thu, 30 Mar 2017 06:11:47 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 36ABC160B95; Thu, 30 Mar 2017 04:11:47 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 58ACD160B8A for ; Thu, 30 Mar 2017 06:11:46 +0200 (CEST) Received: (qmail 84337 invoked by uid 500); 30 Mar 2017 04:11:45 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 84326 invoked by uid 99); 30 Mar 2017 04:11:45 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Mar 2017 04:11:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id EAE21C0AFE for ; Thu, 30 Mar 2017 04:11:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id i2ALR5e2UU3q for ; Thu, 30 Mar 2017 04:11:44 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 74B3E60D28 for ; Thu, 30 Mar 2017 04:11:43 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id A5EE0E0711 for ; Thu, 30 Mar 2017 04:11:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id BE74E24176 for ; Thu, 30 Mar 2017 04:11:41 +0000 (UTC) Date: Thu, 30 Mar 2017 04:11:41 +0000 (UTC) From: "Tao Yang (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 30 Mar 2017 04:11:47 -0000 [ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6403: --------------------------- Attachment: YARN-6403.002.patch [~jlowe] Thanks for correcting me. The last server-side change is not proper and I corrected it as your mentioned. For the client-side change, IIUIC the generated protobuf code won't throws NPE for this case actually. Unit tests for both the client and server change is added. Attach a new patch for review, please correct me if I missed something. > Invalid local resource request can raise NPE and make NM exit > ------------------------------------------------------------- > > Key: YARN-6403 > URL: https://issues.apache.org/jira/browse/YARN-6403 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.8.0 > Reporter: Tao Yang > Attachments: YARN-6403.001.patch, YARN-6403.002.patch > > > Recently we found this problem on our testing environment. The app that caused this problem added a invalid local resource request(have no location) into ContainerLaunchContext like this: > {code} > localResources.put("test", LocalResource.newInstance(location, > LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, > System.currentTimeMillis())); > ContainerLaunchContext amContainer = > ContainerLaunchContext.newInstance(localResources, environment, > vargsFinal, null, securityTokens, acls); > {code} > The actual value of location was null although app doesn't expect that. This mistake cause several NMs exited with the NPE below and can't restart until the nm recovery dirs were deleted. > {code} > FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) > at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) > at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > NPE occured when created LocalResourceRequest instance for invalid resource request. > {code} > public LocalResourceRequest(LocalResource resource) > throws URISyntaxException { > this(resource.getResource().toPath(), //NPE occurred here > resource.getTimestamp(), > resource.getType(), > resource.getVisibility(), > resource.getPattern()); > } > {code} > We can't guarantee the validity of local resource request now, but we could avoid damaging the cluster. Perhaps we can verify the resource both in ContainerLaunchContext and LocalResourceRequest? Please feel free to give your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org