Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3D2C3200C46 for ; Wed, 29 Mar 2017 16:43:48 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3BAB2160B8A; Wed, 29 Mar 2017 14:43:48 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5C9C9160B7C for ; Wed, 29 Mar 2017 16:43:47 +0200 (CEST) Received: (qmail 39207 invoked by uid 500); 29 Mar 2017 14:43:46 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 39196 invoked by uid 99); 29 Mar 2017 14:43:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Mar 2017 14:43:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 2720ECD065 for ; Wed, 29 Mar 2017 14:43:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id zQuDfmGojELH for ; Wed, 29 Mar 2017 14:43:45 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 0733D5F253 for ; Wed, 29 Mar 2017 14:43:44 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id D83D3E0534 for ; Wed, 29 Mar 2017 14:43:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id E3B4E24180 for ; Wed, 29 Mar 2017 14:43:41 +0000 (UTC) Date: Wed, 29 Mar 2017 14:43:41 +0000 (UTC) From: "Jason Lowe (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 29 Mar 2017 14:43:48 -0000 [ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947251#comment-15947251 ] Jason Lowe commented on YARN-6403: ---------------------------------- Thanks for the patch! This patch is changing the client code but not the server code. A client who doesn't have the fix or a malicious client can still construct a malformed protobuf that is missing the resource location. Minimally the server needs to validate the request. The client-side change is nice to have but technically not necessary to fix the issue. Nit: Speaking of the client side change, I think NullPointerException is more appropriate to throw in this case. That's what the generated protobuf code already throws when trying to set protobuf fields to null. > Invalid local resource request can raise NPE and make NM exit > ------------------------------------------------------------- > > Key: YARN-6403 > URL: https://issues.apache.org/jira/browse/YARN-6403 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.8.0 > Reporter: Tao Yang > Attachments: YARN-6403.001.patch > > > Recently we found this problem on our testing environment. The app that caused this problem added a invalid local resource request(have no location) into ContainerLaunchContext like this: > {code} > localResources.put("test", LocalResource.newInstance(location, > LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, > System.currentTimeMillis())); > ContainerLaunchContext amContainer = > ContainerLaunchContext.newInstance(localResources, environment, > vargsFinal, null, securityTokens, acls); > {code} > The actual value of location was null although app doesn't expect that. This mistake cause several NMs exited with the NPE below and can't restart until the nm recovery dirs were deleted. > {code} > FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) > at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) > at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > NPE occured when created LocalResourceRequest instance for invalid resource request. > {code} > public LocalResourceRequest(LocalResource resource) > throws URISyntaxException { > this(resource.getResource().toPath(), //NPE occurred here > resource.getTimestamp(), > resource.getType(), > resource.getVisibility(), > resource.getPattern()); > } > {code} > We can't guarantee the validity of local resource request now, but we could avoid damaging the cluster. Perhaps we can verify the resource both in ContainerLaunchContext and LocalResourceRequest? Please feel free to give your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org