Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7F0DD1759E for ; Wed, 1 Oct 2014 02:12:34 +0000 (UTC) Received: (qmail 16524 invoked by uid 500); 1 Oct 2014 02:12:34 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 16460 invoked by uid 500); 1 Oct 2014 02:12:34 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 16443 invoked by uid 99); 1 Oct 2014 02:12:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Oct 2014 02:12:34 +0000 Date: Wed, 1 Oct 2014 02:12:33 +0000 (UTC) From: "Andrew Or (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAPREDUCE-6116) Start container with auxiliary service data race condition MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated MAPREDUCE-6116: --------------------------------- Component/s: nodemanager > Start container with auxiliary service data race condition > ---------------------------------------------------------- > > Key: MAPREDUCE-6116 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6116 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.4.0 > Environment: HDP 2.1 on SLES 11 > Reporter: Andrew Or > > This shares the same symptoms as MAPREDUCE-2947, which is supposedly fixed. The stack trace I ran into is very similar: > {code} > Exception in thread "ContainerLauncher #1" java.lang.Error: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException > at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) > at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:224) > at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:93) > at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:63) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ... 2 more > Caused by: java.lang.IllegalArgumentException > at java.nio.Buffer.position(Buffer.java:236) > at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:147) > at java.nio.ByteBuffer.get(ByteBuffer.java:694) > at com.google.protobuf.ByteString.copyFrom(ByteString.java:217) > at com.google.protobuf.ByteString.copyFrom(ByteString.java:229) > at org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:196) > at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.convertToProtoFormat(ContainerLaunchContextPBImpl.java:101) > at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl$2$1.next(ContainerLaunchContextPBImpl.java:312) > at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl$2$1.next(ContainerLaunchContextPBImpl.java:300) > at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$Builder.addAllServiceData(YarnProtos.java:32918) > at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.addServiceDataToProto(ContainerLaunchContextPBImpl.java:323) > at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.mergeLocalToBuilder(ContainerLaunchContextPBImpl.java:112) > at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.mergeLocalToProto(ContainerLaunchContextPBImpl.java:128) > at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.getProto(ContainerLaunchContextPBImpl.java:70) > at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.convertToProtoFormat(StartContainerRequestPBImpl.java:156) > at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.mergeLocalToBuilder(StartContainerRequestPBImpl.java:85) > at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.mergeLocalToProto(StartContainerRequestPBImpl.java:95) > at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.getProto(StartContainerRequestPBImpl.java:57) > at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.convertToProtoFormat(StartContainersRequestPBImpl.java:137) > at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.addLocalRequestsToProto(StartContainersRequestPBImpl.java:97) > at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.mergeLocalToBuilder(StartContainersRequestPBImpl.java:79) > at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.mergeLocalToProto(StartContainersRequestPBImpl.java:72) > at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.getProto(StartContainersRequestPBImpl.java:48) > at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:93) > at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:201) > ... 5 more > {code} > What I was doing in my application is calling `ContainerLaunchContext#setServiceData` with my custom shuffle secret. This exception happens only frequently but not always, which leads me to conjecture that it's a race condition. After seeing MAPREDUCE-2947, I manually synchronized all of my calls to `NMClient#startContainer`, and I never ran into this issue again. I suspect that there is still a race condition in the AuxiliaryService code even after MAPREDUCE-2947. -- This message was sent by Atlassian JIRA (v6.3.4#6332)