Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B08CF200B66 for ; Wed, 13 Jul 2016 23:07:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id AF29A160A7C; Wed, 13 Jul 2016 21:07:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id ED44D160A6A for ; Wed, 13 Jul 2016 23:07:21 +0200 (CEST) Received: (qmail 71341 invoked by uid 500); 13 Jul 2016 21:07:20 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 71176 invoked by uid 99); 13 Jul 2016 21:07:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jul 2016 21:07:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id ABDB12C02A9 for ; Wed, 13 Jul 2016 21:07:20 +0000 (UTC) Date: Wed, 13 Jul 2016 21:07:20 +0000 (UTC) From: "Wangda Tan (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-5374) Preemption causing communication loop MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 13 Jul 2016 21:07:22 -0000 [ https://issues.apache.org/jira/browse/YARN-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375759#comment-15375759 ] Wangda Tan commented on YARN-5374: ---------------------------------- [~LucasW], it seems to me that the issue is caused by Spark application doesn't well handle container preemption message. If so, I suggest you can drop a mail to Spark maillist or file a Spark JIRA instead. > Preemption causing communication loop > ------------------------------------- > > Key: YARN-5374 > URL: https://issues.apache.org/jira/browse/YARN-5374 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager, yarn > Affects Versions: 2.7.1 > Environment: Yarn version: Hadoop 2.7.1-amzn-0 > AWS EMR Cluster running: > 1 x r3.8xlarge (Master) > 52 x r3.8xlarge (Core) > Spark version : 1.6.0 > Scala version: 2.10.5 > Java version: 1.8.0_51 > Input size: ~10 tb > Input coming from S3 > Queue Configuration: > Dynamic allocation: enabled > Preemption: enabled > Q1: 70% capacity with max of 100% > Q2: 30% capacity with max of 100% > Job Configuration: > Driver memory = 10g > Executor cores = 6 > Executor memory = 10g > Deploy mode = cluster > Master = yarn > maxResultSize = 4g > Shuffle manager = hash > Reporter: Lucas Winkelmann > Priority: Blocker > > Here is the scenario: > I launch job 1 into Q1 and allow it to grow to 100% cluster utilization. > I wait between 15-30 mins ( for this job to complete with 100% of the cluster available takes about 1hr so job 1 is between 25-50% complete). Note that if I wait less time then the issue sometimes does not occur, it appears to be only after the job 1 is at least 25% complete. > I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to allow 70% of cluster utilization. > At this point job 1 basically halts progress while job 2 continues to execute as normal and finishes. Job 2 either: > - Fails its attempt and restarts. By the time this attempt fails the other job is already complete meaning the second attempt has full cluster availability and finishes. > - The job remains at its current progress and simply does not finish ( I have waited ~6 hrs until finally killing the application ). > > Looking into the error log there is this constant error message: > WARN NettyRpcEndpointRef: Error sending message [message = RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: ip-NUMBERS.ec2.internal was preempted.)] in X attempts > > My observations have led me to believe that the application master does not know about this container being killed and continuously asks the container to remove the executor until eventually failing the attempt or continue trying to remove the executor. > > I have done much digging online for anyone else experiencing this issue but have come up with nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org