From issues-return-185883-archive-asf-public=cust-asf.ponee.io@spark.apache.org Tue Feb 27 20:14:06 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 590CA18066D for ; Tue, 27 Feb 2018 20:14:06 +0100 (CET) Received: (qmail 40993 invoked by uid 500); 27 Feb 2018 19:14:04 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 40835 invoked by uid 99); 27 Feb 2018 19:14:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Feb 2018 19:14:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 4BDEA180636 for ; Tue, 27 Feb 2018 19:14:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.511 X-Spam-Level: X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id nnw-H45Rfbsz for ; Tue, 27 Feb 2018 19:14:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 9757E5FB02 for ; Tue, 27 Feb 2018 19:14:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 7B3CFE09A4 for ; Tue, 27 Feb 2018 19:14:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 8D28F240F6 for ; Tue, 27 Feb 2018 19:14:00 +0000 (UTC) Date: Tue, 27 Feb 2018 19:14:00 +0000 (UTC) From: "Marcelo Vanzin (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23365. ------------------------------------ Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20604 [https://github.com/apache/spark/pull/20604] > DynamicAllocation with failure in straggler task can lead to a hung spark job > ----------------------------------------------------------------------------- > > Key: SPARK-23365 > URL: https://issues.apache.org/jira/browse/SPARK-23365 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.1.2, 2.2.1, 2.3.0 > Reporter: Imran Rashid > Assignee: Imran Rashid > Priority: Major > Fix For: 2.4.0, 2.3.1 > > > Dynamic Allocation can lead to a spark app getting stuck with 0 executors requested when the executors in the last tasks of a taskset fail (eg. with an OOM). > This happens when {{ExecutorAllocationManager}} s internal target number of executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many tasks are active or pending in submitted stages, and computes how many executors would be needed for them. And as tasks finish, it will actively decrease that count, informing the {{CGSB}} along the way. (2) When it decides executors are inactive for long enough, then it requests that {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its target number of executors: https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 > So when there is just one task left, you could have the following sequence of events: > (1) the {{EAM}} sets the desired number of executors to 1, and updates the {{CGSB}} too > (2) while that final task is still running, the other executors cross the idle timeout, and the {{EAM}} requests the {{CGSB}} kill them > (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target of 0 executors > If the final task completed normally now, everything would be OK; the next taskset would get submitted, the {{EAM}} would increase the target number of executors and it would update the {{CGSB}}. > But if the executor for that final task failed (eg. an OOM), then the {{EAM}} thinks it [doesn't need to update anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386], because its target is already 1, which is all it needs for that final task; and the {{CGSB}} doesn't update anything either since its target is 0. > I think you can determine if this is the cause of a stuck app by looking for > {noformat} > yarn.YarnAllocator: Driver requested a total number of 0 executor(s). > {noformat} > in the logs of the ApplicationMaster (at least on yarn). > You can reproduce this with this test app, run with {{--conf "spark.dynamicAllocation.minExecutors=1" --conf "spark.dynamicAllocation.maxExecutors=5" --conf "spark.dynamicAllocation.executorIdleTimeout=5s"}} > {code} > import org.apache.spark.SparkEnv > sc.setLogLevel("INFO") > sc.parallelize(1 to 10000, 10000).count() > val execs = sc.parallelize(1 to 1000, 1000).map { _ => SparkEnv.get.executorId}.collect().toSet > val badExec = execs.head > println("will kill exec " + badExec) > new Thread() { > override def run(): Unit = { > Thread.sleep(10000) > println("about to kill exec " + badExec) > sc.killExecutor(badExec) > } > }.start() > sc.parallelize(1 to 5, 5).mapPartitions { itr => > val exec = SparkEnv.get.executorId > if (exec == badExec) { > Thread.sleep(20000) // long enough that all the other tasks finish, and the executors cross the idle timeout > // meanwhile, something else should kill this executor > itr > } else { > itr > } > }.collect() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org