Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BD93A200C14 for ; Tue, 7 Feb 2017 21:21:51 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id BC325160B32; Tue, 7 Feb 2017 20:21:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 10C0A160B3E for ; Tue, 7 Feb 2017 21:21:50 +0100 (CET) Received: (qmail 97812 invoked by uid 500); 7 Feb 2017 20:21:50 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 97803 invoked by uid 99); 7 Feb 2017 20:21:50 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Feb 2017 20:21:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CEE64C0C5C for ; Tue, 7 Feb 2017 20:21:49 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id fF59uOac2iUb for ; Tue, 7 Feb 2017 20:21:48 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 6B1E95F283 for ; Tue, 7 Feb 2017 20:21:48 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E9F67E00C7 for ; Tue, 7 Feb 2017 20:21:41 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id A584E24D2F for ; Tue, 7 Feb 2017 20:21:41 +0000 (UTC) Date: Tue, 7 Feb 2017 20:21:41 +0000 (UTC) From: "Kay Ousterhout (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SPARK-19502) Remove unnecessary code to re-submit stages in the DAGScheduler MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 07 Feb 2017 20:21:51 -0000 [ https://issues.apache.org/jira/browse/SPARK-19502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19502: ----------------------------------- Description: There are a [few lines of code in the DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215) to re-submit shuffle map stages when some of the tasks fail. My understanding is that there should be a 1:1 mapping between pending tasks (which are tasks that haven't completed successfully) and available output locations, so that code should never be reachable. Furthermore, the approach taken by that code (to re-submit an entire stage as a result of task failures) is not how we handle task failures in a stage (the lower-level scheduler resubmits the individual tasks) which is what the 5-years-old TODO on that code seems to be implying should be done. The big caveat is that there's a bug being fixed in SPARK-19263 that means there is *not* a 1:1 relationship between pendingTasks and available outputLocations, so that code is serving as a (buggy) band-aid. This should be fixed once we resolve SPARK-19263. cc [~imranr] [~markhamstra] [~jinxing6042@126.com] (let me know if any of you see any reason we actually do need that code) was: There are a [few lines of code in the DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215) to re-submit shuffle map stages when some of the tasks fail. My understanding is that there should be a 1:1 mapping between pending tasks (which are tasks that haven't completed successfully) and available output locations, so that code should never be reachable. Furthermore, the approach taken by that code (to re-submit an entire stage as a result of task failures) is not how we handle task failures in a stage (the lower-level scheduler resubmits the individual tasks) which is what the 5-years-old TODO on that code seems to be implying should be done. The big caveat is that there's a bug being fixed in SPARK-19263 that means there is *not* a 1:1 relationship between pendingTasks and available outputLocations, so that code is serving as a (buggy) band-aid. This should be fixed once we resolve SPARK-19263. cc [~imranr] [~markhamstra] [~jinxing6042@126.com] > Remove unnecessary code to re-submit stages in the DAGScheduler > --------------------------------------------------------------- > > Key: SPARK-19502 > URL: https://issues.apache.org/jira/browse/SPARK-19502 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 1.1.1 > Reporter: Kay Ousterhout > Assignee: Kay Ousterhout > Priority: Minor > > There are a [few lines of code in the DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215) to re-submit shuffle map stages when some of the tasks fail. My understanding is that there should be a 1:1 mapping between pending tasks (which are tasks that haven't completed successfully) and available output locations, so that code should never be reachable. Furthermore, the approach taken by that code (to re-submit an entire stage as a result of task failures) is not how we handle task failures in a stage (the lower-level scheduler resubmits the individual tasks) which is what the 5-years-old TODO on that code seems to be implying should be done. > The big caveat is that there's a bug being fixed in SPARK-19263 that means there is *not* a 1:1 relationship between pendingTasks and available outputLocations, so that code is serving as a (buggy) band-aid. This should be fixed once we resolve SPARK-19263. > cc [~imranr] [~markhamstra] [~jinxing6042@126.com] (let me know if any of you see any reason we actually do need that code) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org