From issues-return-194841-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Tue Jun 26 02:09:05 2018
Return-Path: <issues-return-194841-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 7AD5418067C
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 26 Jun 2018 02:09:04 +0200 (CEST)
Received: (qmail 56526 invoked by uid 500); 26 Jun 2018 00:09:03 -0000
Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@spark.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@spark.apache.org>
List-Post: <mailto:issues@spark.apache.org>
List-Id: <issues.spark.apache.org>
Delivered-To: mailing list issues@spark.apache.org
Received: (qmail 56517 invoked by uid 99); 26 Jun 2018 00:09:03 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jun 2018 00:09:03 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id F1793C01AD
	for <issues@spark.apache.org>; Tue, 26 Jun 2018 00:09:02 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.301
X-Spam-Level:
X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100]
	autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id xIfkd4jH-k_j for <issues@spark.apache.org>;
	Tue, 26 Jun 2018 00:09:02 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id C85D05F434
	for <issues@spark.apache.org>; Tue, 26 Jun 2018 00:09:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 18BFBE0F73
	for <issues@spark.apache.org>; Tue, 26 Jun 2018 00:09:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 7352E23F99
	for <issues@spark.apache.org>; Tue, 26 Jun 2018 00:09:00 +0000 (UTC)
Date: Tue, 26 Jun 2018 00:09:00 +0000 (UTC)
From: "Marcelo Vanzin (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13165916.1528919256000.8066.1529971740470@Atlassian.JIRA>
In-Reply-To: <JIRA.13165916.1528919256000@Atlassian.JIRA>
References: <JIRA.13165916.1528919256000@Atlassian.JIRA> <JIRA.13165916.1528919256490@jira-lw-us.apache.org>
Subject: [jira] [Resolved] (SPARK-24552) Task attempt numbers are reused
 when stages are retried
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcelo Vanzin resolved SPARK-24552.
------------------------------------
       Resolution: Fixed
         Assignee: Ryan Blue
    Fix Version/s: 2.4.0
                   2.3.2
                   2.2.2

Giving credit to Ryan since he found the issue and provided the initial fix, although the final fix was a little more extensive.

> Task attempt numbers are reused when stages are retried
> -------------------------------------------------------
>
>                 Key: SPARK-24552
>                 URL: https://issues.apache.org/jira/browse/SPARK-24552
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>            Priority: Blocker
>             Fix For: 2.2.2, 2.3.2, 2.4.0
>
>
> When stages are retried due to shuffle failures, task attempt numbers are reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so that writers can use the attempt number to track and clean up data from failed or speculative attempts. In the v2 docs for DataWriterFactory, the attempt number's javadoc states that "Implementations can use this attempt number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two tasks can create the same data and attempt to commit. The commit coordinator prevents both from committing and will abort the attempt that finishes last. When using the (partition, attempt) pair to track data, the aborted task may delete data associated with the (partition, attempt) pair. If that happens, the data for the task that committed is also deleted as well, which is a correctness bug.
> For a concrete example, I have a data source that creates files in place named with {{part-<partition>-<attempt>-<uuid>.<format>}}. Because these files are written in place, both tasks create the same file and the one that is aborted deletes the file, leading to data corruption when the file is added to the table.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org