From dev-return-24387-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Tue Apr  3 22:45:36 2018
Return-Path: <dev-return-24387-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id A719F18064D
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  3 Apr 2018 22:45:35 +0200 (CEST)
Received: (qmail 66006 invoked by uid 500); 3 Apr 2018 20:45:34 -0000
Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@spark.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@spark.apache.org>
List-Post: <mailto:dev@spark.apache.org>
List-Id: <dev.spark.apache.org>
Delivered-To: mailing list dev@spark.apache.org
Received: (qmail 65996 invoked by uid 99); 3 Apr 2018 20:45:33 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Apr 2018 20:45:33 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D64C21A09C1
	for <dev@spark.apache.org>; Tue,  3 Apr 2018 20:45:32 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -0.701
X-Spam-Level:
X-Spam-Status: No, score=-0.701 tagged_above=-999 required=6.31
	tests=[RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id iXRwuDP1ckoh for <dev@spark.apache.org>;
	Tue,  3 Apr 2018 20:45:31 +0000 (UTC)
Received: from us-smtp-delivery-102.mimecast.com (us-smtp-delivery-102.mimecast.com [216.205.24.102])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 56A1F5F1F0
	for <dev@spark.apache.org>; Tue,  3 Apr 2018 20:45:31 +0000 (UTC)
Received: from MBX080-W3-CO-6.exch080.serverpod.net
 (out.exch080.serverdata.net [199.193.207.82]) (Using TLS) by
 us-smtp-1.mimecast.com with ESMTP id us-mta-149-ABSrq8fFPGuqBIMYjV5CZw-7;
 Tue, 03 Apr 2018 16:45:29 -0400
Received: from MBX080-W3-CO-4.exch080.serverpod.net (10.224.117.158) by
 MBX080-W3-CO-6.exch080.serverpod.net (10.224.117.162) with Microsoft SMTP
 Server (TLS) id 15.0.1263.5; Tue, 3 Apr 2018 13:45:00 -0700
Received: from MBX080-W3-CO-4.exch080.serverpod.net ([10.224.117.158]) by
 MBX080-W3-CO-4.exch080.serverpod.net ([10.224.117.158]) with mapi id
 15.00.1263.000; Tue, 3 Apr 2018 13:45:00 -0700
From: Steve Loughran <stevel@hortonworks.com>
To: cane <zhoukang199191@gmail.com>
CC: Apache Spark Dev <dev@spark.apache.org>
Subject: Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet
 file?
Thread-Topic: saveAsNewAPIHadoopDataset must not enable speculation for
 parquet file?
Thread-Index: AQHTyzVLlf2slUuUPkC53mU7CkQAYqPv+DGA
Date: Tue, 3 Apr 2018 20:44:59 +0000
Message-ID: <137F5053-8119-4FEA-AC5D-3167B89E5028@hortonworks.com>
References: <1522750780340-0.post@n3.nabble.com>
In-Reply-To: <1522750780340-0.post@n3.nabble.com>
Accept-Language: en-GB, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-ms-exchange-messagesentrepresentingtype: 1
x-ms-exchange-transport-fromentityheader: Hosted
x-source-routing-agent: Processed
Content-ID: <375D899A3754A44793C34740774F48C1@exch080.serverpod.net>
MIME-Version: 1.0
X-MC-Unique: ABSrq8fFPGuqBIMYjV5CZw-7
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable


> On 3 Apr 2018, at 11:19, cane <zhoukang199191@gmail.com> wrote:
>=20
> Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may c=
ause
> data loss.
> I check the comment of thi api:
>=20
>  We should make sure our tasks are idempotent when speculation is enabled=
,
> i.e. do
>   * not use output committer that writes data directly.
>   * There is an example in
> https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
>   * result of using direct output committer with speculation enabled.
>   */
>=20
> But if this the rule we must follow?
> For example,for parquet it will got ParquetOutPutCommitter.
> In this case, speculation must disable for parquet?
>=20
> Is there some one know the history?
> Thanks too much!


If you are writing to HDFS or object stores other than s3 and you make sure=
 that you are using the FileOutputFormat commit algorithm, you can use spec=
ulation without problems.=20

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 1

if you use the version 2 algorithm then you are vulnerable to a failure dur=
ing task commit, but only during task commit and then if speculative/repeat=
ed tasks generate output files with different names.

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2

If you are using S3 as a direct destination of work, then, in the absence o=
f a consistency layer (S3mer, EMR consistent s3, Hadoop 3,x + S3Guard) or a=
n S3-Specific committer, you are always at risk of data loss. Don't dp that

Further reading

https://github.com/steveloughran/zero-rename-committer/releases/download/ta=
g_draft_003/a_zero_rename_committer.pdf


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org