From dev-return-24387-archive-asf-public=cust-asf.ponee.io@spark.apache.org Tue Apr 3 22:45:36 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id A719F18064D for ; Tue, 3 Apr 2018 22:45:35 +0200 (CEST) Received: (qmail 66006 invoked by uid 500); 3 Apr 2018 20:45:34 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 65996 invoked by uid 99); 3 Apr 2018 20:45:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Apr 2018 20:45:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D64C21A09C1 for ; Tue, 3 Apr 2018 20:45:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.701 X-Spam-Level: X-Spam-Status: No, score=-0.701 tagged_above=-999 required=6.31 tests=[RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id iXRwuDP1ckoh for ; Tue, 3 Apr 2018 20:45:31 +0000 (UTC) Received: from us-smtp-delivery-102.mimecast.com (us-smtp-delivery-102.mimecast.com [216.205.24.102]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 56A1F5F1F0 for ; Tue, 3 Apr 2018 20:45:31 +0000 (UTC) Received: from MBX080-W3-CO-6.exch080.serverpod.net (out.exch080.serverdata.net [199.193.207.82]) (Using TLS) by us-smtp-1.mimecast.com with ESMTP id us-mta-149-ABSrq8fFPGuqBIMYjV5CZw-7; Tue, 03 Apr 2018 16:45:29 -0400 Received: from MBX080-W3-CO-4.exch080.serverpod.net (10.224.117.158) by MBX080-W3-CO-6.exch080.serverpod.net (10.224.117.162) with Microsoft SMTP Server (TLS) id 15.0.1263.5; Tue, 3 Apr 2018 13:45:00 -0700 Received: from MBX080-W3-CO-4.exch080.serverpod.net ([10.224.117.158]) by MBX080-W3-CO-4.exch080.serverpod.net ([10.224.117.158]) with mapi id 15.00.1263.000; Tue, 3 Apr 2018 13:45:00 -0700 From: Steve Loughran To: cane CC: Apache Spark Dev Subject: Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file? Thread-Topic: saveAsNewAPIHadoopDataset must not enable speculation for parquet file? Thread-Index: AQHTyzVLlf2slUuUPkC53mU7CkQAYqPv+DGA Date: Tue, 3 Apr 2018 20:44:59 +0000 Message-ID: <137F5053-8119-4FEA-AC5D-3167B89E5028@hortonworks.com> References: <1522750780340-0.post@n3.nabble.com> In-Reply-To: <1522750780340-0.post@n3.nabble.com> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-messagesentrepresentingtype: 1 x-ms-exchange-transport-fromentityheader: Hosted x-source-routing-agent: Processed Content-ID: <375D899A3754A44793C34740774F48C1@exch080.serverpod.net> MIME-Version: 1.0 X-MC-Unique: ABSrq8fFPGuqBIMYjV5CZw-7 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable > On 3 Apr 2018, at 11:19, cane wrote: >=20 > Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may c= ause > data loss. > I check the comment of thi api: >=20 > We should make sure our tasks are idempotent when speculation is enabled= , > i.e. do > * not use output committer that writes data directly. > * There is an example in > https://issues.apache.org/jira/browse/SPARK-10063 to show the bad > * result of using direct output committer with speculation enabled. > */ >=20 > But if this the rule we must follow? > For example,for parquet it will got ParquetOutPutCommitter. > In this case, speculation must disable for parquet? >=20 > Is there some one know the history? > Thanks too much! If you are writing to HDFS or object stores other than s3 and you make sure= that you are using the FileOutputFormat commit algorithm, you can use spec= ulation without problems.=20 spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 1 if you use the version 2 algorithm then you are vulnerable to a failure dur= ing task commit, but only during task commit and then if speculative/repeat= ed tasks generate output files with different names. spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 If you are using S3 as a direct destination of work, then, in the absence o= f a consistency layer (S3mer, EMR consistent s3, Hadoop 3,x + S3Guard) or a= n S3-Specific committer, you are always at risk of data loss. Don't dp that Further reading https://github.com/steveloughran/zero-rename-committer/releases/download/ta= g_draft_003/a_zero_rename_committer.pdf --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscribe@spark.apache.org