Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 61C93200BDC for ; Tue, 29 Nov 2016 20:52:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 6131A160B28; Tue, 29 Nov 2016 19:52:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AF1CD160B15 for ; Tue, 29 Nov 2016 20:51:59 +0100 (CET) Received: (qmail 71809 invoked by uid 500); 29 Nov 2016 19:51:58 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 71740 invoked by uid 99); 29 Nov 2016 19:51:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Nov 2016 19:51:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id A271B2C03E5 for ; Tue, 29 Nov 2016 19:51:58 +0000 (UTC) Date: Tue, 29 Nov 2016 19:51:58 +0000 (UTC) From: "Steve Loughran (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-13786) add output committer which uses s3guard for consistent O(1) commits to S3 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 29 Nov 2016 19:52:00 -0000 [ https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15706351#comment-15706351 ] Steve Loughran commented on HADOOP-13786: ----------------------------------------- For the curious, my WiP, in the s3guard branch, is here: https://github.com/steveloughran/hadoop/tree/s3guard/HADOOP-13786-committer I've been tweaking the mapreduce code to add ability to switch to a subclass of FileOutputCommitter in FileOutputFormat (And transitively, all children), by way of a factory. The S3a factory dynamically chooses the committer based on the destination FS. There's currently no difference in the codebase apart from logging of operation durations. This means that we can switch the committer behind all output formats to using the s3a one (and also allowing anyone else to do the same with their own FOF committer subclass). I can see that ParquetOutputFormat has its own committer, {{ParquetOutputCommitter}}, as used by {{ParquetOutputFormat}} and so, I believe spark's {{dataframe.parquet.write()}} pipeline; my IDE isn't highlighting much else. Now, could I get away with just modifying the base FOF committer? for exception reporting it could use {{FileSystem.rename(final Path src, final Path dst, final Rename... options)}}; S3A to impl that with exceptions. That's the "transitory" method for use between FS and FileContext, which does raise exceptions; the default is non-atomic and eventually gets to {{rename(src, dest)}}, except for HDFS which does a full implementation. But: what if the semantics of that rename() (not yet in the FS spec, AFAIK) are different from what callers expect in some of the corner cases? And so commits everywhere break? That would not be good. And it would remove the ability to use any private s3a/s3guard calls if we wanted some (e.g. get a lock on a directory), unless they went in to the standard FS APIs. Making the output factory pluggable would avoid such problems, though logging duration might be nice all round. That said, given the time to rename, its somewhat unimportant when using real filesystems. > add output committer which uses s3guard for consistent O(1) commits to S3 > ------------------------------------------------------------------------- > > Key: HADOOP-13786 > URL: https://issues.apache.org/jira/browse/HADOOP-13786 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.0.0-alpha2 > Reporter: Steve Loughran > Assignee: Steve Loughran > > A goal of this code is "support O(1) commits to S3 repositories in the presence of failures". Implement it, including whatever is needed to demonstrate the correctness of the algorithm. (that is, assuming that s3guard provides a consistent view of the presence/absence of blobs, show that we can commit directly). > I consider ourselves free to expose the blobstore-ness of the s3 output streams (ie. not visible until the close()), if we need to use that to allow us to abort commit operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org