hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-6956) FileOutputCommitter to gain abstract superclass PathOutputCommitter
Date Fri, 08 Sep 2017 19:12:01 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Steve Loughran updated MAPREDUCE-6956:
    Status: Patch Available  (was: Open)

> FileOutputCommitter to gain abstract superclass PathOutputCommitter
> -------------------------------------------------------------------
>                 Key: MAPREDUCE-6956
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6956
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 3.0.0-beta1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: MAPREDUCE-6956-001.patch
> This is the initial step of MAPREDUCE-6823, which proposes a factory behind {{FileOutputFormat}}
to create different committers for different filesystems, if so configured..
> This patch simply adds the new abstract superclass of {{FileOutputCommitter}}, {{PathOutputCommitter
extends OutputCommitter}}. This abstract class adds the {{getWorkPath()}} method as an abstract
method, with {{FIleOutputCommitter}} being the implementation..
> {{FileOutputFormat}} then relaxes its requirement of any committer returned by {{getOutputCommitter()}},
so that instead of requiring a  {{FileOutputCommitter}} or subclass, it only needs a {{PathOutputCommitter}},
using {{PathOutputCommitter.getWorkPath()}} to get the work path.
> What does that do?
> It allows people to implement subclasses of {{FileOutputFormat}} which can provide their
own committers *which don't need to inherit the complexity that FileOutputCommitter has acquired
over time*
> Currently anyone implementing a new committer (example: Netflix S3 committer) needs to
subclass {{FileOutputCommitter}}, which is too complex to understand except under a debugger
with co-recursive routines, lots of methods which need to be overwritten to guarantee a safe
subclass, and, because of its critical role and known subclassing, something which isn't ever
going to be cleaned up.
> A new, lean, parent class which {{FileOutputFormat}} can handle allows people to write
new committers which don't have to worry about implementation details of {{FileOutputCommitter}},
but instead how well they implement the semantics of committing work.
> The full MAPREDUCE-6823 goes beyond this with a change to {{FileOutputFormat}} for a
factory for creating FS-specific {{PathOutputCommitter}} instances. This patch doesn't include
that, as that is something which needs to be reviewed in the context of HADOOP-13786 and ideally
1+ committer for another store, so people can say "this factory model works".
> All I'm proposing here is: tune the committer class hierarchy in MRv2 so that people
can more easily implement committers, and when that factory is done, for it to be switched
to easily. And I'd like this in branch-3 from the outset, so existing code which calls {{FileOutputFormat.getCommitter()}}
to get a {{FileOutputCommitter}} *just to call getWorkPath()* can move to the new interface
across all of Hadoop 3.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org

View raw message