crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <...@cloudera.com>
Subject Re: Output Committers and Crunch Targets
Date Thu, 30 Jan 2014 14:45:41 GMT
Thanks Micah and Josh.

It looks like CRUNCH-82 changed the behaviour so that outputs are
always named, which was a part of the confusion since it means that
the code path for null name in configureForMapReduce is never taken.

Most of the Target implementations leave the output format as the
default, which means that the normal file output committer semantics
are used. This is fine for all the file targets, and for HBase which
uses a no-op output committer, but I need to specify a custom
committer that does something different. I can do that when there's
only a single target, but it's not clear to me yet how to make it cope
with multiple output datasets. E.g. it may need a generalization of
the Path#handleOutputs mechanism.

> we're still stuck setting the Path field in the Conf via FileOutputFormat.

It should probably set the Crunch job's output format to the
NullOutputFormat rather than just leave it as the default, since then
the FileOutputFormat.setOutputPath call wouldn't be needed.

Cheers,
Tom

On Thu, Jan 30, 2014 at 1:35 PM, Josh Wills <jwills@cloudera.com> wrote:
> The first point is correct-- we always use the multiple outputs
> configuration options now, even if there is only a single output.
>
> The second point surprises me-- HBaseTarget (for example) uses a custom
> output committer w/its OutputFormat without issue, although of course we're
> still stuck setting the Path field in the Conf via FileOutputFormat. Maybe
> look at HBaseTarget as a reference here?
>
>
> On Wed, Jan 29, 2014 at 1:21 PM, Micah Whitacre <mkwhit@gmail.com> wrote:
>
>> >> I would expect that
>> >> named outputs would not be used in my simple pipeline, so name would
>> >> be null, but it actually seems that the name parameter is 'out0'. So
>> >> my first question is: what determines when named outputs are used?
>>
>> Looking at the code the output is always named[1] regardless of the number
>> of outputs.  Do you believe the use of a name is causing an issue with the
>> utilization of your custom committer?
>>
>> Regarding your second question I need to do a bit more digging to answer
>> for certain.
>>
>> [1] -
>>
>> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/mr/plan/MSCROutputHandler.java#L64
>>
>>
>>
>>
>> On Wed, Jan 29, 2014 at 10:11 AM, Tom White <tom@cloudera.com> wrote:
>>
>> > Hi,
>> >
>> > I'm writing a Crunch Target that is a MapReduceTarget, but not a
>> > PathTarget, since it writes to files in a partitioned manner, so there
>> > is not necessarily a single output path. I'm confused about the 'name'
>> > parameter in configureForMapReduce() though - I would expect that
>> > named outputs would not be used in my simple pipeline, so name would
>> > be null, but it actually seems that the name parameter is 'out0'. So
>> > my first question is: what determines when named outputs are used?
>> >
>> > In the past this hasn't been a problem (e.g. with the Parquet target),
>> > but this output format has a custom output committer which isn't being
>> > used. Instead it looks like the default file committer is being used
>> > by Crunch, so the job fails. Is it possible to use custom output
>> > committers with Crunch?
>> >
>> > My code is here:
>> >
>> >
>> https://github.com/tomwhite/kite/blob/CDK-251-mr/kite-data/kite-data-crunch/src/main/java/org/kitesdk/data/crunch/DatasetTarget.java#L100
>> >
>> > Cheers,
>> > Tom
>> >
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message