crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Output Committers and Crunch Targets
Date Thu, 30 Jan 2014 13:35:16 GMT
The first point is correct-- we always use the multiple outputs
configuration options now, even if there is only a single output.

The second point surprises me-- HBaseTarget (for example) uses a custom
output committer w/its OutputFormat without issue, although of course we're
still stuck setting the Path field in the Conf via FileOutputFormat. Maybe
look at HBaseTarget as a reference here?


On Wed, Jan 29, 2014 at 1:21 PM, Micah Whitacre <mkwhit@gmail.com> wrote:

> >> I would expect that
> >> named outputs would not be used in my simple pipeline, so name would
> >> be null, but it actually seems that the name parameter is 'out0'. So
> >> my first question is: what determines when named outputs are used?
>
> Looking at the code the output is always named[1] regardless of the number
> of outputs.  Do you believe the use of a name is causing an issue with the
> utilization of your custom committer?
>
> Regarding your second question I need to do a bit more digging to answer
> for certain.
>
> [1] -
>
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/mr/plan/MSCROutputHandler.java#L64
>
>
>
>
> On Wed, Jan 29, 2014 at 10:11 AM, Tom White <tom@cloudera.com> wrote:
>
> > Hi,
> >
> > I'm writing a Crunch Target that is a MapReduceTarget, but not a
> > PathTarget, since it writes to files in a partitioned manner, so there
> > is not necessarily a single output path. I'm confused about the 'name'
> > parameter in configureForMapReduce() though - I would expect that
> > named outputs would not be used in my simple pipeline, so name would
> > be null, but it actually seems that the name parameter is 'out0'. So
> > my first question is: what determines when named outputs are used?
> >
> > In the past this hasn't been a problem (e.g. with the Parquet target),
> > but this output format has a custom output committer which isn't being
> > used. Instead it looks like the default file committer is being used
> > by Crunch, so the job fails. Is it possible to use custom output
> > committers with Crunch?
> >
> > My code is here:
> >
> >
> https://github.com/tomwhite/kite/blob/CDK-251-mr/kite-data/kite-data-crunch/src/main/java/org/kitesdk/data/crunch/DatasetTarget.java#L100
> >
> > Cheers,
> > Tom
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message