crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <...@cloudera.com>
Subject Re: Support OutputCommitter?
Date Fri, 28 Feb 2014 08:40:02 GMT
+1

Tom

On Fri, Feb 28, 2014 at 2:26 AM, Chao Shi <stepinto@live.com> wrote:
> How about introducinug our own OutputFormat? It can delegate to each
> registered OutputCommitter (if any).
>
>
> 2014-02-28 1:28 GMT+08:00 Josh Wills <josh.wills@gmail.com>:
>
>> It's possible to have multiple targets running in one Crunch job; in fact
>> it was so common that I switched everything over to the named targets in
>> order to simplify the bookkeeping. Every output format can run
>> independently of every other output format using the code in CrunchOutputs;
>> I think the only reason we default to FileOutputFormat is b/c it's an
>> exception for an MR config to _not_ have an OuputFormat configured, even if
>> it's never used.
>>
>>
>> On Thu, Feb 27, 2014 at 9:03 AM, Tom White <tom@cloudera.com> wrote:
>>
>> > Is it possible to have multiple targets that Crunch runs in one
>> > MapReduce job? If so then there will be a conflict, and Crunch will
>> > need some changes to support this case.
>> >
>> > Tom
>> >
>> > On Thu, Feb 27, 2014 at 3:34 PM, Chao Shi <stepinto@live.com> wrote:
>> > > Hi Tom,
>> > >
>> > > I will have to use named-output. About your example DatasetTarget, is
>> it
>> > > safe to setOutputFormat() explicitly here? I guess this may conflict
>> with
>> > > other targets that only use the same trick. Is it possible for us to
>> > have a
>> > > general approach to get OutputCommitter work?
>> > > Hi Chao,
>> > >
>> > > Crunch doesn't call the output committer explicitly itself, it's
>> > > called by the MR framework as a normal part of running a job. However,
>> > > in Crunch's MapReduceTarget#configureForMapReduce the output format is
>> > > not typically set for the named-output case (which is the only case
>> > > that is executed now, as I discovered in the thread mentioned below),
>> > > so it defaults to FileOutputFormat, with its semantics. (This is why
>> > > HBaseTarget calls FileOutputFormat.setOutputPath, which it wouldn't
>> > > have to if it set the output format explicitly to HBase's
>> > > TableOutputFormat.)
>> > >
>> > > Are you setting the HCatOutputFormat in the named-output case? In the
>> > > Crunch Target I'm writing I've set the OutputFormat explicitly:
>> > >
>> >
>> https://github.com/tomwhite/kite/blob/CDK-308-dataset-output-format/kite-data/kite-data-crunch/src/main/java/org/kitesdk/data/crunch/DatasetTarget.java#L106
>> > >
>> > > Cheers,
>> > > Tom
>> > >
>> > > On Thu, Feb 27, 2014 at 7:54 AM, Gabriel Reid <gabriel.reid@gmail.com>
>> > > wrote:
>> > >> For reference, here's the link to the previous thread on this:
>> > >>
>> > >
>> >
>> http://mail-archives.apache.org/mod_mbox/crunch-dev/201401.mbox/%3cCAF-WD4Sig2n7yMxiZSji8trQy-8wfUy5_7dnKC=dkSxmrfSPVA@mail.gmail.com%3e
>> > >>
>> > >> On Thu, Feb 27, 2014 at 7:56 AM, Josh Wills <jwills@cloudera.com>
>> > wrote:
>> > >>> +tom
>> > >>>
>> > >>> Didn't Tom have a thing like this a little while ago?
>> > >>>
>> > >>>
>> > >>> On Wed, Feb 26, 2014 at 8:04 PM, Chao Shi <stepinto@live.com>
wrote:
>> > >>>
>> > >>>> Hi crunch devs,
>> > >>>>
>> > >>>> I'm developing target wrapper for HCatOutputFormat, which uses
a
>> > custom
>> > >>>> OutputCommiter to get results committed to hive. It seems its
>> > >>>> OutputCommitter is not called at all. Looking into the code,
I can't
>> > > find
>> > >>>> where crunch calls it. Is it really supported?
>> > >>>>
>> > >>>> Thanks,
>> > >>>> Chao
>> > >>>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> Director of Data Science
>> > >>> Cloudera <http://www.cloudera.com>
>> > >>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>> >
>>

Mime
View raw message