crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Support OutputCommitter?
Date Fri, 28 Feb 2014 02:28:59 GMT
I like it.
On Feb 27, 2014 6:27 PM, "Chao Shi" <stepinto@live.com> wrote:

> How about introducinug our own OutputFormat? It can delegate to each
> registered OutputCommitter (if any).
>
>
> 2014-02-28 1:28 GMT+08:00 Josh Wills <josh.wills@gmail.com>:
>
> > It's possible to have multiple targets running in one Crunch job; in fact
> > it was so common that I switched everything over to the named targets in
> > order to simplify the bookkeeping. Every output format can run
> > independently of every other output format using the code in
> CrunchOutputs;
> > I think the only reason we default to FileOutputFormat is b/c it's an
> > exception for an MR config to _not_ have an OuputFormat configured, even
> if
> > it's never used.
> >
> >
> > On Thu, Feb 27, 2014 at 9:03 AM, Tom White <tom@cloudera.com> wrote:
> >
> > > Is it possible to have multiple targets that Crunch runs in one
> > > MapReduce job? If so then there will be a conflict, and Crunch will
> > > need some changes to support this case.
> > >
> > > Tom
> > >
> > > On Thu, Feb 27, 2014 at 3:34 PM, Chao Shi <stepinto@live.com> wrote:
> > > > Hi Tom,
> > > >
> > > > I will have to use named-output. About your example DatasetTarget, is
> > it
> > > > safe to setOutputFormat() explicitly here? I guess this may conflict
> > with
> > > > other targets that only use the same trick. Is it possible for us to
> > > have a
> > > > general approach to get OutputCommitter work?
> > > > Hi Chao,
> > > >
> > > > Crunch doesn't call the output committer explicitly itself, it's
> > > > called by the MR framework as a normal part of running a job.
> However,
> > > > in Crunch's MapReduceTarget#configureForMapReduce the output format
> is
> > > > not typically set for the named-output case (which is the only case
> > > > that is executed now, as I discovered in the thread mentioned below),
> > > > so it defaults to FileOutputFormat, with its semantics. (This is why
> > > > HBaseTarget calls FileOutputFormat.setOutputPath, which it wouldn't
> > > > have to if it set the output format explicitly to HBase's
> > > > TableOutputFormat.)
> > > >
> > > > Are you setting the HCatOutputFormat in the named-output case? In the
> > > > Crunch Target I'm writing I've set the OutputFormat explicitly:
> > > >
> > >
> >
> https://github.com/tomwhite/kite/blob/CDK-308-dataset-output-format/kite-data/kite-data-crunch/src/main/java/org/kitesdk/data/crunch/DatasetTarget.java#L106
> > > >
> > > > Cheers,
> > > > Tom
> > > >
> > > > On Thu, Feb 27, 2014 at 7:54 AM, Gabriel Reid <
> gabriel.reid@gmail.com>
> > > > wrote:
> > > >> For reference, here's the link to the previous thread on this:
> > > >>
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/crunch-dev/201401.mbox/%3cCAF-WD4Sig2n7yMxiZSji8trQy-8wfUy5_7dnKC=dkSxmrfSPVA@mail.gmail.com%3e
> > > >>
> > > >> On Thu, Feb 27, 2014 at 7:56 AM, Josh Wills <jwills@cloudera.com>
> > > wrote:
> > > >>> +tom
> > > >>>
> > > >>> Didn't Tom have a thing like this a little while ago?
> > > >>>
> > > >>>
> > > >>> On Wed, Feb 26, 2014 at 8:04 PM, Chao Shi <stepinto@live.com>
> wrote:
> > > >>>
> > > >>>> Hi crunch devs,
> > > >>>>
> > > >>>> I'm developing target wrapper for HCatOutputFormat, which
uses a
> > > custom
> > > >>>> OutputCommiter to get results committed to hive. It seems
its
> > > >>>> OutputCommitter is not called at all. Looking into the code,
I
> can't
> > > > find
> > > >>>> where crunch calls it. Is it really supported?
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Chao
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Director of Data Science
> > > >>> Cloudera <http://www.cloudera.com>
> > > >>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message