crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: MultipleOutput in crunch
Date Sat, 09 Mar 2013 06:41:57 GMT
MultipleOutputs is baked in pretty deep to the Crunch system, although we
have our own impl (the class is named CrunchMultipleOutputs) to handle some
of the peculiarities around how we configure OutputFormats.

I would do something similar to what Micah suggested, but I would leave out
the groupByKey step, e.g., I would start with a PCollection<T>, use a MapFn
to convert it to a PCollection<Pair<T, Boolean>> (or equivalently, a
PTable<T, Boolean>) and have each of the filter fns in the sequence check
the current value of the boolean for each record-- if it's already false,
don't bother doing the filter check, just pass along Pair.of(T, false); if
it's true, do the check, and emit Pair.of(T, true) if it passes and
Pair.of(T, false) if it fails. Then, after all of the filter checks are
done, use two FilterFns to route the records that passed the checks
separately from the ones that didn't pass them-- either to subsequent
processing logic, or to separate files, or whatever. If you can get away
with doing everything in a single pass over the data using a map-only job,
that's the best of all worlds from a performance perspective.

Josh


On Fri, Mar 8, 2013 at 8:16 PM, Micah Whitacre <mkwhitacre@gmail.com> wrote:

> Instead of implementing a filter could you switch to using a DoFn and
> emit a Pair?  Then the first part of the pair would be the identifier
> for the category of data.  You can then group by key to process them
> differently or just keep processing them by the same DoFn using the
> key as a flag to how to process it.
>
> That being said I'm not really sure this would be any more efficient
> than filtering twice.
>
>
> On Fri, Mar 8, 2013 at 8:53 PM, Peter Knap <pknap@yahoo.com> wrote:
> > Hi,
> >
> > Is multiple output functionality supported by crunch? I have looked at
> the
> > source code but could find a way to do it. I have the following scenario:
> > input file would be processed by multiple sequential filters, the records
> > passing the filter criteria need to be processed differently than the
> ones
> > which are not. What's the best way to do it in crunch? I know I can
> proccess
> > the input data twice by two different fillters but this is not efficient.
> > Any suggestion from you guys?
> >
> > Thanks,
> > Piotr
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message