crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chandan Biswas <cbiswas1...@gmail.com>
Subject Re: Process of CombineFn<S,T> returns <S,U>?
Date Thu, 17 Oct 2013 23:41:49 GMT
Yeah, I agree with Micah that it will not eliminate the reduce phase
entirely. But the dummy object of U suggested by Josh (or converting to U
type in map for every record)  will not improve performance because same
amounts of records will be sorted and aggregated in the reduce phase.  But
my point is, can we improve it by applying a combiner where the combineFn
provides output as different type. If we have same type, we can use the
combiner to do some aggregation in map side which improves performance.
But, can we have some mechanism by which the same advantage can be achieved
when combineFn emits different type. I think, emitting same type by
CombineFn has restricted its use. Can we have new CombineFn that allows us
to output different type not only same type as input?


On Thu, Oct 17, 2013 at 5:05 PM, Josh Wills <jwills@cloudera.com> wrote:

> Yeah, my experience in these kinds of situations is that you need to come
> up with a "dummy" or singleton version of U for the case where there is
> only a single T and do that conversion on the map side of the job, before
> the combiner runs. I think Chao had an issue like this awhile ago, where he
> had a PTable<String, Double> and wanted to write a combiner that would
> return a PTable<String, Collection<Double>>. The solution was to convert
> the map-side object to a PTable<String, Collection<Double>>, where the
> value on the map-side was a singleton list containing just that double
> value. Does that sort of trick work here?
>
>
> On Thu, Oct 17, 2013 at 2:57 PM, Micah Whitacre <mkwhit@gmail.com> wrote:
>
> > Ok so the feature you are trying to achieve is the proactive combination
> of
> > data before performing the GBK like the javadoc describes.  Essentially
> in
> > that situation the CombineFn is being used as a Combiner[1] to combine
> the
> > data local to that mapper before doing the GBK and then further combining
> > the data in the reduce operation.  It will not necessarily eliminate the
> > need for all processing in the reduce.
> >
> > If you want to use this functionality you will need to do the following:
> >
> > PTable<S, T> map to PTable<S, U>
> > PTable<S, U> gbk to PGT<S, U>
> > PGT<S, U> combine PTable<S, U>
> >
> > This will take advantage of any optimization provided by the CombineFn.
> >
> > [1] - http://wiki.apache.org/hadoop/HadoopMapReduce
> >
> >
> >
> > On Thu, Oct 17, 2013 at 4:30 PM, Chandan Biswas <cbiswas1983@gmail.com
> > >wrote:
> >
> > > Hello Micah,
> > > Yes we are using MapFn now. That aggregation and computation is being
> > done
> > > in reduce phase. As CombineFn after GBK runs into map side, then those
> > most
> > > computations can be done in map side which are now running in reduce
> > phase.
> > > Some smaller aggregations and computations can be done on reduce phase.
> > > My point was to do some aggregation (and create a new object) in map
> > phase
> > > instead of in reduce phase.
> > >
> > > Thanks,
> > > Chandan
> > >
> > >
> > > On Thu, Oct 17, 2013 at 3:48 PM, Micah Whitacre <mkwhit@gmail.com>
> > wrote:
> > >
> > > > Chandan,
> > > >    I think what you are wanting will just be a simple MapFn instead
> of
> > a
> > > > CombineFn.  The doc of the CombineFn[1] sounds like what you want
> with
> > > the
> > > > statement "A special
> > > > DoFn<
> > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/DoFn.html>
> > > > implementation
> > > > that converts an
> > > > Iterable<
> > > >
> > >
> >
> http://download.oracle.com/javase/6/docs/api/java/lang/Iterable.html?is-external=true
> > > > >
> > > > of
> > > > values into a single value" but it is expecting the value to be of
> the
> > > same
> > > > time.  Since you are wanting to combine the values into a different
> > form
> > > it
> > > > should be fairly trivial to write a MapFn that converts the
> Iterable<T>
> > > ->
> > > > U.
> > > >
> > > > [1] -
> > > >
> > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/CombineFn.html
> > > >
> > > >
> > > > On Thu, Oct 17, 2013 at 3:30 PM, Chandan Biswas <
> cbiswas1983@gmail.com
> > > > >wrote:
> > > >
> > > > > I was trying to refactoring some stuffs and trying to use
> combineFn.
> > > > > But when I went into deeper, found that I can't do it as Crunch
> > doesn't
> > > > > allow it the functionality I needed. For example, I have a
> > > > > PGroupedTable<S,T>. I wanted to apply CombineFn<S,T>
on it and
> wanted
> > > to
> > > > > get PCollection<S,U> instead of T. Right now, CombineFn allows
only
> > > same
> > > > > type as return value. The use case of this need is that there will
> be
> > > > some
> > > > > time saving in sorting. It's natural that when aggregating some
> > objects
> > > > at
> > > > > map side can create a new different type object.
> > > > >
> > > > > Any thought on it? Am I missing any thing? If this can be written
> in
> > > > > different way using existing way please let me know.
> > > > >
> > > > > Thanks
> > > > > Chandan
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message