crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Whitacre <mkw...@gmail.com>
Subject Re: Process of CombineFn<S,T> returns <S,U>?
Date Thu, 17 Oct 2013 21:57:34 GMT
Ok so the feature you are trying to achieve is the proactive combination of
data before performing the GBK like the javadoc describes.  Essentially in
that situation the CombineFn is being used as a Combiner[1] to combine the
data local to that mapper before doing the GBK and then further combining
the data in the reduce operation.  It will not necessarily eliminate the
need for all processing in the reduce.

If you want to use this functionality you will need to do the following:

PTable<S, T> map to PTable<S, U>
PTable<S, U> gbk to PGT<S, U>
PGT<S, U> combine PTable<S, U>

This will take advantage of any optimization provided by the CombineFn.

[1] - http://wiki.apache.org/hadoop/HadoopMapReduce



On Thu, Oct 17, 2013 at 4:30 PM, Chandan Biswas <cbiswas1983@gmail.com>wrote:

> Hello Micah,
> Yes we are using MapFn now. That aggregation and computation is being done
> in reduce phase. As CombineFn after GBK runs into map side, then those most
> computations can be done in map side which are now running in reduce phase.
> Some smaller aggregations and computations can be done on reduce phase.
> My point was to do some aggregation (and create a new object) in map phase
> instead of in reduce phase.
>
> Thanks,
> Chandan
>
>
> On Thu, Oct 17, 2013 at 3:48 PM, Micah Whitacre <mkwhit@gmail.com> wrote:
>
> > Chandan,
> >    I think what you are wanting will just be a simple MapFn instead of a
> > CombineFn.  The doc of the CombineFn[1] sounds like what you want with
> the
> > statement "A special
> > DoFn<http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/DoFn.html>
> > implementation
> > that converts an
> > Iterable<
> >
> http://download.oracle.com/javase/6/docs/api/java/lang/Iterable.html?is-external=true
> > >
> > of
> > values into a single value" but it is expecting the value to be of the
> same
> > time.  Since you are wanting to combine the values into a different form
> it
> > should be fairly trivial to write a MapFn that converts the Iterable<T>
> ->
> > U.
> >
> > [1] -
> > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/CombineFn.html
> >
> >
> > On Thu, Oct 17, 2013 at 3:30 PM, Chandan Biswas <cbiswas1983@gmail.com
> > >wrote:
> >
> > > I was trying to refactoring some stuffs and trying to use combineFn.
> > > But when I went into deeper, found that I can't do it as Crunch doesn't
> > > allow it the functionality I needed. For example, I have a
> > > PGroupedTable<S,T>. I wanted to apply CombineFn<S,T> on it and
wanted
> to
> > > get PCollection<S,U> instead of T. Right now, CombineFn allows only
> same
> > > type as return value. The use case of this need is that there will be
> > some
> > > time saving in sorting. It's natural that when aggregating some objects
> > at
> > > map side can create a new different type object.
> > >
> > > Any thought on it? Am I missing any thing? If this can be written in
> > > different way using existing way please let me know.
> > >
> > > Thanks
> > > Chandan
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message