crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Process of CombineFn<S,T> returns <S,U>?
Date Thu, 17 Oct 2013 22:05:29 GMT
Yeah, my experience in these kinds of situations is that you need to come
up with a "dummy" or singleton version of U for the case where there is
only a single T and do that conversion on the map side of the job, before
the combiner runs. I think Chao had an issue like this awhile ago, where he
had a PTable<String, Double> and wanted to write a combiner that would
return a PTable<String, Collection<Double>>. The solution was to convert
the map-side object to a PTable<String, Collection<Double>>, where the
value on the map-side was a singleton list containing just that double
value. Does that sort of trick work here?


On Thu, Oct 17, 2013 at 2:57 PM, Micah Whitacre <mkwhit@gmail.com> wrote:

> Ok so the feature you are trying to achieve is the proactive combination of
> data before performing the GBK like the javadoc describes.  Essentially in
> that situation the CombineFn is being used as a Combiner[1] to combine the
> data local to that mapper before doing the GBK and then further combining
> the data in the reduce operation.  It will not necessarily eliminate the
> need for all processing in the reduce.
>
> If you want to use this functionality you will need to do the following:
>
> PTable<S, T> map to PTable<S, U>
> PTable<S, U> gbk to PGT<S, U>
> PGT<S, U> combine PTable<S, U>
>
> This will take advantage of any optimization provided by the CombineFn.
>
> [1] - http://wiki.apache.org/hadoop/HadoopMapReduce
>
>
>
> On Thu, Oct 17, 2013 at 4:30 PM, Chandan Biswas <cbiswas1983@gmail.com
> >wrote:
>
> > Hello Micah,
> > Yes we are using MapFn now. That aggregation and computation is being
> done
> > in reduce phase. As CombineFn after GBK runs into map side, then those
> most
> > computations can be done in map side which are now running in reduce
> phase.
> > Some smaller aggregations and computations can be done on reduce phase.
> > My point was to do some aggregation (and create a new object) in map
> phase
> > instead of in reduce phase.
> >
> > Thanks,
> > Chandan
> >
> >
> > On Thu, Oct 17, 2013 at 3:48 PM, Micah Whitacre <mkwhit@gmail.com>
> wrote:
> >
> > > Chandan,
> > >    I think what you are wanting will just be a simple MapFn instead of
> a
> > > CombineFn.  The doc of the CombineFn[1] sounds like what you want with
> > the
> > > statement "A special
> > > DoFn<
> http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/DoFn.html>
> > > implementation
> > > that converts an
> > > Iterable<
> > >
> >
> http://download.oracle.com/javase/6/docs/api/java/lang/Iterable.html?is-external=true
> > > >
> > > of
> > > values into a single value" but it is expecting the value to be of the
> > same
> > > time.  Since you are wanting to combine the values into a different
> form
> > it
> > > should be fairly trivial to write a MapFn that converts the Iterable<T>
> > ->
> > > U.
> > >
> > > [1] -
> > >
> http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/CombineFn.html
> > >
> > >
> > > On Thu, Oct 17, 2013 at 3:30 PM, Chandan Biswas <cbiswas1983@gmail.com
> > > >wrote:
> > >
> > > > I was trying to refactoring some stuffs and trying to use combineFn.
> > > > But when I went into deeper, found that I can't do it as Crunch
> doesn't
> > > > allow it the functionality I needed. For example, I have a
> > > > PGroupedTable<S,T>. I wanted to apply CombineFn<S,T> on it
and wanted
> > to
> > > > get PCollection<S,U> instead of T. Right now, CombineFn allows only
> > same
> > > > type as return value. The use case of this need is that there will be
> > > some
> > > > time saving in sorting. It's natural that when aggregating some
> objects
> > > at
> > > > map side can create a new different type object.
> > > >
> > > > Any thought on it? Am I missing any thing? If this can be written in
> > > > different way using existing way please let me know.
> > > >
> > > > Thanks
> > > > Chandan
> > > >
> > >
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message