crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Use of Iterable with combine in Scrunch
Date Wed, 30 Jul 2014 14:47:34 GMT
I would be in favor of both changes, if no one else minded. I love that
Scala allows us to express the TraversableOnce concept, and I'm all for
using it in this instance.

While we're on the subject of combiners, I was toying with the idea of a
GenericAggregator, which would basically encapsulate a combine + a map so
that you could do things like averages.

Oh, and slightly related, I hope you saw where I pulled in Algebird for
monoid calculations in Scrunch in
https://issues.apache.org/jira/browse/CRUNCH-424

J

J


On Wed, Jul 30, 2014 at 6:52 AM, David Whiting <davidwhiting@gmail.com>
wrote:

> The Scrunch version of combine accepts a function Iterable[V] => V . This
> causes a lot of unexpected behaviour because the iterable that is wrapped
> is actually a SingleUseIterable, and much of Scala's collection function
> implementations actually try and access the underlying iterator multiple
> times if they know that it's possible. This leads to often having to write
> code like this:
>
> ...
> .groupByKey()
> .combine { _.iterator reduce { _ + _ } }
>
> This is a silly example of course, because there's an Aggregator for
> summation, but if your reduce function is more complex you have to do this
> indirection via iterator in order to get correct behaviour.
>
> Possible fixes:
> a) Change combine to accept a function TraversableOnce[V] => V or
> Iterator[V] => V, better reflecting the single-use nature of the underlying
> Iterable
> b) Given that most custom combines will in fact be folds over monoids, we
> could promote the notion of reduce or fold up the the PGroupedTable itself,
> so you can do .groupByKey().foldValues(_+_)
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message