crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Concerning the use of the Iterable parameter to CombineFn
Date Fri, 05 Apr 2013 19:40:03 GMT
Hi Chad,

Good point -- I know that this has tripped people up in the past. I think that definitely
documenting this and possibly enforcing it sounds like a good idea -- I've logged a ticket
in JIRA (with the content of your mail), see https://issues.apache.org/jira/browse/CRUNCH-192

- Gabriel


On 05 Apr 2013, at 21:30, Chad Urso McDaniel <chadum@gmail.com> wrote:

> BLUF: The Iterable parameter to CombineFn.process implies you can iterate multiple times
when you cannot and this leads to surprising behavior.
> 
> As many of you probably know, the signature of CombineFn.process is 
> ---
> process(Pair<K, Iterable<V>> input, Emitter<Pair<K, V>> emitter)
> ---
> 
> The corresponding Hadoop Reducer signature is
> ---
> reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter
reporter)
> ---
> 
> I assume the Crunch use of Iterable is for convenient use in "for" loops.
> 
> Unfortunately, the behavior of this Iterable seems to return the same Iterator object
each time Iterable.iterator() is called. 
> 
> This makes sense to me based on the underlying hadoop mapreduce, but violates what I
think most expect from the Iterable interface.
> 
> I understand that it's too late to change the interface, but could we at least have an
javadoc or an exception thrown if the Iterable is used more than once?


Mime
View raw message