crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Concerning the use of the Iterable parameter to CombineFn
Date Sat, 06 Apr 2013 18:41:01 GMT
On Sat, Apr 6, 2013 at 8:28 PM, Josh Wills <jwills@cloudera.com> wrote:

> We could also try caching/spilling the contents of the Iterable so that it
> could actually be used more than once. I'm wondering if we could detect
> that multiple clients were calling the same groupByKey() output and
> automatically swap out the Iterable for one that cached the results.
>
>
Yeah, that's definitely an option -- but are we talking about two different
issues here? The issue that Chad brought up is the ability for a single
DoFn to iterate over an iterable of values multiple times, while (as far as
I understand) you're talking about having multiple DoFns running on reducer
input, right?

For the first case, it seems acceptable to me to just enforce that the
iterable can only be iterated over once. For the second case, I think it
could definitely be interesting to try to do what you're talking about (if
I'm correctly understanding what you were suggesting :-))

- Gabriel




>
> On Fri, Apr 5, 2013 at 12:40 PM, Gabriel Reid <gabriel.reid@gmail.com>wrote:
>
>> Hi Chad,
>>
>> Good point -- I know that this has tripped people up in the past. I think
>> that definitely documenting this and possibly enforcing it sounds like a
>> good idea -- I've logged a ticket in JIRA (with the content of your mail),
>> see https://issues.apache.org/jira/browse/CRUNCH-192
>>
>> - Gabriel
>>
>>
>> On 05 Apr 2013, at 21:30, Chad Urso McDaniel <chadum@gmail.com> wrote:
>>
>> > BLUF: The Iterable parameter to CombineFn.process implies you can
>> iterate multiple times when you cannot and this leads to surprising
>> behavior.
>> >
>> > As many of you probably know, the signature of CombineFn.process is
>> > ---
>> > process(Pair<K, Iterable<V>> input, Emitter<Pair<K, V>>
emitter)
>> > ---
>> >
>> > The corresponding Hadoop Reducer signature is
>> > ---
>> > reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output,
>> Reporter reporter)
>> > ---
>> >
>> > I assume the Crunch use of Iterable is for convenient use in "for"
>> loops.
>> >
>> > Unfortunately, the behavior of this Iterable seems to return the same
>> Iterator object each time Iterable.iterator() is called.
>> >
>> > This makes sense to me based on the underlying hadoop mapreduce, but
>> violates what I think most expect from the Iterable interface.
>> >
>> > I understand that it's too late to change the interface, but could we
>> at least have an javadoc or an exception thrown if the Iterable is used
>> more than once?
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message