crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Concerning the use of the Iterable parameter to CombineFn
Date Sat, 06 Apr 2013 18:44:37 GMT
On Sat, Apr 6, 2013 at 11:41 AM, Gabriel Reid <gabriel.reid@gmail.com>wrote:

>
>
>
> On Sat, Apr 6, 2013 at 8:28 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> We could also try caching/spilling the contents of the Iterable so that
>> it could actually be used more than once. I'm wondering if we could detect
>> that multiple clients were calling the same groupByKey() output and
>> automatically swap out the Iterable for one that cached the results.
>>
>>
> Yeah, that's definitely an option -- but are we talking about two
> different issues here? The issue that Chad brought up is the ability for a
> single DoFn to iterate over an iterable of values multiple times, while (as
> far as I understand) you're talking about having multiple DoFns running on
> reducer input, right?
>
> For the first case, it seems acceptable to me to just enforce that the
> iterable can only be iterated over once. For the second case, I think it
> could definitely be interesting to try to do what you're talking about (if
> I'm correctly understanding what you were suggesting :-))
>

A very good point-- I misread Chad's email. Will open up a separate JIRA
for the caching idea.


> - Gabriel
>
>
>
>
>>
>> On Fri, Apr 5, 2013 at 12:40 PM, Gabriel Reid <gabriel.reid@gmail.com>wrote:
>>
>>> Hi Chad,
>>>
>>> Good point -- I know that this has tripped people up in the past. I
>>> think that definitely documenting this and possibly enforcing it sounds
>>> like a good idea -- I've logged a ticket in JIRA (with the content of your
>>> mail), see https://issues.apache.org/jira/browse/CRUNCH-192
>>>
>>> - Gabriel
>>>
>>>
>>> On 05 Apr 2013, at 21:30, Chad Urso McDaniel <chadum@gmail.com> wrote:
>>>
>>> > BLUF: The Iterable parameter to CombineFn.process implies you can
>>> iterate multiple times when you cannot and this leads to surprising
>>> behavior.
>>> >
>>> > As many of you probably know, the signature of CombineFn.process is
>>> > ---
>>> > process(Pair<K, Iterable<V>> input, Emitter<Pair<K, V>>
emitter)
>>> > ---
>>> >
>>> > The corresponding Hadoop Reducer signature is
>>> > ---
>>> > reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output,
>>> Reporter reporter)
>>> > ---
>>> >
>>> > I assume the Crunch use of Iterable is for convenient use in "for"
>>> loops.
>>> >
>>> > Unfortunately, the behavior of this Iterable seems to return the same
>>> Iterator object each time Iterable.iterator() is called.
>>> >
>>> > This makes sense to me based on the underlying hadoop mapreduce, but
>>> violates what I think most expect from the Iterable interface.
>>> >
>>> > I understand that it's too late to change the interface, but could we
>>> at least have an javadoc or an exception thrown if the Iterable is used
>>> more than once?
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message