incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul <>
Subject Re: Exception due to same iterator returned back by PGroupedTableType
Date Thu, 28 Jun 2012 07:29:27 GMT
Hi Gabriel,

Yes indeed this is a small PoC to get familiar with Crunch in relation 
to my problem. Basically I have the following algo at play:
1. Read data rows
2. Create custom keys for each of them, built using various attributes 
of data (this time it is just a simple hash code, but I would like to 
emit multiple key-value pairs)
3. Group similar data based on created Keys
4. Iterate over individual items in the group and do extensive 
comparison between all of them

I just built an outline in the test case to see what/how can be done, 
can you advise something better ?


On 28-06-2012 12:30, Gabriel Reid wrote:
> Hi Rahul,
> Ok, looks like I misunderstood your code. In that case, you're indeed
> correct that a
> PeekingIterator won't help you -- it looks like you will indeed need
> to store the data
> in a collection per group in order to do the processing that you're
> trying to do.
> Am I correct in assuming that this code is an attempt to get familiar
> with Crunch,
> and less about solving a real-world problem right now? If you are trying to put
> together a solution for a problem, maybe you could outline what you're trying
> to get to -- there may be a better way to get there. I noticed that
> you're grouping
> values by the hash code of the input line, which looks questionable to me.
> Regards,
> Gabriel
> On Thu, Jun 28, 2012 at 8:05 AM, Rahul<>  wrote:
>> Hi Gabriel,
>> I am doing n*(n-1) comparisons here every element would be compared with
>> every other element, so peeking iterator would not help much. It would give
>> me the next element but I need to keep all the elements that have been
>> accessed once in another Collection so that I can iterate over them again
>> and again.
>> or Is there some thing that would help here ?
>> regards,
>> Rahul
>> On 27-06-2012 17:48, Gabriel Reid wrote:
>>> On Wed, Jun 27, 2012 at 1:41 PM, Rahul<>    wrote:
>>>> I am trying to create multiple iterators in a DoFn process method.
>>>>   public void process(Pair<Integer, Iterable<TupleN>>    input,
>>>>          Emitter<Pair<String, Integer>>    emitter) {}
>>>> Every time I ask a iterator it gives back the same one and thus I could
>>>> not
>>>> not traverse the list again and again as I am hitting the following stack
>>>> trace .
>>> The Iterable.iterator call always returns the same iterator is because
>>> this
>>> is the behaviour that is inherited from the reduce method of the Hadoop
>>> Reducer class (and this behaviour is there because of the underlying way
>>> in which Hadoop MapReduce functions). In both Crunch and pure MapReduce,
>>> you've just got one shot at looping over an Iterable in a reducer (or DoFn
>>> that is functioning on a PGroupedTable).
>>> If I understood your code correctly, you're trying to loop over an
>>> Iterable
>>> while looking at two consecutive elements at a time. Probably the easiest
>>> way of doing this is using the PeekingIterator class in Google Guava
>>> (
>>> This will allow
>>> you to look one element ahead within an iterator.
>>> Regards,
>>> Gabriel

View raw message