incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul <rsha...@xebia.com>
Subject Re: Exception due to same iterator returned back by PGroupedTableType
Date Thu, 28 Jun 2012 11:13:41 GMT
Hi Gabriel,

I have found a way by which Crunch supports the uses case of having 
repeated iterators but I am not completely sure of the in-outs of the same.
Basically rather than doing a groupBy on Ptable to get back a 
PGroupedTable,  I used the collectValues API to get back a 
Ptable<key,Collection<values>>.

/    PTable<Integer, Collection<TupleN>> collectValues = 
classifiedData.collectValues();
     PTable<String, Integer> scores = collectValues.parallelDo("compute 
pairs",
         new PTableScoreCalculator(), 
Writables.tableOf(Writables.strings(), Writables.ints()));/


Now when I do ParalledDo on the new collection I get back a Pair, having 
keyType and ArrayList<valueType>, over which I can do things as I wish.

/class PTableScoreCalculator extends DoFn<Pair<Integer, 
Collection<TupleN>>, Pair<String, Integer>> {
//public void process(Pair<Integer, Collection<TupleN>> input,
       Emitter<Pair<String, Integer>> emitter) {
     Iterator<TupleN> primary = input.second().iterator();
.....................
}/

This way I could iterate over again and again, any comments on the same. 
I am attaching my test case for reference.

BTW why are there two methods that can do the same things  the 
/groupBykey/**method and the /collectValues/ method ? I see  an 
Aggregation gets invoked for the collection API and in the other case a 
lazy collection gets created. Any idea on the different applications of 
the two.

regards,
Rahul


On 28-06-2012 14:17, Gabriel Reid wrote:
> On Thu, Jun 28, 2012 at 9:29 AM, Rahul<rsharma@xebia.com>  wrote:
>> Yes indeed this is a small PoC to get familiar with Crunch in relation to my
>> problem. Basically I have the following algo at play:
>> 1. Read data rows
>> 2. Create custom keys for each of them, built using various attributes of
>> data (this time it is just a simple hash code, but I would like to emit
>> multiple key-value pairs)
>> 3. Group similar data based on created Keys
>> 4. Iterate over individual items in the group and do extensive comparison
>> between all of them
>>
>> I just built an outline in the test case to see what/how can be done, can
>> you advise something better ?
>
> Thanks for the outline. In this case, your approach (with putting the
> contents of the
> incoming Iterable into a collection) should work fine, as long as
> number of elements
> per group is relatively small (i.e. easily able to fit in the memory
> available to each reducer in your Hadoop cluster).
>
> - Gabriel


Mime
View raw message