incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Exception due to same iterator returned back by PGroupedTableType
Date Thu, 28 Jun 2012 11:25:49 GMT
Hey Rahul,

Re: groupByKey vs. collectValues: collectValues calls groupByKey in
the course of its operations; it's basically just a convenience method
for the kind of problem you're trying to solve (i.e., I need to
iterate over the values returned by groupByKey multiple times, so
please put them into a collection.) There are lots of other cases
where you do not need to iterate over the values multiple times, and
so Crunch (much like MapReduce) doesn't bother to keep everything
around unless you explicitly ask it to do so.

On Thu, Jun 28, 2012 at 4:13 AM, Rahul <rsharma@xebia.com> wrote:
> Hi Gabriel,
>
> I have found a way by which Crunch supports the uses case of having repeated
> iterators but I am not completely sure of the in-outs of the same.
> Basically rather than doing a groupBy on Ptable to get back a
> PGroupedTable,  I used the collectValues API to get back a
> Ptable<key,Collection<values>>.
>
>     PTable<Integer, Collection<TupleN>> collectValues =
> classifiedData.collectValues();
>     PTable<String, Integer> scores = collectValues.parallelDo("compute
> pairs",
>         new PTableScoreCalculator(), Writables.tableOf(Writables.strings(),
> Writables.ints()));
>
>
> Now when I do ParalledDo on the new collection I get back a Pair, having
> keyType and ArrayList<valueType>, over which I can do things as I wish.
>
> class PTableScoreCalculator extends DoFn<Pair<Integer, Collection<TupleN>>,
> Pair<String, Integer>> {
> public void process(Pair<Integer, Collection<TupleN>> input,
>
>       Emitter<Pair<String, Integer>> emitter) {
>     Iterator<TupleN> primary = input.second().iterator();
> .....................
> }
>
> This way I could iterate over again and again, any comments on the same. I
> am attaching my test case for reference.
>
> BTW why are there two methods that can do the same things  the groupBykey
> method and the collectValues method ? I see  an Aggregation gets invoked for
> the collection API and in the other case a lazy collection gets created. Any
> idea on the different applications of the two.
>
> regards,
> Rahul
>
>
>
> On 28-06-2012 14:17, Gabriel Reid wrote:
>
> On Thu, Jun 28, 2012 at 9:29 AM, Rahul <rsharma@xebia.com> wrote:
>
> Yes indeed this is a small PoC to get familiar with Crunch in relation to my
> problem. Basically I have the following algo at play:
> 1. Read data rows
> 2. Create custom keys for each of them, built using various attributes of
> data (this time it is just a simple hash code, but I would like to emit
> multiple key-value pairs)
> 3. Group similar data based on created Keys
> 4. Iterate over individual items in the group and do extensive comparison
> between all of them
>
> I just built an outline in the test case to see what/how can be done, can
> you advise something better ?
>
> Thanks for the outline. In this case, your approach (with putting the
> contents of the
> incoming Iterable into a collection) should work fine, as long as
> number of elements
> per group is relatively small (i.e. easily able to fit in the memory
> available to each reducer in your Hadoop cluster).
>
> - Gabriel
>
>



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Mime
View raw message