incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rahul Sharma (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-51) PCollection#sort relies on using a single reducer for total order sorting
Date Thu, 20 Sep 2012 08:39:07 GMT


Rahul Sharma commented on CRUNCH-51:

I had to develop Reservoir stuff because CrunchTotalOrderPartitioner would require a sequential
file(having keys) to work with. The approach you are advocating is definitely more efficient
and simpler but you will be required to hack your way through the partitioner for that. On
a distributed cache you would have T type data, but you would get corresponding mapped  type
in the partitioner. The binary tree it has will be required to be built of the corresponding
mapped type. 

As for CrunchTotalOrderPartitionerTest , I wrote it for unit testing CrunchTotalOrderPartitioner
to understand its working. I feel we should still keep it and modify it according to the changes
we are making to the partitioner.
> PCollection#sort relies on using a single reducer for total order sorting
> -------------------------------------------------------------------------
>                 Key: CRUNCH-51
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Gabriel Reid
>         Attachments: 0001-CRUNCH-51-Total-Order-Sort.patch, CRUNCH-51.patch, CRUNCH-51.patch,
> The total-order sorting provided by the Sort class (and therefore PCollection#sort) relies
on using a single reducer in order to provide total-order sorting. This is very inefficient
for large datasets, and should be replaced with a total order partitioner instead.
> For more information, see CRUNCH-23 (and possibly also MAPREDUCE-4574).

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message