crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rahul Sharma (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values
Date Fri, 24 Aug 2012 05:06:42 GMT


Rahul Sharma commented on CRUNCH-23:

TotalOrderPartitioner  in the current form is not usable with Avro. MAPREDUCE-4574 issue states
the same. We will need to re implement the TotalOrderPartitioner if we want to use it.

But on second thought do we want this work with avro data ? In avro  the sort order is imposed
by the Schema. So if the user specifies some order in the schema then Avro will make sure
it loads all data using the same. If none is specified then avro will select ascending order
by default on each of the fields of the record. It feels like avro data is sorted out-of the
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>                 Key: CRUNCH-23
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-sorting-issue.patch,
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed
is only per reducer, and not an absolute sort over all values. This means that the values
are not in sorted order if they are iterated over on a materialized collection. It also means
that the sorted files that are output from a sort operation can not be simply concatenated
to come to a single sorted file.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message