crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Updated] (CRUNCH-167) Sort.sortTuples and related methods write out duplicate values
Date Fri, 22 Feb 2013 07:08:12 GMT


Josh Wills updated CRUNCH-167:

    Attachment: CRUNCH-167.patch

This is a pretty substantial change to how sorting is implemented, but it solves the problem
pretty thoroughly, and I believe it positions us well for using a TotalOrderPartitioner to
distribute the sort across multiple reducers via a 2-phase MR job that uses the ParallelDoOptions
to introduce the dependency on the partition file.
> Sort.sortTuples and related methods write out duplicate values
> --------------------------------------------------------------
>                 Key: CRUNCH-167
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>    Affects Versions: 0.4.0, 0.5.0
>            Reporter: Josh Wills
>             Fix For: 0.6.0
>         Attachments: CRUNCH-167.patch
> I noticed when I was debugging CRUNCH-166 that the strategy that the Sort.sortPairs,
sortTrips, etc. methods are using has the potential to write out duplicate values in cases
where we are only sorting/grouping on a subset of the fields, because all of the records that
have the same value for those sub-fields will be called as part of the same reduce() call,
where only a single one of the records that had the same set of values for those sub-fields
will be used as the key, and the rest of the values will have been thrown away.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message