crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-167) Sort.sortTuples and related methods write out duplicate values
Date Sun, 24 Feb 2013 09:22:12 GMT


Gabriel Reid commented on CRUNCH-167:

I've just gone through this in more details, and it looks good to me -- much cleaner and more
logical in comparison to the old implementation. CRUNCH-51 will still be a fair bit of work
regardless, but these changes definitely won't hurt.

BTW, I think that the Sort.createPairSchema can be removed, I can't see it being used anywhere.
> Sort.sortTuples and related methods write out duplicate values
> --------------------------------------------------------------
>                 Key: CRUNCH-167
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>    Affects Versions: 0.4.0, 0.5.0
>            Reporter: Josh Wills
>             Fix For: 0.6.0
>         Attachments: CRUNCH-167.patch
> I noticed when I was debugging CRUNCH-166 that the strategy that the Sort.sortPairs,
sortTrips, etc. methods are using has the potential to write out duplicate values in cases
where we are only sorting/grouping on a subset of the fields, because all of the records that
have the same value for those sub-fields will be called as part of the same reduce() call,
where only a single one of the records that had the same set of values for those sub-fields
will be used as the key, and the rest of the values will have been thrown away.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message