crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Vargo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-528) Pair: Integer overflow during comparison can cause inconsistent sort.
Date Wed, 27 May 2015 21:59:53 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561823#comment-14561823
] 

Brandon Vargo commented on CRUNCH-528:
--------------------------------------

Ahh, makes sense. I thought you were only going to replace the == with equals in the first
line of cmp, my apologies.

I could see a case where a join would have interleaving keys in the case of a hash collision,
due to the arbitrary ordering, so the cross-product would not be complete. I went looking
for a way to serialize the keys using the underlying type (e.g. Writable) in order to break
the tie in a more deterministic way, but I didn't see an easy way of doing so in a way that
makes sense. At least now (k1, v1) and (k2, v2) won't get joined together if hash(k1) == hash(k2).

Short of there being another tie-breaker that I am missing, the patch looks good to me. Thanks!

> Pair: Integer overflow during comparison can cause inconsistent sort.
> ---------------------------------------------------------------------
>
>                 Key: CRUNCH-528
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-528
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Vargo
>            Assignee: Josh Wills
>            Priority: Minor
>         Attachments: 0001-Pair-Fix-comparison-for-large-hash-codes.patch, CRUNCH-528.2.patch
>
>
> Pair uses the hash code of each value for comparison if the values are not themselves
comparable. If the hash code values are too large, then the values will wrap when doing subtraction.
This results in a comparison function that is not transitive.
> Among other things, this makes Joins using the in-memory pipeline not work, since the
in-memory shuffler uses a TreeMap if the key type is Comparable. Since the key in a join is
a Pair of the original key and a join tag, the key is always comparable. With a non-transitive
comparison function, it is possible for the two join tags of the original key to sort differently,
resulting in the two join tags not being adjacent for the original key. This results either
in either the cross product erroneously producing no values in the case of an inner join,
since the two join tags are not adjacent, or null values appearing when they should not in
the case of an outer join.
> As a workaround, ensure that the key used in a Join is comparable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message