Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BBE4517D87 for ; Wed, 27 May 2015 18:58:18 +0000 (UTC) Received: (qmail 13625 invoked by uid 500); 27 May 2015 18:58:18 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 13594 invoked by uid 500); 27 May 2015 18:58:18 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 13581 invoked by uid 500); 27 May 2015 18:58:18 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 13578 invoked by uid 99); 27 May 2015 18:58:18 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 May 2015 18:58:18 +0000 Date: Wed, 27 May 2015 18:58:18 +0000 (UTC) From: "Brandon Vargo (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-528) Pair: Integer overflow during comparison can cause inconsistent sort. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561507#comment-14561507 ] Brandon Vargo commented on CRUNCH-528: -------------------------------------- Identity instead of equality in cmp is a small speed optimization but shouldn't affect correctness. If two objects are equal in the value sense but not identical objects, then it'll fall through to either compareTo for Comparable objects, which should return 0 for a well-formed comparison, or hashCode for non-Comparable objects, which should also be the same for value equality. Not calling equals is consistent with how TreeMap works internally, for example. The common case is for two objects to not be equal, so calling compareTo and looking for comparison == 0 is faster than calling equals and then having to call compareTo anyways when equals returns false. I don't have a better suggestion on how to handle joins with distinct objects that have a hash collision. That's still broken unless the key implements Comparable, since there is no secondary sort or equality check after the hashCode, and TreeMap requires that compareTo be consistent with equals. This isn't an issue when running on MapReduce because the shuffler requires that the key be a WritableComparable and optionally uses an optimized WritableComparator, but it is for in-memory mode, since it's just a TreeMap and Pairs are allowed to contain any objects. > Pair: Integer overflow during comparison can cause inconsistent sort. > --------------------------------------------------------------------- > > Key: CRUNCH-528 > URL: https://issues.apache.org/jira/browse/CRUNCH-528 > Project: Crunch > Issue Type: Bug > Components: Core > Reporter: Brandon Vargo > Assignee: Josh Wills > Priority: Minor > Attachments: 0001-Pair-Fix-comparison-for-large-hash-codes.patch > > > Pair uses the hash code of each value for comparison if the values are not themselves comparable. If the hash code values are too large, then the values will wrap when doing subtraction. This results in a comparison function that is not transitive. > Among other things, this makes Joins using the in-memory pipeline not work, since the in-memory shuffler uses a TreeMap if the key type is Comparable. Since the key in a join is a Pair of the original key and a join tag, the key is always comparable. With a non-transitive comparison function, it is possible for the two join tags of the original key to sort differently, resulting in the two join tags not being adjacent for the original key. This results either in either the cross product erroneously producing no values in the case of an inner join, since the two join tags are not adjacent, or null values appearing when they should not in the case of an outer join. > As a workaround, ensure that the key used in a Join is comparable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)