Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D8397110DA for ; Wed, 2 Apr 2014 02:05:18 +0000 (UTC) Received: (qmail 96970 invoked by uid 500); 2 Apr 2014 02:05:18 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 96929 invoked by uid 500); 2 Apr 2014 02:05:18 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 96754 invoked by uid 500); 2 Apr 2014 02:05:17 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 96667 invoked by uid 99); 2 Apr 2014 02:05:16 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Apr 2014 02:05:16 +0000 Date: Wed, 2 Apr 2014 02:05:16 +0000 (UTC) From: "Josh Wills (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CRUNCH-373) Problem while Performing MapSide join with ImmutableBytesWritable/Text MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Wills updated CRUNCH-373: ------------------------------ Attachment: CRUNCH-373b.patch I think what you really want is this, which fixes the Text test and provides a (fairly dumb) cut at a fix to SeqFileReaderFactory. [~gabriel.reid], is there a more elegant way to do this w/detached values? > Problem while Performing MapSide join with ImmutableBytesWritable/Text > ---------------------------------------------------------------------- > > Key: CRUNCH-373 > URL: https://issues.apache.org/jira/browse/CRUNCH-373 > Project: Crunch > Issue Type: Bug > Components: Core > Affects Versions: 0.9.0, 0.8.2 > Reporter: Rachit Soni > Assignee: Josh Wills > Attachments: CRUNCH-371_test.patch, CRUNCH-373b.patch, CrunchHBaseIT.java > > > I have been having issues performing MapSide Join with ImmutableBytesWritable as the join key and it always have only 1 value in the map created in the initialize method of MapSideJoinDoFn[1]. With the same set of data if I perform reduce side join it works perfectly fine giving me the correct result. > Additionally, I am making sure the map can be loaded in memory. > The result in both the above cases are different. When I dug up the code where Map side join is being performed in MapSideDoFn [1] when the right side is taken in memory and converted to map [2] all the keys get over written with the last key that is being updated on the map. Seems like there it is referencing the same memory location each and every time and is not cloning it properly. This only happens when I use ImmutableBytesWritable/Text, anything except > ImmutableBytesWritable/Text works perfectly fine. > > It looks like SeqFileReaderFactory (which I believe implements the PTable under the hood for writables) does indeed reuse keys/values [3] in much the same ways reducers do. So, I think in this code [4] it needs to clone the keys/values rather than just store them in a map > > Also, I am attaching a test which I wrote to reproduce the issue. > [1] https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/lib/join/MapsideJoinStrategy.java#L131 > > [2] https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/lib/join/MapsideJoinStrategy.java#L153 > [3] https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/seq/SeqFileReaderFactory.java#L88 > [4] https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/lib/join/MapsideJoinStrategy.java#L153 -- This message was sent by Atlassian JIRA (v6.2#6252)