Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 611B9106A3 for ; Thu, 10 Oct 2013 20:36:51 +0000 (UTC) Received: (qmail 69580 invoked by uid 500); 10 Oct 2013 20:36:50 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 69524 invoked by uid 500); 10 Oct 2013 20:36:49 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 69437 invoked by uid 500); 10 Oct 2013 20:36:48 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 69409 invoked by uid 99); 10 Oct 2013 20:36:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Oct 2013 20:36:46 +0000 Date: Thu, 10 Oct 2013 20:36:46 +0000 (UTC) From: "Josh Wills (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-278) Improvements to MapsideJoin code MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791959#comment-13791959 ] Josh Wills commented on CRUNCH-278: ----------------------------------- So I had two contexts in mind: in-memory for unit testing, but also having these DoFns running inside of a MR context, where they're not strictly part of the CrunchMapper/CrunchReducer flow, but operating more like embedded inside of the initialize() process that is reading records in from the distributed cache and then performing filters/transforms on them. For example, think of being able to do mapside joins against (say) an HBase table, where you could construct the PTable of key-value pairs that is loaded in memory by reading the table into the client and then doing some processing on those values inside of the map initialization vs. having to run a MR job to process that data into a file as a pre-processing step to running the job. I'm not sure if that's the sort of thing folks would be interested in doing, but it seemed cool to me. > Improvements to MapsideJoin code > -------------------------------- > > Key: CRUNCH-278 > URL: https://issues.apache.org/jira/browse/CRUNCH-278 > Project: Crunch > Issue Type: Bug > Components: Core, MapReduce Patterns > Reporter: Josh Wills > Assignee: Josh Wills > Attachments: CRUNCH-278.patch > > > The fact that we have special-case code in the MapsideJoinStrategy for the in-memory and MR-based Pipeline instances has always bugged me, so I set out to eliminate the distinction between the two impls by creating a new interface, ReadableSourceBundle, that encapsulates the MR and in-memory specific logic for doing mapside joins in order to remove the special-case code in MapsideJoinStrategy and hopefully make other implementations that use our mapside-join patterns much easier to test. -- This message was sent by Atlassian JIRA (v6.1#6144)