crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-278) Improvements to MapsideJoin code
Date Thu, 10 Oct 2013 20:36:46 GMT


Josh Wills commented on CRUNCH-278:

So I had two contexts in mind: in-memory for unit testing, but also having these DoFns running
inside of a MR context, where they're not strictly part of the CrunchMapper/CrunchReducer
flow, but operating more like embedded inside of the initialize() process that is reading
records in from the distributed cache and then performing filters/transforms on them.

For example, think of being able to do mapside joins against (say) an HBase table, where you
could construct the PTable of key-value pairs that is loaded in memory by reading the table
into the client and then doing some processing on those values inside of the map initialization
vs. having to run a MR job to process that data into a file as a pre-processing step to running
the job. I'm not sure if that's the sort of thing folks would be interested in doing, but
it seemed cool to me.

> Improvements to MapsideJoin code
> --------------------------------
>                 Key: CRUNCH-278
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278.patch
> The fact that we have special-case code in the MapsideJoinStrategy for the in-memory
and MR-based Pipeline instances has always bugged me, so I set out to eliminate the distinction
between the two impls by creating a new interface, ReadableSourceBundle<T>, that encapsulates
the MR and in-memory specific logic for doing mapside joins in order to remove the special-case
code in MapsideJoinStrategy and hopefully make other implementations that use our mapside-join
patterns much easier to test.

This message was sent by Atlassian JIRA

View raw message