crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-278) Improvements to MapsideJoin code
Date Fri, 11 Oct 2013 15:16:42 GMT


Gabriel Reid commented on CRUNCH-278:

Ok, I get it. 

The issue in the API in making it possible to specify the boundary between MR job and in-memory
is what I was going for with the MaterializedPCollection constructor that I posted before
(copied here below).

PTable<ImmutableBytesWritable,Result> htableContents =;
PTable<A,B> convertedHTable = new MaterializedPCollection(htableContents).parallelDo(new
PTable<A,Pair<C,B>> joined = new MapsideJoinStrategy().join(anotherPTable, convertedHTable);

My idea was that everything coming out of the MaterializedPCollection would be done in memory,
so you could have something that was being calculated upstream in the pipeline be read into
memory starting from the point where you instantiated a MaterializedPCollection.

In any case, yeah, I think it would be pretty important to be able to clearly specify which
things you want done in MR and which you want done in memory.

> Improvements to MapsideJoin code
> --------------------------------
>                 Key: CRUNCH-278
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278.patch
> The fact that we have special-case code in the MapsideJoinStrategy for the in-memory
and MR-based Pipeline instances has always bugged me, so I set out to eliminate the distinction
between the two impls by creating a new interface, ReadableSourceBundle<T>, that encapsulates
the MR and in-memory specific logic for doing mapside joins in order to remove the special-case
code in MapsideJoinStrategy and hopefully make other implementations that use our mapside-join
patterns much easier to test.

This message was sent by Atlassian JIRA

View raw message