crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-278) Improvements to MapsideJoin code
Date Fri, 11 Oct 2013 06:56:42 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792410#comment-13792410
] 

Gabriel Reid commented on CRUNCH-278:
-------------------------------------

So just to make sure I'm on the same page here: I'm thinking that in MapsideJoin case, the
way it would work today is like this:

{code}
PTable<ImmutableBytesWritable,Result> htableContents = pipeline.read(FromHBase.table());
PTable<A,B> convertedHTable = htableContents.parallelDo(new DoSomethingFn());
PTable<A,Pair<C,B>> joined = new MapsideJoinStrategy().join(anotherPTable, convertedHTable);
{code}

and this would have the drawback that created the convertedHTable would require a whole MR
job to be kicked off in order to get to convertedHTable, although what we want is to have
the conversion to convertedHTable happen in the initialize method in the MapsideJoin to avoid
kicking off the MR job.

Wouldn't this be possible with something like a "materialized" PCollection, which could then
operate in the same way as the in-memory pcollections? So then we would end with something
like this:

{code}
PTable<ImmutableBytesWritable,Result> htableContents = pipeline.read(FromHBase.table());
PTable<A,B> convertedHTable = new MaterializedPCollection(htableContents).parallelDo(new
DoSomethingFn());
PTable<A,Pair<C,B>> joined = new MapsideJoinStrategy().join(anotherPTable, convertedHTable);
{code}
Then when materialize() was called on a MaterializedPCollection, we would just materialize
the root PCollection and load everything in memory and pass it through the rest of it's pipeline
in memory so that the processing of the DoSomethingFn would occur in memory in the mapper.
I guess that this would also imply that calling Pipeline#write on a MaterializedCollection
would throw an exception, unless there was some way of getting around that.

Is that kind of what you had in mind? Or am I talking about something totally different?


> Improvements to MapsideJoin code
> --------------------------------
>
>                 Key: CRUNCH-278
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-278
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278.patch
>
>
> The fact that we have special-case code in the MapsideJoinStrategy for the in-memory
and MR-based Pipeline instances has always bugged me, so I set out to eliminate the distinction
between the two impls by creating a new interface, ReadableSourceBundle<T>, that encapsulates
the MR and in-memory specific logic for doing mapside joins in order to remove the special-case
code in MapsideJoinStrategy and hopefully make other implementations that use our mapside-join
patterns much easier to test.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message