crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-278) Improvements to MapsideJoin code
Date Thu, 10 Oct 2013 19:11:42 GMT


Gabriel Reid commented on CRUNCH-278:

I like the idea behind it, and it looks good to me. Definitely good to get rid of what was
basically two difference implementations of the MapsideJoinStrategy, and I have the feeling
that this could be useful for a number of things outside of the context of mapside joins as
well, even if I don't have a totally clear idea of what that would be exactly.

One thing I would mention is to add some docs on the ReadableSourceBundle, particularly on
the configure method to specify where it gets called in the lifecycle. I understand how it
works now, but I'm pretty sure I'll forget that before the next time I need to know it :-)

The idea of generalizing ReadableSourceBundle to allow including DoFns in it sounds interesting.
If I think about it, it's a kind of optimization so that you could do something like filter/transform
values from the in-memory side of the PCollection within the context of the MapsideJoin instead
of doing it within the context of a MR job leading up to the use of the MapsideJoin. Am I
seeing that correctly? Or do you see new functionality that would be added that isn't possible
at all right now? In case it's just an optimization, maybe better hold off on that for this
ticket, as it seems like there could be a fair bit of extra complexity that could creep in

> Improvements to MapsideJoin code
> --------------------------------
>                 Key: CRUNCH-278
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278.patch
> The fact that we have special-case code in the MapsideJoinStrategy for the in-memory
and MR-based Pipeline instances has always bugged me, so I set out to eliminate the distinction
between the two impls by creating a new interface, ReadableSourceBundle<T>, that encapsulates
the MR and in-memory specific logic for doing mapside joins in order to remove the special-case
code in MapsideJoinStrategy and hopefully make other implementations that use our mapside-join
patterns much easier to test.

This message was sent by Atlassian JIRA

View raw message