crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-278) Improvements to MapsideJoin code
Date Tue, 15 Oct 2013 06:49:45 GMT


Gabriel Reid commented on CRUNCH-278:

Yeah, I think that that could work for the more general case. Calling toBundle on a PCollection
would then back up to the last call to materialize and execute everything from there on in
memory, and the default case is to do nothing in memory.

The only issue I see with this is that it makes the materialize() call into something that
visibly mutates the state of a PCollection. Materializing a PCollection mutates state under
the covers anyhow, but adding these semantics to materialize very slightly breaks the idea
of immutability around PCollection. That's probably not a big enough reason to not take this
approach though.

> Improvements to MapsideJoin code
> --------------------------------
>                 Key: CRUNCH-278
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278.patch
> The fact that we have special-case code in the MapsideJoinStrategy for the in-memory
and MR-based Pipeline instances has always bugged me, so I set out to eliminate the distinction
between the two impls by creating a new interface, ReadableSourceBundle<T>, that encapsulates
the MR and in-memory specific logic for doing mapside joins in order to remove the special-case
code in MapsideJoinStrategy and hopefully make other implementations that use our mapside-join
patterns much easier to test.

This message was sent by Atlassian JIRA

View raw message