crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-278) Improvements to MapsideJoin code
Date Mon, 14 Oct 2013 02:07:41 GMT


Micah Whitacre commented on CRUNCH-278:

The MaterialzedPCollection seems nice because it meshes nicely with metaphors already in Crunch
but seems dangerous for the ill-informed consumer.   Specifically since the PCollection can
be passed around it might be passed to functionality expecting to be able to persist the collection
and then encounter the issue.

Therefore the bundle approach seems nice because it clearly sets that distinction.  To confirm
though if we went with this approach...

PTable<K, V> cnt = stuff.count();
ReadableSourceBundle<Pair<K, V>> = cnt.toBundle();

Consumers could still do whatever processing/persisting they wanted with the "cnt" value correct?
 So the cnt.toBundle()  would have no affect on it?  Also GBKs would be allowed prior to creating
the bundle?  In HBase rows can be broken up in a PTable due to the configured batch size and
could potentially require that grouping.

> Improvements to MapsideJoin code
> --------------------------------
>                 Key: CRUNCH-278
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278.patch
> The fact that we have special-case code in the MapsideJoinStrategy for the in-memory
and MR-based Pipeline instances has always bugged me, so I set out to eliminate the distinction
between the two impls by creating a new interface, ReadableSourceBundle<T>, that encapsulates
the MR and in-memory specific logic for doing mapside joins in order to remove the special-case
code in MapsideJoinStrategy and hopefully make other implementations that use our mapside-join
patterns much easier to test.

This message was sent by Atlassian JIRA

View raw message