crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-278) Improvements to MapsideJoin code
Date Tue, 22 Oct 2013 07:27:45 GMT


Gabriel Reid commented on CRUNCH-278:

Looks good to me in general, and it seems to work as it should.

About the name, yeah, ReadableData isn't super clear, but it's of course difficult to think
of something better. DataSourceIterable? MaterializableDataStream? MaterializableIterable?

A few other things I came up with while going over it:
* if you don't materialize a PCollection in a whole stream of DoFns between a Source and where
a ReadableData is being used (e.g. MapsideJoinStrategy), then everything gets run in-memory.
I know that this is what was discussed/agreed upon, but it's also a pretty big change with
the previous (default) behaviour which worries me a little bit. Maybe we should at least add
some javadoc in MapsideJoin to make this clear.
* I think that DelegatingReadableData and DoFnIterator should probably be moved to the
package. They're only used from that package right, and they're currently in o.a.c.util which
I think is intended to include user-facing utils.
* in the default case (ReadableData is read directly from the source and DoFns are run there)
it doesn't show up in the job plan dotfile at all. Definitely not a reason to block this ticket,
but it is something that should be added in a follow-up ticket.
* the file headers seem a little wonky. I saw a couple of files with Cloudera headers, which
I'm guessing is a bigger issue, as well as a few with some kind of messed up pasted-in file
headers (i.e. I know that's being super nit-picky (and I totally
feel like I'm being _that_ guy for pointing it out), but I guess that's stuff that's probably
best to keep in order.

> Improvements to MapsideJoin code
> --------------------------------
>                 Key: CRUNCH-278
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278b.patch, CRUNCH-278.patch
> The fact that we have special-case code in the MapsideJoinStrategy for the in-memory
and MR-based Pipeline instances has always bugged me, so I set out to eliminate the distinction
between the two impls by creating a new interface, ReadableSourceBundle<T>, that encapsulates
the MR and in-memory specific logic for doing mapside joins in order to remove the special-case
code in MapsideJoinStrategy and hopefully make other implementations that use our mapside-join
patterns much easier to test.

This message was sent by Atlassian JIRA

View raw message