crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-278) Improvements to MapsideJoin code
Date Fri, 11 Oct 2013 01:56:42 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792250#comment-13792250
] 

Micah Whitacre commented on CRUNCH-278:
---------------------------------------

{quote}
For example, think of being able to do mapside joins against (say) an HBase table, where you
could construct the PTable of key-value pairs that is loaded in memory by reading the table
into the client and then doing some processing on those values inside of the map initialization
vs. having to run a MR job to process that data into a file as a pre-processing step to running
the job. I'm not sure if that's the sort of thing folks would be interested in doing, but
it seemed cool to me.
{quote}

Did someone give you a copy of our code? :)  We don't do the Mapside portion but have a number
of use cases where that data should be small enough we should be able to do it mapside.  Additionally
our APIs are written in the form of PTable<Avro,Avro> so we usually have transformed
PTable<ImmutableBytesWritable, Result> from HBase into PTable<Avro,Avro> using
simple MapFn's before we would want to do the joins.  

I need to review the ReadableSourceBundle still but just wanted to confirm that the use case
you were heading towards would definitely get used.

> Improvements to MapsideJoin code
> --------------------------------
>
>                 Key: CRUNCH-278
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-278
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278.patch
>
>
> The fact that we have special-case code in the MapsideJoinStrategy for the in-memory
and MR-based Pipeline instances has always bugged me, so I set out to eliminate the distinction
between the two impls by creating a new interface, ReadableSourceBundle<T>, that encapsulates
the MR and in-memory specific logic for doing mapside joins in order to remove the special-case
code in MapsideJoinStrategy and hopefully make other implementations that use our mapside-join
patterns much easier to test.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message