incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-34) Refactor the MSCRPlanner logic
Date Wed, 08 Aug 2012 00:21:09 GMT


Josh Wills commented on CRUNCH-34:

Thanks Gabriel, I think it will be a significant improvement and make lots of other stuff

Re: bigger jobs, not yet, which is why I don't want to include it in the first release. I'll
update the patch w/the above description in the javadoc.
> Refactor the MSCRPlanner logic
> ------------------------------
>                 Key: CRUNCH-34
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.3.0
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: PLANNER-REFACTORING.patch
> I had a conversation with Robert awhile back about one of the shoddier areas of the Crunch
codebase-- the planning logic. It relies on a whole bunch of mutable state, which makes the
logic of the overall planning process incomprehensible to anyone except for me (back when
I wrote it) and Gabriel (who grokked it well enough to fix some bugs in it.)
> It turns out that understanding the planning process is actually pretty easy if you map
the logical plan to a graph that has three kinds of vertices: Source, Target, and GroupByKey
(GBK). All of the other nodes in the logical plan (primarily DoCollection/DoTable instances)
make up the edges of the graph.
> Once you take this graph perspective, you can think of the MapReduce job creation process
entirely in terms of graph operations:
> 1) Walk the logical plan and construct the initial Graph object, which allows Edges to
exist between GBK vertices.
> 2) Build a new graph that is identical to the first one, except it eliminates Edges between
GBK vertices by constructing additional Source and Target vertices.
> 3) Identify all of the (weakly) connected components of the new graph.
> 4) Construct MapReduce jobs out of the connected components, either map-only jobs when
there is no GBK node in the component, or MapReduce jobs when there is one (or a fusion job
when there is more than one.)
> I've been working on this off-and-on for a couple of weeks, and I have a version of the
planning code that implements the description above and passes all of our tests. There are
still places where we have mutable state that will need to be cleaned up, but I think this
is a step in the right direction. I'm not sure it's ready for prime-time yet, but I wanted
to get the conversation started.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message