crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Adler (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CRUNCH-222) Planner should choose to run DoFns on Map or Reduce side depending on data size
Date Mon, 17 Jun 2013 17:31:20 GMT
Joseph Adler created CRUNCH-222:
-----------------------------------

             Summary: Planner should choose to run DoFns on Map or Reduce side depending on
data size
                 Key: CRUNCH-222
                 URL: https://issues.apache.org/jira/browse/CRUNCH-222
             Project: Crunch
          Issue Type: New Feature
    Affects Versions: 0.7.0
            Reporter: Joseph Adler


Hi guys,

I was using Crunch to run a large data pipeline, and came across a problem. In one stage (between
two group functions), I have a DoFn that increases the output data size by a factor of 5.
You can picture the flow like this:

  GroupByKeyA -> MyDoFn -> GroupByKeyB

On Hadoop, this translates to something like this:

  -----------------------------         -----------------------------
  | map1  -> reduce1 |   ->   | map2  -> reduce2 |
  -----------------------------         -----------------------------

Logically, you can either run MyDoFn within reduce1 or map2 and get the same results. (The
same data will be created as an input to reduce2.)

In the current implementation, MyDoFn always runs in Reduce1. That means that the first map/reduce
job will write out 5x more data than it would if MyDoFn ran in Map2. For my job, this is a
big deal: that's 5x more data written to HDFS and read from HDFS; that's a lot of extra work
on HDFS and a lot of extra network traffic. Clearly, it would be more efficient to run MyDoFn
in map2.

I'd like to propose that we change the Crunch Planner to take into account the output size
of a DoFn (or set of DoFns) when deciding where it should be run. When the scaleFactor() is
<= 1 (and reduces the data size), it should run on the reduce side. When the scaleFactor
is > 1 (and increases the data size), it should run on the map side. (For a chain of n
DoFns, this implies that Crunch will need to inspect up to n+1 different scales when deciding
where to run each DoFn.)

-- Joe



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message