crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CRUNCH-294) Cost-based job planning
Date Thu, 14 Nov 2013 19:57:21 GMT
Josh Wills created CRUNCH-294:
---------------------------------

             Summary: Cost-based job planning
                 Key: CRUNCH-294
                 URL: https://issues.apache.org/jira/browse/CRUNCH-294
             Project: Crunch
          Issue Type: Improvement
          Components: Core
            Reporter: Josh Wills
            Assignee: Josh Wills
         Attachments: CRUNCH-294.patch

A bug report on the user list drove me to revisit some of the core planning logic, particularly
around how we decide where to split up DoFns between two dependent MapReduce jobs.

I found an old TODO about using the scale factor from a DoFn to decide where to split up the
nodes between dependent GBKs, so I implemented a new version of the split algorithm that takes
advantage of how we've propagated support for multiple outputs on both the map and reduce
sides of a job to do finer-grained splits that use information from the scaleFactor calculations
to make smarter split decisions.

One high-level change along with this: I changed the default scaleFactor() value in DoFn to
0.99f to slightly prefer writes that occur later in a pipeline flow by default.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message