Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5CC5C10211 for ; Wed, 20 Nov 2013 15:23:45 +0000 (UTC) Received: (qmail 3213 invoked by uid 500); 20 Nov 2013 15:23:44 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 3126 invoked by uid 500); 20 Nov 2013 15:23:42 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 3029 invoked by uid 500); 20 Nov 2013 15:23:38 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 2959 invoked by uid 99); 20 Nov 2013 15:23:37 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Nov 2013 15:23:37 +0000 Date: Wed, 20 Nov 2013 15:23:37 +0000 (UTC) From: "Gabriel Reid (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-294) Cost-based job planning MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13827755#comment-13827755 ] Gabriel Reid commented on CRUNCH-294: ------------------------------------- The logic for choosing splits sounds good to me. About the breakpoint() method: is it not possible to get the same functionality by calling materialize() on a PCollection? I'm also a little bit worried about the name -- I think it could cause some confusion in terms of debugger breakpoints (but maybe that's just me). The only other name I can think of for that is checkpoint(), which maybe also brings its share of confusion along with it. And one other small thing: I noticed a Cloudera file header in BreakpointIT.java. > Cost-based job planning > ----------------------- > > Key: CRUNCH-294 > URL: https://issues.apache.org/jira/browse/CRUNCH-294 > Project: Crunch > Issue Type: Improvement > Components: Core > Reporter: Josh Wills > Assignee: Josh Wills > Attachments: CRUNCH-294.patch, CRUNCH-294b.patch, jobplan-default-new.png, jobplan-default-old.png, jobplan-large_s2_s3.png, jobplan-lopsided.png > > > A bug report on the user list drove me to revisit some of the core planning logic, particularly around how we decide where to split up DoFns between two dependent MapReduce jobs. > I found an old TODO about using the scale factor from a DoFn to decide where to split up the nodes between dependent GBKs, so I implemented a new version of the split algorithm that takes advantage of how we've propagated support for multiple outputs on both the map and reduce sides of a job to do finer-grained splits that use information from the scaleFactor calculations to make smarter split decisions. > One high-level change along with this: I changed the default scaleFactor() value in DoFn to 0.99f to slightly prefer writes that occur later in a pipeline flow by default. -- This message was sent by Atlassian JIRA (v6.1#6144)