Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Date: Mon, 21 Oct 2013 12:20:44 +0000 (UTC)
From: "Josh Wills (JIRA)" <jira@apache.org>
To: crunch-dev@incubator.apache.org
Message-ID: <JIRA.12674757.1382349860316.97061.1382358044066@arcas>
In-Reply-To: <JIRA.12674757.1382349860316@arcas>
References: <JIRA.12674757.1382349860316@arcas>
Subject: [jira] [Commented] (CRUNCH-284) Optimize for minimal disk i/o
 rather than the number of stages?
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CRUNCH-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800589#comment-13800589 ] 

Josh Wills commented on CRUNCH-284:
-----------------------------------

Another thought on the read() option-- what if I wrote an intermediate output at some point in the pipeline (like a checkpoint) that I only wanted subsequent stages to read once (like w/a SourceTarget). Maybe the option should be a property on the Source interface instead of the read method, so that:

pipeline.read(From.textFile(path).onlyOnce())

and

pipeline.write(At.textFile(path).onlyOnce(), WriteMode.CHECKPOINT)

would both work. Or is the onlyOnce() in the context of a write sort of strange?

> Optimize for minimal disk i/o rather than the number of stages?
> ---------------------------------------------------------------
>
>                 Key: CRUNCH-284
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-284
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Chao Shi
>
> I have a pipeline as follows:
> PCollection in = pipeline.read(...)
> PCollection part1 = f1(in)
> PCollection part2 = f2(in)
> pipelien.write(part1.groupByKey...)
> pipeline.write(part2.groupByKey...)
> where f1 extracts a small potion from "in" and f2 returns the rest. Crunch optimizes the pipeline into two independent MR jobs, both of which fully read the input.
> I think the ideal MRs should be a map-only job reads the input and split them to two outputs, and then two MRs read them respectively.
> The problem is that Crunch minimizes the number of MR stages, which is optimal for most cases, but not optimal in this case. 
> What do you think of this folks?


--
This message was sent by Atlassian JIRA
(v6.1#6144)