crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Shi (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CRUNCH-284) Optimize for minimal disk i/o rather than the number of stages?
Date Mon, 21 Oct 2013 10:04:43 GMT
Chao Shi created CRUNCH-284:
-------------------------------

             Summary: Optimize for minimal disk i/o rather than the number of stages?
                 Key: CRUNCH-284
                 URL: https://issues.apache.org/jira/browse/CRUNCH-284
             Project: Crunch
          Issue Type: Bug
            Reporter: Chao Shi


I have a pipeline as follows:

PCollection in = pipeline.read(...)
PCollection part1 = f1(in)
PCollection part2 = f2(in)
pipelien.write(part1.groupByKey...)
pipeline.write(part2.groupByKey...)

where f1 extracts a small potion from "in" and f2 returns the rest. Crunch optimizes the pipeline
into two independent MR jobs, both of which fully read the input.

I think the ideal MRs should be a map-only job reads the input and split them to two outputs,
and then two MRs read them respectively.

The problem is that Crunch minimizes the number of MR stages, which is optimal for most cases,
but not optimal in this case. 

What do you think of this folks?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message