crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Shi (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-284) Optimize for minimal disk i/o rather than the number of stages?
Date Mon, 21 Oct 2013 14:17:43 GMT


Chao Shi commented on CRUNCH-284:

Another use case is (which I came into a few months ago):

PCollection in = 
PCollection tmp = someExpensiveFn(in)
PCollection part1 = f1(tmp) 
PCollection part2 = f2(tmp) 

I hope we can find an approach to tell crunch that someExpensiveFn is really expensive and
should be executed only once.

> Optimize for minimal disk i/o rather than the number of stages?
> ---------------------------------------------------------------
>                 Key: CRUNCH-284
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Chao Shi
> I have a pipeline as follows:
> PCollection in =
> PCollection part1 = f1(in)
> PCollection part2 = f2(in)
> pipelien.write(part1.groupByKey...)
> pipeline.write(part2.groupByKey...)
> where f1 extracts a small potion from "in" and f2 returns the rest. Crunch optimizes
the pipeline into two independent MR jobs, both of which fully read the input.
> I think the ideal MRs should be a map-only job reads the input and split them to two
outputs, and then two MRs read them respectively.
> The problem is that Crunch minimizes the number of MR stages, which is optimal for most
cases, but not optimal in this case. 
> What do you think of this folks?

This message was sent by Atlassian JIRA

View raw message