crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-284) Optimize for minimal disk i/o rather than the number of stages?
Date Mon, 21 Oct 2013 11:16:42 GMT


Gabriel Reid commented on CRUNCH-284:

Yes, definitely would be useful -- this kind of thing has come up before too. I think that
the main question is how to expose something like this via the API. 

The first way (which is currently possible, but not very nice or intuitive) is just to explicitly
write part1 and part2 and then call This could also probably be exposed in
a nicer way, like by adding a "checkpoint" method to PCollection or Pipeline.

The second way is to have the planner transparently make a decision like this based on the
calculated size of the PCollections that will be output (based on input size and the combined
scaleFactors of all DoFns). This implicit approach always makes me a bit nervous, as it's
very dependent on the correctness of the reporting of the input size (which is an issue when
working with something like HBase), as well as the correctness of the return value of DoFn#scaleFactor
(which is also very difficult to get right).

Any thoughts on how you would want to expose this kind of thing in the API?

> Optimize for minimal disk i/o rather than the number of stages?
> ---------------------------------------------------------------
>                 Key: CRUNCH-284
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Chao Shi
> I have a pipeline as follows:
> PCollection in =
> PCollection part1 = f1(in)
> PCollection part2 = f2(in)
> pipelien.write(part1.groupByKey...)
> pipeline.write(part2.groupByKey...)
> where f1 extracts a small potion from "in" and f2 returns the rest. Crunch optimizes
the pipeline into two independent MR jobs, both of which fully read the input.
> I think the ideal MRs should be a map-only job reads the input and split them to two
outputs, and then two MRs read them respectively.
> The problem is that Crunch minimizes the number of MR stages, which is optimal for most
cases, but not optimal in this case. 
> What do you think of this folks?

This message was sent by Atlassian JIRA

View raw message