crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Created] (CRUNCH-247) Planner should take advantage of to-be-materialized outputs during planning
Date Wed, 07 Aug 2013 03:29:48 GMT
Josh Wills created CRUNCH-247:

             Summary: Planner should take advantage of to-be-materialized outputs during planning
                 Key: CRUNCH-247
             Project: Crunch
          Issue Type: Bug
          Components: Core
            Reporter: Josh Wills
            Assignee: Josh Wills
             Fix For: 0.8.0

In the following pipeline, the Crunch planner will rerun the "op1" step in two independent
map-only jobs, instead of running a single job that executes the op1 step followed by a subsequent
job that consumes that output and runs the op2 step:

     PCollection<String> in =;
     PTable<String, String> op = in.parallelDo("op1", new DoFn<String, Pair<String,
String>>() {
      public void process(String input, Emitter<Pair<String, String>> emitter)
        if (input.length() > 5) {
          emitter.emit(Pair.of(input.substring(0, 3), input));
     }, tableOf(strings(), strings()));
     SourceTarget src = (SourceTarget)((MaterializableIterable<Pair<String, String>>)
     op = op.parallelDo("op2", IdentityFn.<Pair<String,String>>getInstance(),
tableOf(strings(), strings()),
     PCollection<String> output = op.values();

The planner should be able to take advantage of the materialized output from op1 to not re-run
that step in the op2 job.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message