crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beech <>
Subject Re: Question about mapreduce job planner
Date Wed, 16 Jan 2013 10:34:26 GMT
Hi Josh. A follow up just to check I've got this straight.

I've amended my pipeline and added a "" call after the write
to HDFS. Now I do get two mapreduce jobs, but instead of the second
carrying on where the first left off, it actually re-does all the steps
needed to generate the PCollection that was written. I get the same jobs A
and B I described in my original email, but running sequentially rather
than in parallel. Is that what you'd expect?

So I guess what I have to do following the write is re-read from the output
path using

It'd be good if the pipeline could hold onto information about PCollections
even after they're written, so that they can be used by follow-on steps.
I'll file a JIRA to this effect so we can discuss it there.


On 15 January 2013 21:00, Dave Beech <> wrote:

> Thanks Josh - that's great. I'll file a JIRA about the side-outputs
> feature, but the call will serve my purpose for now.
> Cheers,
> Dave
> On 15 January 2013 18:03, Josh Wills <> wrote:
>> Hey Dave,
>> The way to force a sequential run would be to call after
>> you write D to HDFS and before you declare the operations in step 6. What
>> we would really want here is a single MapReduce job that wrote side outputs
>> on the map side to create the dataset in step D, but we don't have support
>> for side-outputs in maps yet. Worth filing a JIRA, I think.
>> Thanks!
>> Josh

View raw message