crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beech <d...@paraliatech.com>
Subject Re: Question about mapreduce job planner
Date Wed, 16 Jan 2013 10:34:26 GMT
Hi Josh. A follow up just to check I've got this straight.

I've amended my pipeline and added a "pipeline.run()" call after the write
to HDFS. Now I do get two mapreduce jobs, but instead of the second
carrying on where the first left off, it actually re-does all the steps
needed to generate the PCollection that was written. I get the same jobs A
and B I described in my original email, but running sequentially rather
than in parallel. Is that what you'd expect?

So I guess what I have to do following the write is re-read from the output
path using pipeline.read(From.avroFile(...)).

It'd be good if the pipeline could hold onto information about PCollections
even after they're written, so that they can be used by follow-on steps.
I'll file a JIRA to this effect so we can discuss it there.

Thanks,
Dave


On 15 January 2013 21:00, Dave Beech <dave@paraliatech.com> wrote:

> Thanks Josh - that's great. I'll file a JIRA about the side-outputs
> feature, but the pipeline.run() call will serve my purpose for now.
>
> Cheers,
> Dave
>
> On 15 January 2013 18:03, Josh Wills <jwills@cloudera.com> wrote:
>
>> Hey Dave,
>>
>> The way to force a sequential run would be to call pipeline.run() after
>> you write D to HDFS and before you declare the operations in step 6. What
>> we would really want here is a single MapReduce job that wrote side outputs
>> on the map side to create the dataset in step D, but we don't have support
>> for side-outputs in maps yet. Worth filing a JIRA, I think.
>>
>> Thanks!
>> Josh
>>
>
>
>

Mime
View raw message