crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Question about mapreduce job planner
Date Wed, 16 Jan 2013 15:40:44 GMT
Hey Dave,

I forgot to tell you something important: your intermediate job should use
At.avroFile(...) instead of To.avroFile(...) since you're planning on
consuming additional data from it. If you do that, I believe it will work
as expected (two sequential jobs with the second one picking up where the
first one left off). In any case, we should make that transparent to users,
so I'm writing a small patch to do the underlying Target -> SourceTarget
conversion automatically when we can.

Josh


On Wed, Jan 16, 2013 at 2:34 AM, Dave Beech <dave@paraliatech.com> wrote:

> Hi Josh. A follow up just to check I've got this straight.
>
> I've amended my pipeline and added a "pipeline.run()" call after the write
> to HDFS. Now I do get two mapreduce jobs, but instead of the second
> carrying on where the first left off, it actually re-does all the steps
> needed to generate the PCollection that was written. I get the same jobs A
> and B I described in my original email, but running sequentially rather
> than in parallel. Is that what you'd expect?
>
> So I guess what I have to do following the write is re-read from the
> output path using pipeline.read(From.avroFile(...)).
>
> It'd be good if the pipeline could hold onto information about
> PCollections even after they're written, so that they can be used by
> follow-on steps. I'll file a JIRA to this effect so we can discuss it
> there.
>
> Thanks,
> Dave
>
>
> On 15 January 2013 21:00, Dave Beech <dave@paraliatech.com> wrote:
>
>> Thanks Josh - that's great. I'll file a JIRA about the side-outputs
>> feature, but the pipeline.run() call will serve my purpose for now.
>>
>> Cheers,
>> Dave
>>
>> On 15 January 2013 18:03, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> Hey Dave,
>>>
>>> The way to force a sequential run would be to call pipeline.run() after
>>> you write D to HDFS and before you declare the operations in step 6. What
>>> we would really want here is a single MapReduce job that wrote side outputs
>>> on the map side to create the dataset in step D, but we don't have support
>>> for side-outputs in maps yet. Worth filing a JIRA, I think.
>>>
>>> Thanks!
>>> Josh
>>>
>>
>>
>>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message