crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beech <d...@paraliatech.com>
Subject Re: Question about mapreduce job planner
Date Wed, 16 Jan 2013 16:11:18 GMT
Thanks Josh - that works. At least I was only 2 characters away from the
right answer! ;)


On 16 January 2013 15:40, Josh Wills <jwills@cloudera.com> wrote:

> Hey Dave,
>
> I forgot to tell you something important: your intermediate job should use
> At.avroFile(...) instead of To.avroFile(...) since you're planning on
> consuming additional data from it. If you do that, I believe it will work
> as expected (two sequential jobs with the second one picking up where the
> first one left off). In any case, we should make that transparent to users,
> so I'm writing a small patch to do the underlying Target -> SourceTarget
> conversion automatically when we can.
>
> Josh
>
>
> On Wed, Jan 16, 2013 at 2:34 AM, Dave Beech <dave@paraliatech.com> wrote:
>
>> Hi Josh. A follow up just to check I've got this straight.
>>
>> I've amended my pipeline and added a "pipeline.run()" call after the
>> write to HDFS. Now I do get two mapreduce jobs, but instead of the second
>> carrying on where the first left off, it actually re-does all the steps
>> needed to generate the PCollection that was written. I get the same jobs A
>> and B I described in my original email, but running sequentially rather
>> than in parallel. Is that what you'd expect?
>>
>> So I guess what I have to do following the write is re-read from the
>> output path using pipeline.read(From.avroFile(...)).
>>
>> It'd be good if the pipeline could hold onto information about
>> PCollections even after they're written, so that they can be used by
>> follow-on steps. I'll file a JIRA to this effect so we can discuss it
>> there.
>>
>> Thanks,
>> Dave
>>
>>
>> On 15 January 2013 21:00, Dave Beech <dave@paraliatech.com> wrote:
>>
>>> Thanks Josh - that's great. I'll file a JIRA about the side-outputs
>>> feature, but the pipeline.run() call will serve my purpose for now.
>>>
>>> Cheers,
>>> Dave
>>>
>>> On 15 January 2013 18:03, Josh Wills <jwills@cloudera.com> wrote:
>>>
>>>> Hey Dave,
>>>>
>>>> The way to force a sequential run would be to call pipeline.run() after
>>>> you write D to HDFS and before you declare the operations in step 6. What
>>>> we would really want here is a single MapReduce job that wrote side outputs
>>>> on the map side to create the dataset in step D, but we don't have support
>>>> for side-outputs in maps yet. Worth filing a JIRA, I think.
>>>>
>>>> Thanks!
>>>> Josh
>>>>
>>>
>>>
>>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message