hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Business logic in cleanup?
Date Fri, 18 Nov 2011 15:44:27 GMT
Given that you are sure about it, and you also know why thats the
case, I'd definitely write inside the cleanup(…) hook. No harm at all
in doing that.

Take a look at mapreduce.Mapper#run(…) method in source and you'll
understand what I mean by it not being a stage or even an event, but
just a tail call after all map()s are called.

On Fri, Nov 18, 2011 at 8:58 PM, Something Something
<mailinglists19@gmail.com> wrote:
> Thanks again for the clarification.  Not sure what you mean by it's not a
> 'stage'!  Okay.. may be not a stage but I think of it as an 'Event', such as
> 'Mouseover', 'Mouseout'.  The 'cleanup' is really a 'MapperCompleted' event,
> right?
> Confusion comes with the name of this method.  The name 'cleanup' makes me
> think it should not be really used as 'mapperCompleted', but it appears
> there's no harm in using it that way.
> Here's our dilemma - when we use (local) caching in the Mapper & write in
> the 'cleanup', our job completes in 18 minutes.  When we don't write in
> 'cleanup' it takes 3 hours!!!  Knowing this if you were to decide, would you
> use 'cleanup' for this purpose?
> Thanks once again for your advice.
> On Thu, Nov 17, 2011 at 9:35 PM, Harsh J <harsh@cloudera.com> wrote:
>> Hello,
>> On Fri, Nov 18, 2011 at 10:44 AM, Something Something
>> <mailinglists19@gmail.com> wrote:
>> > Thanks for the reply.  Here's another concern we have.  Let's say Mapper
>> > has
>> > finished processing 1000 lines from the input file & then the machine
>> > goes
>> > down.  I believe Hadoop is smart enough to re-distribute the input split
>> > that was assigned to this Mapper, correct?  After re-assigning will it
>> > reprocess the 1000 lines that were processed successfully before & start
>> > from line 1001  OR  would it reprocess ALL lines?
>> Attempts of any task start afresh. That's the default nature of Hadoop.
>> So, it would begin from start again and hence reprocess ALL lines.
>> Understand that cleanup is just a fancy API call here, thats called
>> after the input reader completes - not a "stage".
>> --
>> Harsh J

Harsh J

View raw message