hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Business logic in cleanup?
Date Fri, 18 Nov 2011 19:12:40 GMT
Arun,

On 19-Nov-2011, at 12:16 AM, Arun C Murthy wrote:

> 
> On Nov 18, 2011, at 10:44 AM, Harsh J wrote:
>> 
>> If you could follow up on that patch, and see it through, its wish granted for a
lot of us as well, as we move ahead with the newer APIs in the future Hadoop releases ;-)
>> 
> 
> The plan is to support both mapred and mapreduce MR apis for the forseeable future.

That is surely good news that we are (And I do know we are gonna). But there may be some clarify
needed here.

I reckon this is to avoid breakage primarily among other reasons, but it helps new developers
if they know which one to choose when they start out (say, forced via deprecation as we tried
before, or a document note?).

I personally think its good enough if we recommend one, and simply support the other (via
regular deprecation periods, documented notes, other ways you can think of).

One place high on confusion _today_ is when a user determines he's supposed to use stable
MR APIs, and then he tries out HBase where they are supporting only the newer ones as they
roll ahead. Some other downstream projects too can't afford to maintain both like we could.

> 
> Arun
> 
>> On 18-Nov-2011, at 10:32 PM, Something Something wrote:
>> 
>>> Thanks again.  Will look at Mapper.run to understand better.  Actually, just
a few minutes ago I got the AVROMapper to work (which will read from AVRO files). This will
hopefully improve performance even more.
>>> 
>>> Interesting, AVROMapper doesn't extend from Mapper, so it doesn't have the 'cleanup'
method.  Instead it provides a 'close' method, which seems to behave the same way.  Honestly,
I like the method name 'close' better than 'cleanup'.
>>> 
>>> Doug - Is there a reason you chose to not extend from org/apache/hadoop/mapreduce/Mapper?
>>> 
>>> Thank you all for your help.
>>> 
>>> 
>>> On Fri, Nov 18, 2011 at 7:44 AM, Harsh J <harsh@cloudera.com> wrote:
>>> Given that you are sure about it, and you also know why thats the
>>> case, I'd definitely write inside the cleanup(…) hook. No harm at all
>>> in doing that.
>>> 
>>> Take a look at mapreduce.Mapper#run(…) method in source and you'll
>>> understand what I mean by it not being a stage or even an event, but
>>> just a tail call after all map()s are called.
>>> 
>>> On Fri, Nov 18, 2011 at 8:58 PM, Something Something
>>> <mailinglists19@gmail.com> wrote:
>>> > Thanks again for the clarification.  Not sure what you mean by it's not
a
>>> > 'stage'!  Okay.. may be not a stage but I think of it as an 'Event', such
as
>>> > 'Mouseover', 'Mouseout'.  The 'cleanup' is really a 'MapperCompleted' event,
>>> > right?
>>> >
>>> > Confusion comes with the name of this method.  The name 'cleanup' makes
me
>>> > think it should not be really used as 'mapperCompleted', but it appears
>>> > there's no harm in using it that way.
>>> >
>>> > Here's our dilemma - when we use (local) caching in the Mapper & write
in
>>> > the 'cleanup', our job completes in 18 minutes.  When we don't write in
>>> > 'cleanup' it takes 3 hours!!!  Knowing this if you were to decide, would
you
>>> > use 'cleanup' for this purpose?
>>> >
>>> > Thanks once again for your advice.
>>> >
>>> >
>>> > On Thu, Nov 17, 2011 at 9:35 PM, Harsh J <harsh@cloudera.com> wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> On Fri, Nov 18, 2011 at 10:44 AM, Something Something
>>> >> <mailinglists19@gmail.com> wrote:
>>> >> > Thanks for the reply.  Here's another concern we have.  Let's say
Mapper
>>> >> > has
>>> >> > finished processing 1000 lines from the input file & then the
machine
>>> >> > goes
>>> >> > down.  I believe Hadoop is smart enough to re-distribute the input
split
>>> >> > that was assigned to this Mapper, correct?  After re-assigning
will it
>>> >> > reprocess the 1000 lines that were processed successfully before
& start
>>> >> > from line 1001  OR  would it reprocess ALL lines?
>>> >>
>>> >> Attempts of any task start afresh. That's the default nature of Hadoop.
>>> >>
>>> >> So, it would begin from start again and hence reprocess ALL lines.
>>> >> Understand that cleanup is just a fancy API call here, thats called
>>> >> after the input reader completes - not a "stage".
>>> >>
>>> >> --
>>> >> Harsh J
>>> >
>>> >
>>> 
>>> 
>>> 
>>> --
>>> Harsh J
>>> 
>> 
> 


Mime
View raw message