hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject Re: Task fails: starts over with first input key?
Date Tue, 14 Dec 2010 17:13:28 GMT
Hmmm, I'll take that under advisement.  So, even if I manually avoided redoing earlier work
(by keeping a log of which input key/values have been processed and short-circuiting the map()
if a key/value has already been processed, you're saying those previously completed key/values
would not be passed on the reducer if I skipped them the second time the task was attempted?
 Is that correct?

Man, I'm trying to figure out the best design here.

My mapper can take up to an hour to process a single input key/value.  If a mapper fails on
the second input, I really can't afford to calculate the first input all over again even though
it was successful the first time.  The job basically never finishes at that rate of inefficiency.
 Reprocessing any data even twice is basically unacceptable, much less four times which the
number of times a task is attempted before giving up and letting the reducer work with what
it's got (I've tried setMaxMapAttempts(), but it has no affect, tasks are always attempted
four times regardless of setMaxMapAttempts().).

I wish there were a less burdensome version of skipbadrecords.  I don't want it to perform
a binary search trying to find the bad record while reprocessing data over and over again.
 I want it to just skip failed calls to map() and move on to the next input key/value.  I
want the mapper to just iterate through its list of inputs, skipping any that fail, and sending
all the successfully processed data to the reducer, all in a single nonredundant pass.  Is
there any way to make Hadoop do that?



On Dec 13, 2010, at 21:46 , Eric Sammer wrote:

> What you are seeing is correct and the intended behavior. The unit of work
> in a MR job is the task. If something causes the task to fail, it starts
> again. Any output from the failed task attempt is throw away. The reducers
> will not see the output of the failed map tasks at all. There is no way
> (within Hadoop proper) to teach a task to be stateful, nor should you as you
> lose a lot of flexibility with respect to features like speculative
> execution and the ability to deal with failures of the machine (unless you
> maintained task state in HDFS or another external system). It's just not
> worth.
> On Mon, Dec 13, 2010 at 7:51 PM, Keith Wiley <kwiley@keithwiley.com> wrote:
>> I think I am seeing a behavior in which if a mapper task fails (crashes) on
>> one input key/value, the entire task is rescheduled and rerun, starting over
>> again from the first input key/value even if all of the inputs preceding the
>> troublesome input were processed successfully.
>> Am I correct about this or am I seeing something that isn't there?
>> If I am correct, what happens to the outputs of the successful duplicate
>> map() calls?  Which output key/value is the one that is sent to shuffle (and
>> a reducer): Is it the result of the first attempt on the input in question
>> or the result of the last attempt?
>> Is there any way to prevent it from recalculating those duplicate inputs
>> other than something manual on the side like keeping a job-log of the map
>> attempts and scanning the log at the beginning of each map() call?
>> Thanks.

Keith Wiley               kwiley@keithwiley.com               www.keithwiley.com

"Luminous beings are we, not this crude matter."
  -- Yoda

View raw message