hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dali Kilani <dali.kil...@gmail.com>
Subject Re: Why is Spilled Records always equal to Map output records
Date Tue, 14 Jul 2009 01:40:59 GMT
If I am not mistaken (I am new to this stuff), that's because you need to
have a checkpoint from which you can restart the reduce jobs that use those
spilled records in case of a reduce task failure.

Dali
On Mon, Jul 13, 2009 at 6:32 PM, Mu Qiao <qiaomuf@gmail.com> wrote:

> Thank you. But why need map outputs to be written to disk at least once? I
> think my io.sort.mb is large enough to do in-memory operations. Could you
> provide me some information about it?
>
> On Tue, Jul 14, 2009 at 1:27 AM, Owen O'Malley <omalley@apache.org> wrote:
>
> >
> > On Jul 12, 2009, at 3:55 AM, Mu Qiao wrote:
> >
> >  I notice it from the web console after I've tried to run serveral jobs.
> >> Every one of the jobs has the number of Spilled Records equal to Map
> >> output
> >> records, even if there are only 5 map output records
> >>
> >
> >
> > This is good. The map outputs need to be written to disk at least once.
> So
> > if they are equal, things are fitting in memory. If multiple passes are
> > needed, you'll see 2x or more spilled records.
> >
> >  In the reduce phase, there are also spilled records which is equal to
> >> reduce
> >> input records.
> >>
> >
> > This is reasonable, although 0.19 and 0.20 don't need to spill the
> records
> > in the reduce at all, if you make the buffer big enough.
> >
> > -- Owen
> >
>
>
>
> --
> Best wishes,
> Qiao Mu
>



-- 
Dali Kilani
===========
Phone :  (650) 492-5921 (Google Voice)
E-Fax  :  (775) 552-2982

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message