Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of baishen.lists@gmail.com
 designates 74.125.82.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CB35B1AE.64D1D%gravi@yahoo-inc.com>
References: 
 <CAJVBNiH49srjm_etpUUim0NZehpbVCMTorbYAp7RLOHW0J1J2Q@mail.gmail.com>
	<CB35B1AE.64D1D%gravi@yahoo-inc.com>
Date: Fri, 13 Jan 2012 08:33:26 -0500
Message-ID: 
 <CAJVBNiGHLArJ4kx_mNV4BoeZ_jWq-c+jE_HNeaBwhgx41Srf6g@mail.gmail.com>
Subject: Re: Hadoop map reduce merge algorithm
From: Bai Shen <baishen.lists@gmail.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=f46d043c80b8e39a7c04b668e884

--f46d043c80b8e39a7c04b668e884
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

As far as I can tell, the amount of ram available has no effect on the
merge.  Regardless of what you set the io.sort.mb and io.sort.factor to, it
will eventually end up attempting to bring the entire output into memory to
merge.  If the mb and factor are set low, it will simply require more
passes.

Is there a way to actually configure the amount of ram used by the merge?

On Thu, Jan 12, 2012 at 11:14 PM, Ravi Gummadi <gravi@yahoo-inc.com> wrote:

>  Yes. Spills of map output get merged to single file. The spills are
> triggered by the buffer size set using the configuration property
> io.sort.mb. Obviously bigger value for io.sort.mb is preferred for better
> performance --- but the limit is to be set based on the amount of RAM
> available.
> Also, the bigger the value for the configuration property io.sort.factor
> the better in terms of performance. Even in this case, smaller value may
> have to be set for this config property based on the size of RAM availabl=
e.
>
> -Ravi
>
>
> On 1/13/12   Friday 3:12 AM, "Bai Shen" <baishen.lists@gmail.com> wrote:
>
> That's my understanding as well.  I can't seem to find any settings that
> govern the step where the output is merged into a single file.
> io.sort.factor modifies the number of passes that is done, but it
> eventually ends up doing the same thing no matter how many spill files
> there are.  They're simply combined incrementally instead of all at once.
>
> Is anybody more familiar with this step of the process?
>
> Thanks.
>
> On Thu, Jan 12, 2012 at 2:27 PM, Robert Evans <evans@yahoo-inc.com> wrote=
:
>
> My understanding is that the mapper will cache the output in memory until
> its memory buffer fills up, at which point it will sort the data and spil=
l
> it to disk.  Once a given number of spill files are created they will be
> merged together into a larger spill file.  Once the mapper finishes then
> the output is totally merged into a single file that can be served to the
> Reducer through the TaskTracker, or NodeManger under YARN.  The reducer
> does a similar thing as it merges the output form all of the mappers.  I
> don=92t understand all of the reasons behind this, but I think much of it=
 is
> to optimize the time it takes to sort the data.  If you try to merge too
> many files then you waste a lot of time doing seeks and less time reading
> data.  But I was not involved with developing it so I don=92t know for su=
re.
>
> --Bobby Evans
>
>
> On 1/12/12 10:27 AM, "Bai Shen" <baishen.lists@gmail.com <
> http://baishen.lists@gmail.com> > wrote:
>
> Can someone explain how the map reduce merge is done?  As far as I can
> tell, it appears to pull all of the spill files into one giant file to se=
nd
> to the reducer.  Is this correct?  Even if you set smaller spill files an=
d
> a lower sort factor, the eventual merge is still the same.  It just takes
> more passes to get there.
>
> Thanks.
>
>
>
>

--f46d043c80b8e39a7c04b668e884
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

As far as I can tell, the amount of ram available has no effect on the merg=
e.=A0 Regardless of what you set the io.sort.mb and io.sort.factor to, it w=
ill eventually end up attempting to bring the entire output into memory to =
merge.=A0 If the mb and factor are set low, it will simply require more pas=
ses.<br>
<br>Is there a way to actually configure the amount of ram used by the merg=
e?<br><br><div class=3D"gmail_quote">On Thu, Jan 12, 2012 at 11:14 PM, Ravi=
  Gummadi <span dir=3D"ltr">&lt;<a href=3D"mailto:gravi@yahoo-inc.com">grav=
i@yahoo-inc.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div>
<font face=3D"Calibri, Verdana, Helvetica, Arial"><span style=3D"font-size:=
11pt">Yes. Spills of map output get merged to single file. The spills are t=
riggered by the buffer size set using the configuration property io.sort.mb=
. Obviously bigger value for io.sort.mb is preferred for better performance=
 --- but the limit is to be set based on the amount of RAM available.<br>

Also, the bigger the value for the configuration property io.sort.factor th=
e better in terms of performance. Even in this case, smaller value may have=
 to be set for this config property based on the size of RAM available.<spa=
n class=3D"HOEnZb"><font color=3D"#888888"><br>

<br>
-Ravi</font></span><div class=3D"im"><br>
<br>
On 1/13/12 =A0=A0Friday 3:12 AM, &quot;Bai Shen&quot; &lt;<a href=3D"http:/=
/baishen.lists@gmail.com" target=3D"_blank">baishen.lists@gmail.com</a>&gt;=
 wrote:<br>
<br>
</div></span></font><blockquote><div class=3D"im"><font face=3D"Calibri, Ve=
rdana, Helvetica, Arial"><span style=3D"font-size:11pt">That&#39;s my under=
standing as well.=A0 I can&#39;t seem to find any settings that govern the =
step where the output is merged into a single file.=A0 io.sort.factor modif=
ies the number of passes that is done, but it eventually ends up doing the =
same thing no matter how many spill files there are.=A0 They&#39;re simply =
combined incrementally instead of all at once.<br>

<br>
Is anybody more familiar with this step of the process?<br>
<br>
Thanks.<br>
<br>
On Thu, Jan 12, 2012 at 2:27 PM, Robert Evans &lt;<a href=3D"http://evans@y=
ahoo-inc.com" target=3D"_blank">evans@yahoo-inc.com</a>&gt; wrote:<br>
</span></font></div><blockquote><font face=3D"Calibri, Verdana, Helvetica, =
Arial"><span style=3D"font-size:11pt"><div class=3D"im">My understanding is=
 that the mapper will cache the output in memory until its memory buffer fi=
lls up, at which point it will sort the data and spill it to disk. =A0Once =
a given number of spill files are created they will be merged together into=
 a larger spill file. =A0Once the mapper finishes then the output is totall=
y merged into a single file that can be served to the Reducer through the T=
askTracker, or NodeManger under YARN. =A0The reducer does a similar thing a=
s it merges the output form all of the mappers. =A0I don=92t understand all=
 of the reasons behind this, but I think much of it is to optimize the time=
 it takes to sort the data. =A0If you try to merge too many files then you =
waste a lot of time doing seeks and less time reading data. =A0But I was no=
t involved with developing it so I don=92t know for sure.<br>

<br>
--Bobby Evans<br>
<br>
<br></div><div class=3D"im">
On 1/12/12 10:27 AM, &quot;Bai Shen&quot; &lt;<a href=3D"http://baishen.lis=
ts@gmail.com" target=3D"_blank">baishen.lists@gmail.com</a> &lt;<a href=3D"=
http://baishen.lists@gmail.com" target=3D"_blank">http://baishen.lists@gmai=
l.com</a>&gt; &gt; wrote:<br>

<br>
</div></span></font><div class=3D"im"><blockquote><font face=3D"Calibri, Ve=
rdana, Helvetica, Arial"><span style=3D"font-size:11pt">Can someone explain=
 how the map reduce merge is done?=A0 As far as I can tell, it appears to p=
ull all of the spill files into one giant file to send to the reducer.=A0 I=
s this correct?=A0 Even if you set smaller spill files and a lower sort fac=
tor, the eventual merge is still the same.=A0 It just takes more passes to =
get there.<br>

<br>
Thanks.<br>
<br>
</span></font></blockquote></div></blockquote><font face=3D"Calibri, Verdan=
a, Helvetica, Arial"><span style=3D"font-size:11pt"><br>
<br>
</span></font></blockquote>
</div>


</blockquote></div><br>

--f46d043c80b8e39a7c04b668e884--