Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of linlma@gmail.com designates
 209.85.212.45 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <A06694F8-AFA7-49C3-81DC-F3C1B10E47C3@123.org>
References: 
 <CAK_MoSvmz4+FPjb6S9PGTTWEBvhAswgkMuJSvJ40wJT+BGLQFg@mail.gmail.com>
	<491FA550-FC92-4280-8FB5-186E5F7A4743@123.org>
	<CAK_MoSuY1B5P_AQTzJfrAwB_-tJ0gmt2PvTaBR9qG5bjPHDpEg@mail.gmail.com>
	<A06694F8-AFA7-49C3-81DC-F3C1B10E47C3@123.org>
Date: Sat, 22 Dec 2012 21:24:12 +0800
Message-ID: 
 <CAK_MoSsZK8UxWvZohz0wWTZ83haT=mbKtFkd6GEcExd=yiRpbg@mail.gmail.com>
Subject: Re: distributed cache
From: Lin Ma <linlma@gmail.com>
To: Kai Voigt <k@123.org>, user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=f46d0434c2ec3e5c9604d170e1c4

--f46d0434c2ec3e5c9604d170e1c4
Content-Type: text/plain; charset=ISO-8859-1

Hi Kai,

Smart answer! :-)

   - The assumption you have is one distributed cache replica could only
   serve one download session for tasktracker node (this is why you get
   concurrency n/r). The question is, why one distributed cache replica cannot
   serve multiple concurrent download session? For example, supposing a
   tasktracker use elapsed time t to download a file from a specific
   distributed cache replica, it is possible for 2 tasktrackers to download
   from the specific distributed cache replica in parallel using elapsed time
   t as well, or 1.5 t, which is faster than sequential download time 2t you
   mentioned before?
   - "In total, r+n/r concurrent operations. If you optimize r depending on
   n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
   minimize r+n/r? Appreciate if you could point me to more details.

regards,
Lin

On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k@123.org> wrote:

> Hi,
>
> simple math. Assuming you have n TaskTrackers in your cluster that will
> need to access the files in the distributed cache. And r is the replication
> level of those files.
>
> Copying the files into HDFS requires r copy operations over the network.
> The n TaskTrackers need to get their local copies from HDFS, so the n
> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
> the optimal replication level. So 10 is a reasonable default setting for
> most clusters that are not 500+ nodes big.
>
> Kai
>
> Am 22.12.2012 um 13:46 schrieb Lin Ma <linlma@gmail.com>:
>
> Thanks Kai, using higher replication count for the purpose of?
>
> regards,
> Lin
>
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k@123.org> wrote:
>
>> Hi,
>>
>> Am 22.12.2012 um 13:03 schrieb Lin Ma <linlma@gmail.com>:
>>
>> > I want to confirm when on each task node either mapper or reducer
>> access distributed cache file, it resides on disk, not resides in memory.
>> Just want to make sure distributed cache file does not fully loaded into
>> memory which compete memory consumption with mapper/reducer tasks. Is that
>> correct?
>>
>>
>> Yes, you are correct. The JobTracker will put files for the distributed
>> cache into HDFS with a higher replication count (10 by default). Whenever a
>> TaskTracker needs those files for a task it is launching locally, it will
>> fetch a copy to its local disk. So it won't need to do this again for
>> future tasks on this node. After a job is done, all local copies and the
>> HDFS copies of files in the distributed cache are cleaned up.
>>
>> Kai
>>
>> --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>>
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

--f46d0434c2ec3e5c9604d170e1c4
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Kai,<br><br>Smart answer! :-)<br><ul><li>The assumption you have is one =
distributed cache replica could only serve one download session for tasktra=
cker node (this is why you get concurrency n/r). The question is, why one d=
istributed cache replica cannot serve multiple concurrent download session?=
 For example, supposing a tasktracker use elapsed time t to download a file=
 from a specific distributed cache replica, it is possible for 2 tasktracke=
rs to download from the specific distributed cache replica in parallel usin=
g elapsed time t as well, or 1.5 t, which is faster than sequential downloa=
d time 2t you mentioned before?<br>
</li><li>&quot;In total, r+n/r concurrent operations. If you optimize r dep=
ending on n, SRQT(n) is the optimal replication level.&quot; -- how do you =
get SRQT(n) for minimize r+n/r? Appreciate if you could point me to more de=
tails.<br>
</li></ul>regards,<br>Lin<br><br><div class=3D"gmail_quote">On Sat, Dec 22,=
 2012 at 8:51 PM, Kai Voigt <span dir=3D"ltr">&lt;<a href=3D"mailto:k@123.o=
rg" target=3D"_blank">k@123.org</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
<div style=3D"word-wrap:break-word">Hi,<div><br></div><div>simple math. Ass=
uming you have n TaskTrackers in your cluster that will need to access the =
files in the distributed cache. And r is the replication level of those fil=
es.</div>
<div><br></div><div>Copying the files into HDFS requires r copy operations =
over the network. The n TaskTrackers need to get their local copies from HD=
FS, so the n TaskTrackers copy from r DataNodes, so n/r concurrent operatio=
n. In total, r+n/r concurrent operations. If you optimize r depending on n,=
 SRQT(n) is the optimal replication level. So 10 is a reasonable default se=
tting for most clusters that are not 500+ nodes big.</div>
<div><br></div><div>Kai</div><div><br><div><div>Am 22.12.2012 um 13:46 schr=
ieb Lin Ma &lt;<a href=3D"mailto:linlma@gmail.com" target=3D"_blank">linlma=
@gmail.com</a>&gt;:</div><div><div class=3D"h5"><br><blockquote type=3D"cit=
e">Thanks Kai, using higher replication count for the purpose of?<br>
<br>regards,<br>Lin<br><br><div class=3D"gmail_quote">On Sat, Dec 22, 2012 =
at 8:44 PM, Kai Voigt <span dir=3D"ltr">&lt;<a href=3D"mailto:k@123.org" ta=
rget=3D"_blank">k@123.org</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi,<br>
<br>
Am 22.12.2012 um 13:03 schrieb Lin Ma &lt;<a href=3D"mailto:linlma@gmail.co=
m" target=3D"_blank">linlma@gmail.com</a>&gt;:<br>
<div><br>
&gt; I want to confirm when on each task node either mapper or reducer acce=
ss distributed cache file, it resides on disk, not resides in memory. Just =
want to make sure distributed cache file does not fully loaded into memory =
which compete memory consumption with mapper/reducer tasks. Is that correct=
?<br>


<br>
<br>
</div>Yes, you are correct. The JobTracker will put files for the distribut=
ed cache into HDFS with a higher replication count (10 by default). Wheneve=
r a TaskTracker needs those files for a task it is launching locally, it wi=
ll fetch a copy to its local disk. So it won&#39;t need to do this again fo=
r future tasks on this node. After a job is done, all local copies and the =
HDFS copies of files in the distributed cache are cleaned up.<br>


<span><font color=3D"#888888"><br>
Kai<br>
<br>
--<br>
Kai Voigt<br>
<a href=3D"mailto:k@123.org" target=3D"_blank">k@123.org</a><br>
<br>
<br>
<br>
<br>
</font></span></blockquote></div><br>
</blockquote></div></div></div><span class=3D"HOEnZb"><font color=3D"#88888=
8"><br><div>
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;te=
xt-align:auto;font-style:normal;font-weight:normal;line-height:normal;borde=
r-collapse:separate;text-transform:none;font-size:medium;white-space:normal=
;font-family:Helvetica;word-spacing:0px"><span style=3D"text-indent:0px;let=
ter-spacing:normal;font-variant:normal;font-style:normal;font-weight:normal=
;line-height:normal;border-collapse:separate;text-transform:none;font-size:=
medium;white-space:normal;font-family:Helvetica;word-spacing:0px"><div styl=
e=3D"word-wrap:break-word">
<div>--=A0</div><div>Kai Voigt</div><div><a href=3D"mailto:k@123.org" targe=
t=3D"_blank">k@123.org</a></div><div><br></div></div></span><br></span><br>
</div>
<br></font></span></div></div></blockquote></div><br>

--f46d0434c2ec3e5c9604d170e1c4--