Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of j_mezazap@hotmail.com
 designates 65.55.111.94 as permitted sender)
Message-ID: <BLU171-W855A0A54E25A0AD797EC9C82C70@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_8a03e2a8-b24a-4c9b-98ca-7553fc5c6c2d_"
From: John Meza <j_mezazap@hotmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: RE: Distributed cache: how big is too big?
Date: Tue, 9 Apr 2013 19:03:26 -0700
Importance: Normal
In-Reply-To: 
 <CA+qWQa2f++BNCC3Cjroku5HmtyHo4eVnS8-Ohe04y7BY9QJ7Kw@mail.gmail.com>
References: 
 <BLU171-W104B9B90CEE4566EB53EC6C82C60@phx.gbl>,<CA+qWQa0g=syWRPL+WX5ock5sY7kz=Z-jA1f0Y6YZSRCWsRj+Sw@mail.gmail.com>,<19410014-D07C-4A07-993C-B698D58A75AB@gmail.com>,<CA+qWQa2f++BNCC3Cjroku5HmtyHo4eVnS8-Ohe04y7BY9QJ7Kw@mail.gmail.com>
MIME-Version: 1.0

--_8a03e2a8-b24a-4c9b-98ca-7553fc5c6c2d_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

The Distributed Cache uses the shared file system (which ever is specified)=
.
The Distributed Cache can be loaded via the GenericOptionsParser / TooRunne=
r parameters. Those parameters (-files=2C -archives=2C -libjars) are seen  =
on the commandline and available in a MR driver class that implements the T=
ool interface.
Those parameters as well as the methods in the Distributed Cache API load t=
he files into the shared filesystem used by the JT. From there the framewor=
k manages the distribution to the DNs.
A couple of unique characteristics are:
1.The Distributed Cache will manage the deployment of the files into the ca=
che directory=2C where they can be used by all those jobs that need them. T=
he TT maintains a reference count to help ensure the file(s) aren't deleted=
 prematurely.=20
2.Archives are unarchived=2C with directory structures intact if needed. Th=
is is an important requirement for my application. During the unarchive the=
 directory structure is created.
Most of this info is directly from HadoopDefGuide and various other sources=
 on the net.
I also look forward to comments and corrections from those with more experi=
ence.John

Date: Tue=2C 9 Apr 2013 16:07:12 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjornjon@gmail.com
To: user@hadoop.apache.org

I think the correct question is why would you use distributed cache for a l=
arge file that is read during map/reduce instead of plain hdfs? It does not=
 sound wise to shuffle GB of data onto all nodes on each job submission and=
 then just remove it when the job is done. I would think about picking anot=
her "data strategy" and just use hdfs for the file. Its no problem to make =
sure the file is available on every node.=0A=

Anyway...maybe someone with more knowledge on this will chip in :)

On Tue=2C Apr 9=2C 2013 at 7:56 AM=2C Jay Vyas <jayunit100@gmail.com> wrote=
:
=0A=
Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as =
a replacement for the distributed cache?=0A=

After all - the distributed cache is just a file with replication over the =
whole cluster=2C which isn't in hdfs.  Cant you Just make the cache size bi=
g and store the file there?

What advantage is hdfs distribution of the file over all nodes  ?=0A=

On Apr 9=2C 2013=2C at 6:49 AM=2C Bjorn Jonsson <bjornjon@gmail.com> wrote:

Put it once on hdfs with a replication factor equal to the number of DN. No=
 startup latency on job submission or max size and access it from anywhere =
with fs since it sticks around untill you replace it? Just a thought.=0A=
=0A=
=0A=
On Apr 8=2C 2013 9:59 PM=2C "John Meza" <j_mezazap@hotmail.com> wrote:
=0A=
=0A=
=0A=
=0A=
=0A=
I am researching a Hadoop solution for an existing application that require=
s a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to eac=
h DN when the job is executed.=0A=
=0A=
I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distribute=
d Cache? I know smaller is better=2C but how big is too big? the larger cac=
he deployed I have read there will be startup latency. I also assume there =
are other factors that play into this.=0A=
=0A=

I know that->Default local.cache.size=3D10Gb
-Range of desirable sizes for Distributed Cache=3D 10Kb - 1Gb??=0A=
=0A=
-Distributed Cache is normally not used if larger than =3D____?
Another Option: Put the data directories on each DN and provide location to=
 TaskTracker?
thanks=0A=
=0A=
John
 		 	   		  =0A=
=0A=

 		 	   		  =

--_8a03e2a8-b24a-4c9b-98ca-7553fc5c6c2d_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 12pt=3B
font-family:Calibri
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'><div>The Distributed Cache uses =
the shared file system (which ever is specified).</div><div><br></div>The D=
istributed Cache can be loaded via the GenericOptionsParser / TooRunner par=
ameters. Those parameters (-files=2C -archives=2C -libjars) are seen &nbsp=
=3Bon the commandline and available in a MR driver class that implements th=
e Tool interface.<div><br></div><div>Those parameters as well as the method=
s in the Distributed Cache API load the files into the shared filesystem us=
ed by the JT. From there the framework manages the distribution to the DNs.=
</div><div><br></div><div>A couple of unique characteristics are:</div><div=
><br></div><div>1.The Distributed Cache will manage the deployment of the f=
iles into the cache directory=2C where they can be used by all those jobs t=
hat need them. The TT maintains a reference count to help ensure the file(s=
) aren't deleted prematurely.&nbsp=3B</div><div><br></div><div>2.Archives a=
re unarchived=2C with directory structures intact if needed. This is an imp=
ortant requirement for my application. During the unarchive the directory s=
tructure is created.</div><div><br></div><div>Most of this info is directly=
 from HadoopDefGuide and various other sources on the net.</div><div><br></=
div><div>I also look forward to comments and corrections from those with mo=
re experience.</div><div>John</div><div><br></div><div><br><div><div id=3D"=
SkyDrivePlaceholder"></div><hr id=3D"stopSpelling">Date: Tue=2C 9 Apr 2013 =
16:07:12 -0700<br>Subject: Re: Distributed cache: how big is too big?<br>Fr=
om: bjornjon@gmail.com<br>To: user@hadoop.apache.org<br><br><div dir=3D"ltr=
">I think the correct question is why would you use distributed cache for a=
 large file that is read during map/reduce instead of plain hdfs? It does n=
ot sound wise to shuffle GB of data onto all nodes on each job submission a=
nd then just remove it when the job is done. I would think about picking an=
other "data strategy" and just use hdfs for the file. Its no problem to mak=
e sure the file is available on every node.<div>=0A=
<br></div><div>Anyway...maybe someone with more knowledge on this will chip=
 in :)</div></div><div class=3D"ecxgmail_extra"><br><br><div class=3D"ecxgm=
ail_quote">On Tue=2C Apr 9=2C 2013 at 7:56 AM=2C Jay Vyas <span dir=3D"ltr"=
>&lt=3B<a href=3D"mailto:jayunit100@gmail.com" target=3D"_blank">jayunit100=
@gmail.com</a>&gt=3B</span> wrote:<br>=0A=
<blockquote class=3D"ecxgmail_quote" style=3D"border-left:1px #ccc solid=3B=
padding-left:1ex=3B"><div dir=3D"auto"><div>Hmmm.. maybe im missing somethi=
ng.. but (@bjorn) Why would you use hdfs as a replacement for the distribut=
ed cache?</div>=0A=
<div><br></div><div>After all - the distributed cache is just a file with r=
eplication over the whole cluster=2C which isn't in hdfs. &nbsp=3BCant you =
Just make the cache size big and store the file there?<br><br>What advantag=
e is hdfs distribution of the file over all nodes &nbsp=3B?</div>=0A=
<div><div class=3D"h5"><div><br>On Apr 9=2C 2013=2C at 6:49 AM=2C Bjorn Jon=
sson &lt=3B<a href=3D"mailto:bjornjon@gmail.com" target=3D"_blank">bjornjon=
@gmail.com</a>&gt=3B wrote:<br><br></div><blockquote><div><p dir=3D"ltr">Pu=
t it once on hdfs with a replication factor equal to the number of DN. No s=
tartup latency on job submission or max size and access it from anywhere wi=
th fs since it sticks around untill you replace it? Just a thought.</p>=0A=
=0A=
=0A=
<div class=3D"ecxgmail_quote">On Apr 8=2C 2013 9:59 PM=2C "John Meza" &lt=
=3B<a href=3D"mailto:j_mezazap@hotmail.com" target=3D"_blank">j_mezazap@hot=
mail.com</a>&gt=3B wrote:<br><blockquote class=3D"ecxgmail_quote" style=3D"=
border-left:1px #ccc solid=3Bpadding-left:1ex=3B">=0A=
=0A=
=0A=
=0A=
=0A=
<div><div dir=3D"ltr">I am researching a Hadoop solution for an existing ap=
plication that requires a directory structure full of data for processing.<=
div><br></div><div>To make the Hadoop solution work I need to deploy the da=
ta directory to each DN when the job is executed.</div>=0A=
=0A=
<div>I know this isn't new and commonly done with a Distributed Cache.</div=
><div><br></div><div><b>Based on experience what are the common file sizes =
deployed in a Distributed Cache?</b>&nbsp=3B</div><div>I know smaller is be=
tter=2C but how big is too big? the larger cache deployed I have read there=
 will be startup latency. I also assume there are other factors that play i=
nto this.</div>=0A=
=0A=
<div><br></div><div><span style=3D"color:rgb(80=2C80=2C80)=3Bfont-family:mo=
nospace=3Bfont-size:13px=3Bline-height:19px=3B">I know that-&gt=3BDefault l=
ocal.cache.size=3D10Gb</span></div><div><br></div><div>-Range of desirable =
sizes for Distributed Cache=3D 10Kb - 1Gb??</div>=0A=
=0A=
<div>-Distributed Cache is normally not used if larger than =3D____?</div><=
div><br></div><div><b>Another Option:</b> Put the data directories on each =
DN and provide location to TaskTracker?</div><div><br></div><div>thanks</di=
v>=0A=
=0A=
<div>John</div><div><br></div> 		 	   		  </div></div>=0A=
</blockquote></div>=0A=
</div></blockquote></div></div></div></blockquote></div><br></div></div></d=
iv> 		 	   		  </div></body>
</html>=

--_8a03e2a8-b24a-4c9b-98ca-7553fc5c6c2d_--