Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7C660F230 for ; Wed, 10 Apr 2013 02:03:58 +0000 (UTC) Received: (qmail 14977 invoked by uid 500); 10 Apr 2013 02:03:53 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 14783 invoked by uid 500); 10 Apr 2013 02:03:53 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 14775 invoked by uid 99); 10 Apr 2013 02:03:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Apr 2013 02:03:53 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of j_mezazap@hotmail.com designates 65.55.111.94 as permitted sender) Received: from [65.55.111.94] (HELO blu0-omc2-s19.blu0.hotmail.com) (65.55.111.94) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Apr 2013 02:03:47 +0000 Received: from BLU171-W85 ([65.55.111.72]) by blu0-omc2-s19.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Tue, 9 Apr 2013 19:03:26 -0700 X-EIP: [xqYpEmZOC/r2d6Ixy/uRBfobA/Ku23Xx] X-Originating-Email: [j_mezazap@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_8a03e2a8-b24a-4c9b-98ca-7553fc5c6c2d_" From: John Meza To: "user@hadoop.apache.org" Subject: RE: Distributed cache: how big is too big? Date: Tue, 9 Apr 2013 19:03:26 -0700 Importance: Normal In-Reply-To: References: ,,<19410014-D07C-4A07-993C-B698D58A75AB@gmail.com>, MIME-Version: 1.0 X-OriginalArrivalTime: 10 Apr 2013 02:03:26.0612 (UTC) FILETIME=[96C08D40:01CE358F] X-Virus-Checked: Checked by ClamAV on apache.org --_8a03e2a8-b24a-4c9b-98ca-7553fc5c6c2d_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable The Distributed Cache uses the shared file system (which ever is specified)= . The Distributed Cache can be loaded via the GenericOptionsParser / TooRunne= r parameters. Those parameters (-files=2C -archives=2C -libjars) are seen = on the commandline and available in a MR driver class that implements the T= ool interface. Those parameters as well as the methods in the Distributed Cache API load t= he files into the shared filesystem used by the JT. From there the framewor= k manages the distribution to the DNs. A couple of unique characteristics are: 1.The Distributed Cache will manage the deployment of the files into the ca= che directory=2C where they can be used by all those jobs that need them. T= he TT maintains a reference count to help ensure the file(s) aren't deleted= prematurely.=20 2.Archives are unarchived=2C with directory structures intact if needed. Th= is is an important requirement for my application. During the unarchive the= directory structure is created. Most of this info is directly from HadoopDefGuide and various other sources= on the net. I also look forward to comments and corrections from those with more experi= ence.John Date: Tue=2C 9 Apr 2013 16:07:12 -0700 Subject: Re: Distributed cache: how big is too big? From: bjornjon@gmail.com To: user@hadoop.apache.org I think the correct question is why would you use distributed cache for a l= arge file that is read during map/reduce instead of plain hdfs? It does not= sound wise to shuffle GB of data onto all nodes on each job submission and= then just remove it when the job is done. I would think about picking anot= her "data strategy" and just use hdfs for the file. Its no problem to make = sure the file is available on every node.=0A= Anyway...maybe someone with more knowledge on this will chip in :) On Tue=2C Apr 9=2C 2013 at 7:56 AM=2C Jay Vyas wrote= : =0A= Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as = a replacement for the distributed cache?=0A= After all - the distributed cache is just a file with replication over the = whole cluster=2C which isn't in hdfs. Cant you Just make the cache size bi= g and store the file there? What advantage is hdfs distribution of the file over all nodes ?=0A= On Apr 9=2C 2013=2C at 6:49 AM=2C Bjorn Jonsson wrote: Put it once on hdfs with a replication factor equal to the number of DN. No= startup latency on job submission or max size and access it from anywhere = with fs since it sticks around untill you replace it? Just a thought.=0A= =0A= =0A= On Apr 8=2C 2013 9:59 PM=2C "John Meza" wrote: =0A= =0A= =0A= =0A= =0A= I am researching a Hadoop solution for an existing application that require= s a directory structure full of data for processing. To make the Hadoop solution work I need to deploy the data directory to eac= h DN when the job is executed.=0A= =0A= I know this isn't new and commonly done with a Distributed Cache. Based on experience what are the common file sizes deployed in a Distribute= d Cache? I know smaller is better=2C but how big is too big? the larger cac= he deployed I have read there will be startup latency. I also assume there = are other factors that play into this.=0A= =0A= I know that->Default local.cache.size=3D10Gb -Range of desirable sizes for Distributed Cache=3D 10Kb - 1Gb??=0A= =0A= -Distributed Cache is normally not used if larger than =3D____? Another Option: Put the data directories on each DN and provide location to= TaskTracker? thanks=0A= =0A= John =0A= =0A= = --_8a03e2a8-b24a-4c9b-98ca-7553fc5c6c2d_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
The Distributed Cache uses = the shared file system (which ever is specified).

The D= istributed Cache can be loaded via the GenericOptionsParser / TooRunner par= ameters. Those parameters (-files=2C -archives=2C -libjars) are seen  = =3Bon the commandline and available in a MR driver class that implements th= e Tool interface.

Those parameters as well as the method= s in the Distributed Cache API load the files into the shared filesystem us= ed by the JT. From there the framework manages the distribution to the DNs.=

A couple of unique characteristics are:

1.The Distributed Cache will manage the deployment of the f= iles into the cache directory=2C where they can be used by all those jobs t= hat need them. The TT maintains a reference count to help ensure the file(s= ) aren't deleted prematurely. =3B

2.Archives a= re unarchived=2C with directory structures intact if needed. This is an imp= ortant requirement for my application. During the unarchive the directory s= tructure is created.

Most of this info is directly= from HadoopDefGuide and various other sources on the net.

I also look forward to comments and corrections from those with mo= re experience.
John



Date: Tue=2C 9 Apr 2013 = 16:07:12 -0700
Subject: Re: Distributed cache: how big is too big?
Fr= om: bjornjon@gmail.com
To: user@hadoop.apache.org

I think the correct question is why would you use distributed cache for a= large file that is read during map/reduce instead of plain hdfs? It does n= ot sound wise to shuffle GB of data onto all nodes on each job submission a= nd then just remove it when the job is done. I would think about picking an= other "data strategy" and just use hdfs for the file. Its no problem to mak= e sure the file is available on every node.
=0A=
Anyway...maybe someone with more knowledge on this will chip= in :)


On Tue=2C Apr 9=2C 2013 at 7:56 AM=2C Jay Vyas <=3Bjayunit100= @gmail.com>=3B wrote:
=0A=
Hmmm.. maybe im missing somethi= ng.. but (@bjorn) Why would you use hdfs as a replacement for the distribut= ed cache?
=0A=

After all - the distributed cache is just a file with r= eplication over the whole cluster=2C which isn't in hdfs.  =3BCant you = Just make the cache size big and store the file there?

What advantag= e is hdfs distribution of the file over all nodes  =3B?
=0A=

On Apr 9=2C 2013=2C at 6:49 AM=2C Bjorn Jon= sson <=3Bbjornjon= @gmail.com>=3B wrote:

Pu= t it once on hdfs with a replication factor equal to the number of DN. No s= tartup latency on job submission or max size and access it from anywhere wi= th fs since it sticks around untill you replace it? Just a thought.

=0A= =0A= =0A=
On Apr 8=2C 2013 9:59 PM=2C "John Meza" <= =3Bj_mezazap@hot= mail.com>=3B wrote:
=0A= =0A= =0A= =0A= =0A=
I am researching a Hadoop solution for an existing ap= plication that requires a directory structure full of data for processing.<= div>
To make the Hadoop solution work I need to deploy the da= ta directory to each DN when the job is executed.
=0A= =0A=
I know this isn't new and commonly done with a Distributed Cache.

Based on experience what are the common file sizes = deployed in a Distributed Cache? =3B
I know smaller is be= tter=2C but how big is too big? the larger cache deployed I have read there= will be startup latency. I also assume there are other factors that play i= nto this.
=0A= =0A=

I know that->=3BDefault l= ocal.cache.size=3D10Gb

-Range of desirable = sizes for Distributed Cache=3D 10Kb - 1Gb??
=0A= =0A=
-Distributed Cache is normally not used if larger than =3D____?
<= div>
Another Option: Put the data directories on each = DN and provide location to TaskTracker?

thanks=0A= =0A=
John

=0A=
=0A=

= --_8a03e2a8-b24a-4c9b-98ca-7553fc5c6c2d_--