Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E87ED1007B for ; Tue, 9 Apr 2013 11:49:50 +0000 (UTC) Received: (qmail 5439 invoked by uid 500); 9 Apr 2013 11:49:45 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 5196 invoked by uid 500); 9 Apr 2013 11:49:45 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 5148 invoked by uid 99); 9 Apr 2013 11:49:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Apr 2013 11:49:43 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bjornjon@gmail.com designates 209.85.210.173 as permitted sender) Received: from [209.85.210.173] (HELO mail-ia0-f173.google.com) (209.85.210.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Apr 2013 11:49:38 +0000 Received: by mail-ia0-f173.google.com with SMTP id h37so6240912iak.32 for ; Tue, 09 Apr 2013 04:49:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=vWQbjkGICU3s/Hm5iBC84/7WbPNPlXgnQP/jLkZr3HM=; b=dYHnMtWRGLGLkGBlnr4s+XoOCqlaL0KX3j4/dNC3M8qV33UPGGqtjMozvb3MOvOKzA UkfmxFncCOkHEsEhYra/NdpHddIxYS4ShukdJQLRddFs2+onsFwvn1yC6r705vYiWusq xrenVIPM5037MXbXj5yJ4uH+aNWj43JUW82z8zaCVeiFH1ylFklinVbncfme0bjlmY5m bS7NvS3H66gb+HG9Iik+1LjjRqBWmXfvwLSIPICbNho37KwoSVMrMftoSJVeWc5h2srX PtE5yGExb3ljYYsbqvQNOx3BOnBrlHUx5OcNaLE/lFFUE+cim95w5L0zHGKxZtaiUiIE HCow== MIME-Version: 1.0 X-Received: by 10.50.130.3 with SMTP id oa3mr10219913igb.76.1365508157669; Tue, 09 Apr 2013 04:49:17 -0700 (PDT) Received: by 10.64.92.71 with HTTP; Tue, 9 Apr 2013 04:49:17 -0700 (PDT) Received: by 10.64.92.71 with HTTP; Tue, 9 Apr 2013 04:49:17 -0700 (PDT) In-Reply-To: References: Date: Tue, 9 Apr 2013 04:49:17 -0700 Message-ID: Subject: Re: Distributed cache: how big is too big? From: Bjorn Jonsson To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b418791ad587204d9ec24a1 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b418791ad587204d9ec24a1 Content-Type: text/plain; charset=ISO-8859-1 Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought. On Apr 8, 2013 9:59 PM, "John Meza" wrote: > I am researching a Hadoop solution for an existing application that > requires a directory structure full of data for processing. > > To make the Hadoop solution work I need to deploy the data directory to > each DN when the job is executed. > I know this isn't new and commonly done with a Distributed Cache. > > *Based on experience what are the common file sizes deployed in a > Distributed Cache?* > I know smaller is better, but how big is too big? the larger cache > deployed I have read there will be startup latency. I also assume there are > other factors that play into this. > > I know that->Default local.cache.size=10Gb > > -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb?? > -Distributed Cache is normally not used if larger than =____? > > *Another Option:* Put the data directories on each DN and provide > location to TaskTracker? > > thanks > John > > --047d7b418791ad587204d9ec24a1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

Put it once on hdfs with a replication factor equal to the n= umber of DN. No startup latency on job submission or max size and access it= from anywhere with fs since it sticks around untill you replace it? Just a= thought.

On Apr 8, 2013 9:59 PM, "John Meza" &l= t;j_mezazap@hotmail.com> wr= ote:
I am researching a Hadoop solution for an existing ap= plication that requires a directory structure full of data for processing.<= div>
To make the Hadoop solution work I need to deploy the da= ta directory to each DN when the job is executed.
I know this isn't new and commonly done with a Distributed Cache.<= /div>

Based on experience what are the common file si= zes deployed in a Distributed Cache?=A0
I know smaller is bet= ter, but how big is too big? the larger cache deployed I have read there wi= ll be startup latency. I also assume there are other factors that play into= this.

I know that->Default local.cache.size= =3D10Gb

-Range of desirable sizes for Distr= ibuted Cache=3D 10Kb - 1Gb??
-Distributed Cache is normally not used if larger than =3D____?
<= div>
Another Option: Put the data directories on each = DN and provide location to TaskTracker?

thanks
John

--047d7b418791ad587204d9ec24a1--