Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E3A3210755 for ; Tue, 9 Apr 2013 14:56:49 +0000 (UTC) Received: (qmail 55907 invoked by uid 500); 9 Apr 2013 14:56:45 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 55781 invoked by uid 500); 9 Apr 2013 14:56:44 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 55774 invoked by uid 99); 9 Apr 2013 14:56:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Apr 2013 14:56:44 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jayunit100@gmail.com designates 209.85.210.54 as permitted sender) Received: from [209.85.210.54] (HELO mail-da0-f54.google.com) (209.85.210.54) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Apr 2013 14:56:38 +0000 Received: by mail-da0-f54.google.com with SMTP id p1so3078968dad.27 for ; Tue, 09 Apr 2013 07:56:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:subject:references:from:content-type:x-mailer :in-reply-to:message-id:date:to:content-transfer-encoding :mime-version; bh=37G1suzzktY7uymLKFC/gSBY7p2sItpLgmF/xcivaBY=; b=a6DlSXvDX+c2BLUSBqheT/ez6T8jIrLzO7akXULIepvnvSyWAqhH4Ugxubxe9PJ+KE EVoLLzYaueLgiIvQj2mE6z/+sXTt9cSRJPz4DaFF43jd68fOxCBOOrs225x9VBAtj5fu PUl2Y8El9Z+Ue6WhQkEcNjsTAirC0gJR9YIvwkYf9K886dtJBRFjJKOqDkmMjJOnL5zD m3VnkXhWPJAHMBOeUh8xaXmvKquXApEypxIINRu28Oabp7hgkekU6uyg/bo7OwLcak3X c+Uomiwa4/FJRDmPVkclb/6yo+a8JXKmy/FP6lL5hE4XL7GHbjQt4TcFRy25Yrr+uGY6 IbvQ== X-Received: by 10.68.200.162 with SMTP id jt2mr2793814pbc.138.1365519378490; Tue, 09 Apr 2013 07:56:18 -0700 (PDT) Received: from [192.168.1.5] (ip68-229-203-40.ok.ok.cox.net. [68.229.203.40]) by mx.google.com with ESMTPS id xz4sm2503809pbb.18.2013.04.09.07.56.15 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 09 Apr 2013 07:56:17 -0700 (PDT) Subject: Re: Distributed cache: how big is too big? References: From: Jay Vyas Content-Type: multipart/alternative; boundary=Apple-Mail-1E066DB3-A0C1-4C53-8F32-18D3F0ECAD4A X-Mailer: iPhone Mail (10B146) In-Reply-To: Message-Id: <19410014-D07C-4A07-993C-B698D58A75AB@gmail.com> Date: Tue, 9 Apr 2013 09:56:14 -0500 To: "user@hadoop.apache.org" Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (1.0) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-1E066DB3-A0C1-4C53-8F32-18D3F0ECAD4A Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a= replacement for the distributed cache? After all - the distributed cache is just a file with replication over the w= hole cluster, which isn't in hdfs. Cant you Just make the cache size big an= d store the file there? What advantage is hdfs distribution of the file over all nodes ? On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson wrote: > Put it once on hdfs with a replication factor equal to the number of DN. N= o startup latency on job submission or max size and access it from anywhere w= ith fs since it sticks around untill you replace it? Just a thought. >=20 > On Apr 8, 2013 9:59 PM, "John Meza" wrote: >> I am researching a Hadoop solution for an existing application that requi= res a directory structure full of data for processing. >>=20 >> To make the Hadoop solution work I need to deploy the data directory to e= ach DN when the job is executed. >> I know this isn't new and commonly done with a Distributed Cache. >>=20 >> Based on experience what are the common file sizes deployed in a Distribu= ted Cache?=20 >> I know smaller is better, but how big is too big? the larger cache deploy= ed I have read there will be startup latency. I also assume there are other f= actors that play into this. >>=20 >> I know that->Default local.cache.size=3D10Gb >>=20 >> -Range of desirable sizes for Distributed Cache=3D 10Kb - 1Gb?? >> -Distributed Cache is normally not used if larger than =3D____? >>=20 >> Another Option: Put the data directories on each DN and provide location t= o TaskTracker? >>=20 >> thanks >> John --Apple-Mail-1E066DB3-A0C1-4C53-8F32-18D3F0ECAD4A Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 7bit
Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bjornjon@gmail.com> wrote:

Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.

On Apr 8, 2013 9:59 PM, "John Meza" <j_mezazap@hotmail.com> wrote:
I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.

To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.
I know this isn't new and commonly done with a Distributed Cache.

Based on experience what are the common file sizes deployed in a Distributed Cache? 
I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.

I know that->Default local.cache.size=10Gb

-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
-Distributed Cache is normally not used if larger than =____?

Another Option: Put the data directories on each DN and provide location to TaskTracker?

thanks
John

--Apple-Mail-1E066DB3-A0C1-4C53-8F32-18D3F0ECAD4A--