hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Meza <j_meza...@hotmail.com>
Subject RE: Distributed cache: how big is too big?
Date Tue, 09 Apr 2013 14:42:13 GMT
"a replication factor equal to the number of DN"Hmmm... I'm not sure I understand: there are
 8 DN in mytest cluster. 
Date: Tue, 9 Apr 2013 04:49:17 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjornjon@gmail.com
To: user@hadoop.apache.org

Put it once on hdfs with a replication factor equal to the number of DN. No startup latency
on job submission or max size and access it from anywhere with fs since it sticks around untill
you replace it? Just a thought.

On Apr 8, 2013 9:59 PM, "John Meza" <j_mezazap@hotmail.com> wrote:

I am researching a Hadoop solution for an existing application that requires a directory structure
full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job
is executed.
I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know
smaller is better, but how big is too big? the larger cache deployed I have read there will
be startup latency. I also assume there are other factors that play into this.

I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?
View raw message