hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Dfs usage calculation
Date Wed, 01 Feb 2012 11:05:26 GMT
Michael,

I do think node-locality would be a little impacted given that there
is one less choice, but with a properly tuned scheduler (delay
scheduling in FairScheduler helps, if you plan to use that), you
should be able to eke out more locality. Should be easy to try it out
and see the difference in speed for yourself with an example app! I do
not think there'd be a major loss over five nodes carrying data evenly
spread.

P.s. Please do run the balancer at regular periods (monthly is mostly
fine, but do monitor for skews), so that the blocks are evenly
distributed and your MR jobs can take advantage of that. This
especially if you are using the cluster's nodes itself as client nodes
for HDFS writes.

On Wed, Feb 1, 2012 at 1:43 PM, Michael Lok <fulat2k@gmail.com> wrote:
> Hi Harsh,
>
> Thanks for the info.  If the replication is set to 2, will there be
> any difference in performance when running MR jobs?
>
> On Wed, Feb 1, 2012 at 1:02 PM, Harsh J <harsh@cloudera.com> wrote:
>> (Total configured space / replication factor), which if you choose
>> your values and apply it for the whole FS: ((500 GB x 5) / 3 rep
>> factor) = (2.5 TB / 3 rep factor) = 833 GB.
>>
>> Note, however, that replication is a per-file property and you can
>> control it granularly instead of keeping it constant FS-wide, if need
>> be. Use the setrep utility:
>> http://hadoop.apache.org/common/docs/current/file_system_shell.html#setrep.
>> For instance, you can keep non-critical files with 1 (none) or 2
>> replicas, and all important ones with 3. The calculation of usable
>> space hence becomes a more complex function.
>>
>> Also, for 5 nodes, using a replication factor of two may be okay too.
>> This will let you bear one DN failure at a time, while 3 will let you
>> bear two DN failures at the same time (unsure if you'll need that,
>> since a power or switch loss in your case would mean the whole cluster
>> going down anyway). You can up the replication factor once you grow
>> higher, and rebalance the cluster to get it properly functional again.
>> With rep=2, you should have 1.2 TB worth of usable space.
>>
>> On Wed, Feb 1, 2012 at 9:06 AM, Michael Lok <fulat2k@gmail.com> wrote:
>>> Hi folks,
>>>
>>> We're planning to setup a 5 node hadoop cluster. I'm thinking of just
>>> setting the dfs.replication to 3; which is the default. Each data node will
>>> have 500gb of local storage for dfs use.
>>>
>>> How do i calculate the amount of usable dfs space given the replication
>>> setting and the number of nodes in this case?  is there a formula which i
>>> can use?
>>>
>>> Any help is greatly appreciated.
>>>
>>> Thanks
>>
>>
>>
>> --
>> Harsh J
>> Customer Ops. Engineer
>> Cloudera | http://tiny.cloudera.com/about



-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

Mime
View raw message