(Total configured space / replication factor), which if you choose
your values and apply it for the whole FS: ((500 GB x 5) / 3 rep
factor) = (2.5 TB / 3 rep factor) = 833 GB.
Note, however, that replication is a perfile property and you can
control it granularly instead of keeping it constant FSwide, if need
be. Use the setrep utility:
http://hadoop.apache.org/common/docs/current/file_system_shell.html#setrep.
For instance, you can keep noncritical files with 1 (none) or 2
replicas, and all important ones with 3. The calculation of usable
space hence becomes a more complex function.
Also, for 5 nodes, using a replication factor of two may be okay too.
This will let you bear one DN failure at a time, while 3 will let you
bear two DN failures at the same time (unsure if you'll need that,
since a power or switch loss in your case would mean the whole cluster
going down anyway). You can up the replication factor once you grow
higher, and rebalance the cluster to get it properly functional again.
With rep=2, you should have 1.2 TB worth of usable space.
On Wed, Feb 1, 2012 at 9:06 AM, Michael Lok <fulat2k@gmail.com> wrote:
> Hi folks,
>
> We're planning to setup a 5 node hadoop cluster. I'm thinking of just
> setting the dfs.replication to 3; which is the default. Each data node will
> have 500gb of local storage for dfs use.
>
> How do i calculate the amount of usable dfs space given the replication
> setting and the number of nodes in this case? is there a formula which i
> can use?
>
> Any help is greatly appreciated.
>
> Thanks

Harsh J
Customer Ops. Engineer
Cloudera  http://tiny.cloudera.com/about
